# 9. Anomaly Detection

### Intuition

As the name suggests, anomaly detection algorithms detect when a data point deviate substantially from the rest. 

More formally, we specify a paramter $\epsilon$ and **flag** our data point if $p(x_{test}) < \epsilon$

### Algorithm

To develop our anomaly detection algorithm, we will assume that our $x_i$ are distributed normally:

$x_i \sim N(\mu, \sigma^2)$ 

The next steps are:

1. Choose features $x_i$ that may characterize anomalous examples  
2. Fit paramters 

$\mu_i = \frac{1}{m} \sum^m_{i =1} x_j^{(i)}$  
$\sigma_j^2 = \frac{1}{m} \sum^m_{i =1} (x_j^{(i)} - \mu_j )^2$

3. Given new example $x$ compute $p(x)$:

$p(x) = \prod_{j = i}^n p(x_j; \mu_j; \sigma_j^2) = \prod_{j = i}^n \frac{1}{\sqrt{2\pi}\sigma_j} exp (- \frac{(x_j - \mu_j )^2}{2\sigma_j^2})$

### Evaluating anomaly detection systems

In order to evaluate our models and take decisions on whether / how to improve them, we will resort to using labeled data and treat it as a supervised learning problem. 

A good _rule of thumb_ data split to evaluate an algorithm an algorithm would be:

* **Training set**: 60% only normal data
* **Cross-validaton**: 20% of normal data + 50% of anomalous data
* **Test**: 20% of normal data + 50% of anomalous data

Now, we can evaluate it by:

1. Fitting model $p(x)$ on training set
2. On a CV / test example, predict y if $p(x) < \epsilon$ (anomaly) and y = 0 otherwise
3. Using possible evaluation metrics:
        - True pos, false pos, true neg, false neg
        - Precision / Recall
        - F1 score
        
**Note**: We can also choose $\epsilon$ based on CV set. 

An interesting question: why don't we use supervised learning (e.g. LR) directly on anomaly detection problems?

Some of the reasons might be:

* Strong imbalance in number of negative (normal) vs. positive examples in anomaly detection
* Hard to learn what constitute a positive example (anomaly) as new anomaly can be substantially different from previous ones

### Feature selection

Altough our algorithms may work well even with non-gaussian features, it is usually good practice to try and transform features to make them closer to a normal distribution. 

It is also useeful to perform **error analysis** by looking at individual instances of anomalies to understand how to transform them into features which are more easily characterized as anomalies. 

### Multivariate Gaussian (Normal) Distribution

Sometimes our model fails to recognize anomalies simply because of the limitations of our gaussian distribution. Computing $p(x)$ one by one can lead to the wrong outcomes, therefore we can solve the problem by computing $p(x)$ all at once. 

This way, we can also detect anomalies for correlated variables. 

The algorithm is similar to what we have seen before:

1. Fit model $p(x)$ by setting $\mu$ and $\sigma$
2. Given a new example $x$, compute:

$p(x) = \frac {1}{ (2\pi)^{\frac{n}{2}} |\Sigma|^{\frac{1}{2}} } exp (-\frac{1}{2}(x- \mu)^T \Sigma ^ {-1} (x- \mu))  $

3. Flag an anomaly if $p(x) < \epsilon$

**Note**: Single variable Gaussian distribution models are just a subset of multivariate models where the covariance matrix $\Sigma$ is 0 expect for the diagonal. 

### Recommender Systems

**Problem formulation**: how do we estimate unknown values for certain agents based on values given by other agents?  
**Example**: movie ratings (1 to 5)  
**Notation**: $r(i,j) = 1$ if movie has been rated - $y^{(i,j)}$ = rating given by user $j$ to movie $i$ (if rated)  

1) One potential technique to address this problem is **content-based recommender systems**: 

1. Assign a score to every movie according to certain parameters (e.g. romance, action, etc.) so we can represent each movie as a feature vector $x^i$
2. Learn a parameter $\Theta^{(j)}$ for each user  
3. Predict $y^{(i,j)}$ as $(\Theta^{(j)})^T x^{(i)}$

**Note**: this is not different (also mathematically) from applying a linear regression for each user. 

2) Another potential approach is called **collaborative filtering**, where we don't even need features:

