<a href="https://colab.research.google.com/github/mathjams/machine-learning-basics/blob/main/anomaly_detection.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from utils import *

%matplotlib inline

In [None]:
X_train, X_val, y_val = load_data()

To look at anomaly detection, we can fit the distribution to a Gaussian distribution for the features. The Gaussian distribution is  $$ p(x ; \mu,\sigma ^2) = \frac{1}{\sqrt{2 \pi \sigma ^2}}\exp^{ - \frac{(x - \mu)^2}{2 \sigma ^2}}. $$ We can calculate the mean and the variance for each feature using $$\mu_i = \frac{1}{m} \sum_{j=1}^m x_i^{(j)}$$ and $$\sigma_i^2 = \frac{1}{m} \sum_{j=1}^m (x_i^{(j)} - \mu_i)^2$$

In [None]:
def estimate_gaussian(X):
    """
    Calculates mean and variance of all features
    in the dataset

    Args:
        X (ndarray): (m, n) Data matrix, m samples, n features

    Returns:
        mu (ndarray): (n,) Mean of all features
        var (ndarray): (n,) Variance of all features
    """

    m, n = X.shape
    mu=np.zeros(n)
    var=np.zeros(n)
    for i in range(n):
        sample=X[:,i]
        mui=sum(sample)/len(sample)
        vari=0
        for j in range(len(sample)):
            vari+=(X[j][i]-mui)**2/m
        mu[i]=mui
        var[i]=vari
    return mu, var

Now, we set a threshold $\epsilon$ to determine whether an example is probable or not. We calculate the p values of certain sets of features and if it falls below $\epsilon$, we call it an abnormality. We can test different values of $\epsilon$ to find the best threshold.

Now, from here we can calculate the $tp, fp$ and $fn$ rates which are true positives, false positive, and false negatives respectively. We can define $prec=\frac{tp}{tp+fp}$ and $rec=\frac{tp}{tp+fn}$ and an $F_1$ score as $F_1=\frac{2\cdot prec\cdot rec}{prec+rec}$ or the harmonic mean. We choose the $\epsilon$ with the highest $F_1$ score.

In [None]:
def select_threshold(y_val, p_val):
    """
    Finds the best threshold to use for selecting outliers
    based on the results from a validation set (p_val)
    and the ground truth (y_val)

    Args:
        y_val (ndarray): Ground truth on validation set
        p_val (ndarray): Results on validation set

    Returns:
        epsilon (float): Threshold chosen
        F1 (float):      F1 score by choosing epsilon as threshold
    """
    best_epsilon = 0
    best_F1 = 0
    F1 = 0
    step_size = (max(p_val) - min(p_val)) / 1000

    for epsilon in np.arange(min(p_val), max(p_val), step_size):
        tp=0
        fp=0
        fn=0
        for i in range(len(p_val)):
            if p_val[i]<epsilon:
                if y_val[i]==1:
                    tp+=1
                if y_val[i]==0:
                    fp+=1
            if p_val[i]>epsilon and y_val[i]==1:
                fn+=1
        if tp==0:
            F1==0
        else:
            prec=tp/(tp+fp)
            rec=tp/(tp+fn)
            F1=2*prec*rec/(prec+rec)
        if F1 > best_F1:
            best_F1 = F1
            best_epsilon = epsilon

    return best_epsilon, best_F1