# Anomaly Detection

### Introduction to anomaly detection

*   Also known as Outlier Analysis. An **outlier(anomaly)** is a data object that deviates significantly from the rest of the objects.    

  <img src="https://learning.oreilly.com/api/v2/epubs/urn:orm:book:9780123814791/files/images/F000125f12-01-9780123814791.jpg">

  Fig 1. The objects in region R are outliers

*   Types of outliers
 *  **Global outlier** (point anomalies) - an outlier deviates significantly from the rest of the data set
 *  **Contextual outlier** - an outlier depends on the context—the date, the location, and possibly some other factors
 *  **Collective outlier** - a subset of objects as a whole deviate significantly from the entire data set

 <img src="https://learning.oreilly.com/api/v2/epubs/urn:orm:book:9780123814791/files/images/F000125f12-02-9780123814791.jpg">

 Fig 2. The black objects form a collective outlier.

*   Challenges of anomaly detection
 *  Modeling normal objects and outliers effectively
 *  Application-specific outlier detection
 *  Handling noise in outlier detection
 *  Understandability

*   Model development and evaluation (Goldstein & Uchida, 2016):

 *  **Supervised anomaly detection** uses **a fully labeled dataset** for training.
 *  **Semi-supervised anomaly detection** uses an **anomaly-free training dataset**. Afterwards, deviations in the test data from that normal model are used to detect anomalies.
 *  **Unsupervised anomaly detection algorithms** use only **intrinsic information** of the data in order to detect instances deviating from the majority of the data. ***“Although unsupervised anomaly detection does not utilize any label information in practice, they are needed for evaluation and comparison. When new algorithms are proposed, it is common practice that an available public classification dataset is modified and the method is compared with the most known algorithms （(Goldstein & Uchida, 2016， p17)”***

* Review from 290 articles published from 2000 to 2020 (Nassif et al., 2021):
  * 29 distinct machine learning models
  * 22 datasets used, e.g., video anomaly detection, intrusion detection, hyperspectral imagery, medical application etc.
  * Gaussian model appeared 2 times out of 290 articles
  * unsupervised anomaly detection has been adopted more than other classification anomaly detection systems

###   Example: Credit Card Fraud Detection


*   Data can be downloaded from Kaggle: https://www.kaggle.com/datasets/mlg-ulb/creditcardfraud
*   **Context**
It is important that credit card companies are able to recognize fraudulent credit card transactions so that customers are not charged for items that they did not purchase.

*  **Content**
The dataset contains transactions made by credit cards in September 2013 by European cardholders.
This dataset presents transactions that occurred in two days, where we have 492 frauds out of 284,807 transactions. The dataset is highly unbalanced, the positive class (frauds) account for 0.172% of all transactions.

  It contains only numerical input variables which are the result of a PCA transformation. Unfortunately, due to confidentiality issues, we cannot provide the original features and more background information about the data. Features V1, V2, … V28 are the principal components obtained with PCA, the only features which have not been transformed with PCA are 'Time' and 'Amount'. Feature 'Time' contains the seconds elapsed between each transaction and the first transaction in the dataset. The feature 'Amount' is the transaction Amount, this feature can be used for example-dependant cost-sensitive learning. Feature 'Class' is the response variable and it takes value 1 in case of fraud and 0 otherwise.

  Given the class imbalance ratio, we recommend measuring the accuracy using the Area Under the Precision-Recall Curve (AUPRC). Confusion matrix accuracy is not meaningful for unbalanced classification.







In [None]:
from google.colab import drive
drive.mount('/content/drive/')

In [None]:
# Set the working directory, can be removed/replaced with your working directory
!pwd
import os
os.chdir('your path')
!pwd

In [None]:
# import the modules
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
from scipy import optimize
from scipy.stats import multivariate_normal
from sklearn.manifold import TSNE
from sklearn.preprocessing import StandardScaler
# Setup the environment
import warnings
warnings.filterwarnings(action='ignore', category=RuntimeWarning) # This will help keep the notebook clean from unnecessary warnings

In [None]:
'''
# for unzip the data, uncomment if run the first time
import zipfile
with zipfile.ZipFile("archive.zip","r") as zip_ref:
    zip_ref.extractall("data")
'''

In [None]:
# import data to a dataframe
df_full = pd.read_csv(os.getcwd() + "/creditcard.csv")

In [None]:
# return the columns
print(df_full.columns.values)

In [None]:
print(df_full.head())

#### Subsampling for demonstration purposes. Running the script on the whole dataset takes a while.

In [None]:
df_sample = df_full.groupby('Class', group_keys=False).apply(lambda x: x.sample(frac=0.05))

#### Separate feature set and label set

In [None]:
# Separating features from the labels
# 'Class' is the target variable indicating fraud or no fraud, so it is excluded from the feature set
data_features = df_sample.drop(columns=['Class'])

In [None]:
# labels
# The target variable 'Class' is stored separately for model training or evaluation
data_label = df_sample['Class']
data_label.value_counts()

#### Normalise the features

In [None]:
# function to normalise the features using z-score normalisation
def featureNormalise(X):

    ### write your codes here

    return X_norm

In [None]:
normalised_features = featureNormalise(data_features)

#### Inspect the feature space using tSNE and visualisation

Comparing to PCA, tSNE is capable of perserving the local structure of data, thus more suitable for visualising high-dimensional dzata.



In [None]:
# t-Distributed Stochastic Neighbor Embedding (t-SNE)
# use tSNE to visualize high-dimensional data: https://scikit-learn.org/stable/modules/generated/sklearn.manifold.TSNE.html

# Use t-SNE for dimensionality reduction: it helps visualize high-dimensional data by preserving local relationships,
# making it easier to identify clusters and patterns in a visual, two or three-dimensional space.
# reduce the dimension to 3 in the embedded space
tSNE_embedded = TSNE(n_components=3, learning_rate='auto', init='random', perplexity=10).fit_transform(normalised_features)

print('t-SNE done!')

In [None]:
# append the tSNE embedded space to the sample dataframe
df_sample['tsne-one'] = tSNE_embedded[:,0]
df_sample['tsne-two'] = tSNE_embedded[:,1]
df_sample['tsne-three'] = tSNE_embedded[:,2]

In [None]:
# visualise the data after tSNE. Code referenced from: https://scikit-learn.org/stable/auto_examples/manifold/plot_swissroll.html#sphx-glr-auto-examples-manifold-plot-swissroll-py
fig = plt.figure(figsize=(8, 6))
ax = fig.add_subplot(111, projection='3d')
fig.add_axes(ax)
ax.scatter(
    df_sample['tsne-one'], df_sample['tsne-two'], df_sample['tsne-three'], c=df_sample['Class'], s=50, alpha=0.8
)
ax.set_title('Data distribution in tSNE embedded 3D space')
ax.view_init(azim=-66, elev=12)
_ = ax.text2D(0.8, 0.05, s='n_samples='+str(len(df_sample)), transform=ax.transAxes)

#### Inspect the feature space using PCA and visualisation

In [None]:
# svd factorisation: https://numpy.org/doc/stable/reference/generated/numpy.linalg.svd.html,
# A(input) = USV(h), s contains the singular values of A, u are the eigenvectors of AAh, Vh are the eigenvectors of AhA
def pca(X):
    # Calculate the covariance matrix of the transposed input data X
    ### write your codes here

    # Perform SVD on the covariance matrix
    # U: Matrix of eigenvectors, which are the principal components
    # S: Vector of singular values, indicating the variance captured by each principal component
    # _: The right singular vectors (not used here)
    ### write your codes here

    return U, S

# function to convert original data to principal components
def projectData(X, U, K):

    ### write your codes here

    return Z

In [None]:
# fit the model on normalised features
U, S = pca(normalised_features)

In [None]:
# reduce to 3 principal components
n_pca = 3
pca_reduced = projectData(normalised_features, U, n_pca)
print("PCA done!")

In [None]:
# append the pca reduced feature space to the sample dataframe
df_sample['pca-one'] = pca_reduced[:,0]
df_sample['pca-two'] = pca_reduced[:,1]
df_sample['pca-three'] = pca_reduced[:,2]

In [None]:
fig = plt.figure(figsize=(8, 6))
ax = fig.add_subplot(111, projection='3d')
fig.add_axes(ax)
ax.scatter(
    df_sample['pca-one'], df_sample['pca-two'], df_sample['pca-three'], c=df_sample['Class'], s=50, alpha=0.8
)
ax.set_title('Data distribution in PCA reduced 3D space')
ax.view_init(azim=-66, elev=12)
_ = ax.text2D(0.8, 0.05, s='n_samples='+str(len(df_sample)), transform=ax.transAxes)

In [None]:
fig, axs = plt.subplots(figsize=(8, 8), nrows=2)
axs[0].scatter(df_sample['pca-one'], df_sample['pca-two'], c=df_sample['Class'])
axs[0].set_title('PCA reduced Data - Component 1 vs Component 2')
axs[1].scatter(df_sample['pca-one'], df_sample['pca-three'], c=df_sample['Class'])
_ = axs[1].set_title('PCA reduced Data - Component 1 vs Component 3')

#### Estimating Gaussian Parameters - mean and variance

In [None]:
# function to estimate Gaussian parameters: mean, variance
def estimateGaussian(X):
    """
    Estimates the parameters of a Gaussian distribution using the data.

    Args:
    X (ndarray): Input data matrix where each row represents a sample and each column a feature.

    Returns:
    mu (ndarray): Vector of means for each feature.
    sigma2 (ndarray): Vector of variances for each feature.
    """
    ### write your codes here

    return mu, sigma2

#### Calculate the Gaussian parameters

In [None]:
# Calculate the Gaussian parameters for the normalized features
mu, sigma2 = estimateGaussian(normalised_features)
print(mu, sigma2)

#### Construct the Gaussian distribution

In [None]:
# apply scipy.stats.multivariate_normal function to generate a multivariate normal random variable based on the Gassuain parameters
# refer to: https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.multivariate_normal.html for function details
distribution = multivariate_normal(mean=mu, cov=np.diag(sigma2))

#### Fit data to the model and estimate the probability for each data point

In [None]:
# apply the probability density function: pdf(x, mean=None, cov=1, allow_singular=False)
probs = distribution.pdf(normalised_features)
min(probs)

In [None]:
max(probs)

#### Without labels - Select the best threshold using grid search and visual inspection

In [None]:
# function to walk through a range of epsilon values and visualise the outliers
def selectThresholdGS(p):
  # define a value list to perform the grid search
  epsilons = [np.min(p), np.max(p)*1e-06, np.max(p)*1e-05, np.max(p)*1e-04, np.max(p)*1e-03, np.max(p)*1e-02, np.max(p)*1e-01, np.max(p)]
  for epsilon in epsilons:
    # get the normalies and abnormalies from fitted model
    predictions = (p < epsilon).astype(int)
    # plot the 3D graph to fit the outliers
    fig = plt.figure(figsize=(8, 6))
    ax = fig.add_subplot(111, projection='3d')
    fig.add_axes(ax)
    ax.scatter(
        df_sample['tsne-one'], df_sample['tsne-two'], df_sample['tsne-three'], c=predictions, s=50, alpha=0.8)
    ax.set_title('Data distribution in tSNE embedded 3D space after fitting')
    ax.view_init(azim=-66, elev=12)
    _ = ax.text2D(0.8, 0.05, s='epsilon= '+str(epsilon)+' n_samples='+str(len(df_sample)), transform=ax.transAxes)

#### Apply the threshold selection procedure

In [None]:
selectThresholdGS(probs)

### Anomaly Detection

In [None]:
# import the modules
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
from scipy import optimize
from scipy.io import loadmat
from scipy.stats import multivariate_normal

In [None]:
import warnings
warnings.filterwarnings(action='ignore', category=RuntimeWarning)

#### Load the dataset `ex8data1.mat`

In [None]:
data = loadmat("ex8data1.mat")
X = data["X"]
print(X.shape)
Xval, yval = data["Xval"], data["yval"].ravel()
print(Xval.shape, yval.shape)

In [None]:
# original dataset, splitted into X and Xval
data

In [None]:
# X for estimating the parameters - mean variance
X

In [None]:
# Xval for testing the parameters and evaluating the performance
Xval

#### Visualise the dataset with the scatter plot

In [None]:
# use scatter plot to visualise the data distribution for X
plt.figure()
plt.scatter(X[:, 0], X[:, 1])
plt.xlabel("Latency (ms)")
plt.ylabel("Throughput (mb/s)")
plt.show()

In [None]:
# use scatter plot to visualise the data distribution for Xval
plt.figure()
plt.scatter(Xval[:, 0], Xval[:, 1])
plt.xlabel("Latency (ms)")
plt.ylabel("Throughput (mb/s)")
plt.show()

#### Estimating Gaussian Parameters - mean and variance

In [None]:
# function to estimate Gaussian parameters: mean, variance
def estimateGaussian(X):

    ### write your codes here

    return mu, sigma2

#### Calculate the Gaussian parameters

In [None]:
mu, sigma2 = estimateGaussian(X)
print(mu, sigma2)

#### Visualise the dataset and the Gaussian anomaly detector

In [None]:
# function to visualise fitting the Gaussian anomaly detector on the data
def visualiseFit(distribution, X):
    # Defining Grid Limits
    x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
    y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
    # Creating a Meshgrid
    xx, yy = np.meshgrid(np.arange(x_min, x_max, 0.1),
                         np.arange(y_min, y_max, 0.1))
    X_plot = np.c_[xx.ravel(), yy.ravel()]

    # apply the probability density function to return the probabilities of objects
    y_plot = distribution.pdf(X_plot).reshape(xx.shape)

    plt.figure()
    plt.scatter(X[:, 0], X[:, 1])
    plt.contour(xx, yy, y_plot, levels=[1e-20, 1e-17, 1e-14, 1e-11, 1e-8, 1e-5, 1e-2])
    plt.show()

#### Construct the Gaussian distribution

In [None]:
# apply scipy.stats.multivariate_normal function to generate a multivariate normal random variable based on the Gassuain parameters
# refer to: https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.multivariate_normal.html for function details
distribution = ### write your codes here
visualiseFit(distribution, X)

#### Select the best threshold returning the highest F1 score

In [None]:
# function to select the best epsilon (threshold) for determining anomalies
def selectThreshold(yval, pval):
    # Initialize the best F1 score to the lowest possible value
    ### write your codes here

    step = (np.max(pval) - np.min(pval)) / 1000

    # Iterate through potential epsilon values to find the best threshold
    for epsilon in np.arange(np.min(pval),
                             np.max(pval) + step, step):
        predictions = ### write your codes here       # Generate predictions for this threshold
        tp = ### write your codes here     # Calculate true positives
        fp = ### write your codes here  # Calculate false positives
        fn = ### write your codes here  # Calculate false negatives
        precision = ### write your codes here
        recall = ### write your codes here
        F1 = ### write your codes here
        # If this F1 score is better, update the best scores and threshold
        if F1 > bestF1:
            bestF1 = F1
            bestEpsilon = epsilon
    return bestEpsilon, bestF1

#### Apply the threshold selection procedure

In [None]:
# apply the probability density function: pdf(x, mean=None, cov=1, allow_singular=False)
pval = distribution.pdf(Xval)
bestEpsilon, bestF1 = selectThreshold(yval, pval)  # Get the best threshold and F1 score
print(bestEpsilon, bestF1)

#### Visualise the dataset and the Gaussian anomaly detector. Outliers need to be indicated.

In [None]:
def visualiseFitOutliers(distribution, X, outliers):
    x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
    y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
    xx, yy = np.meshgrid(np.arange(x_min, x_max, 0.1),
                         np.arange(y_min, y_max, 0.1))
    X_plot = np.c_[xx.ravel(), yy.ravel()]
    y_plot = distribution.pdf(X_plot).reshape(xx.shape)

    plt.figure()
    plt.scatter(X[:, 0], X[:, 1])
    plt.scatter(X[outliers, 0], X[outliers, 1], s=100)
    plt.contour(xx, yy, y_plot, levels=[1e-20, 1e-17, 1e-14, 1e-11, 1e-8, 1e-5, 1e-2])
    plt.show()

In [None]:
# Given that the two distributions for X and Xval are similar, we can apply the Gaussian model on X using the parameters estimated on Xval
p = distribution.pdf(X)
outliers = ### write your codes here
visualiseFitOutliers(distribution, X, outliers)

In [None]:
pval = distribution.pdf(Xval)
outliers = ### write your codes here
visualiseFitOutliers(distribution, Xval, outliers)

### Task - Perform anomaly detection on a high-dimensional dataset `ex8data2.mat`

In this exercise, repeat the above process to construct a Gaussian anomaly detector and apply it to the dataset `ex8data2.mat`.

#### Load another dataset `ex8data2.mat`

In [None]:
data = loadmat("ex8data2.mat")
X = data["X"]
print(X.shape)
Xval, yval = data["Xval"], data["yval"].ravel()
print(Xval.shape, yval.shape)

#### Construct the Gaussian anomaly detector, select the best threshold and calculate the best F1 score

In [None]:
# Estimate the Gaussian parameters from the training data
mu, sigma2 = ### write your codes here

# Create a multivariate normal distribution with the estimated parameters
distribution = ### write your codes here

# Calculate the probability density function of the training data
p = ### write your codes here

# Calculate the probability density function of the validation data
pval = ### write your codes here

# Determine the best threshold epsilon using the F1 score from validation labels
bestEpsilon, bestF1 = selectThreshold(yval, pval)

print(bestEpsilon, bestF1)  # 1.38e-18 0.615385
print(sum(p < bestEpsilon))  # 117