<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Introduction" data-toc-modified-id="Introduction-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Introduction</a></span></li><li><span><a href="#Methods" data-toc-modified-id="Methods-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Methods</a></span><ul class="toc-item"><li><span><a href="#DBSCAN" data-toc-modified-id="DBSCAN-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>DBSCAN</a></span></li><li><span><a href="#One-Class-SVM" data-toc-modified-id="One-Class-SVM-2.2"><span class="toc-item-num">2.2&nbsp;&nbsp;</span>One Class SVM</a></span></li><li><span><a href="#Isolation-Forest" data-toc-modified-id="Isolation-Forest-2.3"><span class="toc-item-num">2.3&nbsp;&nbsp;</span>Isolation Forest</a></span></li><li><span><a href="#AutoEncoders" data-toc-modified-id="AutoEncoders-2.4"><span class="toc-item-num">2.4&nbsp;&nbsp;</span>AutoEncoders</a></span></li><li><span><a href="#GANomaly" data-toc-modified-id="GANomaly-2.5"><span class="toc-item-num">2.5&nbsp;&nbsp;</span>GANomaly</a></span></li></ul></li></ul></div>

# Anomaly Detection

## Introduction

**Anomaly detection** is an important problem in machine
learning and has a wide range of applications such as fraud
detection, intrusion detection, event detection and health care (spotting a malignant tumor in an MRI scan). 
In most anomaly detection problems, a lot of normal data is given, and the task is to detect
anomalies that deviates from the normal data. 
**Anomaly detection** algorithms model the data distribution and then report samples atypical in the distribution as anomalies.

There are various **ML** algorithms which are used for **Anomaly Detection**, in the notebook below we will explore some of those **ML** algorithms. Its hard to find datasets for **Anomaly Detection** hence we will mixture of Fraud, synthetic and vision data.


Some of the ML algorithms we will explore are:
1. Density-Based Anomaly Detection method like, **DBSCAN**.
2. Support Vector Machine-Based Anomaly Detection, **One Class SVM**.
3. Decision tree-Based Anomaly Detection, **Isolation Forest**.
4. Deep Neural Network-Based Anomaly Detection, **Autoencoders**.
5. GANs-Bases Anomaly Detection, **GANomaly**.

Some of the techniques like **DBSCAN** and **One Class SVM** are traditional **ML** techniques and others like **GANomaly** are cutting edge model deep learning techniques. Also there are many other methods which have been successfully applied to **Anomaly Detection** above list is a representative yet wide selection of methods in literature.

## Methods
### DBSCAN

**Density-based spatial clustering of applications with noise (DBSCAN)** is a very popular clustering algorithm proposed by Martin Ester et.al. 
https://www.aaai.org/Papers/KDD/1996/KDD96-037.pdf

It is a density-based clustering algorithm: given a set of points, it groups together points that are closely packed together, marking as anomalous the points that lie alone in low-density regions. One advantage of **DBSCAN** over **KMeans** is it can identify non circular clusters too. Below we show an example of using **DBSCAN** to identify anomalies/outliers in synthetic data. One of the limitation of **DBSCN** is it struggles from **Curse of dimensionality**, as the data becomes more higher dimensional **DBSCAN** underperfoms. 

Below we create synthetic data to show application of **DBSCAN**.

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.cluster import DBSCAN

%matplotlib inline

First we create synthetic data to show application of **DBSCAN**
Data set is sampled from three **Gaussian** distributions centered at, (0,0), (1,1) and (0,1)
And there are two outliers at (2,0) and (0,3). 

In [None]:
n_points_per_cluster = 100

#Cluster 1 at -> (0,0)
C1 = [0, 0] + .1 * np.random.randn(n_points_per_cluster, 2)
y1 = np.ones(shape=n_points_per_cluster)

#Cluster 2 at -> (1,1)
C2 = [1, 1] + .1 * np.random.randn(n_points_per_cluster, 2)
y2 = np.ones(shape=n_points_per_cluster) + 1

#Cluster 3 at -> (0,1)
C3 = [0, 1] + .1 * np.random.randn(n_points_per_cluster, 2)
y3 = np.ones(shape=n_points_per_cluster) + 2

#Anomalies
C4 = np.array([[2,0],[0,3]])
y4 = [4,4]

X = np.vstack((C1, C2, C3, C4))
y = np.hstack((y1,y2,y3,y4))
y= y.astype('int')

In [None]:
plt.figure(figsize=(12,8))
plt.title('Cluster plot')
sns.scatterplot(x=X[:,0], y=X[:,1], hue=y, palette='dark')

From the above plots we can see there are three clusters 1,2, and 3 and then there are two outliers at (0,2) and (3,0). Now we will apply **DBSCAN** to identify these clusters and outliers.

In [None]:
#We initialize DBScan with epsilon of 0.2 and min_samples of 10
dbscan = DBSCAN(eps=0.2, min_samples=10)

In [None]:
y_dbscan = dbscan.fit_predict(X)

In [None]:
plt.figure(figsize=(12,8))
sns.scatterplot(x=X[:,0], y=X[:,1], hue=y_dbscan, palette='dark')

We can see from the plot above **DBSCAN** correctly identified the three clusters 0,1, and 2. 
And also identified the two outliers with -1.

This is a toy example but should illustrate the application of **DBSCAN**. Other popular variation of **DBSCAN** is **OPTICS** algorithm that algorithm can also identify clusters of very different densities.

### One Class SVM

Second ML method we want to discuss is **OneClassSVM**. **OneClassSVM** is an unsupervised learning algorithm that is trained only on the "normal" data. It learns the boundaries of normal data and classifies anything which is not inside the boundary as anomalous.


Method was originally proposed in this paper, http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.675.575&rep=rep1&type=pdf. Its still a fairly popular method for Anomaly detection especially when datasets are not that large.


Below we will demonstrate the use of **One Class SVM** on credit card fraud dataset from Kaggle https://www.kaggle.com/mlg-ulb/creditcardfraud.

There are 492 cases of fraud out of 284,807 transactions.

In [None]:
#Lets load the data
data = pd.read_csv('../data/creditcard.csv')

In [None]:
#Lets look at the head of the data
data.head()

In [None]:
data.info()

Data has 31 columns, we will drop the Time columns and Class column has tag fraud or not.

In [None]:
data.drop(['Time'],inplace=True, axis=1)

In [None]:
#Lets look a the distribution of normal and anomalous data
data['Class'].value_counts()

Lets use the data to create a train, dev, and test dataset split.
For train we will take 80% of the normal data.
For dev we will take 10% of the normal data and 50% of anomalous data.
For test we will take 10% of the normal data and 50% of anomalous data.

In [None]:
data_normal = data[data['Class']==0]
data_anomalous = data[data['Class']==1]

In [None]:
X_normal = data_normal.drop(['Class'], axis=1).values
#Lets downsample the X_normal data so that we can compute quickly
number_of_rows=20000
ran_rows = np.random.choice(X_normal.shape[0], size=number_of_rows, replace=False, )
X_normal = X_normal[ran_rows,:]
X_anomalous = data_anomalous.drop(['Class'], axis=1).values

In [None]:
from sklearn.model_selection import train_test_split
X_nm_train, X_nm_test_dev = train_test_split(X_normal, train_size=0.8)
X_nm_test, X_nm_dev = train_test_split(X_nm_test_dev, train_size=0.5)
X_a_test, X_a_dev = train_test_split(X_anomalous, train_size=0.5)
#Train set only has normal samples
X_train = X_nm_train
#Dev set has 10% of Normal samples and 50% of anomalous samples
X_dev = np.vstack([X_nm_dev, X_a_dev])
y_dev = np.vstack([np.full(shape=(X_nm_dev.shape[0],1), dtype='int',fill_value=1), 
                   np.full(shape=(X_a_dev.shape[0],1), dtype='int',fill_value=-1)])
#Test set has 10% of Normal samples and 50% of anomalous samples
X_test = np.vstack([X_nm_test, X_a_test])
y_test = np.vstack([np.full(shape=(X_nm_test.shape[0],1),dtype='int',fill_value=1),
                    np.full(shape=(X_a_test.shape[0],1), dtype='int',fill_value=-1)])

In [None]:
print(f"Train set shape={X_train.shape}")
print(f"Dev set shape={X_dev.shape}")
print(f"Test set shape={X_test.shape}")
print(f"Anomalies in Dev Set={np.sum(y_dev==-1)}")
print(f"Anomalies in Test Set={np.sum(y_test==-1)}")

In [None]:
#Now lets normalize X data
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
sc.fit(X_train)
X_train_std = sc.transform(X_train)
X_dev_std = sc.transform(X_dev)
X_test_std = sc.transform(X_test)

In [None]:
#Lets train oneclass SVM with default parameters
from sklearn.svm import OneClassSVM
clf = OneClassSVM().fit(X_train_std)

In [None]:
y_dev_pred = clf.predict(X_dev_std)
y_test_pred = clf.predict(X_test_std)

In [None]:
#Lets look at confusion matrix 
from sklearn.metrics import confusion_matrix, f1_score
cm = confusion_matrix(y_dev, y_dev_pred)
print(cm)

In [None]:
cm=confusion_matrix(y_test, y_test_pred)
print(cm)

Not very good performance. Lets use hold out CV to tune some parameters of OneClassSVM.

In [None]:
#Let write a simple CV loop to find good hyper parameters for OneClassSVM
gammas = np.logspace(-4,0,5)
nus = np.linspace(0.01,0.1,5)
for g in gammas:
    for nu in nus:
        clf = OneClassSVM(gamma=g, nu=nu)
        clf.fit(X_train_std)
        y_dev_pred = clf.predict(X_dev_std)
        print(f"gamma={g:.2}, nu = {nu:.2}, CV score = {f1_score(y_dev,y_dev_pred, pos_label=-1):.4}")

In [None]:
#Lets fit the best classifier
clf = OneClassSVM(gamma=0.01, nu=0.01)
clf.fit(X_train_std)
y_dev_pred = clf.predict(X_dev_std)
cm = confusion_matrix(y_dev, y_dev_pred)
print(cm)
print(f"Dev F1 Score = {f1_score(y_dev,y_dev_pred, pos_label=-1):0.3}")

#Lets also check the test set
y_test_pred = clf.predict(X_test_std)
cm = confusion_matrix(y_test, y_test_pred)
print(cm)
print(f"Test F1 Score = {f1_score(y_test,y_test_pred, pos_label=-1):0.3}")

We can see from the results that we have a very high F1 score in test and dev set.
In test set algorithm is able to identify 199 out of 246 anomalies.

### Isolation Forest

**Isolation Forest** is one of the most successful method for Anomaly/outlier detection. **Isolation Forest** was first proposed in this paper
https://cs.nju.edu.cn/zhouzh/zhouzh.files/publication/icdm08b.pdf?q=isolation-forest

In **Isolation Forest** algorithm, anomalies are explicitly isolated. The algorithm works by randomly selecting a feature and then randomly selecting a split value between the maximum and minimum values of the selected feature.

Authors of the algorithm made an observation that since anomalies are so different from normal data they will have less depth in the tree then normal data points. Based on this observation and also on the depth of data points in the tree authors came up with an anomaly score to rank each data point as anomalous or normal.

An anomaly score of 1 indicates surely an anomaly and an anomaly score of 0 indicates surely a normal data point.

Below in the notebook we will use **KDDCup99 (SMTP)** dataset 

https://scikit-learn.org/0.19/modules/generated/sklearn.datasets.fetch_kddcup99.html

This dataset was initially created by DARPA to detect intrusion of their systems and is a standard dataset used for anomaly detection.

In [None]:
#Download KDDCup99 (SMTP) data 
from sklearn.datasets import fetch_kddcup99
data = fetch_kddcup99(subset='smtp', as_frame=True, percent10=False)

In [None]:
df = data['frame']

In [None]:
#Convert the labels to ints
df.loc[df['labels']==b'normal.','labels']=1
df.loc[(df['labels']!=b'normal.') & (df['labels']!=1),'labels']=-1

In [None]:
df.labels.value_counts()

In [None]:
df.head()

In [None]:
#We will follow similar strategy as before for train test split
X_normal = df[df['labels']==1][['duration','src_bytes','dst_bytes']].values
X_anomalous = df[df['labels']==-1][['duration','src_bytes','dst_bytes']].values

#Lets downsample the X_normal data so that we can compute quickly
number_of_rows=10000
ran_rows = np.random.choice(X_normal.shape[0], size=number_of_rows, replace=False, )
X_normal = X_normal[ran_rows,:]

In [None]:
from sklearn.model_selection import train_test_split
X_nm_train, X_nm_test_dev = train_test_split(X_normal, train_size=0.8)
X_nm_test, X_nm_dev = train_test_split(X_nm_test_dev, train_size=0.5)
X_a_test, X_a_dev = train_test_split(X_anomalous, train_size=0.5)
#Train set only has normal samples
X_train = X_nm_train
#Dev set has 10% of Normal samples and 50% of anomalous samples
X_dev = np.vstack([X_nm_dev, X_a_dev])
y_dev = np.vstack([np.full(shape=(X_nm_dev.shape[0],1), dtype='int',fill_value=1), 
                   np.full(shape=(X_a_dev.shape[0],1), dtype='int',fill_value=-1)])
#Test set has 10% of Normal samples and 50% of anomalous samples
X_test = np.vstack([X_nm_test, X_a_test])
y_test = np.vstack([np.full(shape=(X_nm_test.shape[0],1),dtype='int',fill_value=1),
                    np.full(shape=(X_a_test.shape[0],1), dtype='int',fill_value=-1)])

In [None]:
print(f"Train set shape={X_train.shape}")
print(f"Dev set shape={X_dev.shape}")
print(f"Test set shape={X_test.shape}")
print(f"Anomalies in Dev Set={np.sum(y_dev==-1)}")
print(f"Anomalies in Test Set={np.sum(y_test==-1)}")

In [None]:
#Lets train isolated forest with default parameters
from sklearn.ensemble import IsolationForest
clf = IsolationForest().fit(X_train)

In [None]:
y_dev_pred = clf.predict(X_dev)
y_test_pred = clf.predict(X_test)

In [None]:
#Lets look at confusion matrix 
from sklearn.metrics import confusion_matrix, f1_score
cm = confusion_matrix(y_dev, y_dev_pred)
print(cm)

In [None]:
cm=confusion_matrix(y_test, y_test_pred)
print(cm)

Algorithm has decent recall on test set but very low precision. It flags a lot of false positives.

In [None]:
#Let write a simple CV loop to find good hyper parameters for Isolation Forest
n_estimators = [100, 200, 500, 1000]
for n in n_estimators:
    clf = IsolationForest(n_estimators=n)
    clf.fit(X_train)
    y_dev_pred = clf.predict(X_dev)
    print(f"n_estimaters={n}, CV score = {f1_score(y_dev, y_dev_pred, pos_label=-1):.4}")

Best estimator is with n_estimaters=100

### AutoEncoders

### GANomaly