<img src='img/logo.png'>
<img src='img/title.png'>

# Outlier detection

This notebooks discusses several different techniques for identification of statistical outliers, including elliptic envelopes, PCA, kernel density estimatation, and isolation forests.

# Table of Contents
* [Outlier detection](#Outlier-detection)
	* [Example data: German Credit Card Fraud](#Example-data:-German-Credit-Card-Fraud)
	* [Elliptic Envelope](#Elliptic-Envelope)
	* [PCA](#PCA)
	* [Kernel Density Estimation (KDE)](#Kernel-Density-Estimation-%28KDE%29)
	* [Isolation Forest](#Isolation-Forest)
	* [Summary](#Summary)


In [None]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import src.mglearn as mglearn
%matplotlib inline

#plt.rcParams['image.interpolation'] = "none"
#np.set_printoptions(precision=3)

## Example data: German Credit Card Fraud

First we encode the categoricals to dummy variables.

In [None]:
data = pd.read_csv("data/german_cc_fraud.csv")

In [None]:
data['class'].value_counts()

In [None]:
data.head()

### Dummies

As with many Scikit-learn models we need to transform categorical features to True/False

In [None]:
data_dummies = pd.get_dummies(data.drop("class", axis=1))

In [None]:
data_dummies.columns  # expansion in number of columns

In [None]:
X = data_dummies.values.astype(np.float)

In [None]:
X.shape

## Elliptic Envelope

Envelope outlier detection on scaled and PCA transformed input `X` data.

`EllipticEnvelope` can detect outliers in Gaussian distributed data.  Note there is important cautionary [advice in the docs](http://scikit-learn.org/stable/modules/generated/sklearn.covariance.EllipticEnvelope.html) that:

```text
Outlier detection from covariance estimation may break or not perform well in high-dimensional settings. In particular, one will always take care to work with 

n_samples > n_features ** 2
```

In [None]:
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.covariance import EllipticEnvelope

In [None]:
X_scaled = StandardScaler().fit_transform(X)
pca = PCA(n_components=.8)
X_preprocessed = pca.fit_transform(X_scaled)

In [None]:
pca.n_components_

In [None]:
ee = EllipticEnvelope(contamination=.3).fit(X_preprocessed)

In [None]:
from sklearn.metrics import confusion_matrix
confusion_matrix(data['class'] == "good", ee.predict(X_preprocessed) == 1)

In [None]:
from sklearn.metrics import roc_auc_score
roc_auc_score(data['class'] == "good", ee.decision_function(X_preprocessed))

## PCA

Plotting the explained variance ratio, which shows the relative variance explained by each of the extracted principal components.

In [None]:
pca_full = PCA().fit(X_scaled)
plt.figure()
plt.plot(pca_full.explained_variance_ratio_);

In [None]:
roc_auc_score(data['class'] == "good", pca.score_samples(X_scaled))

## Kernel Density Estimation (KDE)

The [sklearn docs on KDE](http://scikit-learn.org/stable/modules/density.html) go through an interesting an example demonstrating how binning data into fixed bins can lead to misleading histogram figures with appearances stronly related to the shifting of bins.

Kernel density methods are often used for histogram smoothing.  The kernel functions each take a `bandwidth` parameter.

In [None]:
from sklearn.neighbors import KernelDensity

In [None]:
kde = KernelDensity(bandwidth=5).fit(X_scaled)

In [None]:
plt.figure()
plt.hist(kde.score_samples(X_scaled), bins=100);

In [None]:
roc_auc_score(data['class'] == "good", kde.score_samples(X_scaled))

## Isolation Forest

Isolation forest is [recursive splitting algorithm](http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.IsolationForest.html) for finding outliers.

In [None]:
from sklearn.ensemble import IsolationForest
iso = IsolationForest(contamination=.3).fit(data_dummies.values)

In [None]:
from sklearn.metrics import confusion_matrix, roc_auc_score
confusion_matrix(data['class'] == "good", iso.predict(data_dummies.values) == 1)

In [None]:
roc_auc_score(data['class'] == "good", iso.decision_function(data_dummies.values))

## Summary

In this notebook, we reviewed the following topics in preparation for more advanced topics:

* [Elliptic Envelope](#Elliptic-Envelope)
* [PCA](#PCA)
* [Kernel Density Estimation (KDE)](#Kernel-Density-Estimation-%28KDE%29)
* [Isolation Forest](#Isolation-Forest)

<a href='Outliers_Exercises.ipynb' class='btn btn-primary btn-lg'>Exercises</a>

<img src='img/copyright.png'>