Based on:

http://scikit-learn.org/stable/auto_examples/covariance/plot_outlier_detection.html#sphx-glr-auto-examples-covariance-plot-outlier-detection-py

http://scikit-learn.org/stable/modules/outlier_detection.html

## Known contamination

When the amount of contamination is known, this example illustrates three different ways of performing Novelty and Outlier Detection:

* based on a robust estimator of covariance, which is assuming that the data are Gaussian distributed and performs better than the One-Class SVM in that case.

* using the One-Class SVM and its ability to capture the shape of the data set, hence performing better when the data is strongly non-Gaussian, i.e. with two well-separated clusters;

* using the Isolation Forest algorithm, which is based on random forests and hence more adapted to large-dimensional settings, even if it performs quite well in the examples below.

* using the Local Outlier Factor to measure the local deviation of a given data point with respect to its neighbors by comparing their local density.


The ground truth about inliers and outliers is given by the points colors while the orange-filled area indicates which points are reported as inliers by each method.

Here, we assume that we know the fraction of outliers in the datasets. Thus rather than using the ‘predict’ method of the objects, we set the threshold on the decision_function to separate out the corresponding fraction.

In [2]:
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt
import matplotlib.font_manager

from sklearn.svm import OneClassSVM
from sklearn.covariance import EllipticEnvelope
from sklearn.ensemble import IsolationForest
# from sklearn.neighbors import LocalOutlierFactor

In [5]:
rng = np.random.RandomState(42)

n_samples = 200
outliers_fraction = 0.25
cluster_separation = [0, 1, 2]

In [6]:
one_class_svm = OneClassSVM(nu=0.95 * outliers_fraction + 0.05)