### Anamoly Detection
Here, we will learn about what is anomaly detection in Sklearn and how it is used in identification of the data points.
Anomaly detection is a technique used to identify data points in dataset that does not fit well with the rest of the data. It has many applications in business such as fraud detection, intrusion detection, system health monitoring, surveillance, and predictive maintenance. Anomalies, which are also called outlier, can be divided into following three categories:
1. Point anomalies: It occurs when an individual data instance is considered as anomalous w.r.t the rest of the data.
2. Contextual anomalies: Such kind of anomaly is context specific. It occurs if a data instance is anomalous in a specific context.
3. Collective anomalies: It occurs when a collection of related data instances is anomalous w.r.t entire dataset rather than individual values.

### Two Methods

Two methods namely outlier detection and novelty detection can be used for anomaly detection. It’s necessary to see the distinction between them.

**Outlier detection**
The training data contains outliers that are far from the rest of the data. Such outliers are defined as observations. That’s the reason, outlier detection estimators always try to fit the region having most concentrated training data while ignoring the deviant observations. It is also known as unsupervised anomaly detection.

**Novelty detection**
It is concerned with detecting an unobserved pattern in new observations which is not included in training data. Here, the training data is not polluted by the outliers. It is also known as semi-supervised anomaly detection.
There are set of ML tools, provided by scikit-learn, which can be used for both outlier detection as well novelty detection. These tools first implementing object learning from the data in an unsupervised by using fit () method as follows:

#### Sklearn algorithms for Outlier Detection



#### Fitting an elliptic envelop

This algorithm assume that regular data comes from a known distribution such as Gaussian distribution. For outlier detection, Scikit-learn provides an object named **covariance.EllipticEnvelop.**
This object fits a robust covariance estimate to the data, and thus, fits an ellipse to the central data points. It ignores the points outside the central mode.

In [1]:
import numpy as np
from sklearn.covariance import EllipticEnvelope
true_cov = np.array([[.5, .6],[.6, .4]])
X = np.random.RandomState(0).multivariate_normal(mean=[0, 0], cov=true_cov,size=500)
cov = EllipticEnvelope(random_state=0).fit(X)

  X = np.random.RandomState(0).multivariate_normal(mean=[0, 0], cov=true_cov,size=500)


In [2]:
# Now we can use predict method. It will return 1 for an inlier and -1 for an outlier.
cov.predict([[0, 0],[2, 2]])

array([ 1, -1])

#### Isolation Forest

In case of high-dimensional dataset, one efficient way for outlier detection is to use random forests. The scikit-learn provides ensemble.IsolationForest method that isolates the observations by randomly selecting a feature. Afterwards, it randomly selects a value between the maximum and minimum values of the selected features.
Here, the number of splitting needed to isolate a sample is equivalent to path length from the root node to the terminating node.

In [3]:
### The Python script below will use sklearn. ensemble.IsolationForest method to fit 10 trees on given data:

from sklearn.ensemble import IsolationForest
import numpy as np
X = np.array([[-1, -2], [-3, -3], [-3, -4], [0, 0], [-50, 60]])
OUTDClf = IsolationForest(n_estimators=10)
OUTDClf.fit(X)

### Local Outlier Factor

Local Outlier Factor (LOF) algorithm is another efficient algorithm to perform outlier detection on high dimension data. The scikit-learn provides neighbors.LocalOutlierFactor method that computes a score, called local outlier factor, reflecting the degree of anomality of the observations. The main logic of this algorithm is to detect the samples that have a substantially lower density than its neighbors. That’s why it measures the local density deviation of given data points w.r.t. their neighbors

In [4]:
#### The Python script given below will use sklearn.neighbors.
#### LocalOutlierFactor method to construct NeighborsClassifier class from any array corresponding our data set:

from sklearn.neighbors import NearestNeighbors
samples = [[0., 0., 0.], [0., .5, 0.], [1., 1., .5]]
LOFneigh = NearestNeighbors(n_neighbors=1, algorithm="ball_tree",p=1)
LOFneigh.fit(samples)

In [7]:
##### Now, we can ask from this constructed classifier who’s is the closet point to [0.5, 1., 1.5] by using the following python script:

print(LOFneigh.kneighbors([[.5, 1., 1.5]]))

(array([[1.5]]), array([[2]]))


#### One-Class SVM

The One-Class SVM, introduced by Schölkopf et al., is the unsupervised Outlier Detection. It is also very efficient in high-dimensional data and estimates the support of a high-dimensional distribution. It is implemented in the Support Vector Machines module in the Sklearn.svm.OneClassSVM object. For defining a frontier, it requires a kernel (mostly used is RBF) and a scalar parameter. For better understanding let’s fit our data with svm.OneClassSVM object:

In [8]:
from sklearn.svm import OneClassSVM
X = [[0], [0.89], [0.90], [0.91], [1]]
OSVMclf = OneClassSVM(gamma='scale').fit(X)

In [9]:
OSVMclf.score_samples(X)

array([1.00296414, 1.43923306, 1.44902238, 1.45691778, 1.43923306])