# Anomaly Detection

PyCaret’s Anomaly Detection Module is an unsupervised machine learning module that is used for identifying rare items, events, or observations that raise suspicions by differing significantly from the majority of the data. Typically, the anomalous items will translate to some kind of problems such as bank fraud, a structural defect, medical problems, or errors. 

- Unsupervised anomaly detection: Unsupervised anomaly detection techniques detect anomalies in an unlabeled test data set under the assumption that the majority of the instances in the dataset are normal by looking for instances that seem to fit least to the remainder of the data set.

- Supervised anomaly detection: This technique requires a dataset that has been labeled as "normal" and "abnormal" and involves training a classifier.

- Semi-supervised anomaly detection: This technique constructs a model representing normal behavior from a given normal training dataset, and then tests the likelihood of a test instance to be generated by the learnt model.

In [1]:
from pycaret.datasets import get_data
data = get_data('mice')

Unnamed: 0,MouseID,DYRK1A_N,ITSN1_N,BDNF_N,NR1_N,NR2A_N,pAKT_N,pBRAF_N,pCAMKII_N,pCREB_N,...,pCFOS_N,SYP_N,H3AcK18_N,EGR1_N,H3MeK4_N,CaNA_N,Genotype,Treatment,Behavior,class
0,309_1,0.503644,0.747193,0.430175,2.816329,5.990152,0.21883,0.177565,2.373744,0.232224,...,0.108336,0.427099,0.114783,0.13179,0.128186,1.675652,Control,Memantine,C/S,c-CS-m
1,309_2,0.514617,0.689064,0.41177,2.789514,5.685038,0.211636,0.172817,2.29215,0.226972,...,0.104315,0.441581,0.111974,0.135103,0.131119,1.74361,Control,Memantine,C/S,c-CS-m
2,309_3,0.509183,0.730247,0.418309,2.687201,5.622059,0.209011,0.175722,2.283337,0.230247,...,0.106219,0.435777,0.111883,0.133362,0.127431,1.926427,Control,Memantine,C/S,c-CS-m
3,309_4,0.442107,0.617076,0.358626,2.466947,4.979503,0.222886,0.176463,2.152301,0.207004,...,0.111262,0.391691,0.130405,0.147444,0.146901,1.700563,Control,Memantine,C/S,c-CS-m
4,309_5,0.43494,0.61743,0.358802,2.365785,4.718679,0.213106,0.173627,2.134014,0.192158,...,0.110694,0.434154,0.118481,0.140314,0.14838,1.83973,Control,Memantine,C/S,c-CS-m


In [2]:
from pycaret.anomaly import *

exp_ano101 = setup(data, normalize = True, 
                   ignore_features = ['MouseID'],
                   session_id = 123)

Unnamed: 0,Description,Value
0,session_id,123
1,Original Data,"(1080, 82)"
2,Missing Values,True
3,Numeric Features,77
4,Categorical Features,4
5,Ordinal Features,False
6,High Cardinality Features,False
7,High Cardinality Method,
8,Transformed Data,"(1080, 91)"
9,CPU Jobs,-1


In [3]:
models()

Unnamed: 0_level_0,Name,Reference
ID,Unnamed: 1_level_1,Unnamed: 2_level_1
abod,Angle-base Outlier Detection,pyod.models.abod.ABOD
cluster,Clustering-Based Local Outlier,pyod.models.cblof.CBLOF
cof,Connectivity-Based Local Outlier,pyod.models.cof.COF
iforest,Isolation Forest,pyod.models.iforest.IForest
histogram,Histogram-based Outlier Detection,pyod.models.hbos.HBOS
knn,K-Nearest Neighbors Detector,pyod.models.knn.KNN
lof,Local Outlier Factor,pyod.models.lof.LOF
svm,One-class SVM detector,pyod.models.ocsvm.OCSVM
pca,Principal Component Analysis,pyod.models.pca.PCA
mcd,Minimum Covariance Determinant,pyod.models.mcd.MCD


- Angle-base Outlier Detection: Does not use distance but relies on the degree of outlierness for a given point based on high-dimensional space
- Cluster-based local outlier identifies the physical significance of an outlier to the local data behavior. 
- Connectivity-Based Local Outlier handles outliers deviating from spherical density patterns
- Isolation Forest detects anomalies using isolation (how far a data point is to the rest of the data), rather than modelling the normal points.
- Histogram-based Outlier Detection assumes the feature independence and calculates the degree of anomalies by building histograms
- K - Nearest Neighbors distance to its kth nearest neighbor could be viewed as the outlier score
- Local Outlier Factor computes the local density deviation of a given data point with respect to its neighbors
- One-class SVM detector plots points on a plane and checks for those that are closest to one from the origin point. Those further away from all others are then outliers.
- Principal Component Analysis breaks the source data matrix down into its principal components, then reconstruct the original data using just the first few principal components. The reconstructed data will be similar to, but not exactly the same as, the original data. The reconstructed data items that are the most different from the corresponding original items are anomalous items.
- The minimum covariance determinant (MCD) estimator is a highly robust estimator of multivariate location and scatter. It is not meant to be used with multi-modal data
- Subspace Outlier Detection is used for high dimensional data (number of features is larger than the number of observation)
- Stochastic Outlier Selection	considers an outlier when the other data points have insufficient affinity with it

In [9]:
iforest = create_model('iforest')
print(iforest)

IForest(behaviour='new', bootstrap=False, contamination=0.05,
    max_features=1.0, max_samples='auto', n_estimators=100, n_jobs=-1,
    random_state=123, verbose=0)


We have created Isolation Forest model using create_model(). Notice the contamination parameter is set 0.05 which is the default value when you do not pass fraction parameter in create_model(). fraction parameter determines the proportion of outliers in the dataset. In below example, we will create One Class Support Vector Machine model with 0.025 fraction.

In [10]:
svm = create_model('svm', fraction = 0.025)
print(svm)

OCSVM(cache_size=200, coef0=0.0, contamination=0.025, degree=3, gamma='auto',
   kernel='rbf', max_iter=-1, nu=0.5, shrinking=True, tol=0.001,
   verbose=False)


In [25]:
cof = create_model('cof')
print(cof)

COF(contamination=0.05, method='fast', n_neighbors=None)


In [28]:
mcd = create_model('mcd',fraction = 0.1)
print(mcd)

MCD(assume_centered=False, contamination=0.1, random_state=123,
  store_precision=True, support_fraction=None)


## Assign Model

This function assigns anomaly labels to the dataset for a given model. (1 = outlier, 0 = inlier).

In [22]:
result = assign_model(iforest)
result.sort_values(by=['Anomaly','MouseID'], ascending=False).head()

Unnamed: 0,MouseID,DYRK1A_N,ITSN1_N,BDNF_N,NR1_N,NR2A_N,pAKT_N,pBRAF_N,pCAMKII_N,pCREB_N,...,H3AcK18_N,EGR1_N,H3MeK4_N,CaNA_N,Genotype,Treatment,Behavior,class,Anomaly,Anomaly_Score
1078,J3295_14,0.221242,0.412894,0.243974,1.876347,2.384088,0.208897,0.173623,2.086028,0.192044,...,0.335936,0.251317,0.365353,1.404031,Ts65Dn,Saline,S/C,t-SC-s,1,0.021575
1077,J3295_13,0.2287,0.395179,0.234118,1.733184,2.220852,0.220665,0.161435,1.989723,0.185164,...,0.321306,0.229193,0.355213,1.430825,Ts65Dn,Saline,S/C,t-SC-s,1,0.013844
433,50810F_14,0.399414,0.496445,0.368883,2.116688,2.801757,0.314095,0.230029,3.525721,0.291092,...,,,,1.477917,Control,Saline,C/S,c-CS-s,1,0.019591
432,50810F_13,0.350444,0.456195,0.356233,1.959475,2.934774,0.290621,0.19259,3.191432,0.255886,...,,,,1.421179,Control,Saline,C/S,c-CS-s,1,0.005366
431,50810F_12,0.390873,0.512566,0.296296,2.112434,3.481481,0.284722,0.198743,3.588624,0.266865,...,,,,1.367338,Control,Saline,C/S,c-CS-s,1,0.000196


In [23]:
result = assign_model(svm)
result.sort_values(by=['Anomaly','MouseID'], ascending=False).head()

Unnamed: 0,MouseID,DYRK1A_N,ITSN1_N,BDNF_N,NR1_N,NR2A_N,pAKT_N,pBRAF_N,pCAMKII_N,pCREB_N,...,H3AcK18_N,EGR1_N,H3MeK4_N,CaNA_N,Genotype,Treatment,Behavior,class,Anomaly,Anomaly_Score
391,50810A_2,0.435422,0.831312,0.419144,2.98743,3.968701,0.193388,0.206712,2.352209,0.293948,...,0.160983,0.189128,0.163777,1.963432,Control,Saline,C/S,c-CS-s,1,68.886878
390,50810A_1,0.488149,0.867264,0.438504,3.174743,4.158349,0.205755,0.202858,2.522979,0.294838,...,0.148312,0.165001,0.18675,2.077084,Control,Saline,C/S,c-CS-s,1,78.043372
286,365_2,0.309432,0.539782,0.3474,2.544861,4.586578,0.237969,0.184401,4.770859,0.225514,...,,0.176881,,1.009781,Control,Memantine,S/C,c-SC-m,1,91.889324
283,364_14,0.209583,0.420929,0.258377,2.279541,3.014697,0.26749,0.167842,4.345091,0.278366,...,0.240399,,,1.091343,Control,Memantine,S/C,c-SC-m,1,76.390346
554,3516_15,0.178447,0.264085,0.230076,1.440279,2.014136,0.185208,0.150994,2.111043,0.142594,...,,0.320755,0.413903,0.830387,Control,Saline,S/C,c-SC-s,1,68.991053


In [26]:
result = assign_model(cof)
result.sort_values(by=['Anomaly','MouseID'], ascending=False).head()

Unnamed: 0,MouseID,DYRK1A_N,ITSN1_N,BDNF_N,NR1_N,NR2A_N,pAKT_N,pBRAF_N,pCAMKII_N,pCREB_N,...,H3AcK18_N,EGR1_N,H3MeK4_N,CaNA_N,Genotype,Treatment,Behavior,class,Anomaly,Anomaly_Score
1079,J3295_15,0.302626,0.461059,0.256564,2.09279,2.594348,0.251001,0.191811,2.361816,0.223632,...,0.335062,0.252995,0.365278,1.370999,Ts65Dn,Saline,S/C,t-SC-s,1,1.307107
1063,J1291_14,0.163325,0.332454,0.255145,1.901847,2.44591,0.53905,0.176781,2.731398,0.243536,...,0.289016,0.201268,0.28063,1.362446,Ts65Dn,Saline,S/C,t-SC-s,1,1.213466
433,50810F_14,0.399414,0.496445,0.368883,2.116688,2.801757,0.314095,0.230029,3.525721,0.291092,...,,,,1.477917,Control,Saline,C/S,c-CS-s,1,1.246822
390,50810A_1,0.488149,0.867264,0.438504,3.174743,4.158349,0.205755,0.202858,2.522979,0.294838,...,0.148312,0.165001,0.18675,2.077084,Control,Saline,C/S,c-CS-s,1,1.210199
289,365_5,0.284857,0.50695,0.323482,2.44038,3.802926,0.236284,0.177762,4.531236,0.208339,...,,0.209821,,1.108153,Control,Memantine,S/C,c-SC-m,1,1.22913


In [29]:
result = assign_model(mcd)
result.sort_values(by=['Anomaly','MouseID'], ascending=False).head()

Unnamed: 0,MouseID,DYRK1A_N,ITSN1_N,BDNF_N,NR1_N,NR2A_N,pAKT_N,pBRAF_N,pCAMKII_N,pCREB_N,...,H3AcK18_N,EGR1_N,H3MeK4_N,CaNA_N,Genotype,Treatment,Behavior,class,Anomaly,Anomaly_Score
558,J2292_4,0.239189,0.383179,0.355445,2.298053,4.789223,0.2187,0.169346,5.502377,0.199004,...,0.197401,0.168381,0.17879,1.29588,Control,Saline,S/C,c-SC-s,1,451.913876
1063,J1291_14,0.163325,0.332454,0.255145,1.901847,2.44591,0.53905,0.176781,2.731398,0.243536,...,0.289016,0.201268,0.28063,1.362446,Ts65Dn,Saline,S/C,t-SC-s,1,1451.465037
434,50810F_15,0.34781,0.469789,0.344411,1.916918,2.724698,0.311178,0.19713,3.182779,0.274924,...,,,,1.453513,Control,Saline,C/S,c-CS-s,1,690.588031
433,50810F_14,0.399414,0.496445,0.368883,2.116688,2.801757,0.314095,0.230029,3.525721,0.291092,...,,,,1.477917,Control,Saline,C/S,c-CS-s,1,619.786664
432,50810F_13,0.350444,0.456195,0.356233,1.959475,2.934774,0.290621,0.19259,3.191432,0.255886,...,,,,1.421179,Control,Saline,C/S,c-CS-s,1,746.826565


Just like the previous examples, models can be saved using the save_model function