# Anomaly Detection - Stage 1

### Objective

1. Getting data
2. Setting up environment
3. Create model
4. Plot model
5. Save/load model

## Dataset

For this tutorial we will use a dataset from UCI called Mice Protein Expression. The dataset consists of the expression levels of 77 proteins/protein modifications that produced detectable signals in the nuclear fraction of cortex. The dataset contains a total of 1080 measurements per protein. Each measurement can be considered as an independent sample/mouse.

### 1. Getting the Data

In [2]:
from pycaret.datasets import get_data
dataset = get_data('mice')

Unnamed: 0,MouseID,DYRK1A_N,ITSN1_N,BDNF_N,NR1_N,NR2A_N,pAKT_N,pBRAF_N,pCAMKII_N,pCREB_N,...,pCFOS_N,SYP_N,H3AcK18_N,EGR1_N,H3MeK4_N,CaNA_N,Genotype,Treatment,Behavior,class
0,309_1,0.503644,0.747193,0.430175,2.816329,5.990152,0.21883,0.177565,2.373744,0.232224,...,0.108336,0.427099,0.114783,0.13179,0.128186,1.675652,Control,Memantine,C/S,c-CS-m
1,309_2,0.514617,0.689064,0.41177,2.789514,5.685038,0.211636,0.172817,2.29215,0.226972,...,0.104315,0.441581,0.111974,0.135103,0.131119,1.74361,Control,Memantine,C/S,c-CS-m
2,309_3,0.509183,0.730247,0.418309,2.687201,5.622059,0.209011,0.175722,2.283337,0.230247,...,0.106219,0.435777,0.111883,0.133362,0.127431,1.926427,Control,Memantine,C/S,c-CS-m
3,309_4,0.442107,0.617076,0.358626,2.466947,4.979503,0.222886,0.176463,2.152301,0.207004,...,0.111262,0.391691,0.130405,0.147444,0.146901,1.700563,Control,Memantine,C/S,c-CS-m
4,309_5,0.43494,0.61743,0.358802,2.365785,4.718679,0.213106,0.173627,2.134014,0.192158,...,0.110694,0.434154,0.118481,0.140314,0.14838,1.83973,Control,Memantine,C/S,c-CS-m


In [3]:
dataset.shape

(1080, 82)

In [4]:
dataset.describe()

Unnamed: 0,DYRK1A_N,ITSN1_N,BDNF_N,NR1_N,NR2A_N,pAKT_N,pBRAF_N,pCAMKII_N,pCREB_N,pELK_N,...,SHH_N,BAD_N,BCL2_N,pS6_N,pCFOS_N,SYP_N,H3AcK18_N,EGR1_N,H3MeK4_N,CaNA_N
count,1077.0,1077.0,1077.0,1077.0,1077.0,1077.0,1077.0,1077.0,1077.0,1077.0,...,1080.0,867.0,795.0,1080.0,1005.0,1080.0,900.0,870.0,810.0,1080.0
mean,0.42581,0.617102,0.319088,2.297269,3.843934,0.233168,0.181846,3.537109,0.212574,1.428682,...,0.226676,0.157914,0.134762,0.121521,0.131053,0.446073,0.169609,0.183135,0.20544,1.337784
std,0.249362,0.25164,0.049383,0.347293,0.9331,0.041634,0.027042,1.295169,0.032587,0.466904,...,0.028989,0.029537,0.027417,0.014276,0.023863,0.066432,0.059402,0.040406,0.055514,0.317126
min,0.145327,0.245359,0.115181,1.330831,1.73754,0.063236,0.064043,1.343998,0.112812,0.429032,...,0.155869,0.088305,0.080657,0.067254,0.085419,0.258626,0.079691,0.105537,0.101787,0.586479
25%,0.288121,0.473361,0.287444,2.057411,3.155678,0.205755,0.164595,2.479834,0.190823,1.203665,...,0.206395,0.136424,0.115554,0.110839,0.113506,0.398082,0.125848,0.155121,0.165143,1.081423
50%,0.366378,0.565782,0.316564,2.296546,3.760855,0.231177,0.182302,3.32652,0.210594,1.355846,...,0.224,0.152313,0.129468,0.121626,0.126523,0.448459,0.15824,0.174935,0.193994,1.317441
75%,0.487711,0.698032,0.348197,2.528481,4.440011,0.257261,0.197418,4.48194,0.234595,1.561316,...,0.241655,0.174017,0.148235,0.131955,0.143652,0.490773,0.197876,0.204542,0.235215,1.585824
max,2.516367,2.602662,0.49716,3.757641,8.482553,0.53905,0.317066,7.46407,0.306247,6.113347,...,0.358289,0.282016,0.261506,0.158748,0.256529,0.759588,0.479763,0.360692,0.413903,2.129791


Sample 5% of the data to perform prediction.

In [5]:
data = dataset.sample(frac=0.95, random_state=42)
data_unseen = dataset.drop(data.index)

In [6]:
data.shape

(1026, 82)

In [7]:
data_unseen.shape

(54, 82)

In [8]:
data.reset_index(drop=True, inplace=True)
data_unseen.reset_index(drop=True, inplace=True)

### 2. Setting up Environment

Initializes the environment in PyCaret and creates the transformation pipeline to prepare the data for modeling and deployment.

In [9]:
from pycaret.anomaly import *

exp = setup(data, 
            normalize=True,
            ignore_features=['MouseID'],
            session_id=123)

Setup Succesfully Completed!


Unnamed: 0,Description,Value
0,session_id,123
1,Original Data,"(1026, 82)"
2,Missing Values,True
3,Numeric Features,76
4,Categorical Features,6
5,Ordinal Features,False
6,High Cardinality Features,False
7,Transformed Data,"(1026, 91)"
8,Numeric Imputer,mean
9,Categorical Imputer,constant


### 3. Create Model

Creating Isolation Forest model using create_model().

In [10]:
iforest = create_model('iforest')

In [11]:
print(iforest)

IForest(behaviour='new', bootstrap=False, contamination=0.05,
    max_features=1.0, max_samples='auto', n_estimators=100, n_jobs=1,
    random_state=123, verbose=0)


In [14]:
# create one class support vector machine

svm = create_model('svm', fraction=0.025)

In [15]:
print(svm)

OCSVM(cache_size=200, coef0=0.0, contamination=0.025, degree=3, gamma='auto',
   kernel='rbf', max_iter=-1, nu=0.5, shrinking=True, tol=0.001,
   verbose=False)


In [16]:
models()

Unnamed: 0_level_0,Name,Reference
ID,Unnamed: 1_level_1,Unnamed: 2_level_1
abod,Angle-base Outlier Detection,pyod.models.abod.ABOD
iforest,Isolation Forest,pyod.models.iforest
cluster,Clustering-Based Local Outlier,pyod.models.cblof
cof,Connectivity-Based Outlier Factor,pyod.models.cof
histogram,Histogram-based Outlier Detection,pyod.models.hbos
knn,k-Nearest Neighbors Detector,pyod.models.knn
lof,Local Outlier Factor,pyod.models.lof
svm,One-class SVM detector,pyod.models.ocsvm
pca,Principal Component Analysis,pyod.models.pca
mcd,Minimum Covariance Determinant,pyod.models.mcd


In [17]:
iforest_result = assign_model(iforest)
iforest_result.head()

Unnamed: 0,MouseID,DYRK1A_N,ITSN1_N,BDNF_N,NR1_N,NR2A_N,pAKT_N,pBRAF_N,pCAMKII_N,pCREB_N,...,H3AcK18_N,EGR1_N,H3MeK4_N,CaNA_N,Genotype,Treatment,Behavior,class,Label,Score
0,50810F_4,0.492403,0.658379,0.339319,2.446823,4.613029,0.27325,0.218692,4.184162,0.26128,...,,,,1.45239,Control,Saline,C/S,c-CS-s,0,-0.039132
1,3516_9,0.182518,0.298969,0.229708,1.725425,2.699869,0.174822,0.139538,2.747931,0.187309,...,,0.220072,0.338278,1.090741,Control,Saline,S/C,c-SC-s,1,0.005363
2,3411_12,0.28845,0.515536,0.286301,2.043971,3.312488,0.218683,0.19914,2.929255,0.226304,...,,0.286819,,1.152579,Ts65Dn,Memantine,S/C,t-SC-m,0,-0.085396
3,3416_4,0.5715,0.747993,0.311465,2.450201,3.82727,0.200075,0.165454,2.424611,0.192925,...,0.14936,,,1.720202,Ts65Dn,Memantine,C/S,t-CS-m,0,-0.08537
4,J1291_2,0.287189,0.523557,0.319746,2.42549,3.589465,0.244044,0.189254,3.807835,0.250662,...,0.287167,0.127822,0.220443,1.372286,Ts65Dn,Saline,S/C,t-SC-s,0,-0.074109


Two columns Label and Score are added towards the end. 0 stands for inliers and 1 for outliers/anomalies. Score is the values computed by the algorithm.

In [18]:
plot_model(iforest)