# Anomoly Detection Using Machhine Learning

__Anamoly :__

- Uncertainity 

- Outlier detection 

- Novelty detection (Something new)

- Unusual pattern 

- Inconsistent data points


Time Series Anamoly detection

Video level amanoly detection

Image level anamoly detection
  
  - Out of Distribution (OOD) Detection target
  
  - Anamoly segmentation target
  
  
__PyOD__ python toolkit for detecting outlying objects in multivariate data (Since May 2018)





## Benchmark 

__Linear model for Outlier detection__

- __PCA: Principal component analysis__: Use sum of weighted projected distance to the eigenvector hyperplane as the outlier scores

- __MCD: Minimum co-variance determinent__: minimum diff between SD & each data points.

- __OCSVM: One class- Support Vector machine__: both for Classification | Regression problems. 

__Proximity based outlier detection Models__: 

- __LOC:__ Local outlier Factor 

- __CBLOF:__ Cluster based local outlier factor

- __KNN: k Nearest Neighbours:__ based on distance to the kth nearest neighbour as the outlier score

- __HBOS:__ Histogram based Outlier Score

__Probabilistic Model:__

- __ABOD__ Angle Based Outlier detection

__Outlier Ensembles & combination Frameworks__: optimisation

- __Isolation__

- __Feature Bagging__

## Import packages

In [14]:
import os
import sys # to Load our files
import numpy as np
import pandas as pd

from sklearn .model_selection import train_test_split # to split the dataset
from scipy.io import loadmat # to load matlab files


## Import PyOD packages & methods

In [20]:
from pyod.models.pca import PCA       # Principal component analysis
from pyod.models.mcd import MCD       # Minimum co-variance determinent
from pyod.models.ocsvm import OCSVM   # One class- Support Vector machine
from pyod.models.lof import LOF       # Local outlier Factor
from pyod.models.cblof import CBLOF   # Cluster based local outlier factor
from pyod.models.knn import KNN       # k Nearest Neighbours
from pyod.models.hbos import HBOS     # Histogram based Outlier Score
from pyod.models.abod import ABOD     # Angle based outlier detection
from pyod.models.iforest import IForest 
from pyod.models.feature_bagging import FeatureBagging 

## import performance Metric Package

In [22]:
from pyod.utils.utility import standardizer
from pyod.utils.utility import precision_n_scores
from sklearn.metrics import roc_auc_score

## Defining the file & read X, Y

In [23]:
mat_file_list=['arrhythmia.mat',
 'cardio.mat',
 'glass.mat',
 'ionosphere.mat',
 'letter.mat',
 'lympho.mat',
 'mnist.mat',
 'musk.mat',
 'optdigits.mat',
 'pendigits.mat',
 'pima.mat',
 'satellite.mat',
 'satimage-2.mat',
 'shuttle.mat',
 'shuttle.mat',
 'vertebral.mat',
 'vowels.mat',
 'wbc.mat'] #all thedata set files in .mat format

In [27]:
mat_file_list

['arrhythmia.mat',
 'cardio.mat',
 'glass.mat',
 'ionosphere.mat',
 'letter.mat',
 'lympho.mat',
 'mnist.mat',
 'musk.mat',
 'optdigits.mat',
 'pendigits.mat',
 'pima.mat',
 'satellite.mat',
 'satimage-2.mat',
 'shuttle.mat',
 'shuttle.mat',
 'vertebral.mat',
 'vowels.mat',
 'wbc.mat']

In [32]:
data = loadmat('data/cardio.mat')
data # X are inputs & Y are outputs

{'__header__': b'MATLAB 5.0 MAT-file, written by Octave 3.8.0, 2014-12-18 10:48:09 UTC',
 '__version__': '1.0',
 '__globals__': [],
 'X': array([[ 0.00491231,  0.69319077, -0.20364049, ...,  0.23149795,
         -0.28978574, -0.49329397],
        [ 0.11072935, -0.07990259, -0.20364049, ...,  0.09356344,
         -0.25638541, -0.49329397],
        [ 0.21654639, -0.27244466, -0.20364049, ...,  0.02459619,
         -0.25638541,  1.14001753],
        ...,
        [-0.41835583, -0.91998844, -0.16463485, ..., -1.49268341,
          0.24461959, -0.49329397],
        [-0.41835583, -0.91998844, -0.15093411, ..., -1.42371616,
          0.14441859, -0.49329397],
        [-0.41835583, -0.91998844, -0.20364049, ..., -1.28578165,
          3.58465295, -0.49329397]]),
 'y': array([[0.],
        [0.],
        [0.],
        ...,
        [1.],
        [1.],
        [1.]])}

In [33]:
len(data)

5

In [34]:
data.keys() # __means predefined__

dict_keys(['__header__', '__version__', '__globals__', 'X', 'y'])

In [35]:
data.values()

dict_values([b'MATLAB 5.0 MAT-file, written by Octave 3.8.0, 2014-12-18 10:48:09 UTC', '1.0', [], array([[ 0.00491231,  0.69319077, -0.20364049, ...,  0.23149795,
        -0.28978574, -0.49329397],
       [ 0.11072935, -0.07990259, -0.20364049, ...,  0.09356344,
        -0.25638541, -0.49329397],
       [ 0.21654639, -0.27244466, -0.20364049, ...,  0.02459619,
        -0.25638541,  1.14001753],
       ...,
       [-0.41835583, -0.91998844, -0.16463485, ..., -1.49268341,
         0.24461959, -0.49329397],
       [-0.41835583, -0.91998844, -0.15093411, ..., -1.42371616,
         0.14441859, -0.49329397],
       [-0.41835583, -0.91998844, -0.20364049, ..., -1.28578165,
         3.58465295, -0.49329397]]), array([[0.],
       [0.],
       [0.],
       ...,
       [1.],
       [1.],
       [1.]])])

### Input feature shape in Mat Files

In [40]:
type(data['X']), data['X'].shape # Input (Independent)

(numpy.ndarray, (1831, 21))

### output feature shape in Mat Files

In [42]:
type(data['y']), data['y'].shape #output(Target)in single dimension

(numpy.ndarray, (1831, 1))

In [43]:
df_columns = ['Data','#Sample','#Dimensions','Outlier Perc','PCA','MCD','OCSVM','LOF','CBLOF','KNN','HBOS','ABOD','IFOREST',
              'FEATUREBAGGING'] #creating dataframe for column names


### ROC Performance evolution table

__Region of Charecteristics | area under curve__

In [47]:
roc_df = pd.DataFrame(columns=df_columns) 
roc_df #empty dataframe with column names

Unnamed: 0,Data,#Sample,#Dimensions,Outlier Perc,PCA,MCD,OCSVM,LOF,CBLOF,KNN,HBOS,ABOD,IFOREST,FEATUREBAGGING


### precision_n_scores Performance evolution table

In [48]:
prn_df = pd.DataFrame(columns=df_columns)
prn_df

Unnamed: 0,Data,#Sample,#Dimensions,Outlier Perc,PCA,MCD,OCSVM,LOF,CBLOF,KNN,HBOS,ABOD,IFOREST,FEATUREBAGGING


### time difference Performance evolution table

In [None]:
time_df = pd.DataFrame(columns=df_columns)

# Exploring .mat files

In [75]:
from time import time
random_state = np.random.RandomState(42)

for mat_file in mat_file_list:
    print("\n... Processing", mat_file, '...')
    mat = loadmat(os.path.join('matfiles', mat_file))

    X = mat['X']
    y = mat['y'].ravel()
    outliers_fraction = np.count_nonzero(y) / len(y)
    outliers_percentage = round(outliers_fraction * 100, ndigits=4)

   # construct containers for saving results
    roc_list = [mat_file[:-4], X.shape[0], X.shape[1], outliers_percentage]
    prn_list = [mat_file[:-4], X.shape[0], X.shape[1], outliers_percentage]
    time_list = [mat_file[:-4], X.shape[0], X.shape[1], outliers_percentage]

   # 60% data for training and 40% for testing
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4,
                                                       random_state=random_state)

   # standardizing data for processing
    X_train_norm, X_test_norm = standardizer(X_train, X_test)

    classifiers = {'Angle-based Outlier Detector (ABOD)': ABOD(contamination=outliers_fraction),
                   'Cluster-based Local Outlier Factor': CBLOF(
                       contamination=outliers_fraction, check_estimator=False,
                       random_state=random_state),
                   'Feature Bagging': FeatureBagging(contamination=outliers_fraction,
                                         random_state=random_state),
                   'Histogram-base Outlier Detection (HBOS)': HBOS(
                       contamination=outliers_fraction),
                   'Isolation Forest': IForest(contamination=outliers_fraction,
                                               random_state=random_state),
                   'K Nearest Neighbors (KNN)': KNN(contamination=outliers_fraction),
                   'Local Outlier Factor (LOF)': LOF(contamination=outliers_fraction),
                   'Minimum Covariance Determinant (MCD)': MCD(contamination=outliers_fraction, random_state=random_state),
                   'One-class SVM (OCSVM)': OCSVM(contamination=outliers_fraction),
                   'Principal Component Analysis (PCA)': PCA(contamination=outliers_fraction, random_state=random_state),
                  }
    

    for clf_name, clf in classifiers.items():
        t0 = time()
        clf.fit(X_train_norm)
        test_scores = clf.decision_function(X_test_norm)
        t1 = time()
        duration = round(t1 - t0, ndigits=4)
        time_list.append(duration)

        roc = round(roc_auc_score(y_test, test_scores), ndigits=4)
        prn = round(precision_n_scores(y_test, test_scores), ndigits=4)

        print('{clf_name} ROC:{roc}, precision @ rank n:{prn}, '
              'execution time: {duration}s'.format(
                  clf_name=clf_name, roc=roc, prn=prn, duration=duration))

        roc_list.append(roc)
        prn_list.append(prn)

    temp_df = pd.DataFrame(time_list).transpose()
    temp_df.columns = df_columns
    time_df = pd.concat([time_df, temp_df], axis=0)

    temp_df = pd.DataFrame(roc_list).transpose()
    temp_df.columns = df_columns
    roc_df = pd.concat([roc_df, temp_df], axis=0)

    temp_df = pd.DataFrame(prn_list).transpose()
    temp_df.columns = df_columns
    prn_df = pd.concat([prn_df, temp_df], axis=0)




... Processing arrhythmia.mat ...


FileNotFoundError: [Errno 2] No such file or directory: 'matfiles\\arrhythmia.mat'

In [87]:
from time import time
random_state = np.random.RandomState(42)

for mat_file in mat_file_list:
    print("\n... Processing", mat_file, '...')
    mat = loadmat(os.path.join('matfiles', mat_file))
    
    X = mat['X']
    y = mat['y'].ravel()
    outliers_fraction = np.count_nonzero(y) / len(y)
    outliers_percentage = round(outliers_fraction * 100, ndigits=4)

   # construct containers for saving results
    roc_list = [mat_file[:-4], X.shape[0], X.shape[1], outliers_percentage]
    prn_list = [mat_file[:-4], X.shape[0], X.shape[1], outliers_percentage]
    time_list = [mat_file[:-4], X.shape[0], X.shape[1], outliers_percentage]

   # 60% data for training and 40% for testing
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4,random_state=random_state)

   # standardizing data for processing
    X_train_norm, X_test_norm = standardizer(X_train, X_test)
    
    classifiers = {'Angle-based Outlier Detector (ABOD)': ABOD(
        contamination=outliers_fraction),
                   'Cluster-based Local Outlier Factor': CBLOF(contamination=outliers_fraction, check_estimator=False,
                                                               random_state=random_state),
                   'Feature Bagging': FeatureBagging(contamination=outliers_fraction,
                                                     random_state=random_state),
                   'Histogram-base Outlier Detection (HBOS)': HBOS(contamination=outliers_fraction),
                   'Isolation Forest': IForest(contamination=outliers_fraction,random_state=random_state),
                   'K Nearest Neighbors (KNN)': KNN(contamination=outliers_fraction),
                   'Local Outlier Factor (LOF)': LOF(contamination=outliers_fraction),
                   'Minimum Covariance Determinant (MCD)': MCD(contamination=outliers_fraction, random_state=random_state),
                   'One-class SVM (OCSVM)': OCSVM(contamination=outliers_fraction),
                   'Principal Component Analysis (PCA)': PCA(contamination=outliers_fraction, random_state=random_state),
                  }
    

    for clf_name, clf in classifiers.items():
        t0 = time()
        clf.fit(X_train_norm)
        test_scores = clf.decision_function(X_test_norm)
        t1 = time()
        duration = round(t1 - t0, ndigits=4)
        time_list.append(duration)
        

        roc = round(roc_auc_score(y_test, test_scores), ndigits=4)
        prn = round(precision_n_scores(y_test, test_scores), ndigits=4)
        
        print('{clf_name} ROC:{roc}, precision @ rank n:{prn},''execution time: {duration}s'.format(
            clf_name=clf_name, roc=roc, prn=prn, duration=duration))
        
        
        roc_list.append(roc)
        prn_list.append(prn)
        
    temp_df = pd.DataFrame(time_list).transpose()
    temp_df.columns = df_columns
    time_df = pd.concat([time_df, temp_df], axis=0)

    temp_df = pd.DataFrame(roc_list).transpose()
    temp_df.columns = df_columns
    roc_df = pd.concat([roc_df, temp_df], axis=0)

    temp_df = pd.DataFrame(prn_list).transpose()
    temp_df.columns = df_columns
    prn_df = pd.concat([prn_df, temp_df], axis=0)


... Processing arrhythmia.mat ...


FileNotFoundError: [Errno 2] No such file or directory: 'matfiles\\arrhythmia.mat'