## Anomaly Detection in the Data

### Unsupervised Learning Approach for Outlier or Anomaly detection

- **Assumption**: Training data (unlabelled) contains both normal and anomalous observations.
- The model identifies outliers during the fitting process.
- This approach is taken when outliers are defined as points that exist in low-density regions in the data.
- Any new observations that do not belong to high-density regions are considered outliers.

Below are some of the algorithms used to build, evaluate and predict likely outliers.

  1. Proximity-Based Outlier Detection Models:
     1. **LOF: Local Outlier Factor**
     2. **CBLOF: Clustering-Based Local Outlier Factor**
     3. **kNN: k Nearest Neighbors** (use the distance to the kth nearest 
     neighbor as the outlier score)
     4. **Median kNN** Outlier Detection (use the median distance to k nearest 
     neighbors as the outlier score)
     
     
  2. Outlier Ensembles and Combination Frameworks
     1. **Isolation Forest** - 
         - The IsolationForest ‘isolates’ observations by randomly selecting a feature and then randomly selecting a split value between the maximum and minimum values of the selected feature.

         - Since recursive partitioning can be represented by a tree structure, the number of splittings required to isolate a sample is equivalent to the path length from the root node to the terminating node.

         - This path length, averaged over a forest of such random trees, is a measure of normality and our decision function.

         - Random partitioning produces noticeable shorter paths for anomalies. Hence, when a forest of random trees collectively produce shorter path lengths for particular samples, they are highly likely to be anomalies.

### Load Packages

In [37]:
import pandas as pd
import numpy as np

# Import models
from pyod.models.abod import ABOD
from pyod.models.cblof import CBLOF
from pyod.models.hbos import HBOS
from pyod.models.iforest import IForest
from pyod.models.knn import KNN
from pyod.models.lof import LOF

# To build Machine Learing pipeline
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer

### Import Data

In [4]:
df = pd.read_excel("Data_Science_Problem-external.xlsx",sheet_name="Data")

### Exclude Rows with Issues in data
- Imported json file created in data issues identification process for excluding affected indices from training ML model

In [40]:
temp = []
with open('data_issues.json', 'r') as json_file:
        data = json.load(json_file)
        for element, colname in zip(data,df.columns) :
            temp.extend(element[colname]["Anomalous_Value_Index"])
#print(len(temp))
#print(temp)
print("Total likely affected rows removed prior to building ML model: %s"%len(set(temp)))
exclusion_index = set(temp)

Total likely affected rows removed prior to building ML model: 1657


### Cleaning of data
- Excluded rows having data specific issues
- Segregate numerica columns

In [44]:
def data_cleaning(dataframe, data_issue_index):
    df_new = df.dropna(how = "all")
    df_new = df_new[['dir', 'mod', 'nm', 'prod', 'productId', 'protocol',
           'sigid', 'sigwid', 'machine_type','os_arch', 'os_name', 
            'product_name','product_version', 'sp_major_version', 
            'trial_copy', 'country_name','real_region_name']]
    
    df_final = df_new.loc[[x for x in df_new.index if x not in data_issue_index]]
    df_final[["sigid","sp_major_version","trial_copy"]] = df_final[["sigid","sp_major_version","trial_copy"]].apply(
        pd.to_numeric)
    
    print("Size of data after cleaning: {}".format(df_final.shape))
    return df_final

In [45]:
df_final = data_cleaning(df,exclusion_index)

Size of data after cleaning: (6342, 17)


### Feature Engineering Pipeline
- One hot encoding of categorical features
- Missing values in categorical features updated as "missing"
- Scaling of numerical features by scaling values to unit variance

In [8]:
def data_featureEngineering(df_final):
    
    #We create the preprocessing pipelines for both numeric and categorical data.
    numeric_features = ["sigid","sp_major_version","trial_copy"]
    numeric_transformer = Pipeline(steps=[
        ('imputer', SimpleImputer(strategy='mean')),
        ('scaler', StandardScaler())])

    categorical_features = ['dir', 'mod', 'nm', 'prod', 'productId', 'protocol',
               'sigwid', 'machine_type','os_arch', 'os_name', 
                'product_name','product_version', 'country_name','real_region_name']
    categorical_transformer = Pipeline(steps=[
        ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
        ('onehot', OneHotEncoder(handle_unknown='ignore'))])

    preprocessor = ColumnTransformer(
        transformers=[
            ('num', numeric_transformer, numeric_features),
            ('cat', categorical_transformer, categorical_features)])
    
    X = preprocessor.fit_transform(df_final)
    X = X.toarray()
    print("Shape of Training data {}".format(X.shape))
    
    return X

In [9]:
X = data_featureEngineering(df_final)

Shape of Training data (6342, 136)


### Explore multiple alogrithms to identify probable anomalous records

- __Key Assumption__: Assuming that the data consistis of 10% of outliers in the sample. The outlier fraction can be a hyperparameter for future fine tuning of model

In [81]:
def build_classfier(X,df_final,export = True):
    
    random_state = np.random.RandomState(42)
    outliers_fraction = 0.05
    
    # Iterate over multiple classfiers prior to selection
    classifiers = {
            'Cluster-based Local Outlier Factor (CBLOF)':
            CBLOF(contamination=outliers_fraction,
                  check_estimator=False, random_state=random_state),
            'K Nearest Neighbors (KNN)': KNN(
            contamination=outliers_fraction),
        'Average KNN': KNN(method='mean',
                           contamination=outliers_fraction),
            'Isolation Forest': IForest(contamination=outliers_fraction,
                                    random_state=random_state)

    }
    outlier_selection_list = [] # To store results of each fitted classfier

    for i, (clf_name, clf) in enumerate(classifiers.items()):
        print("Building %s model"%clf_name)
        clf.fit(X)
        # predict raw anomaly score
        scores_pred = clf.decision_function(X) * -1
        #print(scores_pred)
        # prediction of a datapoint category outlier or inlier
        y_pred = clf.predict(X)
        n_inliers = len(y_pred) - np.count_nonzero(y_pred)
        n_outliers = np.count_nonzero(y_pred == 1)
        outlier_selection_list.append({"Algorithm":clf_name,"Oultiers Predicted":n_outliers})
        print('OUTLIERS : ',n_outliers,'INLIERS : ',n_inliers, clf_name)
    
    # select classfier predecting maximum Outliers
    outlier_selection_list = pd.DataFrame(outlier_selection_list)
    selection = outlier_selection_list["Oultiers Predicted"].idxmax(axis = 0)
    classifier_name = outlier_selection_list.loc[selection,"Algorithm"]
    
    print("\n...................................................")
    print("Final Model Chosen to fit data: %s"%classifier_name)
    print("...................................................\n")
    
    # Finally fit with the selected classfier and store result to disk
    clf = classifiers[classifier_name]
    clf.fit(X)
    
    # prediction of a datapoint category outlier or inlier
    y_pred = clf.predict(X)
    np.count_nonzero(y_pred == 1)
    y_pred_label = ["Inlier" if x == 0 else "Outlier" for x in y_pred]
    y_predicted = pd.DataFrame({"Predicted_Label":y_pred_label,"Binary_Label": y_pred})
    y_predicted.index = df_final.index
    df_export = df_final.merge(y_predicted,left_index=True,right_index=True)
    print(df_export.head())
    print(df_export.tail())
    
    if export:
        df_export.to_csv("Outputdata_Likely_Outliers.csv",index= False)

    return df_export, clf,outlier_selection_list

In [82]:
df_export, clf,outlier_selection_list = build_classfier(X,df_final)

Building Cluster-based Local Outlier Factor (CBLOF) model
OUTLIERS :  312 INLIERS :  6030 Cluster-based Local Outlier Factor (CBLOF)
Building K Nearest Neighbors (KNN) model
OUTLIERS :  203 INLIERS :  6139 K Nearest Neighbors (KNN)
Building Average KNN model
OUTLIERS :  140 INLIERS :  6202 Average KNN
Building Isolation Forest model




OUTLIERS :  313 INLIERS :  6029 Isolation Forest

...................................................
Final Model Chosen to fit data: Isolation Forest
...................................................





   dir  mod                           nm prod productId protocol  sigid  \
8   in  blk     SMB/Autoblue.UN!SP.30735   UN    qhpdt9      SMB  30735   
10  in  blk     SMB/Autoblue.UN!SP.30735   UN    qhpdt4      SMB  30735   
11  in  blk  SMB/EternalBlue.UN!SP.31780   UN    qhpdt9      SMB  31780   
12  in  blk  SMB/EternalBlue.UN!SP.31780   UN    qhpdt9      SMB  31780   
13  in  blk  SMB/EternalBlue.UN!SP.31780   UN    qhpdt9      SMB  31780   

     sigwid machine_type os_arch                     os_name product_name  \
8   qhcltr4      DESKTOP   64Bit            Windows 7 64 bit     qhpname7   
10  qhcltr4      DESKTOP   64Bit  Windows Server 2012 64 bit     qhpname3   
11  qhcltr4      DESKTOP   64Bit            Windows 7 64 bit     qhpname7   
12  qhcltr4      DESKTOP   64Bit            Windows 7 64 bit     qhpname7   
13  qhcltr4      DESKTOP   64Bit            Windows 7 64 bit     qhpname7   

   product_version  sp_major_version  trial_copy country_name  \
8          qhpver0   