# Model Transparency for Big Data Analytics for Intrusion Detection System: Framework for Defining AI Trustworthiness. (proposed title)

## Problem statement

“An intrusion detection system has become a vital mechanism to detect a wide variety of malicious activities in the cyber domain. However, this system still faces an important limitation when it comes to detecting zero-day attacks, concerning the reduction of relatively high false alarm rates” – Moustafa, 2017

In the cyber security field, an Intrusion Detection System (IDS) is essential for achieving a solid line of defense against cyber intrusions. This is a widely known classification problem in the defense space where we want to detect whether there is an attack or not. Using widely cited examples such as Randomforest and Xgboost, we will present a scalable framework for evaluating a model use-case for anomaly detection systems. Furthermore, we want to drive home the importance of understandability in decision-making models and will highlight utility of TAP’s outputs. 

We have a clean dataset that is generated from an IDS, also labeled denoting '0' for No-Attack and '1' for Attack. In this dataset, we will go through different pre-processing well-known to drive this notebook to understandability. The data is clean, yet we need to ensure that every feature matters to the model via 'Feature Importance' that comes out with the generated model

## Dataset

UNSW-NB15, highly studied and public dataset for modeling decision-making in cybersecurity attacks

## Method

Using a cleaned dataset from USNW-NB15 labeled '0' for No-Attack and '1' for Attack, we’ll extract feature importance and model-performance metrics  In this dataset we will go through different pre-processing methods to highlight understandability. We are evaluating the model performance in several ways: a) Using cross-validation to root out over-fitting, b) Feature pre-processing and importance, c) hyper-parameter tuning and model-metric tracking.

## Output

Model metrics will captured in model-tracking feature, will use TAP’s UI for feature importance. Goal is to ensure that every feature matters to the model via 'Feature Importance' that comes out with the generated model, as well as track model-metrics.

* Linear Regression
* RandomForest or other tree-based method (currently testing Xgboost as well)
* SHAP and lime
* Model-metrics tracked: Accuracy, Precision, Recall, F1, True-positive rate, False-positive rate, False-alarm rate


In [6]:
import math, time, random, datetime
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
plt.style.use('seaborn')
import warnings
warnings.filterwarnings('ignore')
import seaborn as sns
import missingno
import pickle

from tap.trustworthy import mlobject, explain, fairness, robust

from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import StandardScaler, MaxAbsScaler

from imblearn.over_sampling import SMOTE, RandomOverSampler
from tqdm import tqdm
from minio import Minio
import boto3
import numba

#
from aif360.datasets import BinaryLabelDataset
from aif360.datasets import AdultDataset, GermanDataset, CompasDataset
from aif360.metrics import BinaryLabelDatasetMetric
from aif360.metrics import ClassificationMetric
from aif360.metrics.utils import compute_boolean_conditioning_vector

from aif360.algorithms.preprocessing.optim_preproc_helpers.data_preproc_functions import load_preproc_data_adult, load_preproc_data_compas, load_preproc_data_german
from aif360.algorithms.inprocessing.adversarial_debiasing import AdversarialDebiasing


from IPython.display import Markdown, display
import tensorflow.compat.v1 as tf
tf.disable_eager_execution()

In [7]:
# download data to local from TAP artifact store via client
client = Minio(
        "minio-service.kubeflow.svc.cluster.local:9000",
        access_key="admin",
        secret_key="AV9%6ymFo2u^",
        secure=False)

bucket="tai-experiments"
path="/unsw/UNSW_NB15_training-set.csv"
data= client.get_object(bucket_name=bucket, object_name=path)

# Load data and munge for modeling
df_train = pd.read_csv(data, index_col=0, low_memory = False)

In [8]:
# download data to local from TAP artifact store via client
client = Minio(
        "minio-service.kubeflow.svc.cluster.local:9000",
        access_key="admin",
        secret_key="AV9%6ymFo2u^",
        secure=False)

bucket="tai-experiments"
path="/unsw/UNSW_NB15_testing-set.csv"
data= client.get_object(bucket_name=bucket, object_name=path)

# Load data and munge for modeling
df_test = pd.read_csv(data, index_col=0, low_memory = False)

In [9]:
df_test.head()

Unnamed: 0_level_0,dur,proto,service,state,spkts,dpkts,sbytes,dbytes,rate,sttl,...,ct_dst_sport_ltm,ct_dst_src_ltm,is_ftp_login,ct_ftp_cmd,ct_flw_http_mthd,ct_src_ltm,ct_srv_dst,is_sm_ips_ports,attack_cat,label
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,0.121478,tcp,-,FIN,6,4,258,172,74.08749,252,...,1,1,0,0,0,1,1,0,Normal,0
2,0.649902,tcp,-,FIN,14,38,734,42014,78.473372,62,...,1,2,0,0,0,1,6,0,Normal,0
3,1.623129,tcp,-,FIN,8,16,364,13186,14.170161,62,...,1,3,0,0,0,2,6,0,Normal,0
4,1.681642,tcp,ftp,FIN,12,12,628,770,13.677108,62,...,1,3,1,1,0,2,1,0,Normal,0
5,0.449454,tcp,-,FIN,10,6,534,268,33.373826,254,...,1,40,0,0,0,2,39,0,Normal,0


In [10]:
df_train.head()

Unnamed: 0_level_0,dur,proto,service,state,spkts,dpkts,sbytes,dbytes,rate,sttl,...,ct_dst_sport_ltm,ct_dst_src_ltm,is_ftp_login,ct_ftp_cmd,ct_flw_http_mthd,ct_src_ltm,ct_srv_dst,is_sm_ips_ports,attack_cat,label
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,1.1e-05,udp,-,INT,2,0,496,0,90909.0902,254,...,1,2,0,0,0,1,2,0,Normal,0
2,8e-06,udp,-,INT,2,0,1762,0,125000.0003,254,...,1,2,0,0,0,1,2,0,Normal,0
3,5e-06,udp,-,INT,2,0,1068,0,200000.0051,254,...,1,3,0,0,0,1,3,0,Normal,0
4,6e-06,udp,-,INT,2,0,900,0,166666.6608,254,...,1,3,0,0,0,2,3,0,Normal,0
5,1e-05,udp,-,INT,2,0,2126,0,100000.0025,254,...,1,3,0,0,0,2,3,0,Normal,0
