# The benchmark testing for KDD Cup Dataset
In this notebook, the general steps of classification procedure is presented. Two methods (Random Foreast and Neural Network) are going to be exploited for the problem of attack identification and detection with the existing dataset - KDDCup'99. This notebook shows the way to program with Tensorflow, SKLearn, NumPy and Matplotlib. 

## Data Engineering

### Environment SetUp
If the enviroment is not ready for the procedure, install all the general toolkits to the enviorment. If the enviornment is ready, skip this step.

In [None]:
#! pip3 install numpy
#! pip3 install pandas
#! pip3 install -U scikit-learn
#! pip3 install graphviz
#! pip3 install pydotplus

### General SetUp
First of all, we import all the needed libraries to the kernel.

In [None]:
import tensorflow as tf
import pandas as pd
import numpy as np
import matplotlib as plot
from IPython.display import Image
import pydotplus
from sklearn.model_selection import train_test_split

Second, load the data into the kernel. pd.read_csv() loads the dataset from the csv file and returns a dataframe structure. It could be utilized in the following steps. The required parameters in pd.read_csv() include the path to the dataset, the used coloumns and so on.

In [None]:
data_path = "../../../Dataset/kddcup99.csv"

dataset = pd.read_csv(data_path, sep=',', usecols=range(0, 42))

print("Dataset Shape:", dataset.shape)

Show the whole dataset.

In [None]:
dataset

### Start the pre-training SetUp
Divide the dataset into two types, one is the collection of features (input_x) and the other is the labels (input_y). There are 41 features and 5 classes. Using Sklearn's train_test_split() to split 20% of data into testing set and the rest is assigned as training set. 

In [None]:
input_x = dataset.iloc[:, 0:41]
input_y = dataset.iloc[:, 41]

train_x, test_x, train_y, test_y = train_test_split(input_x, input_y, test_size=0.20)

Categorize the attack type into two classes: normal and abnormal. Here, we only detect the malicious network traffic. The pre-defined array of new class is ready for the replacement. 

In [None]:
new_class = {'back':'abnormal', 'buffer_overflow':'abnormal', 'ftp_write':'abnormal', 'guess_passwd':'abnormal', 'imap':'abnormal',
            'ipsweep':'abnormal', 'land':'abnormal', 'loadmodule':'abnormal', 'multihop':'abnormal', 'neptune':'abnormal', 'nmap':'abnormal',
            'perl':'abnormal', 'phf':'abnormal', 'pod':'abnormal', 'portsweep':'abnormal', 'rootkit':'abnormal', 'satan':'abnormal',
            'smurf':'abnormal', 'spy':'abnormal', 'teardrop':'abnormal', 'warezclient':'abnormal', 'warezmaster':'abnormal'}
train_y = train_y.replace(new_class)
test_y = test_y.replace(new_class)

### Data Encoding
Transfer the features and label into representative numbers. Here we need the support from SkLearn library.

In [None]:
from sklearn import preprocessing

In [None]:
train_y

Encode the label of training and testing set by using Sklearn.preprocessing.LabelEncoder() in order to make the data all in the representative way.

In [None]:
le_y = preprocessing.LabelEncoder()
le_y.fit(train_y)
train_y = le_y.transform(train_y)
test_y = le_y.transform(test_y)

Not only encoding the label but also the features. Need to column by column to transform the value. 

In [None]:
for col in train_x.columns:
    if train_x[col].dtype == type(object):
        le_x = preprocessing.LabelEncoder()
        le_x.fit(train_x[col])
        train_x[col] = le_x.transform(train_x[col])

In [None]:
for col in test_x.columns:
    if test_x[col].dtype == type(object):
        le_x = preprocessing.LabelEncoder()
        le_x.fit(test_x[col])
        test_x[col] = le_x.transform(test_x[col])

## 1. Decision Tree

In [None]:
from sklearn import tree
clf = tree.DecisionTreeClassifier()
dt = clf.fit(train_x, train_y)

In [None]:
import graphviz 
tree_data = tree.export_graphviz(clf, out_file=None)  
graph = pydotplus.graph_from_dot_data(tree_data)  
Image(graph.create_png())
graph.write_png("tree.png") 

## 2. Random Forest Classifier
Start trainning with random forest classifier.

In [None]:
from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier(n_estimators=100)
clf.fit(train_x, train_y)

In [None]:
print(clf.feature_importances_)

## 3. Support Vector Machine (SVM) Classifier

In [None]:
from sklearn.svm import LinearSVC
clf = LinearSVC(random_state=0, tol=1e-5)
clf.fit(train_x, train_y)  

## Result evaluation

Apply the trained model to the testing dataset, and print the accuracy and confusion matrix.

In [None]:
prid = clf.predict(test_x)

print("Accuracy:", clf.score(test_x, test_y))

Draw a confusion matrix to see the FP, FN, TP, TN.

In [None]:
from sklearn.metrics import confusion_matrix
print(confusion_matrix(prid, test_y))