# The benchmark testing for KDD Cup Dataset
In this notebook, the general steps of classification procedure is presented. Two methods (Random Foreast and Neural Network) are going to be exploited for the problem of attack identification and detection with the existing dataset - KDDCup'99. This notebook shows the way to program with Tensorflow, SKLearn, NumPy and Matplotlib. 

## Environment SetUp
If the enviroment is not ready for the procedure, install all the general toolkits to the enviorment. If the enviornment is ready, skip this step.

In [None]:
#! pip3 install numpy
#! pip3 install pandas
#! pip3 install -U scikit-learn

## General SetUp
First of all, we import all the needed libraries to the kernel.

In [1]:
import tensorflow as tf
import pandas as pd
import numpy as np
import matplotlib as plot
from sklearn.model_selection import train_test_split

Second, load the data into the kernel. pd.read_csv() loads the dataset from the csv file and returns a dataframe structure. It could be utilized in the following steps. The required parameters in pd.read_csv() include the path to the dataset, the used coloumns and so on.

In [2]:
data_path = "../../../Dataset/kddcup99.csv"

dataset = pd.read_csv(data_path, sep=',', usecols=range(0, 42))

dataset.shape

(494020, 42)

Show the whole dataset.

In [3]:
dataset

Unnamed: 0,duration,protocol_type,service,flag,src_bytes,dst_bytes,land,wrong_fragment,urgent,hot,...,dst_host_srv_count,dst_host_same_srv_rate,dst_host_diff_srv_rate,dst_host_same_src_port_rate,dst_host_srv_diff_host_rate,dst_host_serror_rate,dst_host_srv_serror_rate,dst_host_rerror_rate,dst_host_srv_rerror_rate,label
0,0,tcp,http,SF,181,5450,0,0,0,0,...,9,1.0,0.0,0.11,0.00,0.00,0.00,0.00,0.0,normal
1,0,tcp,http,SF,239,486,0,0,0,0,...,19,1.0,0.0,0.05,0.00,0.00,0.00,0.00,0.0,normal
2,0,tcp,http,SF,235,1337,0,0,0,0,...,29,1.0,0.0,0.03,0.00,0.00,0.00,0.00,0.0,normal
3,0,tcp,http,SF,219,1337,0,0,0,0,...,39,1.0,0.0,0.03,0.00,0.00,0.00,0.00,0.0,normal
4,0,tcp,http,SF,217,2032,0,0,0,0,...,49,1.0,0.0,0.02,0.00,0.00,0.00,0.00,0.0,normal
5,0,tcp,http,SF,217,2032,0,0,0,0,...,59,1.0,0.0,0.02,0.00,0.00,0.00,0.00,0.0,normal
6,0,tcp,http,SF,212,1940,0,0,0,0,...,69,1.0,0.0,1.00,0.04,0.00,0.00,0.00,0.0,normal
7,0,tcp,http,SF,159,4087,0,0,0,0,...,79,1.0,0.0,0.09,0.04,0.00,0.00,0.00,0.0,normal
8,0,tcp,http,SF,210,151,0,0,0,0,...,89,1.0,0.0,0.12,0.04,0.00,0.00,0.00,0.0,normal
9,0,tcp,http,SF,212,786,0,0,0,1,...,99,1.0,0.0,0.12,0.05,0.00,0.00,0.00,0.0,normal


## Start the pre-training SetUp
Divide the dataset into two types, one is the collection of features (input_x) and the other is the labels (input_y). There are 41 features and 5 classes. Using Sklearn's train_test_split() to split 20% of data into testing set and the rest is assigned as training set. 

In [4]:
input_x = dataset.iloc[:, 0:41]
input_y = dataset.iloc[:, 41]

train_x, test_x, train_y, test_y = train_test_split(input_x, input_y, test_size=0.20)

In [5]:
train_x.shape

(395216, 41)

Categorize the attack type into two classes: normal and abnormal. Here, we only detect the malicious network traffic. The pre-defined array of new class is ready for the replacement. 

In [6]:
train_y.shape

(395216,)

In [7]:
new_class = {'back':'abnormal', 'buffer_overflow':'abnormal', 'ftp_write':'abnormal', 'guess_passwd':'abnormal', 'imap':'abnormal',
            'ipsweep':'abnormal', 'land':'abnormal', 'loadmodule':'abnormal', 'multihop':'abnormal', 'neptune':'abnormal', 'nmap':'abnormal',
            'perl':'abnormal', 'phf':'abnormal', 'pod':'abnormal', 'portsweep':'abnormal', 'rootkit':'abnormal', 'satan':'abnormal',
            'smurf':'abnormal', 'spy':'abnormal', 'teardrop':'abnormal', 'warezclient':'abnormal', 'warezmaster':'abnormal'}
train_y = train_y.replace(new_class)
test_y = test_y.replace(new_class)

In [8]:
train_y

343485      normal
178930    abnormal
202361    abnormal
188435    abnormal
128328    abnormal
352394    abnormal
149448      normal
376162    abnormal
141678      normal
11266     abnormal
187739    abnormal
10444     abnormal
297784    abnormal
141653      normal
460885    abnormal
338177    abnormal
281337    abnormal
115941    abnormal
166678    abnormal
265125    abnormal
247971    abnormal
35921       normal
111175    abnormal
90607       normal
135269    abnormal
83498       normal
362617    abnormal
207264    abnormal
427751    abnormal
196287    abnormal
            ...   
421725    abnormal
223556    abnormal
387032    abnormal
94753     abnormal
283623    abnormal
455329      normal
308906    abnormal
486486      normal
426130    abnormal
300143    abnormal
182772    abnormal
377942    abnormal
269524    abnormal
128969    abnormal
341922    abnormal
121544    abnormal
54511     abnormal
338356    abnormal
262267    abnormal
119401    abnormal
24361       normal
216602    ab

## Data Encoding
Transfer the features and label into representative numbers. Here we need the support from SkLearn library.

In [9]:
from sklearn.ensemble import RandomForestClassifier
from sklearn import preprocessing

In [10]:
train_y

343485      normal
178930    abnormal
202361    abnormal
188435    abnormal
128328    abnormal
352394    abnormal
149448      normal
376162    abnormal
141678      normal
11266     abnormal
187739    abnormal
10444     abnormal
297784    abnormal
141653      normal
460885    abnormal
338177    abnormal
281337    abnormal
115941    abnormal
166678    abnormal
265125    abnormal
247971    abnormal
35921       normal
111175    abnormal
90607       normal
135269    abnormal
83498       normal
362617    abnormal
207264    abnormal
427751    abnormal
196287    abnormal
            ...   
421725    abnormal
223556    abnormal
387032    abnormal
94753     abnormal
283623    abnormal
455329      normal
308906    abnormal
486486      normal
426130    abnormal
300143    abnormal
182772    abnormal
377942    abnormal
269524    abnormal
128969    abnormal
341922    abnormal
121544    abnormal
54511     abnormal
338356    abnormal
262267    abnormal
119401    abnormal
24361       normal
216602    ab

In [None]:
le_y = preprocessing.LabelEncoder()
le_y.fit(train_y[0])
train_y = le_y.transform(train_y[0])
train_y

In [None]:
for col in train_x.columns:
    if train_x[col].dtype == type(object):
        le_x = preprocessing.LabelEncoder()
        le_x.fit(train_x[col])
        train_x[col] = le_x.transform(train_x[col])

In [None]:
train_x

In [None]:
clf = RandomForestClassifier(n_estimators=100)
clf.fit(train_x, train_y)