# The benchmark testing for KDD Cup Dataset
In this notebook, the general steps of classification procedure is presented. Two methods (Random Foreast and Neural Network) are going to be exploited for the problem of attack identification and detection with the existing dataset - KDDCup'99. This notebook shows the way to program with Tensorflow, SKLearn, NumPy and Matplotlib. 

## Environment SetUp
If the enviroment is not ready for the procedure, install all the general toolkits to the enviorment. If the enviornment is ready, skip this step.

In [None]:
#! pip3 install numpy
#! pip3 install pandas
#! pip3 install -U scikit-learn

## General SetUp
First of all, we import all the needed libraries to the kernel.

In [1]:
import tensorflow as tf
import pandas as pd
import numpy as np
import matplotlib as plot
from sklearn.model_selection import train_test_split

Second, load the data into the kernel. pd.read_csv() loads the dataset from the csv file and returns a dataframe structure. It could be utilized in the following steps. The required parameters in pd.read_csv() include the path to the dataset, the used coloumns and so on.

In [2]:
data_path = "../../../Dataset/kddcup99.csv"

dataset = pd.read_csv(data_path, sep=',', usecols=range(0, 42))

dataset.shape

(494020, 42)

Show the whole dataset.

In [3]:
dataset

Unnamed: 0,duration,protocol_type,service,flag,src_bytes,dst_bytes,land,wrong_fragment,urgent,hot,...,dst_host_srv_count,dst_host_same_srv_rate,dst_host_diff_srv_rate,dst_host_same_src_port_rate,dst_host_srv_diff_host_rate,dst_host_serror_rate,dst_host_srv_serror_rate,dst_host_rerror_rate,dst_host_srv_rerror_rate,label
0,0,tcp,http,SF,181,5450,0,0,0,0,...,9,1.0,0.0,0.11,0.00,0.00,0.00,0.00,0.0,normal
1,0,tcp,http,SF,239,486,0,0,0,0,...,19,1.0,0.0,0.05,0.00,0.00,0.00,0.00,0.0,normal
2,0,tcp,http,SF,235,1337,0,0,0,0,...,29,1.0,0.0,0.03,0.00,0.00,0.00,0.00,0.0,normal
3,0,tcp,http,SF,219,1337,0,0,0,0,...,39,1.0,0.0,0.03,0.00,0.00,0.00,0.00,0.0,normal
4,0,tcp,http,SF,217,2032,0,0,0,0,...,49,1.0,0.0,0.02,0.00,0.00,0.00,0.00,0.0,normal
5,0,tcp,http,SF,217,2032,0,0,0,0,...,59,1.0,0.0,0.02,0.00,0.00,0.00,0.00,0.0,normal
6,0,tcp,http,SF,212,1940,0,0,0,0,...,69,1.0,0.0,1.00,0.04,0.00,0.00,0.00,0.0,normal
7,0,tcp,http,SF,159,4087,0,0,0,0,...,79,1.0,0.0,0.09,0.04,0.00,0.00,0.00,0.0,normal
8,0,tcp,http,SF,210,151,0,0,0,0,...,89,1.0,0.0,0.12,0.04,0.00,0.00,0.00,0.0,normal
9,0,tcp,http,SF,212,786,0,0,0,1,...,99,1.0,0.0,0.12,0.05,0.00,0.00,0.00,0.0,normal


## Start the pre-training SetUp
Divide the dataset into two types, one is the collection of features (input_x) and the other is the labels (input_y). There are 41 features and 5 classes. Using Sklearn's train_test_split() to split 20% of data into testing set and the rest is assigned as training set. 

In [4]:
input_x = dataset.iloc[:, 0:41]
input_y = dataset.iloc[:, 41]

train_x, test_x, train_y, test_y = train_test_split(input_x, input_y, test_size=0.20)

In [5]:
train_x.shape

(395216, 41)

Categorize the attack type into two classes: normal and abnormal. Here, we only detect the malicious network traffic. The pre-defined array of new class is ready for the replacement. 

In [6]:
train_y.shape

(395216,)

In [7]:
new_class = {'back':'abnormal', 'buffer_overflow':'abnormal', 'ftp_write':'abnormal', 'guess_passwd':'abnormal', 'imap':'abnormal',
            'ipsweep':'abnormal', 'land':'abnormal', 'loadmodule':'abnormal', 'multihop':'abnormal', 'neptune':'abnormal', 'nmap':'abnormal',
            'perl':'abnormal', 'phf':'abnormal', 'pod':'abnormal', 'portsweep':'abnormal', 'rootkit':'abnormal', 'satan':'abnormal',
            'smurf':'abnormal', 'spy':'abnormal', 'teardrop':'abnormal', 'warezclient':'abnormal', 'warezmaster':'abnormal'}
train_y = train_y.replace(new_class)
test_y = test_y.replace(new_class)

In [8]:
train_y

43074     abnormal
315411    abnormal
328306    abnormal
350016    abnormal
288080    abnormal
444286    abnormal
331602    abnormal
268415    abnormal
447612    abnormal
72726     abnormal
447247    abnormal
10166     abnormal
389439    abnormal
340942    abnormal
264355    abnormal
386076    abnormal
221028    abnormal
35232       normal
464080    abnormal
283329    abnormal
399669    abnormal
302855    abnormal
124354    abnormal
345241      normal
206072    abnormal
67408     abnormal
376287    abnormal
138255      normal
142322      normal
452330      normal
            ...   
73402     abnormal
272484    abnormal
321112    abnormal
317368    abnormal
92552     abnormal
155997    abnormal
280684    abnormal
35981       normal
132905    abnormal
479330    abnormal
110593    abnormal
331135    abnormal
20460       normal
225562    abnormal
75578       normal
417812    abnormal
208295    abnormal
287411    abnormal
75947       normal
273083    abnormal
52363     abnormal
120381    ab

## Data Encoding
Transfer the features and label into representative numbers. Here we need the support from SkLearn library.

In [9]:
from sklearn.ensemble import RandomForestClassifier
from sklearn import preprocessing

In [10]:
train_y

43074     abnormal
315411    abnormal
328306    abnormal
350016    abnormal
288080    abnormal
444286    abnormal
331602    abnormal
268415    abnormal
447612    abnormal
72726     abnormal
447247    abnormal
10166     abnormal
389439    abnormal
340942    abnormal
264355    abnormal
386076    abnormal
221028    abnormal
35232       normal
464080    abnormal
283329    abnormal
399669    abnormal
302855    abnormal
124354    abnormal
345241      normal
206072    abnormal
67408     abnormal
376287    abnormal
138255      normal
142322      normal
452330      normal
            ...   
73402     abnormal
272484    abnormal
321112    abnormal
317368    abnormal
92552     abnormal
155997    abnormal
280684    abnormal
35981       normal
132905    abnormal
479330    abnormal
110593    abnormal
331135    abnormal
20460       normal
225562    abnormal
75578       normal
417812    abnormal
208295    abnormal
287411    abnormal
75947       normal
273083    abnormal
52363     abnormal
120381    ab

In [11]:
le_y = preprocessing.LabelEncoder()
le_y.fit(train_y)
train_y = le_y.transform(train_y)
test_y = le_y.transform(test_y)

In [12]:
for col in train_x.columns:
    if train_x[col].dtype == type(object):
        le_x = preprocessing.LabelEncoder()
        le_x.fit(train_x[col])
        train_x[col] = le_x.transform(train_x[col])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """


In [15]:
for col in test_x.columns:
    if test_x[col].dtype == type(object):
        le_x = preprocessing.LabelEncoder()
        le_x.fit(test_x[col])
        test_x[col] = le_x.transform(test_x[col])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """


In [16]:
train_x

Unnamed: 0,duration,protocol_type,service,flag,src_bytes,dst_bytes,land,wrong_fragment,urgent,hot,...,dst_host_count,dst_host_srv_count,dst_host_same_srv_rate,dst_host_diff_srv_rate,dst_host_same_src_port_rate,dst_host_srv_diff_host_rate,dst_host_serror_rate,dst_host_srv_serror_rate,dst_host_rerror_rate,dst_host_srv_rerror_rate
43074,0,1,45,10,0,0,0,0,0,0,...,74,1,0.01,0.86,0.89,0.00,0.89,1.0,0.00,0.00
315411,0,0,14,9,1032,0,0,0,0,0,...,255,255,1.00,0.00,1.00,0.00,0.00,0.0,0.00,0.00
328306,0,0,14,9,1032,0,0,0,0,0,...,255,255,1.00,0.00,1.00,0.00,0.00,0.0,0.00,0.00
350016,0,1,45,5,0,0,0,0,0,0,...,255,10,0.04,0.07,0.00,0.00,1.00,1.0,0.00,0.00
288080,0,0,14,9,1032,0,0,0,0,0,...,255,255,1.00,0.00,1.00,0.00,0.00,0.0,0.00,0.00
444286,0,0,14,9,520,0,0,0,0,0,...,255,255,1.00,0.00,1.00,0.00,0.00,0.0,0.00,0.00
331602,0,0,14,9,1032,0,0,0,0,0,...,255,255,1.00,0.00,1.00,0.00,0.00,0.0,0.00,0.00
268415,0,0,14,9,1032,0,0,0,0,0,...,255,255,1.00,0.00,1.00,0.00,0.00,0.0,0.00,0.00
447612,0,0,14,9,520,0,0,0,0,0,...,255,255,1.00,0.00,1.00,0.00,0.00,0.0,0.00,0.00
72726,0,1,45,5,0,0,0,0,0,0,...,255,14,0.05,0.07,0.00,0.00,1.00,1.0,0.00,0.00


In [17]:
test_x

Unnamed: 0,duration,protocol_type,service,flag,src_bytes,dst_bytes,land,wrong_fragment,urgent,hot,...,dst_host_count,dst_host_srv_count,dst_host_same_srv_rate,dst_host_diff_srv_rate,dst_host_same_src_port_rate,dst_host_srv_diff_host_rate,dst_host_serror_rate,dst_host_srv_serror_rate,dst_host_rerror_rate,dst_host_srv_rerror_rate
109809,0,1,44,4,0,0,0,0,0,0,...,255,6,0.02,0.09,0.00,0.00,1.00,1.00,0.00,0.00
186518,0,0,14,8,1032,0,0,0,0,0,...,255,255,1.00,0.00,1.00,0.00,0.00,0.00,0.00,0.00
228517,0,0,14,8,1032,0,0,0,0,0,...,255,255,1.00,0.00,1.00,0.00,0.00,0.00,0.00,0.00
36874,0,1,22,8,308,372,0,0,0,0,...,17,255,1.00,0.00,0.06,0.04,0.00,0.00,0.00,0.00
410508,0,0,14,8,520,0,0,0,0,0,...,255,255,1.00,0.00,1.00,0.00,0.00,0.00,0.00,0.00
241655,0,0,14,8,1032,0,0,0,0,0,...,255,255,1.00,0.00,1.00,0.00,0.00,0.00,0.00,0.00
450928,0,1,19,8,1010,0,0,0,0,0,...,41,3,0.07,0.68,0.07,0.00,0.00,0.00,0.00,0.00
280446,0,0,14,8,1032,0,0,0,0,0,...,255,255,1.00,0.00,1.00,0.00,0.00,0.00,0.00,0.00
102300,0,0,14,8,1032,0,0,0,0,0,...,255,255,1.00,0.00,1.00,0.00,0.00,0.00,0.00,0.00
392355,0,1,44,4,0,0,0,0,0,0,...,255,20,0.08,0.05,0.00,0.00,1.00,1.00,0.00,0.00


In [18]:
clf = RandomForestClassifier(n_estimators=100)
clf.fit(train_x, train_y)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
                       max_depth=None, max_features='auto', max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=100,
                       n_jobs=None, oob_score=False, random_state=None,
                       verbose=0, warm_start=False)

In [30]:
prid = clf.predict(test_x)

In [31]:
print(clf.feature_importances_)

[8.14069930e-03 5.57783079e-02 3.37311292e-02 2.64779053e-02
 8.00037363e-02 1.14984444e-01 5.54817345e-06 3.43503093e-03
 1.30645485e-05 7.96096883e-03 1.04025340e-04 6.24077343e-02
 5.50565457e-03 5.08289516e-05 4.52703599e-06 3.44733717e-05
 4.17607452e-05 1.34269768e-05 1.27793815e-05 0.00000000e+00
 0.00000000e+00 6.95882220e-04 1.86201180e-01 4.72047767e-02
 8.06653232e-03 1.15849254e-02 3.66017115e-03 2.40186500e-03
 2.43417502e-02 2.30726825e-02 1.27678971e-02 8.32021346e-02
 2.61745056e-02 2.05284498e-02 2.55257688e-02 6.13595667e-02
 3.31967142e-02 2.15987992e-02 2.81561425e-03 3.27546269e-03
 3.61927588e-03]


In [32]:
print("Accuracy:", clf.score(test_x, test_y))

Accuracy: 0.9997570948544593


In [33]:
from sklearn.metrics import confusion_matrix
confusion_matrix(prid, test_y)

array([[79237,     4],
       [   20, 19543]])