# The benchmark testing for KDD Cup Dataset
In this notebook, the general steps of classification procedure is presented. Two methods (Random Foreast and Neural Network) are going to be exploited for the problem of attack identification and detection with the existing dataset - KDDCup'99. This notebook shows the way to program with Tensorflow, SKLearn, NumPy and Matplotlib. 

## Environment SetUp
If the enviroment is not ready for the procedure, install all the general toolkits to the enviorment. If the enviornment is ready, skip this step.

In [None]:
#! pip3 install numpy
#! pip3 install pandas
#! pip3 install -U scikit-learn

## General SetUp
First of all, we import all the needed libraries to the kernel.

In [1]:
import tensorflow as tf
import pandas as pd
import numpy as np
import matplotlib as plot
from sklearn.model_selection import train_test_split

Second, load the data into the kernel. pd.read_csv() loads the dataset from the csv file and returns a dataframe structure. It could be utilized in the following steps. The required parameters in pd.read_csv() include the path to the dataset, the used coloumns and so on.

In [2]:
data_path = "../../../Dataset/kddcup.data.txt"

dataset = pd.read_csv(data_path, sep=',', header=None, usecols=range(0, 42))

dataset.shape

  interactivity=interactivity, compiler=compiler, result=result)


(4898431, 42)

Show the whole dataset.

In [3]:
dataset

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,32,33,34,35,36,37,38,39,40,41
0,0,tcp,http,SF,215,45076,0,0,0,0,...,0,0.0,0.0,0.00,0.00,0.00,0.00,0.0,0.0,normal.
1,0,tcp,http,SF,162,4528,0,0,0,0,...,1,1.0,0.0,1.00,0.00,0.00,0.00,0.0,0.0,normal.
2,0,tcp,http,SF,236,1228,0,0,0,0,...,2,1.0,0.0,0.50,0.00,0.00,0.00,0.0,0.0,normal.
3,0,tcp,http,SF,233,2032,0,0,0,0,...,3,1.0,0.0,0.33,0.00,0.00,0.00,0.0,0.0,normal.
4,0,tcp,http,SF,239,486,0,0,0,0,...,4,1.0,0.0,0.25,0.00,0.00,0.00,0.0,0.0,normal.
5,0,tcp,http,SF,238,1282,0,0,0,0,...,5,1.0,0.0,0.20,0.00,0.00,0.00,0.0,0.0,normal.
6,0,tcp,http,SF,235,1337,0,0,0,0,...,6,1.0,0.0,0.17,0.00,0.00,0.00,0.0,0.0,normal.
7,0,tcp,http,SF,234,1364,0,0,0,0,...,7,1.0,0.0,0.14,0.00,0.00,0.00,0.0,0.0,normal.
8,0,tcp,http,SF,239,1295,0,0,0,0,...,8,1.0,0.0,0.12,0.00,0.00,0.00,0.0,0.0,normal.
9,0,tcp,http,SF,181,5450,0,0,0,0,...,9,1.0,0.0,0.11,0.00,0.00,0.00,0.0,0.0,normal.


## Start the pre-training SetUp
Divide the dataset into two types, one is the collection of features (input_x) and the other is the labels (input_y). There are 41 features and 5 classes. Using Sklearn's train_test_split() to split 20% of data into testing set and the rest is assigned as training set. 

In [4]:
input_x = dataset.iloc[:, 0:41]
input_y = dataset.iloc[:, 41]

train_x, test_x, train_y, test_y = train_test_split(input_x, input_y, test_size=0.20, random_state=0)

In [5]:
train_x.shape

(3918744, 41)

Categorize the attack type into two classes: normal and abnormal. Here, we only detect the malicious network traffic. The pre-defined array of new class is ready for the replacement. 

In [6]:
new_class = {'back.':'abnormal', 'buffer_overflow.':'abnormal', 'ftp_write.':'abnormal', 'guess_passwd.':'abnormal', 'imap.':'abnormal',
            'ipsweep.':'abnormal', 'land.':'abnormal', 'loadmodule.':'abnormal', 'multihop.':'abnormal', 'neptune.':'abnormal', 'nmap.':'abnormal',
            'perl.':'abnormal', 'phf.':'abnormal', 'pod.':'abnormal', 'portsweep.':'abnormal', 'rootkit.':'abnormal', 'satan.':'abnormal',
            'smurf.':'abnormal', 'spy.':'abnormal', 'teardrop.':'abnormal', 'warezclient.':'abnormal', 'warezmaster.':'abnormal', '0.00':'abnormal',
            'normal.':'normal'}
train_y = train_y.replace(new_class)
test_y = test_y.replace(new_class)

In [7]:
train_y

1819665    abnormal
1639904    abnormal
635867     abnormal
4477014      normal
96271      abnormal
3525534    abnormal
1458482      normal
1257318    abnormal
17165        normal
2646177    abnormal
1115646    abnormal
1710963    abnormal
602093     abnormal
2241760    abnormal
878371       normal
4055010    abnormal
3317391    abnormal
4382267    abnormal
2971990    abnormal
2611755    abnormal
2228309    abnormal
1373807      normal
3478142    abnormal
1879194    abnormal
2995142    abnormal
3556672    abnormal
1284620    abnormal
2034844    abnormal
259004       normal
2002239    abnormal
             ...   
3329289    abnormal
4668302    abnormal
4064029    abnormal
2327283    abnormal
556209     abnormal
1419224      normal
4087907    abnormal
441170     abnormal
4065728    abnormal
3743151    abnormal
3491838    abnormal
532165     abnormal
2527987    abnormal
4527056      normal
2951283    abnormal
3107661    abnormal
606745     abnormal
887633       normal
3046886    abnormal


## Data Encoding
Transfer the features and label into representative numbers. Here we need the support from SkLearn library.

In [8]:
from sklearn.ensemble import RandomForestClassifier
from sklearn import preprocessing
le = preprocessing.LabelEncoder()

In [14]:
le.fit(train_y)
train_y = le.transform(le.classes_)

#Put the new assigned data to the new array
#new_classes = le.classes_

#le.inverse_transform(train_y)
# Encode the new_classes array which only contains two classes
#le.fit(new_classes)
#train_y = le.transform(new_classes)
#train_y

array([0, 1])

In [13]:
clf = RandomForestClassifier(n_estimators=100)
clf.fit(train_x, train_y)

ValueError: could not convert string to float: 'icmp'