# The benchmark testing for KDD Cup Dataset
In this notebook, the general steps of classification procedure is presented. Two methods (Random Foreast and Neural Network) are going to be exploited for the problem of attack identification and detection with the existing dataset - KDDCup'99. This notebook shows the way to program with Tensorflow, SKLearn, NumPy and Matplotlib. 

## Environment SetUp
If the enviroment is not ready for the procedure, install all the general toolkits to the enviorment. If the enviornment is ready, skip this step.

In [1]:
#! pip3 install numpy
#! pip3 install pandas
#! pip3 install -U scikit-learn

Collecting scikit-learn
[?25l  Downloading https://files.pythonhosted.org/packages/aa/7d/6c71c35c201f6d5cec318c7ed7841317adbf291513742865ed8904ae4ea9/scikit_learn-0.21.2-cp37-cp37m-macosx_10_6_intel.macosx_10_9_intel.macosx_10_9_x86_64.macosx_10_10_intel.macosx_10_10_x86_64.whl (10.5MB)
[K     |████████████████████████████████| 10.5MB 6.2MB/s eta 0:00:01
[?25hCollecting scipy>=0.17.0 (from scikit-learn)
[?25l  Downloading https://files.pythonhosted.org/packages/04/66/ec5f1283d6a290a9153881a896837487338c44639c1305cc59e1c7b69cc9/scipy-1.3.0-cp37-cp37m-macosx_10_6_intel.macosx_10_9_intel.macosx_10_9_x86_64.macosx_10_10_intel.macosx_10_10_x86_64.whl (27.7MB)
[K     |████████████████████████████████| 27.7MB 48.2MB/s eta 0:00:01
[?25hCollecting joblib>=0.11 (from scikit-learn)
[?25l  Downloading https://files.pythonhosted.org/packages/cd/c1/50a758e8247561e58cb87305b1e90b171b8c767b15b12a1734001f41d356/joblib-0.13.2-py2.py3-none-any.whl (278kB)
[K     |████████████████████████████████|

## General SetUp
First of all, we import all the needed libraries to the kernel.

In [11]:
import tensorflow as tf
import pandas as pd
import numpy as np
import matplotlib as plot
from sklearn.model_selection import train_test_split

Second, load the data into the kernel. pd.read_csv() loads the dataset from the csv file and returns a dataframe structure. It could be utilized in the following steps. The required parameters in pd.read_csv() include the path to the dataset, the used coloumns and so on.

In [12]:
data_path = "../Dataset/kddcup.data.txt"

dataset = pd.read_csv(data_path, sep=',', header=None, usecols=range(0, 42))

dataset.shape

(4898431, 42)

Show the whole dataset.

In [13]:
dataset

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,32,33,34,35,36,37,38,39,40,41
0,0,tcp,http,SF,215,45076,0,0,0,0,...,0,0.0,0.0,0.00,0.00,0.00,0.00,0.0,0.0,normal.
1,0,tcp,http,SF,162,4528,0,0,0,0,...,1,1.0,0.0,1.00,0.00,0.00,0.00,0.0,0.0,normal.
2,0,tcp,http,SF,236,1228,0,0,0,0,...,2,1.0,0.0,0.50,0.00,0.00,0.00,0.0,0.0,normal.
3,0,tcp,http,SF,233,2032,0,0,0,0,...,3,1.0,0.0,0.33,0.00,0.00,0.00,0.0,0.0,normal.
4,0,tcp,http,SF,239,486,0,0,0,0,...,4,1.0,0.0,0.25,0.00,0.00,0.00,0.0,0.0,normal.
5,0,tcp,http,SF,238,1282,0,0,0,0,...,5,1.0,0.0,0.20,0.00,0.00,0.00,0.0,0.0,normal.
6,0,tcp,http,SF,235,1337,0,0,0,0,...,6,1.0,0.0,0.17,0.00,0.00,0.00,0.0,0.0,normal.
7,0,tcp,http,SF,234,1364,0,0,0,0,...,7,1.0,0.0,0.14,0.00,0.00,0.00,0.0,0.0,normal.
8,0,tcp,http,SF,239,1295,0,0,0,0,...,8,1.0,0.0,0.12,0.00,0.00,0.00,0.0,0.0,normal.
9,0,tcp,http,SF,181,5450,0,0,0,0,...,9,1.0,0.0,0.11,0.00,0.00,0.00,0.0,0.0,normal.


## Start the pre-training SetUp
Divide the dataset into two types, one is the collection of features (input_x) and the other is the labels (input_y). There are 41 features and 5 classes. Using Sklearn's train_test_split() to split 20% of data into testing set and the rest is assigned as training set. 

In [28]:
input_x = dataset.iloc[:, 0:41]
input_y = dataset.iloc[:, 41]

train_x, test_x, train_y, test_y = train_test_split(input_x, input_y, test_size=0.20, random_state=0)

In [29]:
train_x.shape

(3918744, 41)

In [33]:
test_y

1809341      smurf.
3563864    neptune.
2992676      smurf.
2333965      smurf.
1882045      smurf.
3181076      smurf.
4697502    neptune.
4552438     normal.
2017195      smurf.
1311800      smurf.
619775     neptune.
2893685      smurf.
590285     neptune.
1874554      smurf.
1817132      smurf.
1289225      smurf.
562422     neptune.
3734009    neptune.
2559288      smurf.
4094193      smurf.
4647141    neptune.
597117     neptune.
1036541     normal.
744279      normal.
1112802    neptune.
1919355      smurf.
1812064      smurf.
2707596      smurf.
3652753    neptune.
486530       smurf.
             ...   
657103     neptune.
2979324      smurf.
1473166      smurf.
3240331      smurf.
1382049      satan.
2615238      smurf.
2295337      smurf.
4312987      smurf.
3534748    neptune.
3044206      smurf.
1365493     normal.
1766901      smurf.
1427574     normal.
1390755     normal.
2003782      smurf.
1591269      smurf.
661274     neptune.
4747125    neptune.
951551       smurf.


## Data Preprocessing
Transfer the label into representative numbers. Here we need the support from SkLearn library.

In [54]:
from sklearn.ensemble import RandomForestClassifier
from sklearn import preprocessing
le = preprocessing.LabelEncoder()
le.fit(train_y)
i = 0
for i in le.classes_:
    print(i)

0.00
back.
buffer_overflow.
ftp_write.
guess_passwd.
imap.
ipsweep.
land.
loadmodule.
multihop.
neptune.
nmap.
normal.
perl.
phf.
pod.
portsweep.
rootkit.
satan.
smurf.
teardrop.
warezclient.
warezmaster.


In [32]:
clf = RandomForestClassifier(n_estimators=100)
clf.fit(train_x, train_y)

ValueError: could not convert string to float: 'icmp'