## Machine Learning project - analyzing the NSL-KDD data with SVM

The KDD99 intrusion detection dataset[1] is one of the most frequently studied datasets of its kind.  This network traffic dataset was originally published in 1999, but other researchers[2] published a revised version in 2009 that was an improved version of the original.  (Most importantly, a very significant number of duplicate records have been removed.)

My notebook works with the revised NSL-KDD dataset resulting from the 2009 work.  Since the original NSL-KDD set no longer appears to be hosted online, I've made a copy on my github[3].

First, I will do some exploration of the data set, pre-process it to get it into a usable form for SVM modeling. I will explore the hyperparmeter space with a subset of the data.  Once I've selected good values for the SVM hyperparameters, I'll rebuild the model using the entire dataset, and review the performance against the provided test set.

[1] http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html
[2] M. Tavallaee, E. Bagheri, W. Lu, and A. Ghorbani, “A Detailed Analysis of the KDD CUP 99 Data Set” (2009)
[3] https://github.com/mwinton/NSL-KDD-Data-Set

In [1]:
# Start by importing the usual required libraries
import pandas as pd
import numpy as np
import pylab as plt


## OPTION 1: Load full data set.

If you are planning to skip the hyperparameter optimization below, you can load the full training set now.

In [2]:
#Load the full training data if you don't plan to wait for the hyperparameter optimization.
raw_data=pd.read_csv('https://raw.githubusercontent.com/mwinton/NSL-KDD-Data-Set/master/KDDTrain+.csv', header=None)

## OPTION 2: Load smaller (20% data set).

If you are planning to run the hyperparameter optimization (128 permutations explored), then use the following data set.  (Even with this smaller dataset, that optimization step will take ~1 hour.)

Or if you just want the final step (building the model and running it against the test data) to run more quickly, you can use this set.  The resulting accuracy is not significantly lower than when using the full training set.

In [2]:
#Load a smaller subset of the data for exploration and hyperparameter search
raw_data=pd.read_csv('https://raw.githubusercontent.com/mwinton/NSL-KDD-Data-Set/master/20%20Percent%20Training%20Set.csv', header=None)



## Load test set.

No matter which option you chose above to train the data, you will need the test set.

In [3]:
#Load full test set
raw_test_data = pd.read_csv('https://raw.githubusercontent.com/mwinton/NSL-KDD-Data-Set/master/KDDTest+.csv', header=None)


In [4]:
# Combine train & test data into a single DataFrame for pre-processing.
# Keep track of number of rows of train vs. test in order to re-split them later.
num_rows_train = len(raw_data)
X_combined = pd.concat([raw_data,raw_test_data], axis=0)

# Read feature names from a file and add to the DataFrame
fnames = pd.read_csv('https://raw.githubusercontent.com/mwinton/NSL-KDD-Data-Set/master/Field%20Names.csv', header=None)
col_names = fnames[0].values
col_names = np.append(col_names,['labels','difficulty_level'])
X_combined.columns = col_names


In [5]:
# The attack type table from the dataset was incomplete.  It only lists attacks in the training set, and not those in the test set.  I have manually created a dict with the full list.

atype_dict = {'back':'dos','buffer_overflow':'u2r','ftp_write':'r2l',
              'guess_passwd':'r2l','imap':'r2l','ipsweep':'probe','land':'dos',
              'loadmodule':'u2r','multihop':'r2l','neptune':'dos',
              'nmap':'probe','perl':'u2r','phf':'r2l','pod':'dos',
              'portsweep':'probe','rootkit':'u2r','satan':'probe','smurf':'dos',
              'spy':'r2l','teardrop':'dos','warezclient':'r2l','warezmaster':'r2l',
              'normal':'normal','unknown':'unknown',
              'apache2':'dos','udpstorm':'dos','processtable':'dos','mailbomb':'dos',
              'saint':'probe','mscan':'probe','xterm':'u2r','ps':'u2r',
              'sqlattack':'u2r','snmpgetattack':'r2l','named':'r2l','xlock':'r2l',
              'xsnoop':'r2l','sendmail':'r2l','httptunnel':'r2l','worm':'r2l',
              'snmpguess':'r2l'}

# Add a column for attack_type, and delete the originals. We don't need them.
X_combined['attack_type'] = X_combined['labels'].map(atype_dict)
del X_combined['labels']
del X_combined['difficulty_level']


## Pre-processing

First, we need to split off the y columns (labels) from the X_combined DataFrame.  We also need to convert them to numeric form for the sklearn SVM algorithms.  We will use LabelEncoder for this.

In [6]:
from sklearn.preprocessing import LabelEncoder

# Convert y to numeric labels
y_combined=X_combined.pop('attack_type')
le = LabelEncoder()
le.fit(y_combined)
y_combined = le.transform(y_combined)
y_labels = le.inverse_transform([0,1,2,3,4]) #generate this for label graphs/reports later


## Perform one-hot encoding on the protocol_type, service, and flag features

This is an important step.  Because these are nominal, rather than ordinal values, we cannot simply convert them to integers.  For example, it's not meaningful for an algorithm to say 'tcp' > 'icmp' or vice versa. Instead, we use the one-hot encoder to create a set of columns for each value of the original feature.  These new columns include only 0's and 1's.  (Note: many papers in published literature appear to have dealt with these columns incorrectly.)

In [7]:
# Perform one-hot encoding on nominal features
ohe_protocol_cols = pd.get_dummies(X_combined['protocol_type'])
ohe_service_cols = pd.get_dummies(X_combined['service'])  # 66 columns
ohe_flag_cols = pd.get_dummies(X_combined['flag'])
X_combined = pd.concat([X_combined,ohe_protocol_cols,ohe_service_cols,ohe_flag_cols],axis=1)

drop_cols = ['protocol_type','service','flag']   
X_combined  = X_combined.drop(X_combined[drop_cols],axis=1)   


## Split out train vs. test, and then also create a cross-validation set

We cannot optimize our hyperparameters by comparing against either the data the model is trained on, or the final test data.  So instead, we will pull out 30% of records from the training data as a cross-validation set.  Hyperparameter optimization will be done on this 30% set.  Once optimal values of the hyperparameters are found, we will re-train the model on the entire training set (including this cross-validation subset). 

(Note that the random_state flag is set in order to make the experiment repeatable.)

In [8]:
from sklearn.model_selection import train_test_split

# Re-split train vs test data
X_train = X_combined[:num_rows_train]
X_test = X_combined[num_rows_train:]
y_train = y_combined[:num_rows_train]
y_test = y_combined[num_rows_train:]

# Test-train-split for cross-validation
X_train,X_cv,y_train,y_cv = train_test_split(X_train, y_train, 
                                                 random_state = 1, test_size = 0.3)


## Apply feature scaling

SVM does not perform well if features are not on similar scales, so feature scaling is a requirement.  Recall that the one-hot encoded columns are already [0,1], so I've chosen to use the MinMaxScaler because it also will output features values in this range.  (Since this seemed the work sufficiently well, I didn't try the alternate StandardScaler.)

(Note that I also tried using SelectKBest to reduce the feature space to the top 20 features, but the model didn't perform as well as when I used all features. I have not included that code here.)

In [9]:
from sklearn.preprocessing import MinMaxScaler

# Apply feature scaling to standardize the train/test sets. 

mmsc = MinMaxScaler()
mmsc.fit(X_train)
X_train_norm = mmsc.transform(X_train)
X_test_norm = mmsc.transform(X_test)
X_cv_norm = mmsc.transform(X_cv)


## OPTIONAL: optimizing the hyperparameters - THIS TAKES A LONG TIME!

This was an important step in the experiment, but it takes a long time (~ 1 hour on my computer).  If you don't want to go through that pain, **DO NOT CLICK** to run the next box.  Instead move on to the next box.

I explored both rbf and sigmoid kernels.  I also explored C and sigma values 2^k with k in [-8,-6,-4,-2,0,2,4,6]. This results in 128 permutations that will be run.

In [None]:
# DO NOT RUN THE CODE IN THIS BOX UNLESS YOU ARE PREPARED TO WAIT FOR A VERY LONG TIME!

from sklearn.svm import SVC

scores = []
for k in svm_kernels:
    for c in svm_C_exp_vals:
        for g in svm_gamma_exp_vals:
            
            svc = SVC(C = (2**c), kernel = k, gamma = (2**g))
            svc.fit(X_train_norm,y_train)
            score = svc.score(X_cv_norm,y_cv)
            scores.append((k,c,g,score)) # in hindsight, probably better saved as a dict
            print('\nSVM (k=%s C=2**%d sigma=2**%d: accuracy score = %.3f' % (k,c,g,score))


# Prepare hyperparameter data for 3d plotting
plt_scores = np.array(scores)
rbf_scores = plt_scores[plt_scores[:,0]=='rbf'][:,[1,2,3]]
sigmoid_scores = plt_scores[plt_scores[:,0]=='sigmoid'][:,[1,2,3]]
plot_sets = {'rbf':rbf_scores,'sigmoid':sigmoid_scores}

# Plot scatter plot
for p in plot_sets:
    plt_data = plot_sets[p]
    c = plt_data[:,0]
    c = c.astype(np.int)
    g = plt_data[:,1]
    g = s.astype(np.int)
    score = plt_data[:,2]
    score = score.astype(np.float)
    fig = plt.figure()
    fig.suptitle('SVM hyperparam. optimization: '+str(p))
    ax = fig.add_subplot(111, projection='3d')
    ax.scatter(c,g,score)
    ax.set_xlabel('log2 C')
    ax.set_ylabel('log2 gamma')
plt.show()


## Continue on, using optimal hyperparameters from previous grid search.

We will regenerate the full training set (including the cross-validation data) and move forward using the following choices: kernel = rbf ; C = 2^6 ; gamma = 1.  These resulted in 99.5% accuracy on the cross-validation set.

In [10]:
# Set the parameters to use going forward
k_selected = 'rbf'
C_selected = 2**6
gamma_selected = 2**0

# Regenerate full training set (including CV data)
X_train = X_combined[:num_rows_train]
X_test = X_combined[num_rows_train:]
y_train = y_combined[:num_rows_train]
y_test = y_combined[num_rows_train:]

# re-run the feature scaling
mmsc = MinMaxScaler()
mmsc.fit(X_train)
X_train_norm = mmsc.transform(X_train)
X_test_norm = mmsc.transform(X_test)


## Finally, we can make predictions for our test set to see how our model performed.

Note that the following block of code could take up to ~10 minutes to run if you're using the full training set.  If you're using the smaller ~20% training set, it should run in ~2 minutes.

You should see a classification report similar to this one:

SVM (k=rbf C=2^64 sigma=2^1: accuracy score = 0.765
             precision    recall  f1-score   support

        dos       0.98      0.79      0.88      7458
     normal       0.67      0.97      0.80      9710
      probe       0.78      0.64      0.71      2421
        r2l       0.63      0.11      0.18      2887
        u2r       0.79      0.22      0.35        67

avg / total       0.78      0.76      0.73     22543


In [13]:
from sklearn.svm import SVC
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix

# Run SVM on our Test set 
svc = SVC(C = C_selected, kernel = k_selected, gamma = gamma_selected)
svc.fit(X_train_norm,y_train)
svc_predictions = svc.predict(X_test_norm)
score = svc.score(X_test_norm,y_test)
print('\nSVM (k=%s C=2**%d sigma=2**%d): accuracy score = %.3f' % 
              (k_selected,C_selected,gamma_selected,score))
print(classification_report(y_test,svc_predictions, target_names = y_labels))



SVM (k=rbf C=2**64 sigma=2**1): accuracy score = 0.765
             precision    recall  f1-score   support

        dos       0.98      0.79      0.88      7458
     normal       0.67      0.97      0.80      9710
      probe       0.78      0.64      0.71      2421
        r2l       0.63      0.11      0.18      2887
        u2r       0.79      0.22      0.35        67

avg / total       0.78      0.76      0.73     22543



## CONCLUSION

This SVM model is able to obtain ~76.5% accuracy against the NSL-KDD intrusion detection dataset.  It's precision is best with respect to DOS attacks (the highest volume attack in the dataset).  It's worst performance is on the R2L attacks (which are present in very few records in the dataset).  It's poor precision for NORMAL traffic also suggests much work would be needed to productionize such a system.  Other ML models should also be explored.