# 4. Network Intrusion Detection

The dataset used in this notebook originally comes from a KDD competition held several years ago.

Here you can find the original task description given to the competition participants: [task description](http://kdd.ics.uci.edu/databases/kddcup99/task.html)

The competition task was to build a network intrusion detector, a predictive model capable of distinguishing between *bad* connections, called intrusions or attacks, and *good* normal connections. 
The database contains a standard set of data to be audited, which includes a wide variety of intrusions simulated in a military network environment.

Download instruction:
- download the file kddcup.data.gz from [here](http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html)
- move it in the 'datasets' folder (or in some other folders, as long as you know the path)
- extract the archive

In [None]:
import pandas as pd
import numpy as np
import os

## Load the dataset

In [None]:
# as usual, you might have to change this depending on the path you chose and your operating system
DATA_DIR = 'datasets'
FILENAME = 'kddcup.data.corrected'

In [None]:
# feature names obtained from: http://kdd.ics.uci.edu/databases/kddcup99/kddcup.names
header_names = [
    'duration', 'protocol_type', 'service', 'flag', 'src_bytes', 'dst_bytes', 'land', 'wrong_fragment', 
    'urgent', 'hot', 'num_failed_logins', 'logged_in', 'num_compromised', 'root_shell', 'su_attempted', 
    'num_root', 'num_file_creations', 'num_shells', 'num_access_files', 'num_outbound_cmds', 'is_host_login', 
    'is_guest_login', 'count', 'srv_count', 'serror_rate', 'srv_serror_rate', 'rerror_rate', 'srv_rerror_rate', 
    'same_srv_rate', 'diff_srv_rate', 'srv_diff_host_rate', 'dst_host_count', 'dst_host_srv_count', 
    'dst_host_same_srv_rate', 'dst_host_diff_srv_rate', 'dst_host_same_src_port_rate', 
    'dst_host_srv_diff_host_rate', 'dst_host_serror_rate', 'dst_host_srv_serror_rate', 'dst_host_rerror_rate', 
    'dst_host_srv_rerror_rate', 'attack_type'
]

In [None]:
df = pd.read_csv(os.path.join(DATA_DIR, FILENAME), header=None, names=header_names, sep=',')

<div class="alert alert-block alert-danger">
    <b>Q: What is the effect of setting <i>header=None</i>?</b>
</div>

Hint: take a look at the [documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html)

<div class="alert alert-block alert-success">
ANS
</div>

<div class="alert alert-block alert-danger">
    <b>Q: What is the effect of setting <i>names=header_names</i>?</b>
</div>

<div class="alert alert-block alert-success">
ANS
</div>

<div class="alert alert-block alert-danger">
    <b>Q: How many rows in the dataframe?</b>
</div>

<div class="alert alert-block alert-info">
<b>
IMPORTANT:
    
The cell below reduces the size of the dataframe by sampling some of its elements. This is only done to work with a smaller amount of data. You can try to run the notebook without running this cell; if it crashes due to memory errors, come back here and rerun the notebook with less data.
    
If you still have troubles, there is a smaller version available on the same website.
The file name is *kddcup.data_10_percent.gz*.
</b>
</div>

In [None]:
df = df.sample(frac=0.1)

<div class="alert alert-block alert-danger">
    <b>Q: How many rows in the dataframe after sampling?</b>
</div>

## Initial analysis of the data

<div class="alert alert-block alert-danger">
<b>Q: Display the first 5 rows of the dataframe</b>
</div>

<div class="alert alert-block alert-danger">
<b>Q: How many columns does the original dataframe have?</b>
</div>

<div class="alert alert-block alert-danger">
<b>Q: How many FEATURES?</b>
</div>

<div class="alert alert-block alert-danger">
<b>Q: Are there any categorical variables?</b>
</div>

Hint: use the [.info()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.info.html) method.

## Pre-processing the dataset and analysis

In [None]:
col_names = np.array(header_names)

nominal_idx = [1, 2, 3]
binary_idx = [6, 11, 13, 14, 20, 21]
numeric_idx = list(set(range(41)).difference(nominal_idx).difference(binary_idx))

nominal_cols = col_names[nominal_idx].tolist()
binary_cols = col_names[binary_idx].tolist()
numeric_cols = col_names[numeric_idx].tolist()

<div class="alert alert-block alert-danger">
    <b>Q: What is the difference between <i>col_names</i> and <i>header_names</i>?</b>
</div>

[hint](https://docs.python.org/3/library/functions.html#type)

<div class="alert alert-block alert-success">
ANS
</div>

<div class="alert alert-block alert-danger">
<b>Q: How many distinct values exist for the categorical variables?</b>
</div>

<div class="alert alert-block alert-danger">
<b>Q: If you you solved the previous question with more than two lines of code, do the same but try to use only *two* lines of code.</b>
</div>

hint: remember the `for` loop

<div class="alert alert-block alert-danger">
<b>Q: What are the possible values of the categorical variables?</b>
</div>

Try to use only **two** lines of code to print all the possible values of the categorical variables.

<div class="alert alert-block alert-danger">
<b>Q: Which is the maximum duration, minimun duration and average duration of the entries in the dataframe?</b>
</div>

<div class="alert alert-block alert-danger">
<b>Q: How many entries are 'root_shell' and how many aren't?</b>
</div>

<div class="alert alert-block alert-danger">
<b>Q: Count the number of entries for each 'protocol_type'</b>
</div>

<div class="alert alert-block alert-danger">
<b>Q: Which is the most frequent 'service'?</b> Try to write a cell that prints the name of the most frequent service as a string.
</div>

Hint: remember the [groupby](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.groupby.html) and [sort_values](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.sort_values.html) methods.

## Mapping each attack type to one category

In order to do this, we are going to use the file *training_attack_types.txt*, which maps each of the attacks in the original dataset to 1 category.

If you have downloaded the zip archive from beep, you have this file in the dataset folder; otherwise, you can get it from the github repo or from [here](http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html).

<div class="alert alert-block alert-danger">
<b>Q: How many different 'attack_types' are in the dataframe?</b>
</div>

<div class="alert alert-block alert-danger">
<b>Q: Display the number of occurrences of each attack type</b>
</div>

<div class="alert alert-block alert-danger">
<b>Q: What does the following cell do?</b>
</div>

- hint: display the dataframe before and after performing this operation and look at it
- [hint](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.apply.html)

In [None]:
df['attack_type'] = df.apply(lambda r: r['attack_type'][:-1], axis=1)

<div class="alert alert-block alert-success">
ANS
</div>

In [None]:
category = dict()
category['benign'] = ['normal']

TRAINING_ATTACK_TYPES_FILENAME = 'training_attack_types.txt'
with open(os.path.join(DATA_DIR, TRAINING_ATTACK_TYPES_FILENAME), 'r') as f:
    for line in f.readlines():
        attack, cat = line.strip().split(' ')
        if cat in category.keys():
            category[cat].append(attack)
        else:
            category[cat] = [attack]

attack_mapping = {v: k for k in category for v in category[k]}

<div class="alert alert-block alert-danger">
<b>Q: What is 'attack_mapping'?</b>
</div>

<div class="alert alert-block alert-success">
ANS
</div>

<div class="alert alert-block alert-danger">
<b>Q: How many categories of attacks are there? What are their names?</b>
</div>

### Perform the actual mapping

In [None]:
df['attack_category'] = df.apply(lambda r: attack_mapping[r['attack_type']], axis=1)

<div class="alert alert-block alert-danger">
<b>Q: Count the number of occurrences of each category</b>
</div>

## Data preparation: dummy variables

We have some categorical variables. Thus, we have to converte them to one-hot encoded variables.

<div class="alert alert-block alert-danger">
<b>Q: Create a new DataFrame encoding the categorical attributes with one hot encoding.</b>
</div>

In [None]:
# Convert categorical feature into dummy variables with one-hot encoding
df_one_hot = # TODO

## Data preparation: Train-test split

In [None]:
from sklearn.model_selection import train_test_split

# Split dataset up into train and test sets
X_train, X_test, y_train, y_test = train_test_split(
    # TODO X, 
    # TODO Y, 
    test_size=0.3
)

## Data preparation: scaling

In [None]:
from sklearn.preprocessing import StandardScaler

In [None]:
# This cell might take a while to run
# also, if it crashes it might mean that you do not have enough memory available
standard_scaler = StandardScaler().fit(X_train[numeric_cols])

X_train[numeric_cols] = standard_scaler.transform(X_train[numeric_cols])
X_test[numeric_cols] = standard_scaler.transform(X_test[numeric_cols])

## Data preparation: converting label to integers

<div class="alert alert-block alert-danger">
<b>Q: What does the following cell do?</b>
</div>

In [None]:
y_train_bin = y_train.apply(lambda x: 0 if x == 'benign' else 1)
y_test_bin = y_test.apply(lambda x: 0 if x == 'benign' else 1)

<div class="alert alert-block alert-success">
ANS
</div>

## Training the models for binary classification

As a first step, find the best model in detecting whether an entry is malicious or not (i.e. use the binary label). We will work on multi-label classification in the next notebook.

We will try with different models and we will also have a look at how we can perform cross validation.

In [None]:
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier

from sklearn.metrics import accuracy_score, recall_score, precision_score, confusion_matrix

import time

In [None]:
# toy Example: how to measure elapsed time
a = 5
t0 = time.time()
for value in range(10):
    a *= value
print("Elapsed time:", time.time() - t0, "s")

<div class="alert alert-block alert-danger">
    <b>Q: Define a function named <code>print_eval_metrics</code> that receives as input the predictions and the true labels and *prints* accuracy, precistion, recall and the confusion matrix.</b>
</div>

In [None]:
def print_eval_metrics(predicted_labels, true_labels):
    accuracy = # TODO
    precision = # TODO
    recall = # TODO
    cm = # TODO
    
    # Do not change the code below
    print("ACCURACY:  %.5f" % accuracy)
    print("PRECISION: %.5f" % precision)
    print("RECALL:    %.5f" % recall)
    print("CONFUSION MATRIX:")
    print(cm)

<div class="alert alert-block alert-danger">
<b>Q: Define, train, and evaluate the models specified in the following cells. For all of them, measure the training time as well.</b>
</div>

<div class="alert alert-block alert-danger">
<b>Naive Bayes.</b>
</div>

In [None]:
clf_nb = # TODO define the classifier

t0 = time.time()
# TODO fit the classifier
elapsed_time = # TODO
print("training time = %.2f" % elapsed_time)

# TODO perform the prediction
y_pred_nb = 

# evaluate
print_eval_metrics(y_pred_nb, y_test_bin)

<div class="alert alert-block alert-danger">
<b>Random Forest</b>
</div>

In [None]:
clf_rf = # TODO define the classifier

t0 = time.time()
# TODO fit the classifier
elapsed_time = # TODO
print("training time = %.2f" % elapsed_time)

# TODO perform the prediction
y_pred_rf = 

# evaluate
print_eval_metrics(y_pred_rf, y_test_bin)

<div class="alert alert-block alert-danger">
<b>Decision Tree</b>
</div>

In [None]:
clf_dt = # TODO define the classifier

t0 = time.time()
# TODO fit the classifier
elapsed_time = # TODO
print("training time = %.2f" % elapsed_time)

# TODO perform the prediction
y_pred_dt = 

# evaluate
print_eval_metrics(y_pred_dt, y_test_bin)

<div class="alert alert-block alert-danger">
<b>k-nearest neighbors (this might take a while) </b>
</div>

In [None]:
clf_knn = # TODO define the classifier

t0 = time.time()
# TODO fit the classifier
elapsed_time = # TODO
print("training time = %.2f" % elapsed_time)

# TODO perform the prediction
y_pred_knn = 

# evaluate
print_eval_metrics(y_pred_knn, y_test_bin)

<div class="alert alert-block alert-danger">
<b>
SVM; try two versions of the SVM: 
    
    i) kernel='rbf' 
    ii) kernel='linear'
    
(this might take a while)
</b>
</div>

In [None]:
clf_svc = # TODO define the classifier with kernel='rbf'

t0 = time.time()
# TODO fit the classifier
elapsed_time = # TODO
print("training time = %.2f" % elapsed_time)

# TODO perform the prediction
y_pred_svc = 

# evaluate
print_eval_metrics(y_pred_svc, y_test_bin)

In [None]:
clf_svc = # TODO define the classifier with kernel='linear'

t0 = time.time()
# TODO fit the classifier
elapsed_time = # TODO
print("training time = %.2f" % elapsed_time)

# TODO perform the prediction
y_pred_svc = 

# evaluate
print_eval_metrics(y_pred_svc, y_test_bin)

<div class="alert alert-block alert-danger">
<b>Q: The following cell performs cross validation to find the best performing configuration of the Decision Tree Classifier. Run it as it is and then try to change the parameters to improve the performance.</b>
</div>

In [None]:
from sklearn.model_selection import GridSearchCV

In [None]:
search = GridSearchCV(
    estimator=DecisionTreeClassifier(),
    param_grid={'max_depth': [2, 3, 4, 5, 10, 15, 20, 25, 50, None]}, 
    cv=10
)
best_clf = clf.fit(X_train, y_train_bin)

y_pred = best_clf.predict(X_test)

print_eval_metrics(y_pred, y_test_bin)

<div class="alert alert-block alert-danger">
<b>Q: Which model do you think works best? Why do you say so?</b>
</div>

<div class="alert alert-block alert-success">
ANS
</div>