If you want to remove the margins for the notebook, uncomment and run the following cell.

In [1]:
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:100% !important; }</style>"))

# 5. Network Intrusion Detection

We use the same dataset as in the previous practical session.

Our goal is to build a network intrusion detector, a predictive model capable of distinguishing between *bad* connections, called intrusions or attacks, and *good* normal connections.
During last session we focused on **binary classification**, as we tried to distinguish between standard connections and attacks; in this session we will focus on **multi label classification**: that is, we will try not only to detect the attacks but also to correctly classify the type of attack. 

Download instruction:
- download the file kddcup.data.gz from [here](http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html) (**it is the same as last session, you can use the same file!**)
- move it in the 'datasets' folder (or in some other folders, as long as you know the path)
- extract the archive

You will need [XGBoost](https://xgboost.readthedocs.io/en/latest/python/python_intro.html) and [seaborn](https://seaborn.pydata.org/) for this notebook.

If you use `pip` for managing packages, you can install them with:
```
    pip install xgboost
    pip install seaborn
```

If you use Anaconda, these commands should work fine for you:
```
    conda install -c anaconda seaborn
    conda install -c conda-forge xgboost
```

In [None]:
import numpy as np
import pandas as pd
import time

# Import libraries for plotting
import matplotlib.pyplot as plt
import seaborn as sns

# Import classifiers
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from sklearn.neural_network import MLPClassifier
from xgboost import XGBClassifier

# Import evaluation metrics
from sklearn.metrics import (
    accuracy_score, 
    recall_score, 
    precision_score, 
    multilabel_confusion_matrix
)

In [None]:
from collections import Counter

`Counter` is an unordered collection where elements are stored as dictionary keys and their counts are stored as dictionary values. Counts are allowed to be any integer value including zero or negative counts. [DOC](https://docs.python.org/2/library/collections.html#collections.Counter)

## Load the dataset

In [None]:
# as usual, you might have to change the value of these variables depending on the path you chose and your OS
DATA_DIR = 'datasets/'
FILENAME = 'kddcup.data.corrected'

In [None]:
# feature names obtained from: http://kdd.ics.uci.edu/databases/kddcup99/kddcup.names
header_names = [
    'duration', 'protocol_type', 'service', 'flag', 'src_bytes', 'dst_bytes', 'land', 'wrong_fragment', 
    'urgent', 'hot', 'num_failed_logins', 'logged_in', 'num_compromised', 'root_shell', 'su_attempted', 
    'num_root', 'num_file_creations', 'num_shells', 'num_access_files', 'num_outbound_cmds', 'is_host_login', 
    'is_guest_login', 'count', 'srv_count', 'serror_rate', 'srv_serror_rate', 'rerror_rate', 'srv_rerror_rate', 
    'same_srv_rate', 'diff_srv_rate', 'srv_diff_host_rate', 'dst_host_count', 'dst_host_srv_count', 
    'dst_host_same_srv_rate', 'dst_host_diff_srv_rate', 'dst_host_same_src_port_rate', 
    'dst_host_srv_diff_host_rate', 'dst_host_serror_rate', 'dst_host_srv_serror_rate', 'dst_host_rerror_rate', 
    'dst_host_srv_rerror_rate', 'attack_type'
]

In [None]:
df = pd.read_csv(DATA_DIR+FILENAME, header=None, names=header_names, sep=',')

<div class="alert alert-block alert-info">
<b>
IMPORTANT:
    
The cell below reduces the size of the dataframe by sampling some of its elements. This is only done to work with a smaller amount of data. You can try to run the notebook without running this cell; if it crashes due to memory errors, come back here and rerun the notebook with less data.
    
If you still have troubles, there is a smaller version available on the same website.
The file name is *kddcup.data_10_percent.gz*.
</b>
</div>

In [None]:
df = df.sample(frac=0.2)

## Data analysis

Let's quickly repeat part of the analysis we did last time, and use this as an opportunity to learn some visualization methods.

In [None]:
df[:5]

In [None]:
print("Number of rows = %d" % len(df.index))
print("Number of columns = %d" % len(df.columns))

In [None]:
col_names = np.array(header_names)

nominal_idx = [1, 2, 3]
binary_idx = [6, 11, 13, 14, 20, 21]
numeric_idx = list(set(range(41)).difference(nominal_idx).difference(binary_idx))

nominal_cols = col_names[nominal_idx].tolist()
binary_cols = col_names[binary_idx].tolist()
numeric_cols = col_names[numeric_idx].tolist()

print("categorical attributes: \n", nominal_cols, "\n")
print("binary attributes: \n", binary_cols, "\n")
print("numeric attributes: \n", numeric_cols, "\n")

### Let's check the distribution of the `protocol_type` attribute

In [None]:
tmp_df = df.groupby('protocol_type').size().reset_index()

In [None]:
tmp_df

Instead of just printing the numbers, let's try to use the bar plot from `matplotlib` to plot the distribution.

It works as follows:
- initialize a "figure" and "axes" object with `plt.subplots()`
- plot what you want in the specified axis (with `ax.<sth>`)
- show the plot with `plt.show()`

In [None]:
fig, ax = plt.subplots()
ax.bar(tmp_df['protocol_type'], tmp_df[0])
plt.show()

With matplotlib, you can modify many aspects of the plot.

In [None]:
fig, ax = plt.subplots()
ax.bar(tmp_df['protocol_type'], tmp_df[0])

ax.set_xlabel('protocol type')
ax.set_ylabel('Num. of connections')
ax.set_title('Numer of connections for each protocol_type')

plt.show()

Alternatively you could also use `barh`.

In [None]:
fig, ax = plt.subplots()
ax.barh(tmp_df['protocol_type'], tmp_df[0])
plt.show()

### Let's check the distribution of the `flag` attribute

<div class="alert alert-block alert-danger">
<b>
Q: plot the distribution of <code>flag</code> using bar or barh from matplotlib
</b>
</div>

Documentation:
- [bar](https://matplotlib.org/3.1.1/api/_as_gen/matplotlib.pyplot.bar.html)
- [barh](https://matplotlib.org/3.1.1/api/_as_gen/matplotlib.pyplot.barh.html)

<div class="alert alert-block alert-danger">
<b>
Q: the values are of different magnitude, thus the plot is not really readable. Try to use <code>set_xscale</code> or <code>set_yscale</code> (depending on whether you are using barh or bar) to set 'log' scale and make it more readable.
</b>
</div>

Hint: look at the format I used above to add the name of the axis and the title (i.e. `ax.set_xlabel()`).

Documentation:
- [set_xscale](https://matplotlib.org/3.1.1/api/_as_gen/matplotlib.axes.Axes.set_xscale.html)
- [set_yscale](https://matplotlib.org/api/_as_gen/matplotlib.axes.Axes.set_yscale.html)

### Let's check the distribution of the binary attributes

`matplotlib` lets you create different plots in the same `fig` object.
You can do so by specifying different axes, as follows:
- while initializing `fig` and `ax`, specify the dimension of the grid in `plt.subplots()` (e.g. `plt.subplots(2,2)` for a 2x2 grid)
- in that case, `ax` is a matrix (or array) of axes and you have to specify the one you want to access while plotting (e.g. `ax[0][1].<sth>`) 

In [None]:
binary_cols

In [None]:
fig, ax = plt.subplots(3, 2, figsize=(15, 8))

tmp_df = df.groupby('land').size().reset_index().sort_values('land')
ax[0][0].barh(tmp_df['land'], tmp_df[0])
ax[0][0].set_yticks(tmp_df['land'])

tmp_df = df.groupby('logged_in').size().reset_index().sort_values('logged_in')
ax[0][1].barh(tmp_df['logged_in'], tmp_df[0])
ax[0][1].set_yticks(tmp_df['logged_in'])

tmp_df = df.groupby('root_shell').size().reset_index().sort_values('root_shell')
ax[1][0].barh(tmp_df['root_shell'], tmp_df[0])
ax[1][0].set_yticks(tmp_df['root_shell'])

tmp_df = df.groupby('su_attempted').size().reset_index().sort_values('su_attempted')
ax[1][1].barh(tmp_df['su_attempted'], tmp_df[0])
ax[1][1].set_yticks(tmp_df['su_attempted'])

tmp_df = df.groupby('is_host_login').size().reset_index().sort_values('is_host_login')
ax[2][0].barh(tmp_df['is_host_login'], tmp_df[0])
ax[2][0].set_yticks(tmp_df['is_host_login'])

tmp_df = df.groupby('is_guest_login').size().reset_index().sort_values('is_guest_login')
ax[2][1].barh(tmp_df['is_guest_login'], tmp_df[0])
ax[2][1].set_yticks(tmp_df['is_guest_login'])

plt.show()

You could also automate the creation of such plot without having to rewrite everything several times, as follows:

In [None]:
cols_to_plot = [
    ['land', 'logged_in'],
    ['root_shell', 'su_attempted'],
    ['is_host_login', 'is_guest_login']
]

fig, ax = plt.subplots(3, 2, figsize=(15, 10))

for idx_x in range(3):
    for idx_y in range(2):
        column_name = cols_to_plot[idx_x][idx_y]
        tmp_df = df.groupby(column_name).size().reset_index().sort_values(column_name)
        ax[idx_x][idx_y].barh(tmp_df[column_name], tmp_df[0])
        ax[idx_x][idx_y].set_yticks(tmp_df[column_name])
        ax[idx_x][idx_y].set_xscale('log')
        ax[idx_x][idx_y].set_title(column_name)

plt.show()

Interestingly, from this plot, we can see that `is_host_login` is always 0 and `su_attempted` actually is not binary.

## Mapping each attack type to one category

In [None]:
df['attack_type'] = df.apply(lambda r: r['attack_type'][:-1], axis=1)

In [None]:
len(df['attack_type'].unique())

In [None]:
tmp_df = df.groupby('attack_type').size().reset_index().sort_values(0, ascending=False)
tmp_df

In [None]:
fig, ax = plt.subplots(figsize=(8, 8))
ax.barh(tmp_df['attack_type'], tmp_df[0])
ax.set_xscale('log')
ax.set_ylabel('attack_type')
ax.set_xlabel('N. of samples (log scale)')
plt.show()

In [None]:
category = dict()
category['benign'] = ['normal']

with open(DATA_DIR+'training_attack_types.txt', 'r') as f:
    for line in f.readlines():
        attack, cat = line.strip().split(' ')
        if cat in category.keys():
            category[cat].append(attack)
        else:
            category[cat] = [attack]

attack_mapping = {v: k for k in category for v in category[k]}

In [None]:
print("Attack mapping:")
print(attack_mapping)

### Perform the actual mapping

In [None]:
df['attack_category'] = df.apply(lambda r: attack_mapping[r['attack_type']], axis=1)

In [None]:
tmp_df = df.groupby('attack_category').size().reset_index().sort_values(0, ascending=False)
display(tmp_df)

# This example shows you that you can specify the color of each bar
color = ['green' if category=='benign' else 'red' for category in tmp_df['attack_category']]
fig, ax = plt.subplots()
ax.barh(tmp_df['attack_category'], tmp_df[0], color=color)
ax.set_xscale('log')
ax.set_ylabel('attack_category')
ax.set_xlabel('N. of samples (log scale)')
plt.show()

In [None]:
attack2int = {x: idx for idx, x in enumerate(df['attack_category'].unique())}
int2attack = {v: k for k, v in attack2int.items()}
print("attack2int:", attack2int)
print("int2attack:", int2attack)

## Data preparation: dummy variables

We have some categorical variables. Thus, we have to converte them to one-hot encoded variables.

<div class="alert alert-block alert-danger">
<b>Q: Create a new DataFrame encoding the categorical attributes with one hot encoding.</b>
</div>

In [None]:
# Convert categorical feature into dummy variables with one-hot encoding
df_one_hot =

## Data preparation: Train-test split

<div class="alert alert-block alert-danger">
<b>Q: Perform data split.</b>
</div>

In [None]:
from sklearn.model_selection import train_test_split

# Split dataset up into train and test sets
X_train, X_test, y_train, y_test = train_test_split(
    # TODO features, 
    # TODO labels, 
    test_size=0.3
)

## Data preparation: scaling

In [None]:
from sklearn.preprocessing import StandardScaler

In [None]:
# This cell might take a while to run
# also, if it crashes it might mean that you do not have enough memory available
standard_scaler = StandardScaler().fit(X_train[numeric_cols])

X_train[numeric_cols] = standard_scaler.transform(X_train[numeric_cols])
X_test[numeric_cols] = standard_scaler.transform(X_test[numeric_cols])

## Training the models

<div class="alert alert-block alert-danger">
<b>Q: Naive Bayes: define and train the model. Also, measure how long the training takes.</b>
</div>

In [None]:
clf_nb = # TODO: initialize the model

t0 = time.time()
# TODO: fit the model
print("elapsed time = %.2f" % (time.time()-t0))

y_pred_nb = # TODO: perform the prediction

<div class="alert alert-block alert-danger">
<b>Q: Naive Bayes: compute the accuracy.</b>
</div>

In [None]:
accuracy_nb = # TODO: measure accuracy
print("ACCURACY:", accuracy_nb)

<div class="alert alert-block alert-danger">
<b>Q: Naive Bayes: Use the following code to print the multilabel confusion matrix and the heatmap.</b>
</div>

In [None]:
multilabel_confusion_matrix(y_test, y_pred_nb)

For each class, the confusion matrix above behaves like a binary confusion matrix, considering the specified class vs all the others.

Given the predictions and the True values, you can also print the seaborn heatmat to understand how the predictions are distributed.

In [None]:
c = Counter(zip(y_pred_nb, y_test))
# create empty pandas DataFrame
dff = pd.DataFrame(0, columns=np.unique(y_pred_nb) , index =np.unique(y_test))
# insert counts in the DF
for k,v in c.items():
    dff[k[0]][k[1]] = v

# plot the heatmap
sns.heatmap(dff,annot=True, fmt="d")

The rows represent the **True** values, and the columns the **predicted** values.

**The elements of the diagonal are the objects correctly classified.**

**If you count the values on the row corresponding to, for example, `u2r`, you will obtain as total count the number of `u2r` attacks in the test set.
On the other hand, if you count the values on the column corresponding to `u2r`, you will obtain as total count the number of test connections which were labelled as attacks.**

<div class="alert alert-block alert-danger">
<b>Q: Decision Tree: define, train and test the model computing the evaluation metrics (*accuracy* and *multilabel_confusion_matrix*). Also, measure how long the training takes.</b>
</div>

In [None]:
clf_dt = # TODO: initialize the model

t0 = time.time()
# TODO: fit the model
print("elapsed time = %.2f" % (time.time()-t0))

y_pred_dt = # TODO: perform the prediction

In [None]:
accuracy_dt = # TODO: measure accuracy
print("ACCURACY:", accuracy_dt)

In [None]:
# TODO: multilabel_confusion_matrix

<div class="alert alert-block alert-danger">
<b>Q: Using the code above, plot the heatmap.</b>
</div>

You can copy and paste the cell used for Naive Bayes, you just have to change one variable.

In [None]:
# TODO: heatmap

<div class="alert alert-block alert-danger">
<b>Q: Random Forest: define, train and test the model computing the evaluation metrics. Also, measure how long the training takes.</b>
</div>

In [None]:
clf_rf = # TODO: initialize the model

t0 = time.time()
# TODO: fit the model
print("elapsed time = %.2f" % (time.time()-t0))

y_pred_rf = # TODO: perform the prediction

In [None]:
accuracy_rf = # TODO: measure accuracy
print("ACCURACY:", accuracy_rf)

In [None]:
# TODO: multilabel_confusion_matrix

<div class="alert alert-block alert-danger">
<b>Q: Using the code above, plot the heatmap.</b>
</div>

In [None]:
# TODO: heatmap

<div class="alert alert-block alert-danger">
<b>Q: SVM: define, train and test the model computing the evaluation metrics. Also, measure how long the training takes.</b>
</div>

In [None]:
clf_svc = # TODO: initialize the model

t0 = time.time()
# TODO: fit the model
print("elapsed time = %.2f" % (time.time()-t0))

y_pred_svc = # TODO: perform the prediction

In [None]:
accuracy_svc = # TODO: measure accuracy
print("ACCURACY:", accuracy_svc)

In [None]:
# TODO: multilabel_confusion_matrix

<div class="alert alert-block alert-danger">
<b>Q: Using the code above, plot the heatmap.</b>
</div>

In [None]:
# TODO: heatmap

<div class="alert alert-block alert-danger">
<b>Q: Run your first neural net</b>
</div>

In [None]:
clf_nn = MLPClassifier(hidden_layer_sizes=(200, ), activation='relu')

# train the classifier
t0 = time.time()
# TODO: fit the model
print("elapsed time = %.2f" % (time.time()-t0))

y_pred_nn = # TODO: perform the prediction

In [None]:
accuracy_nn = # TODO: measure accuracy
print("ACCURACY:", accuracy_nn)

In [None]:
# TODO: multilabel_confusion_matrix

<div class="alert alert-block alert-danger">
<b>Q: Using the code above, plot the heatmap.</b>
</div>

In [None]:
# TODO: heatmap

# XGBoost

<div class="alert alert-block alert-danger">
<b>Q: XGBoost: define, train and test the model computing the evaluation metrics. Also, measure how long the training takes.</b>
</div>

XGBoost is not part of sklearn but its usage is pretty much the same as the models from sklearn. The class of the classifier is `XGBClassifier()`.

[Documentation](https://xgboost.readthedocs.io/en/latest/python/python_api.html) (search for class xgboost.XGBClassifier)

In [None]:
clf_xgb = # TODO: initialize the XGBClassifier() object (as starter, use default arguments)

t0 = time.time()
# TODO: fit the model
print("elapsed time = %.2f" % (time.time()-t0))

In [None]:
y_pred_xgb = # TODO: perform prediction

In [None]:
accuracy_xgb = # TODO: measure accuracy
print("ACCURACY:", accuracy_xgb)

In [None]:
# TODO multilabel_confusion_matrix

<div class="alert alert-block alert-danger">
<b>Q: Using the code above, plot the heatmap.</b>
</div>

In [None]:
# TODO: heatmap

<div class="alert alert-block alert-danger">
<b>Q: Which model do you think works best? Why?</b>
</div>

<div class="alert alert-block alert-success">
ANS:
</div>

## Analyse feature importance

<div class="alert alert-block alert-danger">
<b>Q: How many features does our model get as input?</b>
</div>

We cannot assume that every feature is as important as the others. Some features might be very useful, some other features might even worsen the prediction!

A way to measure the "importance" of each feature is given by some models. For instance:
- the `feature_importances_` attribute of Random Forests
- the `plot_importance` method of xgboost
- the `feature_importances_` attribute of xgboost classifiers

Still, we have to be careful as we cannot blindly trust these values.

Let's try to work on the `plot_importance` from xgb

In [None]:
import xgboost as xgb

In [None]:
fig, ax = plt.subplots(figsize=(15, 15))
asd = xgb.plot_importance(model, ax=ax)
plt.show()

<div class="alert alert-block alert-danger">
<b>Q: Which is the importance of the features accordingly to the XGBoost model trained above? Which are the most important features? Which are the least important features? And how big is the difference between their importance?</b>
</div>

<div class="alert alert-block alert-success">
<b>ANS</b>
</div>

<div class="alert alert-block alert-danger">
<b>
    Q: 
    Use the following line to store the importances in an array.
</b>
</div>

In [None]:
importances = model.feature_importances_

In [None]:
print(importances)

<div class="alert alert-block alert-danger">
<b>
    Q: 
    Use the following line to create a dict that maps each column to its importance (as computed by the XGBoost model).
</b>
</div>

In [None]:
feature_importances_dict = dict()
for idx in range(len(X_train.columns)):
    feature_importances_dict[X_train.columns[idx]] = importances[idx]

In [None]:
feature_importances_dict

<div class="alert alert-block alert-danger">
<b>Q: Focus on the least important features: look at their distribution, their max values, etc. Is there anything strange with them?</b>
</div>

<div class="alert alert-block alert-danger">
<b>Q: Focus on the most important features: look at their distribution, their max values, etc. Is there anything strange with them?</b>
</div>

<div class="alert alert-block alert-danger">
<b>Q: Try to remove the least important features (if you want, you can try removing different numbers of features) and see how the performance changes. Try also removing the most important features. 
    
For future work, also observe how the features' importance changes in each situation, and do not limit this analysis to the XGBoost, but try to do the same with the other models as well.</b>
</div>

Similarly, the RandomForest has a `feature_importances_` attribute, that returns the importance of each feature. You can find the documentation [here](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html#sklearn.ensemble.RandomForestClassifier.feature_importances_).
**However, it is important not to trust blindly the values returned by such attribute**, as it only represents "the (normalized) total reduction of the criterion brought by that feature".

---