# 4 - Network Intrusion Detection

The dataset used in this notebook originally comes from a KDD competition held several years ago.

Here you can find the original task description given to the competition participants: [task description](http://kdd.ics.uci.edu/databases/kddcup99/task.html).

The competition task was to build a network intrusion detector, a predictive model capable of distinguishing between *bad* connections, called intrusions or attacks, and *good* normal connections. 
The database contains a standard set of data to be audited, which includes a wide variety of intrusions simulated in a military network environment.

Download instruction:
- download the file kddcup.data.gz from [here](http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html)
- move it in the 'datasets' folder (or in some other folders, as long as you know the path)
- extract the archive

As usual, go through the notebook and answer the questions (N.B. not all of them require some coding)

In [None]:
import pandas as pd
import numpy as np

## Load the dataset

<div class="alert alert-block alert-info">
<b>
You might have to change the value of the variables below, in order to match the location of the dataset on your machine.</b>
</div>

In [None]:
DATA_DIR = 'datasets/'
FILENAME = 'kddcup.data.corrected'

In [None]:
# feature names obtained from: http://kdd.ics.uci.edu/databases/kddcup99/kddcup.names
header_names = [
    'duration', 'protocol_type', 'service', 'flag', 'src_bytes', 'dst_bytes', 'land', 'wrong_fragment', 
    'urgent', 'hot', 'num_failed_logins', 'logged_in', 'num_compromised', 'root_shell', 'su_attempted', 
    'num_root', 'num_file_creations', 'num_shells', 'num_access_files', 'num_outbound_cmds', 'is_host_login', 
    'is_guest_login', 'count', 'srv_count', 'serror_rate', 'srv_serror_rate', 'rerror_rate', 'srv_rerror_rate', 
    'same_srv_rate', 'diff_srv_rate', 'srv_diff_host_rate', 'dst_host_count', 'dst_host_srv_count', 
    'dst_host_same_srv_rate', 'dst_host_diff_srv_rate', 'dst_host_same_src_port_rate', 
    'dst_host_srv_diff_host_rate', 'dst_host_serror_rate', 'dst_host_srv_serror_rate', 'dst_host_rerror_rate', 
    'dst_host_srv_rerror_rate', 'attack_type'
]

In [None]:
df = pd.read_csv(DATA_DIR+FILENAME, header=None, names=header_names, sep=',')

<div class="alert alert-block alert-danger">
    <b>Q: What is the effect of setting <i>header=None</i>?</b>
</div>

[hint](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html)

<div class="alert alert-block alert-success">
ANS
</div>

<div class="alert alert-block alert-danger">
    <b>Q: What is the effect of setting <i>names=header_names</i>?</b>
</div>

<div class="alert alert-block alert-success">
ANS
</div>

<div class="alert alert-block alert-info">
<b>
IMPORTANT:
    
The cell below reduces the size of the dataframe by sampling some of its elements. This is only done to work with a smaller amount of data. You can try to run the notebook without running this cell; if it crashes due to memory errors, come back here and rerun the notebook with less data.
    
If you still have troubles, there is a smaller version available on the same website.
The file name is *kddcup.data_10_percent.gz*.
</b>
</div>

In [None]:
df = df.sample(frac=0.4)

## Initial analysis of the data

<div class="alert alert-block alert-danger">
<b>Q: Display the first 5 rows of the dataframe</b>
</div>

<div class="alert alert-block alert-danger">
<b>Q: How many entries are in the dataframe?</b>
</div>

<div class="alert alert-block alert-danger">
<b>Q: How many columns does the original dataframe have?</b>
</div>

<div class="alert alert-block alert-danger">
<b>Q: How many FEATURES?</b>
</div>

<div class="alert alert-block alert-danger">
<b>Q: Are there any categorical variables?</b>
</div>

[hint](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.info.html)

## pre-processing the dataset and continuing the analysis

In [None]:
col_names = np.array(header_names)

nominal_idx = [1, 2, 3]
binary_idx = [6, 11, 13, 14, 20, 21]
numeric_idx = list(set(range(41)).difference(nominal_idx).difference(binary_idx))

nominal_cols = col_names[nominal_idx].tolist()
binary_cols = col_names[binary_idx].tolist()
numeric_cols = col_names[numeric_idx].tolist()

<div class="alert alert-block alert-danger">
    <b>Q: What is the difference between <i>col_names</i> and <i>header_names</i>?</b>
</div>

[hint](https://docs.python.org/3/library/functions.html#type)

<div class="alert alert-block alert-success">
ANS
</div>

<div class="alert alert-block alert-danger">
<b>Q: How many distinct values exist for the categorical variables?</b>
</div>

In [None]:
print(nominal_cols[0], ":", len(df[nominal_cols[0]].unique()))
print()  # TODO
print()  # TODO

<div class="alert alert-block alert-danger">
<b>Q: Do the same as above, but try to use only *two* lines of code.</b>
</div>

hint: remember the `for` loop

<div class="alert alert-block alert-danger">
<b>Q: What are the possible values of the categorical variables?</b>
</div>

Try to use only **two** lines of code to print all the possible values of the categorical variables.

<div class="alert alert-block alert-danger">
<b>Q: Which is the maximum duration, minimun duration and average duration of the entries in the dataframe?</b>
</div>

In [None]:
# max


In [None]:
# min


In [None]:
# average


<div class="alert alert-block alert-danger">
<b>Q: How many entries are 'root_shell' and how many aren't?</b>
</div>

<div class="alert alert-block alert-danger">
<b>Q: Count the number of entries for each 'protocol_type'</b>
</div>

<div class="alert alert-block alert-danger">
<b>Q: Which is the most frequent 'service'?</b>
</div>

- [hint](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.groupby.html)
- [hint](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.sort_values.html)

## Mapping each attack type to one category

<div class="alert alert-block alert-danger">
<b>Q: How many different 'attack_types' are in the dataframe and how common are they?</b>
</div>

<div class="alert alert-block alert-danger">
<b>Q: What does the following cell do?</b>
</div>

- [hint1](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.apply.html)
- hint2: display the dataframe after performing this operation and look at it

In [None]:
df['attack_type'] = df.apply(lambda r: r['attack_type'][:-1], axis=1)

<div class="alert alert-block alert-success">
ANS
</div>

The file *training_attack_types.txt* maps each of the attacks in the original dataset to 1 category.
The file can be found [here](http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html), or in the github repo.

In [None]:
from collections import defaultdict

You can think of `defaultdict` as a dictionary.
If you are interested in the details, you can find the documentation [here](https://docs.python.org/2/library/collections.html#collections.defaultdict).

In [None]:
category = defaultdict(list)
category['benign'].append('normal')

In [None]:
TRAINING_ATTACK_TYPES_FILENAME = 'training_attack_types.txt'

In [None]:
with open(DATA_DIR+TRAINING_ATTACK_TYPES_FILENAME, 'r') as f:
    for line in f.readlines():
        attack, cat = line.strip().split(' ')
        category[cat].append(attack)

attack_mapping = {v: k for k in category for v in category[k]}

<div class="alert alert-block alert-danger">
<b>Q: What is 'attack_mapping'? (type of variable, meaning of its content, etc.)</b>
</div>

<div class="alert alert-block alert-danger">
<b>Q: How many categories of attacks are there? What are their names?</b>
</div>

### Performing the actual mapping

In [None]:
df['attack_category'] = df.apply(lambda r: attack_mapping[r['attack_type']], axis=1)

<div class="alert alert-block alert-danger">
<b>Q: Count the number of occurrences of each category</b>
</div>

## Data preparation: dummy variables

We have some categorical variables. Thus, we have to converte them to one-hot encoded variables.

<div class="alert alert-block alert-danger">
<b>Q: Create a new DataFrame encoding the categorical attributes with one hot encoding.</b>
</div>

In [None]:
# Convert categorical feature into dummy variables with one-hot encoding
# Be careful, there are several categorical attributes
df_one_hot = 

## Data preparation: Train-test split

In [None]:
from sklearn.model_selection import train_test_split

# Split dataset up into train and test sets
X_train, X_test, y_train, y_test = 

## Data preparation: scaling

In [None]:
from sklearn.preprocessing import StandardScaler

In [None]:
# This cell might take a while to run
# also, if it crashes it might mean that you do not have enough memory available, try rerunning the notebook 
#     closing some other windows
standard_scaler = StandardScaler().fit(X_train[numeric_cols])

X_train[numeric_cols] = standard_scaler.transform(X_train[numeric_cols])
X_test[numeric_cols] = standard_scaler.transform(X_test[numeric_cols])

## Data preparation: converting label to integers

<div class="alert alert-block alert-danger">
<b>Q: What does the following cell do?</b>
</div>

In [None]:
y_train_bin = y_train.apply(lambda x: 0 if x is 'benign' else 1)
y_test_bin = y_test.apply(lambda x: 0 if x is 'benign' else 1)

<div class="alert alert-block alert-success">
ANS
</div>

## Training the models: 2 classes

As a first step, find the best model in detecting whether an entry is malicious or not (i.e. use the binary label).

Try to train Decision Trees, Random Forests, kNN models, SVMs and Naive Bayes to find the best performing model.

Feel free to modify the parameters of each model in order to find the best configuration; here is the documentation for each model:
- [Naive Bayes](https://scikit-learn.org/stable/modules/naive_bayes.html)
- [kNN](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html)
- [Decision Tree](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html)
- [Random Forest](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html)
- [SVM](https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html)

Looking for the best configuration in this way might seem as looking for a needle in a haystack and you might think that there must be some smarter ways to do this.
Indeed, there are, but we'll see them in a later session.

In [None]:
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC

#### You can use the usual metrics, but be careful: you have to consider all the classes, when evaluating the model, the accuracy is not enough!

In [None]:
from sklearn.metrics import accuracy_score, recall_score, precision_score

In [None]:
import time

<div class="alert alert-block alert-danger">
<b>Q: Naive Bayes: define, train and test the model computing the evaluation metrics. Also, measure how long the training takes.</b>
</div>

- [hint](https://docs.python.org/2/library/time.html#time.time) for measuring elapsed time

In [None]:
# define the classifier, train it and perform the prediction
clf_nb = 

y_pred_nb = 

In [None]:
# Compare test set predictions with ground truth labels
accuracy_nb = 
precision_nb = 
recall_nb = 

In [None]:
print(accuracy_nb)
print(precision_nb)
print(recall_nb)

In [None]:
# Show the confusion matrix


<div class="alert alert-block alert-danger">
<b>Q: Decision Tree: define, train and test the model computing the evaluation metrics. Also, measure how long the training takes.</b>
</div>

In [None]:
# define the classifier, train it and perform the prediction
clf_dt = 

y_pred_dt = 

In [None]:
# Compare test set predictions with ground truth labels
accuracy_dt = 
precision_dt = 
recall_dt = 

In [None]:
print(accuracy_dt)
print(precision_dt)
print(recall_dt)

In [None]:
# Show the confusion matrix


<div class="alert alert-block alert-danger">
<b>Q: Random Forest: define, train and test the model computing the evaluation metrics. Also, measure how long the training takes.</b>
</div>

In [None]:
# define the classifier, train it and perform the prediction
clf_rf = 

y_pred_rf = 

In [None]:
# Compare test set predictions with ground truth labels
accuracy_rf = 
precision_rf = 
recall_rf = 

In [None]:
print(accuracy_rf)
print(precision_rf)
print(recall_rf)

In [None]:
# Show the confusion matrix


<div class="alert alert-block alert-danger">
<b>Q: SVM: define, train and test the model computing the evaluation metrics. Also, measure how long the training takes.</b>
</div>

In [None]:
# define the classifier, train it and perform the prediction
clf_svc = 

y_pred_svc = 

In [None]:
# Compare test set predictions with ground truth labels
accuracy_svc = 
precision_svc = 
recall_svc = 

In [None]:
print(accuracy_svc)
print(precision_svc)
print(recall_svc)

In [None]:
# Show the confusion matrix


<div class="alert alert-block alert-danger">
<b>Q: Which model do you think works best? Why do you say so?</b>
</div>

<div class="alert alert-block alert-success">
ANS
</div>

## Training the models: 5 classes

Now try to focus on the specific attack category.

#### Be careful with the evaluation metrics. For each model, focus on the accuracy for every class.

<div class="alert alert-block alert-danger">
<b>Q: Naive Bayes: define, train and test the model computing the evaluation metrics. Also, measure how long the training takes.</b>
</div>

In [None]:
# define the classifier, train it and perform the prediction
clf_nb = 

y_pred_nb = 

In [None]:
# Compare test set predictions with ground truth labels (i.e. measure the accuracy)


<div class="alert alert-block alert-danger">
<b>Q: Decision Tree: define, train and test the model computing the evaluation metrics. Also, measure how long the training takes.</b>
</div>

In [None]:
# define the classifier, train it and perform the prediction
clf_dt = 

y_pred_dt = 

In [None]:
# Compare test set predictions with ground truth labels (i.e. measure the accuracy)


<div class="alert alert-block alert-danger">
<b>Q: Random Forest: define, train and test the model computing the evaluation metrics. Also, measure how long the training takes.</b>
</div>

In [None]:
# define the classifier, train it and perform the prediction
clf_rf = 

y_pred_rf = 

In [None]:
# Compare test set predictions with ground truth labels (i.e. measure the accuracy)


<div class="alert alert-block alert-danger">
<b>Q: SVM: define, train and test the model computing the evaluation metrics. Also, measure how long the training takes.</b>
</div>

In [None]:
# define the classifier, train it and perform the prediction
clf_svc = 

y_pred_svc = 

In [None]:
# Compare test set predictions with ground truth labels (i.e. measure the accuracy)


<div class="alert alert-block alert-danger">
<b>Q: Which model do you think works best? Why do you say so?</b>
</div>

<div class="alert alert-block alert-success">
ANS
</div>

## Analyse feature importance

<div class="alert alert-block alert-danger">
<b>Q: How many features does our model get as input?</b>
</div>

We cannot assume that every feature is as important as the others.

Some features might be very useful, some other features might even worsen the prediction!

<div class="alert alert-block alert-danger">
<b>Q: Which is the importance of the features accordingly to the RF trained above? Which are the most important features? Which are the least important features? And how big is the difference between their importance?</b>
</div>

- [hint1](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html#sklearn.ensemble.RandomForestClassifier.feature_importances_)
- [hint2](https://docs.scipy.org/doc/numpy/reference/generated/numpy.argmax.html)

<div class="alert alert-block alert-danger">
<b>Q: Focus on the least important features: look at their distribution, their max values, etc. Is there anything strange with them?</b>
</div>

<div class="alert alert-block alert-danger">
<b>Q: Try to remove the least important features (you can try removing different numbers of features) and see how the performance changes. Try also removing the most important features. Observe how the features' importance changes in each situation. Lastly, do not limit this analysis to the Random Forest, but try to do the same with the other models as well.</b>
</div>

---