# 4&5 - Network Intrusion Detection

The dataset used in this notebook originally comes from a KDD competition held several years ago.

Here you can find the original task description given to the competition participants: [task description](http://kdd.ics.uci.edu/databases/kddcup99/task.html)

The competition task was to build a network intrusion detector, a predictive model capable of distinguishing between *bad* connections, called intrusions or attacks, and *good* normal connections. 
The database contains a standard set of data to be audited, which includes a wide variety of intrusions simulated in a military network environment.

Download instruction:
- download the file kddcup.data.gz from [here](http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html)
- move it in the 'datasets' folder (or in some other folders, as long as you know the path)
- extract the archive

As usual, go through the notebook and answer the questions (N.B. not all of them require some coding)

In [1]:
import pandas as pd
import numpy as np

## Load the dataset

In [2]:
# you might have to change the value of these variables depending on the path you chose
DATA_DIR = 'datasets/'
FILENAME = 'kddcup.data.corrected'

In [3]:
# feature names obtained from: http://kdd.ics.uci.edu/databases/kddcup99/kddcup.names
header_names = [
    'duration', 'protocol_type', 'service', 'flag', 'src_bytes', 'dst_bytes', 'land', 'wrong_fragment', 
    'urgent', 'hot', 'num_failed_logins', 'logged_in', 'num_compromised', 'root_shell', 'su_attempted', 
    'num_root', 'num_file_creations', 'num_shells', 'num_access_files', 'num_outbound_cmds', 'is_host_login', 
    'is_guest_login', 'count', 'srv_count', 'serror_rate', 'srv_serror_rate', 'rerror_rate', 'srv_rerror_rate', 
    'same_srv_rate', 'diff_srv_rate', 'srv_diff_host_rate', 'dst_host_count', 'dst_host_srv_count', 
    'dst_host_same_srv_rate', 'dst_host_diff_srv_rate', 'dst_host_same_src_port_rate', 
    'dst_host_srv_diff_host_rate', 'dst_host_serror_rate', 'dst_host_srv_serror_rate', 'dst_host_rerror_rate', 
    'dst_host_srv_rerror_rate', 'attack_type'
]

In [4]:
df = pd.read_csv(DATA_DIR+FILENAME, header=None, names=header_names, sep=',')

<div class="alert alert-block alert-danger">
    <b>Q: What is the effect of setting <i>header=None</i>?</b>
</div>

[hint](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html)

<div class="alert alert-block alert-success">
If you do not insert header=None, the read_csv function will try to infer the name of the columns from the first line of the csv file. In practice, you will a DF with wrong names and you will lose one data entry.
</div>

<div class="alert alert-block alert-danger">
    <b>Q: What is the effect of setting <i>names=header_names</i>?</b>
</div>

<div class="alert alert-block alert-success">
Since read_csv does not infer the column names from the csv file (because we have set header=None), with names=header_names we are telling the function which are the names of the columns of the dataframe.
</div>

<div class="alert alert-block alert-info">
<b>
IMPORTANT:
    
The cell below reduces the size of the dataframe by sampling some of its elements. This is only done to work with a smaller amount of data. You can try to run the notebook without running this cell; if it crashes due to memory errors, come back here and rerun the notebook with less data.
    
If you still have troubles, there is a smaller version available on the same website.
The file name is *kddcup.data_10_percent.gz*.
</b>
</div>

In [5]:
df = df.sample(frac=0.4)

## Initial analysis of the data

<div class="alert alert-block alert-danger">
<b>Q: Display the first 5 rows of the dataframe</b>
</div>

<div class="alert alert-block alert-success">
Several possibilities:
</div>

In [6]:
df[:5]

Unnamed: 0,duration,protocol_type,service,flag,src_bytes,dst_bytes,land,wrong_fragment,urgent,hot,...,dst_host_srv_count,dst_host_same_srv_rate,dst_host_diff_srv_rate,dst_host_same_src_port_rate,dst_host_srv_diff_host_rate,dst_host_serror_rate,dst_host_srv_serror_rate,dst_host_rerror_rate,dst_host_srv_rerror_rate,attack_type
4280694,0,icmp,ecr_i,SF,520,0,0,0,0,0,...,255,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,smurf.
752742,0,tcp,http,REJ,0,0,0,0,0,0,...,255,1.0,0.0,0.33,0.21,0.0,0.0,1.0,1.0,normal.
3051348,0,icmp,ecr_i,SF,1032,0,0,0,0,0,...,255,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,smurf.
3531508,0,tcp,private,S0,0,0,0,0,0,0,...,4,0.02,0.08,0.0,0.0,1.0,1.0,0.0,0.0,neptune.
1739299,0,icmp,ecr_i,SF,1032,0,0,0,0,0,...,255,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,smurf.


In [7]:
df.head(5)

Unnamed: 0,duration,protocol_type,service,flag,src_bytes,dst_bytes,land,wrong_fragment,urgent,hot,...,dst_host_srv_count,dst_host_same_srv_rate,dst_host_diff_srv_rate,dst_host_same_src_port_rate,dst_host_srv_diff_host_rate,dst_host_serror_rate,dst_host_srv_serror_rate,dst_host_rerror_rate,dst_host_srv_rerror_rate,attack_type
4280694,0,icmp,ecr_i,SF,520,0,0,0,0,0,...,255,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,smurf.
752742,0,tcp,http,REJ,0,0,0,0,0,0,...,255,1.0,0.0,0.33,0.21,0.0,0.0,1.0,1.0,normal.
3051348,0,icmp,ecr_i,SF,1032,0,0,0,0,0,...,255,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,smurf.
3531508,0,tcp,private,S0,0,0,0,0,0,0,...,4,0.02,0.08,0.0,0.0,1.0,1.0,0.0,0.0,neptune.
1739299,0,icmp,ecr_i,SF,1032,0,0,0,0,0,...,255,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,smurf.


<div class="alert alert-block alert-danger">
<b>Q: How many entries are in the dataframe?</b>
</div>

<div class="alert alert-block alert-success">
Several possibilities.
</div>

In [8]:
len(df.index)

1959372

In [9]:
df.shape[0]

1959372

<div class="alert alert-block alert-danger">
<b>Q: How many columns does the original dataframe have?</b>
</div>

In [10]:
df.shape[1]

42

In [11]:
len(df.columns)

42

<div class="alert alert-block alert-danger">
<b>Q: How many FEATURES?</b>
</div>

<div class="alert alert-block alert-success">
41, since the features (aka attributes) are all the columns except attack_type. If you want to get that number with code you can do:
</div>

In [12]:
len(df.drop('attack_type', axis=1).columns)

41

<div class="alert alert-block alert-danger">
<b>Q: Are there any categorical variables?</b>
</div>

[hint](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.info.html)

In [13]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1959372 entries, 4280694 to 4295474
Data columns (total 42 columns):
duration                       int64
protocol_type                  object
service                        object
flag                           object
src_bytes                      int64
dst_bytes                      int64
land                           int64
wrong_fragment                 int64
urgent                         int64
hot                            int64
num_failed_logins              int64
logged_in                      int64
num_compromised                int64
root_shell                     int64
su_attempted                   int64
num_root                       int64
num_file_creations             int64
num_shells                     int64
num_access_files               int64
num_outbound_cmds              int64
is_host_login                  int64
is_guest_login                 int64
count                          int64
srv_count                  

<div class="alert alert-block alert-success">
Yes, 3 of them. All the columns that are *not* int64 or float64
</div>

<div class="alert alert-block alert-info">
If you want to get with code the name of the categorical attributes (in this case it is not needed, but if you have thousands of features you might need it), you cannot do that from df.info() directly.
However you can do as follows:
</div>

In [14]:
set(df.columns) - set(df.describe().columns)

{'attack_type', 'flag', 'protocol_type', 'service'}

<div class="alert alert-block alert-info">
df.describe() create a dataframe containing statistics about df. Since the statistics are only about numerical attributes, the other attributes are dropped from the dataframe. Thus, if you keep the columns that were in df but not in df.describe(), you keep all the non-numerical attributes of df.
</div>

## pre-processing the dataset and continuing the analysis

In [15]:
col_names = np.array(header_names)

nominal_idx = [1, 2, 3]
binary_idx = [6, 11, 13, 14, 20, 21]
numeric_idx = list(set(range(41)).difference(nominal_idx).difference(binary_idx))

nominal_cols = col_names[nominal_idx].tolist()
binary_cols = col_names[binary_idx].tolist()
numeric_cols = col_names[numeric_idx].tolist()

<div class="alert alert-block alert-danger">
    <b>Q: What is the difference between <i>col_names</i> and <i>header_names</i>?</b>
</div>

[hint](https://docs.python.org/3/library/functions.html#type)

In [16]:
header_names

['duration',
 'protocol_type',
 'service',
 'flag',
 'src_bytes',
 'dst_bytes',
 'land',
 'wrong_fragment',
 'urgent',
 'hot',
 'num_failed_logins',
 'logged_in',
 'num_compromised',
 'root_shell',
 'su_attempted',
 'num_root',
 'num_file_creations',
 'num_shells',
 'num_access_files',
 'num_outbound_cmds',
 'is_host_login',
 'is_guest_login',
 'count',
 'srv_count',
 'serror_rate',
 'srv_serror_rate',
 'rerror_rate',
 'srv_rerror_rate',
 'same_srv_rate',
 'diff_srv_rate',
 'srv_diff_host_rate',
 'dst_host_count',
 'dst_host_srv_count',
 'dst_host_same_srv_rate',
 'dst_host_diff_srv_rate',
 'dst_host_same_src_port_rate',
 'dst_host_srv_diff_host_rate',
 'dst_host_serror_rate',
 'dst_host_srv_serror_rate',
 'dst_host_rerror_rate',
 'dst_host_srv_rerror_rate',
 'attack_type']

In [17]:
col_names

array(['duration', 'protocol_type', 'service', 'flag', 'src_bytes',
       'dst_bytes', 'land', 'wrong_fragment', 'urgent', 'hot',
       'num_failed_logins', 'logged_in', 'num_compromised', 'root_shell',
       'su_attempted', 'num_root', 'num_file_creations', 'num_shells',
       'num_access_files', 'num_outbound_cmds', 'is_host_login',
       'is_guest_login', 'count', 'srv_count', 'serror_rate',
       'srv_serror_rate', 'rerror_rate', 'srv_rerror_rate',
       'same_srv_rate', 'diff_srv_rate', 'srv_diff_host_rate',
       'dst_host_count', 'dst_host_srv_count', 'dst_host_same_srv_rate',
       'dst_host_diff_srv_rate', 'dst_host_same_src_port_rate',
       'dst_host_srv_diff_host_rate', 'dst_host_serror_rate',
       'dst_host_srv_serror_rate', 'dst_host_rerror_rate',
       'dst_host_srv_rerror_rate', 'attack_type'], dtype='<U27')

In [18]:
type(col_names)

numpy.ndarray

In [19]:
type(header_names)

list

<div class="alert alert-block alert-success">
Different types
</div>

<div class="alert alert-block alert-danger">
<b>Q: How many distinct values exist for the categorical variables?</b>
</div>

In [20]:
print(nominal_cols[0], ":", len(df[nominal_cols[0]].unique()))
print(nominal_cols[1], ":", len(df[nominal_cols[1]].unique()))
print(nominal_cols[2], ":", len(df[nominal_cols[2]].unique()))

protocol_type : 3
service : 69
flag : 11


<div class="alert alert-block alert-danger">
<b>Q: Do the same as above, but try to use only *two* lines of code.</b>
</div>

hint: remember the `for` loop

In [21]:
for col in nominal_cols:
    print(col, ":", len(df[col].unique()))

protocol_type : 3
service : 69
flag : 11


Another option (but the previous one is better)

In [22]:
for idx in range(len(nominal_cols)):
    print(nominal_cols[idx], ":", len(df[nominal_cols[idx]].unique()))

protocol_type : 3
service : 69
flag : 11


<div class="alert alert-block alert-danger">
<b>Q: What are the possible values of the categorical variables?</b>
</div>

Try to use only **two** lines of code to print all the possible values of the categorical variables.

In [23]:
for col in nominal_cols:
    print(col, ":", df[col].unique())

protocol_type : ['icmp' 'tcp' 'udp']
service : ['ecr_i' 'http' 'private' 'ftp' 'uucp' 'uucp_path' 'domain_u' 'ftp_data'
 'systat' 'smtp' 'other' 'urh_i' 'shell' 'rje' 'finger' 'auth' 'eco_i'
 'netbios_dgm' 'discard' 'vmnet' 'csnet_ns' 'pop_3' 'urp_i' 'whois'
 'netbios_ns' 'klogin' 'telnet' 'nnsp' 'ntp_u' 'supdup' 'ctf' 'ssh'
 'netbios_ssn' 'remote_job' 'daytime' 'exec' 'hostnames' 'time' 'efs'
 'nntp' 'kshell' 'login' 'sunrpc' 'netstat' 'name' 'pop_2' 'domain' 'link'
 'http_443' 'mtp' 'courier' 'bgp' 'iso_tsap' 'sql_net' 'IRC' 'printer'
 'imap4' 'echo' 'ldap' 'gopher' 'Z39_50' 'X11' 'tim_i' 'http_8001'
 'http_2784' 'tftp_u' 'red_i' 'aol' 'pm_dump']
flag : ['SF' 'REJ' 'S0' 'RSTO' 'RSTR' 'S1' 'SH' 'RSTOS0' 'OTH' 'S2' 'S3']


<div class="alert alert-block alert-danger">
<b>Q: Which is the maximum duration, minimun duration and average duration of the entries in the dataframe?</b>
</div>

In [24]:
df['duration'].max()

58329

In [25]:
df['duration'].min()

0

In [26]:
df['duration'].mean()

48.815367883178894

<div class="alert alert-block alert-danger">
<b>Q: How many entries are 'root_shell' and how many aren't?</b>
</div>

In [27]:
df.groupby('root_shell').size().reset_index()

Unnamed: 0,root_shell,0
0,0,1959246
1,1,126


<div class="alert alert-block alert-success">
You can do the same with value_counts as well, but it does *not* return a DataFrame object, differently from the solution with the groupby.
</div>

In [28]:
df['root_shell'].value_counts()

0    1959246
1        126
Name: root_shell, dtype: int64

<div class="alert alert-block alert-danger">
<b>Q: Count the number of entries for each 'protocol_type'</b>
</div>

In [29]:
df.groupby('protocol_type').size().reset_index()

Unnamed: 0,protocol_type,0
0,icmp,1133905
1,tcp,747766
2,udp,77701


<div class="alert alert-block alert-danger">
<b>Q: Which is the most frequent 'service'?</b>
</div>

- [hint](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.groupby.html)
- [hint](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.sort_values.html)

In [30]:
df.groupby('service').size().reset_index().sort_values(0, ascending=False)['service'].values[0]

'ecr_i'

Analyzing a bit the cell above:
- `df.groupby('service').size().reset_index()` creates a dataframe that contains the number of occurrences of every service
- `.sort_values(0, ascending=False)` sorts such dataframe accordingly to the number of occurrences (descending order)
- `['service']` gets the Series representing the service column
- `.values` returns the sequence of elements of the Series
- `[0]` returns the first element

## Mapping each attack type to one category

<div class="alert alert-block alert-danger">
<b>Q: How many different 'attack_types' are in the dataframe and how common are they?</b>
</div>

In [31]:
len(df['attack_type'].unique())

21

In [32]:
df.groupby('attack_type').size().reset_index().sort_values(0, ascending=False)

Unnamed: 0,attack_type,0
17,smurf.,1123617
9,neptune.,428252
11,normal.,389104
16,satan.,6390
5,ipsweep.,5001
14,portsweep.,4200
10,nmap.,911
0,back.,867
19,warezclient.,444
18,teardrop.,407


<div class="alert alert-block alert-danger">
<b>Q: What does the following cell do?</b>
</div>

- [hint1](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.apply.html)
- hint2: display the dataframe after performing this operation and look at it

In [33]:
display(df['attack_type'][:2])

4280694     smurf.
752742     normal.
Name: attack_type, dtype: object

In [34]:
df['attack_type'] = df.apply(lambda r: r['attack_type'][:-1], axis=1)

In [35]:
display(df['attack_type'][:2])

4280694     smurf
752742     normal
Name: attack_type, dtype: object

<div class="alert alert-block alert-success">
Removes the last character from all the entries in the attack_type column.
    
Specifically, it overwrites the values in the 'attack_type' column, and removes the latest character from all the elements of that column.
</div>

Analyzing a bit the cell above:
- `df['attack_type'] =`: specifies that the results of the operation on the right end side of the "=" will be written in the 'attacl_type' column
- `df.apply(lambda r: ..., axis=1)`: tells us that we are going to perform one operation one line at a time. The operation to perform is the one defined in `...` (in this case `r['attack_type'][:-1]` 
- `r['attack_type'][:-1]`: the operation consits in getting the current value in the 'attack_type' column and remove the last character.

The file *training_attack_types.txt* maps each of the attacks in the original dataset to 1 category.
The file can be found [here](http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html), or in the github repo.

In [36]:
from collections import defaultdict

You can think of `defaultdict` as a dictionary.
If you are interested in the details, you can find the documentation [here](https://docs.python.org/2/library/collections.html#collections.defaultdict).

In [37]:
category = defaultdict(list)
category['benign'].append('normal')

In [38]:
TRAINING_ATTACK_TYPES_FILENAME = 'training_attack_types.txt'

In [39]:
with open(DATA_DIR+TRAINING_ATTACK_TYPES_FILENAME, 'r') as f:
    for line in f.readlines():
        attack, cat = line.strip().split(' ')
        category[cat].append(attack)

attack_mapping = {v: k for k in category for v in category[k]}

<div class="alert alert-block alert-danger">
<b>Q: What is 'attack_mapping'?</b>
</div>

In [40]:
type(attack_mapping)

dict

In [41]:
attack_mapping

{'normal': 'benign',
 'back': 'dos',
 'land': 'dos',
 'neptune': 'dos',
 'pod': 'dos',
 'smurf': 'dos',
 'teardrop': 'dos',
 'buffer_overflow': 'u2r',
 'loadmodule': 'u2r',
 'perl': 'u2r',
 'rootkit': 'u2r',
 'ftp_write': 'r2l',
 'guess_passwd': 'r2l',
 'imap': 'r2l',
 'multihop': 'r2l',
 'phf': 'r2l',
 'spy': 'r2l',
 'warezclient': 'r2l',
 'warezmaster': 'r2l',
 'ipsweep': 'probe',
 'nmap': 'probe',
 'portsweep': 'probe',
 'satan': 'probe'}

<div class="alert alert-block alert-success">
It is a dictionary, it maps each attack_type to one category
</div>

<div class="alert alert-block alert-danger">
<b>Q: How many categories of attacks are there? What are their names?</b>
</div>

In [42]:
len(set(attack_mapping.values()))

5

In [43]:
set(attack_mapping.values())

{'benign', 'dos', 'probe', 'r2l', 'u2r'}

### Perform the actual mapping

In [44]:
df['attack_category'] = df.apply(lambda r: attack_mapping[r['attack_type']], axis=1)

<div class="alert alert-block alert-info">
This is similar to what was done above. The difference is that here a new column is created (named "attack_category"), which contains, for each row, the category of the attack type (<code>r['attack_type']</code> is used as a key to access the dictionary <code>attack_mapping</code>) 
</div>

<div class="alert alert-block alert-danger">
<b>Q: Count the number of occurrences of each category</b>
</div>

In [45]:
df.groupby('attack_category').size().reset_index().sort_values(0, ascending=False)

Unnamed: 0,attack_category,0
1,dos,1553261
0,benign,389104
2,probe,16502
3,r2l,483
4,u2r,22


## Data preparation: dummy variables

We have some categorical variables. Thus, we have to converte them to one-hot encoded variables.

<div class="alert alert-block alert-danger">
<b>Q: Create a new DataFrame encoding the categorical attributes with one hot encoding.</b>
</div>

In [46]:
# Convert categorical feature into dummy variables with one-hot encoding
df_one_hot = pd.get_dummies(df, columns=nominal_cols)

## Data preparation: Train-test split

In [47]:
from sklearn.model_selection import train_test_split

# Split dataset up into train and test sets
X_train, X_test, y_train, y_test = train_test_split(
    df_one_hot.drop(['attack_category', 'attack_type'], axis=1), 
    df_one_hot['attack_category'], 
    test_size=0.3
)

## Data preparation: scaling

In [48]:
from sklearn.preprocessing import StandardScaler

In [49]:
# This cell might take a while to run
# also, if it crashes it might mean that you do not have enough memory available, try rerunning the notebook 
#     closing some other windows
standard_scaler = StandardScaler().fit(X_train[numeric_cols])

X_train[numeric_cols] = standard_scaler.transform(X_train[numeric_cols])
X_test[numeric_cols] = standard_scaler.transform(X_test[numeric_cols])

  return self.partial_fit(X, y)
  
  import sys


## Data preparation: converting label to integers

<div class="alert alert-block alert-danger">
<b>Q: What does the following cell do?</b>
</div>

In [50]:
y_train_bin = y_train.apply(lambda x: 0 if x is 'benign' else 1)
y_test_bin = y_test.apply(lambda x: 0 if x is 'benign' else 1)

<div class="alert alert-block alert-success">
Converts the values in y_train and y_test in order to have only two classes (i.e. banign, malicious) instead of 5.
</div>

REMEMBER: `.apply` applies the `lambda` function within the `()` to each element of the sequence (in this case a Series).

## Training the models: 2 classes

As a first step, find the best model in detecting whether an entry is malicious or not (i.e. use the binary label).

Try to train Decision Trees, Random Forests, kNN models, SVMs and Naive Bayes to find the best performing model.

Feel free to modify the parameters of each model in order to find the best configuration; you can find the link to the documentation of each model in the cheatsheet.

Looking for the best configuration in this way might seem as looking for a needle in a haystack and you might think that there must be some smarter ways to do this.
Indeed, there are, but we'll see them in a later session.

In [51]:
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression

#### You can use the usual metrics, but be careful: you have to consider all the classes, when evaluating the model, the accuracy is not enough!

In [52]:
from sklearn.metrics import accuracy_score, recall_score, precision_score, confusion_matrix

In [53]:
import time

<div class="alert alert-block alert-danger">
<b>Q: Logistic Regression: define, train and test the model computing the evaluation metrics. Also, measure how long the training takes.</b>
</div>

- [hint](https://docs.python.org/2/library/time.html#time.time) for measuring elapsed time

In [54]:
# toy Example: how to measure time
a = 5
my_var = [1,2,3,4,5,6,7,8,9,0]

t0 = time.time()
for value in my_var:
    a *= value
print("Elapsed time:", time.time() - t0, "s")

Elapsed time: 6.103515625e-05 s


In [55]:
# define the classifier
clf_lr = LogisticRegression(solver='lbfgs', max_iter=400)

# train the classifier
t0 = time.time()
clf_lr.fit(X_train, y_train_bin)
print("elapsed time = %.2f" % (time.time()-t0))

# perform the prediction
y_pred_lr = clf_lr.predict(X_test)



elapsed time = 274.05


In [56]:
print("ACCURACY: ", accuracy_score(y_pred_lr, y_test_bin))
print("PRECISION:", precision_score(y_pred_lr, y_test_bin))
print("RECALL:   ", recall_score(y_pred_lr, y_test_bin))
print(confusion_matrix(y_test_bin, y_pred_lr))

ACCURACY:  0.9987478989881119
PRECISION: 0.9987857843366179
RECALL:    0.999651566463697
[[116562    164]
 [   572 470514]]


<div class="alert alert-block alert-danger">
<b>Q: Do the same thing as above but with L1 regularization.</b>
</div>

In [57]:
# define the classifier
clf_lr = LogisticRegression(penalty='l1', max_iter=300)

# train the classifier
t0 = time.time()
clf_lr.fit(X_train, y_train_bin)
print("elapsed time = %.2f" % (time.time()-t0))

# perform the prediction
y_pred_lr = clf_lr.predict(X_test)



elapsed time = 1457.79


In [58]:
print("ACCURACY: ", accuracy_score(y_pred_lr, y_test_bin))
print("PRECISION:", precision_score(y_pred_lr, y_test_bin))
print("RECALL:   ", recall_score(y_pred_lr, y_test_bin))
print(confusion_matrix(y_test_bin, y_pred_lr))

ACCURACY:  0.9985624655502099
PRECISION: 0.9985989819268668
RECALL:    0.9996068940165019
[[116541    185]
 [   660 470426]]


<div class="alert alert-block alert-danger">
<b>Q: Naive Bayes: define, train and test the model computing the evaluation metrics. Also, measure how long the training takes.</b>
</div>

In [59]:
# define the classifier
clf_nb = GaussianNB()

# train the classifier
t0 = time.time()
clf_nb.fit(X_train, y_train_bin)
print("elapsed time = %.2f" % (time.time()-t0))

# perform the prediction
y_pred_nb = clf_nb.predict(X_test)

elapsed time = 7.04


In [60]:
print("ACCURACY: ", accuracy_score(y_pred_nb, y_test_bin))
print("PRECISION:", precision_score(y_pred_nb, y_test_bin))
print("RECALL:   ", recall_score(y_pred_nb, y_test_bin))
print(confusion_matrix(y_test_bin, y_pred_nb))

ACCURACY:  0.9552424924976013
PRECISION: 0.9442713220091449
RECALL:    0.999874125905563
[[116670     56]
 [ 26253 444833]]


<div class="alert alert-block alert-danger">
<b>Q: Decision Tree: define, train and test the model computing the evaluation metrics. Also, measure how long the training takes.</b>
</div>

In [61]:
# define the classifier
clf_dt = DecisionTreeClassifier()

# train the classifier
t0 = time.time()
clf_dt.fit(X_train, y_train_bin)
print("elapsed time = %.2f" % (time.time()-t0))

# perform the prediction
y_pred_dt = clf_dt.predict(X_test)

elapsed time = 34.57


In [62]:
print("ACCURACY: ", accuracy_score(y_pred_dt, y_test_bin))
print("PRECISION:", precision_score(y_pred_dt, y_test_bin))
print("RECALL:   ", recall_score(y_pred_dt, y_test_bin))
print(confusion_matrix(y_test_bin, y_pred_dt))

ACCURACY:  0.9999149387899533
PRECISION: 0.9999448083789372
RECALL:    0.9999490536719566
[[116702     24]
 [    26 471060]]


<div class="alert alert-block alert-danger">
<b>Q: Random Forest: define, train and test the model computing the evaluation metrics. Also, measure how long the training takes.</b>
</div>

In [63]:
# define the classifier
clf_rf = RandomForestClassifier()

# train the classifier
t0 = time.time()
clf_rf.fit(X_train, y_train_bin)
print("elapsed time = %.2f" % (time.time()-t0))

# perform the prediction
y_pred_rf = clf_rf.predict(X_test)



elapsed time = 25.99


In [64]:
print("ACCURACY: ", accuracy_score(y_pred_rf, y_test_bin))
print("PRECISION:", precision_score(y_pred_rf, y_test_bin))
print("RECALL:   ", recall_score(y_pred_rf, y_test_bin))
print(confusion_matrix(y_test_bin, y_pred_rf))

ACCURACY:  0.9999251461351588
PRECISION: 0.9999172125684058
RECALL:    0.9999893854606285
[[116721      5]
 [    39 471047]]


<div class="alert alert-block alert-danger">
<b>Q: SVM: define, train and test the model computing the evaluation metrics. Also, measure how long the training takes.</b>
</div>

In [65]:
# define the classifier
clf_svc = SVC(kernel='linear')

# train the classifier
t0 = time.time()
clf_svc.fit(X_train, y_train_bin)
print("elapsed time = %.2f" % (time.time()-t0))

# perform the prediction
y_pred_svc = clf_svc.predict(X_test)

elapsed time = 15019.32


In [66]:
print("ACCURACY: ", accuracy_score(y_pred_svc, y_test_bin))
print("PRECISION:", precision_score(y_pred_svc, y_test_bin))
print("RECALL:   ", recall_score(y_pred_svc, y_test_bin))
print(confusion_matrix(y_test_bin, y_pred_svc))

ACCURACY:  0.9990966499493035
PRECISION: 0.999150898137495
RECALL:    0.9997217602592939
[[116595    131]
 [   400 470686]]


<div class="alert alert-block alert-danger">
<b>Q: Which model do you think works best? Why do you say so?</b>
</div>

**ANS**:

For choosing the model, you have to take in consideration different aspects:
- evaluation metrics (i.e. accuracy, precision and recall)
- the importance of each metric also depends on the specific scenario (e.g. I cannot afford to miss any attacks vs. I don't want to raise false positive because I don't want to warn the user)
- training time
- inference time (i.e. how long does it take to run on the test data?)

It is rare the case when you can find "the optimum" model. Usually, you have a model that works better than another considering what you are interested in. 

## Training the models: 5 classes

Now try to focus on the specific attack category.

#### Be careful with the evaluation metrics. 

<div class="alert alert-block alert-danger">
<b>Q: Naive Bayes: define, train and test the model computing the evaluation metrics. Also, measure how long the training takes.</b>
</div>

In [67]:
# define the classifier
clf_nb = GaussianNB()

# train the classifier
t0 = time.time()
clf_nb.fit(X_train, y_train)
print("elapsed time = %.2f" % (time.time()-t0))

# perform the prediction
y_pred_nb = clf_nb.predict(X_test)

elapsed time = 6.52


In [68]:
# Compare test set predictions with ground truth labels
print("ACCURACY:", accuracy_score(y_pred_nb, y_test))

ACCURACY: 0.9704089062489367


<div class="alert alert-block alert-danger">
<b>Q: Decision Tree: define, train and test the model computing the evaluation metrics. Also, measure how long the training takes.</b>
</div>

In [69]:
# define the classifier
clf_dt = DecisionTreeClassifier()

# train the classifier
t0 = time.time()
clf_dt.fit(X_train, y_train)
print("elapsed time = %.2f" % (time.time()-t0))

# perform the prediction
y_pred_dt = clf_dt.predict(X_test)

elapsed time = 27.40


In [70]:
# Compare test set predictions with ground truth labels
print("ACCURACY:", accuracy_score(y_pred_dt, y_test))

ACCURACY: 0.9999115363415514


<div class="alert alert-block alert-danger">
<b>Q: Random Forest: define, train and test the model computing the evaluation metrics. Also, measure how long the training takes.</b>
</div>

In [71]:
# define the classifier
clf_rf = RandomForestClassifier()

# train the classifier
t0 = time.time()
clf_rf.fit(X_train, y_train)
print("elapsed time = %.2f" % (time.time()-t0))

# perform the prediction
y_pred_rf = clf_rf.predict(X_test)



elapsed time = 22.02


In [72]:
# Compare test set predictions with ground truth labels
print("ACCURACY:", accuracy_score(y_pred_rf, y_test))

ACCURACY: 0.9998911216511401


<div class="alert alert-block alert-danger">
<b>Q: SVM: define, train and test the model computing the evaluation metrics. Also, measure how long the training takes.</b>
</div>

In [73]:
# define the classifier
clf_svc = SVC(gamma='auto')

# train the classifier
t0 = time.time()
clf_svc.fit(X_train, y_train)
print("elapsed time = %.2f" % (time.time()-t0))

# perform the prediction
y_pred_svc = clf_svc.predict(X_test)

elapsed time = 2148.87


In [74]:
# Compare test set predictions with ground truth labels
print("ACCURACY:", accuracy_score(y_pred_svc, y_test))

ACCURACY: 0.9996920784196308


<div class="alert alert-block alert-danger">
<b>Q: Run your first neural net</b>
</div>

In [75]:
from sklearn.neural_network import MLPClassifier

In [76]:
# define the classifier
clf_nn = MLPClassifier(hidden_layer_sizes=(200, ), activation='relu')

# train the classifier
t0 = time.time()
clf_nn.fit(X_train, y_train)
print("elapsed time = %.2f" % (time.time()-t0))

# perform the prediction
y_pred_nn = clf_nn.predict(X_test)

elapsed time = 949.75


In [77]:
# Compare test set predictions with ground truth labels
print("ACCURACY:", accuracy_score(y_pred_nn, y_test))

ACCURACY: 0.9998281763557055


<div class="alert alert-block alert-danger">
<b>Q: Which model do you think works best? How can you say so</b>
</div>

**ANS**:

All the things said above about model selection are still valid.
The main difference, considering that this is a multi-class problem, consists in the fact that now you have much more variance: while in the previous case the models could only make 2 types of mistakes (attack instead of non-attack and vice-versa), now they can miss the correct prediction in several ways (e.g. it detected an attack, but categorized it wrongly).

## Analyse feature importance

<div class="alert alert-block alert-danger">
<b>Q: How many features does our model get as input?</b>
</div>

In [78]:
len(X_train.columns)

121

We cannot assume that every feature is as important as the others.

Some features might be very useful, some other features might even worsen the prediction!

<div class="alert alert-block alert-danger">
<b>Q: Which is the importance of the features accordingly to the RF trained above? Which are the most important features? Which are the least important features? And how big is the difference between their importance?</b>
</div>

#### IMPORTANT:
The RandomForest has a `feature_importances_` attribute, that returns the importance of each feature. You can find the documentation [here](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html#sklearn.ensemble.RandomForestClassifier.feature_importances_).
**However, it is important not to trust blindly the values returned by such attribute**, as it only represents "the (normalized) total reduction of the criterion brought by that feature".

In [79]:
clf_rf.feature_importances_

array([2.49783694e-02, 7.86719970e-02, 2.79961035e-02, 3.99985809e-06,
       4.88471965e-04, 8.69840082e-07, 3.66001677e-04, 1.36824757e-05,
       1.27292465e-01, 9.57389423e-04, 5.09547961e-06, 2.43338641e-09,
       1.06535525e-05, 1.13752588e-05, 1.96620735e-06, 2.49730293e-06,
       0.00000000e+00, 0.00000000e+00, 1.62888466e-05, 1.95555789e-01,
       1.10363332e-01, 4.85717378e-03, 2.10793354e-03, 1.82960546e-03,
       1.85301331e-03, 7.65582871e-02, 9.16699276e-03, 1.37252872e-03,
       2.95075575e-02, 3.36137293e-03, 7.04669582e-03, 9.53503586e-02,
       2.26836179e-03, 5.05127778e-03, 8.35487666e-03, 3.37864036e-03,
       5.03475947e-04, 9.38956746e-03, 3.31387796e-02, 2.95219022e-04,
       5.17461577e-03, 9.14061070e-07, 1.31125491e-06, 0.00000000e+00,
       0.00000000e+00, 3.22106185e-06, 0.00000000e+00, 0.00000000e+00,
       0.00000000e+00, 1.97082097e-08, 1.44270229e-06, 0.00000000e+00,
       7.05533397e-06, 1.11885237e-03, 3.51465114e-07, 4.51760859e-03,
      

Given an array, it is possible to find the index of the maximum with `argmax`, provided by numpy.

In [80]:
importances = clf_rf.feature_importances_
max_idx = np.argmax(importances)

In [81]:
X_train.columns[max_idx]

'count'

<div class="alert alert-block alert-danger">
<b>Q: Focus on the least important features: look at their distribution, their max values, etc. Is there anything strange with them?</b>
</div>

You can use `np.argmin` to find the index of the least important feature.

In [82]:
importances = clf_rf.feature_importances_
min_idx = np.argmin(importances)

In [83]:
X_train.columns[min_idx]

'num_outbound_cmds'

<div class="alert alert-block alert-danger">
<b>Q: Try to remove the least important features (you can try removing different numbers of features) and see how the performance changes. Try also removing the most important features. Observe how the features' importance changes in each situation. Lastly, do not limit this analysis to the Random Forest, but try to do the same with the other models as well.</b>
</div>

Here I create I dictionary that associates to each feature his importance as obtained from the RandomForest.

In [84]:
feature_importances_dict = dict()
for idx in range(len(X_train.columns)):
    feature_importances_dict[X_train.columns[idx]] = importances[idx]

In [85]:
feature_importances_dict

{'duration': 0.02497836939867357,
 'src_bytes': 0.07867199696743044,
 'dst_bytes': 0.027996103472112698,
 'land': 3.999858093965659e-06,
 'wrong_fragment': 0.0004884719652679889,
 'urgent': 8.698400822888515e-07,
 'hot': 0.0003660016765998213,
 'num_failed_logins': 1.3682475711805469e-05,
 'logged_in': 0.12729246506051473,
 'num_compromised': 0.0009573894234668866,
 'root_shell': 5.095479611377647e-06,
 'su_attempted': 2.4333864087901366e-09,
 'num_root': 1.0653552495647344e-05,
 'num_file_creations': 1.1375258847366232e-05,
 'num_shells': 1.966207350195493e-06,
 'num_access_files': 2.4973029265387215e-06,
 'num_outbound_cmds': 0.0,
 'is_host_login': 0.0,
 'is_guest_login': 1.6288846604956262e-05,
 'count': 0.19555578906870275,
 'srv_count': 0.11036333195098395,
 'serror_rate': 0.004857173775372635,
 'srv_serror_rate': 0.0021079335406117856,
 'rerror_rate': 0.001829605461191011,
 'srv_rerror_rate': 0.001853013314799317,
 'same_srv_rate': 0.07655828705942423,
 'diff_srv_rate': 0.0091669

In [86]:
# this is the "original" one
clf_rf = RandomForestClassifier()
clf_rf.fit(X_train, y_train)
print("elapsed time = %.2f" % (time.time()-t0))
print(accuracy_score(clf_rf.predict(X_test), y_test))



elapsed time = 980.79
0.999921743686757


In [87]:
# Dropping the "most important" 
t0 = time.time()
new_X_train = X_train.drop(X_train.columns[max_idx], axis=1)
new_X_test = X_test.drop(X_train.columns[max_idx], axis=1)
clf_rf.fit(new_X_train, y_train)
print("elapsed time = %.2f" % (time.time()-t0))
print(accuracy_score(clf_rf.predict(new_X_test), y_test))

elapsed time = 18.70
0.9999268473593598


In [88]:
# keeping only the "most important" 
t0 = time.time()
new_X_train = X_train[[X_train.columns[max_idx]]]
new_X_test = X_test[[X_train.columns[max_idx]]]
clf_rf.fit(new_X_train, y_train)
print("elapsed time = %.2f" % (time.time()-t0))
print(accuracy_score(clf_rf.predict(new_X_test), y_test))

elapsed time = 18.99
0.9837941382618932


In [89]:
# dropping the least important
t0 = time.time()
new_X_train = X_train.drop(X_train.columns[min_idx], axis=1)
new_X_test = X_test.drop(X_train.columns[min_idx], axis=1)
clf_rf.fit(new_X_train, y_train)
print("elapsed time = %.2f" % (time.time()-t0))
print(accuracy_score(clf_rf.predict(new_X_test), y_test))

elapsed time = 21.27
0.9999013289963458


In [90]:
# keeping only the "least important" 
t0 = time.time()
new_X_train = X_train[[X_train.columns[min_idx]]]
new_X_test = X_test[[X_train.columns[min_idx]]]
clf_rf.fit(new_X_train, y_train)
print("elapsed time = %.2f" % (time.time()-t0))
print(accuracy_score(clf_rf.predict(new_X_test), y_test))

elapsed time = 3.76
0.7926173674576225


<div class="alert alert-block alert-info">
As you can see from this resutls, it is not easy to understand the importance of each feature.
You have to tinker with the features (by dropping them, combining them, using only some of them, etc.) in order to really understand their effect.
    
This, indeed, is one of the most time consuming phases when creating a model for addressing a real problem.
</div>

---