# Assignment 1 : Classification for network intrusion detection

The goal of this assignment is to build the most efficient *Naive Bayes classifier* for **network intrusion detection**. The data set considered in this case is [KDD Cup 1999 Data](http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html) from *The Third International Knowledge Discovery and Data Mining Tools Competition*.

*Pay attention that this notebook should not be completed in a single pass but instead in iterative steps. My advice is to train (and evaluate) a complete a classifier as quickly as possible without spending to much time in the exploratory and preprocessing stages. Then, improve incrementally the models (with respect to the evaluation metrics) by developing the previous stages.*

Complete this cell with the group composition :

## Imports

The following cell contains the only allowed modules for this assingment. 

In [None]:
import numpy as np
import pandas
import matplotlib.pyplot as plt

from sklearn import preprocessing
import sklearn.naive_bayes as naive_bayes
import sklearn.model_selection as model_selection
from sklearn.ensemble import VotingClassifier
import sklearn.metrics

from scipy import stats

%matplotlib notebook

## Get the data

Download kddcup.data_10_percent.gz from http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html. See http://kdd.ics.uci.edu/databases/kddcup99/task.html for more details on this dataset.

Complete the following cell so that `file_name` is initialized. (do not remove anything from the provided cells)

In [None]:
attributes = np.array(['duration', 'protocol_type', 'service', 'flag', 'src_bytes', 'dst_bytes',
              'land', 'wrong_fragment', 'urgent', 'hot', 'num_failed_logins', 'logged_in',
              'num_compromised', 'root_shell', 'su_attempted', 'num_root', 'num_file_creations',
              'num_shells', 'num_access_files', 'num_outbound_cmds', 'is_host_login',
              'is_guest_login', 'count', 'srv_count', 'serror_rate', 'srv_serror_rate',
              'rerror_rate', 'srv_rerror_rate', 'same_srv_rate', 'diff_srv_rate', 'srv_diff_host_rate',
              'dst_host_count', 'dst_host_srv_count', 'dst_host_same_srv_rate', 'dst_host_diff_srv_rate',
              'dst_host_same_src_port_rate', 'dst_host_srv_diff_host_rate', 'dst_host_serror_rate',
              'dst_host_srv_serror_rate', 'dst_host_rerror_rate', 'dst_host_srv_rerror_rate'])

data_types = np.array(['numerical', 'categorical', 'categorical', 'categorical', 'numerical', 'numerical',
                       'categorical','numerical', 'numerical', 'numerical', 'numerical', 'categorical',
                       'numerical', 'numerical', 'numerical', 'numerical', 'numerical', 'numerical',
                       'numerical', 'numerical', 'categorical', 'categorical', 'numerical', 'numerical',
                       'numerical', 'numerical', 'numerical', 'numerical', 'numerical', 'numerical',
                       'numerical', 'numerical', 'numerical', 'numerical', 'numerical', 'numerical',
                       'numerical', 'numerical', 'numerical', 'numerical', 'numerical'])

file_name = ''

columns = np.concatenate((attributes, ['label']))
df = pandas.read_csv(file_name, names=columns)

Reorder the columns of the data frame so that numerical columns appear first, then the categorical columns, and at the end the label. For this purpose, use [pandas.DataFrame.reindex](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.reindex.html).

In [None]:
# Write your code here

In this classifier, we only consider the five classes DOS, R2L, U2R, probing and normal. Based on http://kdd.ics.uci.edu/databases/kddcup99/training_attack_types modify the data frame accordingly so that only these 5 values appear in the label column. For this purpose, you should use methods from [pandas.Series](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.html). Recall that each column in a *DataFrame* is a *Series*. 

In [None]:
attacks = {
'back.' : 'dos',
'buffer_overflow.' : 'u2r',
'ftp_write.' : 'r2l',
'guess_passwd.' : 'r2l',
'imap.' : 'r2l',
'ipsweep.' : 'probe',
'land.' : 'dos',
'loadmodule.' : 'u2r',
'multihop.' : 'r2l',
'neptune.' : 'dos',
'nmap.' : 'probe',
'perl.' : 'u2r',
'phf.' : 'r2l',
'pod.' : 'dos',
'portsweep.' : 'probe',
'rootkit.' : 'u2r',
'satan.' : 'probe',
'smurf.' : 'dos',
'spy.' : 'r2l',
'teardrop.' : 'dos',
'warezclient.' : 'r2l',
'warezmaster.' : 'r2l',
'normal.' : 'normal'
}

# Write your code here

Save the *label* column in a variable named `target` (for later use) and [drop](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.drop.html) it from the data frame.

In [None]:
# Write your code here

## Explore the data

With the help of methods from [pandas.Series](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.html), print the different values of the `label` column along with their respective absolute frequency.

In [None]:
# Write your code here

With the help of [matplotlib](https://matplotlib.org/) create an histogram of the relative frequencies of values in this column. Take care on the presentation (for example : titles, scales, colors,...).

In [None]:
# Write your code here

Add figures and statistics that could help to get a better understanding of the data set.

In [None]:
# Write your code here

## Data Preprocessing

Use [pandas.get_dummies](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.get_dummies.html) to convert categorical attributes so that they can be used to train *Naive Bayes* model.

In [None]:
# Write your code here

In this first assignment, we consider a simple evaluation model that require a training set and a test set comprising respectively 70% and 30% of the data set. With the help of [model_selection.train_test_split](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html), create two pandas.Index randomly from the data frame index to define these two parts of the dataset. The DataFrame *index* can be retrieved with its `index` attribute.

In [None]:
# Write your code here 

Apply the required preprocessing on the categorical and numerical features. For example, you can use tools from the [preprocessing](http://scikit-learn.org/stable/modules/preprocessing.html) module and traditional math functions from [numpy](https://docs.scipy.org/doc/numpy-1.13.0/reference/routines.math.html). Both the training set and the test set have to be preprocessed. However, ensure that the information provided by the test set is never used to train the model. In particular, pay attention to the difference between `fit_transform` and `transform` methods.

In [None]:
# Write your code here 

## Modeling

Build a first *Naive Bayes* classifier (named `clf1`) that only use the numerical attributes for the prediction.

In [None]:
# Write your code here

Build a second *Naive Bayes* classifier (names `clf2`) that only use the categorical attributes for the prediction.

In [None]:
# Write your code here

Build a third model (named `clf3`) that combine these two classifiers by completing the following class definition.

In [None]:
# Complete the following class definition

class CombinedClassifier:
    def __init__(self, clf1, clf1_features, clf2, clf2_features):
        self.clf1 = clf1
        self.clf2 = clf2
        self.clf1_features = clf1_features
        self.clf2_features = clf2_features
        
    def predict(self, X): pass
    
    def predict_proba(self, X): pass

clf3 = CombinedClassifier(clf1, numerical, clf2, dummy_categorical)

## Evaluating Models

A powerful evaluation tool for classification the so-called **confusion matrix** denoted by $C \in \mathbb{R}^{d \times d}$. Each element of this matrix $C[x, y]$ is defined as the fraction (or number) of items of class $x$ which is labeled as $y$ by the classification model. Hence, in the case of a perfect classifier, the diagonal values $C[i, i]$ should be equal to 1 and other values $C[i, j]$, with $i \neq j$, that represent the *confused* classes should be equal to 0.

Use the [confusion_matrix]() function to compute this matrix for `clf1`, `clf2` and `clf3`. Then create a figure to represent graphically (vertically in the same figure) these three matrices (see, [this example](http://scikit-learn.org/stable/auto_examples/model_selection/plot_confusion_matrix.html)).

In [None]:
# Write your code here

With the help of [metrics](http://scikit-learn.org/stable/modules/classes.html#module-sklearn.metrics), compute the *precision* and *recall* for each class (not the average!) and each classifier.

In [None]:
# Write your code here

## Conclusion

Write a concise conclusion on this analysis. Among others, discuss the strengths and weaknesses of this approach.