# Classification and Clustering
Détection d'intrusion à partir du dataset NSL-KDD, dérivé du dataset [KDD99](http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html). [Description complète du challenge original (1999)](http://kdd.ics.uci.edu/databases/kddcup99/task.html).

> The competition task was to build a network intrusion detector, a predictive model capable of distinguishing between "bad" connections, called intrusions or attacks, and "good" normal connections. This database contains a standard set of data to be audited, which includes a wide variety of intrusions simulated in a military network environment.




Quatre catégories d'attaques dans ce dataset (cf [taxonomy](http://kdd.ics.uci.edu/databases/kddcup99/training_attack_types)) :
> - **DOS**: denial-of-service, e.g. syn flood;
> - **R2L**: unauthorized access from a remote machine, e.g. guessing password;
> - **U2R**:  unauthorized access to local superuser (root) privileges, e.g., various buffer overflow attacks;
> - **probing**: surveillance and other probing, e.g., port scanning.

Catégorie | Attaques
--- | --- 
dos | back,land,neptune,pod,smurf,teardrop
probe | ipsweep,nmap,portsweep,satan
r2l | ftp_write,guess_passwd,imap,multihop,phf,spy,warezclient,warezmaster
u2r | buffer_overflow,loadmodule,perl,rootkit

## Objectifs pédagogiques
- [**done**] loader et préparer les données
- classification avec Random Forest, Naive Bayes
- clustering avec K-means, DBSCAN
- évaluation de la classification avec des performance metrics
- évaluation du clustering naive en utilisant les catégories d'attaque
- évaluation du clustering avec des performance metrics
- *(optionel) visualisation avec t-SNE*
- *(optionel) clustering avec d'autres techniques*

In [None]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import sklearn
import os

# Data

## Load dataset

In [None]:
dataset_path = os.path.join('.', 'dataset')
train20_path = os.path.join(dataset_path, 'KDDTrain+_20Percent.txt')
train_path = os.path.join(dataset_path, 'KDDTrain+.txt')
test_path = os.path.join(dataset_path, 'KDDTest+.txt')

In [None]:
col_names = np.array(["duration","protocol_type","service","flag","src_bytes",
    "dst_bytes","land","wrong_fragment","urgent","hot","num_failed_logins",
    "logged_in","num_compromised","root_shell","su_attempted","num_root",
    "num_file_creations","num_shells","num_access_files","num_outbound_cmds",
    "is_host_login","is_guest_login","count","srv_count","serror_rate",
    "srv_serror_rate","rerror_rate","srv_rerror_rate","same_srv_rate",
    "diff_srv_rate","srv_diff_host_rate","dst_host_count","dst_host_srv_count",
    "dst_host_same_srv_rate","dst_host_diff_srv_rate","dst_host_same_src_port_rate",
    "dst_host_srv_diff_host_rate","dst_host_serror_rate","dst_host_srv_serror_rate",
    "dst_host_rerror_rate","dst_host_srv_rerror_rate","labels", "foo"])

In [None]:
attack_types = {
    'normal': 'normal',
    
    'back': 'DoS',
    'land': 'DoS',
    'neptune': 'DoS',
    'pod': 'DoS',
    'smurf': 'DoS',
    'teardrop': 'DoS',
    'mailbomb': 'DoS',
    'apache2': 'DoS',
    'processtable': 'DoS',
    'udpstorm': 'DoS',
    
    'ipsweep': 'Probe',
    'nmap': 'Probe',
    'portsweep': 'Probe',
    'satan': 'Probe',
    'mscan': 'Probe',
    'saint': 'Probe',

    'ftp_write': 'R2L',
    'guess_passwd': 'R2L',
    'imap': 'R2L',
    'multihop': 'R2L',
    'phf': 'R2L',
    'spy': 'R2L',
    'warezclient': 'R2L',
    'warezmaster': 'R2L',
    'sendmail': 'R2L',
    'named': 'R2L',
    'snmpgetattack': 'R2L',
    'snmpguess': 'R2L',
    'xlock': 'R2L',
    'xsnoop': 'R2L',
    'worm': 'R2L',
    
    'buffer_overflow': 'U2R',
    'loadmodule': 'U2R',
    'perl': 'U2R',
    'rootkit': 'U2R',
    'httptunnel': 'U2R',
    'ps': 'U2R',    
    'sqlattack': 'U2R',
    'xterm': 'U2R'
}

In [None]:
categorical_idx = [1, 2, 3]
binary_idx = [6, 11, 13, 14, 20, 21]
numerical_idx = list(set(range(41)).difference(categorical_idx).difference(binary_idx))

categorical_cols = col_names[categorical_idx].tolist()
binary_cols = col_names[binary_idx].tolist()
numerical_cols = col_names[numerical_idx].tolist()

In [None]:
def load_data(data_path):
    df = pd.read_csv(data_path, header=None, index_col=False, names=col_names).drop("foo", axis='columns')
    df['attack_type'] = df.labels.apply(lambda attack: attack_types[attack])
    
    # casting
    df = df.astype({col: 'category' for col in categorical_cols}, copy=False)
    df = df.astype({col: np.float32 for col in numerical_cols}, copy=False)
    
    return df

In [None]:
df_train20 = load_data(train20_path)
df_train = load_data(train_path)
df_test = load_data(test_path)

In [None]:
df_train20.head()

In [None]:
df_train20.describe()

# Separation label / données

In [None]:
var_names = np.array(["duration","protocol_type","service","flag","src_bytes",
    "dst_bytes","land","wrong_fragment","urgent","hot","num_failed_logins",
    "logged_in","num_compromised","root_shell","su_attempted","num_root",
    "num_file_creations","num_shells","num_access_files","num_outbound_cmds",
    "is_host_login","is_guest_login","count","srv_count","serror_rate",
    "srv_serror_rate","rerror_rate","srv_rerror_rate","same_srv_rate",
    "diff_srv_rate","srv_diff_host_rate","dst_host_count","dst_host_srv_count",
    "dst_host_same_srv_rate","dst_host_diff_srv_rate","dst_host_same_src_port_rate",
    "dst_host_srv_diff_host_rate","dst_host_serror_rate","dst_host_srv_serror_rate",
    "dst_host_rerror_rate","dst_host_srv_rerror_rate"])
target_name = np.array(["attack_type"])

# Train-test split
Déjà effectué par design.

Entrainez vous sur `df_train` en utilisant `sklearn.model_selection.train_test_split`

*doc: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html*

In [None]:
from sklearn import model_selection

In [None]:
X_train, X_test, y_train, y_test = model_selection.train_test_split(df_train[var_names],
                                                                    df_train[target_name],
                                                                    test_size=0.2,
                                                                    train_size=0.8)

Vérification de la taille des sets obtenus, et des proportions (80% / 20%)

In [None]:
X_test.shape[0] + X_train.shape[0] == df_train.shape[0]

In [None]:
X_test.shape[0] / (X_test.shape[0] + X_train.shape[0])

### One Hot encoding pour les variables catégoriques
Aussi appelé 'dummies'. Transformation d'une colonne à N valeurs catégoriques en N colonnes à valeurs binaires :

| categ_var |
|-----------|
| TCP       |
| UDP       |
| TCP       |
| ICMP      |

devient ainsi :

| categ_var_tcp | categ_var_udp | categ_var_icmp |
|---------------|---------------|----------------|
| 1             | 0             | 0              |
| 0             | 1             | 0              |
| 1             | 0             | 0              |
| 0             | 0             | 1              |

In [None]:
X_train.protocol_type.value_counts()

In [None]:
dummies = pd.get_dummies(X_train[categorical_cols])
X_dummies = pd.concat([X_train, dummies], axis=1)
X_dummies.drop(categorical_cols, axis='columns', inplace=True)

# Si vous avez utilisé df_train au lieu de X_train :
# X_dummies.drop('attack_type', axis='columns', inplace=True)
# X_dummies.drop('labels', axis='columns', inplace=True)

# Classification
Some pointers:
- https://scikit-learn.org/stable/modules/tree.html#classification
- https://scikit-learn.org/stable/modules/ensemble.html#random-forests
- https://scikit-learn.org/stable/modules/naive_bayes.html#gaussian-naive-bayes

In [None]:
from sklearn import ensemble

In [None]:
clf = ensemble.RandomForestClassifier(n_estimators=100,
                                      max_depth=2,
                                      random_state=101010)

In [None]:
y_train.values

In [None]:
clf.fit(X_dummies, np.ravel(y_train))

# Mesure de performance - classification
Essayez d'obtenir:
- accuracy
- precision
- recall
- confusion matrix

*doc: https://scikit-learn.org/stable/modules/model_evaluation.html#classification-metrics*

On applique les mêmes transformations sur le test set (pour avoir les mêmes variables) :

In [None]:
test_dummies = pd.get_dummies(X_test[categorical_cols])
X_test_dummies = pd.concat([X_test, test_dummies], axis=1)
X_test_dummies.drop(categorical_cols, axis='columns', inplace=True)

In [None]:
y_pred = clf.predict(X_test_dummies)

## Problem: unbalanced dataset

We have a very `unbalanced dataset`, we can `downsample` the DOS and normal class, or `upsample` the other minority classes. Alternatively, we can look for models that are `robust` to unbalanced datasets.

In [None]:
y_train.attack_type.value_counts()

In [None]:
np.unique(y_test)

We end up predicting only two classes on the test set due to this imbalance :

In [None]:
np.unique(y_pred)

In [None]:
# Mesurer la performance
from sklearn import metrics

conf_matrix = metrics.confusion_matrix(np.ravel(y_test), y_pred)

In [None]:
# code from https://scikit-learn.org/stable/auto_examples/model_selection/plot_confusion_matrix.html
def plot_confusion_matrix(cm, classes,
                          cmap=plt.cm.Blues):

    title = 'Confusion matrix'
    
    fig, ax = plt.subplots()
    im = ax.imshow(cm, interpolation='nearest', cmap=cmap)
    ax.figure.colorbar(im, ax=ax)
    # We want to show all ticks...
    ax.set(xticks=np.arange(cm.shape[1]),
           yticks=np.arange(cm.shape[0]),
           # ... and label them with the respective list entries
           xticklabels=classes, yticklabels=classes,
           title=title,
           ylabel='True label',
           xlabel='Predicted label')

    # Rotate the tick labels and set their alignment.
    plt.setp(ax.get_xticklabels(), rotation=45, ha="right",
             rotation_mode="anchor")

    # Loop over data dimensions and create text annotations.
    fmt = 'd'
    thresh = cm.max() / 2.
    for i in range(cm.shape[0]):
        for j in range(cm.shape[1]):
            ax.text(j, i, format(cm[i, j], fmt),
                    ha="center", va="center",
                    color="white" if cm[i, j] > thresh else "black")
    fig.tight_layout()
    return ax

In [None]:
plot_confusion_matrix(cm=conf_matrix, classes=['DoS', 'Probe', 'R2L', 'U2R', 'normal'])

# Clustering
Some pointers:
- https://scikit-learn.org/stable/modules/clustering.html#k-means
- https://scikit-learn.org/stable/modules/clustering.html#dbscan

In [None]:
from sklearn import cluster

The maximum for each variable are highly different by orders of magnitude (from 0 to 10e9). We need to "normalize" (here divide by the max) to be compute more meaningful distances while clustering, and improve performance.

In [None]:
X_dummies.describe()

In [None]:
# Normalization
X_normalized = X_dummies.apply(lambda x: (x * 1.0)/ x.max())
X_normalized_no_na = X_normalized.dropna(axis='columns')

In [None]:
X_normalized.describe()

In [None]:
kmeans_normalized = cluster.KMeans(n_clusters=2, random_state=0).fit(X=X_normalized_no_na)

In [None]:
kmeans_normalized.labels_

In [None]:
np.unique(kmeans_normalized.labels_)

In [None]:
labels = y_train.copy()
labels['cluster'] = kmeans_normalized.labels_

In [None]:
labels[:5]

### K = 2

In [None]:
labels.groupby(['attack_type', 'cluster']).size()

In [None]:
labels.groupby(['cluster', 'attack_type']).size()

### K = 3

In [None]:
kmeans_normalized = cluster.KMeans(n_clusters=3, random_state=0).fit(X=X_normalized_no_na)
labels = y_train.copy()
labels['cluster'] = kmeans_normalized.labels_

In [None]:
labels.groupby(['cluster', 'attack_type']).size()

### K = 5

In [None]:
kmeans_normalized = cluster.KMeans(n_clusters=5, random_state=0).fit(X=X_normalized_no_na)
labels = y_train.copy()
labels['cluster'] = kmeans_normalized.labels_

In [None]:
labels.groupby(['cluster', 'attack_type']).size()

### Quelques resultats et visualisations
comparaison kmeans et local density clustering: http://yinsenm.github.io/2014/08/18/kdd99-cluster/

# Mesure de performance - classification
Essayez d'obtenir:
- silhouette
- homogénéité

*doc: https://scikit-learn.org/stable/modules/clustering.html#clustering-performance-evaluation*

The best value is 1 and the worst value is -1. Values near 0 indicate overlapping clusters. Negative values generally indicate that a sample has been assigned to the wrong cluster, as a different cluster is more similar.

In [None]:
# k=2 (run this after using k=2 when clustering)
metrics.silhouette_score(X=X_normalized_no_na,
                         labels=kmeans_normalized.labels_,
                         metric='manhattan')

In [None]:
# k=3 (run this after using k=3 when clustering)
metrics.silhouette_score(X=X_normalized_no_na,
                         labels=kmeans_normalized.labels_,
                         metric='manhattan')

### Essai de visualisation

In [None]:
from sklearn.decomposition import PCA

pca = PCA(n_components=3)
pca_result = pca.fit_transform(X_normalized_no_na.values)

print('Explained variation per principal component: {}'.format(pca.explained_variance_ratio_))

In [None]:
plt.scatter(pca_result[:,0], pca_result[:,1], c=y_l)
plt.show()

In [None]:
pca_result[:,0].shape

In [None]:
y_train.shape

In [None]:
# k=5
subplot(1,5,1)
subplot(1,5,2)
subplot(1,5,3)
subplot(1,5,4)
subplot(1,5,5)