In [1]:
# ----------------------------------------------------------------
# IoT Netprofiler
# Licensed under The MIT License [see LICENSE for details]
# Written by Luca Maiano - https://www.linkedin.com/in/lucamaiano/
# ----------------------------------------------------------------

# Attack Detection

In this notebook we apply *learning algorithms* with the aim of detect attacks in an IoT network as accurately as possible. Particularly, we will focus on **supervised** and **unsupervised** algorithms. With this analysis, we want to compare the effectiveness of the following algorithms:
1. K-Nearest Neighbor (KNN)
2. Random Forests Classifier
3. Support Vector Machines (SVM)
5. Deep Neural Network Classifier 
6. K-Means

All mentioned algorithms have been extensively used by *state of the art* solutions in order to solve **anomaly-detection** problems [2].

## Metrics Identification

Using the same approach of Yavuz et al. [1], in order to deal with an imbalanced dataset we use AUC-ROC together with the following metrics:
1. 
$
\begin{align}
Precision = \frac{TP}{TP+FP}
\end{align}
$

2. 
$
\begin{align}
Recall = \frac{TP}{TP+FN}
\end{align}
$

3. 
$
\begin{align}
F1 = 2\frac{precision * recall}{precision+recall}
\end{align}
$
4. AUC-ROC. 

If the AUC-ROC is bigger than 0.5, it means that the model is better than random guessing.




In [85]:
import csv
import pandas as pd
import numpy as np
import sys, os

import warnings 
warnings.filterwarnings('ignore')

from sklearn.model_selection import train_test_split
from sklearn.utils import shuffle

from lib.utils import trace_processing
from lib.visualization import data_visualization
from lib.analysis import trace_statistics
from lib.analysis import trace_classification


Let us start importing 2 sets of experiments:
1. *9 nodes* examples containing grids and random topologies;
2. *16 nodes* examples containing grids and random topologies.

In [86]:
exp_9_nodes = trace_processing.import_trace('data/experiments/cooja3-9nodes/traces/', 'traces.csv')
exp_16_nodes = trace_processing.import_trace('data/experiments/cooja3-16nodes/traces/', 'traces.csv')

experiments = exp_9_nodes + exp_16_nodes

# Experiment 1: Attacked vs Normal Behaviour

Now we import *features* and *normalize* the data to speedup the learning process. For this experiment, we consider just two classes:
1. **Normal** behaviour (0) meaning that the entire network is not under attack;
2. **Attacked** (1), i.e. an attack has been performed (Black Hole or Gray Hole).

In [87]:
data = None
n_classes = 2

for experiment in experiments:
    label = 0
    topology = experiment[0].split('/')[2].split('cooja3-')[1]
    experiment_id = topology + '/' + experiment[1]
    
    if n_classes == 2:
        # Assign a label
        if experiment[1].find('gh') >= 0 or experiment[1].find('bh') >= 0:
            label = 1
    else:
        # Assign a label
        if experiment[1].find('gh') >= 0:
            label = 1
        elif experiment[1].find('bh') >= 0:
            label = 2
    nodes, packets_node = trace_processing.process_cooja_traces(experiment[0], experiment[1])
    
    if data is None:
        data = trace_processing.feature_extraction(nodes, packets_node, label, experiment_id, log_transform=True, window_size=48)    
    else:
        data = pd.concat([data, trace_processing.feature_extraction(nodes, packets_node, label, experiment_id, log_transform=True, window_size=48)])

data = data.sample(frac=1).reset_index(drop=True)
norm_data = trace_processing.feature_normalization(data, ['node', 'experiment', 'label'])
norm_data.head(5)

Unnamed: 0,node,experiment,tr_time,pckt_count,mean,var,hop,min,max,loss,outliers,label
0,aaaa::212:7404:4:404:,16nodes/grid_normal_2019-02-26_11:48_,0.276714,0.347826,0.402463,0.407981,0.0,0.233569,0.612539,0.652174,0.076923,0
1,aaaa::212:7408:8:808:,9nodes/grid9_normal_2019-02-13_17:05_,0.621075,0.673913,0.601773,0.154327,0.75,0.516164,0.651923,0.326087,0.076923,0
2,aaaa::212:7407:7:707:,16nodes/grid_1gh50-9_2019-02-19_23:54_,0.76547,0.956522,0.427156,0.138889,0.25,0.37339,0.550957,0.043478,0.153846,1
3,aaaa::212:7404:4:404:,9nodes/rnd2_1bh-8_2019-02-15_17:28_,0.65586,1.0,0.210058,0.042219,0.25,0.21519,0.286983,0.0,0.0,1
4,aaaa::212:7403:3:303:,9nodes/grid9_1bh-9_2019-02-13_15:57_,0.611601,1.0,0.143371,0.068582,0.0,0.099969,0.219303,0.0,0.0,1


Now we can split the dataset in *training and testing set* of size 80% and 20% respectively.

In [88]:
X = norm_data.drop(['node', 'experiment', 'label'], axis=1)
y = norm_data['label']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) # 80% training and 20% test

X.head()

Unnamed: 0,tr_time,pckt_count,mean,var,hop,min,max,loss,outliers
0,0.276714,0.347826,0.402463,0.407981,0.0,0.233569,0.612539,0.652174,0.076923
1,0.621075,0.673913,0.601773,0.154327,0.75,0.516164,0.651923,0.326087,0.076923
2,0.76547,0.956522,0.427156,0.138889,0.25,0.37339,0.550957,0.043478,0.153846
3,0.65586,1.0,0.210058,0.042219,0.25,0.21519,0.286983,0.0,0.0
4,0.611601,1.0,0.143371,0.068582,0.0,0.099969,0.219303,0.0,0.0


## Feature Selection

Starting from the results that we obtained during the data exploration phase, we can now start to train and compare the learning algorithms. The results will be compared with different sets of features. Features will be iteratively selected removing the most relevant and less relevant feature. Most significant feature is dropped to avoid *overfitting*, less relevant feature is dropped to avoid *underfitting*.

The experiment will be repeated iteratively following this way:
1. select a set of feature;
2. run learning algorithms;
3. measure performances.

At the end the best set of features will be selected. We start selecting the entire set of features.

### Supervised Algorithms

1. The first algorithm that we evaluate is **k-NN**.

In [89]:
knn_pred = trace_classification.k_nn_classifier(X_train, y_train, X_test, y_test, n_neighbors=3, cross_val=5)
knn_results, knn_confusion_matrix = trace_classification.test_metrics('knn', y_test, knn_pred)
knn_results

AUC on validation set 1/5: 0.6589068825910931
AUC on validation set 2/5: 0.6892712550607287
AUC on validation set 3/5: 0.6503036437246963
AUC on validation set 4/5: 0.7036466966611413
AUC on validation set 5/5: 0.6953982161180835
Mean AUC 0.680 (Std +/- 0.021)


Unnamed: 0,model,accuracy,precision,recall,f1-score,auc roc
0,knn,0.755708,0.662544,0.670081,0.665994,0.670081


2. Now we evaluate **RandomForest Classifier**.

In [90]:
rfc_pred = trace_classification.random_forest_classifier(X_train, y_train, X_test, y_test, n_estimators=100, cross_val=5)
rfc_results, rfc_confusion_matrix = trace_classification.test_metrics('random forest', y_test, rfc_pred)
rfc_results

AUC on validation set 1/5: 0.7001518218623481
AUC on validation set 2/5: 0.6991396761133603
AUC on validation set 3/5: 0.6573886639676114
AUC on validation set 4/5: 0.6943918225590023
AUC on validation set 5/5: 0.7036466966611413
Mean AUC 0.704 (Std +/- 0.000)


Unnamed: 0,model,accuracy,precision,recall,f1-score,auc roc
0,random forest,0.792237,0.7064,0.687062,0.695357,0.687062


3. Prediction of **Support Vector Machines (SVM)**.

In [None]:
svm_pred = trace_classification.svm_classifier(X_train, y_train, X_test, y_test, kernel='linear', cross_val=5)
svm_results, svm_confusion_matrix = trace_classification.test_metrics('svm', y_test, svm_pred)
svm_results

AUC on validation set 1/5: 0.6088056680161944
AUC on validation set 2/5: 0.5199898785425102
AUC on validation set 3/5: 0.6260121457489878
AUC on validation set 4/5: 0.5800378877575183
AUC on validation set 5/5: 0.6135646065198517
Mean AUC 0.590 (Std +/- 0.038)


Unnamed: 0,model,accuracy,precision,recall,f1-score,auc roc
0,svm,0.808219,0.788294,0.615546,0.63442,0.615546


4. Finally we implement **Deep Neural Networks**.

In [None]:
nn_pred = trace_classification.neural_net_classifier(X_train, y_train, X_test, y_test, '2classes_ATCK_NORM_48pckts')
nn_results, nn_confusion_matrix = trace_classification.test_metrics('neural network', y_test, nn_pred)
nn_results

### Unsupervised Algorithms

Now we can try to model the problem as an unsupervised learning problem. First, we apply a PCA transformation to collapse the set of datapoints to a 3D space.

In [None]:
X_pca = trace_classification.pca_transformation(X, n_components=len(X.columns))
X_pca.head()

5. Let us try modeling the problem with **K-Means**.

In [None]:
kmeans_pred, centroids = trace_classification.kmeans_classifier(X_pca, n_clusters=2)
data_visualization.plot_3d_points(X_pca[0], X_pca[1], X_pca[3], y, plot_name='KMeans_K2_48pckts', centroids=centroids)
kmeans_results, kmeans_confusion_matrix = trace_classification.test_metrics('kmeans', y, kmeans_pred)
kmeans_results

## Conclusions

Let us compare the results obtained from the experiment.

In [None]:
trace_classification.write_results([knn_results, rfc_results, svm_results, nn_results, kmeans_results], list(X.columns.values), 'ATCK_NORM_48pckts', 2)

Based on the table above, we see that the **neural network** outperform all the other models. **KNN** also achieve really good results follwed by *random forsests*.

# Experiment 2: Normal Behaviour vs Grey Hole vs Black Hole Attack

For this second scenario, we want to detect which attack has been performed (if any). Each node will be labeled as follows:
1. **Normal** behaviour (0) meaning that the entire network is not under attack;
2. **Grey Hole** (1);
3. **Black Hole** (2).

In [None]:
data = None
n_classes = 3

for experiment in experiments:
    label = 0
    topology = experiment[0].split('/')[2].split('cooja3-')[1]
    experiment_id = topology + '/' + experiment[1]
    
    if n_classes == 2:
        # Assign a label
        if experiment[1].find('gh') >= 0 or experiment[1].find('bh') >= 0:
            label = 1
    else:
        # Assign a label
        if experiment[1].find('gh') >= 0:
            label = 1
        elif experiment[1].find('bh') >= 0:
            label = 2
    nodes, packets_node = trace_processing.process_cooja_traces(experiment[0], experiment[1])
    
    if data is None:
        data = trace_processing.feature_extraction(nodes, packets_node, label, experiment_id, log_transform=True, window_size=48)    
    else:
        data = pd.concat([data, trace_processing.feature_extraction(nodes, packets_node, label, experiment_id, log_transform=True, window_size=48)])

data = data.sample(frac=1).reset_index(drop=True)
norm_data = trace_processing.feature_normalization(data, ['node', 'experiment', 'label'])

# Normalize data
X = norm_data.drop(['node', 'experiment', 'label'], axis=1)
y = norm_data['label']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) # 80% training and 20% test

norm_data.head()

## Feature Selection

We repeat the experiment following the same identical approach as before. Thus we choose a set of features, and we iterate removing the most important and less importan one until we obtain the best results.

### Supervised Algorithms

1. The first algorithm that we evaluate is **k-NN**.

In [None]:
knn_pred = trace_classification.k_nn_classifier(X_train, y_train, X_test, y_test, n_neighbors=3, cross_val=5)
knn_results, knn_confusion_matrix = trace_classification.test_metrics('knn', y_test, knn_pred)
knn_results

2. Now we evaluate **RandomForest Classifier**.

In [None]:
rfc_pred = trace_classification.random_forest_classifier(X_train, y_train, X_test, y_test, n_estimators=100, cross_val=5)
rfc_results, rfc_confusion_matrix = trace_classification.test_metrics('random forest', y_test, rfc_pred)
rfc_results

3. Prediction of **Support Vector Machines (SVM)**.

In [None]:
svm_pred = trace_classification.svm_classifier(X_train, y_train, X_test, y_test, kernel='linear', cross_val=5)
svm_results, svm_confusion_matrix = trace_classification.test_metrics('svm', y_test, svm_pred)
svm_results

4. Finally we implement **Deep Neural Networks**.

In [None]:
nn_pred = trace_classification.neural_net_classifier(X_train, y_train, X_test, y_test, '3classes_BH_GH_NORM_48pckts')
nn_results, nn_confusion_matrix = trace_classification.test_metrics('neural network', y_test, nn_pred)
nn_results

### Unsupervised Algorithms

Now we can try to model the problem as an unsupervised learning problem. First, we apply a PCA transformation to collapse the set of datapoints to a 3D space.

In [None]:
X_pca = trace_classification.pca_transformation(X, n_components=len(X.columns))
X_pca.head()

5. Let us try modeling the problem with **K-Means**.

In [None]:
kmeans_pred, centroids = trace_classification.kmeans_classifier(X_pca, n_clusters=3)
data_visualization.plot_3d_points(X_pca[0], X_pca[1], X_pca[3], y, plot_name='KMeans_K3_48pckts', centroids=centroids)
kmeans_results, kmeans_confusion_matrix = trace_classification.test_metrics('kmeans', y, kmeans_pred)
kmeans_results

## Conclusions

Let us compare the results obtained from the experiment.

In [None]:
trace_classification.write_results([knn_results, rfc_results, svm_results, nn_results, kmeans_results], list(X.columns.values), 'BH_GH_NORM_48pckts', 3)

In this case, the **neural network** still performs good, folloed by **svm**.

# References

1. *Deep Learning for Detection of Routing Attacks in the Internet of Things*, International Journal of Computational Intelligence Systems (2018), by Furkan Yusuf Yavuz, Devrim Ünal and Ensar Gul
2. *Machine Learning in IoT Security:Current Solutions and Future Challenges*, arXiv:1904.05735v1 (Mar 2019), by Fatima Hussain, Rasheed Hussain, Syed Ali Hassan, and Ekram Hossain.
3. *Almost Everything You Need to Know About Time Series*, https://towardsdatascience.com/almost-everything-you-need-to-know-about-time-series-860241bdc578, by Marco Peixeiro
4. *How to Check if Time Series Data is Stationary with Python*, https://machinelearningmastery.com/time-series-data-stationary-python/, by Jason Brownlee
5. *K-Means Clustering in Python*, https://mubaris.com/posts/kmeans-clustering/, by Mubaris NK
6. *AUC ROC Curve Scoring Function for Multi-class Classification*, https://medium.com/@plog397/auc-roc-curve-scoring-function-for-multi-class-classification-9822871a6659, by Eric Plog