In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [2]:
df = pd.read_csv("data/preprocessed_CTU-IoT-Malware-Capture-21-1.csv")

In [3]:
df.head()

Unnamed: 0,id.resp_h,proto,service,duration,orig_bytes,resp_bytes,conn_state,missed_bytes,history,orig_pkts,orig_ip_bytes,resp_pkts,resp_ip_bytes,label
0,8,2,1,-0.031471,-0.12058,-0.091638,2,-0.028072,1,-0.031437,-0.033291,-0.032746,-0.035204,0
1,8,2,1,-0.031469,0.083804,0.114779,5,-0.028072,2,-0.025729,-0.022584,-0.021844,-0.016371,0
2,8,2,1,-0.025755,0.056852,-0.089369,2,-0.028072,1,-0.025729,-0.02421,-0.032746,-0.035204,0
3,8,2,1,-0.031469,0.083804,0.114779,5,-0.028072,2,-0.025729,-0.022584,-0.021844,-0.016371,0
4,8,2,1,-0.025717,0.144445,0.232731,5,-0.028072,2,-0.020022,-0.015129,-0.021844,-0.009663,0


# Train Test Split

In [4]:
# get a train test split which has most malware examples in the test set

malware = df[df['label'] == 1]
malware_test = malware.sample(frac=0.8, random_state=42)
malware_train = malware.drop(malware_test.index)

benign = df[df['label'] == 0]

benign_test = benign.sample(frac=0.8, random_state=42)
benign_train = benign.drop(benign_test.index)

train = pd.concat([malware_train, benign_train])
test = pd.concat([malware_test, benign_test])

X_train = train.drop(['label'], axis=1)
y_train = train['label']

X_test = test.drop(['label'], axis=1)
y_test = test['label']

# Shuffle the data
from sklearn.utils import shuffle
X_train, y_train = shuffle(X_train, y_train, random_state=42)


# K-Nearest Neighbors (KNN)

### Simple explanation of how does this model work
- Choose hyperparameters for the model (number of neighbors to look for, distance metric, etc.)
- For each point, compute the distance between the point and <b>all</b> other points
    - This distance (be it Euclidean, Manhattan, etc.) is the distance between the points in the feature space, independently of the amount of features (dimensions) it can be calculated.
- Pick the K points which are closer to the unclassified example.
- Assign the label of the majority of the K points to the new point. 
    - So, for k=5, if 3 of the closest points are labeled as "red" and 2 are labeled as "blue", the new point will be labeled as "red".
- Repeat for all unclassified points in the dataset

<img src="https://images.datacamp.com/image/upload/v1686762721/image2_a2876c62d1.png" alt="knn" width="400"/>

In this notebook I will instantiate various KNN classifiers with different values for the amount of neighbors. Will also use the weights parameter set to distance, to increase the influence of closer neighbors.

In [5]:
from sklearn.model_selection import GridSearchCV
from sklearn.neighbors import KNeighborsClassifier

knn = KNeighborsClassifier()

grid = GridSearchCV(knn, param_grid={'n_neighbors': [1,2,3,4,5]}, cv=5) 
grid_search = grid.fit(X_train, y_train)
print("Best parameters: ", grid_search.best_params_)
print("Best cross-validation score: {:.2f}".format(grid_search.best_score_))
print("Best accuracy on test set: {:.2f}".format(grid_search.score(X_test, y_test)))



Best parameters:  {'n_neighbors': 1}
Best cross-validation score: 1.00
Best accuracy on test set: 1.00


In [9]:
knn_1 = KNeighborsClassifier(n_neighbors=1, weights="distance", p=1)
knn_2 = KNeighborsClassifier(n_neighbors=2, weights="distance", p=1)
knn_3 = KNeighborsClassifier(n_neighbors=3, weights="distance", p=1)
knn_5 = KNeighborsClassifier(n_neighbors=5, weights="distance", p=1)

y_pred_1 = knn_1.fit(X_train, y_train).predict(X_test)
y_pred_2 = knn_2.fit(X_train, y_train).predict(X_test)
y_pred_3 = knn_3.fit(X_train, y_train).predict(X_test)
y_pred_5 = knn_5.fit(X_train, y_train).predict(X_test)

In [10]:
# get confusion matrix for each model
from sklearn.metrics import confusion_matrix

cm_1 = confusion_matrix(y_test, y_pred_1).ravel()
cm_2 = confusion_matrix(y_test, y_pred_2).ravel()
cm_3 = confusion_matrix(y_test, y_pred_3).ravel()
cm_5 = confusion_matrix(y_test, y_pred_5).ravel()

In [11]:
# make a df with all the confusion matrices
cm_df = pd.DataFrame([cm_1, cm_2, cm_3, cm_5], columns=['tn', 'fp', 'fn', 'tp'], index=[1, 2, 3, 5])
# set index name to be the number of neighbors
cm_df.index.rename('neighbors', inplace=True)
cm_df.head()

Unnamed: 0_level_0,tn,fp,fn,tp
neighbors,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,2618,0,0,11
2,2618,0,0,11
3,2618,0,0,11
5,2618,0,0,11


# Conclusion
Here we can see that the model with only 1 neighbour was the one which outperformed the rest, as was announced by the GridSearch. Although they had the same accuracy, the model with 1 neighbour is less complex, and therefore, more desirable than the rest.
One thing which is remarkable is that 80% of the malicious cases were in the <b>test set</b>, so the model made do with very few anomalies, and was able to predict them all correctly.


<img src="https://www.researchgate.net/publication/333430988/figure/fig8/AS:960478901710860@1606007423711/Example-of-Euclidean-and-Manhattan-distances-between-two-points-A-and-B-The-Euclidean.png" alt="manhattan vs euclidean" width="275"/>

In addition, when adding the p=1 parameter, which is the Manhattan distance, more fitting for this use case, all models were able to detect all the anomalies, which is a great result. 



When comparing it to the Isolation Forest model, this is clearly a better model, as it was able to detect all the anomalies. However, it is worth noting that the Isolation Forest model is unsupervised, which gives it a clear disadvantage when compared to the KNN model, which is supervised.