## 🧪 K-Nearest Neighbors with and without SMOTE

This notebook evaluates the performance of a K-Nearest Neighbors (KNN) classifier applied to a binary classification problem derived from a labeled network dataset.

We compare two versions of the model:
- Without any resampling
- With **SMOTE** (Synthetic Minority Over-sampling Technique) applied before training

Stratified 5-fold cross-validation is used to compute cumulative confusion matrices for both pipelines.


In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import StratifiedKFold
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.metrics import confusion_matrix
from imblearn.pipeline import Pipeline as ImbPipeline
from imblearn.over_sampling import SMOTE


### 📂 Load dataset and define binary target

We load the CSV file and create a binary column `label_bin`:
- 1 for any type of cyberattack
- 0 for normal traffic


In [2]:
df = pd.read_csv("20240625_Flooding_Heartbeat_filtered_ordered_OcppFlows_120_labelled.csv")
df['label_bin'] = df['label'].apply(lambda x: 1 if 'cyberattack' in str(x).lower() else 0)


### 🧹 Drop non-numeric and irrelevant columns

We remove identifiers, timestamps, IP addresses, and ports.


In [3]:
columns_to_drop = ['flow_id', 'flow_start_timestamp', 'flow_end_timestamp',
                   'src_ip', 'dst_ip', 'src_port', 'dst_port', 'label']
X = df.drop(columns=columns_to_drop + ['label_bin'])
y = df['label_bin']


### ⚙️ Define pipelines with and without SMOTE


In [None]:
pipeline_no_smote = Pipeline([
    ('scaler', StandardScaler()),
    ('clf', KNeighborsClassifier())
])

pipeline_smote = ImbPipeline([
    ('smote', SMOTE(random_state=13)),
    ('scaler', StandardScaler()),
    ('clf', KNeighborsClassifier())
])


### 🔁 Run stratified 5-fold cross-validation

We compute cumulative confusion matrices for both pipelines.


In [None]:
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=13)
cm_no_smote = np.zeros((2, 2), dtype=int)
cm_smote = np.zeros((2, 2), dtype=int)

for train_idx, test_idx in cv.split(X, y):
    X_train, X_test = X.iloc[train_idx], X.iloc[test_idx]
    y_train, y_test = y.iloc[train_idx], y.iloc[test_idx]

    # Without SMOTE
    pipeline_no_smote.fit(X_train, y_train)
    y_pred_no = pipeline_no_smote.predict(X_test)
    cm_no_smote += confusion_matrix(y_test, y_pred_no, labels=[0, 1])

    # With SMOTE
    pipeline_smote.fit(X_train, y_train)
    y_pred_smote = pipeline_smote.predict(X_test)
    cm_smote += confusion_matrix(y_test, y_pred_smote, labels=[0, 1])


### 📊 Cumulative confusion matrices


In [6]:
print("Cumulative confusion matrix WITHOUT SMOTE (KNN):")
print(cm_no_smote)

print("\nCumulative confusion matrix WITH SMOTE (KNN):")
print(cm_smote)


Cumulative confusion matrix WITHOUT SMOTE (KNN):
[[  87    0]
 [   0 8700]]

Cumulative confusion matrix WITH SMOTE (KNN):
[[  87    0]
 [   0 8700]]
