## 🌲 Random Forest Classification with and without SMOTE

This notebook evaluates the performance of a `RandomForestClassifier` on imbalanced network traffic data using two approaches:
- Without any resampling
- With **SMOTE** (Synthetic Minority Oversampling Technique) applied before training

We compare their performance using classification metrics and confusion matrices.


In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, StratifiedKFold
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report, confusion_matrix
from imblearn.pipeline import Pipeline as ImbPipeline
from imblearn.over_sampling import SMOTE


### 📂 Load dataset and create binary labels

We load the CSV and generate a binary label:
- `1` for any kind of cyberattack
- `0` for normal traffic


In [2]:
filepath = "20240625_Flooding_Heartbeat_filtered_ordered_OcppFlows_120_labelled.csv"
df = pd.read_csv(filepath)
df['label_bin'] = df['label'].apply(lambda x: 1 if 'cyberattack' in str(x).lower() else 0)


### 🧹 Drop non-numeric and irrelevant columns

We remove flow identifiers, timestamps, IPs, and ports.


In [3]:
columns_to_drop = ['flow_id', 'flow_start_timestamp', 'flow_end_timestamp',
                   'src_ip', 'dst_ip', 'src_port', 'dst_port', 'label']
X = df.drop(columns=columns_to_drop + ['label_bin'])
y = df['label_bin']


### 🔁 Define pipelines with and without SMOTE and prepare cross-validation


In [4]:
pipeline_no_smote = Pipeline([
    ('scaler', StandardScaler()),
    ('clf', RandomForestClassifier(random_state=13))
])

pipeline_smote = ImbPipeline([
    ('smote', SMOTE(random_state=13)),
    ('scaler', StandardScaler()),
    ('clf', RandomForestClassifier(random_state=13))
])

cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=13)
cm_no_smote = np.zeros((2, 2), dtype=int)
cm_smote = np.zeros((2, 2), dtype=int)


### ⚙️ Train and evaluate both pipelines with 5-fold cross-validation


In [5]:
for train_idx, test_idx in cv.split(X, y):
    X_train, X_test = X.iloc[train_idx], X.iloc[test_idx]
    y_train, y_test = y.iloc[train_idx], y.iloc[test_idx]

    # Without SMOTE
    pipeline_no_smote.fit(X_train, y_train)
    y_pred_no = pipeline_no_smote.predict(X_test)
    cm_no_smote += confusion_matrix(y_test, y_pred_no, labels=[0, 1])

    # With SMOTE
    pipeline_smote.fit(X_train, y_train)
    y_pred_smote = pipeline_smote.predict(X_test)
    cm_smote += confusion_matrix(y_test, y_pred_smote, labels=[0, 1])


### 📊 Display cumulative confusion matrices


In [6]:
print("Confusion matrix WITHOUT SMOTE (Random Forest):")
print(cm_no_smote)

print("\nConfusion matrix WITH SMOTE (Random Forest):")
print(cm_smote)


Confusion matrix WITHOUT SMOTE (Random Forest):
[[  87    0]
 [   0 8700]]

Confusion matrix WITH SMOTE (Random Forest):
[[  87    0]
 [   0 8700]]
