# The project: Multiclass Classification on RT_IOT2022 Cybersecurity Dataset

This project performs automated benchmarking of multiple classification models on the **RT_IOT2022** dataset using a full **preprocessing + modeling pipeline**, with evaluation based on **Macro F1-Score** to account for class imbalance.

---

## 1. Class Labels in RT_IOT2022 Dataset

The `RT_IOT2022` dataset contains the following class labels in the `Attack_type` column:

## RT_IOT2022 Dataset – Actual Class Labels and Descriptions

## RT_IOT2022 – Class Labels and Simple Descriptions

1. **ARP_poisioning** – Attacker pretends to be another device to intercept communication on the network.

2. **DDOS_Slowloris** – Slow attack that keeps many connections open to overload a server.

3. **DOS_SYN_Hping** – Denial of Service attack using many fake TCP requests to crash the system.

4. **MQTT_Publish** – Normal IoT device sending data using the MQTT protocol.

5. **Metasploit_Brute_Force_SSH** – Trying many passwords to break into SSH using the Metasploit tool.

6. **NMAP_FIN_SCAN** – Stealth scan to find open ports without being easily detected.

7. **NMAP_OS_DETECTION** – Scanning to detect the operating system of a device.

8. **NMAP_TCP_scan** – Checking which TCP ports are open on a device.

9. **NMAP_UDP_SCAN** – Checking which UDP ports are open.

10. **NMAP_XMAS_TREE_SCAN** – Special scan using unusual flags to bypass firewalls and find open ports.

11. **Thing_Speak** – Normal traffic from an IoT device sending data to the ThingSpeak cloud.

12. **Wipro_bulb** – Normal traffic from a smart light bulb made by Wipro.


---

## 2. Models for Evaluation

The following classifiers are evaluated:
- `RandomForestClassifier`
- `XGBoostClassifier`
- `KNeighborsClassifier`
- `SVC` (Support Vector Classifier with RBF kernel)
- `LogisticRegression`

Each model is encapsulated in a `Pipeline` together with the preprocessor.

---

## 3. Evaluation Strategy: Stratified K-Fold Cross-Validation

- `StratifiedKFold(n_splits=5)` ensures class distribution is preserved across all 5 folds.
- Predictions are generated using `cross_val_predict` to evaluate model performance in a fair, cross-validated manner.

---

## 4. Metrics and Reporting

For each model:
- **Macro F1-Score** is printed (averages F1 across all classes equally).
- Full classification report is displayed, including per-class precision, recall, and F1.
- Confusion matrix is plotted with `seaborn.heatmap` and saved to the `confusion_matrices/` folder.


# The dataset: RT_IOT2022

## Source

The RT_IOT2022 dataset is provided by the UCI Machine Learning Repository. It was published in January 2024 and is designed specifically for research and development in intrusion detection systems (IDS) for Internet of Things (IoT) networks.

## Purpose

The dataset was created to provide realistic network traffic for IoT systems, including both normal behavior and various types of cyber-attacks. It enables researchers and developers to train and evaluate machine learning models for anomaly detection and classification tasks.


## The dataset: RT_IOT2022 — Feature Description Table

| **Column Name**                  | **Type**      | **Description**                                                                 |
|----------------------------------|---------------|---------------------------------------------------------------------------------|
| `id.orig_p`                      | Integer       | Source port number of the connection                                            |
| `id.resp_p`                      | Integer       | Destination (response) port number                                             |
| `proto`                          | Categorical   | Transport protocol used (e.g., TCP, UDP)                                       |
| `service`                        | Continuous    | Application service (e.g., HTTP, MQTT) provided by Zeek                        |
| `flow_duration`                  | Continuous    | Total duration of the network flow                                             |
| `fwd_pkts_tot`                   | Integer       | Total number of packets sent in the forward direction                          |
| `bwd_pkts_tot`                   | Integer       | Total number of packets in the backward direction                              |
| `fwd_data_pkts_tot`              | Integer       | Total number of data packets (forward direction)                              |
| `bwd_data_pkts_tot`              | Integer       | Total number of data packets (reverse direction)                              |
| `fwd_pkts_per_sec`               | Continuous    | Packet rate per second (forward direction)                                     |
| `bwd_pkts_per_sec`               | Continuous    | Packet rate per second (reverse direction)                                     |
| `flow_pkts_per_sec`              | Continuous    | Overall packet rate per second across the flow                                 |
| `down_up_ratio`                  | Continuous    | Ratio of downstream to upstream traffic volume                                 |
| `fwd_header_size_tot`            | Integer       | Total header bytes (forward direction)                                         |
| `fwd_header_size_min`            | Integer       | Smallest header size observed (forward)                                        |
| `fwd_header_size_max`            | Integer       | Largest header size observed (forward)                                         |
| `bwd_header_size_tot`            | Integer       | Total header bytes (backward direction)                                        |
| `bwd_header_size_min`            | Integer       | Smallest header size observed (backward)                                       |
| `bwd_header_size_max`            | Integer       | Largest header size observed (backward)                                        |
| `flow_FIN_flag_count`            | Integer       | Count of TCP FIN flags within the flow                                         |
| `flow_SYN_flag_count`            | Integer       | Count of TCP SYN flags within the flow                                         |
| `flow_RST_flag_count`            | Integer       | Count of TCP RST flags within the flow                                         |
| `fwd_PSH_flag_count`             | Integer       | Count of forward TCP PSH flags                                                 |
| `bwd_PSH_flag_count`             | Integer       | Count of backward TCP PSH flags                                                |
| `flow_ACK_flag_count`            | Integer       | Count of TCP ACK flags across the flow                                         |
| `fwd_URG_flag_count`             | Integer       | Count of forward TCP URG flags                                                 |
| `bwd_URG_flag_count`             | Integer       | Count of backward TCP URG flags                                                |
| `flow_CWR_flag_count`            | Integer       | Count of TCP CWR flags within the flow                                         |
| `flow_ECE_flag_count`            | Integer       | Count of TCP ECE flags within the flow                                         |
| `fwd_pkts_payload.min`           | Integer       | Minimum payload size in forward packets                                        |
| `fwd_pkts_payload.max`           | Integer       | Maximum payload size in forward packets                                        |
| `fwd_pkts_payload.tot`           | Integer       | Total payload size across forward packets                                      |
| `fwd_pkts_payload.avg`           | Continuous    | Average payload size in forward packets                                        |
| `fwd_pkts_payload.std`           | Continuous    | Standard deviation of forward payload sizes                                   |
| `bwd_pkts_payload.min`           | Integer       | Minimum payload size in backward packets                                       |
| `bwd_pkts_payload.max`           | Integer       | Maximum payload size in backward packets                                       |
| `bwd_pkts_payload.tot`           | Integer       | Total payload size across backward packets                                     |
| `bwd_pkts_payload.avg`           | Continuous    | Average payload size in backward packets                                       |
| `bwd_pkts_payload.std`           | Continuous    | Standard deviation of backward payload sizes                                  |
| `flow_pkts_payload.min`          | Integer       | Minimum payload size across all packets in the flow                            |
| `flow_pkts_payload.max`          | Integer       | Maximum payload size across the flow                                           |
| `flow_pkts_payload.tot`          | Integer       | Total payload size across the entire flow                                      |
| `flow_pkts_payload.avg`          | Continuous    | Average payload size across the flow                                           |
| `flow_pkts_payload.std`          | Continuous    | Standard deviation of payload sizes across the flow                            |
| `fwd_iat.min`                    | Continuous    | Minimum inter-arrival time (forward direction)                                |
| `fwd_iat.max`                    | Continuous    | Maximum inter-arrival time (forward)                                          |
| `fwd_iat.tot`                    | Continuous    | Total inter-arrival time (forward)                                            |
| `fwd_iat.avg`                    | Continuous    | Average inter-arrival time (forward)                                          |
| `fwd_iat.std`                    | Continuous    | Standard deviation of inter-arrival times (forward)                           |
| `bwd_iat.min`                    | Continuous    | Minimum inter-arrival time (backward direction)                               |
| `bwd_iat.max`                    | Continuous    | Maximum inter-arrival time (backward)                                         |
| `bwd_iat.tot`                    | Continuous    | Total inter-arrival time (backward)                                           |
| `bwd_iat.avg`                    | Continuous    | Average inter-arrival time (backward)                                         |
| `bwd_iat.std`                    | Continuous    | Standard deviation of backward inter-arrival times                            |
| `flow_iat.min`                   | Continuous    | Minimum inter-arrival time across entire flow                                 |
| `flow_iat.max`                   | Continuous    | Maximum inter-arrival time across flow                                        |
| `flow_iat.tot`                   | Continuous    | Total inter-arrival time across flow                                          |
| `flow_iat.avg`                   | Continuous    | Average inter-arrival time across the flow                                    |
| `flow_iat.std`                   | Continuous    | Standard deviation of inter-arrival times across the flow                     |
| `payload_bytes_per_second`       | Continuous    | Average payload throughput per second                                          |
| `fwd_subflow_pkts`               | Integer       | Forward subflow packet count                                                  |
| `bwd_subflow_pkts`               | Integer       | Backward subflow packet count                                                 |
| `fwd_subflow_bytes`              | Integer       | Forward subflow total bytes                                                   |
| `bwd_subflow_bytes`              | Integer       | Backward subflow total bytes                                                  |
| `fwd_bulk_bytes`                 | Integer       | Bulk (large) bytes in forward direction                                       |
| `bwd_bulk_bytes`                 | Integer       | Bulk bytes in backward direction                                              |
| `fwd_bulk_packets`               | Integer       | Number of bulk packets (forward direction)                                    |
| `bwd_bulk_packets`               | Integer       | Number of bulk packets (backward)                                            |
| `fwd_bulk_rate`                  | Continuous    | Rate of bulk data in forward direction                                        |
| `bwd_bulk_rate`                  | Continuous    | Rate of bulk data in backward direction                                       |
| `active.min`                     | Continuous    | Minimum duration of active flow segments                                      |
| `active.max`                     | Continuous    | Maximum duration of active segments                                           |
| `active.tot`                     | Continuous    | Total duration of active flow segments                                        |
| `active.avg`                     | Continuous    | Average duration of active segment durations                                 |
| `active.std`                     | Continuous    | Standard deviation of active segment durations                               |
| `idle.min`                       | Continuous    | Minimum duration of idle times                                                |
| `idle.max`                       | Continuous    | Maximum duration of idle periods                                              |
| `idle.tot`                       | Continuous    | Total idle time duration                                                      |
| `idle.avg`                       | Continuous    | Average idle period duration                                                  |
| `idle.std`                       | Continuous    | Standard deviation of idle durations                                          |
| `fwd_init_window_size`           | Integer       | Initial TCP window size (forward direction)                                  |
| `bwd_init_window_size`           | Integer       | Initial TCP window size (backward direction)                                 |
| `fwd_last_window_size`           | Integer       | Last observed TCP window size (forward)                                       |
| `Attack_type`                    | Categorical   | Target variable: label for each record, indicating normal vs various attacks |


In [2]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [10]:
import pandas as pd
import numpy as np
import os
import matplotlib.pyplot as plt
import seaborn as sns
import joblib

from sklearn.model_selection import train_test_split
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder, LabelEncoder
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# Load dataset
df = pd.read_csv("/content/drive/MyDrive/Portfolio datasets/cyber/RT_IOT2022.csv")

# Drop rows with empty label
df = df[df['Attack_type'].notnull()]
df.head()

Unnamed: 0.1,Unnamed: 0,id.orig_p,id.resp_p,proto,service,flow_duration,fwd_pkts_tot,bwd_pkts_tot,fwd_data_pkts_tot,bwd_data_pkts_tot,...,active.std,idle.min,idle.max,idle.tot,idle.avg,idle.std,fwd_init_window_size,bwd_init_window_size,fwd_last_window_size,Attack_type
0,0,38667,1883,tcp,mqtt,32.011598,9,5,3,3,...,0.0,29729180.0,29729180.0,29729180.0,29729180.0,0.0,64240,26847,502,MQTT_Publish
1,1,51143,1883,tcp,mqtt,31.883584,9,5,3,3,...,0.0,29855280.0,29855280.0,29855280.0,29855280.0,0.0,64240,26847,502,MQTT_Publish
2,2,44761,1883,tcp,mqtt,32.124053,9,5,3,3,...,0.0,29842150.0,29842150.0,29842150.0,29842150.0,0.0,64240,26847,502,MQTT_Publish
3,3,60893,1883,tcp,mqtt,31.961063,9,5,3,3,...,0.0,29913770.0,29913770.0,29913770.0,29913770.0,0.0,64240,26847,502,MQTT_Publish
4,4,51087,1883,tcp,mqtt,31.902362,9,5,3,3,...,0.0,29814700.0,29814700.0,29814700.0,29814700.0,0.0,64240,26847,502,MQTT_Publish


In [6]:
import pandas as pd
import numpy as np
import os
import matplotlib.pyplot as plt
import seaborn as sns
import joblib

from sklearn.model_selection import StratifiedKFold, cross_val_predict
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder, LabelEncoder
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# Load dataset
df = pd.read_csv("/content/drive/MyDrive/Portfolio datasets/cyber/RT_IOT2022.csv")

# Drop rows with empty label
df = df[df['Attack_type'].notnull()]

# Encode the label
le = LabelEncoder()
df['Attack_type'] = le.fit_transform(df['Attack_type'])
class_names = le.classes_

# Separate features and label
X = df.drop('Attack_type', axis=1)
y = df['Attack_type']

# Detect column types
numeric_features = X.select_dtypes(include=['int64', 'float64']).columns.tolist()
categorical_features = X.select_dtypes(include=['object', 'category']).columns.tolist()

# Define transformers
numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='mean')),
    ('scaler', StandardScaler())
])

categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

# Combine transformers
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)
    ]
)

# Define models to test
models = {
    "RandomForest": RandomForestClassifier(n_estimators=100, random_state=42),
    "XGBoost": XGBClassifier(eval_metric='logloss'),
    "KNN": KNeighborsClassifier(n_neighbors=5),
    "SVC": SVC(kernel='rbf'),
    "LogisticRegression": LogisticRegression(max_iter=1000)
}

# Stratified K-Fold
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

# Directory to save models
os.makedirs("models", exist_ok=True)
os.makedirs("confusion_matrices", exist_ok=True)

for name, clf in models.items():
    print(f"\n=== {name} ===")

    # Full pipeline
    pipeline = Pipeline(steps=[
        ('preprocessor', preprocessor),
        ('classifier', clf)
    ])

    # Cross-validated predictions (not probabilities)
    y_pred = cross_val_predict(pipeline, X, y, cv=skf)

    # Evaluation
    print(classification_report(y, y_pred, target_names=class_names, zero_division=0))


    # Confusion Matrix
    cm = confusion_matrix(y, y_pred)
    plt.figure(figsize=(10, 8))
    sns.heatmap(cm, annot=True, fmt='d', cmap='Purples',
            xticklabels=class_names, yticklabels=class_names)
    plt.title(f"Confusion Matrix - {name}")
    plt.xlabel("Predicted")
    plt.ylabel("Actual")
    plt.tight_layout()
    plt.savefig(f"confusion_matrices/{name}_confusion_matrix.png")
    plt.close()

    # Save trained pipeline on entire dataset
    pipeline.fit(X, y)
    joblib.dump(pipeline, f"models/{name}_pipeline.pkl")



=== RandomForest ===
                            precision    recall  f1-score   support

            ARP_poisioning       0.99      1.00      0.99      7750
            DDOS_Slowloris       0.98      0.98      0.98       534
             DOS_SYN_Hping       1.00      1.00      1.00     94659
              MQTT_Publish       1.00      1.00      1.00      4146
Metasploit_Brute_Force_SSH       0.94      0.78      0.85        37
             NMAP_FIN_SCAN       0.92      0.86      0.89        28
         NMAP_OS_DETECTION       1.00      1.00      1.00      2000
             NMAP_TCP_scan       1.00      1.00      1.00      1002
             NMAP_UDP_SCAN       0.99      0.99      0.99      2590
       NMAP_XMAS_TREE_SCAN       1.00      1.00      1.00      2010
               Thing_Speak       1.00      0.99      1.00      8108
                Wipro_bulb       0.99      0.96      0.97       253

                  accuracy                           1.00    123117
                 macro a