# IoT Intrusion Detection System â€“ Full Pipeline (Colab Ready)

This notebook implements a **complete IoT IDS pipeline** based on the architecture you described:

1. **Data loading from Google Drive**
2. **Data preprocessing & cleaning**
3. **Exploratory data analysis (EDA)** for class imbalance etc.
4. **SMOTE** for handling class imbalance
5. **Feature scaling (MinMaxScaler)**
6. **Hybrid feature selection**  
   - Binary Grey Wolf Optimizer (**BGWO**)  
   - **RFEâ€“XGBoost** for fine-grained feature ranking
7. **Hyperparameter optimization** of XGBoost using **Bayesian Optimization (BO-TPE via Hyperopt)**
8. **Final XGBoost training & evaluation**
9. **Checkpoint saving** (model, scaler, label encoder, feature masks, metrics, hyperparameters) to Google Drive

> ðŸ”§ **Important**: You must update the `DATASETS` dictionary (paths and label column names) to match your actual dataset files in Google Drive.


In [None]:
# =========================
# 0. Install Dependencies
# =========================
# Run this cell once when you start a new Colab session.

!pip install xgboost hyperopt imbalanced-learn joblib

In [None]:
# =========================
# 1. Imports & Basic Setup
# =========================

import os
import json
import warnings

import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split, StratifiedKFold, cross_val_score
from sklearn.preprocessing import MinMaxScaler, LabelEncoder
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix
from sklearn.feature_selection import RFE

from imblearn.over_sampling import SMOTE
from xgboost import XGBClassifier

from hyperopt import fmin, tpe, hp, Trials, STATUS_OK

import matplotlib.pyplot as plt
import seaborn as sns

import joblib

warnings.filterwarnings("ignore", category=UserWarning)
warnings.filterwarnings("ignore", category=FutureWarning)

plt.rcParams['figure.figsize'] = (7, 5)
plt.rcParams['axes.grid'] = True

In [None]:
# =======================================
# 2. Mount Google Drive & Dataset Configs
# =======================================

from google.colab import drive
drive.mount('/content/drive')

# Root folder where you will store ALL datasets
DATA_ROOT = "/content/drive/MyDrive/IoT_IDS_datasets"

# Root folder for saving checkpoints (models, scalers, etc.)
CHECKPOINT_ROOT = "/content/drive/MyDrive/IoT_IDS_checkpoints"
os.makedirs(CHECKPOINT_ROOT, exist_ok=True)

# ---------------------------
# DATASET CONFIGURATIONS
# ---------------------------
# ðŸ”´ IMPORTANT:
# 1. Update `path` to your actual CSV file in Drive.
# 2. Set `label_col` to the correct label/target column.
# 3. Optionally use `drop_cols` to remove ID or timestamp columns.
# 4. Optionally define `binary_mapping` for binary experiments (normal vs attack).

DATASETS = {
    "N_BaIoT": {
        "path": f"{DATA_ROOT}/N_BaIoT_sample.csv",  # TODO: change filename
        "label_col": "label",                       # TODO: change to actual label column
        "drop_cols": [],                            # e.g., ['id', 'timestamp']
        "binary_mapping": None                      # or a dict mapping original labels to 'normal'/'attack'
    },
    "BoT_IoT": {
        "path": f"{DATA_ROOT}/BoT_IoT_sample.csv",
        "label_col": "label",
        "drop_cols": [],
        "binary_mapping": None
    },
    "WUSTL_IIOT_2021": {
        "path": f"{DATA_ROOT}/WUSTL_IIOT_2021_sample.csv",
        "label_col": "label",
        "drop_cols": [],
        "binary_mapping": None
    },
    "WUSTL_EHMS_2020": {
        "path": f"{DATA_ROOT}/WUSTL_EHMS_2020_sample.csv",
        "label_col": "label",
        "drop_cols": [],
        "binary_mapping": None
    },
    "NSL_KDD": {
        "path": f"{DATA_ROOT}/NSL_KDD_sample.csv",
        "label_col": "label",
        "drop_cols": [],
        "binary_mapping": None
    },
}

In [None]:
# =====================================
# 3. Data Loading, EDA & Preprocessing
# =====================================

def load_dataset(config):
    """
    Load dataset from CSV based on the configuration dictionary.
    """
    path = config["path"]
    print(f"[INFO] Loading dataset from: {path}")
    df = pd.read_csv(path)
    print(f"[INFO] Raw shape: {df.shape}")
    return df


def explore_dataset(df, label_col):
    """
    Basic exploratory data analysis (EDA):
    - Show first rows
    - Summary info
    - Class distribution for labels
    - Missing values per column
    """
    print("\n[EDA] Head:")
    display(df.head())

    print("\n[EDA] Info:")
    print(df.info())

    print("\n[EDA] Label distribution:")
    if label_col in df.columns:
        print(df[label_col].value_counts())
        df[label_col].value_counts().plot(kind='bar', title='Label Distribution')
        plt.xlabel('Class')
        plt.ylabel('Count')
        plt.show()
    else:
        print(f"Label column '{label_col}' not found in dataframe.")

    print("\n[EDA] Missing values per column:")
    print(df.isna().sum().sort_values(ascending=False).head(20))


def preprocess_dataset(df, label_col, drop_cols=None, binary_mapping=None):
    """
    Full preprocessing pipeline:

    1. Drop unwanted columns (IDs, timestamps, etc.).
    2. Drop rows where label is missing.
    3. Optional binary mapping (e.g., normal vs attack).
    4. Drop duplicate rows.
    5. Separate features (X) and labels (y).
    6. Handle numerical + categorical features:
       - Convert obvious numeric strings to numeric.
       - One-hot encode remaining categorical columns.
    7. Handle missing values (fill with 0 for simplicity and robustness).
    8. Label-encode target variable.
    """
    df = df.copy()

    # 1) Drop columns if requested
    if drop_cols:
        df = df.drop(columns=drop_cols, errors="ignore")

    # 2) Drop rows where label is missing
    df = df.dropna(subset=[label_col])

    # 3) Optional binary mapping
    if binary_mapping is not None:
        df[label_col] = df[label_col].map(binary_mapping)
        # Drop rows with unmapped labels
        df = df.dropna(subset=[label_col])

    # 4) Remove duplicates
    df = df.drop_duplicates()

    # 5) Separate features and labels
    y_raw = df[label_col]
    X_raw = df.drop(columns=[label_col])

    # 6) Convert obviously numeric strings to numeric
    for col in X_raw.columns:
        if X_raw[col].dtype == 'object':
            try:
                X_raw[col] = pd.to_numeric(X_raw[col])
            except Exception:
                # keep as object for one-hot encoding later
                pass

    # Identify categorical (object) columns
    cat_cols = X_raw.select_dtypes(include=['object', 'category']).columns.tolist()
    num_cols = [c for c in X_raw.columns if c not in cat_cols]

    # Handle missing values separately for numeric and categorical
    if num_cols:
        X_raw[num_cols] = X_raw[num_cols].fillna(X_raw[num_cols].median())
    if cat_cols:
        X_raw[cat_cols] = X_raw[cat_cols].fillna(X_raw[cat_cols].mode().iloc[0])

    # One-hot encode categorical columns
    X = pd.get_dummies(X_raw, columns=cat_cols, drop_first=True)

    # As a safety net, fill any remaining NaNs
    X = X.fillna(0)

    # 7) Label encode target
    y_encoder = LabelEncoder()
    y = y_encoder.fit_transform(y_raw)

    print(f"[PREPROCESS] After preprocessing: X shape = {X.shape}, y shape = {y.shape}")
    print(f"[PREPROCESS] Number of classes: {len(np.unique(y))}")

    return X.values, y, y_encoder

In [None]:
# =========================================
# 4. SMOTE Oversampling & Feature Scaling
# =========================================

def apply_smote_and_scale(X_train, y_train, X_test, random_state=42):
    """
    Apply SMOTE to handle class imbalance, then MinMax scaling.

    Returns:
    - X_res_scaled: Resampled and scaled training features
    - y_res: Resampled labels
    - X_test_scaled: Scaled test features
    - scaler: Fitted MinMaxScaler (for reuse on new data)
    """
    sm = SMOTE(random_state=random_state)
    X_res, y_res = sm.fit_resample(X_train, y_train)
    print(f"[SMOTE] Train before: {X_train.shape}, after: {X_res.shape}")

    scaler = MinMaxScaler()
    X_res_scaled = scaler.fit_transform(X_res)
    X_test_scaled = scaler.transform(X_test)

    return X_res_scaled, y_res, X_test_scaled, scaler

In [None]:
# ====================================
# 5. Binary Grey Wolf Optimizer (BGWO)
# ====================================

def evaluate_feature_subset(mask, X, y, random_state=42):
    """
    Fitness function for BGWO:
    - Train a lightweight XGBoost model on the selected features.
    - Use validation accuracy.
    - Slightly penalize using too many features.
    """
    if mask.sum() == 0:
        return 0.0  # invalid: no features

    X_sub = X[:, mask]

    X_tr, X_val, y_tr, y_val = train_test_split(
        X_sub, y, test_size=0.3, stratify=y, random_state=random_state
    )

    # Lightweight XGBoost for fitness evaluation (fast but effective)
    clf = XGBClassifier(
        n_estimators=50,
        max_depth=5,
        learning_rate=0.1,
        subsample=0.8,
        colsample_bytree=0.8,
        objective='multi:softmax' if len(np.unique(y)) > 2 else 'binary:logistic',
        eval_metric='mlogloss' if len(np.unique(y)) > 2 else 'logloss',
        tree_method='hist',
        random_state=random_state,
        n_jobs=-1,
        verbosity=0
    )

    clf.fit(X_tr, y_tr)
    y_pred = clf.predict(X_val)
    acc = accuracy_score(y_val, y_pred)

    # Penalty for many features
    feat_ratio = mask.sum() / len(mask)
    fitness = 0.99 * acc + 0.01 * (1.0 - feat_ratio)
    return fitness


def bgwo_feature_selection(
    X, y,
    num_wolves=10,
    max_iter=15,
    subsample=8000,
    random_state=42
):
    """
    Binary Grey Wolf Optimization for feature selection.

    - X, y: Training data (scaled).
    - num_wolves: population size.
    - max_iter: number of iterations.
    - subsample: if dataset is huge, we use a subset to speed up fitness evaluation.

    Returns:
    - best_mask: boolean array indicating selected features.
    """
    np.random.seed(random_state)
    n_samples, n_features = X.shape

    # Subsample to speed up fitness evaluation on large datasets
    if n_samples > subsample:
        idx = np.random.choice(n_samples, subsample, replace=False)
        X_eval = X[idx]
        y_eval = y[idx]
    else:
        X_eval, y_eval = X, y

    # Initialize wolves in continuous space [0, 1]
    positions = np.random.rand(num_wolves, n_features)

    # Sigmoid-based binarization
    def continuous_to_binary(pos):
        s = 1 / (1 + np.exp(-10 * (pos - 0.5)))
        rand_mat = np.random.rand(*pos.shape)
        return (s > rand_mat).astype(int)

    # Initialize alpha, beta, delta wolves
    alpha_pos = None
    beta_pos = None
    delta_pos = None
    alpha_score = -np.inf
    beta_score = -np.inf
    delta_score = -np.inf

    for iter_idx in range(max_iter):
        # Parameter 'a' decreases linearly from 2 to 0
        a = 2 - 2 * (iter_idx / max_iter)

        for i in range(num_wolves):
            bin_mask = continuous_to_binary(positions[i])
            fitness = evaluate_feature_subset(bin_mask.astype(bool), X_eval, y_eval, random_state)

            # Update alpha, beta, delta
            if fitness > alpha_score:
                delta_score, delta_pos = beta_score, beta_pos
                beta_score, beta_pos = alpha_score, alpha_pos
                alpha_score, alpha_pos = fitness, positions[i].copy()
            elif fitness > beta_score:
                delta_score, delta_pos = beta_score, beta_pos
                beta_score, beta_pos = fitness, positions[i].copy()
            elif fitness > delta_score:
                delta_score, delta_pos = fitness, positions[i].copy()

        # Position update for all wolves
        for i in range(num_wolves):
            for j in range(n_features):
                r1, r2 = np.random.rand(), np.random.rand()
                A1 = 2 * a * r1 - a
                C1 = 2 * r2
                D_alpha = abs(C1 * alpha_pos[j] - positions[i, j])
                X1 = alpha_pos[j] - A1 * D_alpha

                r1, r2 = np.random.rand(), np.random.rand()
                A2 = 2 * a * r1 - a
                C2 = 2 * r2
                D_beta = abs(C2 * beta_pos[j] - positions[i, j])
                X2 = beta_pos[j] - A2 * D_beta

                r1, r2 = np.random.rand(), np.random.rand()
                A3 = 2 * a * r1 - a
                C3 = 2 * r2
                D_delta = abs(C3 * delta_pos[j] - positions[i, j])
                X3 = delta_pos[j] - A3 * D_delta

                positions[i, j] = (X1 + X2 + X3) / 3.0

        # Clip positions to [0, 1]
        positions = np.clip(positions, 0, 1)

        print(f"[BGWO] Iter {iter_idx + 1}/{max_iter} - best fitness: {alpha_score:.6f}")

    best_binary = continuous_to_binary(alpha_pos).astype(bool)
    print(f"[BGWO] Selected {best_binary.sum()} / {n_features} features")
    return best_binary

In [None]:
# =============================================
# 6. RFE-XGBoost (Refined Feature Selection)
# =============================================

def rfe_xgboost_feature_selection(X, y, initial_mask, random_state=42):
    """
    Second-stage feature selection:

    1. Restrict features to those selected by BGWO.
    2. Apply RFE (Recursive Feature Elimination) using XGBoost as base estimator.
    3. Return a final boolean mask over ALL original features.

    Parameters:
    - X, y: training data (after SMOTE & scaling).
    - initial_mask: boolean mask of features selected by BGWO.

    Returns:
    - final_mask: boolean mask over all original features.
    """
    selected_indices = np.where(initial_mask)[0]
    X_reduced = X[:, selected_indices]

    base_clf = XGBClassifier(
        n_estimators=100,
        max_depth=6,
        learning_rate=0.1,
        subsample=0.8,
        colsample_bytree=0.8,
        objective='multi:softmax' if len(np.unique(y)) > 2 else 'binary:logistic',
        eval_metric='mlogloss' if len(np.unique(y)) > 2 else 'logloss',
        tree_method='hist',
        random_state=random_state,
        n_jobs=-1,
        verbosity=0
    )

    # Keep about 50% of BGWO-selected features (tunable)
    n_features_bgwo = X_reduced.shape[1]
    n_to_select = max(5, n_features_bgwo // 2)

    print(f"[RFE] Running RFE on {n_features_bgwo} features, selecting {n_to_select}...")
    rfe = RFE(
        estimator=base_clf,
        n_features_to_select=n_to_select,
        step=0.1
    )
    rfe.fit(X_reduced, y)
    support_reduced = rfe.support_  # mask among BGWO features

    # Map back to full feature space
    final_mask = np.zeros(X.shape[1], dtype=bool)
    final_mask[selected_indices[support_reduced]] = True

    print(f"[RFE] Final selected features: {final_mask.sum()} / {X.shape[1]}")
    return final_mask

In [None]:
# ==================================================================
# 7. Hyperparameter Optimization (BO-TPE) & Final Model Evaluation
# ==================================================================

def bo_tpe_optimize_xgboost(X, y, max_evals=25, random_state=42):
    """
    Bayesian Optimization (Tree-structured Parzen Estimator - TPE)
    to tune key XGBoost hyperparameters.

    Returns:
    - best_params: dictionary with best n_estimators, max_depth, learning_rate
    """
    n_classes = len(np.unique(y))
    objective_type = 'multi:softprob' if n_classes > 2 else 'binary:logistic'
    eval_metric = 'mlogloss' if n_classes > 2 else 'logloss'

    def objective(params):
        params = dict(params)

        clf = XGBClassifier(
            n_estimators=int(params['n_estimators']),
            max_depth=int(params['max_depth']),
            learning_rate=float(params['learning_rate']),
            subsample=0.8,
            colsample_bytree=0.8,
            objective=objective_type,
            eval_metric=eval_metric,
            tree_method='hist',
            random_state=random_state,
            n_jobs=-1,
            verbosity=0
        )

        cv = StratifiedKFold(n_splits=3, shuffle=True, random_state=random_state)
        scores = cross_val_score(clf, X, y, cv=cv, scoring='accuracy', n_jobs=-1)
        acc = scores.mean()
        loss = 1.0 - acc

        return {
            'loss': loss,
            'status': STATUS_OK,
            'acc': acc
        }

    search_space = {
        'n_estimators': hp.quniform('n_estimators', 50, 200, 10),
        'max_depth': hp.quniform('max_depth', 3, 12, 1),
        'learning_rate': hp.loguniform('learning_rate', np.log(0.01), np.log(0.5))
    }

    trials = Trials()
    best = fmin(
        fn=objective,
        space=search_space,
        algo=tpe.suggest,
        max_evals=max_evals,
        trials=trials,
        rstate=np.random.RandomState(random_state)
    )

    best_params = {
        'n_estimators': int(best['n_estimators']),
        'max_depth': int(best['max_depth']),
        'learning_rate': float(best['learning_rate'])
    }
    print(f"[BO-TPE] Best params: {best_params}")
    return best_params


def train_evaluate_xgboost(X_train, y_train, X_test, y_test, params, random_state=42):
    """
    Train final XGBoost with optimized hyperparameters and evaluate on test set.

    Returns:
    - clf: trained XGBClassifier
    - metrics: (accuracy, precision, recall, f1)
    """
    n_classes = len(np.unique(y_train))
    objective_type = 'multi:softmax' if n_classes > 2 else 'binary:logistic'
    eval_metric = 'mlogloss' if n_classes > 2 else 'logloss'

    clf = XGBClassifier(
        n_estimators=params['n_estimators'],
        max_depth=params['max_depth'],
        learning_rate=params['learning_rate'],
        subsample=0.8,
        colsample_bytree=0.8,
        objective=objective_type,
        eval_metric=eval_metric,
        tree_method='hist',
        random_state=random_state,
        n_jobs=-1,
        verbosity=0
    )

    clf.fit(X_train, y_train)
    y_pred = clf.predict(X_test)

    acc = accuracy_score(y_test, y_pred)
    prec = precision_score(y_test, y_pred, average='weighted', zero_division=0)
    rec = recall_score(y_test, y_pred, average='weighted', zero_division=0)
    f1 = f1_score(y_test, y_pred, average='weighted', zero_division=0)

    print("\n[RESULTS]")
    print(f"Accuracy : {acc:.6f}")
    print(f"Precision: {prec:.6f}")
    print(f"Recall   : {rec:.6f}")
    print(f"F1-score : {f1:.6f}")

    cm = confusion_matrix(y_test, y_pred)
    plt.figure(figsize=(7, 6))
    sns.heatmap(cm, annot=True, fmt="d", cmap="Blues")
    plt.title("Confusion Matrix")
    plt.xlabel("Predicted")
    plt.ylabel("True")
    plt.show()

    return clf, (acc, prec, rec, f1)

In [None]:
# =======================================
# 8. Checkpoint Saving (for Next Phases)
# =======================================

def save_checkpoints(
    save_dir,
    scaler,
    label_encoder,
    bgwo_mask,
    final_mask,
    best_params,
    metrics,
    model
):
    """
    Save all important artifacts for later reuse.

    Files created inside save_dir:
    - scaler.joblib          : MinMaxScaler
    - label_encoder.joblib   : LabelEncoder for target
    - bgwo_mask.npy          : Boolean mask after BGWO
    - final_mask.npy         : Boolean mask after RFE (final features)
    - best_params.json       : Best hyperparameters found by BO-TPE
    - metrics.json           : Accuracy, Precision, Recall, F1
    - xgb_model.json         : Trained XGBoost model (booster format)
    """
    os.makedirs(save_dir, exist_ok=True)

    # Scaler & label encoder
    joblib.dump(scaler, os.path.join(save_dir, "scaler.joblib"))
    joblib.dump(label_encoder, os.path.join(save_dir, "label_encoder.joblib"))

    # Feature masks
    np.save(os.path.join(save_dir, "bgwo_mask.npy"), bgwo_mask.astype(bool))
    np.save(os.path.join(save_dir, "final_mask.npy"), final_mask.astype(bool))

    # Hyperparameters & metrics
    with open(os.path.join(save_dir, "best_params.json"), "w") as f:
        json.dump(best_params, f, indent=4)

    metrics_dict = {
        "accuracy": float(metrics[0]),
        "precision": float(metrics[1]),
        "recall": float(metrics[2]),
        "f1_score": float(metrics[3]),
    }
    with open(os.path.join(save_dir, "metrics.json"), "w") as f:
        json.dump(metrics_dict, f, indent=4)

    # Model (XGBoost)
    model_path = os.path.join(save_dir, "xgb_model.json")
    model.save_model(model_path)

    print(f"[CHECKPOINT] Saved all artifacts to: {save_dir}")

In [None]:
# ================================
# 9. Full Pipeline (One Dataset)
# ================================

def run_full_pipeline(
    dataset_name,
    test_size=0.2,
    random_state=42,
    bgwo_wolves=10,
    bgwo_iters=10,
    bgwo_subsample=8000,
    bo_evals=20,
    save_checkpoints_flag=True
):
    """
    Run the complete IDS pipeline on a given dataset.

    Steps:
    1. Load dataset.
    2. EDA (optional, you can call explore_dataset separately).
    3. Preprocess (cleaning, encoding).
    4. Train-test split.
    5. SMOTE + MinMax scaling.
    6. BGWO feature selection.
    7. RFE-XGBoost feature selection.
    8. Hyperparameter optimization with BO-TPE.
    9. Final XGBoost training & evaluation.
    10. Save checkpoints to Google Drive (optional).

    Returns a dictionary containing:
    - model
    - metrics
    - label_encoder
    - scaler
    - feature_mask (final mask)
    - bgwo_mask
    - best_params
    - checkpoint_dir (if saving enabled)
    """
    assert dataset_name in DATASETS, f"Unknown dataset: {dataset_name}"

    cfg = DATASETS[dataset_name]

    # Create checkpoint folder for this dataset
    save_dir = os.path.join(CHECKPOINT_ROOT, dataset_name)
    if save_checkpoints_flag:
        os.makedirs(save_dir, exist_ok=True)

    # 1) Load
    df = load_dataset(cfg)

    # Optional: EDA (uncomment if you want to see it each run)
    # explore_dataset(df, cfg['label_col'])

    # 2) Preprocess
    X, y, y_encoder = preprocess_dataset(
        df,
        label_col=cfg["label_col"],
        drop_cols=cfg.get("drop_cols", []),
        binary_mapping=cfg.get("binary_mapping", None)
    )

    # 3) Train-test split
    X_train, X_test, y_train, y_test = train_test_split(
        X, y,
        test_size=test_size,
        stratify=y,
        random_state=random_state
    )
    print(f"[SPLIT] Train: {X_train.shape}, Test: {X_test.shape}")

    # 4) SMOTE + scaling
    X_train_res, y_train_res, X_test_scaled, scaler = apply_smote_and_scale(
        X_train, y_train, X_test, random_state=random_state
    )

    # 5) BGWO feature selection
    bgwo_mask = bgwo_feature_selection(
        X_train_res, y_train_res,
        num_wolves=bgwo_wolves,
        max_iter=bgwo_iters,
        subsample=bgwo_subsample,
        random_state=random_state
    )

    X_train_bgwo = X_train_res[:, bgwo_mask]
    X_test_bgwo = X_test_scaled[:, bgwo_mask]
    print(f"[BGWO] After BGWO: Train {X_train_bgwo.shape}, Test {X_test_bgwo.shape}")

    # 6) RFE-XGBoost feature selection
    final_mask = rfe_xgboost_feature_selection(
        X_train_res, y_train_res, bgwo_mask, random_state=random_state
    )

    X_train_fs = X_train_res[:, final_mask]
    X_test_fs = X_test_scaled[:, final_mask]
    print(f"[FEATURES] Final selected: Train {X_train_fs.shape}, Test {X_test_fs.shape}")

    # 7) Hyperparameter optimization
    best_params = bo_tpe_optimize_xgboost(
        X_train_fs, y_train_res,
        max_evals=bo_evals,
        random_state=random_state
    )

    # 8) Final training & evaluation
    clf, metrics = train_evaluate_xgboost(
        X_train_fs, y_train_res,
        X_test_fs, y_test,
        best_params,
        random_state=random_state
    )

    # 9) Save checkpoints
    if save_checkpoints_flag:
        save_checkpoints(
            save_dir=save_dir,
            scaler=scaler,
            label_encoder=y_encoder,
            bgwo_mask=bgwo_mask,
            final_mask=final_mask,
            best_params=best_params,
            metrics=metrics,
            model=clf
        )

    return {
        "model": clf,
        "metrics": metrics,
        "label_encoder": y_encoder,
        "scaler": scaler,
        "feature_mask": final_mask,
        "bgwo_mask": bgwo_mask,
        "best_params": best_params,
        "checkpoint_dir": save_dir if save_checkpoints_flag else None
    }

## 10. Run the Full Pipeline on a Dataset

- Make sure you have **updated the `DATASETS` dictionary** (paths + label column + optional binary mapping).
- Then you can run the pipeline on any of the defined dataset keys, for example: `NSL_KDD`, `BoT_IoT`, etc.


In [None]:
# =======================
# Example: Run on NSL_KDD
# =======================
# 1. Ensure `DATASETS["NSL_KDD"]["path"]` and `label_col` are correct.
# 2. Then run this cell.

# Uncomment when ready:
# result_nsl = run_full_pipeline(
#     dataset_name="NSL_KDD",
#     test_size=0.2,
#     random_state=42,
#     bgwo_wolves=10,
#     bgwo_iters=8,       # you can increase this for better search (more time)
#     bgwo_subsample=8000,
#     bo_evals=15,        # more evaluations -> better tuning (more time)
#     save_checkpoints_flag=True
# )
# print("NSL-KDD metrics:", result_nsl["metrics"])
# print("Checkpoints saved in:", result_nsl["checkpoint_dir"])

In [None]:
# =======================
# Example: Run on BoT-IoT
# =======================
# Uncomment when you have configured DATASETS["BoT_IoT"] correctly.

# result_bot = run_full_pipeline(
#     dataset_name="BoT_IoT",
#     test_size=0.2,
#     random_state=42,
#     bgwo_wolves=10,
#     bgwo_iters=8,
#     bgwo_subsample=8000,
#     bo_evals=15,
#     save_checkpoints_flag=True
# )
# print("BoT-IoT metrics:", result_bot["metrics"])
# print("Checkpoints saved in:", result_bot["checkpoint_dir"])