# XGBoost Hyperparameter Optimization

This notebook implements a three-stage Bayesian optimization process for tuning XGBoost hyperparameters using Optuna. The approach systematically explores the hyperparameter space to reduce FPR.

## Overview of the Optimization Process

The hyperparameter optimization follows a sequential approach, progressing from fundamental to advanced parameters. We begin by fine-tuning the basic tree structure through depth, learning rate, and number of trees. This foundation is then enhanced by optimizing how data and features are sampled, along with controlling leaf node sizes. The process ends in adjusting regularization parameters to balance model complexity and prevent overfitting, using split thresholds and both L1 and L2 penalties.

### Stage 1: Basic Parameter Optimization

- Optimizes fundamental XGBoost parameters
- Optuna with 50 trials
- The goal is to maximize classification precision
- Parameter ranges:
  - max_depth: 3 to 10
  - learning_rate: 0.01 to 0.3
  - n_estimators: 50 to 200

### Stage 2: Sampling Parameter Optimization

- Builds upon best parameters from Stage 1
- Optuna with 30 trials
- Parameter ranges:
  - subsample: 0.5 to 1.0
  - colsample_bytree: 0.5 to 1.0
  - min_child_weight: 1 to 10

### Stage 3: Regularization Parameter Optimization

- Builds upon best parameters from Stages 1 and 2
- Optuna with 30 trials
- Parameter ranges:
  - gamma: 0 to 5
  - reg_alpha: 0 to 5
  - reg_lambda: 0 to 5

In [1]:
!pip install xgboost scikit-learn pandas optuna

from google.colab import drive

drive.mount('/content/drive')

Collecting optuna
  Downloading optuna-4.1.0-py3-none-any.whl.metadata (16 kB)
Collecting alembic>=1.5.0 (from optuna)
  Downloading alembic-1.14.0-py3-none-any.whl.metadata (7.4 kB)
Collecting colorlog (from optuna)
  Downloading colorlog-6.9.0-py3-none-any.whl.metadata (10 kB)
Collecting Mako (from alembic>=1.5.0->optuna)
  Downloading Mako-1.3.8-py3-none-any.whl.metadata (2.9 kB)
Downloading optuna-4.1.0-py3-none-any.whl (364 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m364.4/364.4 kB[0m [31m5.0 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading alembic-1.14.0-py3-none-any.whl (233 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m233.5/233.5 kB[0m [31m10.7 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading colorlog-6.9.0-py3-none-any.whl (11 kB)
Downloading Mako-1.3.8-py3-none-any.whl (78 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m78.6/78.6 kB[0m [31m5.9 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: Ma

In [None]:
import os
from datetime import datetime

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
import xgboost as xgb
import optuna

from sklearn.metrics import classification_report, confusion_matrix
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, StandardScaler


def load_and_preprocess_data():
    data = pd.read_csv("/content/drive/My Drive/data/Data.csv")
    labels = pd.read_csv("/content/drive/My Drive/data/Label.csv")

    df = pd.merge(data, labels, left_index=True, right_index=True)

    X = df.drop("Label", axis=1)
    y = df["Label"]

    attack_mask = y > 0
    X_attacks = X[attack_mask]
    y_multiclass = y[attack_mask]

    y_multiclass = y_multiclass - 1

    categorical_cols = X.select_dtypes(include=["object"]).columns
    for col in categorical_cols:
        le = LabelEncoder()
        X[col] = le.fit_transform(X[col])
        X_attacks[col] = le.fit_transform(X_attacks[col])

    scaler_multi = StandardScaler()
    X_attacks_scaled = scaler_multi.fit_transform(X_attacks)
    X_attacks_scaled = pd.DataFrame(X_attacks_scaled, columns=X_attacks.columns)

    return X_attacks_scaled, y_multiclass


def create_directories():
    timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
    multiclass_dir = f"models/multiclass/xgboost_{timestamp}"
    os.makedirs(multiclass_dir, exist_ok=True)
    return multiclass_dir


def objective_stage1(trial, X_train, y_train, X_valid, y_valid):
    param = {
        'objective': 'multi:softmax',
        'num_class': 9,
        'eval_metric': 'mlogloss',
        'max_depth': trial.suggest_int('max_depth', 3, 10),
        'learning_rate': trial.suggest_float('learning_rate', 0.01, 0.3),
        'n_estimators': trial.suggest_int('n_estimators', 50, 200),
        'random_state': 42
    }

    model = xgb.XGBClassifier(**param)
    model.fit(X_train, y_train)
    preds = model.predict(X_valid)
    accuracy = (preds == y_valid).mean()

    return accuracy


def objective_stage2(trial, X_train, y_train, X_valid, y_valid, best_params):
    param = {
        **best_params,
        'subsample': trial.suggest_float('subsample', 0.5, 1.0),
        'colsample_bytree': trial.suggest_float('colsample_bytree', 0.5, 1.0),
        'min_child_weight': trial.suggest_int('min_child_weight', 1, 10)
    }

    model = xgb.XGBClassifier(**param)
    model.fit(X_train, y_train)
    preds = model.predict(X_valid)
    accuracy = (preds == y_valid).mean()

    return accuracy


def objective_stage3(trial, X_train, y_train, X_valid, y_valid, best_params):
    param = {
        **best_params,
        'gamma': trial.suggest_float('gamma', 0, 5),
        'reg_alpha': trial.suggest_float('reg_alpha', 0, 5),
        'reg_lambda': trial.suggest_float('reg_lambda', 0, 5)
    }

    model = xgb.XGBClassifier(**param)
    model.fit(X_train, y_train)
    preds = model.predict(X_valid)
    accuracy = (preds == y_valid).mean()

    return accuracy


def train_and_evaluate():
    X, y_multiclass = load_and_preprocess_data()
    multiclass_dir = create_directories()

    X_train, X_test, y_train, y_test = train_test_split(X, y_multiclass, test_size=0.2, random_state=42)

    study_stage1 = optuna.create_study(direction='maximize')
    study_stage1.optimize(lambda trial: objective_stage1(trial, X_train, y_train, X_test, y_test), n_trials=50)
    best_params_stage1 = study_stage1.best_params
    print("Stage 1 Best Parameters:", best_params_stage1)

    study_stage2 = optuna.create_study(direction='maximize')
    study_stage2.optimize(lambda trial: objective_stage2(trial, X_train, y_train, X_test, y_test, best_params_stage1), n_trials=30)
    best_params_stage2 = {**best_params_stage1, **study_stage2.best_params}
    print("Stage 2 Best Parameters:", best_params_stage2)

    study_stage3 = optuna.create_study(direction='maximize')
    study_stage3.optimize(lambda trial: objective_stage3(trial, X_train, y_train, X_test, y_test, best_params_stage2), n_trials=30)
    best_params_stage3 = {**best_params_stage2, **study_stage3.best_params}
    print("Stage 3 Best Parameters:", best_params_stage3)

    final_model = xgb.XGBClassifier(**best_params_stage3)
    final_model.fit(X_train, y_train)

    preds = final_model.predict(X_test)
    report = classification_report(y_test, preds, target_names=["Analysis", "Backdoor", "DoS", "Exploits", "Fuzzers", "Generic", "Reconnaissance", "Shellcode", "Worms"])
    print(report)

    final_model.save_model(f"{multiclass_dir}/final_model.json")

    with open(f"{multiclass_dir}/report.txt", "w") as f:
        f.write("XGBoost Multiclass Classification Results\n")
        f.write("==================================\n\n")
        f.write(f"Stage 1 Best Parameters: {best_params_stage1}\n")
        f.write(f"Stage 2 Best Parameters: {best_params_stage2}\n")
        f.write(f"Stage 3 Best Parameters: {best_params_stage3}\n")
        f.write("\nClassification Report:\n")
        f.write(report)


if __name__ == "__main__":
    train_and_evaluate()


[I 2024-12-08 15:49:07,328] A new study created in memory with name: no-name-b2e1a0a8-36b0-491d-831e-4f6e5c4f7ea5
[I 2024-12-08 15:49:46,029] Trial 0 finished with value: 0.7671485181671038 and parameters: {'max_depth': 6, 'learning_rate': 0.22559782883684773, 'n_estimators': 168}. Best is trial 0 with value: 0.7671485181671038.
[I 2024-12-08 15:50:14,937] Trial 1 finished with value: 0.7669810794217782 and parameters: {'max_depth': 7, 'learning_rate': 0.20644201623785383, 'n_estimators': 105}. Best is trial 0 with value: 0.7671485181671038.
[I 2024-12-08 15:51:12,065] Trial 2 finished with value: 0.7710554222247028 and parameters: {'max_depth': 10, 'learning_rate': 0.14281253195860238, 'n_estimators': 111}. Best is trial 2 with value: 0.7710554222247028.
[I 2024-12-08 15:52:12,989] Trial 3 finished with value: 0.7683764022994921 and parameters: {'max_depth': 8, 'learning_rate': 0.07019434996598799, 'n_estimators': 180}. Best is trial 2 with value: 0.7710554222247028.
[I 2024-12-08 15:

Stage 1 Best Parameters: {'max_depth': 10, 'learning_rate': 0.10846417402017274, 'n_estimators': 116}


[I 2024-12-08 16:24:44,578] Trial 0 finished with value: 0.7658648211196071 and parameters: {'subsample': 0.5880963925858063, 'colsample_bytree': 0.5504913614676639, 'min_child_weight': 5}. Best is trial 0 with value: 0.7658648211196071.
[I 2024-12-08 16:25:19,723] Trial 1 finished with value: 0.7679298989786236 and parameters: {'subsample': 0.9770637161572737, 'colsample_bytree': 0.6286715543596686, 'min_child_weight': 10}. Best is trial 1 with value: 0.7679298989786236.
[I 2024-12-08 16:25:57,839] Trial 2 finished with value: 0.7668136406764525 and parameters: {'subsample': 0.6938129915389537, 'colsample_bytree': 0.6701802255793936, 'min_child_weight': 7}. Best is trial 1 with value: 0.7679298989786236.
[I 2024-12-08 16:26:44,558] Trial 3 finished with value: 0.7664229502706926 and parameters: {'subsample': 0.6323791594959572, 'colsample_bytree': 0.6988455293836077, 'min_child_weight': 2}. Best is trial 1 with value: 0.7679298989786236.
[I 2024-12-08 16:27:24,090] Trial 4 finished wi

Stage 2 Best Parameters: {'max_depth': 10, 'learning_rate': 0.10846417402017274, 'n_estimators': 116, 'subsample': 0.9347771906515, 'colsample_bytree': 0.7472126360381557, 'min_child_weight': 2}


[I 2024-12-08 16:47:09,429] Trial 0 finished with value: 0.7595021487972317 and parameters: {'gamma': 3.008395239220958, 'reg_alpha': 1.567899752338261, 'reg_lambda': 0.3819476557917728}. Best is trial 0 with value: 0.7595021487972317.
[I 2024-12-08 16:47:30,166] Trial 1 finished with value: 0.7560417480605012 and parameters: {'gamma': 4.255064018359466, 'reg_alpha': 0.8114406264693924, 'reg_lambda': 2.9891188512468188}. Best is trial 0 with value: 0.7595021487972317.
[I 2024-12-08 16:48:07,184] Trial 2 finished with value: 0.7699949768376402 and parameters: {'gamma': 0.59860752687388, 'reg_alpha': 0.8331702924238371, 'reg_lambda': 0.7750858785468784}. Best is trial 2 with value: 0.7699949768376402.
[I 2024-12-08 16:48:36,646] Trial 3 finished with value: 0.7624044203828766 and parameters: {'gamma': 0.8316952336299183, 'reg_alpha': 4.013898837041079, 'reg_lambda': 2.9298814174916203}. Best is trial 2 with value: 0.7699949768376402.
[I 2024-12-08 16:48:57,015] Trial 4 finished with valu

Stage 3 Best Parameters: {'max_depth': 10, 'learning_rate': 0.10846417402017274, 'n_estimators': 116, 'subsample': 0.9347771906515, 'colsample_bytree': 0.7472126360381557, 'min_child_weight': 2, 'gamma': 0.40485133721885, 'reg_alpha': 0.12792230585405645, 'reg_lambda': 1.3261593035347525}
                precision    recall  f1-score   support

      Analysis       0.48      0.44      0.46        68
      Backdoor       0.88      0.48      0.62        77
           DoS       0.81      0.29      0.43       917
      Exploits       0.81      0.78      0.80      6191
       Fuzzers       0.69      0.95      0.80      5859
       Generic       0.89      0.74      0.81       929
Reconnaissance       0.90      0.67      0.77      3370
     Shellcode       0.64      0.28      0.39       457
         Worms       0.64      0.33      0.43        49

      accuracy                           0.77     17917
     macro avg       0.75      0.55      0.61     17917
  weighted avg       0.79      0.77 

```Stage 3 Best Parameters: {'max_depth': 10, 'learning_rate': 0.10846417402017274, 'n_estimators': 116, 'subsample': 0.9347771906515, 'colsample_bytree': 0.7472126360381557, 'min_child_weight': 2, 'gamma': 0.40485133721885, 'reg_alpha': 0.12792230585405645, 'reg_lambda': 1.3261593035347525}```

## Results

The optimization process produced a set of tuned parameters, with the final model using a relatively deep tree structure (max_depth: 10) balanced by a moderate learning rate (≈0.11). The sampling strategy favors using most of the training data (subsample: ≈0.93) while being more selective with features (colsample_bytree: ≈0.75). The regularization parameters suggest a light L1 penalty (reg_alpha: ≈0.13) with stronger L2 regularization (reg_lambda: ≈1.33).

However, when comparing performance metrics, this optimized model shows only marginal improvements over XGBoost's default configuration. This suggests we've reached a performance plateau where:
- The problem space may be well-covered by default parameters
- The dataset's inherent patterns are already being effectively captured
- Further parameter tuning is unlikely to yield meaningful gains