# 09: XGBoost Model

This notebook focuses on developing, tuning, and evaluating an XGBoost classifier for predicting severe traffic accidents.

**PRD References:** 3.1.5.4 (Gradient Boosting - General), 3.1.5.6 (XGBoost/AdaBoost), FR3 (Model Training & Tuning), 9.1 (Jupyter Notebooks), 9.3 (Performance Logging), 10.5 (Hyperparameter Logging).

## 1. Setup and Imports

In [4]:
import pandas as pd
import numpy as np
import joblib
import json
from pathlib import Path
import time
import sys

# Ensure src directory is in Python path
sys.path.append(str(Path.cwd().parent / 'src'))

# Scikit-learn imports
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import make_scorer, precision_score, recall_score, f1_score, roc_auc_score

# XGBoost import
from xgboost import XGBClassifier

# Custom utilities
from modeling_utils import compute_classification_metrics, append_performance_record

# Configuration
DATA_DIR = Path.cwd().parent / 'data'
PROCESSED_DATA_FILE = DATA_DIR / 'processed' / 'preprocessed_data.csv'
MODELS_DIR = Path.cwd().parent / 'models'
REPORTS_DIR = Path.cwd().parent / 'reports'
PERFORMANCE_EXCEL_FILE = REPORTS_DIR / 'model_performance_summary.xlsx'
RANDOM_STATE = 42
MODEL_NAME = 'XGBoost'
CV_SHEET_NAME = f'{MODEL_NAME}_CV_Trials'
MODEL_FILENAME = f'{MODEL_NAME.lower().replace(" ", "_")}_best_model.joblib'

MODELS_DIR.mkdir(parents=True, exist_ok=True)
REPORTS_DIR.mkdir(parents=True, exist_ok=True)

## 2. Load Data

Load the preprocessed data. Ensure that the target variable `SEVERITY` and all features are correctly identified.

In [5]:
try:
    df = pd.read_csv(PROCESSED_DATA_FILE)
except FileNotFoundError:
    print(f"Error: Processed data file not found at {PROCESSED_DATA_FILE}")
    print("Please ensure '02_data_preprocessing.ipynb' has been run successfully.")
    df = None

if df is not None:
    print(f"Data loaded successfully: {df.shape}")
    if 'SEVERITY' not in df.columns:
        print("Error: Target column 'SEVERITY' not found in the dataframe.")
        X, y = None, None
    else:
        X = df.drop('SEVERITY', axis=1)
        y = df['SEVERITY']
        print(f"Features shape: {X.shape}, Target shape: {y.shape}")
        print(f"Target distribution:\n{y.value_counts(normalize=True)}")

Data loaded successfully: (22072, 44)
Features shape: (22072, 43), Target shape: (22072,)
Target distribution:
SEVERITY
0    0.931859
1    0.068141
Name: proportion, dtype: float64


## 3. Train-Test Split

Split the data into training and testing sets, stratifying by the target variable.

In [6]:
if X is not None and y is not None:
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.2, random_state=RANDOM_STATE, stratify=y
    )
    print(f"X_train shape: {X_train.shape}, y_train shape: {y_train.shape}")
    print(f"X_test shape: {X_test.shape}, y_test shape: {y_test.shape}")
else:
    print("Skipping train-test split as data was not loaded properly.")
    X_train, X_test, y_train, y_test = None, None, None, None

X_train shape: (17657, 43), y_train shape: (17657,)
X_test shape: (4415, 43), y_test shape: (4415,)


## 4. Handle Class Imbalance

XGBoost can handle class imbalance using the `scale_pos_weight` parameter. We will calculate this and include it in our hyperparameter grid or set it directly if not tuning it.

In [7]:
if y_train is not None:
    scale_pos_weight_value = y_train.value_counts()[0] / y_train.value_counts()[1]
    print(f"Calculated scale_pos_weight: {scale_pos_weight_value:.2f}")
else:
    scale_pos_weight_value = 1 # Default if y_train is not available
    print("y_train not available, using default scale_pos_weight=1")

Calculated scale_pos_weight: 13.68


## 5. Model Definition and Hyperparameter Tuning (GridSearchCV)

We'll use GridSearchCV to find the best hyperparameters for the XGBoost model.

In [8]:
if X_train is not None:
    # Note: XGBoost's scikit-learn API uses 'objective': 'binary:logistic' by default for binary classification.
    # 'eval_metric' can be set to 'logloss', 'auc', 'aucpr' etc.
    # 'use_label_encoder=False' is recommended to avoid warnings with newer XGBoost versions.
    xgb_classifier = XGBClassifier(random_state=RANDOM_STATE, eval_metric='logloss')

    # Define a standard parameter grid for ideal tuning
    param_grid = {
        'n_estimators': [100, 200, 300],
        'learning_rate': [0.01, 0.05, 0.1],
        'max_depth': [3, 5, 7],
        'gamma': [0, 0.1, 0.5], # Minimum loss reduction required to make a further partition
        'subsample': [0.6, 0.8, 1.0], # Subsample ratio of the training instance
        'colsample_bytree': [0.6, 0.8, 1.0], # Subsample ratio of columns when constructing each tree
        'scale_pos_weight': [1, scale_pos_weight_value] # For handling class imbalance
    }

    # Define a smaller, more focused parameter grid for faster initial run
    param_grid_fast = {
        'n_estimators': [50],
        'learning_rate': [0.05],
        'max_depth': [5],
        'scale_pos_weight': [scale_pos_weight_value]
        # 'gamma': [0],
        # 'subsample': [0.8],
        # 'colsample_bytree': [0.8],
    }

    # Custom ROC AUC scorer to ensure predict_proba is used
    def roc_auc_proba_scorer(estimator, X_data, y_true_data):
        y_proba = estimator.predict_proba(X_data)[:, 1]
        return roc_auc_score(y_true_data, y_proba, average='weighted', multi_class='ovr')

    scoring = {
        'F1': make_scorer(f1_score, average='weighted'),
        'ROC_AUC': roc_auc_proba_scorer, 
        'Precision': make_scorer(precision_score, average='weighted', zero_division=0),
        'Recall': make_scorer(recall_score, average='weighted', zero_division=0)
    }

    grid_search = GridSearchCV(
        estimator=xgb_classifier,
        param_grid=param_grid, # Using the fast grid for quicker execution
        scoring=scoring,
        refit='F1', # Refit the best model using F1 score
        cv=3,       # Number of cross-validation folds (3 for quicker run)
        verbose=2,
        n_jobs=-1   # Use all available cores
    )

    print(f"Starting GridSearchCV for {MODEL_NAME}...")
    start_time_grid_search = time.time()
    grid_search.fit(X_train, y_train)
    end_time_grid_search = time.time()
    grid_search_duration = end_time_grid_search - start_time_grid_search
    print(f"GridSearchCV completed in {grid_search_duration:.2f} seconds.")

    print(f"Best F1 score from GridSearchCV: {grid_search.best_score_:.4f}")
    print(f"Best parameters from GridSearchCV: {grid_search.best_params_}")
else:
    print("Skipping GridSearchCV as training data is not available.")
    grid_search = None
    grid_search_duration = 0

Starting GridSearchCV for XGBoost...
Fitting 3 folds for each of 1458 candidates, totalling 4374 fits


[CV] END colsample_bytree=0.6, gamma=0, learning_rate=0.01, max_depth=3, n_estimators=100, scale_pos_weight=1, subsample=0.8; total time=   2.2s
[CV] END colsample_bytree=0.6, gamma=0, learning_rate=0.01, max_depth=3, n_estimators=100, scale_pos_weight=1, subsample=1.0; total time=   2.1s
[CV] END colsample_bytree=0.6, gamma=0, learning_rate=0.01, max_depth=3, n_estimators=100, scale_pos_weight=1, subsample=0.8; total time=   2.3s
[CV] END colsample_bytree=0.6, gamma=0, learning_rate=0.01, max_depth=3, n_estimators=100, scale_pos_weight=1, subsample=0.6; total time=   2.4s
[CV] END colsample_bytree=0.6, gamma=0, learning_rate=0.01, max_depth=3, n_estimators=100, scale_pos_weight=1, subsample=0.6; total time=   2.3s
[CV] END colsample_bytree=0.6, gamma=0, learning_rate=0.01, max_depth=3, n_estimators=100, scale_pos_weight=1, subsample=1.0; total time=   2.4s
[CV] END colsample_bytree=0.6, gamma=0, learning_rate=0.01, max_depth=3, n_estimators=100, scale_pos_weight=1, subsample=0.6; tota

## 6. Log Hyperparameter Tuning Experiments

Log each hyperparameter combination tried by GridSearchCV and its performance to the Excel sheet.

In [9]:
if grid_search is not None and hasattr(grid_search, 'cv_results_'):
    cv_results = grid_search.cv_results_
    print(f"Logging {len(cv_results['params'])} CV trials to Excel sheet '{CV_SHEET_NAME}'...")

    for i in range(len(cv_results['params'])):
        params_tried = cv_results['params'][i]
        record = {
            'Model': MODEL_NAME,
            'Sheet_Context': 'CV_Trial',
            'Hyperparameter_Set_Tried': json.dumps(params_tried),
            'CV_F1_Mean': cv_results.get('mean_test_F1', [np.nan]*len(cv_results['params']))[i],
            'CV_F1_Std': cv_results.get('std_test_F1', [np.nan]*len(cv_results['params']))[i],
            'CV_ROC_AUC_Mean': cv_results.get('mean_test_ROC_AUC', [np.nan]*len(cv_results['params']))[i],
            'CV_ROC_AUC_Std': cv_results.get('std_test_ROC_AUC', [np.nan]*len(cv_results['params']))[i],
            'CV_Precision_Mean': cv_results.get('mean_test_Precision', [np.nan]*len(cv_results['params']))[i],
            'CV_Precision_Std': cv_results.get('std_test_Precision', [np.nan]*len(cv_results['params']))[i],
            'CV_Recall_Mean': cv_results.get('mean_test_Recall', [np.nan]*len(cv_results['params']))[i],
            'CV_Recall_Std': cv_results.get('std_test_Recall', [np.nan]*len(cv_results['params']))[i],
            'CV_Rank_F1': cv_results.get('rank_test_F1', [np.nan]*len(cv_results['params']))[i],
            'Fit_Time_Seconds_Mean': cv_results.get('mean_fit_time', [np.nan]*len(cv_results['params']))[i]
        }
        append_performance_record(PERFORMANCE_EXCEL_FILE, record, sheet_name=CV_SHEET_NAME)
    print("CV trials logging complete.")
else:
    print("Skipping CV trials logging as GridSearchCV results are not available.")

Logging 1458 CV trials to Excel sheet 'XGBoost_CV_Trials'...
Info: Sheet 'XGBoost_CV_Trials' not found in /home/cmark/Projects/TrafficAccidentSeverity/reports/model_performance_summary.xlsx. Creating new sheet with common columns.
Info: Sheet 'XGBoost_CV_Trials' not found in /home/cmark/Projects/TrafficAccidentSeverity/reports/model_performance_summary.xlsx. Creating new sheet with common columns.
CV trials logging complete.
CV trials logging complete.


## 7. Best Model Evaluation and Logging

Evaluate the best model found by GridSearchCV on the training and test sets, then log its performance.

In [10]:
if grid_search is not None and hasattr(grid_search, 'best_estimator_') and X_train is not None:
    best_xgb_model = grid_search.best_estimator_
    best_params = grid_search.best_params_

    # Predictions
    y_train_pred = best_xgb_model.predict(X_train)
    y_train_prob = best_xgb_model.predict_proba(X_train)[:, 1]
    y_test_pred = best_xgb_model.predict(X_test)
    y_test_prob = best_xgb_model.predict_proba(X_test)[:, 1]

    # Compute metrics
    train_metrics = compute_classification_metrics(y_train, y_train_pred, y_train_prob)
    test_metrics = compute_classification_metrics(y_test, y_test_pred, y_test_prob)

    print(f"Best {MODEL_NAME} Model Performance:")
    print(f"Training Set Metrics: {train_metrics}")
    print(f"Test Set Metrics: {test_metrics}")

    # Log final model performance
    final_record = {
        'Model': MODEL_NAME,
        'Sheet_Context': 'Final_Model',
        'Selected_Final_Hyperparameters': json.dumps(best_params),
        'Training_Time_Seconds': grid_search_duration, # Total GridSearchCV time as proxy
        'Train_Precision': train_metrics.get('Precision'),
        'Train_Recall': train_metrics.get('Recall'),
        'Train_F1': train_metrics.get('F1'),
        'Train_ROC_AUC': train_metrics.get('ROC_AUC'),
        'Test_Precision': test_metrics.get('Precision'),
        'Test_Recall': test_metrics.get('Recall'),
        'Test_F1': test_metrics.get('F1'),
        'Test_ROC_AUC': test_metrics.get('ROC_AUC'),
        'CV_Best_F1_Score': grid_search.best_score_
    }
    append_performance_record(PERFORMANCE_EXCEL_FILE, final_record, sheet_name='Model_Summaries')
    print("Final model performance logged.")

    # Save the best model
    model_save_path = MODELS_DIR / MODEL_FILENAME
    joblib.dump(best_xgb_model, model_save_path)
    print(f"Best {MODEL_NAME} model saved to {model_save_path}")
else:
    print("Skipping final model evaluation, logging, and saving as prerequisites are not met.")
    best_xgb_model = None

Best XGBoost Model Performance:
Training Set Metrics: {'Precision': 0.9467893187959916, 'Recall': 0.944894376168092, 'F1': 0.927290105830906, 'ROC_AUC': np.float64(0.9423341841902678)}
Test Set Metrics: {'Precision': 0.8965493328439902, 'Recall': 0.9306908267270668, 'F1': 0.9020889450687456, 'ROC_AUC': np.float64(0.7137232559754635)}
Final model performance logged.
Best XGBoost model saved to /home/cmark/Projects/TrafficAccidentSeverity/models/xgboost_best_model.joblib
Final model performance logged.
Best XGBoost model saved to /home/cmark/Projects/TrafficAccidentSeverity/models/xgboost_best_model.joblib


## 8. Conclusion

This notebook implemented and tuned an XGBoost classifier. The hyperparameter tuning process was logged, and the best performing model was evaluated and saved. Its performance metrics are recorded in the summary Excel sheet for comparison with other models.