# 10: Bagging Classifier Model

This notebook focuses on developing, tuning, and evaluating a Bagging Classifier for predicting severe traffic accidents. We will likely use Decision Trees as base estimators.

**PRD References:** 3.1.5.5 (Bagging), FR3 (Model Training & Tuning), 9.1 (Jupyter Notebooks), 9.3 (Performance Logging), 10.5 (Hyperparameter Logging).

## 1. Setup and Imports

In [1]:
import pandas as pd
import numpy as np
import joblib
import json
from pathlib import Path
import time
import sys

# Ensure src directory is in Python path
sys.path.append(str(Path.cwd().parent / 'src'))

# Scikit-learn imports
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier # Base estimator
from sklearn.metrics import make_scorer, precision_score, recall_score, f1_score, roc_auc_score

# Custom utilities
from modeling_utils import compute_classification_metrics, append_performance_record

# Configuration
DATA_DIR = Path.cwd().parent / 'data'
PROCESSED_DATA_FILE = DATA_DIR / 'processed' / 'preprocessed_data.csv'
MODELS_DIR = Path.cwd().parent / 'models'
REPORTS_DIR = Path.cwd().parent / 'reports'
PERFORMANCE_EXCEL_FILE = REPORTS_DIR / 'model_performance_summary.xlsx'
RANDOM_STATE = 42
MODEL_NAME = 'BaggingClassifier'
CV_SHEET_NAME = f'{MODEL_NAME}_CV_Trials'
MODEL_FILENAME = f'{MODEL_NAME.lower().replace(" ", "_")}_best_model.joblib'

MODELS_DIR.mkdir(parents=True, exist_ok=True)
REPORTS_DIR.mkdir(parents=True, exist_ok=True)

## 2. Load Data

In [2]:
print(f"Attempting to load data from: {PROCESSED_DATA_FILE}")
try:
    df = pd.read_csv(PROCESSED_DATA_FILE)
except FileNotFoundError:
    print(f"Error: Processed data file not found at {PROCESSED_DATA_FILE}")
    print("Please ensure '02_data_preprocessing.ipynb' has been run successfully.")
    df = None
except Exception as e:
    print(f"An error occurred while loading the data: {e}")
    df = None

if df is not None:
    print(f"Data loaded successfully: {df.shape}")
    if 'SEVERITY' not in df.columns:
        print("Error: Target column 'SEVERITY' not found in the dataframe.")
        X, y = None, None
    else:
        X = df.drop('SEVERITY', axis=1)
        y = df['SEVERITY']
        print(f"Features shape: {X.shape}, Target shape: {y.shape}")
        print(f"Target distribution:\n{y.value_counts(normalize=True)}")
else:
    X, y = None, None

Attempting to load data from: /home/cmark/Projects/TrafficAccidentSeverity/data/processed/preprocessed_data.csv
Data loaded successfully: (22072, 44)
Features shape: (22072, 43), Target shape: (22072,)
Target distribution:
SEVERITY
0    0.931859
1    0.068141
Name: proportion, dtype: float64


## 3. Train-Test Split

In [3]:
if X is not None and y is not None:
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.2, random_state=RANDOM_STATE, stratify=y
    )
    print(f"X_train shape: {X_train.shape}, y_train shape: {y_train.shape}")
    print(f"X_test shape: {X_test.shape}, y_test shape: {y_test.shape}")
    # Basic check for NaNs after split
    print(f"NaNs in X_train: {X_train.isnull().sum().sum()}")
    print(f"NaNs in X_test: {X_test.isnull().sum().sum()}")

else:
    print("Skipping train-test split as data was not loaded properly.")
    X_train, X_test, y_train, y_test = None, None, None, None

X_train shape: (17657, 43), y_train shape: (17657,)
X_test shape: (4415, 43), y_test shape: (4415,)
NaNs in X_train: 0
NaNs in X_test: 0


## 4. Handle Class Imbalance (via base estimator or BaggingClassifier params)

- BaggingClassifier itself doesn't directly have class_weight.
- We can pass a base_estimator that supports it (like DecisionTreeClassifier with class_weight='balanced')
- or use techniques like SMOTE on the training data before fitting BaggingClassifier.
- For simplicity and consistency with other models, we'll explore `class_weight` in the base Decision Tree.

## 5. Model Definition and Hyperparameter Tuning (GridSearchCV)

In [4]:
if X_train is not None and y_train is not None:
    # Base estimator
    base_estimator = DecisionTreeClassifier(random_state=RANDOM_STATE)

    bagging_classifier = BaggingClassifier(
        estimator=base_estimator, # Changed from base_estimator to estimator
        random_state=RANDOM_STATE,
        n_jobs=-1 # Use all available cores for Bagging itself
    )

    # Define a parameter grid
    param_grid_fast = {
        'n_estimators': [50, 100], # Number of base estimators
        'estimator__max_depth': [5, 10, None], # Max depth of base Decision Tree
        'estimator__class_weight': ['balanced', None], # Class weight for base Decision Tree
        'max_samples': [0.7, 1.0],    # Fraction of samples for training each base estimator
        'max_features': [0.7, 1.0]  # Fraction of features for training each base estimator
    }
    
    # A more comprehensive grid for thorough tuning (can be time-consuming)
    param_grid_full = {
        'n_estimators': [50, 100, 200],
        'estimator__max_depth': [5, 10, 20, None],
        'estimator__min_samples_split': [2, 5, 10],
        'estimator__min_samples_leaf': [1, 2, 4],
        'estimator__class_weight': ['balanced', None],
        'max_samples': [0.5, 0.7, 1.0],
        'max_features': [0.5, 0.7, 1.0],
        'bootstrap': [True, False],
        'bootstrap_features': [True, False]
    }

    # Custom ROC AUC scorer to ensure predict_proba is used
    def roc_auc_proba_scorer(estimator, X_data, y_true_data):
        if hasattr(estimator, "predict_proba"):
            y_proba = estimator.predict_proba(X_data)[:, 1]
            return roc_auc_score(y_true_data, y_proba, average='weighted', multi_class='ovr')
        else: # Fallback for estimators without predict_proba (though BaggingClassifier should have it)
            return 0.5 # Neutral score

    scoring = {
        'F1': make_scorer(f1_score, average='weighted'),
        'ROC_AUC': roc_auc_proba_scorer,
        'Precision': make_scorer(precision_score, average='weighted', zero_division=0),
        'Recall': make_scorer(recall_score, average='weighted', zero_division=0)
    }

    grid_search = GridSearchCV(
        estimator=bagging_classifier,
        param_grid=param_grid_fast, # Using the faster grid for now
        scoring=scoring,
        refit='F1',
        cv=2, # Reduced CV folds for faster initial run
        verbose=2,
        n_jobs=-1 # GridSearchCV's n_jobs
    )

    print(f"Starting GridSearchCV for {MODEL_NAME}...")
    start_time_grid_search = time.time()
    try:
        grid_search.fit(X_train, y_train)
        end_time_grid_search = time.time()
        grid_search_duration = end_time_grid_search - start_time_grid_search
        print(f"GridSearchCV completed in {grid_search_duration:.2f} seconds.")
        print(f"Best F1 score from GridSearchCV: {grid_search.best_score_:.4f}")
        print(f"Best parameters from GridSearchCV: {grid_search.best_params_}")
    except Exception as e:
        print(f"An error occurred during GridSearchCV: {e}")
        grid_search = None
        grid_search_duration = 0
else:
    print("Skipping GridSearchCV as training data is not available.")
    grid_search = None
    grid_search_duration = 0

Starting GridSearchCV for BaggingClassifier...
Fitting 2 folds for each of 48 candidates, totalling 96 fits
[CV] END estimator__class_weight=balanced, estimator__max_depth=5, max_features=1.0, max_samples=0.7, n_estimators=50; total time=   3.1s
[CV] END estimator__class_weight=balanced, estimator__max_depth=5, max_features=0.7, max_samples=0.7, n_estimators=50; total time=   3.5s
[CV] END estimator__class_weight=balanced, estimator__max_depth=5, max_features=0.7, max_samples=0.7, n_estimators=50; total time=   4.0s
[CV] END estimator__class_weight=balanced, estimator__max_depth=5, max_features=0.7, max_samples=1.0, n_estimators=50; total time=   4.2s
[CV] END estimator__class_weight=balanced, estimator__max_depth=5, max_features=1.0, max_samples=0.7, n_estimators=50; total time=   4.3s
[CV] END estimator__class_weight=balanced, estimator__max_depth=5, max_features=0.7, max_samples=1.0, n_estimators=50; total time=   4.5s
[CV] END estimator__class_weight=balanced, estimator__max_depth=

## 6. Log Hyperparameter Tuning Experiments

In [5]:
if grid_search is not None and hasattr(grid_search, 'cv_results_'):
    cv_results = grid_search.cv_results_
    print(f"Logging {len(cv_results['params'])} CV trials to Excel sheet '{CV_SHEET_NAME}'...")

    for i in range(len(cv_results['params'])):
        params_tried = cv_results['params'][i]
        # Ensure all keys are present, provide default if not (e.g. for different scorers)
        record = {
            'Model': MODEL_NAME,
            'Sheet_Context': 'CV_Trial',
            'Hyperparameter_Set_Tried': json.dumps(params_tried),
            'CV_F1_Mean': cv_results.get('mean_test_F1', [np.nan]*len(cv_results['params']))[i],
            'CV_F1_Std': cv_results.get('std_test_F1', [np.nan]*len(cv_results['params']))[i],
            'CV_ROC_AUC_Mean': cv_results.get('mean_test_ROC_AUC', [np.nan]*len(cv_results['params']))[i],
            'CV_ROC_AUC_Std': cv_results.get('std_test_ROC_AUC', [np.nan]*len(cv_results['params']))[i],
            'CV_Precision_Mean': cv_results.get('mean_test_Precision', [np.nan]*len(cv_results['params']))[i],
            'CV_Precision_Std': cv_results.get('std_test_Precision', [np.nan]*len(cv_results['params']))[i],
            'CV_Recall_Mean': cv_results.get('mean_test_Recall', [np.nan]*len(cv_results['params']))[i],
            'CV_Recall_Std': cv_results.get('std_test_Recall', [np.nan]*len(cv_results['params']))[i],
            'CV_Rank_F1': cv_results.get('rank_test_F1', [np.nan]*len(cv_results['params']))[i],
            'Fit_Time_Seconds_Mean': cv_results.get('mean_fit_time', [np.nan]*len(cv_results['params']))[i]
        }
        append_performance_record(PERFORMANCE_EXCEL_FILE, record, sheet_name=CV_SHEET_NAME)
    print("CV trials logging complete.")
else:
    print("Skipping CV trials logging as GridSearchCV results are not available or an error occurred.")

Logging 48 CV trials to Excel sheet 'BaggingClassifier_CV_Trials'...
Info: Sheet 'BaggingClassifier_CV_Trials' not found in /home/cmark/Projects/TrafficAccidentSeverity/reports/model_performance_summary.xlsx. Creating new sheet with common columns.
CV trials logging complete.


## 7. Best Model Evaluation and Logging

In [6]:
if grid_search is not None and hasattr(grid_search, 'best_estimator_') and X_train is not None and y_train is not None and X_test is not None and y_test is not None:
    best_bagging_model = grid_search.best_estimator_
    best_params = grid_search.best_params_

    # Predictions
    y_train_pred = best_bagging_model.predict(X_train)
    y_train_prob = best_bagging_model.predict_proba(X_train)[:, 1]
    y_test_pred = best_bagging_model.predict(X_test)
    y_test_prob = best_bagging_model.predict_proba(X_test)[:, 1]

    # Compute metrics
    train_metrics = compute_classification_metrics(y_train, y_train_pred, y_train_prob)
    test_metrics = compute_classification_metrics(y_test, y_test_pred, y_test_prob)

    print(f"Best {MODEL_NAME} Model Performance:")
    print(f"Training Set Metrics: {train_metrics}")
    print(f"Test Set Metrics: {test_metrics}")

    # Log final model performance
    final_record = {
        'Model': MODEL_NAME,
        'Sheet_Context': 'Final_Model',
        'Selected_Final_Hyperparameters': json.dumps(best_params),
        'Training_Time_Seconds': grid_search_duration, # Total GridSearchCV time
        'Train_Precision': train_metrics.get('Precision'),
        'Train_Recall': train_metrics.get('Recall'),
        'Train_F1': train_metrics.get('F1'),
        'Train_ROC_AUC': train_metrics.get('ROC_AUC'),
        'Test_Precision': test_metrics.get('Precision'),
        'Test_Recall': test_metrics.get('Recall'),
        'Test_F1': test_metrics.get('F1'),
        'Test_ROC_AUC': test_metrics.get('ROC_AUC'),
        'CV_Best_F1_Score': grid_search.best_score_ if hasattr(grid_search, 'best_score_') else np.nan
    }
    append_performance_record(PERFORMANCE_EXCEL_FILE, final_record, sheet_name='Model_Summaries')
    print("Final model performance logged.")

    # Save the best model
    model_save_path = MODELS_DIR / MODEL_FILENAME
    try:
        joblib.dump(best_bagging_model, model_save_path)
        print(f"Best {MODEL_NAME} model saved to {model_save_path}")
    except Exception as e:
        print(f"Error saving model {MODEL_NAME} to {model_save_path}: {e}")
else:
    print("Skipping final model evaluation, logging, and saving as prerequisites are not met (e.g., data loading error, GridSearchCV error, or test data missing).")
    best_bagging_model = None

Best BaggingClassifier Model Performance:
Training Set Metrics: {'Precision': 0.9989816882676894, 'Recall': 0.9989805742764909, 'F1': 0.9989770108697928, 'ROC_AUC': np.float64(0.9999988127812636)}
Test Set Metrics: {'Precision': 0.8997168030242177, 'Recall': 0.9306908267270668, 'F1': 0.9039773055765735, 'ROC_AUC': np.float64(0.6843987066285289)}
Final model performance logged.
Best BaggingClassifier model saved to /home/cmark/Projects/TrafficAccidentSeverity/models/baggingclassifier_best_model.joblib


## 8. Conclusion

In [7]:
print(f"Finished processing for {MODEL_NAME}.")
if 'best_bagging_model' in locals() and best_bagging_model is not None:
    print("The Bagging Classifier was successfully trained, tuned, evaluated, and saved.")
    print(f"Best parameters found: {best_params}")
    print(f"Test F1 Score: {test_metrics.get('F1'):.4f}, Test ROC AUC: {test_metrics.get('ROC_AUC'):.4f}")
else:
    print("The Bagging Classifier could not be fully processed due to earlier errors (check logs).")

print(f"All results, including hyperparameter trials for {MODEL_NAME} (if run), are logged in '{PERFORMANCE_EXCEL_FILE}'.")

Finished processing for BaggingClassifier.
The Bagging Classifier was successfully trained, tuned, evaluated, and saved.
Best parameters found: {'estimator__class_weight': None, 'estimator__max_depth': None, 'max_features': 1.0, 'max_samples': 1.0, 'n_estimators': 50}
Test F1 Score: 0.9040, Test ROC AUC: 0.6844
All results, including hyperparameter trials for BaggingClassifier (if run), are logged in '/home/cmark/Projects/TrafficAccidentSeverity/reports/model_performance_summary.xlsx'.
