# 07: Random Forest Model

This notebook focuses on developing, tuning, and evaluating a Random Forest classifier for predicting severe traffic accidents.

## 1. Setup and Imports

In [14]:
import pandas as pd
import numpy as np
import joblib
import json
from pathlib import Path
import time
import sys

# Ensure src directory is in Python path
sys.path.append(str(Path.cwd().parent / 'src'))

# Scikit-learn imports
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import make_scorer, precision_score, recall_score, f1_score, roc_auc_score

# Imbalance handling (optional, if used)
# from imblearn.over_sampling import SMOTE 

# Custom utilities
from modeling_utils import compute_classification_metrics, append_performance_record
# from preprocessing_utils import load_preprocessed_data # Assuming this function exists

# Configuration
DATA_DIR = Path.cwd().parent / 'data'
PROCESSED_DATA_FILE = DATA_DIR / 'processed' / 'preprocessed_data.csv'
MODELS_DIR = Path.cwd().parent / 'models'
REPORTS_DIR = Path.cwd().parent / 'reports'
PERFORMANCE_EXCEL_FILE = REPORTS_DIR / 'model_performance_summary.xlsx'
RANDOM_STATE = 42

MODELS_DIR.mkdir(parents=True, exist_ok=True)
REPORTS_DIR.mkdir(parents=True, exist_ok=True)

## 2. Load Data

Load the preprocessed data. Ensure that the target variable `SEVERITY` and all features are correctly identified.

In [15]:
try:
    df = pd.read_csv(PROCESSED_DATA_FILE)
except FileNotFoundError:
    print(f"Error: Processed data file not found at {PROCESSED_DATA_FILE}")
    print("Please ensure '02_data_preprocessing.ipynb' and '04_modeling_pipeline_setup.ipynb' have been run successfully.")
    # df = None # Or raise an exception

if 'df' in locals() and df is not None:
    print(f"Data loaded successfully: {df.shape}")
    # Assuming 'SEVERITY' is the target and other columns are features
    # This might need adjustment based on the actual columns from preprocessing
    if 'SEVERITY' not in df.columns:
        print("Error: Target column 'SEVERITY' not found in the dataframe.")
    else:
        X = df.drop('SEVERITY', axis=1)
        y = df['SEVERITY']
        print(f"Features shape: {X.shape}, Target shape: {y.shape}")
        print(f"Target distribution:\n{y.value_counts(normalize=True)}")

Data loaded successfully: (22072, 44)
Features shape: (22072, 43), Target shape: (22072,)
Target distribution:
SEVERITY
0    0.931859
1    0.068141
Name: proportion, dtype: float64


## 3. Train-Test Split

Split the data into training and testing sets. We use stratification to maintain the proportion of the target variable in both sets.

In [16]:
if 'X' in locals() and 'y' in locals():
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.2, random_state=RANDOM_STATE, stratify=y
    )
    print(f"X_train shape: {X_train.shape}, y_train shape: {y_train.shape}")
    print(f"X_test shape: {X_test.shape}, y_test shape: {y_test.shape}")
else:
    print("Skipping train-test split as data was not loaded properly.")

X_train shape: (17657, 43), y_train shape: (17657,)
X_test shape: (4415, 43), y_test shape: (4415,)


## 4. Handle Class Imbalance (Optional)

Random Forest has a `class_weight` parameter that can be set to `'balanced'` or `'balanced_subsample'` to handle imbalanced datasets. This is often preferred over techniques like SMOTE for tree-based ensembles as it's integrated into the training process. We will include this in our hyperparameter grid.

## 5. Model Definition and Hyperparameter Tuning (GridSearchCV)

We'll use GridSearchCV to find the best hyperparameters for the Random Forest model.

In [26]:
if 'X_train' in locals():
    rf_classifier = RandomForestClassifier(random_state=RANDOM_STATE)

    # Define a smaller, more focused parameter grid for faster initial run
    # Expand this grid for more thorough tuning
    param_grid = {
        'n_estimators': [100, 200], # Number of trees in the forest
        'max_depth': [10, 20, None], # Maximum depth of the tree
        'min_samples_split': [2, 5],    # Minimum number of samples required to split an internal node
        'min_samples_leaf': [1, 2],     # Minimum number of samples required to be at a leaf node
        'class_weight': ['balanced', 'balanced_subsample', None]
    }# Define parameter grid for testing meant to run quickly juts seconds
    # Define parameter grid for testing meant to run quickly just few seconds
    param_grid_fast = {
        'n_estimators': [100], # Number of trees in the forest
        'max_depth': [10], # Maximum depth of the tree
        'min_samples_split': [2],    # Minimum number of samples required to split an internal node
        'min_samples_leaf': [1],     # Minimum number of samples required to be at a leaf node
        'class_weight': ['balanced', None]
    }

    

    # Define scoring metrics. F1 is often a good choice for imbalanced classification.
    # Using a dictionary for refit allows specifying which scorer to use for choosing the best parameters.
    scoring = {
        'F1': make_scorer(f1_score, average='weighted'),
        'ROC_AUC': make_scorer(roc_auc_score, average='weighted', multi_class='ovr'),
        'Precision': make_scorer(precision_score, average='weighted', zero_division=0),
        'Recall': make_scorer(recall_score, average='weighted', zero_division=0)
    }

    grid_search = GridSearchCV(
        estimator=rf_classifier, 
        param_grid=param_grid, 
        scoring=scoring, 
        refit='F1', # Refit the best model using F1 score
        cv=3, # Number of cross-validation folds (use 3 for quicker run, 5-10 for robust)
        verbose=2, 
        n_jobs=-1 # Use all available cores
    )

    print("Starting GridSearchCV for Random Forest...")
    start_time_grid_search = time.time()
    grid_search.fit(X_train, y_train)
    end_time_grid_search = time.time()
    print(f"GridSearchCV completed in {end_time_grid_search - start_time_grid_search:.2f} seconds.")

    print(f"Best F1 score from GridSearchCV: {grid_search.best_score_:.4f}")
    print(f"Best parameters from GridSearchCV: {grid_search.best_params_}")
else:
    print("Skipping GridSearchCV as training data is not available.")

Starting GridSearchCV for Random Forest...
Fitting 3 folds for each of 72 candidates, totalling 216 fits
[CV] END class_weight=balanced, max_depth=10, min_samples_leaf=1, min_samples_split=5, n_estimators=100; total time=   2.2s
[CV] END class_weight=balanced, max_depth=10, min_samples_leaf=1, min_samples_split=5, n_estimators=100; total time=   2.2s
[CV] END class_weight=balanced, max_depth=10, min_samples_leaf=1, min_samples_split=2, n_estimators=100; total time=   2.7s
[CV] END class_weight=balanced, max_depth=10, min_samples_leaf=1, min_samples_split=2, n_estimators=100; total time=   2.7s
[CV] END class_weight=balanced, max_depth=10, min_samples_leaf=1, min_samples_split=2, n_estimators=100; total time=   2.8s
[CV] END class_weight=balanced, max_depth=10, min_samples_leaf=1, min_samples_split=2, n_estimators=100; total time=   2.7s
[CV] END class_weight=balanced, max_depth=10, min_samples_leaf=1, min_samples_split=2, n_estimators=100; total time=   2.7s
[CV] END class_weight=balan

## 6. Log Hyperparameter Tuning Experiments

Log each hyperparameter combination tried by GridSearchCV and its performance to the Excel sheet.

In [27]:
if 'grid_search' in locals():
    cv_results = grid_search.cv_results_
    print(f"Logging {len(cv_results['params'])} CV trials to Excel...")
    
    for i in range(len(cv_results['params'])):
        params_tried = cv_results['params'][i]
        # CV scores for each metric
        mean_f1 = cv_results['mean_test_F1'][i]
        std_f1 = cv_results['std_test_F1'][i]
        mean_roc_auc = cv_results['mean_test_ROC_AUC'][i]
        std_roc_auc = cv_results['std_test_ROC_AUC'][i]
        mean_precision = cv_results['mean_test_Precision'][i]
        std_precision = cv_results['std_test_Precision'][i]
        mean_recall = cv_results['mean_test_Recall'][i]
        std_recall = cv_results['std_test_Recall'][i]
        
        record = {
            'Model': 'Random Forest',
            'Sheet_Context': 'CV_Trial',
            'Hyperparameter_Set_Tried': json.dumps(params_tried),
            'CV_F1_Mean': mean_f1,
            'CV_F1_Std': std_f1,
            'CV_ROC_AUC_Mean': mean_roc_auc,
            'CV_ROC_AUC_Std': std_roc_auc,
            'CV_Precision_Mean': mean_precision,
            'CV_Precision_Std': std_precision,
            'CV_Recall_Mean': mean_recall,
            'CV_Recall_Std': std_recall,
            'CV_Rank_F1': cv_results['rank_test_F1'][i]
        }
        append_performance_record(PERFORMANCE_EXCEL_FILE, record, sheet_name='RandomForest_CV_Trials')
    print("CV trials logging complete.")
else:
    print("Skipping CV trials logging as GridSearchCV object is not available.")

Logging 72 CV trials to Excel...
CV trials logging complete.
CV trials logging complete.


## 7. Best Model Evaluation and Logging

Evaluate the best model found by GridSearchCV on the training and test sets, then log its performance.

In [28]:
if 'grid_search' in locals() and 'X_train' in locals():
    best_rf_model = grid_search.best_estimator_
    best_params = grid_search.best_params_

    # Training time for the best model (GridSearchCV refits the best model on the whole training data)
    # The time taken for the final fit is not directly available from GridSearchCV results.
    # We can approximate or refit to get a more accurate training time if needed.
    # For simplicity, we'll use the total GridSearchCV time as a rough proxy or leave it blank.
    approx_training_time = end_time_grid_search - start_time_grid_search 

    # Predictions
    y_train_pred = best_rf_model.predict(X_train)
    y_train_prob = best_rf_model.predict_proba(X_train)[:, 1] # Probability for the positive class
    y_test_pred = best_rf_model.predict(X_test)
    y_test_prob = best_rf_model.predict_proba(X_test)[:, 1]

    # Compute metrics
    train_metrics = compute_classification_metrics(y_train, y_train_pred, y_train_prob)
    test_metrics = compute_classification_metrics(y_test, y_test_pred, y_test_prob)

    print("Best Random Forest Model Performance:")
    print(f"Training Set Metrics: {train_metrics}")
    print(f"Test Set Metrics: {test_metrics}")

    # Log final model performance
    final_record = {
        'Model': 'Random Forest (Best Tuned)',
        'Sheet_Context': 'Final_Model',
        'Selected_Final_Hyperparameters': json.dumps(best_params),
        'Training_Time_Seconds': approx_training_time, # Or more accurate if refitted
        'Train_Precision': train_metrics.get('Precision'),
        'Train_Recall': train_metrics.get('Recall'),
        'Train_F1': train_metrics.get('F1'),
        'Train_ROC_AUC': train_metrics.get('ROC_AUC'),
        'Test_Precision': test_metrics.get('Precision'),
        'Test_Recall': test_metrics.get('Recall'),
        'Test_F1': test_metrics.get('F1'),
        'Test_ROC_AUC': test_metrics.get('ROC_AUC'),
        'Class_Imbalance_Strategy': best_params.get('class_weight', 'N/A'), # From best_params
        'Notes': f"Best model from GridSearchCV with {grid_search.cv}-fold CV, refit on F1."
    }
    append_performance_record(PERFORMANCE_EXCEL_FILE, final_record, sheet_name='Model_Summaries')
    print("Final model performance logged.")
else:
    print("Skipping final model evaluation and logging as prerequisites are not met.")

Best Random Forest Model Performance:
Training Set Metrics: {'Precision': 0.998482657350985, 'Recall': 0.9984708614147364, 'F1': 0.9984746681476719, 'ROC_AUC': np.float64(0.999814187637749)}
Test Set Metrics: {'Precision': 0.9060619834978159, 'Recall': 0.9318233295583239, 'F1': 0.906780326467548, 'ROC_AUC': np.float64(0.7106674882138133)}
Final model performance logged.
Final model performance logged.


## 8. Save Model

Save the best trained Random Forest model for future use.

In [29]:
if 'best_rf_model' in locals():
    model_filename = MODELS_DIR / 'random_forest_best_model.joblib'
    joblib.dump(best_rf_model, model_filename)
    print(f"Best Random Forest model saved to {model_filename}")
else:
    print("Skipping model saving as the best model is not available.")

Best Random Forest model saved to /home/cmark/Projects/TrafficAccidentSeverity/models/random_forest_best_model.joblib


## 9. Summary and Documentation

The Random Forest model was implemented, and hyperparameters were tuned using GridSearchCV. The `class_weight` parameter was included in the tuning process to address class imbalance. 

Key findings:
*   Best Parameters: (Refer to `grid_search.best_params_` output above)
*   Best CV F1-Score: (Refer to `grid_search.best_score_` output above)
*   Test Set Performance (F1-score): (Refer to `test_metrics['F1']` output above)
*   Test Set Performance (ROC AUC): (Refer to `test_metrics['ROC_AUC']` output above)

All hyperparameter trials and the final model's performance have been logged to `reports/model_performance_summary.xlsx`. The best model has been saved to `models/random_forest_best_model.joblib`.

Further analysis could involve exploring feature importances from the Random Forest model.