# Commit 15: Decision Tree - Initial & Tuning

**Objective:** Implement, tune, and evaluate a Decision Tree classifier for severe traffic accident prediction. Log all hyperparameter tuning experiments and final results to the Excel sheet.

## 1. Setup and Imports

In [30]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import joblib
import time
import json
from sklearn.model_selection import train_test_split, GridSearchCV, RandomizedSearchCV
from sklearn.tree import DecisionTreeClassifier
from sklearn.preprocessing import StandardScaler
from imblearn.over_sampling import SMOTE
import sys
import os

# Add src directory to path for custom modules
sys.path.append(os.path.abspath(os.path.join(os.getcwd(), '..', 'src')))
import modeling_utils as mu
import preprocessing_utils as pu

# Configure Pandas display options for better readability
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 100)
pd.set_option('display.float_format', lambda x: '%.5f' % x)

# Define constants
PROCESSED_DATA_PATH = '../data/processed/preprocessed_data.csv'
MODEL_PATH = '../models/decision_tree_model.joblib'
EXCEL_SUMMARY_PATH = '../reports/model_performance_summary.xlsx'
MODEL_NAME = 'Decision Tree'
RANDOM_STATE = 42

## 2. Load Data

In [31]:
df = pd.read_csv(PROCESSED_DATA_PATH)
print(f"Data loaded successfully. Shape: {df.shape}")
df.head()

Data loaded successfully. Shape: (22072, 42)


Unnamed: 0,SEVERITY,Y,X,DATETIME_UTC,hour,day_of_week,day,month,year,is_weekend,season,ROAD_EDSA,MAIN_CAUSE_Human error,MAIN_CAUSE_Other (see description),MAIN_CAUSE_Road defect,MAIN_CAUSE_Unknown,MAIN_CAUSE_Vehicle defect,COLLISION_TYPE_Angle Impact,COLLISION_TYPE_Head-On,COLLISION_TYPE_Hit Object,COLLISION_TYPE_Multiple,COLLISION_TYPE_No Collision Stated,COLLISION_TYPE_Rear-End,COLLISION_TYPE_Self-Accident,COLLISION_TYPE_Side Swipe,WEATHER_Unknown,WEATHER_clear-day,WEATHER_clear-night,WEATHER_cloudy,WEATHER_fog,WEATHER_partly-cloudy-day,WEATHER_partly-cloudy-night,WEATHER_rain,LIGHT_Unknown,LIGHT_day,LIGHT_dusk,LIGHT_night,REPORTING_AGENCY_MMDA Metrobase,REPORTING_AGENCY_MMDA Road Safety Unit,REPORTING_AGENCY_Other,desc_word_count,desc_contains_collision
0,Property,14.65771,121.01979,2014-06-30 05:40:00,5,0,30,6,2014,False,Summer,True,True,False,False,False,False,False,False,False,False,True,False,False,False,True,False,False,False,False,False,False,False,True,False,False,False,False,True,False,30,1
1,Property,14.65771,121.01979,2014-03-17 01:00:00,1,0,17,3,2014,False,Spring,True,True,False,False,False,False,False,False,False,False,True,False,False,False,True,False,False,False,False,False,False,False,True,False,False,False,False,True,False,38,1
2,Injury,14.65771,121.01979,2013-11-26 02:00:00,2,1,26,11,2013,False,Fall,True,True,False,False,False,False,False,False,False,False,True,False,False,False,True,False,False,False,False,False,False,False,True,False,False,False,False,True,False,30,1
3,Property,14.65771,121.01979,2013-10-26 13:00:00,13,5,26,10,2013,True,Fall,True,True,False,False,False,False,False,False,False,False,True,False,False,False,True,False,False,False,False,False,False,False,True,False,False,False,False,True,False,31,1
4,Injury,14.65771,121.01966,2013-06-26 23:30:00,23,2,26,6,2013,False,Summer,True,True,False,False,False,False,False,False,False,False,True,False,False,False,True,False,False,False,False,False,False,False,True,False,False,False,False,True,False,32,1


## 3. Define Features (X) and Target (y)

In [32]:
# Assuming 'is_severe_accident' is the target variable created in notebook 01
# and all other relevant columns are features after preprocessing in notebook 02.

if 'SEVERITY' not in df.columns:
    raise ValueError("Target variable 'SEVERITY' not found in the dataframe.")

X = df.drop('SEVERITY', axis=1)
y = df['SEVERITY']

print(f"Features (X) shape: {X.shape}")
print(f"Target (y) shape: {y.shape}")
print(f"Target distribution: {y.value_counts(normalize=True)}")

Features (X) shape: (22072, 41)
Target (y) shape: (22072,)
Target distribution: SEVERITY
Property   0.93127
Injury     0.06773
Fatal      0.00100
Name: proportion, dtype: float64


## 4. Train-Test Split

In [33]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=RANDOM_STATE, stratify=y)
print(f"X_train shape: {X_train.shape}, y_train shape: {y_train.shape}")
print(f"X_test shape: {X_test.shape}, y_test shape: {y_test.shape}")

X_train shape: (17657, 41), y_train shape: (17657,)
X_test shape: (4415, 41), y_test shape: (4415,)


## 5. Numerical Feature Scaling

In [34]:
# Identify numerical columns (assuming they are of type float or int and not binary/boolean after one-hot encoding)
numerical_cols = X_train.select_dtypes(include=np.number).columns.tolist()

# Remove any columns that are already binary (e.g., from one-hot encoding) or shouldn't be scaled
# This step might need adjustment based on the actual feature set after preprocessing
cols_to_scale = [col for col in numerical_cols if len(X_train[col].unique()) > 2] 

scaler = StandardScaler()
X_train_scaled = X_train.copy()
X_test_scaled = X_test.copy()

if cols_to_scale:
    X_train_scaled[cols_to_scale] = scaler.fit_transform(X_train[cols_to_scale])
    X_test_scaled[cols_to_scale] = scaler.transform(X_test[cols_to_scale])
    print(f"Scaled columns: {cols_to_scale}")
else:
    print("No columns identified for scaling.")

X_train_scaled.head()

Scaled columns: ['Y', 'X', 'hour', 'day_of_week', 'day', 'month', 'year', 'desc_word_count']


Unnamed: 0,Y,X,DATETIME_UTC,hour,day_of_week,day,month,year,is_weekend,season,ROAD_EDSA,MAIN_CAUSE_Human error,MAIN_CAUSE_Other (see description),MAIN_CAUSE_Road defect,MAIN_CAUSE_Unknown,MAIN_CAUSE_Vehicle defect,COLLISION_TYPE_Angle Impact,COLLISION_TYPE_Head-On,COLLISION_TYPE_Hit Object,COLLISION_TYPE_Multiple,COLLISION_TYPE_No Collision Stated,COLLISION_TYPE_Rear-End,COLLISION_TYPE_Self-Accident,COLLISION_TYPE_Side Swipe,WEATHER_Unknown,WEATHER_clear-day,WEATHER_clear-night,WEATHER_cloudy,WEATHER_fog,WEATHER_partly-cloudy-day,WEATHER_partly-cloudy-night,WEATHER_rain,LIGHT_Unknown,LIGHT_day,LIGHT_dusk,LIGHT_night,REPORTING_AGENCY_MMDA Metrobase,REPORTING_AGENCY_MMDA Road Safety Unit,REPORTING_AGENCY_Other,desc_word_count,desc_contains_collision
7948,-0.97779,0.21942,2009-04-05 04:00:00,-0.8084,1.64366,-1.19811,-0.70811,-1.75055,True,Spring,True,False,False,False,True,False,False,False,False,False,False,False,False,True,True,False,False,False,False,False,False,False,True,False,False,False,False,True,False,0.69034,0
17297,1.27048,-0.00204,2013-10-23 05:45:00,-0.65935,-0.44388,0.86189,1.10753,-0.18682,False,Fall,True,False,False,False,True,False,False,False,False,False,False,True,False,False,True,False,False,False,False,False,False,False,True,False,False,False,False,True,False,-0.61884,0
3678,-1.25551,-0.47388,2014-10-13 16:20:00,0.98016,-1.48766,-0.28256,1.10753,0.20411,False,Fall,True,False,False,False,True,False,False,False,False,False,False,False,False,True,True,False,False,False,False,False,False,False,True,False,False,False,False,True,False,-0.18245,0
6054,-0.6812,0.71897,2016-03-30 12:15:00,0.38397,-0.44388,1.663,-1.01072,0.98597,False,Spring,True,False,False,False,True,False,False,False,False,False,False,False,False,True,True,False,False,False,False,False,False,False,True,False,False,False,False,True,False,0.03575,0
13923,1.31281,-1.19265,2015-10-30 11:30:00,0.23493,0.59989,1.663,1.10753,0.59504,False,Fall,True,False,False,False,True,False,False,False,False,False,True,False,False,False,True,False,False,False,False,False,False,False,True,False,False,False,False,True,False,0.90853,1


## 6. Class Imbalance Handling (SMOTE)

In [35]:
print(f"Original y_train distribution: {y_train.value_counts(normalize=True)}")

# Prepare data for SMOTE: keep only numeric and boolean features
X_train_smote = X_train_scaled.select_dtypes(include=['number', 'bool'])

smote = SMOTE(random_state=RANDOM_STATE)
X_train_resampled, y_train_resampled = smote.fit_resample(X_train_smote, y_train)

print(f"Resampled X_train shape: {X_train_resampled.shape}")
print(f"Resampled y_train distribution: {y_train_resampled.value_counts(normalize=True)}")

Original y_train distribution: SEVERITY
Property   0.93125
Injury     0.06774
Fatal      0.00102
Name: proportion, dtype: float64
Resampled X_train shape: (49329, 39)
Resampled y_train distribution: SEVERITY
Property   0.33333
Injury     0.33333
Fatal      0.33333
Name: proportion, dtype: float64


## 7. Decision Tree Model Implementation and Hyperparameter Tuning

In [36]:
dt_classifier = DecisionTreeClassifier(random_state=RANDOM_STATE, class_weight='balanced') # Using class_weight as a baseline strategy

# Define parameter grid for GridSearchCV
param_grid = {
    'criterion': ['gini', 'entropy'],
    'max_depth': [None, 5, 10, 15, 20],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 5],
    'ccp_alpha': [0.0, 0.001, 0.01, 0.1] # Cost-complexity pruning
}

# Initialize GridSearchCV
# Using F1-score for evaluation as it's a good metric for imbalanced datasets
grid_search = GridSearchCV(estimator=dt_classifier, 
                           param_grid=param_grid, 
                           cv=5, 
                           scoring='f1_weighted', # or 'roc_auc', 'f1_macro', etc.
                           verbose=1, 
                           n_jobs=-1, 
                           return_train_score=True)

print("Starting GridSearchCV for Decision Tree...")
start_time_tuning = time.time()
grid_search.fit(X_train_resampled, y_train_resampled) # Use resampled data for tuning
tuning_time_seconds = time.time() - start_time_tuning
print(f"GridSearchCV completed in {tuning_time_seconds:.2f} seconds.")

# Best estimator
best_dt_model = grid_search.best_estimator_
print(f"Best parameters found: {grid_search.best_params_}")
print(f"Best cross-validation F1 score: {grid_search.best_score_:.4f}")

Starting GridSearchCV for Decision Tree...
Fitting 5 folds for each of 360 candidates, totalling 1800 fits
GridSearchCV completed in 430.85 seconds.
Best parameters found: {'ccp_alpha': 0.0, 'criterion': 'entropy', 'max_depth': None, 'min_samples_leaf': 1, 'min_samples_split': 2}
Best cross-validation F1 score: 0.9416


### 7.1 Log All Hyperparameter Tuning Trials to Excel

In [38]:
# Log all hyperparameter tuning trials to Excel using append_performance_record from modeling_utils.py
cv_results_df = pd.DataFrame(grid_search.cv_results_)

# Select relevant columns and prepare for logging
columns_to_log = ['params', 'mean_test_score', 'std_test_score', 'rank_test_score', 'mean_train_score', 'std_train_score']
log_df = cv_results_df[columns_to_log].copy()
log_df.rename(columns={'params': 'Hyperparameter_Set_Tried', 
                       'mean_test_score': 'CV_Score_for_Set',
                       'std_test_score': 'CV_Std_Dev_for_Set',
                       'rank_test_score': 'CV_Rank'}, inplace=True)

log_df['Model_Name'] = MODEL_NAME + ' (Tuning Trial)'
log_df['Hyperparameter_Set_Tried'] = log_df['Hyperparameter_Set_Tried'].astype(str) # Convert dict to string for Excel
log_df['Timestamp'] = pd.Timestamp.now()
log_df['Selected_Final_Hyperparameters'] = ''
log_df['Training_Time_Seconds'] = tuning_time_seconds
log_df['Train_Precision'] = ''
log_df['Train_Recall'] = ''
log_df['Train_F1'] = ''
log_df['Train_ROC_AUC'] = ''
log_df['Test_Precision'] = ''
log_df['Test_Recall'] = ''
log_df['Test_F1'] = ''
log_df['Test_ROC_AUC'] = ''
log_df['Class_Imbalance_Strategy'] = 'SMOTE'
log_df['Notes'] = 'GridSearchCV tuning trial.'

print(f"Logging {len(log_df)} hyperparameter tuning trials to Excel...")
for _, row in log_df.iterrows():
    record = {
        'Model_Name': row['Model_Name'],
        'Timestamp': row['Timestamp'],
        'Hyperparameter_Set_Tried': row['Hyperparameter_Set_Tried'],
        'CV_Score_for_Set': row['CV_Score_for_Set'],
        'Selected_Final_Hyperparameters': row['Selected_Final_Hyperparameters'],
        'Training_Time_Seconds': row['Training_Time_Seconds'],
        'Train_Precision': row['Train_Precision'],
        'Train_Recall': row['Train_Recall'],
        'Train_F1': row['Train_F1'],
        'Train_ROC_AUC': row['Train_ROC_AUC'],
        'Test_Precision': row['Test_Precision'],
        'Test_Recall': row['Test_Recall'],
        'Test_F1': row['Test_F1'],
        'Test_ROC_AUC': row['Test_ROC_AUC'],
        'Class_Imbalance_Strategy': row['Class_Imbalance_Strategy'],
        'Notes': row['Notes']
    }
    mu.append_performance_record(EXCEL_SUMMARY_PATH, record)
print("Hyperparameter tuning trials logged.")
log_df.head()

Logging 360 hyperparameter tuning trials to Excel...
Hyperparameter tuning trials logged.
Hyperparameter tuning trials logged.


Unnamed: 0,Hyperparameter_Set_Tried,CV_Score_for_Set,CV_Std_Dev_for_Set,CV_Rank,mean_train_score,std_train_score,Model_Name,Timestamp,Selected_Final_Hyperparameters,Training_Time_Seconds,Train_Precision,Train_Recall,Train_F1,Train_ROC_AUC,Test_Precision,Test_Recall,Test_F1,Test_ROC_AUC,Class_Imbalance_Strategy,Notes
0,"{'ccp_alpha': 0.0, 'criterion': 'gini', 'max_d...",0.93953,0.02589,3,0.99994,4e-05,Decision Tree (Tuning Trial),2025-05-16 08:40:20.061453,,430.84739,,,,,,,,,SMOTE,GridSearchCV tuning trial.
1,"{'ccp_alpha': 0.0, 'criterion': 'gini', 'max_d...",0.93863,0.02562,4,0.9922,0.00237,Decision Tree (Tuning Trial),2025-05-16 08:40:20.061453,,430.84739,,,,,,,,,SMOTE,GridSearchCV tuning trial.
2,"{'ccp_alpha': 0.0, 'criterion': 'gini', 'max_d...",0.93667,0.0257,9,0.98218,0.00403,Decision Tree (Tuning Trial),2025-05-16 08:40:20.061453,,430.84739,,,,,,,,,SMOTE,GridSearchCV tuning trial.
3,"{'ccp_alpha': 0.0, 'criterion': 'gini', 'max_d...",0.93579,0.02508,10,0.98678,0.00317,Decision Tree (Tuning Trial),2025-05-16 08:40:20.061453,,430.84739,,,,,,,,,SMOTE,GridSearchCV tuning trial.
4,"{'ccp_alpha': 0.0, 'criterion': 'gini', 'max_d...",0.93561,0.02579,12,0.98545,0.00346,Decision Tree (Tuning Trial),2025-05-16 08:40:20.061453,,430.84739,,,,,,,,,SMOTE,GridSearchCV tuning trial.


## 8. Evaluate Best Model

In [41]:
print("Evaluating the best Decision Tree model on training and test sets...")

# Ensure feature columns match those used during fitting (SMOTE)
smote_feature_cols = X_train_smote.columns.tolist()

# Predictions on the original training set (scaled, not resampled), using only SMOTE columns
y_train_pred = best_dt_model.predict(X_train_scaled[smote_feature_cols])
y_train_proba = best_dt_model.predict_proba(X_train_scaled[smote_feature_cols])[:, 1]

y_test_pred = best_dt_model.predict(X_test_scaled[smote_feature_cols])
y_test_proba = best_dt_model.predict_proba(X_test_scaled[smote_feature_cols])[:, 1]

# Calculate metrics using compute_classification_metrics from modeling_utils.py
train_metrics = mu.compute_classification_metrics(y_train, y_train_pred, y_train_proba)
test_metrics = mu.compute_classification_metrics(y_test, y_test_pred, y_test_proba)

print("\nTrain Set Metrics (Original - Scaled, Not Resampled):")
for metric, value in train_metrics.items():
    print(f"{metric}: {value:.4f}")

print("\nTest Set Metrics:")
for metric, value in test_metrics.items():
    print(f"{metric}: {value:.4f}")

# Plot ROC Curve for Test Set
if hasattr(mu, 'plot_roc_curve'):
    mu.plot_roc_curve(y_test, y_test_proba, model_name=MODEL_NAME, set_name='Test')
    plt.show()

# Plot Confusion Matrix for Test Set
if hasattr(mu, 'plot_confusion_matrix'):
    mu.plot_confusion_matrix(y_test, y_test_pred, model_name=MODEL_NAME, set_name='Test')
    plt.show()

Evaluating the best Decision Tree model on training and test sets...


ValueError: Target is multiclass but average='binary'. Please choose another average setting, one of [None, 'micro', 'macro', 'weighted'].

## 9. Log Final Model Performance to Excel

In [42]:
print("Logging final model performance to Excel...")
final_record = {
    'Model_Name': MODEL_NAME + ' (Final Tuned)',
    'Timestamp': pd.Timestamp.now().isoformat(),
    'Hyperparameter_Set_Tried': '',
    'CV_Score_for_Set': '',
    'Selected_Final_Hyperparameters': json.dumps(grid_search.best_params_),
    'Training_Time_Seconds': tuning_time_seconds,
    'Train_Precision': train_metrics.get('Precision'),
    'Train_Recall': train_metrics.get('Recall'),
    'Train_F1': train_metrics.get('F1'),
    'Train_ROC_AUC': train_metrics.get('ROC_AUC'),
    'Test_Precision': test_metrics.get('Precision'),
    'Test_Recall': test_metrics.get('Recall'),
    'Test_F1': test_metrics.get('F1'),
    'Test_ROC_AUC': test_metrics.get('ROC_AUC'),
    'Class_Imbalance_Strategy': 'SMOTE',
    'Notes': 'Used SMOTE on training data for tuning. Evaluation on original train set (scaled) and test set (scaled).'
}
mu.append_performance_record(EXCEL_SUMMARY_PATH, final_record)
print("Final model performance logged.")

Logging final model performance to Excel...


NameError: name 'train_metrics' is not defined

## 10. Save the Best Model

In [None]:
print(f"Saving the best Decision Tree model to {MODEL_PATH}")
joblib.dump(best_dt_model, MODEL_PATH)
print("Model saved successfully.")

# Also save the scaler if it was used and not saved centrally before
if cols_to_scale:
    scaler_path = '../models/decision_tree_scaler.joblib'
    joblib.dump(scaler, scaler_path)
    print(f"Scaler saved to {scaler_path}")

: 

## 11. Documentation and Summary of Results

### Model: Decision Tree

**Preprocessing Steps Applied:**
1. Loaded preprocessed data from `preprocessed_data.csv`.
2. Features (X) and target (y = `is_severe_accident`) defined.
3. Data split into training (80%) and testing (20%) sets, stratified by target.
4. Numerical features scaled using `StandardScaler` (fit on training data, applied to train and test).
5. Class imbalance in the training set handled using `SMOTE` (applied only to the scaled training data).

**Hyperparameter Tuning:**
- **Method:** `GridSearchCV` with 5-fold cross-validation.
- **Scoring Metric for Tuning:** `f1_weighted`.
- **Parameter Grid Searched:**
  - `criterion`: ['gini', 'entropy']
  - `max_depth`: [None, 5, 10, 15, 20]
  - `min_samples_split`: [2, 5, 10]
  - `min_samples_leaf`: [1, 2, 5]
  - `ccp_alpha`: [0.0, 0.001, 0.01, 0.1]
- **Best Hyperparameters Found:** (To be filled after execution - `grid_search.best_params_`)
- **Best CV F1 Score:** (To be filled after execution - `grid_search.best_score_`)
- **Tuning Time:** (To be filled after execution - `tuning_time_seconds`)
- **Logging:** All hyperparameter combinations tried and their CV scores were logged to the `Hyperparameter_Tuning_Log` sheet in `model_performance_summary.xlsx`. The final selected model's details and performance metrics were logged to the `Model_Performance_Summary` sheet.

**Evaluation Metrics (Best Tuned Model):**
*(To be filled after execution with actual values from `train_metrics` and `test_metrics`)*
| Metric         | Train (Original, Scaled) | Test (Scaled) |
|----------------|--------------------------|---------------|
| Accuracy       |                          |               |
| Precision      |                          |               |
| Recall         |                          |               |
| F1 Score       |                          |               |
| ROC AUC        |                          |               |

**Model Persistence:**
- The best tuned Decision Tree model was saved to: `../models/decision_tree_model.joblib`
- The scaler used was saved to: `../models/decision_tree_scaler.joblib` (if applicable)

**Observations & Next Steps:**
*(To be filled after execution. Comment on performance, potential overfitting/underfitting, comparison to other models if available, and any specific insights from the Decision Tree model.)*