# STM Transit Delay Data Modeling

## Overview

This notebook explores tree-based regression and classification models in order to find the one that predicts STM transit delays with the best accuracy. The featured models are XGBoost, LightGBM and CatBoost.

## Data Description

`exp_trip_duration`: Expected duration of a trip, in seconds.<br>
`route_direction`: Route direction in degrees.<br>
`route_type_Night`, `route_type_High Frequency` : One-Hot features for types of bus lines<br>
`stop_location_group`: Stop cluster based on coordinates.<br>
`stop_distance`: Distance between the previous and current stop, in meters.<br>
`trip_phase_end`: One-Hot feature for trip progress.<br>
`exp_delay_prev_stop`: Expected duration between the previous and current stop, in seconds.<br>
`wheelchair_boarding`: Indicates if the stop is accessible for people in wheelchair.<br>
`sch_rel_Scheduled`: One-Hot feature for schedule relationship.<br>
`time_of_day_evening`, `time_of_day_morning`, `time_of_day_night`: One-Hot features for time of day.<br>
`is_weekend`: Boolean value if the day of week in on the weekend.<br>
`is_peak_hour`: Boolean value indicating if the sheduled arrival time is at peak hour.<br>
`temperature`: Air temperature at 2 meters above ground, in Celsius.<br>
`relative_humidity`: Relative humidity at 2 meters above ground, in percentage.<br>
`precipitation`: Total precipitation (rain, showers, snow) sum of the preceding hour, in millimeters.<br>
`pressure`: Atmospheric air pressure reduced to mean sea level (msl), in hPa.<br>
`cloud_cover`: Total cloud cover as an area fraction.<br>
`windspeed`: Wind speed at 10 meters above ground, in kilometers per hour.<br>
`wind_direction`: Wind direction at 10 meters above ground.<br>
`delay`: Difference between real and scheduled arrival time, in seconds.<br>
`delay_class`: Delay category, from early to late.

## Imports

In [1]:
from catboost import CatBoostRegressor, CatBoostClassifier
import joblib
import lightgbm as lgb
import math
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import pickle
import random
import seaborn as sns
import shap
from sklearn.metrics import cohen_kappa_score, confusion_matrix, f1_score, classification_report, mean_absolute_error, root_mean_squared_error, r2_score
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV, train_test_split
from sklearn.utils import shuffle
import sys
import xgboost as xgb

In [2]:
# Import custom code
sys.path.insert(0, '..')
from scripts.custom_functions import DELAY_CLASS

In [3]:
# Load data
df = pd.read_parquet('../data/preprocessed.parquet')
print(f'Shape of dataset: {df.shape}')

Shape of dataset: (7523993, 22)


## Split the data

In [4]:
# Get delay class distribution
proportions = df['delay_class'].value_counts(normalize=True)
prop_early = float(proportions.loc[0])
prop_on_time = float(proportions.loc[1])
prop_late = float(proportions.loc[2])

In [5]:
# Get row number of each class
sample_size = int(1e6)
nb_early = math.ceil(prop_early * sample_size)
nb_late = math.ceil(prop_late * sample_size)
nb_on_time = sample_size - (nb_early + nb_late)

In [6]:
# Due to the large volume of data, sample 1M rows, stratified manually to preserve delay class distribution
early_df = df[df['delay_class'] == 0].sample(n=nb_early, random_state=42)
on_time_df = df[df['delay_class'] == 1].sample(n=nb_on_time, random_state=42)
late_df = df[df['delay_class'] == 2].sample(n=nb_late, random_state=42)

sample_df = pd.concat([early_df, on_time_df, late_df], ignore_index=True)
sample_df = shuffle(sample_df, random_state=42).reset_index(drop=True)

In [7]:
# Use 60% for training
df_train, df_temp = train_test_split(
    sample_df, 
    test_size=0.4, 
    stratify=sample_df['delay_class'], # stratify with delay class
    random_state=42
)

In [8]:
# Split validation and test sets
df_val, df_test = train_test_split(
  df_temp,
  test_size=0.5,
  stratify=df_temp['delay_class'],
  random_state=42
)

del df_temp

In [9]:
# Check if the delay class distribution is preserved
print(sample_df['delay_class'].value_counts(normalize=True))
print(df_train['delay_class'].value_counts(normalize=True))
print(df_val['delay_class'].value_counts(normalize=True))
print(df_test['delay_class'].value_counts(normalize=True))

delay_class
1    0.884208
2    0.107581
0    0.008211
Name: proportion, dtype: float64
delay_class
1    0.884208
2    0.107580
0    0.008212
Name: proportion, dtype: float64
delay_class
1    0.884205
2    0.107585
0    0.008210
Name: proportion, dtype: float64
delay_class
1    0.88421
2    0.10758
0    0.00821
Name: proportion, dtype: float64


In [10]:
# Separate features from target variables
feature_cols = [col for col in df.columns if col not in ['delay', 'delay_class']]

X_train = df_train[feature_cols]
X_val = df_val[feature_cols]
X_test = df_test[feature_cols]

y_reg_train = df_train['delay']
y_reg_val = df_val['delay']
y_reg_test = df_test['delay']

y_class_train = df_train['delay_class']
y_class_val = df_val['delay_class']
y_class_test = df_test['delay_class']

**Scaling**

Since only tree-based models are explored in this project, scaling is not needed because the models are not sensitive to the absolute scale or distribution of the features.

## Regression Model

### Fit Base Models

All models allow to setup a number of rounds and early stopping. To start, all models will run 100 rounds with an early stopping of 3.

In [11]:
# Create dataframe to track metrics
reg_metrics_df = pd.DataFrame(columns=['model', 'MAE', 'RMSE', 'R²'])

In [12]:
def add_reg_metrics(reg_metrics_df:pd.DataFrame, y_pred:pd.Series, y_val:pd.Series, model_name:str) -> pd.DataFrame:
	mae = mean_absolute_error(y_val, y_pred)
	rmse = root_mean_squared_error(y_val, y_pred)
	r2 = r2_score(y_val, y_pred)

	reg_metrics_df.loc[len(reg_metrics_df)] = [model_name, mae, rmse, r2]
	return reg_metrics_df

#### XGBoost

In [None]:
# Create regression matrices
xg_train_data = xgb.DMatrix(X_train, y_reg_train, enable_categorical=False)
xg_val_data = xgb.DMatrix(X_val, y_reg_val, enable_categorical=False)
xg_test_data = xgb.DMatrix(X_test, y_reg_test, enable_categorical=False)
xg_eval_set = [(xg_train_data, 'train'), (xg_val_data, 'validation')]
xg_test_set = [(xg_train_data, 'train'), (xg_test_data, 'test')]

In [14]:
# Train model
xg_reg_base = xgb.train(
  params= {'objective': 'reg:squarederror', 'tree_method': 'hist'},
  dtrain=xg_train_data,
  num_boost_round=100,
  evals=xg_eval_set,
  verbose_eval=10,
  early_stopping_rounds=3
)

[0]	train-rmse:152.93557	validation-rmse:153.17449
[10]	train-rmse:150.18422	validation-rmse:150.58364
[20]	train-rmse:149.06105	validation-rmse:149.71753
[30]	train-rmse:148.41409	validation-rmse:149.27499
[40]	train-rmse:147.71234	validation-rmse:148.81655
[50]	train-rmse:147.13256	validation-rmse:148.43416
[60]	train-rmse:146.69414	validation-rmse:148.18049
[70]	train-rmse:146.21668	validation-rmse:147.89726
[80]	train-rmse:145.75915	validation-rmse:147.68019
[90]	train-rmse:145.33211	validation-rmse:147.42488
[99]	train-rmse:144.96538	validation-rmse:147.25898


In [15]:
# Evaluate model
y_pred = xg_reg_base.predict(xg_val_data)

reg_metrics_df = add_reg_metrics(reg_metrics_df, y_pred, y_reg_val, 'xg_reg_base')
reg_metrics_df

Unnamed: 0,model,MAE,RMSE,R²
0,xg_reg_base,79.331802,147.258983,0.090091


**MAE**<br>
On average, the predictions are off by 79 seconds, which is not very good.

**RMSE**<br>
The higher RMSE compared to MAE suggests that there are some significant prediction errors that influence the overall error metric.

**R²**<br>
The model explains 9% of the variance, which indicates the model is a very poor fit to the data.

#### LightGBM

In [16]:
# Train model
lgb_train_data = lgb.Dataset(X_train, label=y_reg_train)
lgb_val_data = lgb.Dataset(X_val, label=y_reg_val, reference=lgb_train_data)

lgb_reg_base = lgb.train(
    {
        'objective': 'regression',
        'metric': 'rmse',
        'learning_rate': 0.05,
        'max_depth': -1
    },
    lgb_train_data,
    valid_sets=[lgb_val_data],
    num_boost_round=100,
    callbacks=[lgb.early_stopping(stopping_rounds=3)]
)

[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.032420 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 1066
[LightGBM] [Info] Number of data points in the train set: 600000, number of used features: 20
[LightGBM] [Info] Start training from score 53.781391
Training until validation scores don't improve for 3 rounds
Did not meet early stopping. Best iteration is:
[100]	valid_0's rmse: 149.712


In [17]:
# Evaluate model
y_pred = lgb_reg_base.predict(X_val)

reg_metrics_df = add_reg_metrics(reg_metrics_df, y_pred, y_reg_val, 'lgb_reg_base')
reg_metrics_df

Unnamed: 0,model,MAE,RMSE,R²
0,xg_reg_base,79.331802,147.258983,0.090091
1,lgb_reg_base,81.259661,149.711666,0.059528


Overall, the LightGBM model performs worse than XGBoost.

#### CatBoost

In [18]:
# Fit model
cat_reg_base = CatBoostRegressor(
    iterations=100,
    learning_rate=0.05,
    depth=10,
    random_seed=42,
    verbose=10
)

cat_reg_base.fit(X_train, y_reg_train, eval_set=(X_val, y_reg_val), early_stopping_rounds=3)

0:	learn: 153.8873443	test: 154.1248588	best: 154.1248588 (0)	total: 188ms	remaining: 18.6s
10:	learn: 152.3135360	test: 152.5779089	best: 152.5779089 (10)	total: 2.31s	remaining: 18.7s
20:	learn: 151.5067878	test: 151.8104114	best: 151.8104114 (20)	total: 3.78s	remaining: 14.2s
30:	learn: 150.9801791	test: 151.3066600	best: 151.3066600 (30)	total: 5.09s	remaining: 11.3s
40:	learn: 150.5716939	test: 150.9312246	best: 150.9312246 (40)	total: 6.45s	remaining: 9.29s
50:	learn: 150.2456259	test: 150.6354564	best: 150.6354564 (50)	total: 7.88s	remaining: 7.57s
60:	learn: 149.9425947	test: 150.3767799	best: 150.3767799 (60)	total: 9.18s	remaining: 5.87s
70:	learn: 149.6384661	test: 150.1229627	best: 150.1229627 (70)	total: 10.5s	remaining: 4.31s
80:	learn: 149.3885366	test: 149.9176878	best: 149.9176878 (80)	total: 11.9s	remaining: 2.79s
90:	learn: 149.2123146	test: 149.7639969	best: 149.7639969 (90)	total: 13.3s	remaining: 1.32s
99:	learn: 149.0450631	test: 149.6322459	best: 149.6322459 (99

<catboost.core.CatBoostRegressor at 0x11e07d940>

In [19]:
# Evaluate model
y_pred = cat_reg_base.predict(X_val)

reg_metrics_df = add_reg_metrics(reg_metrics_df, y_pred, y_reg_val, 'cat_reg_base')
reg_metrics_df

Unnamed: 0,model,MAE,RMSE,R²
0,xg_reg_base,79.331802,147.258983,0.090091
1,lgb_reg_base,81.259661,149.711666,0.059528
2,cat_reg_base,80.994939,149.632246,0.060526


CatBoost performs slighly better than LightGBM but worse than XGBoost. Without hyperparameter tuning, XGBoost seems to capture a bit more of the underlying patterns than the two other models.

### Hyperparameter Tuning

#### XGBoost

In [None]:
# Grid search with 2-fold cross-validation
xgb_model = xgb.XGBRegressor(objective='reg:squarederror', random_state=42)

param_grid = {
    'n_estimators': [100, 200, 300],
    'max_depth': [3, 5, 7],
    'learning_rate': [0.01, 0.1, 0.2],
    'subsample': [0.6, 0.8, 1.0],
    'colsample_bytree': [0.6, 0.8, 1.0]
}

grid_search = GridSearchCV(
    estimator=xgb_model,
    param_grid=param_grid,
    scoring='neg_root_mean_squared_error',
    cv=2,
    n_jobs=-1,
    verbose=2
)

grid_search.fit(X_train, y_reg_train)

Fitting 2 folds for each of 243 candidates, totalling 486 fits


[CV] END colsample_bytree=0.6, learning_rate=0.01, max_depth=3, n_estimators=100, subsample=1.0; total time=  15.8s
[CV] END colsample_bytree=0.6, learning_rate=0.01, max_depth=3, n_estimators=100, subsample=1.0; total time=  16.0s
[CV] END colsample_bytree=0.6, learning_rate=0.01, max_depth=3, n_estimators=100, subsample=0.8; total time=  16.6s
[CV] END colsample_bytree=0.6, learning_rate=0.01, max_depth=3, n_estimators=100, subsample=0.8; total time=  16.6s
[CV] END colsample_bytree=0.6, learning_rate=0.01, max_depth=3, n_estimators=100, subsample=0.6; total time=  16.9s
[CV] END colsample_bytree=0.6, learning_rate=0.01, max_depth=3, n_estimators=100, subsample=0.6; total time=  17.0s
[CV] END colsample_bytree=0.6, learning_rate=0.01, max_depth=3, n_estimators=200, subsample=0.6; total time=  28.3s
[CV] END colsample_bytree=0.6, learning_rate=0.01, max_depth=3, n_estimators=200, subsample=0.6; total time=  28.4s
[CV] END colsample_bytree=0.6, learning_rate=0.01, max_depth=3, n_estima

In [None]:
# Best model
xg_best_model = grid_search.best_estimator_
xg_best_params = grid_search.best_params_

In [None]:
# Train best model with more boost rounds
xg_reg_tuned = xgb.train(
  params= {
    'objective': 'reg:squarederror',
    'tree_method': 'hist',
    'n_estimators': xg_best_params['n_estimators'],
    'max_depth': xg_best_params['max_depth'],
    'learning_rate': xg_best_params['learning_rate'],
    'subsample': xg_best_params['subsample'],
    'colsample_bytree': xg_best_params['colsample_bytree'],
  },
  dtrain=xg_train_data,
  num_boost_round=1000,
  evals=xg_eval_set,
  verbose_eval=10,
  early_stopping_rounds=50
)

In [None]:
# Evaluate model
y_pred = xg_reg_tuned.predict(xg_val_data)

reg_metrics_df = add_reg_metrics(reg_metrics_df, y_pred, y_reg_val, 'xg_reg_tuned')
reg_metrics_df

There's a significant improvement from the base XGBoost model, especially the R-squared.

#### LightGBM

In [None]:
param_grid = {
  'n_estimators': [100, 500, 1000],
  'learning_rate': [0.01, 0.05, 0.1],
  'max_depth': [5, 10, 15],
  'num_leaves': [20, 31, 40],
  'min_child_samples': [10, 20, 30],
  'subsample': [0.8, 1.0],
  'colsample_bytree': [0.8, 1.0]
}

lgb_model = lgb.LGBMRegressor(random_state=42)

grid_search = GridSearchCV(
    estimator=lgb_model,
    param_grid=param_grid,
    cv=2, 
    scoring='neg_root_mean_squared_error',
    verbose=2,
    n_jobs=-1
)

grid_search.fit(X_train, y_reg_train)

In [None]:
# Best model
lgb_best_model = grid_search.best_estimator_
lgb_best_params = grid_search.best_params_

In [None]:
# Train model with more boost rounds and early stopping
lgb_reg_tuned = lgb.train(
    params={
        'objective': 'regression',
        'metric': 'rmse',
        'n_estimators': lgb_best_params['n_estimators'],
        'learning_rate': lgb_best_params['learning_rate'],
        'max_depth': lgb_best_params['max_depth'],
        'num_leaves': lgb_best_params['num_leaves'],
        'min_child_samples': lgb_best_params['min_child_samples'],
        'subsample': lgb_best_params['subsample'],
        'colsample_bytree': lgb_best_params['colsample_bytree']
    },
    train_set=lgb_train_data,
    valid_sets=[lgb_val_data],
    num_boost_round=1000,
    callbacks=[lgb.early_stopping(stopping_rounds=50)]
)

In [None]:
# Evaluate model
y_pred = lgb_reg_tuned.predict(X_val)

reg_metrics_df = add_reg_metrics(reg_metrics_df, y_pred, y_reg_val, 'lgb_reg_tuned')
reg_metrics_df

Interpret results

#### CatBoost

In [None]:
param_grid = {
  'iterations': [500, 1000],
  'learning_rate': [0.01, 0.05, 0.1],
  'depth': [6, 8, 10],
  'l2_leaf_reg': [1, 3, 5],
  'border_count': [32, 64, 128],
  'bagging_temperature': [0, 1, 5],
}

cat_model = CatBoostRegressor(verbose=0, random_seed=42)

grid_search = GridSearchCV(
    estimator=cat_model,
    param_grid=param_grid,
    cv=2,
    scoring='neg_root_mean_squared_error',
    verbose=2,
    n_jobs=-1
)

grid_search.fit(X_train, y_reg_train)

In [None]:
# Best model
cat_reg_tuned = grid_search.best_estimator_
cat_best_params = grid_search.best_params_

In [None]:
# Train best model with more iterations
cat_reg_tuned = CatBoostRegressor(
    iterations=cat_best_params['iterations'],
    learning_rate=cat_best_params['learning_rate'],
    depth=cat_best_params['depth'],
    l2_leaf_reg=cat_best_params['l2_leaf_reg'],
    border_count=cat_best_params['border_count'],
    bagging_temperature=cat_best_params['bagging_temperature'],
    random_seed=42,
    verbose=50
)

cat_reg_tuned.fit(X_train, y_reg_train, eval_set=(X_val, y_reg_val), early_stopping_rounds=50)

In [None]:
# Evaluate model
y_pred = cat_reg_tuned.predict(X_val)

reg_metrics_df = add_reg_metrics(reg_metrics_df, y_pred, y_reg_val, 'cat_reg_tuned')
reg_metrics_df

In [None]:
# Get model with lowest RMSE
reg_metrics_df.nsmallest(n=1, columns='RMSE')

The best model is XGBoost. This is the model that will be used for the rest of the analysis.

### Residual Analysis

In [None]:
# Get best model
reg_metrics_df.iloc[reg_metrics_df['RMSE'].idxmin()]

In [None]:
# Get predictions
best_model = xg_reg_tuned
y_pred = best_model.predict(X_val)

In [None]:
# Plot residuals
fig, (ax1, ax2) = plt.subplots(nrows=1, ncols=2, figsize=(15, 7))

# Predicted vs. actual values
ax1.scatter(x=y_pred, y=y_true)
ax1.set_title('Predicted vs. Actual values')
ax1.set_xlabel('Predicted delay (seconds)')
ax1.set_ylabel('Actual delay (seconds)')
ax1.grid(True)

# Residuals
residuals = y_true - y_pred
ax2.scatter(x=y_pred, y=residuals)
ax2.set_title('Residual Plot')
ax2.set_xlabel('Predicted Delay (seconds)')
ax2.set_ylabel('Residuals (seconds)')
ax2.axhline(0, linestyle='--', color='orange')
ax2.grid(True)

fig.suptitle('Residual Analysis', fontsize=18)
fig.tight_layout()
fig.savefig(f'../images/residual_analysis_{model_name}.png', bbox_inches='tight')
plt.show()

### Feature Importances

#### MDI

#### SHAP Plots

### Feature Pruning

### Retrain Model with Best Features

### Retune Parameters

## Classification Model

### Fit Base Models

In [None]:
# Create dataframe to track metrics
class_metrics_df = pd.DataFrame(columns=['model', 'params', 'f1_macro', 'kappa'])

In [None]:
def add_class_metrics(class_metrics_df:pd.DataFrame, model, y_pred:pd.Series, y_val:pd.Series, model_name:str) -> pd.DataFrame:
	f1_macro = f1_score(y_val, y_pred, average='macro')
	kappa = cohen_kappa_score(y_val, y_pred)

	class_metrics_df.loc[len(class_metrics_df)] = [model_name, model.get_params(), f1_macro, kappa]
	return class_metrics_df

#### XGBoost

#### LightGBM

#### CatBoost

### Hyperparameter Tuning

### Feature Importances

### Feature Pruning

### Retrain Model with Best Features

### Retune Parameters

## Final Model

### Evaluate with Test Set

### Make Prediction

In [None]:
# Save best model
joblib.dump(best_model, 'best_xgb_model.pkl')

## End

# STM Transit Delay Data Modeling Draft

### Random Forest Regressor

#### Fit Model

In [None]:
# Fit base model
rf_reg_base = RandomForestRegressor(n_estimators=100, random_state=42, n_jobs=-1)
rf_reg_base.fit(X_train_sample, y_reg_train_sample)

#### Evaluate Model

In [None]:
# Calculate metrics
y_pred = rf_reg_base.predict(X_val)
reg_metrics_df = add_reg_metrics(reg_metrics_df, y_pred, y_reg_val, 'rf_reg_base')
reg_metrics_df

**MAE**<br>
On average, the predictions are off by 74 seconds, which is not very good.

**RMSE**<br>
The higher RMSE compared to MAE suggests that there are some significant prediction errors that influence the overall error metric.

**R²**<br>
The model explains 10.85% of the variance, which indicates the model is a poor fit to the data.

In [None]:
# Plot residual analysis
plot_residuals(y_pred, y_reg_val, 'rf_reg_base')

Interpret plot

### XGBoost Regressor

#### Fit Base Model

In [None]:
# Train a model


#### Evaluate Model

In [None]:
y_pred = xg_reg_base.predict(X_val)

reg_metrics_df = add_reg_metrics(reg_metrics_df, y_pred, y_reg_val, 'xg_reg_base')
reg_metrics_df

**MAE:**

This is a slight improvement over the Random Forest model. The model is now, on average, 73.9 seconds off in its predictions, which is a reduction of almost 10 seconds.

**RMSE:**

The RMSE has also decreased compared to the previous model indicating that the XGBoost model is performing better and has reduced the impact of large errors.

**R²:**

This is a substantial improvement from -4.67%. With an R-squared of 12.52%, the XGBoost model explains more variance in the data, which shows that it's capturing more of the underlying patterns than the previous model. However, it's still a poor fit to the data.

In [None]:
# Plot residual analysis
plot_residuals(y_pred, y_reg_val, 'xg_reg_base')

#### Hyperparameter tuning

In [None]:
# Perform GridSearch with 5-fold CV
xgb = xgb.XGBRegressor(objective='reg:squarederror', random_state=42)

param_grid = {
    'n_estimators': [100, 200, 300],
    'max_depth': [3, 5, 7],
    'learning_rate': [0.01, 0.1, 0.2],
    'subsample': [0.6, 0.8, 1.0],
    'colsample_bytree': [0.6, 0.8, 1.0]
}

grid_search = GridSearchCV(
    estimator=xgb,
    param_grid=param_grid,
    scoring='neg_mean_absolute_error',
    cv=5,
    n_jobs=-1,
    verbose=1
)

grid_search.fit(X_train, y_reg_train)

In [None]:
# Best model
xg_reg_tuned = grid_search.best_estimator_
xg_best_params = grid_search.best_params_

In [None]:
y_pred = xg_reg_tuned.predict(X_val)

reg_metrics_df = add_reg_metrics(reg_metrics_df, y_pred, y_reg_val, 'xg_reg_tuned')
reg_metrics_df

### LightGBM Regressor

#### Fit Base Model

#### Evaluate Model

In [None]:
y_pred = lgb_reg_base.predict(X_val)

reg_metrics_df = add_reg_metrics(reg_metrics_df, y_pred, y_reg_val, 'lgb_reg_base')
reg_metrics_df

#### Hyperparameter tuning

In [None]:
lgb_param_grid = {
    'n_estimators': [100, 500, 1000],
    'learning_rate': [0.01, 0.05, 0.1],
    'max_depth': [5, 10, 15],
    'num_leaves': [20, 31, 40],
    'min_child_samples': [10, 20, 30],
    'subsample': [0.8, 1.0],
    'colsample_bytree': [0.8, 1.0]
}

lgb_model = lgb.LGBMRegressor(random_state=42)

lgb_grid = GridSearchCV(
    estimator=lgb_model,
    param_grid=lgb_param_grid,
    cv=3,  # 3-fold cross-validation
    scoring='neg_mean_squared_error',
    verbose=1,
    n_jobs=-1
)

lgb_grid.fit(X_train, y_reg_train)

In [None]:
lgb_reg_tuned = lgb_grid.best_estimator_

In [None]:
# Best parameters and score
print("Best Parameters for LightGBM:")
print(lgb_grid.best_params_)
print(f"Best RMSE: {-lgb_grid.best_score_ ** 0.5}")

In [None]:
y_pred = lgb_reg_tuned.predict(X_val)

reg_metrics_df = add_reg_metrics(reg_metrics_df, y_pred, y_reg_val, 'lgb_reg_tuned')
reg_metrics_df

### CatBoost Regressor

In [None]:
cat_reg_base = CatBoostRegressor(
    iterations=1000,
    learning_rate=0.05,
    depth=10,
    random_seed=42,
    verbose=100
)

cat_reg_base.fit(X_train, y_reg_train, eval_set=(X_val, y_reg_val), early_stopping_rounds=50)

In [None]:
y_pred = cat_reg_base.predict(X_val)

reg_metrics_df = add_reg_metrics(reg_metrics_df, y_pred, y_reg_val, 'cat_reg_base')
reg_metrics_df

## Classification Model

### Random Forest Classifier

#### Fit Base Model

In [None]:
# class_weight helps for rare classes
rf_class_base = RandomForestClassifier(n_estimators=100, random_state=42, class_weight='balanced', n_jobs=-1) 
rf_class_base.fit(X_train, y_class_train)

#### Evaluate Model

In [None]:
y_pred = rf_class_base.predict(X_val)

class_metrics_df = add_class_metrics(class_metrics_df, rf_class_base, y_pred, y_class_val, 'rf_class_base')
class_metrics_df[['model', 'f1_macro', 'kappa']]

Interpret results

In [None]:
class_labels = DELAY_CLASS.values()

In [None]:
def print_class_report(y_val, y_pred, labels):
	print(classification_report(y_val, y_pred, target_names=labels))

In [None]:
# Classification report
print_class_report(y_class_val, y_pred, class_labels)

In [None]:
def plot_confusion_matrix(y_val, y_pred, labels:list, model_name:str):
	cm = confusion_matrix(y_val, y_pred)
	plt.figure(figsize=(8, 6))
	sns.heatmap(cm, annot=True, fmt='d', cmap='crest', xticklabels=labels, yticklabels=labels)
	plt.xlabel('Predicted', fontsize=14)
	plt.ylabel('Actual', fontsize=14)
	plt.title('Confusion Matrix', fontsize=18)
	plt.tight_layout()
	plt.savefig(f'../images/cm_{model_name}.png', bbox_inches='tight')
	plt.show()

In [None]:
# Confusion matrix heatmap
plot_confusion_matrix(y_class_val, y_pred, class_labels, 'rf_class_base')

Interpret results

#### Hyperparameter Tuning

In [None]:
# Define parameter grid
param_dist = {
    'n_estimators': list(range(100, 700, 100)),
    'max_depth': [None, 10, 20, 30, 50],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4],
    'max_features': ['sqrt', 'log2'],
    'bootstrap': [True, False],
    'class_weight': ['balanced', 'balanced_subsample']
}

# Initialize base model
rf = RandomForestClassifier(random_state=42, n_jobs=-1)

# Randomized search with 2-fold CV (to save computation time)
random_search = RandomizedSearchCV(
    estimator=rf,
    param_distributions=param_dist,
    n_iter=10, # 10 combinations
    cv=2,
    verbose=1,
    scoring='f1_macro', # optimize for macro F1
    n_jobs=-1
)

# Fit search
random_search.fit(X_train, y_class_train)

In [None]:
# Best model
rf_class_tuned = random_search.best_estimator_
rf_best_params = random_search.best_params_

In [None]:
y_pred = rf_class_tuned.predict(X_val)

class_metrics_df = add_class_metrics(class_metrics_df, rf_class_tuned, y_pred, y_class_val, 'rf_class_tuned')
class_metrics_df[['model', 'f1_macro', 'kappa']]

Interpret results

In [None]:
# Classification report
print_class_report(y_class_val, y_pred, class_labels)

In [None]:
# Confusion matrix
plot_confusion_matrix(y_class_val, y_pred, class_labels, 'rf_class_tuned')

Interpret results

### XGBoost Classifier

#### Fit Base Model

In [None]:
xg_class_base = xgb.XGBClassifier(
	objective='multi:softmax',
  	num_class=3,
    eval_metric='mlogloss',
    random_state=42,
)

xg_class_base.fit(X_train, y_class_train)

#### Evaluate Model

In [None]:
y_pred = xg_class_base.predict(X_val)

class_metrics_df = add_class_metrics(class_metrics_df, xg_class_base, y_pred, y_class_val, 'xg_class_base')
class_metrics_df[['model', 'f1_macro', 'kappa']]

Interpret results

In [None]:
# Classification report
print_class_report(y_class_val, y_pred, class_labels)

In [None]:
# Confusion matrix heatmap
plot_confusion_matrix(y_class_val, y_pred, class_labels, 'xg_class_base')

#### Hyperparameter Tuning

In [None]:
# Define param grid
param_grid = {
    'n_estimators': [100, 200, 400],
    'max_depth': [3, 5, 7, 10],
    'learning_rate': [0.01, 0.05, 0.1, 0.2],
    'subsample': [0.6, 0.8, 1.0],
    'colsample_bytree': [0.6, 0.8, 1.0],
    'min_child_weight': [1, 3, 5],
    'gamma': [0, 0.1, 0.3, 0.5]
}

# Initialize model
xgb = xgb.XGBClassifier(
    objective='multi:softmax',
    num_class=3,
    eval_metric='mlogloss',
    random_state=42
)

# Grid search with 3-fold CV
grid_search = GridSearchCV(
    estimator=xgb,
    param_grid=param_grid,
    cv=3,
    verbose=1,
    scoring='f1_macro',
    n_jobs=-1
)

# Fit
grid_search.fit(X_train, y_class_train)

In [None]:
# Best model
xg_class_tuned = grid_search.best_estimator_
xg_best_params = grid_search.best_params_

In [None]:
# Calculate metrics
y_pred = rf_class_tuned.predict(X_val)

class_metrics_df = add_class_metrics(class_metrics_df, rf_class_tuned, y_pred, y_class_val, 'rf_class_tuned')
class_metrics_df[['model', 'f1_macro', 'kappa']]

Interpret results

In [None]:
# Classification report
print_class_report(y_class_val, y_pred, class_labels)

In [None]:
# Confusion matrix heatmap
plot_confusion_matrix(y_class_val, y_pred, class_labels, 'rf_class_tuned')

Interpret results

## End