# STM Transit Delay Data Modeling

## Overview

This notebook explores tree-based regression and classification models in order to find the one that predicts STM transit delays with the best accuracy. The featured models are XGBoost, LightGBM and CatBoost

## Data Description

`exp_trip_duration`: Expected duration of a trip, in seconds.<br>
`route_direction_North`, `route_direction_South`, `route_direction_West`: One-Hot features for route directions.<br>
`route_type_Night`, `route_type_High Frequency` : One-Hot features for types of bus lines<br>
`stop_location_group`: Stop cluster based on coordinates.<br>
`stop_distance`: Distance between the previous and current stop, in meters.<br>
`trip_phase_end`: One-Hot feature for trip progress.<br>
`exp_delay_prev_stop`: Expected duration between the previous and current stop, in seconds.<br>
`wheelchair_boarding`: Indicates if the stop is accessible for people in wheelchair.<br>
`sch_rel_Scheduled`: One-Hot feature for schedule relationship.<br>
`time_of_day_evening`, `time_of_day_morning`, `time_of_day_night`: One-Hot features for time of day.<br>
`is_weekend`: Boolean value if the day of week in on the weekend.<br>
`is_peak_hour`: Boolean value indicating if the sheduled arrival time is at peak hour.<br>
`temperature`: Air temperature at 2 meters above ground, in Celsius.<br>
`relative_humidity`: Relative humidity at 2 meters above ground, in percentage.<br>
`precipitation`: Total precipitation (rain, showers, snow) sum of the preceding hour, in millimeters.<br>
`pressure`: Atmospheric air pressure reduced to mean sea level (msl), in hPa.<br>
`cloud_cover`: Total cloud cover as an area fraction.<br>
`windspeed`: Wind speed at 10 meters above ground, in kilometers per hour.<br>
`wind_direction`: Wind direction at 10 meters above ground.<br>
`delay`: Difference between real and scheduled arrival time, in seconds.<br>
`delay_class`: Delay category, from early to late.

## Imports

In [1]:
from catboost import CatBoostRegressor, CatBoostClassifier
import joblib
import lightgbm as lgb
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import pickle
import seaborn as sns
import shap
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
from sklearn.metrics import cohen_kappa_score, confusion_matrix, f1_score, classification_report, mean_absolute_error, root_mean_squared_error, r2_score
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV, train_test_split
import sys
import xgboost as xgb

In [2]:
# Import custom code
sys.path.insert(0, '..')
from scripts.custom_functions import DELAY_CLASS

In [3]:
df = pd.read_parquet('../data/preprocessed.parquet')
print(f'Number of rows: {len(df)}')

Number of rows: 7429915


## Split the data

In [4]:
# Keep 30% for validation and testing
df_train, df_temp = train_test_split(
    df, 
    test_size=0.3, 
    stratify=df['delay_class'], 
    random_state=42
)

In [None]:
# Split validation and test set to make the train-val-test split 70-15-15
df_val, df_test = train_test_split(
  df_temp,
  test_size=0.5,
  stratify=df_temp['delay_class'],
  random_state=42
)

del df_temp

In [6]:
# Check if the delay class distribution is preserved
print(df['delay_class'].value_counts(normalize=True))
print(df_train['delay_class'].value_counts(normalize=True))
print(df_val['delay_class'].value_counts(normalize=True))
print(df_test['delay_class'].value_counts(normalize=True))

delay_class
1    0.891401
2    0.100321
0    0.008278
Name: proportion, dtype: float64
delay_class
1    0.891401
2    0.100321
0    0.008278
Name: proportion, dtype: float64
delay_class
1    0.891401
2    0.100321
0    0.008278
Name: proportion, dtype: float64
delay_class
1    0.891401
2    0.100321
0    0.008277
Name: proportion, dtype: float64


In [7]:
# Separate features from target variables
feature_cols = [col for col in df.columns if col not in ['delay', 'delay_class']]

X_train = df_train[feature_cols]
X_val = df_val[feature_cols]
X_test = df_test[feature_cols]

y_reg_train = df_train['delay']
y_reg_val = df_val['delay']
y_reg_test = df_test['delay']

y_class_train = df_train['delay_class']
y_class_val = df_val['delay_class']
y_class_test = df_test['delay_class']

Since only tree-based models are explored in this project, scaling is not needed because the models are not sensitive to the absolute scale or distribution of the features.

## Regression Model

### Fit Base Models

#### XGBoost

#### LightGBM

#### CatBoost

### Residual Analysis

### Hyperparameter Tuning

### Feature Importances

#### MDI

#### SHAP Plots

### Feature Pruning

### Retrain Model with Best Features

### Retune Parameters

## Classification

### Fit Base Models

#### XGBoost

#### LightGBM

#### CatBoost

### Hyperparameter Tuning

### Feature Importances

### Feature Pruning

### Retrain Model with Best Features

### Retune Parameters

## Final Model

### Evaluate with Test Set

### Make Prediction

## End

# STM Transit Delay Data Modeling Draft

### Random Forest Regressor

#### Fit Model

In [None]:
# Fit base model
rf_reg_base = RandomForestRegressor(n_estimators=100, random_state=42, n_jobs=-1)
rf_reg_base.fit(X_train_sample, y_reg_train_sample)

#### Evaluate Model

In [10]:
# Create dataframe to track metrics
reg_metrics_df = pd.DataFrame(columns=['model', 'MAE', 'RMSE', 'R²'])

In [11]:
def add_reg_metrics(reg_metrics_df:pd.DataFrame, y_pred:pd.Series, y_val:pd.Series, model_name:str) -> pd.DataFrame:
	mae = mean_absolute_error(y_val, y_pred)
	rmse = root_mean_squared_error(y_val, y_pred)
	r2 = r2_score(y_val, y_pred)

	reg_metrics_df.loc[len(reg_metrics_df)] = [model_name, mae, rmse, r2]
	return reg_metrics_df

In [None]:
# Calculate metrics
y_pred = rf_reg_base.predict(X_val)
reg_metrics_df = add_reg_metrics(reg_metrics_df, y_pred, y_reg_val, 'rf_reg_base')
reg_metrics_df

**MAE**<br>
On average, the predictions are off by 74 seconds, which is not very good.

**RMSE**<br>
The higher RMSE compared to MAE suggests that there are some significant prediction errors that influence the overall error metric.

**R²**<br>
The model explains 10.85% of the variance, which indicates the model is a poor fit to the data.

In [None]:
def plot_residuals(y_pred: pd.Series, y_true:pd.Series, model_name:str) -> None:
	fig, (ax1, ax2) = plt.subplots(nrows=1, ncols=2, figsize=(15, 7))

	# Predicted vs. actual values
	ax1.scatter(x=y_pred, y=y_true)
	ax1.set_title('Predicted vs. Actual values')
	ax1.set_xlabel('Predicted delay (seconds)')
	ax1.set_ylabel('Actual delay (seconds)')
	ax1.grid(True)

	# Residuals
	residuals = y_true - y_pred
	ax2.scatter(x=y_pred, y=residuals)
	ax2.set_title('Residual Plot')
	ax2.set_xlabel('Predicted Delay (seconds)')
	ax2.set_ylabel('Residuals (seconds)')
	ax2.axhline(0, linestyle='--', color='orange')
	ax2.grid(True)

	fig.suptitle('Residual Analysis', fontsize=18)
	fig.tight_layout()
	fig.savefig(f'../images/residual_analysis_{model_name}.png', bbox_inches='tight')
	plt.show()

In [None]:
# Plot residual analysis
plot_residuals(y_pred, y_reg_val, 'rf_reg_base')

Interpret plot

### XGBoost Regressor

#### Fit Base Model

In [8]:
# Initialize the XGBoost model with default parameters
xg_reg_base = xgb.XGBRegressor(objective='reg:squarederror', random_state=42)

In [9]:
# Train a model
xg_reg_base.fit(X_train, y_reg_train)

#### Evaluate Model

In [12]:
y_pred = xg_reg_base.predict(X_val)

reg_metrics_df = add_reg_metrics(reg_metrics_df, y_pred, y_reg_val, 'xg_reg_base')
reg_metrics_df

Unnamed: 0,model,MAE,RMSE,R²
0,xg_reg_base,67.702981,114.622799,0.082915


**MAE:**

This is a slight improvement over the Random Forest model. The model is now, on average, 73.9 seconds off in its predictions, which is a reduction of almost 10 seconds.

**RMSE:**

The RMSE has also decreased compared to the previous model indicating that the XGBoost model is performing better and has reduced the impact of large errors.

**R²:**

This is a substantial improvement from -4.67%. With an R-squared of 12.52%, the XGBoost model explains more variance in the data, which shows that it's capturing more of the underlying patterns than the previous model. However, it's still a poor fit to the data.

In [None]:
# Plot residual analysis
plot_residuals(y_pred, y_reg_val, 'xg_reg_base')

#### Hyperparameter tuning

In [None]:
# Perform GridSearch with 5-fold CV
xgb = xgb.XGBRegressor(objective='reg:squarederror', random_state=42)

param_grid = {
    'n_estimators': [100, 200, 300],
    'max_depth': [3, 5, 7],
    'learning_rate': [0.01, 0.1, 0.2],
    'subsample': [0.6, 0.8, 1.0],
    'colsample_bytree': [0.6, 0.8, 1.0]
}

grid_search = GridSearchCV(
    estimator=xgb,
    param_grid=param_grid,
    scoring='neg_mean_absolute_error',
    cv=5,
    n_jobs=-1,
    verbose=1
)

grid_search.fit(X_train, y_reg_train)

In [None]:
# Best model
xg_reg_tuned = grid_search.best_estimator_
xg_best_params = grid_search.best_params_

In [None]:
y_pred = xg_reg_tuned.predict(X_val)

reg_metrics_df = add_reg_metrics(reg_metrics_df, y_pred, y_reg_val, 'xg_reg_tuned')
reg_metrics_df

### LightGBM Regressor

#### Fit Base Model

In [15]:
# Fit
train_data = lgb.Dataset(X_train, label=y_reg_train)
valid_data = lgb.Dataset(X_val, label=y_reg_val, reference=train_data)

lgb_reg_base = lgb.train(
    {
        'objective': 'regression',
        'metric': 'rmse',
        'learning_rate': 0.05,
        'max_depth': 10
    },
    train_data,
    valid_sets=[valid_data],
    num_boost_round=100,
    callbacks=[lgb.early_stopping(stopping_rounds=3)]
)

[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.460066 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 1230
[LightGBM] [Info] Number of data points in the train set: 5200940, number of used features: 23
[LightGBM] [Info] Start training from score 45.342338
Training until validation scores don't improve for 3 rounds
Did not meet early stopping. Best iteration is:
[100]	valid_0's rmse: 116.755


#### Evaluate Model

In [None]:
y_pred = lgb_reg_base.predict(X_val)

reg_metrics_df = add_reg_metrics(reg_metrics_df, y_pred, y_reg_val, 'lgb_reg_base')
reg_metrics_df

Unnamed: 0,model,MAE,RMSE,R²
0,xg_reg_base,67.702981,114.622799,0.082915
1,lgb_reg_tuned,69.450762,116.754556,0.048486


#### Hyperparameter tuning

In [None]:
lgb_param_grid = {
    'n_estimators': [100, 500, 1000],
    'learning_rate': [0.01, 0.05, 0.1],
    'max_depth': [5, 10, 15],
    'num_leaves': [20, 31, 40],
    'min_child_samples': [10, 20, 30],
    'subsample': [0.8, 1.0],
    'colsample_bytree': [0.8, 1.0]
}

lgb_model = lgb.LGBMRegressor(random_state=42)

lgb_grid = GridSearchCV(
    estimator=lgb_model,
    param_grid=lgb_param_grid,
    cv=3,  # 3-fold cross-validation
    scoring='neg_mean_squared_error',
    verbose=1,
    n_jobs=-1
)

lgb_grid.fit(X_train, y_reg_train)

In [None]:
lgb_reg_tuned = lgb_grid.best_estimator_

In [None]:
# Best parameters and score
print("Best Parameters for LightGBM:")
print(lgb_grid.best_params_)
print(f"Best RMSE: {-lgb_grid.best_score_ ** 0.5}")

In [None]:
y_pred = lgb_reg_tuned.predict(X_val)

reg_metrics_df = add_reg_metrics(reg_metrics_df, y_pred, y_reg_val, 'lgb_reg_tuned')
reg_metrics_df

### CatBoost Regressor

In [17]:
cat_reg_base = CatBoostRegressor(
    iterations=1000,
    learning_rate=0.05,
    depth=10,
    random_seed=42,
    verbose=100
)

cat_reg_base.fit(X_train, y_reg_train, eval_set=(X_val, y_reg_val), early_stopping_rounds=50)

0:	learn: 119.3347044	test: 119.5142649	best: 119.5142649 (0)	total: 1.23s	remaining: 20m 30s
100:	learn: 116.2567547	test: 116.4351562	best: 116.4351562 (100)	total: 2m 16s	remaining: 20m 15s
200:	learn: 115.4994487	test: 115.7030307	best: 115.7030307 (200)	total: 4m 38s	remaining: 18m 27s
300:	learn: 114.9264458	test: 115.1532379	best: 115.1532379 (300)	total: 6m 28s	remaining: 15m 1s
400:	learn: 114.4448184	test: 114.6970710	best: 114.6970710 (400)	total: 8m 22s	remaining: 12m 31s
500:	learn: 114.0007762	test: 114.2745097	best: 114.2745097 (500)	total: 10m 25s	remaining: 10m 22s
600:	learn: 113.6298374	test: 113.9263879	best: 113.9263879 (600)	total: 12m 31s	remaining: 8m 18s
700:	learn: 113.3070343	test: 113.6257002	best: 113.6257002 (700)	total: 14m 32s	remaining: 6m 12s
800:	learn: 113.0050785	test: 113.3467621	best: 113.3467621 (800)	total: 16m 24s	remaining: 4m 4s
900:	learn: 112.7215418	test: 113.0800406	best: 113.0800406 (900)	total: 18m 31s	remaining: 2m 2s
999:	learn: 112.4

<catboost.core.CatBoostRegressor at 0x11ae7f620>

In [18]:
y_pred = cat_reg_base.predict(X_val)

reg_metrics_df = add_reg_metrics(reg_metrics_df, y_pred, y_reg_val, 'cat_reg_base')
reg_metrics_df

Unnamed: 0,model,MAE,RMSE,R²
0,xg_reg_base,67.702981,114.622799,0.082915
1,lgb_reg_tuned,69.450762,116.754556,0.048486
2,cat_reg_base,66.518261,112.839323,0.111232


## Classification Model

### Random Forest Classifier

#### Fit Base Model

In [None]:
# class_weight helps for rare classes
rf_class_base = RandomForestClassifier(n_estimators=100, random_state=42, class_weight='balanced', n_jobs=-1) 
rf_class_base.fit(X_train, y_class_train)

#### Evaluate Model

In [None]:
class_metrics_df = pd.DataFrame(columns=['model', 'params', 'f1_macro', 'kappa'])

In [None]:
def add_class_metrics(class_metrics_df:pd.DataFrame, model, y_pred:pd.Series, y_val:pd.Series, model_name:str) -> pd.DataFrame:
	f1_macro = f1_score(y_val, y_pred, average='macro')
	kappa = cohen_kappa_score(y_val, y_pred)

	class_metrics_df.loc[len(class_metrics_df)] = [model_name, model.get_params(), f1_macro, kappa]
	return class_metrics_df

In [None]:
y_pred = rf_class_base.predict(X_val)

class_metrics_df = add_class_metrics(class_metrics_df, rf_class_base, y_pred, y_class_val, 'rf_class_base')
class_metrics_df[['model', 'f1_macro', 'kappa']]

Interpret results

In [None]:
class_labels = DELAY_CLASS.values()

In [None]:
def print_class_report(y_val, y_pred, labels):
	print(classification_report(y_val, y_pred, target_names=labels))

In [None]:
# Classification report
print_class_report(y_class_val, y_pred, class_labels)

In [None]:
def plot_confusion_matrix(y_val, y_pred, labels:list, model_name:str):
	cm = confusion_matrix(y_val, y_pred)
	plt.figure(figsize=(8, 6))
	sns.heatmap(cm, annot=True, fmt='d', cmap='crest', xticklabels=labels, yticklabels=labels)
	plt.xlabel('Predicted', fontsize=14)
	plt.ylabel('Actual', fontsize=14)
	plt.title('Confusion Matrix', fontsize=18)
	plt.tight_layout()
	plt.savefig(f'../images/cm_{model_name}.png', bbox_inches='tight')
	plt.show()

In [None]:
# Confusion matrix heatmap
plot_confusion_matrix(y_class_val, y_pred, class_labels, 'rf_class_base')

Interpret results

#### Hyperparameter Tuning

In [None]:
# Define parameter grid
param_dist = {
    'n_estimators': list(range(100, 700, 100)),
    'max_depth': [None, 10, 20, 30, 50],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4],
    'max_features': ['sqrt', 'log2'],
    'bootstrap': [True, False],
    'class_weight': ['balanced', 'balanced_subsample']
}

# Initialize base model
rf = RandomForestClassifier(random_state=42, n_jobs=-1)

# Randomized search with 2-fold CV (to save computation time)
random_search = RandomizedSearchCV(
    estimator=rf,
    param_distributions=param_dist,
    n_iter=10, # 10 combinations
    cv=2,
    verbose=1,
    scoring='f1_macro', # optimize for macro F1
    n_jobs=-1
)

# Fit search
random_search.fit(X_train, y_class_train)

In [None]:
# Best model
rf_class_tuned = random_search.best_estimator_
rf_best_params = random_search.best_params_

In [None]:
y_pred = rf_class_tuned.predict(X_val)

class_metrics_df = add_class_metrics(class_metrics_df, rf_class_tuned, y_pred, y_class_val, 'rf_class_tuned')
class_metrics_df[['model', 'f1_macro', 'kappa']]

Interpret results

In [None]:
# Classification report
print_class_report(y_class_val, y_pred, class_labels)

In [None]:
# Confusion matrix
plot_confusion_matrix(y_class_val, y_pred, class_labels, 'rf_class_tuned')

Interpret results

### XGBoost Classifier

#### Fit Base Model

In [None]:
xg_class_base = xgb.XGBClassifier(
	objective='multi:softmax',
  	num_class=3,
    eval_metric='mlogloss',
    random_state=42,
)

xg_class_base.fit(X_train, y_class_train)

#### Evaluate Model

In [None]:
y_pred = xg_class_base.predict(X_val)

class_metrics_df = add_class_metrics(class_metrics_df, xg_class_base, y_pred, y_class_val, 'xg_class_base')
class_metrics_df[['model', 'f1_macro', 'kappa']]

Interpret results

In [None]:
# Classification report
print_class_report(y_class_val, y_pred, class_labels)

In [None]:
# Confusion matrix heatmap
plot_confusion_matrix(y_class_val, y_pred, class_labels, 'xg_class_base')

#### Hyperparameter Tuning

In [None]:
# Define param grid
param_grid = {
    'n_estimators': [100, 200, 400],
    'max_depth': [3, 5, 7, 10],
    'learning_rate': [0.01, 0.05, 0.1, 0.2],
    'subsample': [0.6, 0.8, 1.0],
    'colsample_bytree': [0.6, 0.8, 1.0],
    'min_child_weight': [1, 3, 5],
    'gamma': [0, 0.1, 0.3, 0.5]
}

# Initialize model
xgb = xgb.XGBClassifier(
    objective='multi:softmax',
    num_class=3,
    eval_metric='mlogloss',
    random_state=42
)

# Grid search with 3-fold CV
grid_search = GridSearchCV(
    estimator=xgb,
    param_grid=param_grid,
    cv=3,
    verbose=1,
    scoring='f1_macro',
    n_jobs=-1
)

# Fit
grid_search.fit(X_train, y_class_train)

In [None]:
# Best model
xg_class_tuned = grid_search.best_estimator_
xg_best_params = grid_search.best_params_

In [None]:
# Calculate metrics
y_pred = rf_class_tuned.predict(X_val)

class_metrics_df = add_class_metrics(class_metrics_df, rf_class_tuned, y_pred, y_class_val, 'rf_class_tuned')
class_metrics_df[['model', 'f1_macro', 'kappa']]

Interpret results

In [None]:
# Classification report
print_class_report(y_class_val, y_pred, class_labels)

In [None]:
# Confusion matrix heatmap
plot_confusion_matrix(y_class_val, y_pred, class_labels, 'rf_class_tuned')

Interpret results

## Feature Importances

## Final Model

### Fit Model

### Evaluate on Test Set

### Plot residuals

### Make Prediction

### Export Data

## End