# STM Transit Delay Data Modeling

## Overview

This notebook explores tree-based machine learning models in order to find the one that predicts STM transit delays with the best accuracy. The featured models are XGBoost, LightGBM and CatBoost, because they are more suitable for large datasets with mixed data and high cardinality.

## Imports

In [1]:
from catboost import CatBoostRegressor
import joblib
import lightgbm as lgb
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import random
import shap
from sklearn.metrics import mean_absolute_error, root_mean_squared_error, r2_score
from sklearn.model_selection import RandomizedSearchCV, train_test_split
import xgboost as xgb

In [2]:
# Load data
df = pd.read_parquet('../data/preprocessed.parquet')
print(f'Shape of dataset: {df.shape}')

Shape of dataset: (1500000, 59)


## Split the data

In [3]:
# Separate features from target variable
X = df.drop('delay', axis=1)
y = df['delay']

The 3 models can run multiple iterations with a training and validation set. Therefore, a hold-out set will be kept to evaluate the final model.

In [4]:
# Train-validation-test split (60-20-20)
X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.4, random_state=42)
X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.5, random_state=42)

del X_temp
del y_temp

**Scaling**

Since only tree-based models are explored in this project, scaling is not needed because the models are not sensitive to the absolute scale or distribution of the features.

## Fit Base Models

All models allow to setup a number of rounds and early stopping. To start, all models will run 100 rounds with an early stopping of 3.

In [5]:
# Create dataframe to track metrics
metrics_df = pd.DataFrame(columns=['model', 'MAE', 'RMSE', 'R²'])

In [6]:
def add_reg_metrics(metrics_df:pd.DataFrame, y_pred:pd.Series, y_true:pd.Series, model_name:str) -> pd.DataFrame:
	mae = mean_absolute_error(y_true, y_pred)
	rmse = root_mean_squared_error(y_true, y_pred)
	r2 = r2_score(y_true, y_pred)

	metrics_df.loc[len(metrics_df)] = [model_name, mae, rmse, r2]
	return metrics_df

### XGBoost

In [7]:
# Create regression matrices
xg_train_data = xgb.DMatrix(X_train, y_train, enable_categorical=False)
xg_val_data = xgb.DMatrix(X_val, y_val, enable_categorical=False)
xg_eval_set = [(xg_train_data, 'train'), (xg_val_data, 'validation')]

In [8]:
# Train model
xg_reg_base = xgb.train(
  params= {'objective': 'reg:squarederror', 'tree_method': 'hist'},
  dtrain=xg_train_data,
  num_boost_round=100,
  evals=xg_eval_set,
  verbose_eval=10,
  early_stopping_rounds=3
)

[0]	train-rmse:155.80079	validation-rmse:155.08603
[10]	train-rmse:145.90481	validation-rmse:146.12332
[20]	train-rmse:143.38536	validation-rmse:144.00183
[30]	train-rmse:141.07567	validation-rmse:141.94091
[40]	train-rmse:139.40611	validation-rmse:140.61362
[50]	train-rmse:137.73674	validation-rmse:139.30879
[60]	train-rmse:136.06218	validation-rmse:137.87494
[70]	train-rmse:134.54850	validation-rmse:136.68737
[80]	train-rmse:133.54640	validation-rmse:135.93771
[90]	train-rmse:132.45315	validation-rmse:135.01277
[99]	train-rmse:131.92073	validation-rmse:134.69873


In [9]:
# Evaluate model
y_pred = xg_reg_base.predict(xg_val_data)

metrics_df = add_reg_metrics(metrics_df, y_pred, y_val, 'xg_reg_base')
metrics_df

Unnamed: 0,model,MAE,RMSE,R²
0,xg_reg_base,68.705388,134.698728,0.294505


**MAE**<br>
On average, the predictions are off by 69 seconds, which is reasonable, knowing that [STM](https://www.stm.info/en/info/networks/bus-network-and-schedules-enlightened) considers a bus arriving 3 minutes after the planned schedule as being on time.

**RMSE**<br>
The higher RMSE compared to MAE suggests that there are some significant prediction errors that influence the overall error metric.

**R²**<br>
The model explains 29.45% of the variance, which is not good but understandable because of how random transit delays can be (bad weather, vehicle breakdown, accidents, etc.)

### LightGBM

In [10]:
# Create regression datasets
lgb_train_data = lgb.Dataset(X_train, label=y_train)
lgb_val_data = lgb.Dataset(X_val, label=y_val, reference=lgb_train_data)

In [11]:
# Train model
lgb_reg_base = lgb.train(
    params={
        'objective': 'regression',
        'metric': 'rmse',
        'learning_rate': 0.05,
        'max_depth': -1
    },
    train_set=lgb_train_data,
    valid_sets=[lgb_val_data],
    num_boost_round=100,
    callbacks=[lgb.early_stopping(stopping_rounds=3)]
)

[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.122161 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 4754
[LightGBM] [Info] Number of data points in the train set: 900000, number of used features: 58
[LightGBM] [Info] Start training from score 52.317412
Training until validation scores don't improve for 3 rounds
Did not meet early stopping. Best iteration is:
[100]	valid_0's rmse: 143.628


In [12]:
# Evaluate model
y_pred = lgb_reg_base.predict(X_val)

metrics_df = add_reg_metrics(metrics_df, y_pred, y_val, 'lgb_reg_base')
metrics_df

Unnamed: 0,model,MAE,RMSE,R²
0,xg_reg_base,68.705388,134.698728,0.294505
1,lgb_reg_base,72.783585,143.628133,0.197867


The LightGBM model performs worse than XGBoost, especially in terms of R-squared.

### CatBoost

In [13]:
# Fit model
cat_reg_base = CatBoostRegressor(
    iterations=100,
    learning_rate=0.05,
    depth=10,
    random_seed=42,
    verbose=10
)

cat_reg_base.fit(X_train, y_train, eval_set=(X_val, y_val), early_stopping_rounds=3)

0:	learn: 160.2481177	test: 159.5089783	best: 159.5089783 (0)	total: 439ms	remaining: 43.5s
10:	learn: 154.4491909	test: 153.6929522	best: 153.6929522 (10)	total: 4.47s	remaining: 36.2s
20:	learn: 151.5593984	test: 150.9251366	best: 150.9251366 (20)	total: 8.58s	remaining: 32.3s
30:	learn: 149.8589391	test: 149.3471130	best: 149.3471130 (30)	total: 13s	remaining: 28.9s
40:	learn: 148.6627666	test: 148.2603400	best: 148.2603400 (40)	total: 17.4s	remaining: 25.1s
50:	learn: 147.7654611	test: 147.4723157	best: 147.4723157 (50)	total: 22.4s	remaining: 21.5s
60:	learn: 146.9983279	test: 146.8140565	best: 146.8140565 (60)	total: 27.7s	remaining: 17.7s
70:	learn: 146.3649630	test: 146.2503484	best: 146.2503484 (70)	total: 33.9s	remaining: 13.9s
80:	learn: 145.8601177	test: 145.7948161	best: 145.7948161 (80)	total: 40.3s	remaining: 9.46s
90:	learn: 145.3587754	test: 145.3627418	best: 145.3627418 (90)	total: 46.7s	remaining: 4.62s
99:	learn: 144.9163148	test: 144.9739177	best: 144.9739177 (99)	

<catboost.core.CatBoostRegressor at 0x11d1ddd30>

In [14]:
# Evaluate model
y_pred = cat_reg_base.predict(X_val)

metrics_df = add_reg_metrics(metrics_df, y_pred, y_val, 'cat_reg_base')
metrics_df

Unnamed: 0,model,MAE,RMSE,R²
0,xg_reg_base,68.705388,134.698728,0.294505
1,lgb_reg_base,72.783585,143.628133,0.197867
2,cat_reg_base,72.896601,144.973918,0.182765


CatBoost performs similar to LightGBM. So far, XGBoost seems to capture a bit more of the underlying patterns than the two other models.

## Hyperparameter Tuning

### XGBoost

In [15]:
param_dist = {
    'n_estimators': [100, 200, 300, 400, 500, 600],
    'max_depth': [3, 5, 7],
    'learning_rate': [0.01, 0.1, 0.2],
    'subsample': [0.6, 0.8, 1.0],
    'colsample_bytree': [0.6, 0.8, 1.0],
    'alpha': [0, 1, 2, 5], # L1 regularization
    'lambda': [0, 1, 2, 5] # L2 regularization
}

xgb_model = xgb.XGBRegressor(objective='reg:squarederror', random_state=42)

random_search = RandomizedSearchCV(
    estimator=xgb_model,
    param_distributions=param_dist,
    n_iter=25,
    scoring='neg_root_mean_squared_error',
    cv=2,
    verbose=1,
    random_state=42
)

random_search.fit(X_train, y_train)

Fitting 2 folds for each of 25 candidates, totalling 50 fits


In [16]:
# Best model
xg_reg_tuned = random_search.best_estimator_
xg_best_params = random_search.best_params_

In [17]:
# Evaluate model
y_pred = xg_reg_tuned.predict(X_val)

metrics_df = add_reg_metrics(metrics_df, y_pred, y_val, 'xg_reg_tuned')
metrics_df

Unnamed: 0,model,MAE,RMSE,R²
0,xg_reg_base,68.705388,134.698728,0.294505
1,lgb_reg_base,72.783585,143.628133,0.197867
2,cat_reg_base,72.896601,144.973918,0.182765
3,xg_reg_tuned,65.801101,130.530964,0.337487


There's a slight improvement from the base XGBoost model and it's the best performing model so far.

### LightGBM

In [18]:
param_dist = {
  'n_estimators': [100, 200, 300, 400, 500, 600],
  'learning_rate': [0.01, 0.05, 0.1],
  'max_depth': [5, 10, 15],
  'num_leaves': [20, 31, 40],
  'min_child_samples': [10, 20, 30],
  'subsample': [0.8, 1.0],
  'colsample_bytree': [0.8, 1.0],
  'reg_alpha': [0, 1, 2, 5],
  'reg_lambda': [0, 1, 2, 5],
}

lgb_model = lgb.LGBMRegressor(random_state=42)

random_search = RandomizedSearchCV(
    estimator=lgb_model,
    param_distributions=param_dist,
    n_iter=25,
    cv=2, 
    scoring='neg_root_mean_squared_error',
    verbose=1,
    random_state=42
)

random_search.fit(X_train, y_train)

Fitting 2 folds for each of 25 candidates, totalling 50 fits
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.194622 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 4754
[LightGBM] [Info] Number of data points in the train set: 450000, number of used features: 58
[LightGBM] [Info] Start training from score 52.461629
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.199122 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 4754
[LightGBM] [Info] Number of data points in the train set: 450000, number of used features: 58
[LightGBM] [Info] Start training from score 52.173195
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.203584 seconds.
You can set `force_

In [19]:
# Best model
lgb_reg_tuned = random_search.best_estimator_
lgb_best_params = random_search.best_params_

In [20]:
# Evaluate model
y_pred = lgb_reg_tuned.predict(X_val)

metrics_df = add_reg_metrics(metrics_df, y_pred, y_val, 'lgb_reg_tuned')
metrics_df

Unnamed: 0,model,MAE,RMSE,R²
0,xg_reg_base,68.705388,134.698728,0.294505
1,lgb_reg_base,72.783585,143.628133,0.197867
2,cat_reg_base,72.896601,144.973918,0.182765
3,xg_reg_tuned,65.801101,130.530964,0.337487
4,lgb_reg_tuned,66.353197,129.303391,0.34989


The performance is very similar to the previous tuned model. The MAE is slightly worse but the RMSE and the R-squared are slightly better.

### CatBoost

In [None]:
param_dist = {
  	'iterations': [100, 200, 300, 400, 500, 600],
  	'learning_rate': [0.01, 0.05, 0.1],
  	'depth': [6, 8, 10],
  	'l2_leaf_reg': [1, 3, 5],
  	'border_count': [32, 64, 128],
  	'bagging_temperature': [0, 1, 5],
	'colsample_bytree': [0.8, 1.0],
  	'reg_alpha': [1, 2, 5],
  	'reg_lambda': [1, 2, 5]
}

cat_model = CatBoostRegressor(verbose=0, random_seed=42)

random_search = RandomizedSearchCV(
    estimator=cat_model,
    param_distributions=param_dist,
    n_iter=25,
    cv=2,
    scoring='neg_root_mean_squared_error',
    verbose=1,
    random_state=42
)

random_search.fit(X_train, y_train)

In [None]:
# Best model
cat_reg_tuned = random_search.best_estimator_
cat_best_params = random_search.best_params_

In [None]:
# Evaluate model
y_pred = cat_reg_tuned.predict(X_val)

metrics_df = add_reg_metrics(metrics_df, y_pred, y_val, 'cat_reg_tuned')
metrics_df

After tuning, XGBoost is the best performing model. It also has the fastest fitting time. This is the model that will be used for the rest of the analysis.

## Residual Analysis

In [None]:
# Get predictions
best_model = xg_reg_tuned
y_pred = best_model.predict(X_val)

In [None]:
def plot_residuals(y_true, y_pred, model_name:str) -> None:
	fig, (ax1, ax2) = plt.subplots(nrows=1, ncols=2, figsize=(15, 7))

	# Predicted vs. actual values
	ax1.scatter(x=y_pred, y=y_true)
	ax1.set_title('Predicted vs. Actual values')
	ax1.set_xlabel('Predicted Delay (seconds)')
	ax1.set_ylabel('Actual Delay (seconds)')
	ax1.grid(True)

	# Residuals
	residuals = y_true - y_pred
	ax2.scatter(x=y_pred, y=residuals)
	ax2.set_title('Residual Plot')
	ax2.set_xlabel('Predicted Delay (seconds)')
	ax2.set_ylabel('Residuals (seconds)')
	ax2.axhline(0, linestyle='--', color='orange')
	ax2.grid(True)

	fig.suptitle('Residual Analysis', fontsize=18)
	fig.tight_layout()
	fig.savefig(f'../images/residual_analysis_{model_name}.png', bbox_inches='tight')
	plt.show()

In [None]:
# Plot residuals
plot_residuals(y_val, y_pred, 'xg_reg_tuned')

**Predicted vs. Actual Plot**

There's a dense cluster around 0 for both predicted and actual values, indicating many predictions and centered near 0. However, there is substantial spread both above and below the diagonal line, which suggests underprediction and overprediction. There are clear outliers that are far from the main cluster.


**Residual Plot**

The residuals show a visible funnel shapes, which indicates a systematic error in prediction. The spread of residuals increases as the predicted delay increases. This is a sign of heteroscedasticity (the variance of errors is not constant across all predictions).

## Feature Importance Plot

In [None]:
# Get top 5 most important features
importances = best_model.get_booster().get_score(importance_type='weight')
importances_df = pd.DataFrame.from_dict(importances, orient='index').rename(columns={0: 'importance'}).reset_index().rename(columns={'index': 'feature'})
importances_df.sort_values('importance', ascending=False).head()

In [None]:
# Plot the feature importance
ax = xgb.plot_importance(best_model, importance_type='weight')
ax.figure.tight_layout()
ax.figure.savefig('../images/feature_importances_xg_reg_tuned.png')
plt.show()

**Most Important Features:**
- `exp_trip_duration` This is the most important feature in the model. It seems like the expected trip duration is highly predictive of the actual delay. This makes sense as longer expected trips are more prone to disruptions and variations.
- `hist_avg_delay` Historical average delay is the second most important predictor. This aligns well with time series predictability since past delays often indicate patterns or bottlenecks that repeat over time.
- `stop_distance` Long gaps might lead to longer travel times or greater variability.
- `stop_location_group` This is also highly influential. Grouping the stops by location might be capturing specific problematic areas or geographic patterns that contribute to delays (e.g., heavy traffic zones, construction areas, major intersections).
- `temperature_2m`, `wind_direction_10m`, `wind_speed_10m` Weather conditions do play a role, but not as heavily as trip-related features. The fact that wind and temperature are relatively impactful suggests weather variability might affect delays more than just precipitation alone.

**Least Important Features:**
- `is_peak_hour` This is surprisingly less impactful than expected. It suggests that perhaps peak hours are not as unpredictable as other features.
- `frequency_very_frequent` The bus frequency is contributing to the prediction. More frequent buses might be less susceptible to delays since missed connections or unexpected traffic issues don't accumulate as much.
-  `time_of_day_evening`, `time_of_day_morning`, `time_of_day_night` Evening seems to be a bit more influential than morning or night, which could indicate evening rush hour impacts.
- `wheelchair_boarding` Very low importance, indicating it has minimal influence on delays.
- `schedule_relationship_Scheduled` This has almost no impact, which might indicate that deviations from scheduled times are not systematically captured by the model.

## SHAP Plots

In [None]:
def shap_plot(shap_values, X_true, model_name:str, barplot:bool=True) -> None:
	if barplot:
		shap.summary_plot(shap_values, X_true, plot_type='bar', show=False)
		plt.title('SHAP Summary Barplot')
		plot_type = 'barplot' 
	else: # beeswarm
		shap.summary_plot(shap_values, X_true, show=False)
		plt.title('SHAP Summary Beeswarm Plot')
		plot_type = 'beeswarm_plot' 
	plt.tight_layout()
	plt.savefig(f'../images/shap_{plot_type}_{model_name}.png', bbox_inches='tight')
	plt.show()

In [None]:
def shap_single_pred(X_true, explainer, shap_values, model_name:str) -> None:
	random.seed(42)
	index = random.randrange(len(X_true))
	shap.force_plot(
		explainer.expected_value,
		shap_values[index, :],
		X_true.iloc[index, :],
		figsize=(15, 4),
		contribution_threshold=0.1,
		matplotlib=True,
		show=False)
	plt.tight_layout()
	plt.savefig(f'../images/shap_force_plot_{model_name}.png', bbox_inches='tight')
	plt.show()

In [None]:
# Initialize SHAP
sample_size = 10000 # sample validation set to prevent memory overload
X_val_sample = X_val.sample(n=sample_size, random_state=42) 
explainer = shap.TreeExplainer(best_model)
shap_values = explainer.shap_values(X_val_sample)

In [None]:
# Summary barplot
shap_plot(shap_values, X_val_sample, 'xg_reg_tuned', barplot=True)

**Comparison with XGBoost feature importances**

- `hist_avg_delay` is a much stronger predictor in SHAP analysis than XGBoost's internal measure. It’s clearly impactful across different delays.
frequency_very_rare is hidden in XGBoost’s view, but SHAP recognizes it as influential — possibly due to rare scheduling patterns causing large delays.
- `stop_distance` is potentially overvalued by XGBoost. SHAP indicates it might not be as impactful.
- SHAP considers `time_of_day_evening` to be more important. SHAP probably picked up interaction that XGBoost ignored.

In [None]:
# Summary beeswarm plot
shap_plot(shap_values, X_val_sample, 'xg_reg_tuned', barplot=False)

**Interpretation**

- `hist_avg_delay` and `exp_trip_duration` are the top features:
	- High values of hist_avg_delay (in red) push predictions higher.
	- For `exp_trip_duration`, high values both increase and decrease the delay prediction, indicating complex interactions.
- Weather Variables:
	- `temperature_2m`, `wind_speed_10m`, and `wind_direction_10m` also affect predictions. For instance, higher wind speeds push predictions slightly upwards, which makes sense given that weather disturbances can slow down traffic.
- Time of Day:
	- Evening seems to affect delay more than morning or night, aligning with typical rush hour traffic.

**Insight:**

The high influence of `hist_avg_delay` confirms that delay is highly dependent on past performance. This could be useful for forecasting in specific segments or optimizing bus routes during peak times.

In [None]:
# Force plot a single prediction
shap_single_pred(X_val_sample, explainer, shap_values, 'xg_reg_tuned')

This plot is a breakdown of the specific prediction (`34.97`) for one instance. For clarity, only the features with a SHAP score of at least 1 are displayed.

- Features that increase the prediction (red):
	- `time_of_day_evening`: Being in the evening strongly increases the delay, especially at peak times.
	- `frequency_rare`, `frequency_very_rare`: If the bus service is rare, it also increases the expected delay.
- Features that decrease the prediction (blue):
	- `hist_avg_delay`: A value of `34.05` decreases the day.
	- `stop_location_group`: A value of `5.0` (East from downtown Montreal) also reduces the delay, which is expected, depending of the time of the day.

## Feature Pruning

XGBoost's internal metrics (Gain) are purely split-based and don’t account for interactions or marginal effects. SHAP, on the other hand, shows the actual contribution to predictions — highlighting that some of these splits might be less impactful than XGBoost suggests. Therefore, the feature elimination will be based on the SHAP values.

In [None]:
shap_abs_mean = np.abs(shap_values).mean(axis=0)
shap_df = pd.DataFrame({
  'feature': X.columns,
  'mean_abs_shap': shap_abs_mean
})
shap_df.head()

In [None]:
# Identify low impact features (below 1)
low_impact_features = shap_df[shap_df['mean_abs_shap'] < 1]

print('Features with low SHAP impact to remove:\n')
print(low_impact_features)

In [None]:
# Drop features
features_to_drop = low_impact_features['feature'].tolist()

X_pruned = X.drop(features_to_drop, axis=1)
X_train_pruned = X_train.drop(features_to_drop, axis=1)
X_val_pruned = X_val.drop(features_to_drop, axis=1)
X_test_pruned = X_test.drop(features_to_drop, axis=1)

## Retrain Model with Best Features

In [None]:
# Create new regression matrices
xg_train_data = xgb.DMatrix(X_train_pruned, y_train, enable_categorical=False)
xg_val_data = xgb.DMatrix(X_val_pruned, y_val, enable_categorical=False)
xg_test_data = xgb.DMatrix(X_test_pruned, y_test, enable_categorical=False)
xg_eval_set = [(xg_train_data, 'train'), (xg_val_data, 'validation')]
xg_test_set = [(xg_train_data, 'train'), (xg_test_data, 'test')]

In [None]:
# Retrain model
xg_reg_pruned = xgb.train(
  params= {
    'objective':'reg:squarederror', 
  	'tree_method': 'hist',
    'max_depth': xg_best_params['max_depth'],
    'learning_rate': xg_best_params['learning_rate'],
    'subsample': xg_best_params['subsample'],
    'colsample_bytree': xg_best_params['colsample_bytree']
  },
  dtrain=xg_train_data,
  num_boost_round=xg_best_params['n_estimators'],
  evals=xg_eval_set,
  verbose_eval=50,
  early_stopping_rounds=3
)

In [None]:
# Evaluate model
y_pred = xg_reg_pruned.predict(xg_val_data)

metrics_df = add_reg_metrics(metrics_df, y_pred, y_val, 'xg_reg_pruned')
metrics_df

The performance got worse with feature pruning. The best model of all is the tuned XGBoost model.

## Final Model

### Retrain Model

In [None]:
# Create regression matrices
xg_train_data = xgb.DMatrix(X_train, y_train, enable_categorical=False)
xg_val_data = xgb.DMatrix(X_val, y_val, enable_categorical=False)
xg_test_data = xgb.DMatrix(X_test, y_test, enable_categorical=False)
xg_eval_set = [(xg_train_data, 'train'), (xg_val_data, 'validation')]
xg_test_set = [(xg_train_data, 'train'), (xg_test_data, 'test')]

In [None]:
# Train final model with more boosting rounds
xg_reg_final = xgb.train(
  params= {
    'objective':'reg:squarederror', 
  	'tree_method': 'hist',
    'max_depth': xg_best_params['max_depth'],
    'learning_rate': xg_best_params['learning_rate'],
    'subsample': xg_best_params['subsample'],
    'colsample_bytree': xg_best_params['colsample_bytree']
  },
  dtrain=xg_train_data,
  num_boost_round=10000,
  evals=xg_eval_set,
  verbose_eval=50,
  early_stopping_rounds=50
)

### Evaluate with Test Set

In [None]:
final_model = xg_reg_final

In [None]:
# Evaluate model
y_pred = final_model.predict(xg_test_data)

metrics_df = add_reg_metrics(metrics_df, y_pred, y_test, 'xg_reg_final')
metrics_df

In [None]:
# Plot residuals
plot_residuals(y_test, y_pred, 'xg_reg_final')

### SHAP Plots

In [None]:
# Initialize SHAP
sample_size = 10000 # sample validation set to prevent memory overload
X_test_sample = X_test.sample(n=sample_size, random_state=42) 
explainer = shap.TreeExplainer(final_model)
shap_values = explainer.shap_values(X_test_sample)

In [None]:
# Summary barplot
shap_plot(shap_values, X_test_sample, 'cat_reg_final', barplot=True)

In [None]:
# Summary beeswarm plot
shap_plot(shap_values, X_test_sample, 'cat_reg_final', barplot=False)

In [None]:
# Force plot a single prediction
shap_single_pred(X_test_sample, explainer, shap_values, 'cat_reg_final')

### Make Prediction

In [None]:
# Display features
best_features = X.columns.tolist()
best_features

In [None]:
# Create feature matrix
test_input = {
  	'cloud_cover': [0],
	'exp_trip_duration': [3600],
  	'frequency_normal': [1],
	
	
	'relative_humidity_2m': [60],
	'wind_direction_10m': [140],
	'precipitation': [0],
	'time_of_day_morning': [0],
	'hist_avg_delay': [300],
	'route_direction_South': [0],
	'wind_speed_10m': [10],
	
	'time_of_day_evening': [0],
	'stop_location_group': [2],
	'is_peak_hour': [1],
	'trip_phase_middle': [0],
	'frequency_very_rare': [0],
	'route_direction_North': [0],
	'route_direction_West': [1],
	'frequency_rare': [0],
	'temperature_2m': [24.3],
	'stop_distance': [400],
	
	'trip_phase_start': [0]
}

x_test = pd.DataFrame(test_input)

In [None]:
# Predict delay
prediction = final_model.predict(x_test)
print(f'Predicted delay: {prediction[0]:.2f} seconds')

### Export Data

In [None]:
xg_best_params

In [None]:
# Save model, hyperparameters and predictors
joblib.dump(final_model, '../models/regression_model.pkl')
joblib.dump(xg_best_params, '../models/best_hyperparams.pkl')
joblib.dump(best_features, '../models/best_features.pkl')

## End