# STM Transit Delay Data Modeling

## Overview

This notebook explores tree-based machine learning models in order to find the one that predicts STM transit delays with the best accuracy. The featured models are XGBoost, LightGBM and CatBoost, because they are more suitable for large datasets with mixed data and high cardinality.

## Data Description

`exp_trip_duration`: Expected duration of a trip, in seconds.<br>
`route_direction_North`, `route_direction_South`, `route_direction_West`: Route direction in degrees.<br>
`route_type_Night`, `route_type_HighFrequency` : One-Hot features for types of bus lines<br>
`frequency_frequent`, `frequency_normal`, `frequency_rare`, `frequency_very_frequent`, `frequency_very_rare`: One-Hot features for number of arrivals per hour.<br>
`stop_location_group`: Stop cluster based on coordinates.<br>
`stop_distance`: Distance between the previous and current stop, in meters.<br>
`trip_phase_middle`, `trip_phase_end`: One-Hot feature for trip progress.<br>
`exp_delay_prev_stop`: Expected duration between the previous and current stop, in seconds.<br>
`wheelchair_boarding`: Indicates if the stop is accessible for people in wheelchair.<br>
`sch_rel_Scheduled`: One-Hot feature for schedule relationship.<br>
`time_of_day_evening`, `time_of_day_morning`, `time_of_day_night`: One-Hot features for time of day.<br>
`is_peak_hour`: Boolean value indicating if the sheduled arrival time is at peak hour.<br>
`temperature_2m`: Air temperature at 2 meters above ground, in Celsius.<br>
`relative_humidity_2m`: Relative humidity at 2 meters above ground, in percentage.<br>
`precipitation`: Total precipitation (rain, showers, snow) sum of the preceding hour, in millimeters.<br>
`pressure_msl`: Atmospheric air pressure reduced to mean sea level (msl), in hPa.<br>
`cloud_cover`: Total cloud cover as an area fraction.<br>
`windspeed_10m`: Wind speed at 10 meters above ground, in kilometers per hour.<br>
`wind_direction_10m`: Wind direction at 10 meters above ground.<br>

## Imports

In [None]:
from catboost import CatBoostRegressor
import joblib
import lightgbm as lgb
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import random
import shap
from sklearn.metrics import mean_absolute_error, root_mean_squared_error, r2_score
from sklearn.model_selection import RandomizedSearchCV, train_test_split
import xgboost as xgb

In [None]:
# Load data
df = pd.read_parquet('../data/preprocessed.parquet')
print(f'Shape of dataset: {df.shape}')

## Split the data

In [None]:
# Separate features from target variable
X = df.drop('delay', axis=1)
y = df['delay']

The 3 models can run multiple iterations with a training and validation set. Therefore, a hold-out set will be kept to evaluate the final model.

In [None]:
# Train-validation-test split (60-20-20)
X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.4, random_state=42)
X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.5, random_state=42)

del X_temp
del y_temp

**Scaling**

Since only tree-based models are explored in this project, scaling is not needed because the models are not sensitive to the absolute scale or distribution of the features.

## Fit Base Models

All models allow to setup a number of rounds and early stopping. To start, all models will run 100 rounds with an early stopping of 3.

In [None]:
# Create dataframe to track metrics
metrics_df = pd.DataFrame(columns=['model', 'MAE', 'RMSE', 'R²'])

In [None]:
def add_reg_metrics(metrics_df:pd.DataFrame, y_pred:pd.Series, y_true:pd.Series, model_name:str) -> pd.DataFrame:
	mae = mean_absolute_error(y_true, y_pred)
	rmse = root_mean_squared_error(y_true, y_pred)
	r2 = r2_score(y_true, y_pred)

	metrics_df.loc[len(metrics_df)] = [model_name, mae, rmse, r2]
	return metrics_df

### XGBoost

In [None]:
# Create regression matrices
xg_train_data = xgb.DMatrix(X_train, y_train, enable_categorical=False)
xg_val_data = xgb.DMatrix(X_val, y_val, enable_categorical=False)
xg_test_data = xgb.DMatrix(X_test, y_test, enable_categorical=False)
xg_eval_set = [(xg_train_data, 'train'), (xg_val_data, 'validation')]
xg_test_set = [(xg_train_data, 'train'), (xg_test_data, 'test')]

In [None]:
# Train model
xg_reg_base = xgb.train(
  params= {'objective': 'reg:squarederror', 'tree_method': 'hist'},
  dtrain=xg_train_data,
  num_boost_round=100,
  evals=xg_eval_set,
  verbose_eval=10,
  early_stopping_rounds=3
)

In [None]:
# Evaluate model
y_pred = xg_reg_base.predict(xg_val_data)

metrics_df = add_reg_metrics(metrics_df, y_pred, y_val, 'xg_reg_base')
metrics_df

**MAE**<br>
On average, the predictions are off by 70 seconds, which is reasonable, knowing that [STM](https://www.stm.info/en/info/networks/bus-network-and-schedules-enlightened) considers a bus arriving 3 minutes after the planned schedule as being on time.

**RMSE**<br>
The higher RMSE compared to MAE suggests that there are some significant prediction errors that influence the overall error metric.

**R²**<br>
The model explains 24.35% of the variance, which is not good but understandable because of how random transit delays can be (bad weather, vehicle breakdown, accidents, etc.)

### LightGBM

In [None]:
# Create regression datasets
lgb_train_data = lgb.Dataset(X_train, label=y_train)
lgb_val_data = lgb.Dataset(X_val, label=y_val, reference=lgb_train_data)
lgb_test_data = lgb.Dataset(X_test, label=y_test, reference=lgb_train_data)

In [None]:
# Train model
lgb_reg_base = lgb.train(
    params={
        'objective': 'regression',
        'metric': 'rmse',
        'learning_rate': 0.05,
        'max_depth': -1
    },
    train_set=lgb_train_data,
    valid_sets=[lgb_val_data],
    num_boost_round=100,
    callbacks=[lgb.early_stopping(stopping_rounds=3)]
)

In [None]:
# Evaluate model
y_pred = lgb_reg_base.predict(X_val)

metrics_df = add_reg_metrics(metrics_df, y_pred, y_val, 'lgb_reg_base')
metrics_df

The LightGBM model performs worse than XGBoost, especially in terms of R-squared.

### CatBoost

In [None]:
# Fit model
cat_reg_base = CatBoostRegressor(
    iterations=100,
    learning_rate=0.05,
    depth=10,
    random_seed=42,
    verbose=10
)

cat_reg_base.fit(X_train, y_train, eval_set=(X_val, y_val), early_stopping_rounds=3)

In [None]:
# Evaluate model
y_pred = cat_reg_base.predict(X_val)

metrics_df = add_reg_metrics(metrics_df, y_pred, y_val, 'cat_reg_base')
metrics_df

CatBoost performs almost like LightGBM. So far, XGBoost seems to capture a bit more of the underlying patterns than the two other models.

## Hyperparameter Tuning

### XGBoost

In [None]:
param_dist = {
    'n_estimators': [100, 200, 300, 400, 500, 600],
    'max_depth': [3, 5, 7],
    'learning_rate': [0.01, 0.1, 0.2],
    'subsample': [0.6, 0.8, 1.0],
    'colsample_bytree': [0.6, 0.8, 1.0]
}

xgb_model = xgb.XGBRegressor(objective='reg:squarederror', random_state=42)

random_search = RandomizedSearchCV(
    estimator=xgb_model,
    param_distributions=param_dist,
    n_iter=50,
    scoring='neg_root_mean_squared_error',
    cv=2,
    verbose=1,
    random_state=42
)

random_search.fit(X_train, y_train)

In [None]:
# Best model
xg_reg_tuned = random_search.best_estimator_
xg_best_params = random_search.best_params_

In [None]:
# Evaluate model
y_pred = xg_reg_tuned.predict(X_val)

metrics_df = add_reg_metrics(metrics_df, y_pred, y_val, 'xg_reg_tuned')
metrics_df

There's a significant improvement from the base XGBoost model and it's the best performing model so far.

### LightGBM

In [None]:
param_dist = {
  'n_estimators': [100, 200, 300, 400, 500, 600],
  'learning_rate': [0.01, 0.05, 0.1],
  'max_depth': [5, 10, 15],
  'num_leaves': [20, 31, 40],
  'min_child_samples': [10, 20, 30],
  'subsample': [0.8, 1.0],
  'colsample_bytree': [0.8, 1.0]
}

lgb_model = lgb.LGBMRegressor(random_state=42)

random_search = RandomizedSearchCV(
    estimator=lgb_model,
    param_distributions=param_dist,
    n_iter=50,
    cv=2, 
    scoring='neg_root_mean_squared_error',
    verbose=1,
    random_state=42
)

random_search.fit(X_train, y_train)

In [None]:
# Best model
lgb_reg_tuned = random_search.best_estimator_
lgb_best_params = random_search.best_params_

In [None]:
# Evaluate model
y_pred = lgb_reg_tuned.predict(X_val)

metrics_df = add_reg_metrics(metrics_df, y_pred, y_val, 'lgb_reg_tuned')
metrics_df

The performance is very similar to the previous tuned model. The MAE is slightly worse but the RMSE and the R-squared are slightly better.

### CatBoost

In [None]:
param_dist = {
  'num_trees': [100, 200, 300, 400, 500, 600],
  'learning_rate': [0.01, 0.05, 0.1],
  'depth': [6, 8, 10],
  'l2_leaf_reg': [1, 3, 5],
  'border_count': [32, 64, 128],
  'bagging_temperature': [0, 1, 5],
}

cat_model = CatBoostRegressor(verbose=0, random_seed=42)

random_search = RandomizedSearchCV(
    estimator=cat_model,
    param_distributions=param_dist,
    n_iter=50,
    cv=2,
    scoring='neg_root_mean_squared_error',
    verbose=1,
    random_state=42
)

random_search.fit(X_train, y_train)

In [None]:
# Best model
cat_reg_tuned = random_search.best_estimator_
cat_best_params = random_search.best_params_

In [None]:
# Evaluate model
y_pred = cat_reg_tuned.predict(X_val)

metrics_df = add_reg_metrics(metrics_df, y_pred, y_val, 'cat_reg_tuned')
metrics_df

After tuning, LightGBM is the best model. This is the model that will be used for the rest of the analysis.

## Residual Analysis

In [None]:
# Get predictions
best_model = lgb_reg_tuned
y_pred = best_model.predict(X_val)

In [None]:
def plot_residuals(y_true, y_pred, model_name:str) -> None:
	fig, (ax1, ax2) = plt.subplots(nrows=1, ncols=2, figsize=(15, 7))

	# Predicted vs. actual values
	ax1.scatter(x=y_pred, y=y_true)
	ax1.set_title('Predicted vs. Actual values')
	ax1.set_xlabel('Predicted Delay (seconds)')
	ax1.set_ylabel('Actual Delay (seconds)')
	ax1.grid(True)

	# Residuals
	residuals = y_true - y_pred
	ax2.scatter(x=y_pred, y=residuals)
	ax2.set_title('Residual Plot')
	ax2.set_xlabel('Predicted Delay (seconds)')
	ax2.set_ylabel('Residuals (seconds)')
	ax2.axhline(0, linestyle='--', color='orange')
	ax2.grid(True)

	fig.suptitle('Residual Analysis', fontsize=18)
	fig.tight_layout()
	fig.savefig(f'../images/residual_analysis_{model_name}.png', bbox_inches='tight')
	plt.show()

In [None]:
# Plot residuals
plot_residuals(y_val, y_pred, 'cat_reg_tuned')

**Predicted vs. Actual Plot**

There's a dense cluster around 0 for both predicted and actual values, indicating many predictions and centered near 0. However, there is substantial spread both above and below the diagonal line, which suggests underprediction and overprediction. There are clear outliers that are far from the main cluster.


**Residual Plot**

The residuals show a visible funnel shapes, which indicates a systematic error in prediction. The spread of residuals increases as the predicted delay increases. This is a sign of heteroscedasticity (the variance of errors is not constant across all predictions).

## Feature Importance Plot

In [None]:
def plot_feat_importance(feature_importances, model_name:str) -> None:
	plt.figure(figsize=(10, 6))
	plt.barh(feature_importances['Feature Id'], feature_importances['Importances'])
	plt.gca().invert_yaxis()
	plt.title('Feature Importance')
	plt.xlabel('Importance')
	plt.tight_layout()
	plt.savefig(f'../images/feature_importances_{model_name}.png', bbox_inches='tight')
	plt.show()

In [None]:
# Get sorted feature importances
feature_importances = best_model.get_feature_importance(prettified=True)
feature_importances = feature_importances.sort_values(by='Importances', ascending=False)
feature_importances.head()

In [None]:
# Plot the feature importance
plot_feat_importance(feature_importances, 'cat_reg_tuned')

**Most Important Features:**
- `exp_trip_duration` This is the most important feature in the model. It seems like the expected trip duration is highly predictive of the actual delay. This makes sense as longer expected trips are more prone to disruptions and variations.
- `hist_avg_delay` Historical average delay is the second most important predictor. This aligns well with time series predictability since past delays often indicate patterns or bottlenecks that repeat over time.
- `stop_location_group` This is also highly influential. Grouping the stops by location might be capturing specific problematic areas or geographic patterns that contribute to delays (e.g., heavy traffic zones, construction areas, major intersections).
- `temperature_2m`, `wind_direction_10m` Weather conditions do play a role, but not as heavily as trip-related features. The fact that wind direction and temperature are relatively impactful suggests weather variability might affect delays more than just precipitation alone.

**Least Important Features:**
- `is_peak_hour` This is surprisingly less impactful than expected. It suggests that perhaps peak hours are not as unpredictable as other features.
- `time_of_day_morning`, `time_of_day_night` Evening seems to be more influential than morning or night, which could indicate evening rush hour impacts.
- `frequency_very_frequent` The bus frequency is contributing to the prediction. More frequent buses might be less susceptible to delays since missed connections or unexpected traffic issues don't accumulate as much.
- `wheelchair_boarding` Very low importance, indicating it has minimal influence on delays.
- `schedule_relationship_Scheduled` This has almost no impact, which might indicate that deviations from scheduled times are not systematically captured by the model.

## SHAP Plots

In [None]:
def shap_plot(shap_values, X_true, model_name:str, barplot:bool=True) -> None:
	if barplot:
		shap.summary_plot(shap_values, X_true, plot_type='bar', show=False)
		plt.title('SHAP Summary Barplot')
		plot_type = 'barplot' 
	else: # beeswarm
		shap.summary_plot(shap_values, X_true, show=False)
		plt.title('SHAP Summary Beeswarm Plot')
		plot_type = 'beeswarm_plot' 
	plt.tight_layout()
	plt.savefig(f'../images/shap_{plot_type}_{model_name}.png', bbox_inches='tight')
	plt.show()

In [None]:
def shap_single_pred(X_true, explainer, shap_values, model_name:str) -> None:
	index = random.randrange(len(X_true))
	shap.force_plot(
		explainer.expected_value,
		shap_values[index, :],
		X_true.iloc[index, :],
		figsize=(15, 4),
		contribution_threshold=0.075,
		matplotlib=True,
		show=False)
	plt.tight_layout()
	plt.savefig(f'../images/shap_force_plot_{model_name}.png', bbox_inches='tight')
	plt.show()

In [None]:
# Initialize SHAP
sample_size = 250 # sample validation set to prevent memory overload
X_val_sample = X_val.sample(n=sample_size, random_state=42) 
explainer = shap.TreeExplainer(best_model)
shap_values = explainer.shap_values(X_val_sample)

In [None]:
# Summary barplot
shap_plot(shap_values, X_val_sample, 'cat_reg_tuned', barplot=True)

- The ranking matches the feature importances plot:
	- `hist_avg_delay` and `exp_trip_duration` dominate the list, with a substantial gap from the rest.
	- `frequency_very_rare` and `stop_location_group` are also critical.
	- Weather (`temperature_2m`, `wind_speed_10m`) plays a moderate role.

**Insight:**

The impact of weather is not that high. It's probably because the data has been collected in spring and usually, there are not extreme weather conditions at that time of the year.

In [None]:
# Summary beeswarm plot
shap_plot(shap_values, X_val_sample, 'cat_reg_tuned', barplot=False)

The beeswarm plot offers a global view of all predictions and their feature influences.

- `hist_avg_delay` and `exp_trip_duration` are the top features:
	- High values of hist_avg_delay (in red) push predictions higher.
	- For `exp_trip_duration`, high values both increase and decrease the delay prediction, indicating complex interactions.
- Color Representation:
	- Red = High feature value, Blue = Low feature value.
	- For example, when `route_direction_West` is true, it pushes the prediction up. When it is low, it has little to no effect.
- Weather Variables:
	- `temperature_2m`, `wind_speed_10m`, and `wind_direction_10m` also affect predictions. For instance, higher wind speeds push predictions slightly upwards, which makes sense given that weather disturbances can slow down traffic.
- Time of Day:
	- Evening seems to affect delay more than morning or night, aligning with typical rush hour traffic.

**Insight:**

The high influence of `hist_avg_delay` confirms that delay is highly dependent on past performance. This could be useful for forecasting in specific segments or optimizing bus routes during peak times.

In [None]:
# Force plot a single prediction
shap_single_pred(X_val_sample, explainer, shap_values, 'cat_reg_tuned')

This plot is a breakdown of the specific prediction (`68.23`) for one instance.

- Features that increase the prediction (red):
	- `route_direction_West`: When the bus is going West, it strongly increases the delay.
	- `frequency_very_rare`: If the bus service is rare, it also increases the expected delay.
	- `hist_avg_delay`: A large historical average delay of `100.59` also pushes the prediction up significantly.
- Features that decrease the prediction (blue):
	- `stop_location_group`: A value of `1.0` (extreme West of Montreal) for the stop group reduces the delay.
	- `exp_trip_duration`: An expected trip duration of `3960.0` seconds also reduces the delay, which is a bit surprising. This might indicate that long trips in this grouping tend to be managed better.<br><br>

**Insight:**

The model predicts more delay (`68.23`) when:
- The bus heads West.
- It is part of a rare frequency group.
- Historical delays have been high.

## Feature Elimination

In [None]:
# Recursive feature elimination with a patience of 2

X_current = X.copy()
X_train_current = X_train.copy()
X_val_current = X_val.copy()
X_test_current = X_test.copy()

best_rmse = float(metrics_df['RMSE'].min())
best_features = X_current.columns.tolist()
tracking = [] # To store number of features and RMSE
patience = 2
patience_counter = 0

keep_going = True

while keep_going and len(X_current.columns) > 10: # Keep at least 10 features
	cat_model = CatBoostRegressor(
		iterations=cat_best_params['iterations'],
		learning_rate=cat_best_params['learning_rate'],
		depth=cat_best_params['depth'],
		l2_leaf_reg=cat_best_params['l2_leaf_reg'],
		border_count=cat_best_params['border_count'],
		bagging_temperature=cat_best_params['bagging_temperature'],
		random_seed=42,
		verbose=10
	)

	cat_model.fit(X_train_current, y_train, eval_set=(X_val_current, y_val), early_stopping_rounds=3)

	# Predict and calculate RMSE
	y_pred = cat_model.predict(X_val_current)
	rmse = root_mean_squared_error(y_val, y_pred)
	tracking.append({
		'nb_features': len(X_current.columns),
		'RMSE': best_rmse
	})

	# Feature importance
	importances = cat_model.get_feature_importance()
	weakest_feature = importances.idxmin()
	weakest_score = importances.min()

	if rmse <= best_rmse:
		best_rmse = rmse
		best_features = X_current.columns.tolist()

		# Drop the weakest feature
		print(f'RMSE: {rmse:.4f} | Dropping: {weakest_feature} (importance {importances.min():.6f})')
		X_current = X_current.drop(columns=[weakest_feature])
		X_train_current = X_train_current.drop(columns=[weakest_feature])
		X_val_current = X_val_current.drop(columns=[weakest_feature])
		X_test_current = X_test_current.drop(columns=[weakest_feature])

		patience_counter = 0 # Reset patience if RMSE improves
	
	else:
		patience_counter += 1
		print(f'Patience counter: {patience_counter}/{patience}')
	
		if patience_counter >= patience:
			print('Performance worsened. Stopping feature elimination.')
			keep_going = False
		else:
			# Allow two bad steps: still drop feature and continue
			X_current = X_current.drop(columns=[weakest_feature])
			X_train_current = X_train_current.drop(columns=[weakest_feature])
			X_val_current = X_val_current.drop(columns=[weakest_feature])
			X_test_current = X_test_current.drop(columns=[weakest_feature])


print('\nBest set of features found:\n')
print(best_features)
print(f'Final validation RMSE: {best_rmse:.4f}')

In [None]:
# Plot RMSE vs. number of features
tracking_df = pd.DataFrame(tracking)

plt.figure(figsize=(10,6))
plt.plot(tracking_df['nb_features'], tracking_df['RMSE'], marker='o')
plt.gca().invert_xaxis()  # More features on the left
plt.xlabel('Number of Features')
plt.ylabel('Validation RMSE')
plt.title('Recursive Feature Elimination Progress')
plt.grid(True)
plt.tight_layout()
plt.savefig(f'../images/rfe_rmse_tracking.png', bbox_inches='tight')
plt.show()

In [None]:
# Remove columns from input matrices
X_pruned = X[best_features]
X_train_pruned = X_train[best_features]
X_val_pruned = X_val[best_features]
X_test_pruned = X_test[best_features]

## Retrain Model with Best Features

In [None]:
# Retrain Model
cat_reg_pruned = CatBoostRegressor(
    iterations=10000,
    learning_rate=cat_best_params['learning_rate'],
    depth=cat_best_params['depth'],
    l2_leaf_reg=cat_best_params['l2_leaf_reg'],
    border_count=cat_best_params['border_count'],
    bagging_temperature=cat_best_params['bagging_temperature'],
    random_seed=42,
    verbose=50
)

cat_reg_tuned.fit(X_train_pruned, y_train, eval_set=(X_val_pruned, y_val), early_stopping_rounds=50)

In [None]:
# Evaluate model
y_pred = cat_reg_tuned.predict(X_val_pruned)

metrics_df = add_reg_metrics(metrics_df, y_pred, y_val, 'cat_reg_pruned')
metrics_df

The model metrics with less features are similar, which means eliminating features didn't worsen the performance too much.

## Retune Parameters

In [None]:
param_dist = {
  'iterations': [50, 100],
  'learning_rate': [0.01, 0.05, 0.1],
  'depth': [6, 8, 10],
  'l2_leaf_reg': [1, 3, 5],
  'border_count': [32, 64, 128],
  'bagging_temperature': [0, 1, 5],
}

cat_model = CatBoostRegressor(verbose=0, random_seed=42)

random_search = RandomizedSearchCV(
    estimator=cat_model,
    param_distributions=param_dist,
    n_iter=50,
    cv=2,
    scoring='neg_root_mean_squared_error',
    verbose=1,
    random_state=42
)

random_search.fit(X_train_pruned, y_train)

In [None]:
# Best model
cat_pruned_best_model = random_search.best_estimator_
cat_pruned_best_params = random_search.best_params_

In [None]:
# Retrain final model with more iterations
cat_reg_final = CatBoostRegressor(
    iterations=10000,
    learning_rate=cat_pruned_best_params['learning_rate'],
    depth=cat_pruned_best_params['depth'],
    l2_leaf_reg=cat_pruned_best_params['l2_leaf_reg'],
    border_count=cat_pruned_best_params['border_count'],
    bagging_temperature=cat_pruned_best_params['bagging_temperature'],
    random_seed=42,
    verbose=50
)

cat_reg_final.fit(X_train_pruned, y_train, eval_set=(X_val_pruned, y_val), early_stopping_rounds=50)

## Final Model

### Evaluate with Test Set

In [None]:
final_model = cat_reg_final

In [None]:
# Evaluate model
y_pred = final_model.predict(X_test_pruned)

metrics_df = add_reg_metrics(metrics_df, y_pred, y_test, 'cat_reg_final')
metrics_df

In [None]:
# Plot residuals
plot_residuals(y_test, y_pred, 'cat_reg_final')

### Feature importances

In [None]:
# Get top 5 most important features
feature_importances = final_model.get_feature_importance(prettified=True)
feature_importances = feature_importances.sort_values(by='Importances', ascending=False)
feature_importances.head()

In [None]:
# Plot the feature importance
plot_feat_importance(feature_importances, 'cat_reg_final')

### SHAP Plots

In [None]:
# Initialize SHAP
sample_size = 250 # sample validation set to prevent memory overload
X_test_sample = X_test_pruned.sample(n=sample_size, random_state=42) 
explainer = shap.TreeExplainer(final_model)
shap_values = explainer.shap_values(X_test_sample)

In [None]:
# Summary barplot
shap_plot(shap_values, X_test_sample, 'cat_reg_final', barplot=True)

In [None]:
# Summary beeswarm plot
shap_plot(shap_values, X_test_sample, 'cat_reg_final', barplot=False)

In [None]:
# Force plot a single prediction
shap_single_pred(X_test_sample, explainer, shap_values, 'cat_reg_final')

### Make Prediction

In [None]:
# Load stop coordinates scaler
scaler_coords = joblib.load('../models/scaler_coords.pkl')

In [None]:
# Display features
best_features = X_test_pruned.columns.tolist()
best_features

In [None]:
# Create feature matrix
test_input = {
	'exp_trip_duration': [3600],
	'relative_humidity_2m': [60],
	'wind_direction_10m': [140],
	'precipitation': [0],
	'time_of_day_morning': [0],
	'hist_avg_delay': [300],
	'route_direction_South': [0],
	'wind_speed_10m': [10],
	'frequency_normal': [1],
	'time_of_day_evening': [0],
	'stop_location_group': [2],
	'is_peak_hour': [1],
	'trip_phase_middle': [0],
	'frequency_very_rare': [0],
	'route_direction_North': [0],
	'route_direction_West': [1],
	'frequency_rare': [0],
	'temperature_2m': [24.3],
	'stop_distance': [400],
	'cloud_cover': [0],
	'trip_phase_start': [0]
}

x_test = pd.DataFrame(test_input)

In [None]:
# Predict delay
prediction = final_model.predict(x_test)
print(f'Predicted delay: {prediction[0]:.2f} seconds')

### Export Data

In [None]:
# Save model, hyperparameters and predictors
joblib.dump(final_model, '../models/regression_model.pkl')
joblib.dump(cat_pruned_best_params, '../models/best_hyperparams.pkl')
joblib.dump(best_features, '../models/best_features.pkl')

## End