# STM Transit Delay Data Modeling

## Overview

This notebook explores tree-based machine learning models in order to find the one that predicts STM transit delays with the best accuracy. The featured models are XGBoost, LightGBM and CatBoost, because they are more suitable for large datasets with mixed data and high cardinality.

## Data Description

`exp_trip_duration`: Expected duration of a trip, in seconds.<br>
`route_direction_North`, `route_direction_South`, `route_direction_West`: Route direction in degrees.<br>
`route_type_Night`, `route_type_HighFrequency` : One-Hot features for types of bus lines<br>
`frequency_frequent`, `frequency_normal`, `frequency_rare`, `frequency_very_frequent`, `frequency_very_rare`: One-Hot features for number of arrivals per hour.<br>
`stop_location_group`: Stop cluster based on coordinates.<br>
`stop_distance`: Distance between the previous and current stop, in meters.<br>
`trip_phase_middle`, `trip_phase_end`: One-Hot feature for trip progress.<br>
`exp_delay_prev_stop`: Expected duration between the previous and current stop, in seconds.<br>
`wheelchair_boarding`: Indicates if the stop is accessible for people in wheelchair.<br>
`sch_rel_Scheduled`: One-Hot feature for schedule relationship.<br>
`time_of_day_evening`, `time_of_day_morning`, `time_of_day_night`: One-Hot features for time of day.<br>
`is_peak_hour`: Boolean value indicating if the sheduled arrival time is at peak hour.<br>
`temperature_2m`: Air temperature at 2 meters above ground, in Celsius.<br>
`relative_humidity_2m`: Relative humidity at 2 meters above ground, in percentage.<br>
`precipitation`: Total precipitation (rain, showers, snow) sum of the preceding hour, in millimeters.<br>
`pressure_msl`: Atmospheric air pressure reduced to mean sea level (msl), in hPa.<br>
`cloud_cover`: Total cloud cover as an area fraction.<br>
`windspeed_10m`: Wind speed at 10 meters above ground, in kilometers per hour.<br>
`wind_direction_10m`: Wind direction at 10 meters above ground.<br>

## Imports

In [1]:
from catboost import CatBoostRegressor
import joblib
import lightgbm as lgb
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import random
import shap
from sklearn.metrics import mean_absolute_error, root_mean_squared_error, r2_score
from sklearn.model_selection import RandomizedSearchCV, train_test_split
import xgboost as xgb

In [2]:
# Load data
df = pd.read_parquet('../data/preprocessed.parquet')
print(f'Shape of dataset: {df.shape}')

Shape of dataset: (1500000, 26)


## Split the data

In [3]:
# Separate features from target variable
X = df.drop('delay', axis=1)
y = df['delay']

The 3 models can run multiple iterations with a training and validation set. Therefore, a hold-out set will be kept for the final model.

In [4]:
# Train-validation-test split (60-20-20)
X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.4, random_state=42)
X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.5, random_state=42)

del X_temp
del y_temp

**Scaling**

Since only tree-based models are explored in this project, scaling is not needed because the models are not sensitive to the absolute scale or distribution of the features.

## Fit Base Models

All models allow to setup a number of rounds and early stopping. To start, all models will run 100 rounds with an early stopping of 3.

In [5]:
# Create dataframe to track metrics
reg_metrics_df = pd.DataFrame(columns=['model', 'MAE', 'RMSE', 'R²'])

In [6]:
def add_reg_metrics(reg_metrics_df:pd.DataFrame, y_pred:pd.Series, y_true:pd.Series, model_name:str) -> pd.DataFrame:
	mae = mean_absolute_error(y_true, y_pred)
	rmse = root_mean_squared_error(y_true, y_pred)
	r2 = r2_score(y_true, y_pred)

	reg_metrics_df.loc[len(reg_metrics_df)] = [model_name, mae, rmse, r2]
	return reg_metrics_df

### XGBoost

In [7]:
# Create regression matrices
xg_train_data = xgb.DMatrix(X_train, y_train, enable_categorical=False)
xg_val_data = xgb.DMatrix(X_val, y_val, enable_categorical=False)
xg_test_data = xgb.DMatrix(X_test, y_test, enable_categorical=False)
xg_eval_set = [(xg_train_data, 'train'), (xg_val_data, 'validation')]
xg_test_set = [(xg_train_data, 'train'), (xg_test_data, 'test')]

In [8]:
# Train model
xg_reg_base = xgb.train(
  params= {'objective': 'reg:squarederror', 'tree_method': 'hist'},
  dtrain=xg_train_data,
  num_boost_round=100,
  evals=xg_eval_set,
  verbose_eval=10,
  early_stopping_rounds=3
)

[0]	train-rmse:156.46323	validation-rmse:155.71727
[10]	train-rmse:148.03189	validation-rmse:148.07687
[20]	train-rmse:145.54309	validation-rmse:145.91902
[30]	train-rmse:144.11409	validation-rmse:144.70793
[40]	train-rmse:142.91902	validation-rmse:143.75161
[50]	train-rmse:141.55458	validation-rmse:142.59166
[60]	train-rmse:140.69614	validation-rmse:141.90451
[70]	train-rmse:139.56006	validation-rmse:140.94534
[80]	train-rmse:138.69328	validation-rmse:140.24273
[90]	train-rmse:137.89886	validation-rmse:139.57982
[97]	train-rmse:137.62058	validation-rmse:139.47898


In [9]:
# Evaluate model
y_pred = xg_reg_base.predict(xg_val_data)

reg_metrics_df = add_reg_metrics(reg_metrics_df, y_pred, y_val, 'xg_reg_base')
reg_metrics_df

Unnamed: 0,model,MAE,RMSE,R²
0,xg_reg_base,70.365457,139.478984,0.243542


**MAE**<br>
On average, the predictions are off by 70 seconds, which is reasonable.

**RMSE**<br>
The higher RMSE compared to MAE suggests that there are some significant prediction errors that influence the overall error metric.

**R²**<br>
The model explains 24.35% of the variance, which is not good but understandable because of how random transit delays can be (bad weather, vehicle breakdown, accidents, etc.)

### LightGBM

In [10]:
# Create regression datasets
lgb_train_data = lgb.Dataset(X_train, label=y_train)
lgb_val_data = lgb.Dataset(X_val, label=y_val, reference=lgb_train_data)
lgb_test_data = lgb.Dataset(X_test, label=y_test, reference=lgb_train_data)

In [11]:
# Train model
lgb_reg_base = lgb.train(
    params={
        'objective': 'regression',
        'metric': 'rmse',
        'learning_rate': 0.05,
        'max_depth': -1
    },
    train_set=lgb_train_data,
    valid_sets=[lgb_val_data],
    num_boost_round=100,
    callbacks=[lgb.early_stopping(stopping_rounds=3)]
)

[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.063701 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 861
[LightGBM] [Info] Number of data points in the train set: 900000, number of used features: 25
[LightGBM] [Info] Start training from score 52.317412
Training until validation scores don't improve for 3 rounds
Did not meet early stopping. Best iteration is:
[100]	valid_0's rmse: 145.977


In [12]:
# Evaluate model
y_pred = lgb_reg_base.predict(X_val)

reg_metrics_df = add_reg_metrics(reg_metrics_df, y_pred, y_val, 'lgb_reg_base')
reg_metrics_df

Unnamed: 0,model,MAE,RMSE,R²
0,xg_reg_base,70.365457,139.478984,0.243542
1,lgb_reg_base,73.185825,145.976751,0.17142


The LightGBM model performs worse than XGBoost, especially in terms of R-squared.

### CatBoost

In [13]:
# Fit model
cat_reg_base = CatBoostRegressor(
    iterations=100,
    learning_rate=0.05,
    depth=10,
    random_seed=42,
    verbose=10
)

cat_reg_base.fit(X_train, y_train, eval_set=(X_val, y_val), early_stopping_rounds=3)

0:	learn: 160.3047745	test: 159.5508086	best: 159.5508086 (0)	total: 227ms	remaining: 22.5s
10:	learn: 155.0720266	test: 154.2661407	best: 154.2661407 (10)	total: 2.19s	remaining: 17.7s
20:	learn: 152.5198134	test: 151.7871580	best: 151.7871580 (20)	total: 3.89s	remaining: 14.6s
30:	learn: 151.1482569	test: 150.5139002	best: 150.5139002 (30)	total: 5.62s	remaining: 12.5s
40:	learn: 150.1356224	test: 149.5836087	best: 149.5836087 (40)	total: 7.49s	remaining: 10.8s
50:	learn: 149.2087764	test: 148.7391813	best: 148.7391813 (50)	total: 9.38s	remaining: 9.01s
60:	learn: 148.6057792	test: 148.2251318	best: 148.2251318 (60)	total: 11.5s	remaining: 7.35s
70:	learn: 148.0129444	test: 147.6992814	best: 147.6992814 (70)	total: 13.7s	remaining: 5.59s
80:	learn: 147.5463877	test: 147.2977566	best: 147.2977566 (80)	total: 15.7s	remaining: 3.69s
90:	learn: 146.9514147	test: 146.7248924	best: 146.7248924 (90)	total: 17.8s	remaining: 1.76s
99:	learn: 146.4551328	test: 146.2884541	best: 146.2884541 (99

<catboost.core.CatBoostRegressor at 0x120ef5fd0>

In [14]:
# Evaluate model
y_pred = cat_reg_base.predict(X_val)

reg_metrics_df = add_reg_metrics(reg_metrics_df, y_pred, y_val, 'cat_reg_base')
reg_metrics_df

Unnamed: 0,model,MAE,RMSE,R²
0,xg_reg_base,70.365457,139.478984,0.243542
1,lgb_reg_base,73.185825,145.976751,0.17142
2,cat_reg_base,73.080722,146.288454,0.167877


CatBoost performs almost like LightGBM. Without tuning, XGBoost seems to capture a bit more of the underlying patterns than the two other models.

## Hyperparameter Tuning

### XGBoost

In [15]:
param_dist = {
    'n_estimators': [100, 200, 300, 400, 500, 600],
    'max_depth': [3, 5, 7],
    'learning_rate': [0.01, 0.1, 0.2],
    'subsample': [0.6, 0.8, 1.0],
    'colsample_bytree': [0.6, 0.8, 1.0]
}

xgb_model = xgb.XGBRegressor(objective='reg:squarederror', random_state=42)

random_search = RandomizedSearchCV(
    estimator=xgb_model,
    param_distributions=param_dist,
    n_iter=50,
    scoring='neg_root_mean_squared_error',
    cv=2,
    n_jobs=-1,
    verbose=1,
    random_state=42
)

random_search.fit(X_train, y_train)

Fitting 2 folds for each of 50 candidates, totalling 100 fits


In [16]:
# Best model
xg_best_model = random_search.best_estimator_
xg_best_params = random_search.best_params_

In [17]:
# Train best model with more boost rounds
xg_reg_tuned = xgb.train(
  params= {
    'objective': 'reg:squarederror',
    'tree_method': 'hist',
    'max_depth': xg_best_params['max_depth'],
    'learning_rate': xg_best_params['learning_rate'],
    'subsample': xg_best_params['subsample'],
    'colsample_bytree': xg_best_params['colsample_bytree'],
  },
  dtrain=xg_train_data,
  num_boost_round=10000,
  evals=xg_eval_set,
  verbose_eval=50,
  early_stopping_rounds=50
)

[0]	train-rmse:157.44938	validation-rmse:156.75882
[50]	train-rmse:140.30183	validation-rmse:141.66669
[100]	train-rmse:135.65984	validation-rmse:138.16842
[150]	train-rmse:132.54442	validation-rmse:136.15099
[200]	train-rmse:130.13743	validation-rmse:134.72254
[250]	train-rmse:128.53028	validation-rmse:134.03920
[300]	train-rmse:127.07783	validation-rmse:133.50085
[350]	train-rmse:125.60486	validation-rmse:132.82838
[400]	train-rmse:124.33808	validation-rmse:132.32552
[450]	train-rmse:123.35154	validation-rmse:132.12580
[500]	train-rmse:122.28692	validation-rmse:131.79863
[550]	train-rmse:121.46720	validation-rmse:131.66445
[600]	train-rmse:120.58133	validation-rmse:131.52871
[650]	train-rmse:119.86069	validation-rmse:131.52896
[657]	train-rmse:119.74549	validation-rmse:131.53226


In [18]:
# Evaluate model
y_pred = xg_reg_tuned.predict(xg_val_data)

reg_metrics_df = add_reg_metrics(reg_metrics_df, y_pred, y_val, 'xg_reg_tuned')
reg_metrics_df

Unnamed: 0,model,MAE,RMSE,R²
0,xg_reg_base,70.365457,139.478984,0.243542
1,lgb_reg_base,73.185825,145.976751,0.17142
2,cat_reg_base,73.080722,146.288454,0.167877
3,xg_reg_tuned,66.431701,131.526843,0.327339


There's a significant improvement from the base XGBoost model and it's the best performing model so far.

### LightGBM

In [19]:
param_dist = {
  'n_estimators': [100, 500, 1000],
  'learning_rate': [0.01, 0.05, 0.1],
  'max_depth': [5, 10, 15],
  'num_leaves': [20, 31, 40],
  'min_child_samples': [10, 20, 30],
  'subsample': [0.8, 1.0],
  'colsample_bytree': [0.8, 1.0]
}

lgb_model = lgb.LGBMRegressor(random_state=42)

random_search = RandomizedSearchCV(
    estimator=lgb_model,
    param_distributions=param_dist,
    n_iter=50,
    cv=2, 
    scoring='neg_root_mean_squared_error',
    verbose=1,
    n_jobs=-1,
    random_state=42
)

random_search.fit(X_train, y_train)

Fitting 2 folds for each of 50 candidates, totalling 100 fits
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.109476 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 861
[LightGBM] [Info] Number of data points in the train set: 450000, number of used features: 25
[LightGBM] [Info] Start training from score 52.173195
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.083355 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 861
[LightGBM] [Info] Number of data points in the train set: 450000, number of used features: 25
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.128195 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is

In [20]:
# Best model
lgb_best_model = random_search.best_estimator_
lgb_best_params = random_search.best_params_

In [21]:
# Train model with more boost rounds and early stopping
lgb_reg_tuned = lgb.train(
    params={
        'objective': 'regression',
        'metric': 'rmse',
        'n_estimators': lgb_best_params['n_estimators'],
        'learning_rate': lgb_best_params['learning_rate'],
        'max_depth': lgb_best_params['max_depth'],
        'num_leaves': lgb_best_params['num_leaves'],
        'min_child_samples': lgb_best_params['min_child_samples'],
        'subsample': lgb_best_params['subsample'],
        'colsample_bytree': lgb_best_params['colsample_bytree']
    },
    train_set=lgb_train_data,
    valid_sets=[lgb_val_data],
    num_boost_round=10000,
    callbacks=[lgb.early_stopping(stopping_rounds=50)]
)

[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.077953 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 861
[LightGBM] [Info] Number of data points in the train set: 900000, number of used features: 25
[LightGBM] [Info] Start training from score 52.317412
Training until validation scores don't improve for 50 rounds
Did not meet early stopping. Best iteration is:
[1000]	valid_0's rmse: 131.41


In [22]:
# Evaluate model
y_pred = lgb_reg_tuned.predict(X_val)

reg_metrics_df = add_reg_metrics(reg_metrics_df, y_pred, y_val, 'lgb_reg_tuned')
reg_metrics_df

Unnamed: 0,model,MAE,RMSE,R²
0,xg_reg_base,70.365457,139.478984,0.243542
1,lgb_reg_base,73.185825,145.976751,0.17142
2,cat_reg_base,73.080722,146.288454,0.167877
3,xg_reg_tuned,66.431701,131.526843,0.327339
4,lgb_reg_tuned,66.705643,131.410076,0.328533


The performance is very similar to the previous tuned model. The MAE is slightly worse but the RMSE and the R-sqared are slightly better.

### CatBoost

In [23]:
param_dist = {
  'iterations': [100, 500, 1000],
  'learning_rate': [0.01, 0.05, 0.1],
  'depth': [6, 8, 10],
  'l2_leaf_reg': [1, 3, 5],
  'border_count': [32, 64, 128],
  'bagging_temperature': [0, 1, 5],
}

cat_model = CatBoostRegressor(verbose=0, random_seed=42)

random_search = RandomizedSearchCV(
    estimator=cat_model,
    param_distributions=param_dist,
    n_iter=50,
    cv=2,
    scoring='neg_root_mean_squared_error',
    verbose=1,
    n_jobs=-1,
    random_state=42
)

random_search.fit(X_train, y_train)

Fitting 2 folds for each of 50 candidates, totalling 100 fits


KeyboardInterrupt: 

In [None]:
# Best model
cat_best_model = random_search.best_estimator_
cat_best_params = random_search.best_params_

In [None]:
# Train best model with more iterations
cat_reg_tuned = CatBoostRegressor(
    iterations=10000,
    learning_rate=cat_best_params['learning_rate'],
    depth=cat_best_params['depth'],
    l2_leaf_reg=cat_best_params['l2_leaf_reg'],
    border_count=cat_best_params['border_count'],
    bagging_temperature=cat_best_params['bagging_temperature'],
    random_seed=42,
    verbose=50
)

cat_reg_tuned.fit(X_train, y_train, eval_set=(X_val, y_val), early_stopping_rounds=50)

In [None]:
# Evaluate model
y_pred = cat_reg_tuned.predict(X_val)

reg_metrics_df = add_reg_metrics(reg_metrics_df, y_pred, y_val, 'cat_reg_tuned')
reg_metrics_df

CatBoost is the best performing model so far. This is the model that will be used for the rest of the analysis.

## Residual Analysis

In [None]:
# Get predictions
best_model = cat_reg_tuned
y_pred = best_model.predict(X_val)

In [None]:
def plot_residuals(y_true, y_pred, model_name:str) -> None:
	fig, (ax1, ax2) = plt.subplots(nrows=1, ncols=2, figsize=(15, 7))

	# Predicted vs. actual values
	ax1.scatter(x=y_pred, y=y_true)
	ax1.set_title('Predicted vs. Actual values')
	ax1.set_xlabel('Predicted Delay (seconds)')
	ax1.set_ylabel('Actual Delay (seconds)')
	ax1.grid(True)

	# Residuals
	residuals = y_true - y_pred
	ax2.scatter(x=y_pred, y=residuals)
	ax2.set_title('Residual Plot')
	ax2.set_xlabel('Predicted Delay (seconds)')
	ax2.set_ylabel('Residuals (seconds)')
	ax2.axhline(0, linestyle='--', color='orange')
	ax2.grid(True)

	fig.suptitle('Residual Analysis', fontsize=18)
	fig.tight_layout()
	fig.savefig(f'../images/residual_analysis_{model_name}.png', bbox_inches='tight')
	plt.show()

In [None]:
# Plot residuals
plot_residuals(y_val, y_pred, 'cat_reg_tuned')

**Predicted vs. Actual Plot**

There's a dense cluster around 0 for both predicted and actual values, indicating many predictions and centered near 0. However, there is substantial spread both above and below the diagonal line, which suggests underprediction and overprediction. There are clear outliers that are far from the main cluster.


**Residual Plot**

The residuals show a visible diagonal stripe pattern, which indicates a systematic error in prediction. The spread of residuals increases as the predicted delay increases. This is a sign of heteroscedasticity (the variance of errors is not constant across all predictions).

## Feature Importance Plot

In [None]:
def plot_feat_importance(feature_importances, model_name:str) -> None:
	plt.figure(figsize=(10, 6))
	plt.barh(feature_importances['Feature Id'], feature_importances['Importances'])
	plt.gca().invert_yaxis()
	plt.title('Feature Importance')
	plt.xlabel('Importance')
	plt.tight_layout()
	plt.savefig(f'../images/feature_importances_{model_name}.png', bbox_inches='tight')
	plt.show()

In [None]:
# Get sorted feature importances
feature_importances = best_model.get_feature_importance(prettified=True)
feature_importances = feature_importances.sort_values(by='Importances', ascending=False)
feature_importances

In [None]:
# Plot the feature importance
plot_feat_importance(feature_importances, 'cat_reg_tuned')

Interpret plot

## SHAP Plots

In [None]:
# Initialize SHAP
sample_size = 250 # sample validation set to prevent memory overload
X_val_sample = X_val.sample(n=sample_size, random_state=42) 
explainer = shap.TreeExplainer(best_model)
shap_values = explainer.shap_values(X_val_sample)

In [None]:
# Summary barplot
shap.summary_plot(shap_values, X_val_sample, plot_type='bar', show=False)
plt.title('SHAP Summary Barplot')
plt.tight_layout()
plt.savefig(f'../images/shap_barplot_cat_reg_tuned.png', bbox_inches='tight')
plt.show()

In [None]:
# Summary beeswarm plot
shap.summary_plot(shap_values, X_val_sample, show=False)
plt.title('SHAP Summary Beeswarm Plot')
plt.tight_layout()
plt.savefig(f'../images/shap_beeswarm_plot_cat_reg_tuned.png', bbox_inches='tight')
plt.show()

In [None]:
# Force plot a single prediction
random.seed(0)
index = random.randrange(sample_size)
shap.force_plot(
  	explainer.expected_value,
  	shap_values[index, :],
  	X_val_sample.iloc[index, :],
  	figsize=(15, 4),
	contribution_threshold=0.1,
	matplotlib=True,
  	show=False)
plt.tight_layout()
plt.savefig(f'../images/shap_force_plot_cat_reg_tuned.png', bbox_inches='tight')
plt.show()

## Feature Pruning

In [None]:
# Calculate mean SHAP per feature
shap_abs_mean = np.abs(shap_values).mean(axis=0)
shap_df = pd.DataFrame({
    'feature': X_val.columns,
    'mean_abs_shap': shap_abs_mean
}).sort_values('mean_abs_shap', ascending=False)
shap_df.head()

In [None]:
# Identify low impact features (below 1)
low_shap_features = shap_df[shap_df['mean_abs_shap'] < 1]

print('One-hot features with low SHAP impact to remove:\n')
print(low_shap_features)

In [None]:
# Keep best features
features_to_drop = low_shap_features['feature'].tolist()
all_features = X.columns.tolist()

diff = set(all_features) - set(features_to_drop)
best_features = list(diff)

In [None]:
# Remove columns from input matrices
X_pruned = X[best_features]
X_train_pruned = X_train[best_features]
X_val_pruned = X_val[best_features]
X_test_pruned = X_test[best_features]

## Retrain Model with Best Features

In [None]:
# Retrain Model
cat_reg_pruned = CatBoostRegressor(
    iterations=10000,
    learning_rate=cat_best_params['learning_rate'],
    depth=cat_best_params['depth'],
    l2_leaf_reg=cat_best_params['l2_leaf_reg'],
    border_count=cat_best_params['border_count'],
    bagging_temperature=cat_best_params['bagging_temperature'],
    random_seed=42,
    verbose=50
)

cat_reg_tuned.fit(X_train_pruned, y_train, eval_set=(X_val_pruned, y_val), early_stopping_rounds=50)

In [None]:
# Evaluate model
y_pred = cat_reg_pruned.predict(X_val_pruned)

reg_metrics_df = add_reg_metrics(reg_metrics_df, y_pred, y_val, 'cat_reg_pruned')
reg_metrics_df

## Retune Parameters

In [None]:
param_dist = {
  'iterations': [100, 500, 1000],
  'learning_rate': [0.01, 0.05, 0.1],
  'depth': [6, 8, 10],
  'l2_leaf_reg': [1, 3, 5],
  'border_count': [32, 64, 128],
  'bagging_temperature': [0, 1, 5],
}

cat_model = CatBoostRegressor(verbose=0, random_seed=42)

random_search = RandomizedSearchCV(
    estimator=cat_model,
    param_distributions=param_dist,
    n_iter=50,
    cv=2,
    scoring='neg_root_mean_squared_error',
    verbose=1,
    n_jobs=-1,
    random_state=42
)

random_search.fit(X_train_pruned, y_train)

In [None]:
# Best model
cat_pruned_best_model = random_search.best_estimator_
cat_pruned_best_params = random_search.best_params_

In [None]:
# Retrain model
cat_reg_final = CatBoostRegressor(
    iterations=10000,
    learning_rate=cat_pruned_best_params['learning_rate'],
    depth=cat_pruned_best_params['depth'],
    l2_leaf_reg=cat_pruned_best_params['l2_leaf_reg'],
    border_count=cat_pruned_best_params['border_count'],
    bagging_temperature=cat_pruned_best_params['bagging_temperature'],
    random_seed=42,
    verbose=50
)

cat_reg_final.fit(X_train_pruned, y_train, eval_set=(X_val_pruned, y_val), early_stopping_rounds=50)

## Final Model

### Evaluate with Test Set

In [None]:
final_model = cat_reg_final

In [None]:
# Evaluate model
y_pred = final_model.predict(X_test_pruned)

reg_metrics_df = add_reg_metrics(reg_metrics_df, y_pred, y_test, 'cat_reg_final')
reg_metrics_df

In [None]:
# Plot residuals
plot_residuals(y_test, y_pred, 'cat_reg_final')

### Feature importances

In [None]:
# Get top 5 most important features
feature_importances = final_model.get_feature_importance(prettified=True)
feature_importances = feature_importances.sort_values(by='Importances', ascending=False)
feature_importances.head()

In [None]:
# Plot the feature importance
plot_feat_importance(feature_importances, 'cat_reg_final')

### SHAP Summary Plots

In [None]:
# Initialize SHAP
sample_size = 250 # sample validation set to prevent memory overload
X_test_sample = X_test.sample(n=sample_size, random_state=42) 
explainer = shap.TreeExplainer(final_model)
shap_values = explainer.shap_values(X_test_sample)

In [None]:
# Summary barplot
shap.summary_plot(shap_values, X_test_sample, plot_type='bar', show=False)
plt.title('SHAP Summary Barplot')
plt.tight_layout()
plt.savefig(f'../images/shap_barplot_cat_reg_final.png', bbox_inches='tight')
plt.show()

In [None]:
# Summary beeswarm plot
shap.summary_plot(shap_values, X_test_sample, show=False)
plt.title('SHAP Summary Beeswarm Plot')
plt.tight_layout()
plt.savefig(f'../images/shap_beeswarm_plot_cat_reg_final.png', bbox_inches='tight')
plt.show()

In [None]:
# Force plot a single prediction
random.seed(0)
index = random.randrange(sample_size)
shap.force_plot(
  	explainer.expected_value,
  	shap_values[index, :],
  	X_test_sample.iloc[index, :],
  	figsize=(15, 4),
	contribution_threshold=0.1,
	matplotlib=True,
  	show=False)
plt.tight_layout()
plt.savefig(f'../images/shap_force_plot_cat_reg_final.png', bbox_inches='tight')
plt.show()

### Make Prediction

In [None]:
# Load stop coordinates scaler
scaler_coords = joblib.load('../models/scaler_coords.pkl')

In [None]:
# Create feature matrix
input = np.array([[0.1, 0.2, 0.3]])

In [None]:
# Predict delay
prediction = final_model.predict(input)
print(f'Predicted delay: {prediction:.2f}')

### Export Data

In [None]:
# Save model
joblib.dump(final_model, 'regression_model.pkl')

## End