# STM Transit Delay Data Modeling

## Overview

This notebook explores tree-based machine learning models in order to find the one that predicts STM transit delays with the best accuracy. The featured models are XGBoost, LightGBM and CatBoost, because they are more suitable for large datasets with mixed data and high cardinality.

## Data Description

`exp_trip_duration`: Expected duration of a trip, in seconds.<br>
`route_direction`: Route direction in degrees.<br>
`route_type_Night`, `route_type_High Frequency` : One-Hot features for types of bus lines<br>
`stop_location_group`: Stop cluster based on coordinates.<br>
`stop_distance`: Distance between the previous and current stop, in meters.<br>
`trip_phase_end`: One-Hot feature for trip progress.<br>
`exp_delay_prev_stop`: Expected duration between the previous and current stop, in seconds.<br>
`wheelchair_boarding`: Indicates if the stop is accessible for people in wheelchair.<br>
`sch_rel_Scheduled`: One-Hot feature for schedule relationship.<br>
`time_of_day_evening`, `time_of_day_morning`, `time_of_day_night`: One-Hot features for time of day.<br>
`is_weekend`: Boolean value if the day of week in on the weekend.<br>
`is_peak_hour`: Boolean value indicating if the sheduled arrival time is at peak hour.<br>
`temperature`: Air temperature at 2 meters above ground, in Celsius.<br>
`relative_humidity`: Relative humidity at 2 meters above ground, in percentage.<br>
`precipitation`: Total precipitation (rain, showers, snow) sum of the preceding hour, in millimeters.<br>
`pressure`: Atmospheric air pressure reduced to mean sea level (msl), in hPa.<br>
`cloud_cover`: Total cloud cover as an area fraction.<br>
`windspeed`: Wind speed at 10 meters above ground, in kilometers per hour.<br>
`wind_direction`: Wind direction at 10 meters above ground.<br>
`delay`: Difference between real and scheduled arrival time, in seconds.<br>

## Imports

In [None]:
from catboost import CatBoostRegressor
import joblib
import lightgbm as lgb
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
import shap
from sklearn.metrics import mean_absolute_error, root_mean_squared_error, r2_score
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV, train_test_split
import sys
import xgboost as xgb

In [3]:
# Load data
df = pd.read_parquet('../data/preprocessed.parquet')
print(f'Shape of dataset: {df.shape}')

Shape of dataset: (800000, 25)


## Split the data

In [34]:
# Separate features from target variable
X = df.drop('delay', axis=1)
y = df['delay']

In [35]:
# Train-validation-test split (60-20-20)
X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.4, random_state=42)
X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.5, random_state=42)

del X_temp
del y_temp

**Scaling**

Since only tree-based models are explored in this project, scaling is not needed because the models are not sensitive to the absolute scale or distribution of the features.

## Fit Base Models

All models allow to setup a number of rounds and early stopping. To start, all models will run 100 rounds with an early stopping of 3.

In [8]:
# Create dataframe to track metrics
reg_metrics_df = pd.DataFrame(columns=['model', 'MAE', 'RMSE', 'R²'])

In [None]:
def add_reg_metrics(reg_metrics_df:pd.DataFrame, y_pred:pd.Series, y_true:pd.Series, model_name:str) -> pd.DataFrame:
	mae = mean_absolute_error(y_true, y_pred)
	rmse = root_mean_squared_error(y_true, y_pred)
	r2 = r2_score(y_true, y_pred)

	reg_metrics_df.loc[len(reg_metrics_df)] = [model_name, mae, rmse, r2]
	return reg_metrics_df

### XGBoost

In [None]:
# Create regression matrices
xg_train_data = xgb.DMatrix(X_train, y_train, enable_categorical=False)
xg_val_data = xgb.DMatrix(X_val, y_val, enable_categorical=False)
xg_test_data = xgb.DMatrix(X_test, y_test, enable_categorical=False)
xg_eval_set = [(xg_train_data, 'train'), (xg_val_data, 'validation')]
xg_test_set = [(xg_train_data, 'train'), (xg_test_data, 'test')]

In [11]:
# Train model
xg_reg_base = xgb.train(
  params= {'objective': 'reg:squarederror', 'tree_method': 'hist'},
  dtrain=xg_train_data,
  num_boost_round=100,
  evals=xg_eval_set,
  verbose_eval=10,
  early_stopping_rounds=3
)

[0]	train-rmse:151.68551	validation-rmse:151.34824
[10]	train-rmse:143.57776	validation-rmse:143.47792
[20]	train-rmse:141.25503	validation-rmse:141.45290
[30]	train-rmse:139.72854	validation-rmse:140.15436
[40]	train-rmse:138.03110	validation-rmse:138.77747
[50]	train-rmse:136.81656	validation-rmse:137.82060
[60]	train-rmse:135.87552	validation-rmse:137.08295
[70]	train-rmse:134.88446	validation-rmse:136.37501
[80]	train-rmse:134.26456	validation-rmse:135.96706
[90]	train-rmse:133.37860	validation-rmse:135.38716
[96]	train-rmse:132.99772	validation-rmse:135.13422


In [12]:
# Evaluate model
y_pred = xg_reg_base.predict(xg_val_data)

reg_metrics_df = add_reg_metrics(reg_metrics_df, y_pred, y_reg_val, 'xg_reg_base')
reg_metrics_df

Unnamed: 0,model,MAE,RMSE,R²
0,xg_reg_base,72.400162,135.134219,0.248317


**MAE**<br>
On average, the predictions are off by 72 seconds, which is not very good.

**RMSE**<br>
The higher RMSE compared to MAE suggests that there are some significant prediction errors that influence the overall error metric.

**R²**<br>
The model explains 24.83% of the variance, which is not good but understandable because of how random transit delays can be (bad weather, vehicle breakdown, accidents, etc.)

### LightGBM

In [13]:
# Train model
lgb_train_data = lgb.Dataset(X_train, label=y_reg_train)
lgb_val_data = lgb.Dataset(X_val, label=y_reg_val, reference=lgb_train_data)
lgb_test_data = lgb.Dataset(X_test, label=y_reg_test, reference=lgb_train_data)

lgb_reg_base = lgb.train(
    params={
        'objective': 'regression',
        'metric': 'rmse',
        'learning_rate': 0.05,
        'max_depth': -1
    },
    train_set=lgb_train_data,
    valid_sets=[lgb_val_data],
    num_boost_round=100,
    callbacks=[lgb.early_stopping(stopping_rounds=3)]
)

[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.028327 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 798
[LightGBM] [Info] Number of data points in the train set: 480000, number of used features: 23
[LightGBM] [Info] Start training from score 53.910862
Training until validation scores don't improve for 3 rounds
Did not meet early stopping. Best iteration is:
[100]	valid_0's rmse: 141.595


In [14]:
# Evaluate model
y_pred = lgb_reg_base.predict(X_val)

reg_metrics_df = add_reg_metrics(reg_metrics_df, y_pred, y_reg_val, 'lgb_reg_base')
reg_metrics_df

Unnamed: 0,model,MAE,RMSE,R²
0,xg_reg_base,72.400162,135.134219,0.248317
1,lgb_reg_base,75.631386,141.595168,0.17472


Overall, the LightGBM model performs worse than XGBoost.

### CatBoost

In [15]:
# Fit model
cat_reg_base = CatBoostRegressor(
    iterations=100,
    learning_rate=0.05,
    depth=10,
    random_seed=42,
    verbose=10
)

cat_reg_base.fit(X_train, y_reg_train, eval_set=(X_val, y_reg_val), early_stopping_rounds=3)

0:	learn: 155.3866858	test: 155.0752858	best: 155.0752858 (0)	total: 175ms	remaining: 17.3s
10:	learn: 150.1283401	test: 149.8244834	best: 149.8244834 (10)	total: 1.44s	remaining: 11.6s
20:	learn: 147.5512752	test: 147.2669953	best: 147.2669953 (20)	total: 2.58s	remaining: 9.69s
30:	learn: 146.0437598	test: 145.7606510	best: 145.7606510 (30)	total: 3.58s	remaining: 7.97s
40:	learn: 145.0447379	test: 144.7991356	best: 144.7991356 (40)	total: 4.59s	remaining: 6.61s
50:	learn: 144.3414812	test: 144.1352105	best: 144.1352105 (50)	total: 5.65s	remaining: 5.43s
60:	learn: 143.7224012	test: 143.5373634	best: 143.5373634 (60)	total: 6.86s	remaining: 4.38s
70:	learn: 143.2066704	test: 143.0644667	best: 143.0644667 (70)	total: 7.96s	remaining: 3.25s
80:	learn: 142.7382035	test: 142.6122296	best: 142.6122296 (80)	total: 9.14s	remaining: 2.14s
90:	learn: 142.3898467	test: 142.3135788	best: 142.3135788 (90)	total: 10.2s	remaining: 1.01s
99:	learn: 141.9604719	test: 141.8988930	best: 141.8988930 (99

<catboost.core.CatBoostRegressor at 0x12c7956a0>

In [16]:
# Evaluate model
y_pred = cat_reg_base.predict(X_val)

reg_metrics_df = add_reg_metrics(reg_metrics_df, y_pred, y_reg_val, 'cat_reg_base')
reg_metrics_df

Unnamed: 0,model,MAE,RMSE,R²
0,xg_reg_base,72.400162,135.134219,0.248317
1,lgb_reg_base,75.631386,141.595168,0.17472
2,cat_reg_base,75.327649,141.898893,0.171176


CatBoost performs almost like LightGBM. Without hyperparameter tuning, XGBoost seems to capture a bit more of the underlying patterns than the two other models.

## Hyperparameter Tuning

### XGBoost

In [None]:
param_grid = {
    'n_estimators': [100, 200, 300],
    'max_depth': [3, 5, 7],
    'learning_rate': [0.01, 0.1, 0.2],
    'subsample': [0.6, 0.8, 1.0],
    'colsample_bytree': [0.6, 0.8, 1.0]
}

xgb_model = xgb.XGBRegressor(objective='reg:squarederror', random_state=42)

grid_search = GridSearchCV(
    estimator=xgb_model,
    param_grid=param_grid,
    scoring='neg_root_mean_squared_error',
    cv=2,
    n_jobs=-1,
    verbose=2
)

grid_search.fit(X_train, y_reg_train)

Fitting 2 folds for each of 243 candidates, totalling 486 fits
[CV] END colsample_bytree=0.6, learning_rate=0.01, max_depth=3, n_estimators=100, subsample=1.0; total time=  10.7s
[CV] END colsample_bytree=0.6, learning_rate=0.01, max_depth=3, n_estimators=100, subsample=1.0; total time=  11.0s
[CV] END colsample_bytree=0.6, learning_rate=0.01, max_depth=3, n_estimators=100, subsample=0.8; total time=  11.5s
[CV] END colsample_bytree=0.6, learning_rate=0.01, max_depth=3, n_estimators=100, subsample=0.8; total time=  11.7s
[CV] END colsample_bytree=0.6, learning_rate=0.01, max_depth=3, n_estimators=100, subsample=0.6; total time=  12.0s
[CV] END colsample_bytree=0.6, learning_rate=0.01, max_depth=3, n_estimators=100, subsample=0.6; total time=  12.0s
[CV] END colsample_bytree=0.6, learning_rate=0.01, max_depth=3, n_estimators=200, subsample=0.6; total time=  19.9s
[CV] END colsample_bytree=0.6, learning_rate=0.01, max_depth=3, n_estimators=200, subsample=0.6; total time=  20.4s
[CV] END 

In [18]:
# Best model
xg_best_model = grid_search.best_estimator_
xg_best_params = grid_search.best_params_

In [None]:
# Train best model with more boost rounds
xg_reg_tuned = xgb.train(
  params= {
    'objective': 'reg:squarederror',
    'tree_method': 'hist',
    'max_depth': xg_best_params['max_depth'],
    'learning_rate': xg_best_params['learning_rate'],
    'subsample': xg_best_params['subsample'],
    'colsample_bytree': xg_best_params['colsample_bytree'],
  },
  dtrain=xg_train_data,
  num_boost_round=1000,
  evals=xg_eval_set,
  verbose_eval=50,
  early_stopping_rounds=50
)

[0]	train-rmse:152.70902	validation-rmse:152.40559
[50]	train-rmse:135.79286	validation-rmse:137.19029
[100]	train-rmse:132.00960	validation-rmse:134.88922
[150]	train-rmse:129.19271	validation-rmse:133.60614
[200]	train-rmse:127.01186	validation-rmse:132.72683
[250]	train-rmse:124.93803	validation-rmse:132.06293
[300]	train-rmse:123.33210	validation-rmse:131.66940
[350]	train-rmse:121.98580	validation-rmse:131.36681
[400]	train-rmse:120.77165	validation-rmse:131.26063
[450]	train-rmse:119.62523	validation-rmse:131.11842
[500]	train-rmse:118.46142	validation-rmse:130.97391
[550]	train-rmse:117.51765	validation-rmse:131.03619
[600]	train-rmse:116.55800	validation-rmse:130.97030
[618]	train-rmse:116.28911	validation-rmse:131.03971


In [20]:
# Evaluate model
y_pred = xg_reg_tuned.predict(xg_val_data)

reg_metrics_df = add_reg_metrics(reg_metrics_df, y_pred, y_reg_val, 'xg_reg_tuned')
reg_metrics_df

Unnamed: 0,model,MAE,RMSE,R²
0,xg_reg_base,72.400162,135.134219,0.248317
1,lgb_reg_base,75.631386,141.595168,0.17472
2,cat_reg_base,75.327649,141.898893,0.171176
3,xg_reg_tuned,69.754902,131.039707,0.293178


There's a significant improvement from the base XGBoost model.

### LightGBM

In [None]:
param_dist = {
  'n_estimators': [100, 500, 1000],
  'learning_rate': [0.01, 0.05, 0.1],
  'max_depth': [5, 10, 15],
  'num_leaves': [20, 31, 40],
  'min_child_samples': [10, 20, 30],
  'subsample': [0.8, 1.0],
  'colsample_bytree': [0.8, 1.0]
}

lgb_model = lgb.LGBMRegressor(random_state=42)

random_search = RandomizedSearchCV(
    estimator=lgb_model,
    param_distributions=param_dist,
    n_iter=20,
    cv=2, 
    scoring='neg_root_mean_squared_error',
    verbose=2,
    n_jobs=-1,
    random_state=42
)

random_search.fit(X_train, y_reg_train)

Fitting 2 folds for each of 10 candidates, totalling 20 fits
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.074197 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 797
[LightGBM] [Info] Number of data points in the train set: 240000, number of used features: 23
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.076455 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 797
[LightGBM] [Info] Start training from score 54.036375
[LightGBM] [Info] Number of data points in the train set: 240000, number of used features: 23
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.084400 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is 

Exception ignored in: <function ResourceTracker.__del__ at 0x10564b740>
Traceback (most recent call last):
  File "/usr/local/Cellar/python@3.13/3.13.3/Frameworks/Python.framework/Versions/3.13/lib/python3.13/multiprocessing/resource_tracker.py", line 82, in __del__
  File "/usr/local/Cellar/python@3.13/3.13.3/Frameworks/Python.framework/Versions/3.13/lib/python3.13/multiprocessing/resource_tracker.py", line 91, in _stop
  File "/usr/local/Cellar/python@3.13/3.13.3/Frameworks/Python.framework/Versions/3.13/lib/python3.13/multiprocessing/resource_tracker.py", line 116, in _stop_locked
ChildProcessError: [Errno 10] No child processes
Exception ignored in: <function ResourceTracker.__del__ at 0x103fcf740>
Traceback (most recent call last):
  File "/usr/local/Cellar/python@3.13/3.13.3/Frameworks/Python.framework/Versions/3.13/lib/python3.13/multiprocessing/resource_tracker.py", line 82, in __del__
  File "/usr/local/Cellar/python@3.13/3.13.3/Frameworks/Python.framework/Versions/3.13/lib/py

In [None]:
# Best model
lgb_best_model = random_search.best_estimator_
lgb_best_params = random_search.best_params_

In [24]:
# Train model with more boost rounds and early stopping
lgb_reg_tuned = lgb.train(
    params={
        'objective': 'regression',
        'metric': 'rmse',
        'n_estimators': lgb_best_params['n_estimators'],
        'learning_rate': lgb_best_params['learning_rate'],
        'max_depth': lgb_best_params['max_depth'],
        'num_leaves': lgb_best_params['num_leaves'],
        'min_child_samples': lgb_best_params['min_child_samples'],
        'subsample': lgb_best_params['subsample'],
        'colsample_bytree': lgb_best_params['colsample_bytree']
    },
    train_set=lgb_train_data,
    valid_sets=[lgb_val_data],
    num_boost_round=1000,
    callbacks=[lgb.early_stopping(stopping_rounds=50)]
)

[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.034397 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 798
[LightGBM] [Info] Number of data points in the train set: 480000, number of used features: 23
[LightGBM] [Info] Start training from score 53.910862
Training until validation scores don't improve for 50 rounds
Did not meet early stopping. Best iteration is:
[1000]	valid_0's rmse: 129.303


In [25]:
# Evaluate model
y_pred = lgb_reg_tuned.predict(X_val)

reg_metrics_df = add_reg_metrics(reg_metrics_df, y_pred, y_reg_val, 'lgb_reg_tuned')
reg_metrics_df

Unnamed: 0,model,MAE,RMSE,R²
0,xg_reg_base,72.400162,135.134219,0.248317
1,lgb_reg_base,75.631386,141.595168,0.17472
2,cat_reg_base,75.327649,141.898893,0.171176
3,xg_reg_tuned,69.754902,131.039707,0.293178
4,lgb_reg_tuned,69.231826,129.303118,0.311788


The MAE is slightly better than the previous model and it's much better than the base LightGBM Model. This is the best performing model so far.

### CatBoost

In [None]:
param_dist = {
  'iterations': [500, 1000, 5000, 10000],
  'learning_rate': [0.01, 0.05, 0.1],
  'depth': [6, 8, 10],
  'l2_leaf_reg': [1, 3, 5],
  'border_count': [32, 64, 128],
  'bagging_temperature': [0, 1, 5],
}

cat_model = CatBoostRegressor(verbose=0, random_seed=42)

random_search = RandomizedSearchCV(
    estimator=cat_model,
    param_distributions=param_dist,
    n_iter=20,
    cv=2,
    scoring='neg_root_mean_squared_error',
    verbose=2,
    n_jobs=-1,
    random_state=42
)

random_search.fit(X_train, y_train)

Fitting 1 folds for each of 20 candidates, totalling 20 fits


ValueError: String indexing is not supported with 'axis=0'

In [None]:
# Best model
cat_reg_tuned = random_search.best_estimator_
cat_best_params = random_search.best_params_

In [29]:
# Train best model with more iterations
cat_reg_tuned = CatBoostRegressor(
    iterations=cat_best_params['iterations'],
    learning_rate=cat_best_params['learning_rate'],
    depth=cat_best_params['depth'],
    l2_leaf_reg=cat_best_params['l2_leaf_reg'],
    border_count=cat_best_params['border_count'],
    bagging_temperature=cat_best_params['bagging_temperature'],
    random_seed=42,
    verbose=50
)

cat_reg_tuned.fit(X_train, y_reg_train, eval_set=(X_val, y_reg_val), early_stopping_rounds=50)

0:	learn: 154.6180754	test: 154.2871082	best: 154.2871082 (0)	total: 71.7ms	remaining: 1m 11s
50:	learn: 141.8678885	test: 141.7358836	best: 141.7358836 (50)	total: 4.76s	remaining: 1m 28s
100:	learn: 138.5668795	test: 138.9022186	best: 138.9022186 (100)	total: 8.86s	remaining: 1m 18s
150:	learn: 135.7098152	test: 136.5678245	best: 136.5678245 (150)	total: 13.1s	remaining: 1m 13s
200:	learn: 133.6280092	test: 134.9808557	best: 134.9808557 (200)	total: 17.8s	remaining: 1m 10s


Exception ignored in: <function ResourceTracker.__del__ at 0x10732b740>
Traceback (most recent call last):
  File "/usr/local/Cellar/python@3.13/3.13.3/Frameworks/Python.framework/Versions/3.13/lib/python3.13/multiprocessing/resource_tracker.py", line 82, in __del__
  File "/usr/local/Cellar/python@3.13/3.13.3/Frameworks/Python.framework/Versions/3.13/lib/python3.13/multiprocessing/resource_tracker.py", line 91, in _stop
  File "/usr/local/Cellar/python@3.13/3.13.3/Frameworks/Python.framework/Versions/3.13/lib/python3.13/multiprocessing/resource_tracker.py", line 116, in _stop_locked
ChildProcessError: [Errno 10] No child processes
Exception ignored in: <function ResourceTracker.__del__ at 0x1056af740>
Traceback (most recent call last):
  File "/usr/local/Cellar/python@3.13/3.13.3/Frameworks/Python.framework/Versions/3.13/lib/python3.13/multiprocessing/resource_tracker.py", line 82, in __del__
  File "/usr/local/Cellar/python@3.13/3.13.3/Frameworks/Python.framework/Versions/3.13/lib/py

250:	learn: 131.8014953	test: 133.6458091	best: 133.6458091 (250)	total: 23.4s	remaining: 1m 9s
300:	learn: 130.1983887	test: 132.5410263	best: 132.5410263 (300)	total: 29.7s	remaining: 1m 8s
350:	learn: 128.9997688	test: 131.7598405	best: 131.7598405 (350)	total: 34.8s	remaining: 1m 4s


Exception ignored in: <function ResourceTracker.__del__ at 0x10c1e3740>
Traceback (most recent call last):
  File "/usr/local/Cellar/python@3.13/3.13.3/Frameworks/Python.framework/Versions/3.13/lib/python3.13/multiprocessing/resource_tracker.py", line 82, in __del__
  File "/usr/local/Cellar/python@3.13/3.13.3/Frameworks/Python.framework/Versions/3.13/lib/python3.13/multiprocessing/resource_tracker.py", line 91, in _stop
  File "/usr/local/Cellar/python@3.13/3.13.3/Frameworks/Python.framework/Versions/3.13/lib/python3.13/multiprocessing/resource_tracker.py", line 116, in _stop_locked
ChildProcessError: [Errno 10] No child processes
Exception ignored in: <function ResourceTracker.__del__ at 0x1069b7740>
Traceback (most recent call last):
  File "/usr/local/Cellar/python@3.13/3.13.3/Frameworks/Python.framework/Versions/3.13/lib/python3.13/multiprocessing/resource_tracker.py", line 82, in __del__
  File "/usr/local/Cellar/python@3.13/3.13.3/Frameworks/Python.framework/Versions/3.13/lib/py

400:	learn: 127.9438661	test: 131.1042549	best: 131.1042549 (400)	total: 40.3s	remaining: 1m
450:	learn: 127.0001795	test: 130.5908246	best: 130.5908246 (450)	total: 45.9s	remaining: 55.8s
500:	learn: 126.0772027	test: 130.0947316	best: 130.0947316 (500)	total: 51.1s	remaining: 50.8s
550:	learn: 125.3247069	test: 129.7158125	best: 129.7158125 (550)	total: 56.2s	remaining: 45.8s
600:	learn: 124.6634135	test: 129.4214156	best: 129.4214156 (600)	total: 1m 1s	remaining: 40.8s


Exception ignored in: <function ResourceTracker.__del__ at 0x111163740>
Traceback (most recent call last):
  File "/usr/local/Cellar/python@3.13/3.13.3/Frameworks/Python.framework/Versions/3.13/lib/python3.13/multiprocessing/resource_tracker.py", line 82, in __del__
  File "/usr/local/Cellar/python@3.13/3.13.3/Frameworks/Python.framework/Versions/3.13/lib/python3.13/multiprocessing/resource_tracker.py", line 91, in _stop
  File "/usr/local/Cellar/python@3.13/3.13.3/Frameworks/Python.framework/Versions/3.13/lib/python3.13/multiprocessing/resource_tracker.py", line 116, in _stop_locked
ChildProcessError: [Errno 10] No child processes
Exception ignored in: <function ResourceTracker.__del__ at 0x11192b740>
Traceback (most recent call last):
  File "/usr/local/Cellar/python@3.13/3.13.3/Frameworks/Python.framework/Versions/3.13/lib/python3.13/multiprocessing/resource_tracker.py", line 82, in __del__
  File "/usr/local/Cellar/python@3.13/3.13.3/Frameworks/Python.framework/Versions/3.13/lib/py

650:	learn: 123.9471581	test: 129.0782831	best: 129.0782831 (650)	total: 1m 7s	remaining: 36.1s
700:	learn: 123.3304717	test: 128.8548226	best: 128.8548226 (700)	total: 1m 12s	remaining: 30.9s
750:	learn: 122.7895684	test: 128.7033411	best: 128.7033411 (750)	total: 1m 17s	remaining: 25.8s
800:	learn: 122.2006696	test: 128.4493226	best: 128.4493226 (800)	total: 1m 23s	remaining: 20.6s
850:	learn: 121.6771602	test: 128.2091398	best: 128.2091398 (850)	total: 1m 28s	remaining: 15.5s
900:	learn: 121.1390890	test: 128.0248069	best: 128.0248069 (900)	total: 1m 33s	remaining: 10.3s
950:	learn: 120.7025713	test: 127.9012894	best: 127.9012894 (950)	total: 1m 39s	remaining: 5.1s
999:	learn: 120.2878209	test: 127.7752498	best: 127.7720994 (998)	total: 1m 44s	remaining: 0us

bestTest = 127.7720994
bestIteration = 998

Shrink model to first 999 iterations.


<catboost.core.CatBoostRegressor at 0x12d697890>

In [30]:
# Evaluate model
y_pred = cat_reg_tuned.predict(X_val)

reg_metrics_df = add_reg_metrics(reg_metrics_df, y_pred, y_reg_val, 'cat_reg_tuned')
reg_metrics_df

Unnamed: 0,model,MAE,RMSE,R²
0,xg_reg_base,72.400162,135.134219,0.248317
1,lgb_reg_base,75.631386,141.595168,0.17472
2,cat_reg_base,75.327649,141.898893,0.171176
3,xg_reg_tuned,69.754902,131.039707,0.293178
4,lgb_reg_tuned,69.231826,129.303118,0.311788
5,cat_reg_tuned,68.330397,127.772099,0.327989


In [31]:
# Get model with lowest RMSE
reg_metrics_df.nsmallest(n=1, columns='RMSE')

Unnamed: 0,model,MAE,RMSE,R²
5,cat_reg_tuned,68.330397,127.772099,0.327989


The best model is CatBoost. This is the model that will be used for the rest of the analysis.

### Residual Analysis

In [None]:
# Get predictions
best_model = cat_reg_tuned
y_pred = best_model.predict(X_val)

In [None]:
# Plot residuals
fig, (ax1, ax2) = plt.subplots(nrows=1, ncols=2, figsize=(15, 7))

# Predicted vs. actual values
ax1.scatter(x=y_pred, y=y_true)
ax1.set_title('Predicted vs. Actual values')
ax1.set_xlabel('Predicted delay (seconds)')
ax1.set_ylabel('Actual delay (seconds)')
ax1.grid(True)

# Residuals
residuals = y_true - y_pred
ax2.scatter(x=y_pred, y=residuals)
ax2.set_title('Residual Plot')
ax2.set_xlabel('Predicted Delay (seconds)')
ax2.set_ylabel('Residuals (seconds)')
ax2.axhline(0, linestyle='--', color='orange')
ax2.grid(True)

fig.suptitle('Residual Analysis', fontsize=18)
fig.tight_layout()
fig.savefig(f'../images/residual_analysis_{model_name}.png', bbox_inches='tight')
plt.show()

## Feature Importances

### MDI

### SHAP Plots

## Feature Pruning

## Retrain Model with Best Features

## Retune Parameters

## Final Model

### Evaluate with Test Set

### Make Prediction

In [33]:
# Save best model
joblib.dump(best_model, 'best_xgb_model.pkl')

NameError: name 'best_model' is not defined

## End