# TTC Subway Delay Prediction — Model Training

**Author:** ...
**Date:** February 2026  
**Input:** `cleaned_ttc_delay_data.csv` (from preprocessing notebook)  
**Objective:** Train a regression model to predict subway delay duration in minutes

---

## 1. Setup & Data Loading

Import required libraries and load the preprocessed dataset.

**Libraries needed:**
- `pandas`, `numpy` — data manipulation
- `scikit-learn` — model training, evaluation, preprocessing
- `matplotlib`, `seaborn` — visualization
- `xgboost` or `lightgbm` — gradient boosting models (optional)

In [1]:
# --- Libraries ---
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split, TimeSeriesSplit, RandomizedSearchCV
from sklearn.preprocessing import OneHotEncoder
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error, root_mean_squared_error, r2_score

from lightgbm import LGBMRegressor

import warnings
warnings.filterwarnings('ignore')

sns.set_style('whitegrid')
plt.rcParams['figure.dpi'] = 100

In [2]:
# --- Load Data ---
df = pd.read_csv('../data/processing/cleaned_ttc_delay_data.csv', parse_dates=['Date'])

print(f"Shape: {df.shape}")
print(f"\nColumn types:\n{df.dtypes}")
print(f"\nTarget summary (min_delay_capped):\n{df['min_delay_capped'].describe()}")
print(f"\nZero-delay ratio: {(df['min_delay_capped'] == 0).mean():.1%}")
df.head()


Shape: (52059, 14)

Column types:
Date                        datetime64[us]
Line                                   str
Station                                str
Code                                   str
hour                                 int64
day_of_week                          int64
is_weekend                           int64
month                                int64
week                                 int64
year                                 int64
min_delay_capped                     int64
route_avg_delay                    float64
route_hour_avg_delay               float64
route_day_hour_avg_delay           float64
dtype: object

Target summary (min_delay_capped):
count    52059.000000
mean         2.600165
std          5.740640
min          0.000000
25%          0.000000
50%          0.000000
75%          4.000000
max         60.000000
Name: min_delay_capped, dtype: float64

Zero-delay ratio: 64.5%


Unnamed: 0,Date,Line,Station,Code,hour,day_of_week,is_weekend,month,week,year,min_delay_capped,route_avg_delay,route_hour_avg_delay,route_day_hour_avg_delay
0,2024-01-01,Line 1,SHEPPARD STATION,MUI,2,0,0,1,1,2024,0,2.860133,0.623867,0.630631
1,2024-01-01,Line 1,DUNDAS STATION,MUIS,2,0,0,1,1,2024,0,2.860133,0.623867,0.630631
2,2024-01-01,Line 1,DUNDAS STATION,MUPAA,2,0,0,1,1,2024,4,2.860133,0.623867,0.630631
3,2024-01-01,Line 2,KENNEDY BD STATION,PUTDN,2,0,0,1,1,2024,10,2.352354,0.473251,1.017857
4,2024-01-01,Line 1,BLOOR STATION,MUPAA,2,0,0,1,1,2024,4,2.860133,0.623867,0.630631


---
## 2. Feature Preparation

Prepare the feature matrix (`X`) and target vector (`y`) for model training.

### 2.1 Define Features and Target

| Column | Role | Notes |
|--------|------|-------|
| `min_delay_capped` | **Target** | Delay in minutes (0–60) |
| `Line`, `Station`, `Code` | Categorical features | Require encoding |
| `hour`, `day_of_week`, `is_weekend` | Numeric features | Ready to use |
| `month`, `week`, `year` | Numeric features | Ready to use |
| `route_avg_delay`, `route_hour_avg_delay`, `route_day_hour_avg_delay` | Numeric features | Historical averages |
| `Date` | **Drop** | Not a model feature — used only for train/test split |

In [3]:
target = 'min_delay_capped'
y = df[target]

X = df.drop(columns=['Date', target])

cat_cols = ['Line', 'Station', 'Code']
num_cols = [c for c in X.columns if c not in cat_cols]

print(f"Target: {target}")
print(f"\nCategorical features ({len(cat_cols)}): {cat_cols}")
for col in cat_cols:
    print(f"  {col}: {X[col].nunique()} unique values")
print(f"\nNumeric features ({len(num_cols)}): {num_cols}")
print(f"\nX shape: {X.shape}")
X.head()


Target: min_delay_capped

Categorical features (3): ['Line', 'Station', 'Code']
  Line: 4 unique values
  Station: 731 unique values
  Code: 130 unique values

Numeric features (9): ['hour', 'day_of_week', 'is_weekend', 'month', 'week', 'year', 'route_avg_delay', 'route_hour_avg_delay', 'route_day_hour_avg_delay']

X shape: (52059, 12)


Unnamed: 0,Line,Station,Code,hour,day_of_week,is_weekend,month,week,year,route_avg_delay,route_hour_avg_delay,route_day_hour_avg_delay
0,Line 1,SHEPPARD STATION,MUI,2,0,0,1,1,2024,2.860133,0.623867,0.630631
1,Line 1,DUNDAS STATION,MUIS,2,0,0,1,1,2024,2.860133,0.623867,0.630631
2,Line 1,DUNDAS STATION,MUPAA,2,0,0,1,1,2024,2.860133,0.623867,0.630631
3,Line 2,KENNEDY BD STATION,PUTDN,2,0,0,1,1,2024,2.352354,0.473251,1.017857
4,Line 1,BLOOR STATION,MUPAA,2,0,0,1,1,2024,2.860133,0.623867,0.630631


### 2.2 Encode Categorical Features

String columns cannot be used directly in ML models. Encoding strategy:

| Column | Unique Values | Encoding Method |
|--------|--------------|----------------|
| `Line` | 4 | One-hot encoding — few categories, no ordinality |
| `Station` | ~760 | Target encoding — too many for one-hot, encodes mean delay per station |
| `Code` | ~131 | Target encoding — same rationale as Station |

**Important:** Target encoding must be fit on training data only, then applied to test data. Fitting on the full dataset causes data leakage.

---
## 3. Train/Test Split

Split the data into training and test sets.

**Considerations:**
- Use **time-based split** (e.g., train on 2024, test on 2025) to simulate real-world deployment — the model should predict future delays, not past ones.
- Alternative: random 80/20 split with `train_test_split` if temporal generalization is not a concern.
- Ensure encoding (Section 2.2) is fit on training data only.

---
## 4. Baseline Model

Establish a baseline to benchmark all subsequent models against.

**Baseline strategies:**
- **Mean predictor:** Always predict the mean delay (~2.6 min) — the simplest possible model.
- **Median predictor:** Always predict 0 (the median) — reflects the zero-inflated nature of the data.

Record baseline metrics (MAE, RMSE, R²) for comparison.

---
## 5. Model Training

Train multiple regression models and compare performance.

### 5.1 Linear Regression

A simple linear model to understand baseline linear relationships between features and delay.

**Expectations:** Likely poor performance due to zero-inflated target and non-linear patterns, but establishes a linear benchmark.

### 5.2 Random Forest Regressor

An ensemble of decision trees that handles non-linear relationships and feature interactions.

**Why Random Forest:**
- Handles mixed feature types (numeric + encoded categorical)
- Captures non-linear patterns (e.g., hour + day interactions)
- Provides feature importance rankings
- Robust to outliers and skewed distributions

**Key hyperparameters to tune:**
- `n_estimators` — number of trees (start with 100-500)
- `max_depth` — tree depth (start with None, then constrain if overfitting)
- `min_samples_leaf` — minimum samples per leaf (regularization)

### 5.3 Gradient Boosting (XGBoost / LightGBM)

Gradient boosting builds trees sequentially, with each tree correcting the errors of the previous one.

**Why Gradient Boosting:**
- Typically the best performing model for tabular data
- Handles zero-inflated distributions well
- Built-in handling of missing values and categorical features (LightGBM)
- Fast training with GPU support

**Key hyperparameters to tune:**
- `learning_rate` — step size (0.01-0.1)
- `n_estimators` — number of boosting rounds (100-1000)
- `max_depth` — tree depth (3-8)
- `subsample` — fraction of data per tree (0.7-1.0)

### 5.4 Two-Stage Model (Optional — Advanced)

Addresses the zero-inflated target distribution by splitting the problem:

**Stage 1 — Classification:** Will there be a delay? (0 vs >0)
- Train a binary classifier (e.g., XGBClassifier) on all records
- Target: `is_delayed = (min_delay_capped > 0).astype(int)`

**Stage 2 — Regression:** If yes, how many minutes?
- Train a regressor only on records where `min_delay_capped > 0` (~18,500 records)
- The non-zero distribution is much healthier for regression

**Prediction pipeline:**
1. Run Stage 1 → predict probability of delay
2. If predicted delay, run Stage 2 → predict delay minutes
3. If predicted no delay, output 0

---
## 6. Hyperparameter Tuning

Optimize the best-performing model from Section 5.

**Approach:**
- Use `GridSearchCV` or `RandomizedSearchCV` with cross-validation
- Use time-series aware cross-validation (`TimeSeriesSplit`) if using time-based split
- Scoring metric: negative MAE (`neg_mean_absolute_error`) — most interpretable for delay prediction

**Parameter grids** should be defined based on which model performed best in Section 5.

---
## 7. Model Evaluation

Evaluate the final tuned model on the held-out test set.

### 7.1 Regression Metrics

| Metric | What it measures | Why it matters |
|--------|-----------------|----------------|
| **MAE** | Mean Absolute Error | Average prediction error in minutes — most interpretable |
| **RMSE** | Root Mean Squared Error | Penalizes large errors more heavily |
| **R²** | Coefficient of determination | Proportion of variance explained (0-1) |
| **MAPE** | Mean Absolute % Error | Relative error — use only on non-zero records |

Compare all models side-by-side, including the baseline.

### 7.2 Prediction Analysis

Visualize model predictions to understand where it performs well and where it fails.

**Plots to create:**
- **Actual vs Predicted** scatter plot — should cluster along the diagonal
- **Residual distribution** — should be centered around zero
- **Error by Line** — does the model perform equally across all lines?
- **Error by Hour** — are certain times harder to predict?
- **Prediction distribution** — does it match the actual distribution?

### 7.3 Feature Importance

Understand which features contribute most to predictions.

**Methods:**
- **Built-in importance** — tree-based models provide feature importance scores
- **Permutation importance** — more reliable; measures performance drop when a feature is shuffled

Plot the top features ranked by importance.

---
## 8. Model Export

Save the trained model and preprocessing artifacts for deployment.

**Files to save:**
- Trained model → `../models/delay_model.pkl` (using `joblib`)
- Encoders (target encoder for Station/Code) → `../models/encoders.pkl`
- Feature list → `../models/feature_columns.json`

**Important:** Save the encoders alongside the model — they are required to transform new data at inference time.

---
## 9. Results Summary

Document the final model performance, key decisions, and next steps.

**Template:**

| Item | Value |
|------|-------|
| **Best model** | *(fill in)* |
| **Test MAE** | *(fill in)* minutes |
| **Test RMSE** | *(fill in)* minutes |
| **Test R²** | *(fill in)* |
| **Baseline MAE** | *(fill in)* minutes |
| **Improvement over baseline** | *(fill in)* % |
| **Top 3 features** | *(fill in)* |

**Limitations:**
- *(Document any limitations discovered during training)*

**Next steps:**
- *(List potential improvements: more data, additional features, deployment plan)*