# MBTA On-Time Performance Prediction

- **Goal:** Predict daily on-time % for September 2024 for each bus route  
- **Training Data:** January–August 2024 MBTA Bus Arrival & Departure Times, using historical arrival and delay data  
- **Model:** Random Forest Regressor 
- **Features:**  
  - `route_cat`: numerical encoding of each route (categorical)  
  - `day_of_week`: day of the week (0 = Monday, 6 = Sunday)  
  - `delay_minutes`: average delay in minutes for the route on a given day  
- **Metric:** RMSE (Root Mean Squared Error), will look at other evaluation metrics later  

---

After the midterm report, we plan to make improvements to our model, including:
- Experimenting with more advanced models (e.g. XGBoost, ensemble methods)  
- Engineering additional features (e.g. rolling delay averages, route-level stats, external data)  
- Expanding the training data range  
- Conducting additional evaluation/visualization of data 

In [17]:
output_path = "/Users/chris/Desktop/MBTA_Bus_Arrival_Departure_Times_2024/MBTA_Bus_2024_Preprocessed.csv"
df_merged.to_csv(output_path, index=False)
print(f"Preprocessed data saved to {output_path}")


Preprocessed data saved to /Users/chris/Desktop/MBTA_Bus_Arrival_Departure_Times_2024/MBTA_Bus_2024_Preprocessed.csv


In [18]:
# 1. Create a flag for on-time vs. late
df_merged['on_time_flag'] = df_merged['delay_minutes'] <= 5

# 2. Extract day of week from service_date
df_merged['day_of_week'] = df_merged['service_date'].dt.dayofweek

# 3. Aggregate by route + date (one row per route-day)
daily = df_merged.groupby(['route_id_str', 'service_date'], as_index=False).agg({
    'on_time_flag': 'mean',
    'delay_minutes': 'mean',
    'day_of_week': 'first'
})
daily.rename(columns={'on_time_flag': 'on_time_pct'}, inplace=True)

# 4. Encode route as a categorical variable
daily['route_cat'] = daily['route_id_str'].astype('category').cat.codes

# 5. Split into features (X) and target (y)
X = daily[['route_cat', 'day_of_week', 'delay_minutes']]
y = daily['on_time_pct']

# 6. Train/test split (random for demo; you could do a time-based split)
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# 7. Choose a model (RandomForestRegressor as example)
from sklearn.ensemble import RandomForestRegressor
model = RandomForestRegressor(n_estimators=100, random_state=42)
model.fit(X_train, y_train)

# 8. Predict on the test set
y_pred = model.predict(X_test)

# 9. Evaluate
from sklearn.metrics import mean_squared_error, r2_score
rmse = mean_squared_error(y_test, y_pred, squared=False)
r2   = r2_score(y_test, y_pred)

print("RandomForest Regressor:")
print(f"RMSE: {rmse:.3f}")
print(f"R^2:  {r2:.3f}")


RandomForest Regressor:
RMSE: 0.111
R^2:  0.565




In [19]:
import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score

# -----------------------------------------------------------------
# 0. Assume df_merged already loaded/cleaned:
#    - Includes columns: service_date (datetime), route_id_str, delay_minutes
#    - If not, read your CSV or do the merges first, then run:
#        df_merged['route_id_str'] = df_merged['route_id'].astype(str)
# -----------------------------------------------------------------

# 1) Define on_time_flag and year_month columns
df_merged['on_time_flag'] = df_merged['delay_minutes'] <= 5
df_merged['year_month'] = df_merged['service_date'].dt.to_period('M')  # e.g. 2024-01, 2024-02, etc.
df_merged['day_of_week'] = df_merged['service_date'].dt.dayofweek

# 2) Split: Train on Jan–Aug 2024, Predict on Sept 2024
train_mask = (df_merged['year_month'] >= '2024-01') & (df_merged['year_month'] <= '2024-08')
test_mask  = (df_merged['year_month'] == '2024-09')

df_train = df_merged[train_mask].copy()
df_test  = df_merged[test_mask].copy()

# 3) Aggregate each subset to daily route-level
def aggregate_daily_route(df):
    daily = df.groupby(['route_id_str','service_date'], as_index=False).agg({
        'on_time_flag': 'mean',
        'delay_minutes': 'mean',
        'day_of_week': 'first'
    })
    daily.rename(columns={'on_time_flag':'on_time_pct'}, inplace=True)
    daily['route_cat'] = daily['route_id_str'].astype('category').cat.codes
    return daily

daily_train = aggregate_daily_route(df_train)
daily_test  = aggregate_daily_route(df_test)

# 4) Prepare features (X) and target (y) for training
#    Here: X = route_cat, day_of_week, daily mean delay
X_train = daily_train[['route_cat','day_of_week','delay_minutes']]
y_train = daily_train['on_time_pct']

# 5) Train the model
model = RandomForestRegressor(n_estimators=100, random_state=42)
model.fit(X_train, y_train)

# 6) Predict for SEPT 2024
X_test = daily_test[['route_cat','day_of_week','delay_minutes']]
y_pred = model.predict(X_test)

# Store predictions in daily_test
daily_test['predicted_on_time_pct'] = y_pred

# 7) Optional: If you DO have real data for September, you can evaluate
if 'on_time_pct' in daily_test.columns:
    # Evaluate if actual data is present
    y_true = daily_test['on_time_pct']
    if not y_true.isna().all():
        rmse = mean_squared_error(y_true, y_pred, squared=False)
        r2   = r2_score(y_true, y_pred)
        print("Evaluation on September 2024 (Time-Based):")
        print(f"  RMSE: {rmse:.3f}")
        print(f"  R^2:  {r2:.3f}")
    else:
        print("No ground truth for September—just predictions.")
else:
    print("No on_time_pct column in daily_test to compare—just predictions.")

# 8) Present the results
print("\n----- SAMPLES OF PREDICTIONS FOR SEPT 2024 -----")
display(daily_test[['route_id_str','service_date','predicted_on_time_pct']].head(20))

# (Optional) Group by route for a final average
route_preds = daily_test.groupby('route_id_str', as_index=False)['predicted_on_time_pct'].mean()
route_preds.sort_values('predicted_on_time_pct', ascending=False, inplace=True)

print("\n----- AVERAGE PREDICTED ON-TIME % BY ROUTE (SEPT 2024) -----")
display(route_preds.head(20))


Evaluation on September 2024 (Time-Based):
  RMSE: 0.132
  R^2:  0.307

----- SAMPLES OF PREDICTIONS FOR SEPT 2024 -----




Unnamed: 0,route_id_str,service_date,predicted_on_time_pct
0,1,2024-09-01,0.795503
1,1,2024-09-02,0.486686
2,1,2024-09-03,0.42432
3,1,2024-09-04,0.428193
4,1,2024-09-05,0.302507
5,1,2024-09-06,0.331868
6,1,2024-09-07,0.612485
7,1,2024-09-08,0.532394
8,1,2024-09-09,0.412598
9,1,2024-09-10,0.391647



----- AVERAGE PREDICTED ON-TIME % BY ROUTE (SEPT 2024) -----


Unnamed: 0,route_id_str,predicted_on_time_pct
37,194,0.869316
35,192,0.806786
103,55,0.782503
22,121,0.777363
119,69,0.772478
118,68,0.759508
50,226,0.751792
121,71,0.746434
132,85,0.744888
141,94,0.741892


In [21]:
# 1) Ensure your df_merged has a datetime column 'service_date' and a route_id_str, 
#    plus an on_time_flag or on_time_pct column. 
#    For example:
#       df_merged['on_time_flag'] = df_merged['delay_minutes'] <= 5
#       df_merged['service_date'] = pd.to_datetime(df_merged['service_date'], errors='coerce')
#       df_merged['route_id_str'] = df_merged['route_id'].astype(str)

# 2) Create a 'year_month' column (period), then filter for September 2024:
df_merged['year_month'] = df_merged['service_date'].dt.to_period('M')
sept_mask = (df_merged['year_month'] == '2024-09')
df_sept = df_merged[sept_mask].copy()

# 3) Aggregate by route + date to compute daily on_time_pct
#    If you haven't already computed on_time_flag, do so:
df_sept['on_time_flag'] = df_sept['delay_minutes'] <= 5

daily_sept = df_sept.groupby(['route_id_str','service_date'], as_index=False).agg({
    'on_time_flag': 'mean'
})
daily_sept.rename(columns={'on_time_flag':'on_time_pct'}, inplace=True)

# 4) Now compute average on-time percentage by route across all of September 2024
route_actuals = (
    daily_sept
    .groupby('route_id_str', as_index=False)['on_time_pct']
    .mean()
)

route_actuals.rename(columns={'on_time_pct': 'avg_on_time_pct'}, inplace=True)
route_actuals.sort_values('avg_on_time_pct', ascending=False, inplace=True)

print("----- AVERAGE ON-TIME % BY ROUTE (SEPT 2024) -----")
display(route_actuals.head(20))


----- AVERAGE ON-TIME % BY ROUTE (SEPT 2024) -----


Unnamed: 0,route_id_str,avg_on_time_pct
37,194,0.879167
69,351,0.846226
35,192,0.81713
22,121,0.81519
103,55,0.803509
118,68,0.772059
112,61,0.767151
50,226,0.76589
117,67,0.765365
132,85,0.75679
