## Classification: Random Forest

Since it was difficult to determine which variables impact streetcar delay times through regression models, let's try using Random Forest Classification. We tried using knn classification in early exploratory tests and it was not a good model for this type of data. Random forest will be better at handling noisy and mixed-type data with many independent variables.

In [220]:
# Import required libraries
import numpy as np
import pandas as pd
import glob
from sklearn.model_selection import GridSearchCV, train_test_split, cross_val_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report
from sklearn.preprocessing import StandardScaler, OneHotEncoder, LabelEncoder
from sklearn import set_config

In [221]:
# Import weather data
# Get list of all weather csv files in the specified directory
weather_csv_files = glob.glob("weather_data/*.csv")
weather_csv_files

# Define the columns to use from the weather data
columns_to_use = ["Date/Time (LST)", "Temp (°C)", "Precip. Amount (mm)"]

# Read and concatenate all weather data into a single DataFrame
weather_data = pd.concat(
    [pd.read_csv(file, usecols=columns_to_use) for file in weather_csv_files],
    ignore_index=True
)
# Convert 'Date/Time (LST)' to datetime format
weather_data["Date/Time (LST)"] = pd.to_datetime(weather_data["Date/Time (LST)"])

# Display the first few rows of the weather data
weather_data.head()

Unnamed: 0,Date/Time (LST),Temp (°C),Precip. Amount (mm)
0,2024-01-01 00:00:00,-0.7,0.2
1,2024-01-01 01:00:00,-0.9,0.3
2,2024-01-01 02:00:00,-1.5,0.1
3,2024-01-01 03:00:00,-1.5,0.1
4,2024-01-01 04:00:00,-1.8,0.0


In [222]:
# Extract year, month, day, hour from the datetime column
weather_data["year"] = weather_data["Date/Time (LST)"].dt.year
weather_data["month"] = weather_data["Date/Time (LST)"].dt.month
weather_data["day"] = weather_data["Date/Time (LST)"].dt.day
weather_data["hour"] = weather_data["Date/Time (LST)"].dt.hour

weather_data.head()

Unnamed: 0,Date/Time (LST),Temp (°C),Precip. Amount (mm),year,month,day,hour
0,2024-01-01 00:00:00,-0.7,0.2,2024,1,1,0
1,2024-01-01 01:00:00,-0.9,0.3,2024,1,1,1
2,2024-01-01 02:00:00,-1.5,0.1,2024,1,1,2
3,2024-01-01 03:00:00,-1.5,0.1,2024,1,1,3
4,2024-01-01 04:00:00,-1.8,0.0,2024,1,1,4


In [223]:
# Load ttc streetcar delay dataset
ttc_data = pd.read_csv("ttc-streetcar-delay-data-2024_cleaned.csv")

In [224]:
# Merge the datasets
# Rename ttc dataset columns for merging
ttc_data = ttc_data.rename(columns={"Year": "year", "Month": "month", "Day of Month": "day", "Hour of Day": "hour"})
ttc_data.head()

Unnamed: 0,Date,Line,Time,Day,Location,Incident,Min Delay,Min Gap,Bound,Vehicle,month,Week,day,hour,Season
0,28-Apr-24,301.0,03:54,Sunday,WOLSELEY LOOP,Cleaning - Unsanitary,30.0,60.0,E,8118.0,4.0,18.0,28.0,3.0,Spring
1,14-Sep-24,301.0,02:23,Saturday,WOLSELEY LOOP,Utilized Off Route,30.0,60.0,W,8112.0,9.0,37.0,14.0,2.0,Summer
2,28-Jan-24,301.0,02:11,Sunday,WARDEN AND COMSTOCK,Mechanical,10.0,20.0,W,8734.0,1.0,5.0,28.0,2.0,Winter
3,15-Nov-24,301.0,02:35,Friday,THE QUEENSWAY AND WIND,Security,24.0,39.0,W,4588.0,11.0,46.0,15.0,2.0,Fall
4,25-Aug-24,301.0,03:11,Sunday,THE QUEENSWAY AND GLEN,Mechanical,10.0,30.0,E,4569.0,8.0,35.0,25.0,3.0,Summer


In [225]:
# Merge data frames and drop unnecessary columns
ttc_weather = pd.merge(ttc_data, weather_data, on=["month", "day", "hour"], how="left")
ttc_weather.drop(columns=["Date/Time (LST)", "year"], inplace=True)
ttc_weather

Unnamed: 0,Date,Line,Time,Day,Location,Incident,Min Delay,Min Gap,Bound,Vehicle,month,Week,day,hour,Season,Temp (°C),Precip. Amount (mm)
0,28-Apr-24,301.0,03:54,Sunday,WOLSELEY LOOP,Cleaning - Unsanitary,30.0,60.0,E,8118.0,4.0,18.0,28.0,3.0,Spring,13.4,0.0
1,14-Sep-24,301.0,02:23,Saturday,WOLSELEY LOOP,Utilized Off Route,30.0,60.0,W,8112.0,9.0,37.0,14.0,2.0,Summer,19.2,0.0
2,28-Jan-24,301.0,02:11,Sunday,WARDEN AND COMSTOCK,Mechanical,10.0,20.0,W,8734.0,1.0,5.0,28.0,2.0,Winter,1.8,0.0
3,15-Nov-24,301.0,02:35,Friday,THE QUEENSWAY AND WIND,Security,24.0,39.0,W,4588.0,11.0,46.0,15.0,2.0,Fall,6.9,0.0
4,25-Aug-24,301.0,03:11,Sunday,THE QUEENSWAY AND GLEN,Mechanical,10.0,30.0,E,4569.0,8.0,35.0,25.0,3.0,Summer,18.5,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
13920,18-Feb-24,512.0,19:50,Sunday,AND KEELE,Operations,8.0,15.0,E,8337.0,2.0,8.0,18.0,19.0,Winter,-0.5,0.0
13921,19-Apr-24,512.0,20:24,Friday,AND GLENHOLME,Mechanical,10.0,20.0,W,8799.0,4.0,16.0,19.0,20.0,Spring,11.1,0.0
13922,20-Feb-24,512.0,16:20,Tuesday,AND EARLS COU,Collision - TTC Involved,4.0,8.0,E,8295.0,2.0,8.0,20.0,16.0,Winter,2.5,0.0
13923,13-May-24,512.0,16:17,Monday,AND DEER PARK,Operations,5.0,10.0,E,8860.0,5.0,20.0,13.0,16.0,Spring,20.7,0.0


In [226]:
# Calculate and print Q1, Median, Q3, and Mean of "Min Delay" in the original dataset
# These values will determine how to classify minute delays into 'minor' and 'moderate' delay categories
Median = ttc_weather["Min Delay"].median()
Q1 = ttc_weather["Min Delay"].quantile(0.25)
Q3 = ttc_weather["Min Delay"].quantile(0.75)
Mean = ttc_weather["Min Delay"].mean()
print(Q1, Median, Q3, Mean)

5.0 10.0 13.0 16.172148807813848


We will classify delays as follows:
- Minor -> less than the median (10 min)
- Moderate -> between median and 1.5 IQR + Q3 (10-25 min, inclusive)

Not included in our model:
- No Delay -> delay time = 0. We're interested in delays only.
- Major -> greater than 1.5 IQR + Q3 (outliers of the dataset, > 25). These delays are anomalistic outliers and likely hard to predict.

In [227]:
# Remove outliers from Location column based on if they appear less than 5 times in the dataframe as these locations are likely to be typos or rare events
location_counts = ttc_weather['Location'].value_counts()
locations_to_keep = location_counts[location_counts >= 5].index

# Filter the dataframe to keep only locations that appear 5 or more times
ttc_weather = ttc_weather[ttc_weather['Location'].isin(locations_to_keep)]
ttc_weather

Unnamed: 0,Date,Line,Time,Day,Location,Incident,Min Delay,Min Gap,Bound,Vehicle,month,Week,day,hour,Season,Temp (°C),Precip. Amount (mm)
0,28-Apr-24,301.0,03:54,Sunday,WOLSELEY LOOP,Cleaning - Unsanitary,30.0,60.0,E,8118.0,4.0,18.0,28.0,3.0,Spring,13.4,0.0
1,14-Sep-24,301.0,02:23,Saturday,WOLSELEY LOOP,Utilized Off Route,30.0,60.0,W,8112.0,9.0,37.0,14.0,2.0,Summer,19.2,0.0
3,15-Nov-24,301.0,02:35,Friday,THE QUEENSWAY AND WIND,Security,24.0,39.0,W,4588.0,11.0,46.0,15.0,2.0,Fall,6.9,0.0
4,25-Aug-24,301.0,03:11,Sunday,THE QUEENSWAY AND GLEN,Mechanical,10.0,30.0,E,4569.0,8.0,35.0,25.0,3.0,Summer,18.5,0.0
5,15-May-24,301.0,02:42,Wednesday,SUNNYSIDE LOOP,Security,0.0,0.0,,4589.0,5.0,20.0,15.0,2.0,Spring,14.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
13892,25-Nov-24,512.0,15:38,Monday,BATHURST AND ST CLAIR,Emergency Services,11.0,19.0,W,4521.0,11.0,48.0,25.0,15.0,Fall,5.2,0.0
13893,12-Dec-24,512.0,06:25,Thursday,BATHURST AND ST CLAIR,Held By,19.0,28.0,E,4403.0,12.0,50.0,12.0,6.0,Fall,-3.4,0.0
13894,31-Dec-24,512.0,07:16,Tuesday,BATHURST AND ST CLAIR,Rail/Switches,7.0,15.0,N,4454.0,12.0,53.0,31.0,7.0,Winter,0.9,0.0
13895,21-Nov-24,512.0,18:44,Thursday,BATHURST AND ST CLAIR,Cleaning - Unsanitary,8.0,16.0,W,4436.0,11.0,47.0,21.0,18.0,Fall,5.2,0.0


In [228]:
# Define function to classify delays as 'minor' or 'moderate'
def classify_delay(min_delay):
    if 0 < min_delay < 10:
        return "Minor"
    elif 10 <= min_delay <= 25:
        return "Moderate"
    else:
        return None  # Exclude other cases

# Filter to only minor/moderate
ttc_weather["Delay Class"] = ttc_weather["Min Delay"].apply(classify_delay)
ttc_classify = ttc_weather[ttc_weather["Delay Class"].notna()] 
ttc_classify

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  ttc_weather["Delay Class"] = ttc_weather["Min Delay"].apply(classify_delay)


Unnamed: 0,Date,Line,Time,Day,Location,Incident,Min Delay,Min Gap,Bound,Vehicle,month,Week,day,hour,Season,Temp (°C),Precip. Amount (mm),Delay Class
3,15-Nov-24,301.0,02:35,Friday,THE QUEENSWAY AND WIND,Security,24.0,39.0,W,4588.0,11.0,46.0,15.0,2.0,Fall,6.9,0.0,Moderate
4,25-Aug-24,301.0,03:11,Sunday,THE QUEENSWAY AND GLEN,Mechanical,10.0,30.0,E,4569.0,8.0,35.0,25.0,3.0,Summer,18.5,0.0,Moderate
6,3-Jul-24,301.0,03:49,Wednesday,SUNNYSIDE LOOP,Emergency Services,20.0,40.0,W,4491.0,7.0,27.0,3.0,3.0,Summer,20.8,0.0,Moderate
10,3-Oct-24,301.0,02:39,Thursday,SPADINA AND KING,Overhead,22.0,42.0,,0.0,10.0,40.0,3.0,2.0,Fall,11.2,0.0,Moderate
11,27-Oct-24,301.0,03:00,Sunday,SPADINA AND KING,Operations,10.0,20.0,S,4518.0,10.0,44.0,27.0,3.0,Fall,2.5,0.0,Moderate
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
13892,25-Nov-24,512.0,15:38,Monday,BATHURST AND ST CLAIR,Emergency Services,11.0,19.0,W,4521.0,11.0,48.0,25.0,15.0,Fall,5.2,0.0,Moderate
13893,12-Dec-24,512.0,06:25,Thursday,BATHURST AND ST CLAIR,Held By,19.0,28.0,E,4403.0,12.0,50.0,12.0,6.0,Fall,-3.4,0.0,Moderate
13894,31-Dec-24,512.0,07:16,Tuesday,BATHURST AND ST CLAIR,Rail/Switches,7.0,15.0,N,4454.0,12.0,53.0,31.0,7.0,Winter,0.9,0.0,Minor
13895,21-Nov-24,512.0,18:44,Thursday,BATHURST AND ST CLAIR,Cleaning - Unsanitary,8.0,16.0,W,4436.0,11.0,47.0,21.0,18.0,Fall,5.2,0.0,Minor


In [229]:
# Define X without modifying the original data frame
# Dropping columns that based on exploratory analysis, will not be useful classification variables
X = ttc_classify.drop(columns=["Delay Class", "Min Delay", "Min Gap", "Date", "Time", "Bound", "Vehicle", "Season", "Day"])

# One-hot encode categorical columns
X_encoded = pd.get_dummies(X, columns=["Line", "Location", "Incident"], drop_first=True)
X_encoded = X_encoded.astype(float)

# Encode target (Delay Class) from the original data frame
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
y_encoded = le.fit_transform(ttc_classify["Delay Class"])

In [230]:
# Split data into training & testing data and fit model
# Use Random Forest classification
X_train, X_test, y_train, y_test = train_test_split(X_encoded, y_encoded, stratify=y_encoded, test_size=0.2, random_state=4)
model = RandomForestClassifier(class_weight='balanced', random_state=4)
model.fit(X_train, y_train)

In [231]:
# Test model and produce classification report
y_pred = model.predict(X_test)
print(classification_report(y_test, y_pred, target_names=le.classes_))

              precision    recall  f1-score   support

       Minor       0.83      0.53      0.64       523
    Moderate       0.83      0.95      0.88      1224

    accuracy                           0.83      1747
   macro avg       0.83      0.74      0.76      1747
weighted avg       0.83      0.83      0.81      1747



In [232]:
# Determine which variables are the most important for classification
feature_importance = pd.Series(model.feature_importances_, index=X_encoded.columns)
print(feature_importance.sort_values(ascending=False).head(30))

hour                                 0.103263
Temp (°C)                            0.095667
Week                                 0.086609
day                                  0.078120
month                                0.059825
Line_512.0                           0.035594
Line_510.0                           0.028461
Line_506.0                           0.018817
Line_505.0                           0.018792
Line_503.0                           0.017879
Precip. Amount (mm)                  0.014960
Incident_Mechanical                  0.012889
Incident_Operations                  0.012127
Line_504.0                           0.011517
Incident_Diversion                   0.010796
Line_501.0                           0.010731
Incident_Emergency Services          0.010513
Incident_Held By                     0.010341
Incident_Security                    0.010063
Incident_General Delay               0.008784
Location_LESLIE BARNS                0.008100
Incident_Collision - TTC Involved 

In [233]:
# Cross-validation and hyperparameter tuning
# Define parameter grid
param_grid = {
    'n_estimators': [100, 200, 300],
    'max_depth': [None, 10, 20],
    'min_samples_split': [2, 5, 10],
    'class_weight': [None, 'balanced']
}

In [234]:
# Initiate Grid Search Cross Validation
grid_search = GridSearchCV(
    RandomForestClassifier(random_state=4),
    param_grid,
    cv=5,
    scoring='f1_weighted'
)

In [235]:
grid_search.fit(X_train, y_train)

In [238]:
best_model = grid_search.best_estimator_

In [239]:
y_pred = best_model.predict(X_test)
print(classification_report(y_test, y_pred, target_names=le.classes_))

              precision    recall  f1-score   support

       Minor       0.77      0.57      0.65       523
    Moderate       0.83      0.93      0.88      1224

    accuracy                           0.82      1747
   macro avg       0.80      0.75      0.77      1747
weighted avg       0.82      0.82      0.81      1747



In [240]:
# Determine which variables are the most important for classification
feature_importance = pd.Series(model.feature_importances_, index=X_encoded.columns)
print(feature_importance.sort_values(ascending=False).head(30))

hour                                 0.103263
Temp (°C)                            0.095667
Week                                 0.086609
day                                  0.078120
month                                0.059825
Line_512.0                           0.035594
Line_510.0                           0.028461
Line_506.0                           0.018817
Line_505.0                           0.018792
Line_503.0                           0.017879
Precip. Amount (mm)                  0.014960
Incident_Mechanical                  0.012889
Incident_Operations                  0.012127
Line_504.0                           0.011517
Incident_Diversion                   0.010796
Line_501.0                           0.010731
Incident_Emergency Services          0.010513
Incident_Held By                     0.010341
Incident_Security                    0.010063
Incident_General Delay               0.008784
Location_LESLIE BARNS                0.008100
Incident_Collision - TTC Involved 