# STM Transit Delay Data Modeling

This notebook preprocesses data about STM trip updates and historical weather data. Three regression models will be explored and the one with the best accuracy will be chosen.

## Data Description

`trip_id` unique identifier of a trip<br>
`route_id` bus line<br>
`stop_id` stop number<br>
`stop_sequence` sequence of the stop, for ordering<br>
`realtime_arrival_time` actual arrival time, in milliseconds<br>
`scheduled_arrival_time` planned arrival time, in milliseconds<br>
`realtime_departure_time` actual departure time<br>
`scheduled_departure_time` planned departure time<br>
`schedule_relationship` state of the trip, 0 means scheduled and 1 means skipped<br>
`temperature` air temperature at 2 meters above ground, in Celsius<br>
`precipitation` total precipitation (rain, showers, snow) sum of the preceding hour, in millimeters<br>
`windspeed` wind speed at 10 meters above ground, in km/h<br>
`weathercode` weather condition as a numeric code, see [this table](https://open-meteo.com/en/docs#weather_variable_documentation) for details<br>
`delay` difference between actual and planned arrival time, in seconds

## Imports

In [24]:
import pandas as pd
from sklearn.preprocessing import LabelEncoder

## Data Preprocessing

In [5]:
local_timezone = 'Canada/Eastern'

In [2]:
# Load data
df = pd.read_csv('data/stm_weather_merged.csv')
df.head()

Unnamed: 0,trip_id,route_id,stop_id,stop_sequence,realtime_arrival_time,scheduled_arrival_time,realtime_departure_time,scheduled_departure_time,schedule_relationship,temperature,precipitation,windspeed,weathercode,delay
0,285028348,189,54433,20,1745384718,1745384708000,1745384718,1745384708000,0,1.5,0.0,16.2,0,10.0
1,285028348,189,54444,21,1745384751,1745384751000,1745384751,1745384751000,0,1.5,0.0,16.2,0,0.0
2,285028348,189,54445,22,1745384785,1745384785000,1745384785,1745384785000,0,1.5,0.0,16.2,0,0.0
3,285028348,189,54451,23,1745384806,1745384806000,1745384806,1745384806000,0,1.5,0.0,16.2,0,0.0
4,285028348,189,54456,24,1745384829,1745384829000,1745384829,1745384829000,0,1.5,0.0,16.2,0,0.0


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 311035 entries, 0 to 311034
Data columns (total 14 columns):
 #   Column                    Non-Null Count   Dtype  
---  ------                    --------------   -----  
 0   trip_id                   311035 non-null  int64  
 1   route_id                  311035 non-null  int64  
 2   stop_id                   311035 non-null  int64  
 3   stop_sequence             311035 non-null  int64  
 4   realtime_arrival_time     311035 non-null  int64  
 5   scheduled_arrival_time    311035 non-null  int64  
 6   realtime_departure_time   311035 non-null  int64  
 7   scheduled_departure_time  311035 non-null  int64  
 8   schedule_relationship     311035 non-null  int64  
 9   temperature               311035 non-null  float64
 10  precipitation             311035 non-null  float64
 11  windspeed                 311035 non-null  float64
 12  weathercode               311035 non-null  int64  
 13  delay                     311035 non-null  f

In [9]:
# Convert realtime arrival timestamp to datetime
rt_arrival_dt = pd.to_datetime(df['realtime_arrival_time'] * 1000, origin='unix', unit='ms', utc=True)
rt_arrival_dt = rt_arrival_dt.dt.tz_convert(local_timezone)
rt_arrival_dt

0        2025-04-23 01:05:18-04:00
1        2025-04-23 01:05:51-04:00
2        2025-04-23 01:06:25-04:00
3        2025-04-23 01:06:46-04:00
4        2025-04-23 01:07:09-04:00
                    ...           
311030   2025-04-23 19:42:51-04:00
311031   2025-04-23 19:43:27-04:00
311032   2025-04-23 19:44:14-04:00
311033   2025-04-23 19:44:54-04:00
311034   2025-04-23 19:49:00-04:00
Name: realtime_arrival_time, Length: 311035, dtype: datetime64[ns, Canada/Eastern]

In [17]:
# Convert datetime to useful features
df['hour_of_day'] = rt_arrival_dt.dt.hour
df['day_of_week'] = rt_arrival_dt.dt.day_of_week
df.head()

Unnamed: 0,trip_id,route_id,stop_id,stop_sequence,realtime_arrival_time,scheduled_arrival_time,realtime_departure_time,scheduled_departure_time,schedule_relationship,temperature,precipitation,windspeed,weathercode,delay,hour_of_day,day_of_week
0,285028348,189,54433,20,1745384718,1745384708000,1745384718,1745384708000,0,1.5,0.0,16.2,0,10.0,1,2
1,285028348,189,54444,21,1745384751,1745384751000,1745384751,1745384751000,0,1.5,0.0,16.2,0,0.0,1,2
2,285028348,189,54445,22,1745384785,1745384785000,1745384785,1745384785000,0,1.5,0.0,16.2,0,0.0,1,2
3,285028348,189,54451,23,1745384806,1745384806000,1745384806,1745384806000,0,1.5,0.0,16.2,0,0.0,1,2
4,285028348,189,54456,24,1745384829,1745384829000,1745384829,1745384829000,0,1.5,0.0,16.2,0,0.0,1,2


In [None]:
# Use Label Encoding (tree-based models)

In [11]:
df.columns

Index(['trip_id', 'route_id', 'stop_id', 'stop_sequence',
       'realtime_arrival_time', 'scheduled_arrival_time',
       'realtime_departure_time', 'scheduled_departure_time',
       'schedule_relationship', 'temperature', 'precipitation', 'windspeed',
       'weathercode', 'delay', 'hour_of_day'],
      dtype='object')

## End