# STM Transit Delay Data Preparation

## Data description

### Real-time STM Trip Updates

`current_time` timestamp of the time the data was collected<br>
`trip_id` unique identifier of a trip<br>
`route_id` bus or metro line<br>
`start_date` schedule date<br>
`stop_id` stop number<br>
`arrival_time` actual arrival time, in milliseconds<br>
`departure_time` actual departure time, in milliseconds<br>
`schedule_relationship` state of the trip, 0 means scheduled and 1 means skipped

### Scheduled STM Trips

`trip_id` unique identifier of a trip<br>
`arrival_time` scheduled arrival time, in milliseconds<br>
`departure_time` scheduled departure time, in milliseconds<br>
`stop_id` stop number<br>
`stop_sequence` sequence of the stop, for ordering

### STM Stops

`stop_id` unique identifier of a stop<br>
`stop_code` stop number<br>
`stop_name` stop name<br>
`stop_lat` stop latitude<br>
`stop_lon` stop longitude<br>
`stop_url` stop web page<br>
`location_type` stop type, 1 being a metro station and 2 a bus stop<br>
`parent_station` parent station (ex: a metro station with multiple exits)<br>
`wheelchair_boarding` indicates if the stop is accessible for people in wheelchair, 1 being true and 2 being false

### Weather Archive

`time` date and hour or the archived weather<br>
`temperature` air temperature at 2 meters above ground, in Celsius<br>
`precipitation` total precipitation (rain, showers, snow) sum of the preceding hour, in millimeters<br>
`windspeed` wind speed at 10 meters above ground, in km/h<br>
`weathercode` World Meteorological Organization (WMO) code

## Imports

In [None]:
from datetime import timedelta
import numpy as np
import pandas as pd
import sys

In [None]:
# Import custom code
sys.path.insert(0, '..')
from scripts.custom_functions import LOCAL_TIMEZONE

In [None]:
real_stm_df = pd.read_csv('../data/fetched_stm.csv', low_memory=False)

In [None]:
planned_stm_df = pd.read_csv('../data/stop_times_2025-04-23.txt')

In [None]:
stops_df = pd.read_csv('../data/stops_2025-04-23.txt')

In [None]:
weather_df = pd.read_csv('../data/fetched_weather.csv')

## Merge Data

### Realtime and Scheduled Trips

In [43]:
stm_trips_df = pd.merge(left=real_stm_df, right=planned_stm_df, how='inner', on=['trip_id', 'stop_id'])
stm_trips_df.head()

Unnamed: 0,current_time,trip_id,route_id,start_date,stop_id,arrival_time_x,departure_time_x,schedule_relationship,arrival_time_y,departure_time_y,stop_sequence
0,1745385000.0,285028348,189,20250422,54433,1745384718,1745384718,0,25:05:08,25:05:08,20
1,1745385000.0,285028348,189,20250422,54444,1745384751,1745384751,0,25:05:51,25:05:51,21
2,1745385000.0,285028348,189,20250422,54445,1745384785,1745384785,0,25:06:25,25:06:25,22
3,1745385000.0,285028348,189,20250422,54451,1745384806,1745384806,0,25:06:46,25:06:46,23
4,1745385000.0,285028348,189,20250422,54456,1745384829,1745384829,0,25:07:09,25:07:09,24


In [44]:
stm_trips_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1942041 entries, 0 to 1942040
Data columns (total 11 columns):
 #   Column                 Dtype  
---  ------                 -----  
 0   current_time           float64
 1   trip_id                int64  
 2   route_id               object 
 3   start_date             int64  
 4   stop_id                int64  
 5   arrival_time_x         int64  
 6   departure_time_x       int64  
 7   schedule_relationship  int64  
 8   arrival_time_y         object 
 9   departure_time_y       object 
 10  stop_sequence          int64  
dtypes: float64(1), int64(7), object(3)
memory usage: 163.0+ MB


In [45]:
# Convert start_date to datetime
stm_trips_df['start_date'] = pd.to_datetime(stm_trips_df['start_date'], format='%Y%m%d')
stm_trips_df.head()

Unnamed: 0,current_time,trip_id,route_id,start_date,stop_id,arrival_time_x,departure_time_x,schedule_relationship,arrival_time_y,departure_time_y,stop_sequence
0,1745385000.0,285028348,189,2025-04-22,54433,1745384718,1745384718,0,25:05:08,25:05:08,20
1,1745385000.0,285028348,189,2025-04-22,54444,1745384751,1745384751,0,25:05:51,25:05:51,21
2,1745385000.0,285028348,189,2025-04-22,54445,1745384785,1745384785,0,25:06:25,25:06:25,22
3,1745385000.0,285028348,189,2025-04-22,54451,1745384806,1745384806,0,25:06:46,25:06:46,23
4,1745385000.0,285028348,189,2025-04-22,54456,1745384829,1745384829,0,25:07:09,25:07:09,24


In [46]:
def parse_gtfs_time(row) -> pd.Timestamp:
	'''
	Converts GTFS time string (e.g., '25:30:00') to datetime
	based on the arrival time.
	'''
	hours, minutes, seconds = map(int, row['arrival_time_y'].split(':'))
	total_seconds = hours * 3600 + minutes * 60 + seconds

	parsed_time = row['start_date'] + timedelta(seconds=total_seconds)
	return parsed_time

In [None]:
# Convert planned arrival time to localized datetime
stm_trips_df['scheduled_arrival_time'] = stm_trips_df.apply(parse_gtfs_time, axis=1)
stm_trips_df['scheduled_arrival_time'] = stm_trips_df['scheduled_arrival_time'].dt.tz_localize(LOCAL_TIMEZONE)
stm_trips_df.head()

Unnamed: 0,current_time,trip_id,route_id,start_date,stop_id,arrival_time_x,departure_time_x,schedule_relationship,arrival_time_y,departure_time_y,stop_sequence,scheduled_arrival_time
0,1745385000.0,285028348,189,2025-04-22,54433,1745384718,1745384718,0,25:05:08,25:05:08,20,2025-04-23 01:05:08-04:00
1,1745385000.0,285028348,189,2025-04-22,54444,1745384751,1745384751,0,25:05:51,25:05:51,21,2025-04-23 01:05:51-04:00
2,1745385000.0,285028348,189,2025-04-22,54445,1745384785,1745384785,0,25:06:25,25:06:25,22,2025-04-23 01:06:25-04:00
3,1745385000.0,285028348,189,2025-04-22,54451,1745384806,1745384806,0,25:06:46,25:06:46,23,2025-04-23 01:06:46-04:00
4,1745385000.0,285028348,189,2025-04-22,54456,1745384829,1745384829,0,25:07:09,25:07:09,24,2025-04-23 01:07:09-04:00


In [48]:
# Convert planned time to timestamp in milliseconds since epoch
stm_trips_df['scheduled_arrival_time'] = stm_trips_df['scheduled_arrival_time'].astype('int64') // 10**6
stm_trips_df.head()

Unnamed: 0,current_time,trip_id,route_id,start_date,stop_id,arrival_time_x,departure_time_x,schedule_relationship,arrival_time_y,departure_time_y,stop_sequence,scheduled_arrival_time
0,1745385000.0,285028348,189,2025-04-22,54433,1745384718,1745384718,0,25:05:08,25:05:08,20,1745384708000
1,1745385000.0,285028348,189,2025-04-22,54444,1745384751,1745384751,0,25:05:51,25:05:51,21,1745384751000
2,1745385000.0,285028348,189,2025-04-22,54445,1745384785,1745384785,0,25:06:25,25:06:25,22,1745384785000
3,1745385000.0,285028348,189,2025-04-22,54451,1745384806,1745384806,0,25:06:46,25:06:46,23,1745384806000
4,1745385000.0,285028348,189,2025-04-22,54456,1745384829,1745384829,0,25:07:09,25:07:09,24,1745384829000


In [49]:
# Convert realtime arrival and departure time to milliseconds
stm_trips_df['arrival_time_x'] = stm_trips_df['arrival_time_x'] * 1000
stm_trips_df['departure_time_x'] = stm_trips_df['departure_time_x'] * 1000

In [50]:
# Analyse distribution
stm_trips_df[['arrival_time_x', 'departure_time_x']].describe()

Unnamed: 0,arrival_time_x,departure_time_x
count,1942041.0,1942041.0
mean,1660843000000.0,1640085000000.0
std,375035900000.0,415859300000.0
min,0.0,0.0
25%,1745481000000.0,1745472000000.0
50%,1745528000000.0,1745527000000.0
75%,1745588000000.0,1745588000000.0
max,1745634000000.0,1745633000000.0


In [51]:
# Replace null arrival time by departure time, as they are usually the same
zero_mask = stm_trips_df['arrival_time_x'] == 0
stm_trips_df.loc[zero_mask, 'arrival_time_x'] = stm_trips_df.loc[zero_mask, 'departure_time_x']

In [52]:
# Delete the rows with null arrival times
zero_mask = stm_trips_df['arrival_time_x'] == 0
stm_trips_df = stm_trips_df[~zero_mask]

In [53]:
# Rename real time arrival time
stm_trips_df = stm_trips_df.rename(columns={'arrival_time_x': 'realtime_arrival_time'})
stm_trips_df.head()

Unnamed: 0,current_time,trip_id,route_id,start_date,stop_id,realtime_arrival_time,departure_time_x,schedule_relationship,arrival_time_y,departure_time_y,stop_sequence,scheduled_arrival_time
0,1745385000.0,285028348,189,2025-04-22,54433,1745384718000,1745384718000,0,25:05:08,25:05:08,20,1745384708000
1,1745385000.0,285028348,189,2025-04-22,54444,1745384751000,1745384751000,0,25:05:51,25:05:51,21,1745384751000
2,1745385000.0,285028348,189,2025-04-22,54445,1745384785000,1745384785000,0,25:06:25,25:06:25,22,1745384785000
3,1745385000.0,285028348,189,2025-04-22,54451,1745384806000,1745384806000,0,25:06:46,25:06:46,23,1745384806000
4,1745385000.0,285028348,189,2025-04-22,54456,1745384829000,1745384829000,0,25:07:09,25:07:09,24,1745384829000


### Trips and Stops

In [54]:
# Merge stops to trips
merged_stm_df = pd.merge(left=stm_trips_df, right=stops_df, how='inner', left_on='stop_id', right_on='stop_code')
merged_stm_df.head()

Unnamed: 0,current_time,trip_id,route_id,start_date,stop_id_x,realtime_arrival_time,departure_time_x,schedule_relationship,arrival_time_y,departure_time_y,...,scheduled_arrival_time,stop_id_y,stop_code,stop_name,stop_lat,stop_lon,stop_url,location_type,parent_station,wheelchair_boarding
0,1745385000.0,285028348,189,2025-04-22,54433,1745384718000,1745384718000,0,25:05:08,25:05:08,...,1745384708000,54433,54433,Notre-Dame / No 10150,45.617546,-73.507835,https://www.stm.info/fr/recherche#stq=54433,0,,1
1,1745385000.0,285028348,189,2025-04-22,54444,1745384751000,1745384751000,0,25:05:51,25:05:51,...,1745384751000,54444,54444,Notre-Dame / Gamble,45.62163,-73.505533,https://www.stm.info/fr/recherche#stq=54444,0,,1
2,1745385000.0,285028348,189,2025-04-22,54445,1745384785000,1745384785000,0,25:06:25,25:06:25,...,1745384785000,54445,54445,Notre-Dame / No 10800,45.624606,-73.503332,https://www.stm.info/fr/recherche#stq=54445,0,,1
3,1745385000.0,285028348,189,2025-04-22,54451,1745384806000,1745384806000,0,25:06:46,25:06:46,...,1745384806000,54451,54451,Notre-Dame / Richard,45.62627,-73.501486,https://www.stm.info/fr/recherche#stq=54451,0,,1
4,1745385000.0,285028348,189,2025-04-22,54456,1745384829000,1745384829000,0,25:07:09,25:07:09,...,1745384829000,54456,54456,Notre-Dame / Hinton,45.628078,-73.499449,https://www.stm.info/fr/recherche#stq=54456,0,,1


In [55]:
# Keep relevant columns
merged_stm_df = merged_stm_df[[
  'trip_id',
  'route_id',
  'stop_id_x',
  'stop_lat',
  'stop_lon',
  'stop_sequence',
  'wheelchair_boarding',
  'realtime_arrival_time',
  'scheduled_arrival_time'
]]
merged_stm_df = merged_stm_df.rename(columns={'stop_id_x': 'stop_id'})
merged_stm_df.head()

Unnamed: 0,trip_id,route_id,stop_id,stop_lat,stop_lon,stop_sequence,wheelchair_boarding,realtime_arrival_time,scheduled_arrival_time
0,285028348,189,54433,45.617546,-73.507835,20,1,1745384718000,1745384708000
1,285028348,189,54444,45.62163,-73.505533,21,1,1745384751000,1745384751000
2,285028348,189,54445,45.624606,-73.503332,22,1,1745384785000,1745384785000
3,285028348,189,54451,45.62627,-73.501486,23,1,1745384806000,1745384806000
4,285028348,189,54456,45.628078,-73.499449,24,1,1745384829000,1745384829000


In [56]:
# Convert arrival timestamp to datetime
rt_arrival_dt = pd.to_datetime(merged_stm_df['realtime_arrival_time'], origin='unix', unit='ms', utc=True)
rt_arrival_dt

0         2025-04-23 05:05:18+00:00
1         2025-04-23 05:05:51+00:00
2         2025-04-23 05:06:25+00:00
3         2025-04-23 05:06:46+00:00
4         2025-04-23 05:07:09+00:00
                     ...           
1885393   2025-04-26 00:29:00+00:00
1885394   2025-04-26 00:31:20+00:00
1885395   2025-04-26 00:31:36+00:00
1885396   2025-04-26 00:32:00+00:00
1885397   2025-04-26 00:34:00+00:00
Name: realtime_arrival_time, Length: 1885398, dtype: datetime64[ns, UTC]

In [57]:
# TODO: remove this cell after collecting historical data
# Remove 3 days to match historical data
rt_arrival_dt = rt_arrival_dt - pd.DateOffset(days=3)
rt_arrival_dt

0         2025-04-20 05:05:18+00:00
1         2025-04-20 05:05:51+00:00
2         2025-04-20 05:06:25+00:00
3         2025-04-20 05:06:46+00:00
4         2025-04-20 05:07:09+00:00
                     ...           
1885393   2025-04-23 00:29:00+00:00
1885394   2025-04-23 00:31:20+00:00
1885395   2025-04-23 00:31:36+00:00
1885396   2025-04-23 00:32:00+00:00
1885397   2025-04-23 00:34:00+00:00
Name: realtime_arrival_time, Length: 1885398, dtype: datetime64[ns, UTC]

In [58]:
# Round arrival time to the nearest hour
rounded_arrival_dt = rt_arrival_dt.dt.round('h')
rounded_arrival_dt

0         2025-04-20 05:00:00+00:00
1         2025-04-20 05:00:00+00:00
2         2025-04-20 05:00:00+00:00
3         2025-04-20 05:00:00+00:00
4         2025-04-20 05:00:00+00:00
                     ...           
1885393   2025-04-23 00:00:00+00:00
1885394   2025-04-23 01:00:00+00:00
1885395   2025-04-23 01:00:00+00:00
1885396   2025-04-23 01:00:00+00:00
1885397   2025-04-23 01:00:00+00:00
Name: realtime_arrival_time, Length: 1885398, dtype: datetime64[ns, UTC]

In [59]:
# Format time to match weather data
merged_stm_df['time'] = rounded_arrival_dt.dt.strftime('%Y-%m-%dT%H:%M')
merged_stm_df.head()

Unnamed: 0,trip_id,route_id,stop_id,stop_lat,stop_lon,stop_sequence,wheelchair_boarding,realtime_arrival_time,scheduled_arrival_time,time
0,285028348,189,54433,45.617546,-73.507835,20,1,1745384718000,1745384708000,2025-04-20T05:00
1,285028348,189,54444,45.62163,-73.505533,21,1,1745384751000,1745384751000,2025-04-20T05:00
2,285028348,189,54445,45.624606,-73.503332,22,1,1745384785000,1745384785000,2025-04-20T05:00
3,285028348,189,54451,45.62627,-73.501486,23,1,1745384806000,1745384806000,2025-04-20T05:00
4,285028348,189,54456,45.628078,-73.499449,24,1,1745384829000,1745384829000,2025-04-20T05:00


In [60]:
# Get duplicates
duplicate_mask = merged_stm_df.duplicated()
merged_stm_df[duplicate_mask]

Unnamed: 0,trip_id,route_id,stop_id,stop_lat,stop_lon,stop_sequence,wheelchair_boarding,realtime_arrival_time,scheduled_arrival_time,time
253230,285285001,811,62138,45.589525,-73.537341,1,1,1745447520000,1745444760000,2025-04-20T23:00
253231,285285001,811,62138,45.589525,-73.537341,22,1,1745447520000,1745447520000,2025-04-20T23:00
258234,285285013,811,62138,45.589525,-73.537341,1,1,1745448180000,1745445480000,2025-04-20T23:00
258235,285285013,811,62138,45.589525,-73.537341,22,1,1745448180000,1745448180000,2025-04-20T23:00
297697,285007882,72,55717,45.508261,-73.672905,34,1,1745450840000,1745450840000,2025-04-20T23:00
...,...,...,...,...,...,...,...,...,...,...
1870979,284740269,67,55333,45.583717,-73.649799,37,1,1745627209000,1745627209000,2025-04-23T00:00
1870980,284740269,67,55334,45.584434,-73.650933,38,1,1745627280000,1745627280000,2025-04-23T00:00
1873182,284739979,67,55046,45.581373,-73.646077,36,1,1745627700000,1745627700000,2025-04-23T01:00
1873183,284739979,67,55333,45.583717,-73.649799,37,1,1745627929000,1745627929000,2025-04-23T01:00


In [61]:
# Remove duplicates
merged_stm_df = merged_stm_df.drop_duplicates()

### STM and Weather

In [62]:
# Merge STM data with weather data
df = pd.merge(left=merged_stm_df, right=weather_df, how='inner', on='time').drop('time', axis=1)
df.head()

Unnamed: 0,trip_id,route_id,stop_id,stop_lat,stop_lon,stop_sequence,wheelchair_boarding,realtime_arrival_time,scheduled_arrival_time,temperature,precipitation,windspeed,weathercode
0,285028348,189,54433,45.617546,-73.507835,20,1,1745384718000,1745384708000,1.5,0.0,16.2,0
1,285028348,189,54444,45.62163,-73.505533,21,1,1745384751000,1745384751000,1.5,0.0,16.2,0
2,285028348,189,54445,45.624606,-73.503332,22,1,1745384785000,1745384785000,1.5,0.0,16.2,0
3,285028348,189,54451,45.62627,-73.501486,23,1,1745384806000,1745384806000,1.5,0.0,16.2,0
4,285028348,189,54456,45.628078,-73.499449,24,1,1745384829000,1745384829000,1.5,0.0,16.2,0


In [63]:
# Convert route_id to integer
df['route_id'] = df['route_id'].astype('int64')

In [64]:
# Convert wheelchair_boarding to boolean
df['wheelchair_boarding'] = np.where(df['wheelchair_boarding'] == 1, True, False)

## Export Data

In [65]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1678987 entries, 0 to 1678986
Data columns (total 13 columns):
 #   Column                  Non-Null Count    Dtype  
---  ------                  --------------    -----  
 0   trip_id                 1678987 non-null  int64  
 1   route_id                1678987 non-null  int64  
 2   stop_id                 1678987 non-null  int64  
 3   stop_lat                1678987 non-null  float64
 4   stop_lon                1678987 non-null  float64
 5   stop_sequence           1678987 non-null  int64  
 6   wheelchair_boarding     1678987 non-null  bool   
 7   realtime_arrival_time   1678987 non-null  int64  
 8   scheduled_arrival_time  1678987 non-null  int64  
 9   temperature             1678987 non-null  float64
 10  precipitation           1678987 non-null  float64
 11  windspeed               1678987 non-null  float64
 12  weathercode             1678987 non-null  int64  
dtypes: bool(1), float64(5), int64(7)
memory usage: 155.3 MB


In [66]:
df.describe()

Unnamed: 0,trip_id,route_id,stop_id,stop_lat,stop_lon,stop_sequence,realtime_arrival_time,scheduled_arrival_time,temperature,precipitation,windspeed,weathercode
count,1678987.0,1678987.0,1678987.0,1678987.0,1678987.0,1678987.0,1678987.0,1678987.0,1678987.0,1678987.0,1678987.0,1678987.0
mean,285205100.0,154.353,54875.49,45.527,-73.63521,24.56761,1745525000000.0,1745525000000.0,8.347553,0.2135231,13.37588,15.00449
std,634675.1,132.4954,3192.591,0.06398477,0.08994233,16.95879,67284820.0,67273530.0,2.88997,0.6234591,4.802622,22.34531
min,284726600.0,10.0,50101.0,45.40267,-73.9562,1.0,1745384000000.0,1745384000000.0,-0.4,0.0,3.9,0.0
25%,284776800.0,55.0,52183.0,45.47644,-73.66915,11.0,1745490000000.0,1745490000000.0,7.5,0.0,9.5,3.0
50%,285008600.0,121.0,54610.0,45.51987,-73.61722,22.0,1745528000000.0,1745528000000.0,8.6,0.0,13.0,3.0
75%,285282400.0,195.0,57001.0,45.57295,-73.57333,35.0,1745586000000.0,1745586000000.0,10.2,0.0,17.7,3.0
max,286574700.0,968.0,62442.0,45.70112,-73.48058,117.0,1745624000000.0,1745626000000.0,13.5,3.5,23.1,63.0


In [67]:
# Export data to CSV
df.to_csv('../data/stm_weather_merged.csv', index=False)

## End