# STM Transit Delay Data Preparation

## Data description

### Real-time STM Trip Updates

`current_time` timestamp of the time the data was collected<br>
`trip_id` unique identifier of a trip<br>
`route_id` bus or metro line<br>
`start_date` schedule date<br>
`stop_id` stop number<br>
`arrival_time` actual arrival time, in milliseconds<br>
`departure_time` actual departure time, in milliseconds<br>
`schedule_relationship` state of the trip, 0 means scheduled and 1 means skipped

### Scheduled STM Trips

`trip_id` unique identifier of a trip<br>
`arrival_time` scheduled arrival time, in milliseconds<br>
`departure_time` scheduled departure time, in milliseconds<br>
`stop_id` stop number<br>
`stop_sequence` sequence of the stop, for ordering

### STM Stops

`stop_id` unique identifier of a stop<br>
`stop_code` stop number<br>
`stop_name` stop name<br>
`stop_lat` stop latitude<br>
`stop_lon` stop longitude<br>
`stop_url` stop web page<br>
`location_type` stop type, 1 being a metro station and 2 a bus stop<br>
`parent_station` parent station (ex: a metro station with multiple exits)<br>
`wheelchair_boarding` indicates if the stop is accessible for people in wheelchair, 1 being true and 2 being false

### Weather Archive and Forecast

`time` date and hour or the archived weather<br>
`temperature` air temperature at 2 meters above ground, in Celsius<br>
`precipitation` total precipitation (rain, showers, snow) sum of the preceding hour, in millimeters<br>
`windspeed` wind speed at 10 meters above ground, in km/h<br>
`weathercode` World Meteorological Organization (WMO) code

## Imports

In [1]:
from datetime import timedelta
import pandas as pd
import sys

In [2]:
# Import custom code
sys.path.insert(0, '..')
from scripts.custom_functions import fetch_weather, LOCAL_TIMEZONE

In [3]:
real_stm_df = pd.read_csv('../data/fetched_stm.csv', low_memory=False)

In [4]:
planned_stm_df = pd.read_csv('../data/stop_times_2025-04-23.txt')

In [5]:
stops_df = pd.read_csv('../data/stops_2025-04-23.txt')

In [6]:
weather_df = pd.read_csv('../data/fetched_historical_weather.csv')

## Clean Data

In [7]:
real_stm_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2852528 entries, 0 to 2852527
Data columns (total 8 columns):
 #   Column                 Dtype  
---  ------                 -----  
 0   current_time           float64
 1   trip_id                int64  
 2   route_id               object 
 3   start_date             int64  
 4   stop_id                int64  
 5   arrival_time           int64  
 6   departure_time         int64  
 7   schedule_relationship  int64  
dtypes: float64(1), int64(6), object(1)
memory usage: 174.1+ MB


In [None]:
# Sort trips
subset = ['current_time', 'start_date', 'trip_id', 'route_id', 'stop_id']
real_stm_df = real_stm_df.sort_values(by=subset)

Unnamed: 0,current_time,trip_id,route_id,start_date,stop_id,arrival_time,departure_time,schedule_relationship
2852523,1745766000.0,286593931,35,20250427,52705,1745768400,1745768400,0
2852524,1745766000.0,286593931,35,20250427,52666,1745768472,1745768472,0
2852525,1745766000.0,286593931,35,20250427,52588,1745768563,1745768563,0
2852526,1745766000.0,286593931,35,20250427,52499,1745768640,1745768640,0
2852527,1745766000.0,286593931,35,20250427,52412,1745768820,0,0


In [10]:
# Get proportion of duplicates
new_subset = subset[1:]
duplicate_mask = real_stm_df.duplicated(subset=new_subset)
print(f'{(duplicate_mask.sum() / len(real_stm_df)):.2%}')

24.02%


In [11]:
# Remove duplicates
real_stm_df = real_stm_df.drop_duplicates(subset=new_subset, keep='last') # keep latest update

In [12]:
# Convert realtime arrival and departure time to milliseconds
real_stm_df['arrival_time'] = real_stm_df['arrival_time'] * 1000
real_stm_df['departure_time'] = real_stm_df['departure_time'] * 1000

In [13]:
# Get distribution of realtime arrival times
real_stm_df[['arrival_time', 'departure_time']].describe()

Unnamed: 0,arrival_time,departure_time
count,2167304.0,2167304.0
mean,1649678000000.0,1645944000000.0
std,397754100000.0,404965800000.0
min,0.0,0.0
25%,1745497000000.0,1745496000000.0
50%,1745579000000.0,1745579000000.0
75%,1745661000000.0,1745660000000.0
max,1745773000000.0,1745773000000.0


In [14]:
# Get proportion of rows with zero arrival times
zero_mask = real_stm_df['arrival_time'] == 0
print(f'{(zero_mask.sum() / len(real_stm_df)):.2%}')

5.49%


In [15]:
# Get proportion of rows where the arrival and departure times are different
diff_date_mask = real_stm_df['arrival_time'] != real_stm_df['departure_time']
print(f'{(diff_date_mask.sum() / len(real_stm_df)):.2%}')

5.96%


In [None]:
# Get rows
diff_date_df = real_stm_df[diff_date_mask]

Unnamed: 0,current_time,trip_id,route_id,start_date,stop_id,arrival_time,departure_time,schedule_relationship
44,1.745385e+09,285028348,189,20250422,53610,1745386080000,0,0
61,1.745385e+09,285001809,97,20250422,54010,1745385828000,0,0
62,1.745385e+09,284777405,100,20250422,50561,1745384652000,1745384663000,0
70,1.745385e+09,284777405,100,20250422,50734,1745384907000,1745384940000,0
71,1.745385e+09,284777405,100,20250422,50769,1745385060000,0,0
...,...,...,...,...,...,...,...,...
2852410,1.745766e+09,284752803,121,20250427,55934,0,1745766000000,0
2852457,1.745766e+09,284752803,121,20250427,54216,1745769180000,0,0
2852458,1.745766e+09,286592942,102,20250427,53796,0,1745766000000,0
2852483,1.745766e+09,286592942,102,20250427,50489,1745767320000,0,0


In [17]:
# Replace zero arrival times by departure times, as they are usually the same
real_stm_df.loc[zero_mask, 'arrival_time'] = real_stm_df.loc[zero_mask, 'departure_time']

In [18]:
# Get proportion of rows with zero arrival times again
zero_mask = real_stm_df['arrival_time'] == 0
print(f'{(zero_mask.sum() / len(real_stm_df)):.2%}')

2.98%


In [19]:
# Delete the rows with 0 arrival times
real_stm_df = real_stm_df[~zero_mask]
zero_mask = real_stm_df['arrival_time'] == 0
assert zero_mask.sum() == 0

In [20]:
# Rename arrival time
real_stm_df = real_stm_df.rename(columns={'arrival_time': 'realtime_arrival_time'})

In [21]:
# Drop departure time
real_stm_df = real_stm_df.drop('departure_time', axis=1)
real_stm_df.columns

Index(['current_time', 'trip_id', 'route_id', 'start_date', 'stop_id',
       'realtime_arrival_time', 'schedule_relationship'],
      dtype='object')

## Merge Data

### Realtime and Scheduled Trips

In [None]:
stm_trips_df = pd.merge(left=real_stm_df, right=planned_stm_df, how='inner', on=['trip_id', 'stop_id'])

Unnamed: 0,current_time,trip_id,route_id,start_date,stop_id,realtime_arrival_time,schedule_relationship,arrival_time,departure_time,stop_sequence
0,1745385000.0,285028348,189,20250422,54433,1745384718000,0,25:05:08,25:05:08,20
1,1745385000.0,285028348,189,20250422,54444,1745384751000,0,25:05:51,25:05:51,21
2,1745385000.0,285028348,189,20250422,54445,1745384785000,0,25:06:25,25:06:25,22
3,1745385000.0,285028348,189,20250422,54451,1745384806000,0,25:06:46,25:06:46,23
4,1745385000.0,285028348,189,20250422,54456,1745384829000,0,25:07:09,25:07:09,24


In [23]:
# Convert start_date to datetime
stm_trips_df['start_date'] = pd.to_datetime(stm_trips_df['start_date'], format='%Y%m%d')
assert(stm_trips_df['start_date'].dtype == 'datetime64[ns]')

In [24]:
def parse_gtfs_time(row) -> pd.Timestamp:
	'''
	Converts GTFS time string (e.g., '25:30:00') to datetime
	based on the arrival time.
	'''
	hours, minutes, seconds = map(int, row['arrival_time'].split(':'))
	total_seconds = hours * 3600 + minutes * 60 + seconds

	parsed_time = row['start_date'] + timedelta(seconds=total_seconds)
	return parsed_time

In [25]:
# Convert planned arrival time to localized datetime
stm_trips_df['scheduled_arrival_time'] = stm_trips_df.apply(parse_gtfs_time, axis=1)
stm_trips_df['scheduled_arrival_time'] = stm_trips_df['scheduled_arrival_time'].dt.tz_localize(LOCAL_TIMEZONE)
assert(stm_trips_df['start_date'].dtype == 'datetime64[ns]')

In [None]:
# Convert planned time to timestamp in milliseconds since epoch
stm_trips_df['scheduled_arrival_time'] = stm_trips_df['scheduled_arrival_time'].astype('int64') // 10**6

Unnamed: 0,current_time,trip_id,route_id,start_date,stop_id,realtime_arrival_time,schedule_relationship,arrival_time,departure_time,stop_sequence,scheduled_arrival_time
1090760,1745590000.0,285010246,468,2025-04-25,58015,1745590980000,0,10:21:30,10:21:30,21,1745590890000
1191158,1745600000.0,285008057,177,2025-04-25,55925,1745602569000,0,13:36:09,13:36:09,18,1745602569000
226028,1745446000.0,286572024,107,2025-04-23,56503,1745446163000,0,18:04:44,18:04:44,40,1745445884000
397776,1745492000.0,285028638,141,2025-04-24,54184,1745495880000,0,07:58:00,07:58:00,1,1745495880000
1912182,1745723000.0,286589213,90,2025-04-26,57303,1745722941000,0,23:01:38,23:01:38,19,1745722898000


### Trips and Stops

In [None]:
merged_stm_df = pd.merge(left=stm_trips_df, right=stops_df, how='inner', left_on='stop_id', right_on='stop_code')

Unnamed: 0,current_time,trip_id,route_id,start_date,stop_id_x,realtime_arrival_time,schedule_relationship,arrival_time,departure_time,stop_sequence,scheduled_arrival_time,stop_id_y,stop_code,stop_name,stop_lat,stop_lon,stop_url,location_type,parent_station,wheelchair_boarding
0,1745385000.0,285028348,189,2025-04-22,54433,1745384718000,0,25:05:08,25:05:08,20,1745384708000,54433,54433,Notre-Dame / No 10150,45.617546,-73.507835,https://www.stm.info/fr/recherche#stq=54433,0,,1
1,1745385000.0,285028348,189,2025-04-22,54444,1745384751000,0,25:05:51,25:05:51,21,1745384751000,54444,54444,Notre-Dame / Gamble,45.62163,-73.505533,https://www.stm.info/fr/recherche#stq=54444,0,,1
2,1745385000.0,285028348,189,2025-04-22,54445,1745384785000,0,25:06:25,25:06:25,22,1745384785000,54445,54445,Notre-Dame / No 10800,45.624606,-73.503332,https://www.stm.info/fr/recherche#stq=54445,0,,1
3,1745385000.0,285028348,189,2025-04-22,54451,1745384806000,0,25:06:46,25:06:46,23,1745384806000,54451,54451,Notre-Dame / Richard,45.62627,-73.501486,https://www.stm.info/fr/recherche#stq=54451,0,,1
4,1745385000.0,285028348,189,2025-04-22,54456,1745384829000,0,25:07:09,25:07:09,24,1745384829000,54456,54456,Notre-Dame / Hinton,45.628078,-73.499449,https://www.stm.info/fr/recherche#stq=54456,0,,1


In [None]:
# Keep relevant columns
merged_stm_df = merged_stm_df[[
  'current_time',
  'trip_id',
  'route_id',
  'stop_id_x',
  'stop_lat',
  'stop_lon',
  'stop_sequence',
  'wheelchair_boarding',
  'realtime_arrival_time',
  'scheduled_arrival_time'
]]

Unnamed: 0,current_time,trip_id,route_id,stop_id_x,stop_lat,stop_lon,stop_sequence,wheelchair_boarding,realtime_arrival_time,scheduled_arrival_time
1670681,1745683000.0,285117964,164,60861,45.497132,-73.725505,44,1,1745685605000,1745685605000
1630647,1745680000.0,284301086,121,55554,45.506659,-73.692474,44,1,1745680510000,1745680414000
914493,1745557000.0,285029142,86,54430,45.632404,-73.509587,23,1,1745557348000,1745557224000
773961,1745532000.0,284740796,140,55149,45.593519,-73.632749,12,1,1745535344000,1745535344000
964161,1745575000.0,286570586,63,51397,45.472925,-73.613057,13,1,1745578380000,1745578380000


In [29]:
# Rename stop id
merged_stm_df = merged_stm_df.rename(columns={'stop_id_x': 'stop_id'})

In [30]:
# Convert route_id to integer
merged_stm_df['route_id'] = merged_stm_df['route_id'].astype('int64')

In [31]:
# Convert wheelchair_boarding to boolean
merged_stm_df['wheelchair_boarding'] = (merged_stm_df['wheelchair_boarding'] == 1).astype('int64')

### STM and Weather

In [None]:
# Convert arrival timestamp to datetime
rt_arrival_dt = pd.to_datetime(merged_stm_df['realtime_arrival_time'], origin='unix', unit='ms', utc=True)

0   2025-04-23 05:05:18+00:00
1   2025-04-23 05:05:51+00:00
2   2025-04-23 05:06:25+00:00
3   2025-04-23 05:06:46+00:00
4   2025-04-23 05:07:09+00:00
Name: realtime_arrival_time, dtype: datetime64[ns, UTC]

In [33]:
# Round arrival time to the nearest hour
merged_stm_df['rounded_arrival_dt'] = rt_arrival_dt.dt.round('h')

In [None]:
# Format time to match weather data
merged_stm_df['time'] = merged_stm_df['rounded_arrival_dt'].dt.strftime('%Y-%m-%dT%H:%M')

Unnamed: 0,current_time,trip_id,route_id,stop_id,stop_lat,stop_lon,stop_sequence,wheelchair_boarding,realtime_arrival_time,scheduled_arrival_time,rounded_arrival_dt,time
1992401,1745752000.0,284214964,128,55504,45.518604,-73.6982,59,1,1745753885000,1745753885000,2025-04-27 12:00:00+00:00,2025-04-27T12:00
1612382,1745676000.0,283610164,86,53425,45.643739,-73.51416,59,1,1745677386000,1745677320000,2025-04-26 14:00:00+00:00,2025-04-26T14:00
1092624,1745590000.0,285029603,141,52119,45.57405,-73.579328,13,1,1745589801000,1745589780000,2025-04-25 14:00:00+00:00,2025-04-25T14:00
562192,1745510000.0,285282827,136,54790,45.594485,-73.610108,6,1,1745512998000,1745512998000,2025-04-24 17:00:00+00:00,2025-04-24T17:00
237445,1745446000.0,286573808,195,57549,45.448196,-73.751896,3,1,1745449033000,1745449033000,2025-04-23 23:00:00+00:00,2025-04-23T23:00


In [None]:
# Merge STM data with historical weather
df = pd.merge(left=merged_stm_df, right=weather_df, how='left', on='time')

Unnamed: 0,current_time,trip_id,route_id,stop_id,stop_lat,stop_lon,stop_sequence,wheelchair_boarding,realtime_arrival_time,scheduled_arrival_time,rounded_arrival_dt,time,temperature,precipitation,windspeed,weathercode
0,1745385000.0,285028348,189,54433,45.617546,-73.507835,20,1,1745384718000,1745384708000,2025-04-23 05:00:00+00:00,2025-04-23T05:00,4.5,0.0,9.5,0.0
1,1745385000.0,285028348,189,54444,45.62163,-73.505533,21,1,1745384751000,1745384751000,2025-04-23 05:00:00+00:00,2025-04-23T05:00,4.5,0.0,9.5,0.0
2,1745385000.0,285028348,189,54445,45.624606,-73.503332,22,1,1745384785000,1745384785000,2025-04-23 05:00:00+00:00,2025-04-23T05:00,4.5,0.0,9.5,0.0
3,1745385000.0,285028348,189,54451,45.62627,-73.501486,23,1,1745384806000,1745384806000,2025-04-23 05:00:00+00:00,2025-04-23T05:00,4.5,0.0,9.5,0.0
4,1745385000.0,285028348,189,54456,45.628078,-73.499449,24,1,1745384829000,1745384829000,2025-04-23 05:00:00+00:00,2025-04-23T05:00,4.5,0.0,9.5,0.0


In [None]:
# Filter rows with null weather
null_weather_mask = df.isna().any(axis=1)

Unnamed: 0,current_time,trip_id,route_id,stop_id,stop_lat,stop_lon,stop_sequence,wheelchair_boarding,realtime_arrival_time,scheduled_arrival_time,rounded_arrival_dt,time,temperature,precipitation,windspeed,weathercode
1287616,1745611000.0,285028147,449,53225,45.664081,-73.536064,2,0,1745614140000,1745614140000,2025-04-25 21:00:00+00:00,2025-04-25T21:00,,,,
1669451,1745683000.0,286575902,136,53988,45.564019,-73.550718,28,1,1745686419000,1745686419000,2025-04-26 17:00:00+00:00,2025-04-26T17:00,,,,
1107385,1745590000.0,285028841,141,52554,45.593444,-73.565971,26,1,1745592620000,1745592620000,2025-04-25 15:00:00+00:00,2025-04-25T15:00,,,,
1739137,1745694000.0,283552546,51,50597,45.460928,-73.648144,7,1,1745695609000,1745695522000,2025-04-26 19:00:00+00:00,2025-04-26T19:00,,,,
1046856,1745582000.0,286571722,106,57083,45.428955,-73.638628,12,0,1745584206000,1745584206000,2025-04-25 13:00:00+00:00,2025-04-25T13:00,,,,


In [37]:
# Get proportion of rows with null weather
print(f'{(null_weather_mask.sum() / len(df)):.2%}')

61.55%


In [38]:
# Separate null and non null rows
not_null_df = df[~null_weather_mask]
null_df = df[null_weather_mask]

In [None]:
# Fetch forecast weather
start_date = null_df['rounded_arrival_dt'].min().strftime('%Y-%m-%d')
end_date = null_df['rounded_arrival_dt'].max().strftime('%Y-%m-%d')

weather_list = fetch_weather(start_date=start_date, end_date=end_date, forecast=True)
weather_df = pd.DataFrame(weather_list)

Unnamed: 0,time,temperature,precipitation,windspeed,weathercode
0,2025-04-25T00:00,8.9,0.0,10.7,0
1,2025-04-25T01:00,8.4,0.0,6.9,0
2,2025-04-25T02:00,7.7,0.0,8.0,2
3,2025-04-25T03:00,6.9,0.0,8.4,0
4,2025-04-25T04:00,6.5,0.0,6.6,0


In [None]:
# Merge null weather dataframe with forecast
null_df = null_df.drop(['temperature', 'precipitation', 'windspeed', 'weathercode'], axis=1)
null_df = pd.merge(left=null_df, right=weather_df, how='inner', on='time')

Unnamed: 0,current_time,trip_id,route_id,stop_id,stop_lat,stop_lon,stop_sequence,wheelchair_boarding,realtime_arrival_time,scheduled_arrival_time,rounded_arrival_dt,time,temperature,precipitation,windspeed,weathercode
0,1745532000.0,285010565,968,60296,45.514212,-73.684175,1,1,1745539337000,1745533800000,2025-04-25 00:00:00+00:00,2025-04-25T00:00,8.9,0.0,10.7,0
1,1745532000.0,285010565,968,61988,45.510414,-73.81174,3,1,1745540656000,1745535900000,2025-04-25 00:00:00+00:00,2025-04-25T00:00,8.9,0.0,10.7,0
2,1745532000.0,284779443,166,51290,45.496392,-73.616885,17,1,1745537409000,1745531880000,2025-04-25 00:00:00+00:00,2025-04-25T00:00,8.9,0.0,10.7,0
3,1745532000.0,284779443,166,51252,45.495321,-73.618952,18,1,1745537501000,1745531972000,2025-04-25 00:00:00+00:00,2025-04-25T00:00,8.9,0.0,10.7,0
4,1745532000.0,284779443,166,51254,45.494653,-73.619497,19,1,1745537533000,1745532004000,2025-04-25 00:00:00+00:00,2025-04-25T00:00,8.9,0.0,10.7,0


In [41]:
# Merge null and non null weather dataframes
df = pd.concat([not_null_df, null_df]).reset_index()

## Export Data

In [45]:
# Keep relevant columns
df = df[['current_time', 'trip_id', 'route_id', 'stop_id', 'stop_lat', 'stop_lon',
       'stop_sequence', 'wheelchair_boarding', 'realtime_arrival_time',
       'scheduled_arrival_time', 'temperature', 'precipitation', 'windspeed', 'weathercode']]

In [46]:
# Export data to CSV
df.to_csv('../data/stm_weather_merged.csv', index=False)

## End