# STM Transit Delay Data Preparation

## Data description

### Real-time STM Trip Updates

`current_time` timestamp of the time the data was collected<br>
`trip_id` unique identifier of a trip<br>
`route_id` bus or metro line<br>
`start_date` schedule date<br>
`stop_id` stop number<br>
`arrival_time` actual arrival time, in milliseconds<br>
`departure_time` actual departure time, in milliseconds<br>
`schedule_relationship` state of the trip, 0 means scheduled and 1 means skipped

### Scheduled STM Trips

`trip_id` unique identifier of a trip<br>
`arrival_time` scheduled arrival time, in milliseconds<br>
`departure_time` scheduled departure time, in milliseconds<br>
`stop_id` stop number<br>
`stop_sequence` sequence of the stop, for ordering

### STM Stops

`stop_id` unique identifier of a stop<br>
`stop_code` stop number<br>
`stop_name` stop name<br>
`stop_lat` stop latitude<br>
`stop_lon` stop longitude<br>
`stop_url` stop web page<br>
`location_type` stop type, 1 being a metro station and 2 a bus stop<br>
`parent_station` parent station (ex: a metro station with multiple exits)<br>
`wheelchair_boarding` indicates if the stop is accessible for people in wheelchair, 1 being true and 2 being false

### Weather Archive and Forecast

`time` date and hour or the archived weather<br>
`temperature` air temperature at 2 meters above ground, in Celsius<br>
`precipitation` total precipitation (rain, showers, snow) sum of the preceding hour, in millimeters<br>
`windspeed` wind speed at 10 meters above ground, in km/h<br>
`weathercode` World Meteorological Organization (WMO) code

## Imports

In [1]:
from datetime import timedelta
import pandas as pd
import sys

In [2]:
# Import custom code
sys.path.insert(0, '..')
from scripts.custom_functions import fetch_weather, LOCAL_TIMEZONE

In [37]:
real_stm_df = pd.read_csv('../data/fetched_stm.csv', low_memory=False)

In [4]:
planned_stm_df = pd.read_csv('../data/stop_times_2025-04-23.txt')

In [5]:
stops_df = pd.read_csv('../data/stops_2025-04-23.txt')

In [6]:
weather_df = pd.read_csv('../data/fetched_historical_weather.csv')

## Clean Data

In [26]:
real_stm_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2796947 entries, 0 to 2796946
Data columns (total 8 columns):
 #   Column                 Dtype  
---  ------                 -----  
 0   current_time           float64
 1   trip_id                int64  
 2   route_id               object 
 3   start_date             int64  
 4   stop_id                int64  
 5   arrival_time           int64  
 6   departure_time         int64  
 7   schedule_relationship  int64  
dtypes: float64(1), int64(6), object(1)
memory usage: 170.7+ MB


In [25]:
real_stm_df.sample(5)

Unnamed: 0,current_time,trip_id,route_id,start_date,stop_id,arrival_time,departure_time,schedule_relationship
265548,1745446000.0,285029388,49,20250423,50350,1745449320,1745449320,0
1547157,1745597000.0,285006923,68,20250425,58299,1745597847,1745597847,0
1339150,1745579000.0,285029049,86,20250425,53586,1745579700,1745579700,0
656058,1745503000.0,284739187,467,20250424,61046,1745506728,1745506728,0
1060539,1745536000.0,284740819,48,20250424,51807,1745536821,1745536821,0


In [38]:
# Sort trips
subset = ['current_time', 'start_date', 'trip_id', 'route_id', 'stop_id']
real_stm_df = real_stm_df.sort_values(by=subset)
real_stm_df.tail()

Unnamed: 0,current_time,trip_id,route_id,start_date,stop_id,arrival_time,departure_time,schedule_relationship
2796942,1745759000.0,286592394,38,20250427,51833,1745759895,1745759895,0
2796943,1745759000.0,286592394,38,20250427,51700,1745759958,1745759958,0
2796944,1745759000.0,286592394,38,20250427,53974,1745760034,1745760034,0
2796945,1745759000.0,286592394,38,20250427,56270,1745760120,1745760120,0
2796946,1745759000.0,286592394,38,20250427,53763,1745760300,0,0


In [39]:
# Get proportion of duplicates
new_subset = subset[1:]
duplicate_mask = real_stm_df.duplicated(subset=new_subset)
print(f'{(duplicate_mask.sum() / len(real_stm_df)):.2%}')

24.03%


In [40]:
# Remove duplicates
real_stm_df = real_stm_df.drop_duplicates(subset=new_subset, keep='last') # keep latest update

In [41]:
# Convert realtime arrival and departure time to milliseconds
real_stm_df['arrival_time'] = real_stm_df['arrival_time'] * 1000
real_stm_df['departure_time'] = real_stm_df['departure_time'] * 1000

In [42]:
# Get distribution of realtime arrival times
real_stm_df[['arrival_time', 'departure_time']].describe()

Unnamed: 0,arrival_time,departure_time
count,2124822.0,2124822.0
mean,1649682000000.0,1645919000000.0
std,397740400000.0,405005500000.0
min,0.0,0.0
25%,1745496000000.0,1745496000000.0
50%,1745577000000.0,1745577000000.0
75%,1745644000000.0,1745644000000.0
max,1745766000000.0,1745766000000.0


In [43]:
# Get proportion of rows with zero arrival times
zero_mask = real_stm_df['arrival_time'] == 0
print(f'{(zero_mask.sum() / len(real_stm_df)):.2%}')

5.49%


In [44]:
# Replace zero arrival time by departure time, as they are usually the same
real_stm_df.loc[zero_mask, 'arrival_time'] = real_stm_df.loc[zero_mask, 'departure_time']

In [45]:
# Get proportion of rows with zero arrival times again
zero_mask = real_stm_df['arrival_time'] == 0
print(f'{(zero_mask.sum() / len(real_stm_df)):.2%}')

2.99%


In [46]:
# Delete the rows with 0 arrival times
real_stm_df = real_stm_df[~zero_mask]
zero_mask = real_stm_df['arrival_time'] == 0
assert zero_mask.sum() == 0

In [47]:
# Rename arrival time
real_stm_df = real_stm_df.rename(columns={'arrival_time': 'realtime_arrival_time'})

In [48]:
# Drop departure time
real_stm_df = real_stm_df.drop('departure_time', axis=1)
real_stm_df.columns

Index(['current_time', 'trip_id', 'route_id', 'start_date', 'stop_id',
       'realtime_arrival_time', 'schedule_relationship'],
      dtype='object')

## Merge Data

### Realtime and Scheduled Trips

In [49]:
stm_trips_df = pd.merge(left=real_stm_df, right=planned_stm_df, how='inner', on=['trip_id', 'stop_id'])
stm_trips_df.head()

Unnamed: 0,current_time,trip_id,route_id,start_date,stop_id,realtime_arrival_time,schedule_relationship,arrival_time,departure_time,stop_sequence
0,1745385000.0,285028348,189,20250422,54433,1745384718000,0,25:05:08,25:05:08,20
1,1745385000.0,285028348,189,20250422,54444,1745384751000,0,25:05:51,25:05:51,21
2,1745385000.0,285028348,189,20250422,54445,1745384785000,0,25:06:25,25:06:25,22
3,1745385000.0,285028348,189,20250422,54451,1745384806000,0,25:06:46,25:06:46,23
4,1745385000.0,285028348,189,20250422,54456,1745384829000,0,25:07:09,25:07:09,24


In [50]:
# Convert start_date to datetime
stm_trips_df['start_date'] = pd.to_datetime(stm_trips_df['start_date'], format='%Y%m%d')
assert(stm_trips_df['start_date'].dtype == 'datetime64[ns]')

In [51]:
def parse_gtfs_time(row) -> pd.Timestamp:
	'''
	Converts GTFS time string (e.g., '25:30:00') to datetime
	based on the arrival time.
	'''
	hours, minutes, seconds = map(int, row['arrival_time'].split(':'))
	total_seconds = hours * 3600 + minutes * 60 + seconds

	parsed_time = row['start_date'] + timedelta(seconds=total_seconds)
	return parsed_time

In [52]:
# Convert planned arrival time to localized datetime
stm_trips_df['scheduled_arrival_time'] = stm_trips_df.apply(parse_gtfs_time, axis=1)
stm_trips_df['scheduled_arrival_time'] = stm_trips_df['scheduled_arrival_time'].dt.tz_localize(LOCAL_TIMEZONE)
assert(stm_trips_df['start_date'].dtype == 'datetime64[ns]')

In [53]:
# Convert planned time to timestamp in milliseconds since epoch
stm_trips_df['scheduled_arrival_time'] = stm_trips_df['scheduled_arrival_time'].astype('int64') // 10**6
stm_trips_df.sample(5)

Unnamed: 0,current_time,trip_id,route_id,start_date,stop_id,realtime_arrival_time,schedule_relationship,arrival_time,departure_time,stop_sequence,scheduled_arrival_time
839999,1745543000.0,286570845,90,2025-04-24,61603,1745545304000,0,21:41:44,21:41:44,13,1745545304000
1997989,1745752000.0,286580963,168,2025-04-27,52732,1745753855000,0,07:37:35,07:37:35,5,1745753855000
1884670,1745716000.0,284302839,103,2025-04-26,50627,1745718066000,0,21:41:06,21:41:06,8,1745718066000
680835,1745525000.0,284738577,69,2025-04-24,55301,1745526902000,0,16:35:02,16:35:02,66,1745526902000
731697,1745528000.0,285010337,468,2025-04-24,50126,1745529577000,0,17:16:23,17:16:23,14,1745529383000


### Trips and Stops

In [54]:
merged_stm_df = pd.merge(left=stm_trips_df, right=stops_df, how='inner', left_on='stop_id', right_on='stop_code')
merged_stm_df.head()

Unnamed: 0,current_time,trip_id,route_id,start_date,stop_id_x,realtime_arrival_time,schedule_relationship,arrival_time,departure_time,stop_sequence,scheduled_arrival_time,stop_id_y,stop_code,stop_name,stop_lat,stop_lon,stop_url,location_type,parent_station,wheelchair_boarding
0,1745385000.0,285028348,189,2025-04-22,54433,1745384718000,0,25:05:08,25:05:08,20,1745384708000,54433,54433,Notre-Dame / No 10150,45.617546,-73.507835,https://www.stm.info/fr/recherche#stq=54433,0,,1
1,1745385000.0,285028348,189,2025-04-22,54444,1745384751000,0,25:05:51,25:05:51,21,1745384751000,54444,54444,Notre-Dame / Gamble,45.62163,-73.505533,https://www.stm.info/fr/recherche#stq=54444,0,,1
2,1745385000.0,285028348,189,2025-04-22,54445,1745384785000,0,25:06:25,25:06:25,22,1745384785000,54445,54445,Notre-Dame / No 10800,45.624606,-73.503332,https://www.stm.info/fr/recherche#stq=54445,0,,1
3,1745385000.0,285028348,189,2025-04-22,54451,1745384806000,0,25:06:46,25:06:46,23,1745384806000,54451,54451,Notre-Dame / Richard,45.62627,-73.501486,https://www.stm.info/fr/recherche#stq=54451,0,,1
4,1745385000.0,285028348,189,2025-04-22,54456,1745384829000,0,25:07:09,25:07:09,24,1745384829000,54456,54456,Notre-Dame / Hinton,45.628078,-73.499449,https://www.stm.info/fr/recherche#stq=54456,0,,1


In [55]:
# Keep relevant columns
merged_stm_df = merged_stm_df[[
  'trip_id',
  'route_id',
  'stop_id_x',
  'stop_lat',
  'stop_lon',
  'stop_sequence',
  'wheelchair_boarding',
  'realtime_arrival_time',
  'scheduled_arrival_time'
]]
merged_stm_df.sample(5)

Unnamed: 0,trip_id,route_id,stop_id_x,stop_lat,stop_lon,stop_sequence,wheelchair_boarding,realtime_arrival_time,scheduled_arrival_time
451254,285029979,189,53608,45.633674,-73.493652,36,1,1745499078000,1745499078000
140686,285282662,172,56676,45.465885,-73.553942,9,2,1745411896000,1745411896000
1464357,285006667,70,60292,45.513971,-73.683504,29,1,1745636280000,1745636280000
27435,285030263,364,53061,45.572203,-73.549672,41,1,1745399539000,1745399379000
735107,285008067,164,50193,45.524478,-73.707622,26,1,1745530434000,1745530140000


In [56]:
# Rename stop id
merged_stm_df = merged_stm_df.rename(columns={'stop_id_x': 'stop_id'})

In [57]:
# Convert route_id to integer
merged_stm_df['route_id'] = merged_stm_df['route_id'].astype('int64')

In [58]:
# Convert wheelchair_boarding to boolean
merged_stm_df['wheelchair_boarding'] = (merged_stm_df['wheelchair_boarding'] == 1).astype('int64')

### STM and Weather

In [59]:
# Convert arrival timestamp to datetime
rt_arrival_dt = pd.to_datetime(merged_stm_df['realtime_arrival_time'], origin='unix', unit='ms', utc=True)
rt_arrival_dt.head()

0   2025-04-23 05:05:18+00:00
1   2025-04-23 05:05:51+00:00
2   2025-04-23 05:06:25+00:00
3   2025-04-23 05:06:46+00:00
4   2025-04-23 05:07:09+00:00
Name: realtime_arrival_time, dtype: datetime64[ns, UTC]

In [60]:
# Round arrival time to the nearest hour
merged_stm_df['rounded_arrival_dt'] = rt_arrival_dt.dt.round('h')

In [61]:
# Format time to match weather data
merged_stm_df['time'] = merged_stm_df['rounded_arrival_dt'].dt.strftime('%Y-%m-%dT%H:%M')
merged_stm_df.sample(5)

Unnamed: 0,trip_id,route_id,stop_id,stop_lat,stop_lon,stop_sequence,wheelchair_boarding,realtime_arrival_time,scheduled_arrival_time,rounded_arrival_dt,time
888206,286573308,37,51899,45.457082,-73.590388,17,1,1745551775000,1745551775000,2025-04-25 03:00:00+00:00,2025-04-25T03:00
1053480,286571866,107,56701,45.4392,-73.580966,16,1,1745583488000,1745583488000,2025-04-25 12:00:00+00:00,2025-04-25T12:00
523876,284728252,197,52395,45.563802,-73.571255,18,1,1745507180000,1745507040000,2025-04-24 15:00:00+00:00,2025-04-24T15:00
753783,285010322,209,57905,45.495508,-73.808813,30,1,1745533955000,1745533767000,2025-04-24 23:00:00+00:00,2025-04-24T23:00
1550662,286588850,198,57174,45.436606,-73.702576,17,1,1745662403000,1745662403000,2025-04-26 10:00:00+00:00,2025-04-26T10:00


In [62]:
# Merge STM data with historical weather
df = pd.merge(left=merged_stm_df, right=weather_df, how='left', on='time')
df.head()

Unnamed: 0,trip_id,route_id,stop_id,stop_lat,stop_lon,stop_sequence,wheelchair_boarding,realtime_arrival_time,scheduled_arrival_time,rounded_arrival_dt,time,temperature,precipitation,windspeed,weathercode
0,285028348,189,54433,45.617546,-73.507835,20,1,1745384718000,1745384708000,2025-04-23 05:00:00+00:00,2025-04-23T05:00,4.5,0.0,9.5,0.0
1,285028348,189,54444,45.62163,-73.505533,21,1,1745384751000,1745384751000,2025-04-23 05:00:00+00:00,2025-04-23T05:00,4.5,0.0,9.5,0.0
2,285028348,189,54445,45.624606,-73.503332,22,1,1745384785000,1745384785000,2025-04-23 05:00:00+00:00,2025-04-23T05:00,4.5,0.0,9.5,0.0
3,285028348,189,54451,45.62627,-73.501486,23,1,1745384806000,1745384806000,2025-04-23 05:00:00+00:00,2025-04-23T05:00,4.5,0.0,9.5,0.0
4,285028348,189,54456,45.628078,-73.499449,24,1,1745384829000,1745384829000,2025-04-23 05:00:00+00:00,2025-04-23T05:00,4.5,0.0,9.5,0.0


In [64]:
# Get rows with null weather
null_weather_mask = df.isna().any(axis=1)
df[null_weather_mask].sample(5)

Unnamed: 0,trip_id,route_id,stop_id,stop_lat,stop_lon,stop_sequence,wheelchair_boarding,realtime_arrival_time,scheduled_arrival_time,rounded_arrival_dt,time,temperature,precipitation,windspeed,weathercode
1960888,284330195,363,50934,45.538957,-73.631096,34,1,1745737537000,1745737527000,2025-04-27 07:00:00+00:00,2025-04-27T07:00,,,,
1750312,286575050,24,51950,45.491563,-73.587763,31,1,1745697578000,1745697578000,2025-04-26 20:00:00+00:00,2025-04-26T20:00,,,,
1261013,284739389,48,55052,45.593724,-73.644549,45,1,1745610428000,1745610428000,2025-04-25 20:00:00+00:00,2025-04-25T20:00,,,,
790602,284740819,48,61638,45.659542,-73.541353,48,1,1745537608000,1745537498000,2025-04-25 00:00:00+00:00,2025-04-25T00:00,,,,
1380547,284740167,69,50267,45.548512,-73.678792,21,1,1745622728000,1745622728000,2025-04-25 23:00:00+00:00,2025-04-25T23:00,,,,


In [65]:
# Get proportion of rows with null weather
print(f'{(null_weather_mask.sum() / len(df)):.2%}')

60.78%


In [66]:
# Separate null and non null rows
not_null_df = df[~null_weather_mask]
null_df = df[null_weather_mask]

In [67]:
# Fetch forecast weather
start_date = null_df['rounded_arrival_dt'].min().strftime('%Y-%m-%d')
end_date = null_df['rounded_arrival_dt'].max().strftime('%Y-%m-%d')

weather_list = fetch_weather(start_date=start_date, end_date=end_date, forecast=True)
weather_df = pd.DataFrame(weather_list)
weather_df.head()

Unnamed: 0,time,temperature,precipitation,windspeed,weathercode
0,2025-04-25T00:00,8.9,0.0,10.7,0
1,2025-04-25T01:00,8.4,0.0,6.9,0
2,2025-04-25T02:00,7.7,0.0,8.0,2
3,2025-04-25T03:00,6.9,0.0,8.4,0
4,2025-04-25T04:00,6.5,0.0,6.6,0


In [68]:
# Merge null weather dataframe with forecast
null_df = null_df.drop(['temperature', 'precipitation', 'windspeed', 'weathercode'], axis=1)
null_df = pd.merge(left=null_df, right=weather_df, how='inner', on='time')
null_df.head()

Unnamed: 0,trip_id,route_id,stop_id,stop_lat,stop_lon,stop_sequence,wheelchair_boarding,realtime_arrival_time,scheduled_arrival_time,rounded_arrival_dt,time,temperature,precipitation,windspeed,weathercode
0,285010565,968,60296,45.514212,-73.684175,1,1,1745539337000,1745533800000,2025-04-25 00:00:00+00:00,2025-04-25T00:00,8.9,0.0,10.7,0
1,285010565,968,61988,45.510414,-73.81174,3,1,1745540656000,1745535900000,2025-04-25 00:00:00+00:00,2025-04-25T00:00,8.9,0.0,10.7,0
2,284779443,166,51290,45.496392,-73.616885,17,1,1745537409000,1745531880000,2025-04-25 00:00:00+00:00,2025-04-25T00:00,8.9,0.0,10.7,0
3,284779443,166,51252,45.495321,-73.618952,18,1,1745537501000,1745531972000,2025-04-25 00:00:00+00:00,2025-04-25T00:00,8.9,0.0,10.7,0
4,284779443,166,51254,45.494653,-73.619497,19,1,1745537533000,1745532004000,2025-04-25 00:00:00+00:00,2025-04-25T00:00,8.9,0.0,10.7,0


In [69]:
# Merge null and non null weather dataframes
df = pd.concat([not_null_df, null_df]).reset_index()

In [70]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2046917 entries, 0 to 2046916
Data columns (total 16 columns):
 #   Column                  Dtype              
---  ------                  -----              
 0   index                   int64              
 1   trip_id                 int64              
 2   route_id                int64              
 3   stop_id                 int64              
 4   stop_lat                float64            
 5   stop_lon                float64            
 6   stop_sequence           int64              
 7   wheelchair_boarding     int64              
 8   realtime_arrival_time   int64              
 9   scheduled_arrival_time  int64              
 10  rounded_arrival_dt      datetime64[ns, UTC]
 11  time                    object             
 12  temperature             float64            
 13  precipitation           float64            
 14  windspeed               float64            
 15  weathercode             float64            
dtype

In [71]:
df.describe()

Unnamed: 0,index,trip_id,route_id,stop_id,stop_lat,stop_lon,stop_sequence,wheelchair_boarding,realtime_arrival_time,scheduled_arrival_time,temperature,precipitation,windspeed,weathercode
count,2046917.0,2046917.0,2046917.0,2046917.0,2046917.0,2046917.0,2046917.0,2046917.0,2046917.0,2046917.0,2046917.0,2046917.0,2046917.0,2046917.0
mean,535546.3,285122300.0,148.6693,54824.8,45.52683,-73.63253,23.24688,0.9381646,1745577000000.0,1745577000000.0,10.10562,0.1844518,10.84386,13.00501
std,333297.7,815580.9,126.7947,3179.528,0.06347622,0.08821122,16.57459,0.2408565,98488130.0,98486320.0,3.775451,0.5508451,4.443218,21.50994
min,0.0,283211500.0,10.0,50101.0,45.40267,-73.9562,1.0,0.0,1745384000000.0,1745384000000.0,1.2,0.0,0.6,0.0
25%,255864.0,284738800.0,55.0,52159.0,45.47673,-73.66536,10.0,1.0,1745505000000.0,1745505000000.0,7.6,0.0,6.9,2.0
50%,511729.0,285008000.0,121.0,54525.0,45.52006,-73.61511,20.0,1.0,1745583000000.0,1745583000000.0,9.6,0.0,11.2,3.0
75%,767612.0,285283000.0,192.0,56929.0,45.5723,-73.5724,33.0,1.0,1745661000000.0,1745661000000.0,12.7,0.0,14.2,3.0
max,1244145.0,286594900.0,968.0,62442.0,45.70112,-73.48058,117.0,1.0,1745766000000.0,1745766000000.0,17.9,5.1,22.3,75.0


## Export Data

In [72]:
# Keep relevant columns
df = df[['trip_id', 'route_id', 'stop_id', 'stop_lat', 'stop_lon',
       'stop_sequence', 'wheelchair_boarding', 'realtime_arrival_time',
       'scheduled_arrival_time', 'temperature', 'precipitation', 'windspeed', 'weathercode']]

In [73]:
# Export data to CSV
df.to_csv('../data/stm_weather_merged.csv', index=False)

## End