# STM Transit Delay Data Preparation

## Overview

This notebook cleans and merges data collected from [STM](https://www.stm.info/en/about/developers), [Open-Meteo](https://open-meteo.com/en/docs) and [Tomtom](https://developer.tomtom.com/) and prepares it for data analysis and/or preprocessing.

## Data Description

### STM Real-time Trip Updates

`current_time`: Timestamp when the data was fetched from the GTFS, in milliseconds.<br>
`trip_id`: Unique identifier for the transit trip.<br>
`route_id`: Unique identifier for a bus or metro line.<br>
`start_date`: Start date of the transit trip.<br>
`stop_id`: Unique identifier of a stop.<br>
`arrival_time`, `departure_time`: Realtime arrival and departure time, in seconds<br>
`schedule_relationship`: State of the trip, 0 meaning "scheduled", 1 meaning "skipped" and 2 meaning "no data".

### STM Scheduled Trips

`trip_id`: Unique identifier for the transit trip.<br>
`arrival_time`, `departure_time`: Scheduled arrival and departure time.<br>
`stop_id`: Unique identifier of a stop.<br>
`stop_sequence`: Sequence of a stop, for ordering.

### STM Stops

`stop_id`: Unique identifier of a stop.<br>
`stop_code`: Bus stop or metro station number.<br>
`stop_name`: Bus stop or metro station name<br>
`stop_lat`, `stop_lon`: Stop coordinates.<br>
`stop_url`: Stop web page.<br>
`location_type`: Stop type.<br>
`parent_station`: Parent station (metro station with multiple exits).<br>
`wheelchair_boarding`: Indicates if the stop is accessible for people in wheelchair, 1 being true and 2 being false.

### STM Real-time Vehicle Positions

`current_time`: Timestamp when the data was fetched from the GTFS, in milliseconds.<br>
`vehicle_id`: Unique identifier for a vehicle.<br>
`trip_id`: Unique identifier for the transit trip.<br>
`route_id`: Unique identifier for a bus or metro line.<br>
`start_date`: Start date of a transit trip.<br>
`start_time`: Start time of a transit trip.<br>
`latitude`, `longitude`: Vehicle current position.<br>
`bearing`: Direction that the vehicle is facing, from 0 to 360 degrees.<br>
`speed`: Momentary speed measured by the vehicle, in meters per second.<br>
`stop_sequence`: Sequence of a stop, for ordering.<br>
`status`: Vehicle stop status in relation with a stop that it's currently approaching or is at, 1 being "stopped at" and 2 being "in transit to".<br>
`timestamp`: Timestamp when STM updated the data, in seconds.<br>
`occupancy_status`: Degree of passenger occupancy, ranging from 1 (empty) to 7 (not accepting passengers).

### STM Service Alerts

`start_time`: Start time of the alert, in seconds.<br>
`end_time`: End time of the alert, in seconds.<br>
`route_id`: Unique identifier for a bus or metro line.<br>
`stop_id`: Unique identifier of a stop.

### STM Route Types

`route_id`: Unique identifier for a bus or metro line.<br>
`route_type`: Type of bus line (e.g. Night)<br>

### Open-Meteo Weather Archive and Forecast

`time`: Date and hour or the weather.<br>
`temperature`: Air temperature at 2 meters above ground, in Celsius.<br>
`relative_humidity`: Relative humidity at 2 meters above ground, in percentage.<br>
`dew_point`: Dew point temperature at 2 meters above ground, in Celsius.<br>
`precipitation`: Total precipitation (rain, showers, snow) sum of the preceding hour, in millimeters.<br>
`pressure`: Atmospheric air pressure reduced to mean sea level (msl), in hPa.<br>
`visibility`: Viewing distance in meters.<br>
`cloud_cover`: Total cloud cover as an area fraction.<br>
`windspeed`: Wind speed at 10 meters above ground, in kilometers per hour.<br>
`wind_direction`: Wind direction at 10 meters above ground.<br>
`wind_gusts`: Gusts at 10 meters above ground as a maximum of the preceding hour.<br>

### Tomtom Traffic Incidents

`category`: Category of the incident.<br>
`start_time`: Start time of the incident, in ISO8601 format.<br>
`end_time`: End time of the incident, in ISO8601 format.<br>
`length`: Length of the incident in meters.<br>
`delay`: Delay in seconds caused by the incident (except road closures).<br>
`magnitude_of_delay`: Severity of the delay, ranging from 0 to 4 (minor to major).<br>
`last_report_time`: Date when the last time the incident was reported,in ISO8601 format.<br>
`latitude`, `longitude`: Coordinates of the incident.

## Imports

In [1]:
from datetime import datetime, timedelta, timezone
import geopandas as gpd
import numpy as np
import pandas as pd
import sys

In [2]:
# Import custom code
sys.path.insert(0, '..')
from scripts.custom_functions import fetch_weather, LOCAL_TIMEZONE, OCCUPANCY_STATUS, SCHEDULE_RELATIONSHIP

In [3]:
# Import data
trips_df = pd.read_csv('../data/api/fetched_stm_trip_updates.csv',low_memory=False)
schedules_df = pd.read_csv('../data/download/stop_times_2025-04-30.txt')
stops_df = pd.read_csv('../data/download/stops_2025-04-30.txt')
positions_df =  pd.read_csv('../data/api/fetched_stm_vehicle_positions.csv', low_memory=False)
alerts_df = pd.read_csv('../data/api/fetched_stm_service_alerts.csv')
routes_df = pd.read_csv('../data/route_types.csv')
traffic_df = pd.read_csv('../data/api/fetched_traffic.csv')
weather_df = pd.read_csv('../data/api/fetched_historical_weather.csv')

## Merge Data

### Schedules and stops

In [4]:
# Sort values by stop sequence
schedules_df = schedules_df.sort_values(by=['trip_id', 'stop_sequence'])

In [5]:
# Add trip progress (vehicles further along the trip are more likely to be delayed)
total_stops = schedules_df.groupby('trip_id')['stop_id'].transform('count')
schedules_df['trip_progress'] = schedules_df['stop_sequence'] / total_stops

In [6]:
# Get distribution of trip progress
schedules_df['trip_progress'].describe()

count    6.629625e+06
mean     5.139220e-01
std      2.887289e-01
min      8.547009e-03
25%      2.631579e-01
50%      5.142857e-01
75%      7.647059e-01
max      1.000000e+00
Name: trip_progress, dtype: float64

In [7]:
# Merge schedules and stops
schedules_stops_df = pd.merge(left=schedules_df, right=stops_df, how='inner', left_on='stop_id', right_on='stop_code') \
	.rename(columns={'stop_id_x': 'stop_id'}) \
	.drop(['stop_id_y', 'stop_code', 'stop_url'], axis=1)

In [8]:
schedules_stops_df.columns

Index(['trip_id', 'arrival_time', 'departure_time', 'stop_id', 'stop_sequence',
       'trip_progress', 'stop_name', 'stop_lat', 'stop_lon', 'location_type',
       'parent_station', 'wheelchair_boarding'],
      dtype='object')

In [9]:
# Get coordinates of previous stop
schedules_stops_df = schedules_stops_df.sort_values(by=['trip_id', 'stop_sequence'])
schedules_stops_df['prev_lat'] = schedules_stops_df.groupby('trip_id')['stop_lat'].shift(1)
schedules_stops_df['prev_lon'] = schedules_stops_df.groupby('trip_id')['stop_lon'].shift(1)

In [10]:
# Make sure the null values are from first stops
prev_null_mask = (schedules_stops_df['prev_lat'].isna()) | (schedules_stops_df['prev_lon'].isna())
first_stop_mask = schedules_stops_df['stop_sequence'] == 1
assert prev_null_mask.sum() == first_stop_mask.sum()

In [11]:
# Create GeoDataFrames for previous and current stop
sch_gdf1 = gpd.GeoDataFrame(
  schedules_stops_df[['prev_lon', 'prev_lat']],
  geometry=gpd.points_from_xy(schedules_stops_df['prev_lon'], schedules_stops_df['prev_lat']),
  crs='EPSG:4326' # WGS84 (sea level)
).to_crs(epsg=3857) # Convert to metric for buffering (meters)

sch_gdf2 = gpd.GeoDataFrame(
  schedules_stops_df[['stop_lon', 'stop_lat']],
  geometry=gpd.points_from_xy(schedules_stops_df['stop_lon'], schedules_stops_df['stop_lat']),
  crs='EPSG:4326'
).to_crs(epsg=3857)

In [12]:
# Calculate distance from previous stop
schedules_stops_df['stop_distance'] = sch_gdf1.distance(sch_gdf2)
schedules_stops_df['stop_distance'].describe()

count    6.216518e+06
mean     4.019345e+02
std      7.161721e+02
min      2.092191e+01
25%      2.490739e+02
50%      3.286372e+02
75%      4.299291e+02
max      1.965439e+04
Name: stop_distance, dtype: float64

In [13]:
# Drop previous coordinates
schedules_stops_df = schedules_stops_df.drop(['prev_lat', 'prev_lon'], axis=1)

In [14]:
# Replace null distances by zero (first stop of the trip)
schedules_stops_df['stop_distance'] = schedules_stops_df['stop_distance'].fillna(0)
assert schedules_stops_df['stop_distance'].isna().sum() == 0

In [15]:
schedules_stops_df['stop_distance'].describe()

count    6.388741e+06
mean     3.910995e+02
std      7.094460e+02
min      0.000000e+00
25%      2.428608e+02
50%      3.243015e+02
75%      4.266621e+02
max      1.965439e+04
Name: stop_distance, dtype: float64

In [16]:
# Get stop with largest distance
schedules_stops_df.iloc[schedules_stops_df['stop_distance'].idxmax()]

trip_id                                    281377832
arrival_time                                05:33:00
departure_time                              05:33:00
stop_id                                        61261
stop_sequence                                     13
trip_progress                                    1.0
stop_name              YUL Aéroport Montréal-Trudeau
stop_lat                                   45.456622
stop_lon                                  -73.751615
location_type                                      0
parent_station                                   NaN
wheelchair_boarding                                1
stop_distance                            19654.39188
Name: 124692, dtype: object

After double-checking in the STM website, the large distance make sense because the bus stops before or after this one are far away.

### Realtime and Scheduled Trips

In [17]:
# Convert route_id to integer
trips_df['route_id'] = trips_df['route_id'].str.extract(r'(\d+)')
trips_df['route_id'] = trips_df['route_id'].astype('int64')

In [18]:
# Get proportion of duplicates
subset = trips_df.drop('current_time', axis=1).columns
duplicate_mask = trips_df.duplicated(subset=subset)
print(f'{duplicate_mask.mean():.2%}')

10.12%


In [19]:
# Remove duplicates
trips_df = trips_df.drop_duplicates(subset=subset, keep='last')

In [20]:
# Rename arrival and departure time
trips_df = trips_df.rename(columns={'arrival_time': 'rt_arrival_time','departure_time': 'rt_departure_time'})

In [21]:
# Merge trip updates with schedule
stm_trips_df = pd.merge(left=trips_df, right=schedules_stops_df, how='inner', on=['trip_id', 'stop_id'])

In [22]:
stm_trips_df.columns

Index(['current_time', 'trip_id', 'route_id', 'start_date', 'stop_id',
       'rt_arrival_time', 'rt_departure_time', 'schedule_relationship',
       'arrival_time', 'departure_time', 'stop_sequence', 'trip_progress',
       'stop_name', 'stop_lat', 'stop_lon', 'location_type', 'parent_station',
       'wheelchair_boarding', 'stop_distance'],
      dtype='object')

In [23]:
# Convert start_date to datetime
stm_trips_df['start_date_dt'] = pd.to_datetime(stm_trips_df['start_date'], format='%Y%m%d')

In [24]:
def parse_gtfs_time(df:pd.DataFrame, date_column:str, time_column:str) -> pd.Series:
	'''
	Converts GTFS time string (e.g., '25:30:00') to localized datetime
	based on the arrival or departure time.
	'''
	time_columns = ['hours', 'minutes', 'seconds']
	split_cols = df[time_column].str.split(':', expand=True).apply(pd.to_numeric)
	split_cols.columns = time_columns
	seconds_delta = (split_cols['hours'] * 3600) + (split_cols['minutes'] * 60) + split_cols['seconds']
	
	# Convert datetime to seconds
	start_seconds = df[date_column].astype('int') / 10**9

	# Add seconds 
	total_seconds = start_seconds + seconds_delta

	# Convert to datetime
	parsed_time = pd.to_datetime(total_seconds, origin='unix', unit='s').dt.tz_localize(LOCAL_TIMEZONE)

	return parsed_time

In [25]:
# Parse GTFS scheduled arrival and departure time
parsed_arrival_time = parse_gtfs_time(stm_trips_df, 'start_date_dt', 'arrival_time')
parsed_departure_time = parse_gtfs_time(stm_trips_df, 'start_date_dt', 'departure_time')

In [26]:
# Convert scheduled arrival and departure time to UTC datetime
stm_trips_df['sch_arrival_time'] = parsed_arrival_time.dt.tz_convert(timezone.utc)
stm_trips_df['sch_departure_time'] = parsed_departure_time.dt.tz_convert(timezone.utc)

In [27]:
# Get rows where scheduled arrival and departure time are different
stm_trips_df[stm_trips_df['sch_arrival_time'] != stm_trips_df['sch_departure_time']]

Unnamed: 0,current_time,trip_id,route_id,start_date,stop_id,rt_arrival_time,rt_departure_time,schedule_relationship,arrival_time,departure_time,...,stop_name,stop_lat,stop_lon,location_type,parent_station,wheelchair_boarding,stop_distance,start_date_dt,sch_arrival_time,sch_departure_time
5232,1.745791e+09,284216122,470,20250427,57794,1745792340,1745792340,0,18:19:00,18:20:00,...,Terminus Fairview,45.465697,-73.831619,0,,1,391.127323,2025-04-27,2025-04-27 22:19:00+00:00,2025-04-27 22:20:00+00:00
10326,1.745791e+09,284216100,470,20250427,60415,1745792054,1745792054,0,18:06:00,18:07:00,...,Terminus Fairview / Brunswick,45.465962,-73.831866,0,,1,782.045238,2025-04-27,2025-04-27 22:06:00+00:00,2025-04-27 22:07:00+00:00
10754,1.745791e+09,284216575,747,20250427,60997,0,1745791860,0,18:06:00,18:11:00,...,YUL Aéroport Montréal-Trudeau,45.456567,-73.751618,0,,1,0.000000,2025-04-27,2025-04-27 22:06:00+00:00,2025-04-27 22:11:00+00:00
11788,1.745791e+09,284216533,747,20250427,60997,0,1745794440,0,18:49:00,18:54:00,...,YUL Aéroport Montréal-Trudeau,45.456567,-73.751618,0,,1,0.000000,2025-04-27,2025-04-27 22:49:00+00:00,2025-04-27 22:54:00+00:00
12297,1.745791e+09,284216124,470,20250427,60415,1745795640,1745795700,0,19:14:00,19:15:00,...,Terminus Fairview / Brunswick,45.465962,-73.831866,0,,1,782.045238,2025-04-27,2025-04-27 23:14:00+00:00,2025-04-27 23:15:00+00:00
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5774218,1.746551e+09,284727843,161,20250506,50819,1746552660,1746552780,0,13:31:00,13:33:00,...,Station Plamondon (Van Horne / Victoria),45.494232,-73.637641,0,,1,328.056105,2025-05-06,2025-05-06 17:31:00+00:00,2025-05-06 17:33:00+00:00
5774246,1.746551e+09,284728077,18,20250506,53851,1746552060,1746552180,0,13:21:00,13:23:00,...,Station Beaubien,45.535535,-73.604655,0,,1,271.004292,2025-05-06,2025-05-06 17:21:00+00:00,2025-05-06 17:23:00+00:00
5774722,1.746551e+09,284741173,56,20250506,50792,1746553320,1746553440,0,13:42:00,13:44:00,...,Station Crémazie Nord,45.546503,-73.638938,0,,1,390.332123,2025-05-06,2025-05-06 17:42:00+00:00,2025-05-06 17:44:00+00:00
5776390,1.746551e+09,284727983,161,20250506,50820,1746553140,1746553260,0,13:39:00,13:41:00,...,Station Plamondon (Van Horne / Victoria),45.494105,-73.637966,0,,1,449.523333,2025-05-06,2025-05-06 17:39:00+00:00,2025-05-06 17:41:00+00:00


In [28]:
# Replace 0 timestamps with NaN
stm_trips_df['rt_arrival_time'] = stm_trips_df['rt_arrival_time'].replace({0: np.nan})
stm_trips_df['rt_departure_time'] = stm_trips_df['rt_departure_time'].replace({0: np.nan})

In [29]:
# Convert realtime arrival and departure time to UTC datetime
stm_trips_df['rt_arrival_time'] = pd.to_datetime(stm_trips_df['rt_arrival_time'], origin='unix', unit='s', utc=True)
stm_trips_df['rt_departure_time'] = pd.to_datetime(stm_trips_df['rt_departure_time'], origin='unix', unit='s', utc=True)

In [30]:
# Calculate delay (realtime - scheduled)
# Start with arrival time, if null, calculate with departure time
stm_trips_df['delay'] = (stm_trips_df['rt_arrival_time'] - stm_trips_df['sch_arrival_time']) / pd.Timedelta(seconds=1)
stm_trips_df['delay'] = stm_trips_df['delay'].fillna(((stm_trips_df['rt_departure_time'] - stm_trips_df['sch_departure_time']) / pd.Timedelta(seconds=1)))

In [31]:
# Get distribution
stm_trips_df['delay'].describe()

count    5.628233e+06
mean     6.568096e+01
std      4.642355e+02
min     -1.359200e+04
25%      0.000000e+00
50%      0.000000e+00
75%      0.000000e+00
max      5.458500e+04
Name: delay, dtype: float64

In [32]:
# Get null delays count
print(stm_trips_df['delay'].isna().sum())

149201


In [33]:
# Replace the null delays with the average delay by route, stop, day of week and hour
stm_trips_df['day_of_week'] = stm_trips_df['sch_arrival_time'].dt.day_of_week
stm_trips_df['hour'] = stm_trips_df['sch_arrival_time'].dt.hour
stm_trips_df['delay'] = stm_trips_df['delay'] \
	.fillna(stm_trips_df.groupby(['route_id', 'stop_id', 'day_of_week', 'hour'])['delay'].transform('mean')) \
	.fillna(stm_trips_df.groupby(['route_id', 'stop_id', 'day_of_week'])['delay'].transform('mean')) \
	.fillna(stm_trips_df.groupby(['route_id', 'stop_id'])['delay'].transform('mean')) \
	.fillna(stm_trips_df.groupby('route_id')['delay'].transform('mean'))
assert stm_trips_df['delay'].isna().sum() == 0

In [34]:
# Make sure the distribution didn't change too much
stm_trips_df['delay'].describe()

count    5.777434e+06
mean     6.629402e+01
std      4.609923e+02
min     -1.359200e+04
25%      0.000000e+00
50%      0.000000e+00
75%      6.000000e+00
max      5.458500e+04
Name: delay, dtype: float64

In [35]:
stm_trips_df.columns

Index(['current_time', 'trip_id', 'route_id', 'start_date', 'stop_id',
       'rt_arrival_time', 'rt_departure_time', 'schedule_relationship',
       'arrival_time', 'departure_time', 'stop_sequence', 'trip_progress',
       'stop_name', 'stop_lat', 'stop_lon', 'location_type', 'parent_station',
       'wheelchair_boarding', 'stop_distance', 'start_date_dt',
       'sch_arrival_time', 'sch_departure_time', 'delay', 'day_of_week',
       'hour'],
      dtype='object')

In [36]:
# Remove uneeded columns
stm_trips_df = stm_trips_df.drop(['current_time', 'arrival_time', 'departure_time', 'start_date_dt', 'day_of_week', 'hour'], axis=1)

### Trips and Traffic Data

In [37]:
# Get proportion of duplicates
duplicate_mask = traffic_df.duplicated()
print(f'{duplicate_mask.mean():.2%}')

37.86%


In [38]:
# Remove duplicates
traffic_df = traffic_df.drop_duplicates(keep='last').reset_index(drop=True)

In [39]:
# Convert traffic start_time and end_time to datetime
traffic_df['start_time'] = pd.to_datetime(traffic_df['start_time'], utc=True)
traffic_df['end_time'] = pd.to_datetime(traffic_df['end_time'], utc=True)

In [40]:
# Sort by date
traffic_df = traffic_df.sort_values(by='start_time').reset_index(drop=True)

In [41]:
# Fill null end times with current time (assuming the incident is still ongoing)
traffic_df['end_time'] = traffic_df['end_time'].fillna(datetime.now(timezone.utc).replace(microsecond=0))
assert traffic_df['end_time'].isna().sum() == 0

In [42]:
# Create GeoDataFrame for trip updates
stm_trips_gdf = gpd.GeoDataFrame(
  stm_trips_df,
  geometry=gpd.points_from_xy(stm_trips_df['stop_lon'], stm_trips_df['stop_lat']),
  crs='EPSG:4326'
).to_crs(epsg=3857)

In [43]:
# Create GeoDataFrame for traffic incidents
traffic_gdf = gpd.GeoDataFrame(
    traffic_df,
    geometry=gpd.points_from_xy(traffic_df['longitude'], traffic_df['latitude']),
    crs='EPSG:4326'
).to_crs(epsg=3857)

In [44]:
# Perform spatial join with nearest incidents
joined = gpd.sjoin_nearest(
  left_df=stm_trips_gdf,
  right_df=traffic_gdf,
  how='left',
  max_distance=500, # get incidents within 500 m
  distance_col='distance')

In [45]:
joined.columns

Index(['trip_id', 'route_id', 'start_date', 'stop_id', 'rt_arrival_time',
       'rt_departure_time', 'schedule_relationship', 'stop_sequence',
       'trip_progress', 'stop_name', 'stop_lat', 'stop_lon', 'location_type',
       'parent_station', 'wheelchair_boarding', 'stop_distance',
       'sch_arrival_time', 'sch_departure_time', 'delay_left', 'geometry',
       'index_right', 'category', 'start_time', 'end_time', 'length',
       'delay_right', 'magnitude_of_delay', 'last_report_time', 'latitude',
       'longitude', 'distance'],
      dtype='object')

In [46]:
# Filter joined incidents by time overlap
active_incident = (joined['start_time'] <= joined['sch_arrival_time']) & (joined['end_time'] >= joined['sch_arrival_time'])
joined = joined[active_incident].copy()

In [47]:
# Group by trip index (from trip_gdf) to attach traffic features
incident_summary = joined.groupby('index_right').agg({
	'category': pd.Series.mode,
	'delay_right': 'mean',
   	'distance': 'mean',
	'magnitude_of_delay': pd.Series.mode,
	'geometry': 'count'  # number of incidents
}).rename(columns={
	'category': 'incident_category',
	'delay_right': 'incident_avg_delay',
  	'distance': 'avg_distance_to_incident',
  	'magnitude_of_delay': 'incident_delay_magnitude',
	'geometry': 'incident_count',
})

In [48]:
# Create boolean column incident_nearby
incident_summary['incident_nearby'] = (incident_summary['incident_count'] > 0).astype('int64')

In [49]:
# Merge back to original trip_updates
trips_traffic_df = stm_trips_gdf.reset_index().merge(incident_summary, left_index=True, right_on='index_right', how='left')

In [50]:
# Fill missing values for trips with no incidents
trips_traffic_df = trips_traffic_df.fillna({
	'incident_category': np.nan,
	'incident_delay': np.nan,
	'avg_distance_to_incident': np.nan,
  	'incident_delay_magnitude': np.nan,
	'incident_count': 0,
	'incident_nearby': 0
})

In [51]:
# Drop uneeded columns
trips_traffic_df = trips_traffic_df.drop(columns=['geometry', 'index_right', 'index'], axis=1)

### Vehicle Positions

In [52]:
# Sort values
positions_df = positions_df.sort_values(by=['trip_id', 'stop_sequence', 'timestamp'])

In [53]:
# Get proportion of duplicates
subset = ['trip_id', 'stop_sequence']
duplicate_mask = positions_df.duplicated(subset=subset)
print(f'{duplicate_mask.mean():.2%}')

38.20%


In [54]:
# Drop duplicates
positions_df = positions_df.drop_duplicates(subset=subset, keep='last')

In [55]:
# Rename latitude and longitude
positions_df = positions_df.rename(columns={
  'latitude': 'vehicle_lat',
  'longitude': 'vehicle_lon',
  'status': 'vehicle_status',
  'bearing': 'vehicle_bearing',
  'speed': 'vehicle_speed',
  'timestamp': 'vehicle_dt'
})

In [56]:
# Merge positions
stm_trips_positions_df = pd.merge(left=trips_traffic_df, right=positions_df, how='inner', on=['trip_id', 'stop_sequence']) \
	.rename(columns={'route_id_x': 'route_id'}) \
	.drop(['current_time', 'route_id_y', 'start_date_x', 'start_date_y', 'start_time'], axis=1)

In [57]:
stm_trips_positions_df.columns

Index(['trip_id', 'route_id', 'stop_id', 'rt_arrival_time',
       'rt_departure_time', 'schedule_relationship', 'stop_sequence',
       'trip_progress', 'stop_name', 'stop_lat', 'stop_lon', 'location_type',
       'parent_station', 'wheelchair_boarding', 'stop_distance',
       'sch_arrival_time', 'sch_departure_time', 'delay', 'incident_category',
       'incident_avg_delay', 'avg_distance_to_incident',
       'incident_delay_magnitude', 'incident_count', 'incident_nearby',
       'vehicle_id', 'vehicle_lat', 'vehicle_lon', 'vehicle_bearing',
       'vehicle_speed', 'vehicle_status', 'vehicle_dt', 'occupancy_status'],
      dtype='object')

In [58]:
# Create GeoDataFrames for vehicle and stop positions
pos_gdf1 = gpd.GeoDataFrame(
  stm_trips_positions_df[['vehicle_lon', 'vehicle_lat']],
  geometry=gpd.points_from_xy(stm_trips_positions_df['vehicle_lon'], stm_trips_positions_df['vehicle_lat']),
  crs='EPSG:4326'
).to_crs(epsg=3857)

pos_gdf2 = gpd.GeoDataFrame(
  stm_trips_positions_df[['stop_lon', 'stop_lat']],
  geometry=gpd.points_from_xy(stm_trips_positions_df['stop_lon'], stm_trips_positions_df['stop_lat']),
  crs='EPSG:4326'
).to_crs(epsg=3857)

In [59]:
# Calculate the vehicle distance from the current stop
stm_trips_positions_df['vehicle_distance'] = pos_gdf1.distance(pos_gdf2)
stm_trips_positions_df['vehicle_distance'].describe()

count    1.418698e+06
mean     5.004887e+02
std      4.513786e+04
min      3.001229e-02
25%      1.182238e+01
50%      1.646870e+02
75%      3.267283e+02
max      9.992432e+06
Name: vehicle_distance, dtype: float64

In [60]:
# The large vehicle distances don't seem to make sense with the delays, the coordinates and distance won't be used
further_than_previous_stop = stm_trips_positions_df['vehicle_distance'] > stm_trips_positions_df['stop_distance']
stm_trips_positions_df[further_than_previous_stop].sort_values('vehicle_distance', ascending=False)

Unnamed: 0,trip_id,route_id,stop_id,rt_arrival_time,rt_departure_time,schedule_relationship,stop_sequence,trip_progress,stop_name,stop_lat,...,incident_nearby,vehicle_id,vehicle_lat,vehicle_lon,vehicle_bearing,vehicle_speed,vehicle_status,vehicle_dt,occupancy_status,vehicle_distance
1077083,283608587,43,55258,2025-05-03 23:28:18+00:00,2025-05-03 23:28:18+00:00,0,34,0.548387,Lacordaire / Henri-Bourassa,45.606293,...,0.0,36015,0.000950,-0.001567,0.0,9.166739,2,1746315053,1,9.992432e+06
1075045,283608587,43,55258,2025-05-03 23:21:33+00:00,2025-05-03 23:21:33+00:00,0,34,0.548387,Lacordaire / Henri-Bourassa,45.606293,...,0.0,36015,0.000950,-0.001567,0.0,9.166739,2,1746315053,1,9.992432e+06
1111821,283854581,192,54882,2025-05-04 20:28:19+00:00,2025-05-04 20:28:19+00:00,0,20,0.408163,Robert / Aimé-Renaud,45.589070,...,0.0,36040,0.001783,-0.002900,0.0,2.500020,2,1746390571,1,9.988454e+06
805253,285031853,49,53684,2025-05-02 00:27:46+00:00,2025-05-02 00:27:46+00:00,0,40,0.645161,Perras / 49e Avenue,45.652292,...,0.0,36031,0.031000,-0.050567,312.0,8.611180,2,1746060348,1,9.986185e+06
1298650,285031853,49,53684,2025-05-06 00:27:32+00:00,2025-05-06 00:27:32+00:00,0,40,0.645161,Perras / 49e Avenue,45.652292,...,0.0,36031,0.031000,-0.050567,312.0,8.611180,2,1746060348,1,9.986185e+06
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
723175,285010221,419,57822,NaT,2025-05-01 18:54:00+00:00,0,1,0.052632,Brunswick / Terminus Fairview,45.466579,...,0.0,28099,45.466579,-73.830933,328.0,0.000000,1,1746470688,1,8.144308e-02
871729,285010143,419,57822,NaT,2025-05-02 12:46:00+00:00,0,1,0.052632,Brunswick / Terminus Fairview,45.466579,...,0.0,38073,45.466579,-73.830933,0.0,0.000000,1,1746535477,1,8.144308e-02
272579,285010147,419,57822,NaT,2025-04-29 14:47:00+00:00,0,1,0.052632,Brunswick / Terminus Fairview,45.466579,...,0.0,31148,45.466579,-73.830933,0.0,0.000000,1,1746024338,1,8.144308e-02
1142567,285010163,419,57822,NaT,2025-05-05 11:55:00+00:00,0,1,0.052632,Brunswick / Terminus Fairview,45.466579,...,0.0,28008,45.466579,-73.830933,330.0,0.000000,1,1746013539,1,8.144308e-02


### Service Alerts

In [61]:
# Get proportion of duplicates
duplicate_mask = alerts_df.duplicated()
print(f'{duplicate_mask.mean():.2%}')

81.23%


In [62]:
# Remove duplicates
alerts_df = alerts_df.drop_duplicates(keep='last').reset_index(drop=True)

In [63]:
# Convert timestamps to datetime
alerts_df['start_time'] = pd.to_datetime(alerts_df['start_time'], origin='unix', unit='s', utc=True)
alerts_df['end_time'] = pd.to_datetime(alerts_df['end_time'], origin='unix', unit='s', utc=True)

In [64]:
# Fill null end times with current dates (assuming the alert is still active)
alerts_df['end_time'] = alerts_df['end_time'].fillna(datetime.now(timezone.utc).replace(microsecond=0))

In [65]:
# Sort values by start time
alerts_df = alerts_df.sort_values('start_time').reset_index(drop=True)

In [66]:
# Merge alerts
stm_df = pd.merge(left=stm_trips_positions_df, right=alerts_df, how='left', on=['route_id', 'stop_id'])

In [67]:
# Add boolean column if the stop as an active alert
has_alert_mask = (stm_df['start_time'].notna()) & \
	(stm_df['sch_arrival_time'] >= stm_df['start_time']) & \
	(stm_df['sch_arrival_time'] <= stm_df['end_time'])
stm_df['stop_has_alert'] = has_alert_mask.astype('int64')

In [68]:
stm_df['stop_has_alert'].value_counts()

stop_has_alert
0    1385991
1      79201
Name: count, dtype: int64

In [69]:
stm_df.columns

Index(['trip_id', 'route_id', 'stop_id', 'rt_arrival_time',
       'rt_departure_time', 'schedule_relationship', 'stop_sequence',
       'trip_progress', 'stop_name', 'stop_lat', 'stop_lon', 'location_type',
       'parent_station', 'wheelchair_boarding', 'stop_distance',
       'sch_arrival_time', 'sch_departure_time', 'delay', 'incident_category',
       'incident_avg_delay', 'avg_distance_to_incident',
       'incident_delay_magnitude', 'incident_count', 'incident_nearby',
       'vehicle_id', 'vehicle_lat', 'vehicle_lon', 'vehicle_bearing',
       'vehicle_speed', 'vehicle_status', 'vehicle_dt', 'occupancy_status',
       'vehicle_distance', 'start_time', 'end_time', 'stop_has_alert'],
      dtype='object')

In [70]:
# Drop unneeded datetime columns
stm_df = stm_df.drop(['start_time', 'end_time'], axis=1)

### Route Types

In [71]:
stm_df = pd.merge(left=stm_df, right=routes_df, how='inner', on='route_id')

In [72]:
stm_df.columns

Index(['trip_id', 'route_id', 'stop_id', 'rt_arrival_time',
       'rt_departure_time', 'schedule_relationship', 'stop_sequence',
       'trip_progress', 'stop_name', 'stop_lat', 'stop_lon', 'location_type',
       'parent_station', 'wheelchair_boarding', 'stop_distance',
       'sch_arrival_time', 'sch_departure_time', 'delay', 'incident_category',
       'incident_avg_delay', 'avg_distance_to_incident',
       'incident_delay_magnitude', 'incident_count', 'incident_nearby',
       'vehicle_id', 'vehicle_lat', 'vehicle_lon', 'vehicle_bearing',
       'vehicle_speed', 'vehicle_status', 'vehicle_dt', 'occupancy_status',
       'vehicle_distance', 'stop_has_alert', 'route_type'],
      dtype='object')

### STM and Weather

In [73]:
# Convert time string to datetime
time_dt = pd.to_datetime(weather_df['time'], utc=True)

In [74]:
# Calculate dates for weather forecast
last_day_weather = time_dt.max()
start_date = last_day_weather + timedelta(days=1)
end_date = stm_df['sch_arrival_time'].max()

In [75]:
# Fetch forecast weather
start_date_str = start_date.strftime('%Y-%m-%d')
end_date_str = end_date.strftime('%Y-%m-%d')

forecast_list = fetch_weather(start_date=start_date_str, end_date=end_date_str, forecast=True)
forecast_df = pd.DataFrame(forecast_list)

In [76]:
# Merge archive and forecast weather
weather_df = pd.concat([weather_df, forecast_df], ignore_index=True)

In [77]:
weather_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 240 entries, 0 to 239
Data columns (total 11 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   time               240 non-null    object 
 1   temperature        240 non-null    float64
 2   relative_humidity  240 non-null    int64  
 3   dew_point          240 non-null    float64
 4   precipitation      240 non-null    float64
 5   pressure           240 non-null    float64
 6   visibility         72 non-null     float64
 7   cloud_cover        240 non-null    int64  
 8   windspeed          240 non-null    float64
 9   wind_direction     240 non-null    int64  
 10  wind_gusts         240 non-null    float64
dtypes: float64(7), int64(3), object(1)
memory usage: 20.8+ KB


In [78]:
# Drop visibility column because most of the values are null
weather_df = weather_df.drop('visibility', axis=1)

In [79]:
# Round arrival time to the nearest hour
rounded_arrival_dt = stm_df['sch_arrival_time'].dt.round('h')

In [80]:
# Format time to match weather data
stm_df['time'] = rounded_arrival_dt.dt.strftime('%Y-%m-%dT%H:%M')

In [81]:
# Merge STM with weather
df = pd.merge(left=stm_df, right=weather_df, how='inner', on='time').drop('time', axis=1)

## Clean Data

### Drop Columns

In [82]:
# Remove columns with constant values or with more than 50% missing values
df = df.loc[:, (df.nunique() > 1) & (df.isna().mean() < 0.5)]
df.columns

Index(['trip_id', 'route_id', 'stop_id', 'rt_arrival_time',
       'rt_departure_time', 'schedule_relationship', 'stop_sequence',
       'trip_progress', 'stop_name', 'stop_lat', 'stop_lon',
       'wheelchair_boarding', 'stop_distance', 'sch_arrival_time',
       'sch_departure_time', 'delay', 'incident_count', 'incident_nearby',
       'vehicle_id', 'vehicle_lat', 'vehicle_lon', 'vehicle_bearing',
       'vehicle_speed', 'vehicle_status', 'vehicle_dt', 'occupancy_status',
       'vehicle_distance', 'stop_has_alert', 'route_type', 'temperature',
       'relative_humidity', 'dew_point', 'precipitation', 'pressure',
       'cloud_cover', 'windspeed', 'wind_direction', 'wind_gusts'],
      dtype='object')

In [83]:
df.isna().sum()

trip_id                       0
route_id                      0
stop_id                       0
rt_arrival_time           94954
rt_departure_time        123826
schedule_relationship         0
stop_sequence                 0
trip_progress                 0
stop_name                     0
stop_lat                      0
stop_lon                      0
wheelchair_boarding           0
stop_distance                 0
sch_arrival_time              0
sch_departure_time            0
delay                         0
incident_count                0
incident_nearby               0
vehicle_id                    0
vehicle_lat                   0
vehicle_lon                   0
vehicle_bearing               0
vehicle_speed                 0
vehicle_status                0
vehicle_dt                    0
occupancy_status              0
vehicle_distance              0
stop_has_alert                0
route_type                    0
temperature                   0
relative_humidity             0
dew_poin

### Convert columns

In [84]:
# Get columns with two values
two_values = df.loc[:, df.nunique() == 2]
for column in two_values.columns:
  print(df[column].value_counts())

wheelchair_boarding
1    1384425
2      80767
Name: count, dtype: int64
incident_nearby
0.0    1461399
1.0       3793
Name: count, dtype: int64
vehicle_status
2    1059888
1     405304
Name: count, dtype: int64
stop_has_alert
0    1385991
1      79201
Name: count, dtype: int64


In [85]:
# Convert columns with 2 unique values to integer
df['wheelchair_boarding'] = (df['wheelchair_boarding'] == 1).astype('int64')
df['vehicle_in_transit'] = (df['vehicle_status'] == 2).astype('int64')
df['incident_nearby'] = df['incident_nearby'].astype('int64')

### Create Delay Categories

**Delay Threshholds, according to [STM's definition](https://www.stm.info/en/info/networks/bus-network-and-schedules-enlightened)**

- Early: delay < -1 min
- On Time: -1 min ≤ delay < 3 min
- Late: delay ≥ 3 min

In [86]:
# Add delay category column
labels = ['Early', 'On Time', 'Late']
ranges = [-np.inf, -60, 180, np.inf]
df['delay_class'] = pd.cut(df['delay'], bins=ranges, labels=labels, include_lowest=True, right=False)
df['delay_class'].value_counts(normalize=True)

delay_class
On Time    0.859204
Late       0.123705
Early      0.017091
Name: proportion, dtype: float64

### Convert Schedule Relationship to Categories

In [87]:
# Create schedule relationship mapping
sch_codes = df['schedule_relationship'].sort_values().unique()
condition_list = []
label_list = []

for code in sch_codes:
  condition_list.append(df['schedule_relationship'] == code)
  label_list.append(SCHEDULE_RELATIONSHIP[code])

In [88]:
# Create categories
df['schedule_relationship'] = np.select(condition_list, label_list, default='Unknown')
df['schedule_relationship'].value_counts()

schedule_relationship
Scheduled    1451509
Skipped         7680
No Data         6003
Name: count, dtype: int64

### Convert Occupancy Status Into Categories

In [89]:
# Create occupancy status mapping
occ_codes = df['occupancy_status'].sort_values().unique()
condition_list = []
label_list = []

for code in occ_codes:
  condition_list.append(df['occupancy_status'] == code)
  label_list.append(OCCUPANCY_STATUS[code])

In [90]:
# Create categories
df['occupancy_status'] = np.select(condition_list, label_list, default='Unknown')
df['occupancy_status'].value_counts()

occupancy_status
Empty                         668324
Many seats available          464196
Few seats available           308785
Crushed standing room only     22760
Unknown                         1127
Name: count, dtype: int64

In [91]:
df.columns

Index(['trip_id', 'route_id', 'stop_id', 'rt_arrival_time',
       'rt_departure_time', 'schedule_relationship', 'stop_sequence',
       'trip_progress', 'stop_name', 'stop_lat', 'stop_lon',
       'wheelchair_boarding', 'stop_distance', 'sch_arrival_time',
       'sch_departure_time', 'delay', 'incident_count', 'incident_nearby',
       'vehicle_id', 'vehicle_lat', 'vehicle_lon', 'vehicle_bearing',
       'vehicle_speed', 'vehicle_status', 'vehicle_dt', 'occupancy_status',
       'vehicle_distance', 'stop_has_alert', 'route_type', 'temperature',
       'relative_humidity', 'dew_point', 'precipitation', 'pressure',
       'cloud_cover', 'windspeed', 'wind_direction', 'wind_gusts',
       'vehicle_in_transit', 'delay_class'],
      dtype='object')

## Export Data

In [92]:
# Drop and reorder columns
df = df[[
	'trip_id',
	'vehicle_id',
  	'vehicle_bearing',
	'vehicle_speed',
	'vehicle_in_transit',
	'occupancy_status',
  	'route_id',
    'route_type',
  	'stop_id',
    'stop_name',
    'stop_lat',
    'stop_lon',
    'stop_distance',
    'stop_sequence',
    'trip_progress',
    'stop_has_alert',
    'schedule_relationship',
	'wheelchair_boarding',
  	'rt_arrival_time',
	'rt_departure_time',
  	'sch_arrival_time',
    'sch_departure_time',
    'delay',
    'delay_class',
    'incident_count',
    'incident_nearby',
	'temperature',
    'relative_humidity',
  	'dew_point',
  	'precipitation',
  	'pressure',
	'cloud_cover',
  	'windspeed',
  	'wind_direction',
  	'wind_gusts'
]]

In [93]:
# Export data to CSV
df.to_parquet('../data/stm_weather_traffic_merged.parquet', index=False)

## End