# STM Transit Delay Data Preparation

## Overview

This notebook cleans and merges data collected from [STM](https://www.stm.info/en/about/developers), [Open-Meteo](https://open-meteo.com/en/docs) and [Tomtom](https://developer.tomtom.com/) and prepares it for data analysis and/or preprocessing.

## Data Description

### STM Real-time Trip Updates

`current_time`: Timestamp when the data was fetched from the GTFS, in microseconds.<br>
`trip_id`: Unique identifier for the transit trip.<br>
`route_id`: Unique identifier for a bus or metro line.<br>
`start_date`: Start date of the transit trip.<br>
`stop_id`: Unique identifier of a stop.<br>
`arrival_time`, `departure_time`: Realtime arrival and departure time, in microseconds<br>
`schedule_relationship`: State of the trip, 0 meaning "scheduled", 1 meaning "skipped" and 2 meaning "no data".

### STM Scheduled Trips

`trip_id`: Unique identifier for the transit trip.<br>
`arrival_time`, `departure_time`: Scheduled arrival and departure time.<br>
`stop_id`: Unique identifier of a stop.<br>
`stop_sequence`: Sequence of a stop, for ordering.

### STM Stops

`stop_id`: Unique identifier of a stop.<br>
`stop_code`: Bus stop or metro station number.<br>
`stop_name`: Bus stop or metro station name<br>
`stop_lat`, `stop_lon`: Stop coordinates.<br>
`stop_url`: Stop web page.<br>
`location_type`: Stop type.<br>
`parent_station`: Parent station (metro station with multiple exits).<br>
`wheelchair_boarding`: Indicates if the stop is accessible for people in wheelchair, 1 being true and 2 being false.

### STM Real-time Vehicle Positions

`current_time`: Timestamp when the data was fetched from the GTFS, in microseconds.<br>
`vehicle_id`: Unique identifier for a vehicle.<br>
`trip_id`: Unique identifier for the transit trip.<br>
`route_id`: Unique identifier for a bus or metro line.<br>
`start_date`: Start date of a transit trip.<br>
`start_time`: Start time of a transit trip.<br>
`latitude`, `longitude`: Vehicle current position.<br>
`bearing`: Direction that the vehicle is facing, from 0 to 360 degrees.<br>
`speed`: Momentary speed measured by the vehicle, in meters per second.<br>
`stop_sequence`: Sequence of a stop, for ordering.<br>
`status`: Vehicle stop status in relation with a stop that it's currently approaching or is at, 1 being "stopped at" and 2 being "in transit to".<br>
`timestamp`: Timestamp when STM updated the data, in milliseconds.<br>
`occupancy_status`: Degree of passenger occupancy, ranging from 1 (empty) to 7 (not accepting passengers).

### STM Service Alerts

`start_time`: Start time of the alert, in microseconds.<br>
`end_time`: End time of the alert, in microseconds.<br>
`route_id`: Unique identifier for a bus or metro line.<br>
`stop_id`: Unique identifier of a stop.

### Open-Meteo Weather Archive and Forecast

`time`: Date and hour or the weather.<br>
`temperature`: Air temperature at 2 meters above ground, in Celsius.<br>
`precipitation`: Total precipitation (rain, showers, snow) sum of the preceding hour, in millimeters.<br>
`windspeed`: Wind speed at 10 meters above ground, in kilometers per hour.<br>
`weathercode`: World Meteorological Organization (WMO) code, see [Open-Meteo API documentation](https://open-meteo.com/en/docs#weather_variable_documentation) for the list.

### Tomtom Traffic Incidents

`category`: Category of the incident.<br>
`start_time`: Start time of the incident, in ISO8601 format.<br>
`end_time`: End time of the incident, in ISO8601 format.<br>
`length`: Length of the incident in meters.<br>
`delay`: Delay in seconds caused by the incident (except road closures).<br>
`magnitude_of_delay`: Severity of the delay, ranging from 0 to 4 (minor to major).<br>
`last_report_time`: Date when the last time the incident was reported,in ISO8601 format.<br>
`latitude`, `longitude`: Coordinates of the incident.

## Imports

In [None]:
from datetime import datetime, timedelta, timezone
import geopandas as gpd
import numpy as np
import pandas as pd
import sys

In [2]:
# Import custom code
sys.path.insert(0, '..')
from scripts.custom_functions import fetch_weather, LOCAL_TIMEZONE, OCCUPANCY_STATUS, SCHEDULE_RELATIONSHIP, WEATHER_CODES

In [3]:
# Import data
trips_df = pd.read_csv('../data/api/fetched_stm_trip_updates.csv',low_memory=False)
schedules_df = pd.read_csv('../data/download/stop_times_2025-04-30.txt')
stops_df = pd.read_csv('../data/download/stops_2025-04-30.txt')
positions_df =  pd.read_csv('../data/api/fetched_stm_vehicle_positions.csv', low_memory=False)
alerts_df = pd.read_csv('../data/api/fetched_stm_service_alerts.csv')
weather_df = pd.read_csv('../data/api/fetched_historical_weather.csv')
traffic_df = pd.read_csv('../data/api/fetched_traffic.csv')

## Merge Data

### Schedules and stops

In [4]:
# Sort values by stop sequence
schedules_df = schedules_df.sort_values(by=['trip_id', 'stop_sequence'])

In [5]:
# Add trip progress (vehicles further along the trip are more likely to be delayed)
total_stops = schedules_df.groupby('trip_id')['stop_id'].transform('count')
schedules_df['trip_progress'] = schedules_df['stop_sequence'] / total_stops

In [6]:
# Get distribution of trip progress
schedules_df['trip_progress'].describe()

count    6.629625e+06
mean     5.139220e-01
std      2.887289e-01
min      8.547009e-03
25%      2.631579e-01
50%      5.142857e-01
75%      7.647059e-01
max      1.000000e+00
Name: trip_progress, dtype: float64

In [7]:
# Merge schedules and stops
schedules_stops_df = pd.merge(left=schedules_df, right=stops_df, how='inner', left_on='stop_id', right_on='stop_code')

In [8]:
schedules_stops_df.columns

Index(['trip_id', 'arrival_time', 'departure_time', 'stop_id_x',
       'stop_sequence', 'trip_progress', 'stop_id_y', 'stop_code', 'stop_name',
       'stop_lat', 'stop_lon', 'stop_url', 'location_type', 'parent_station',
       'wheelchair_boarding'],
      dtype='object')

In [9]:
# Rename stop id and drop uneeded columns
schedules_stops_df = schedules_stops_df.rename(columns={'stop_id_x': 'stop_id'})
schedules_stops_df = schedules_stops_df.drop(['stop_id_y', 'stop_code', 'stop_url'], axis=1)

In [10]:
# Get coordinates of previous stop
schedules_stops_df = schedules_stops_df.sort_values(by=['trip_id', 'stop_sequence'])
schedules_stops_df['prev_lat'] = schedules_stops_df.groupby('trip_id')['stop_lat'].shift(1)
schedules_stops_df['prev_lon'] = schedules_stops_df.groupby('trip_id')['stop_lon'].shift(1)

In [11]:
# Make sure the null values are from first stops
prev_null_mask = (schedules_stops_df['prev_lat'].isna()) | (schedules_stops_df['prev_lon'].isna())
first_stop_mask = schedules_stops_df['stop_sequence'] == 1
assert prev_null_mask.sum() == first_stop_mask.sum()

In [None]:
# Create GeoDataFrames from schedule
sch_gdf1 = gpd.GeoDataFrame(
  schedules_stops_df,
  geometry=gpd.points_from_xy(schedules_stops_df['prev_lon'], schedules_stops_df['prev_lat']),
  crs='EPSG:4326' # WGS84 (sea level)
).to_crs(epsg=3857) # Convert to metric for buffering (meters)

sch_gdf2 = gpd.GeoDataFrame(
  schedules_stops_df,
  geometry=gpd.points_from_xy(schedules_stops_df['stop_lon'], schedules_stops_df['stop_lat']),
  crs='EPSG:4326'
).to_crs(epsg=3857)

0                  NaN
1           309.882231
2           321.760219
3           270.745872
4           190.753715
              ...     
6388736    5823.342923
6388737            NaN
6388738    5823.342923
6388739            NaN
6388740    5823.342923
Length: 6388741, dtype: float64

In [15]:
# Calculate distance from previous stop
schedules_stops_df['stop_distance'] = sch_gdf1.distance(sch_gdf2)
schedules_stops_df['stop_distance'].describe()

count    6.216518e+06
mean     4.019345e+02
std      7.161721e+02
min      2.092191e+01
25%      2.490739e+02
50%      3.286372e+02
75%      4.299291e+02
max      1.965439e+04
Name: stop_distance, dtype: float64

In [16]:
# Drop previous coordinates
schedules_stops_df = schedules_stops_df.drop(['prev_lat', 'prev_lon'], axis=1)

In [17]:
# Replace null distances by zero (first stop of the trip)
schedules_stops_df['stop_distance'] = schedules_stops_df['stop_distance'].fillna(0)

In [18]:
schedules_stops_df['stop_distance'].describe()

count    6.388741e+06
mean     3.910995e+02
std      7.094460e+02
min      0.000000e+00
25%      2.428608e+02
50%      3.243015e+02
75%      4.266621e+02
max      1.965439e+04
Name: stop_distance, dtype: float64

In [19]:
# Get stop with largest distance
schedules_stops_df.iloc[schedules_stops_df['stop_distance'].idxmax()]

trip_id                                    281377832
arrival_time                                05:33:00
departure_time                              05:33:00
stop_id                                        61261
stop_sequence                                     13
trip_progress                                    1.0
stop_name              YUL Aéroport Montréal-Trudeau
stop_lat                                   45.456622
stop_lon                                  -73.751615
location_type                                      0
parent_station                                   NaN
wheelchair_boarding                                1
stop_distance                            19654.39188
Name: 124692, dtype: object

After double-checking in the STM website, the large distance make sense because the bus stops before or after this are far away.

### Realtime and Scheduled Trips

In [20]:
# Convert route_id to integer
trips_df['route_id'] = trips_df['route_id'].str.extract(r'(\d+)')
trips_df['route_id'] = trips_df['route_id'].astype('int64')

In [21]:
# Get proportion of duplicates
subset = trips_df.drop('current_time', axis=1).columns
duplicate_mask = trips_df.duplicated(subset=subset)
print(f'{duplicate_mask.mean():.2%}')

9.98%


In [22]:
# Remove duplicates
trips_df = trips_df.drop_duplicates(subset=subset, keep='last')

In [23]:
# Rename arrival and departure time
trips_df = trips_df.rename(columns={'arrival_time': 'rt_arrival_time','departure_time': 'rt_departure_time'})

In [24]:
# Merge trip updates with schedule
stm_trips_df = pd.merge(left=trips_df, right=schedules_stops_df, how='inner', on=['trip_id', 'stop_id'])

In [25]:
stm_trips_df.columns

Index(['current_time', 'trip_id', 'route_id', 'start_date', 'stop_id',
       'rt_arrival_time', 'rt_departure_time', 'schedule_relationship',
       'arrival_time', 'departure_time', 'stop_sequence', 'trip_progress',
       'stop_name', 'stop_lat', 'stop_lon', 'location_type', 'parent_station',
       'wheelchair_boarding', 'stop_distance'],
      dtype='object')

In [26]:
# Convert start_date to datetime
stm_trips_df['start_date_dt'] = pd.to_datetime(stm_trips_df['start_date'], format='%Y%m%d')

In [27]:
def parse_gtfs_time(start_date:pd.Timestamp, stop_time:str) -> pd.Timestamp:
	'''
	Converts GTFS time string (e.g., '25:30:00') to datetime
	based on the arrival time.
	'''
	hours, minutes, seconds = map(int, stop_time.split(':'))
	total_seconds = hours * 3600 + minutes * 60 + seconds

	parsed_time = start_date + timedelta(seconds=total_seconds)
	return parsed_time

In [28]:
# Parse GTFS scheduled arrival and departure time
parsed_arrival_time = stm_trips_df.apply(lambda row: parse_gtfs_time(row['start_date_dt'], row['arrival_time']), axis=1)
parsed_departure_time = stm_trips_df.apply(lambda row: parse_gtfs_time(row['start_date_dt'], row['departure_time']), axis=1)

In [29]:
# Convert scheduled arrival and departure time to UTC datetime
sch_arrival_time_local = parsed_arrival_time.dt.tz_localize(LOCAL_TIMEZONE)
sch_departure_time_local = parsed_departure_time.dt.tz_localize(LOCAL_TIMEZONE)

stm_trips_df['sch_arrival_time'] = sch_arrival_time_local.dt.tz_convert(timezone.utc)
stm_trips_df['sch_departure_time'] = sch_departure_time_local.dt.tz_convert(timezone.utc)

In [30]:
# Get rows where scheduled arrival and departure time are different
stm_trips_df[stm_trips_df['sch_arrival_time'] != stm_trips_df['sch_departure_time']]

Unnamed: 0,current_time,trip_id,route_id,start_date,stop_id,rt_arrival_time,rt_departure_time,schedule_relationship,arrival_time,departure_time,...,stop_name,stop_lat,stop_lon,location_type,parent_station,wheelchair_boarding,stop_distance,start_date_dt,sch_arrival_time,sch_departure_time
5232,1.745791e+09,284216122,470,20250427,57794,1745792340,1745792340,0,18:19:00,18:20:00,...,Terminus Fairview,45.465697,-73.831619,0,,1,391.127323,2025-04-27,2025-04-27 22:19:00+00:00,2025-04-27 22:20:00+00:00
10326,1.745791e+09,284216100,470,20250427,60415,1745792054,1745792054,0,18:06:00,18:07:00,...,Terminus Fairview / Brunswick,45.465962,-73.831866,0,,1,782.045238,2025-04-27,2025-04-27 22:06:00+00:00,2025-04-27 22:07:00+00:00
10754,1.745791e+09,284216575,747,20250427,60997,0,1745791860,0,18:06:00,18:11:00,...,YUL Aéroport Montréal-Trudeau,45.456567,-73.751618,0,,1,0.000000,2025-04-27,2025-04-27 22:06:00+00:00,2025-04-27 22:11:00+00:00
11788,1.745791e+09,284216533,747,20250427,60997,0,1745794440,0,18:49:00,18:54:00,...,YUL Aéroport Montréal-Trudeau,45.456567,-73.751618,0,,1,0.000000,2025-04-27,2025-04-27 22:49:00+00:00,2025-04-27 22:54:00+00:00
12297,1.745791e+09,284216124,470,20250427,60415,1745795640,1745795700,0,19:14:00,19:15:00,...,Terminus Fairview / Brunswick,45.465962,-73.831866,0,,1,782.045238,2025-04-27,2025-04-27 23:14:00+00:00,2025-04-27 23:15:00+00:00
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4715733,1.746418e+09,284216088,470,20250504,60415,1746418860,1746418920,0,24:21:00,24:22:00,...,Terminus Fairview / Brunswick,45.465962,-73.831866,0,,1,782.045238,2025-05-04,2025-05-05 04:21:00+00:00,2025-05-05 04:22:00+00:00
4717881,1.746418e+09,284216669,747,20250504,60997,0,1746420720,0,24:47:00,24:52:00,...,YUL Aéroport Montréal-Trudeau,45.456567,-73.751618,0,,1,0.000000,2025-05-04,2025-05-05 04:47:00+00:00,2025-05-05 04:52:00+00:00
4723556,1.746418e+09,284216794,747,20250504,60997,0,1746419640,0,24:29:00,24:34:00,...,YUL Aéroport Montréal-Trudeau,45.456567,-73.751618,0,,1,0.000000,2025-05-04,2025-05-05 04:29:00+00:00,2025-05-05 04:34:00+00:00
4724675,1.746418e+09,284216133,470,20250504,60415,1746420660,1746420720,0,24:51:00,24:52:00,...,Terminus Fairview / Brunswick,45.465962,-73.831866,0,,1,782.045238,2025-05-04,2025-05-05 04:51:00+00:00,2025-05-05 04:52:00+00:00


In [31]:
# Replace 0 timestamps with NaN
stm_trips_df['rt_arrival_time'] = stm_trips_df['rt_arrival_time'].replace({0: np.nan})
stm_trips_df['rt_departure_time'] = stm_trips_df['rt_departure_time'].replace({0: np.nan})

In [32]:
# Convert realtime arrival and departure time to UTC datetime
stm_trips_df['rt_arrival_time'] = pd.to_datetime(stm_trips_df['rt_arrival_time'] * 1000, origin='unix', unit='ms', utc=True)
stm_trips_df['rt_departure_time'] = pd.to_datetime(stm_trips_df['rt_departure_time'] * 1000, origin='unix', unit='ms', utc=True)

In [33]:
# Calculate delay (realtime - scheduled)
# Start with arrival time, if null, calculate with departure time
stm_trips_df['delay'] = (stm_trips_df['rt_arrival_time'] - stm_trips_df['sch_arrival_time']) / pd.Timedelta(seconds=1)
stm_trips_df['delay'] = stm_trips_df['delay'].fillna(((stm_trips_df['rt_departure_time'] - stm_trips_df['sch_departure_time']) / pd.Timedelta(seconds=1)))

In [34]:
# Get distribution
stm_trips_df['delay'].describe()

count    4.607676e+06
mean     7.108818e+01
std      4.999845e+02
min     -1.359200e+04
25%      0.000000e+00
50%      0.000000e+00
75%      0.000000e+00
max      5.458500e+04
Name: delay, dtype: float64

In [35]:
# Get null delays count
print(stm_trips_df['delay'].isna().sum())

121559


In [36]:
# Replace the null delays with the average delay by route, stop, day of week and hour
stm_trips_df['day_of_week'] = stm_trips_df['sch_arrival_time'].dt.day_of_week
stm_trips_df['hour'] = stm_trips_df['sch_arrival_time'].dt.hour
stm_trips_df['delay'] = stm_trips_df['delay'] \
	.fillna(stm_trips_df.groupby(['route_id', 'stop_id', 'day_of_week', 'hour'])['delay'].transform('mean')) \
	.fillna(stm_trips_df.groupby(['route_id', 'stop_id', 'day_of_week'])['delay'].transform('mean')) \
	.fillna(stm_trips_df.groupby(['route_id', 'stop_id'])['delay'].transform('mean')) \
	.fillna(stm_trips_df.groupby('route_id')['delay'].transform('mean'))
assert stm_trips_df['delay'].isna().sum() == 0

In [37]:
# Make sure the distribution didn't change too much
stm_trips_df['delay'].describe()

count    4.729235e+06
mean     7.175174e+01
std      4.961106e+02
min     -1.359200e+04
25%      0.000000e+00
50%      0.000000e+00
75%      1.100000e+01
max      5.458500e+04
Name: delay, dtype: float64

In [38]:
stm_trips_df.columns

Index(['current_time', 'trip_id', 'route_id', 'start_date', 'stop_id',
       'rt_arrival_time', 'rt_departure_time', 'schedule_relationship',
       'arrival_time', 'departure_time', 'stop_sequence', 'trip_progress',
       'stop_name', 'stop_lat', 'stop_lon', 'location_type', 'parent_station',
       'wheelchair_boarding', 'stop_distance', 'start_date_dt',
       'sch_arrival_time', 'sch_departure_time', 'delay', 'day_of_week',
       'hour'],
      dtype='object')

In [39]:
# Remove uneeded columns
stm_trips_df = stm_trips_df.drop(['current_time', 'arrival_time', 'departure_time', 'start_date_dt', 'day_of_week', 'hour'], axis=1)

### Trips and Traffic Data

In [41]:
# Get proportion of duplicates
duplicate_mask = traffic_df.duplicated()
print(f'{duplicate_mask.mean():.2%}')

38.98%


In [42]:
# Remove duplicates
traffic_df = traffic_df.drop_duplicates(keep='last').reset_index(drop=True)

In [43]:
# Convert traffic start_time and end_time to datetime
traffic_df['start_time'] = pd.to_datetime(traffic_df['start_time'], utc=True)
traffic_df['end_time'] = pd.to_datetime(traffic_df['end_time'], utc=True)

In [44]:
# Sort by date
traffic_df = traffic_df.sort_values(by='start_time').reset_index(drop=True)

In [45]:
# Keep only traffic incidents on same date as trip updates
relevant_dates = stm_trips_df['sch_arrival_time'].dt.date.unique()
traffic_df = traffic_df[traffic_df['start_time'].dt.date.isin(relevant_dates)]

In [46]:
# Fill null end times with current time (assuming the incident is still ongoing)
traffic_df['end_time'] = traffic_df['end_time'].fillna(datetime.now(timezone.utc).replace(microsecond=0))
assert traffic_df['end_time'].isna().sum() == 0

In [47]:
# Create GeoDataFrame for trip updates
stm_trips_gdf = gpd.GeoDataFrame(
  stm_trips_df,
  geometry=gpd.points_from_xy(stm_trips_df['stop_lon'], stm_trips_df['stop_lat']),
  crs='EPSG:4326'
).to_crs(epsg=3857)

In [48]:
# Create GeoDataFrame for traffic incidents
traffic_gdf = gpd.GeoDataFrame(
    traffic_df,
    geometry=gpd.points_from_xy(traffic_df['longitude'], traffic_df['latitude']),
    crs='EPSG:4326'
).to_crs(epsg=3857)

In [49]:
# Perform spatial join with nearest incidents
joined = gpd.sjoin_nearest(
  left_df=stm_trips_gdf,
  right_df=traffic_gdf,
  how='left',
  max_distance=1000, # get incidents within 1 km
  distance_col='distance')

In [50]:
joined.columns

Index(['trip_id', 'route_id', 'start_date', 'stop_id', 'rt_arrival_time',
       'rt_departure_time', 'schedule_relationship', 'stop_sequence',
       'trip_progress', 'stop_name', 'stop_lat', 'stop_lon', 'location_type',
       'parent_station', 'wheelchair_boarding', 'stop_distance',
       'sch_arrival_time', 'sch_departure_time', 'delay_left', 'geometry',
       'index_right', 'category', 'start_time', 'end_time', 'length',
       'delay_right', 'magnitude_of_delay', 'last_report_time', 'latitude',
       'longitude', 'distance'],
      dtype='object')

In [51]:
# Filter joined incidents by time overlap
active_incident = (joined['start_time'] <= joined['sch_arrival_time']) & (joined['end_time'] >= joined['sch_arrival_time'])
joined = joined[active_incident].copy()

if joined.empty:
	stm_trips_gdf['incident_nearby'] = 0
	stm_trips_gdf['incident_delay'] = 0
	stm_trips_gdf['weighted_incident_delay'] = 0

In [52]:
joined.head()

Unnamed: 0,trip_id,route_id,start_date,stop_id,rt_arrival_time,rt_departure_time,schedule_relationship,stop_sequence,trip_progress,stop_name,...,category,start_time,end_time,length,delay_right,magnitude_of_delay,last_report_time,latitude,longitude,distance
3,285076964,212,20250427,58570,NaT,NaT,2,1,0.076923,Lakeshore / Macdonald,...,6.0,2025-04-27 14:56:30+00:00,2025-04-27 23:01:30+00:00,4112.790791,1248.0,3.0,,45.408775,-73.944283,914.335222
125,284752678,69,20250427,50209,2025-04-27 22:18:28+00:00,2025-04-27 22:18:28+00:00,0,56,0.848485,de l'Acadie / Gouin,...,8.0,2025-04-27 20:58:00+00:00,2025-05-05 04:28:40+00:00,82.273511,,4.0,,45.543949,-73.702248,124.094222
125,284752678,69,20250427,50209,2025-04-27 22:18:28+00:00,2025-04-27 22:18:28+00:00,0,56,0.848485,de l'Acadie / Gouin,...,8.0,2025-04-27 21:58:00+00:00,2025-05-05 04:28:40+00:00,82.273511,,4.0,,45.543949,-73.702248,124.094222
126,284752678,69,20250427,50205,2025-04-27 22:19:11+00:00,2025-04-27 22:19:11+00:00,0,57,0.863636,Gouin / René-Guénette,...,8.0,2025-04-27 20:58:00+00:00,2025-05-05 04:28:40+00:00,82.273511,,4.0,,45.543949,-73.702248,386.693105
126,284752678,69,20250427,50205,2025-04-27 22:19:11+00:00,2025-04-27 22:19:11+00:00,0,57,0.863636,Gouin / René-Guénette,...,8.0,2025-04-27 21:58:00+00:00,2025-05-05 04:28:40+00:00,82.273511,,4.0,,45.543949,-73.702248,386.693105


In [54]:
# Group by trip index (from trip_gdf) to attach traffic features
incident_summary = joined.groupby('index_right').agg({
	'category': pd.Series.mode,
	'delay_right': 'mean',
   	'distance': 'mean',
	'magnitude_of_delay': pd.Series.mode,
	'geometry': 'count'  # number of incidents
}).rename(columns={
	'category': 'incident_category',
	'delay_right': 'incident_avg_delay',
  	'distance': 'avg_distance_to_incident',
  	'magnitude_of_delay': 'incident_delay_magnitude',
	'geometry': 'incident_count',
})

In [57]:
# Create boolean column incident_nearby
incident_summary['incident_nearby'] = (incident_summary['incident_count'] > 0).astype(int)

In [63]:
# Merge back to original trip_updates
trips_traffic_df = stm_trips_gdf.reset_index().merge(incident_summary, left_index=True, right_on='index_right', how='left')

In [64]:
# Fill missing values for trips with no incidents
trips_traffic_df = trips_traffic_df.fillna({
	'incident_category': np.nan,
	'incident_delay': np.nan,
	'avg_distance_to_incident': np.nan,
  	'incident_delay_magnitude': np.nan,
	'incident_count': 0,
	'incident_nearby': 0
})

In [66]:
# Drop uneeded columns
trips_traffic_df = trips_traffic_df.drop(columns=['geometry', 'index_right', 'index'], axis=1)

### Vehicle Positions

In [67]:
# Get proportion of duplicates
subset = positions_df.drop('current_time', axis=1).columns
duplicate_mask = positions_df.duplicated(subset=subset)
print(f'{duplicate_mask.mean():.2%}')

0.01%


In [68]:
# Drop duplicates
positions_df = positions_df.drop_duplicates(subset=subset, keep='last')

In [69]:
# Rename latitude and longitude
positions_df = positions_df.rename(columns={
  'latitude': 'vehicle_lat',
  'longitude': 'vehicle_lon',
  'status': 'vehicle_status',
  'bearing': 'vehicle_bearing',
  'speed': 'vehicle_speed',
  'timestamp': 'vehicle_dt'
})

In [70]:
# Merge positions
stm_trips_positions_df = pd.merge(left=trips_traffic_df, right=positions_df, how='inner', on=['trip_id', 'route_id', 'start_date', 'stop_sequence'])

In [71]:
stm_trips_positions_df.columns

Index(['trip_id', 'route_id', 'start_date', 'stop_id', 'rt_arrival_time',
       'rt_departure_time', 'schedule_relationship', 'stop_sequence',
       'trip_progress', 'stop_name', 'stop_lat', 'stop_lon', 'location_type',
       'parent_station', 'wheelchair_boarding', 'stop_distance',
       'sch_arrival_time', 'sch_departure_time', 'delay', 'incident_category',
       'incident_avg_delay', 'avg_distance_to_incident',
       'incident_delay_magnitude', 'incident_count', 'incident_nearby',
       'current_time', 'vehicle_id', 'start_time', 'vehicle_lat',
       'vehicle_lon', 'vehicle_bearing', 'vehicle_speed', 'vehicle_status',
       'vehicle_dt', 'occupancy_status'],
      dtype='object')

In [72]:
# Drop uneeded columns
stm_trips_positions_df = stm_trips_positions_df.drop(['current_time', 'start_date', 'start_time'], axis=1)

In [73]:
# Create GeoDataFrames for vehicle and stop positions
pos_gdf1 = gpd.GeoDataFrame(
  stm_trips_positions_df,
  geometry=gpd.points_from_xy(stm_trips_positions_df['vehicle_lon'], stm_trips_positions_df['vehicle_lat']),
  crs='EPSG:4326'
).to_crs(epsg=3857)

pos_gdf2 = gpd.GeoDataFrame(
  stm_trips_positions_df,
  geometry=gpd.points_from_xy(stm_trips_positions_df['stop_lon'], stm_trips_positions_df['stop_lat']),
  crs='EPSG:4326'
).to_crs(epsg=3857)

In [74]:
# Calculate the vehicle distance from the current stop
stm_trips_positions_df['vehicle_distance'] = pos_gdf1.distance(pos_gdf2)
stm_trips_positions_df['vehicle_distance'].describe()

count    3.797300e+05
mean     6.855589e+02
std      3.974336e+04
min      3.001229e-02
25%      1.132212e+01
50%      1.602335e+02
75%      3.451729e+02
max      9.992432e+06
Name: vehicle_distance, dtype: float64

In [75]:
further_than_previous_stop = stm_trips_positions_df['vehicle_distance'] > stm_trips_positions_df['stop_distance']
stm_trips_positions_df[further_than_previous_stop]

Unnamed: 0,trip_id,route_id,stop_id,rt_arrival_time,rt_departure_time,schedule_relationship,stop_sequence,trip_progress,stop_name,stop_lat,...,incident_nearby,vehicle_id,vehicle_lat,vehicle_lon,vehicle_bearing,vehicle_speed,vehicle_status,vehicle_dt,occupancy_status,vehicle_distance
3,283213434,968,60296,NaT,2025-04-27 23:00:00+00:00,0,1,0.333333,Station Côte-Vertu,45.514212,...,0.0,40187,45.514233,-73.684097,0.0,0.00000,1,1745793882,1,9.250945
40,283855195,33,54923,2025-04-27 22:15:02+00:00,2025-04-27 22:15:02+00:00,0,35,0.583333,Langelier / Robert,45.597087,...,0.0,39109,45.600529,-73.595451,120.0,2.50002,2,1745792095,1,1000.994575
63,284752224,55,52947,2025-04-27 22:18:23+00:00,NaT,0,47,1.000000,Saint-Laurent / Saint-Jacques,45.506482,...,0.0,42015,45.503372,-73.560776,0.0,0.00000,2,1745792078,2,711.137195
80,283553131,51,50110,2025-04-27 22:55:00+00:00,NaT,0,53,1.000000,Gare Montréal-Ouest (Elmhurst / Sherbrooke),45.454203,...,0.0,41007,45.469250,-73.602531,0.0,0.00000,2,1745795448,1,4950.018548
81,283553304,51,50110,2025-04-27 22:12:37+00:00,NaT,0,53,1.000000,Gare Montréal-Ouest (Elmhurst / Sherbrooke),45.454203,...,0.0,40089,45.457382,-73.639320,39.0,7.50006,2,1745792086,2,558.927653
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
379705,284752385,139,52014,2025-05-05 04:01:40+00:00,2025-05-05 04:01:40+00:00,0,18,0.418605,Pie-IX / Saint-Zotique,45.563793,...,0.0,42048,45.561543,-73.576050,293.0,12.50010,2,1746417586,2,873.805121
379710,284214504,100,50377,2025-05-05 04:01:27+00:00,2025-05-05 04:01:27+00:00,0,23,0.534884,Station De la Savane (Décarie / de Sorel),45.500347,...,0.0,37081,45.498466,-73.660233,0.0,0.00000,2,1746417598,1,350.183587
379722,286594469,496,56472,2025-05-05 04:11:27+00:00,2025-05-05 04:11:27+00:00,0,2,0.090909,Notre-Dame / Saint-Pierre,45.442188,...,0.0,40241,45.486481,-73.578423,49.0,10.00008,2,1746417586,2,10414.102103
379724,284214402,100,55356,2025-05-05 04:01:34+00:00,2025-05-05 04:01:34+00:00,0,19,0.513514,Côte-de-Liesse Nord / No 5415,45.500679,...,0.0,37106,45.508480,-73.671997,0.0,0.00000,2,1746417580,1,1258.337249


The large vehicle distances don't make sense with the delays. It will confuse the model so the coordinates and the distance won't be used.

In [76]:
# Drop unneeded columns
stm_trips_positions_df = stm_trips_positions_df.drop(['vehicle_lat', 'vehicle_lon', 'vehicle_dt', 'vehicle_distance'], axis=1)

### Service Alerts

In [77]:
# Get proportion of duplicates
duplicate_mask = alerts_df.duplicated()
print(f'{duplicate_mask.mean():.2%}')

74.01%


In [78]:
# Remove duplicates
alerts_df = alerts_df.drop_duplicates(keep='last').reset_index(drop=True)

In [79]:
# Convert timestamps to datetime
alerts_df['start_time'] = pd.to_datetime(alerts_df['start_time'] * 1000, origin='unix', unit='ms', utc=True)
alerts_df['end_time'] = pd.to_datetime(alerts_df['end_time'] * 1000, origin='unix', unit='ms', utc=True)

In [80]:
# Fill null end time with current date (assuming the alert is still active)
alerts_df['end_time'] = alerts_df['end_time'].fillna(datetime.now(timezone.utc).replace(microsecond=0))

In [None]:
# Sort values by start time
alerts_df = alerts_df.sort_values('start_time').reset_index(drop=True)

In [82]:
# Merge alerts
stm_df = pd.merge(left=stm_trips_positions_df, right=alerts_df, how='left', on=['route_id', 'stop_id'])

In [83]:
stm_df.columns

Index(['trip_id', 'route_id', 'stop_id', 'rt_arrival_time',
       'rt_departure_time', 'schedule_relationship', 'stop_sequence',
       'trip_progress', 'stop_name', 'stop_lat', 'stop_lon', 'location_type',
       'parent_station', 'wheelchair_boarding', 'stop_distance',
       'sch_arrival_time', 'sch_departure_time', 'delay', 'incident_category',
       'incident_avg_delay', 'avg_distance_to_incident',
       'incident_delay_magnitude', 'incident_count', 'incident_nearby',
       'vehicle_id', 'vehicle_bearing', 'vehicle_speed', 'vehicle_status',
       'occupancy_status', 'index', 'start_time', 'end_time'],
      dtype='object')

In [84]:
# Add column has_alert
has_alert_mask = (stm_df['start_time'].notna()) & \
	(stm_df['sch_arrival_time'] >= stm_df['start_time']) & \
	(stm_df['sch_arrival_time'] <= stm_df['end_time'])
stm_df['stop_has_alert'] = has_alert_mask.astype('int64')

In [85]:
stm_df['stop_has_alert'].value_counts()

stop_has_alert
0    365874
1     26348
Name: count, dtype: int64

In [86]:
stm_df.columns

Index(['trip_id', 'route_id', 'stop_id', 'rt_arrival_time',
       'rt_departure_time', 'schedule_relationship', 'stop_sequence',
       'trip_progress', 'stop_name', 'stop_lat', 'stop_lon', 'location_type',
       'parent_station', 'wheelchair_boarding', 'stop_distance',
       'sch_arrival_time', 'sch_departure_time', 'delay', 'incident_category',
       'incident_avg_delay', 'avg_distance_to_incident',
       'incident_delay_magnitude', 'incident_count', 'incident_nearby',
       'vehicle_id', 'vehicle_bearing', 'vehicle_speed', 'vehicle_status',
       'occupancy_status', 'index', 'start_time', 'end_time',
       'stop_has_alert'],
      dtype='object')

In [None]:
# Drop unneeded datetime columns and index
stm_df = stm_df.drop(['start_time', 'end_time'], axis=1)

### STM and Weather

In [88]:
# Convert time string to datetime
time_dt = pd.to_datetime(weather_df['time'], utc=True)

In [89]:
# Calculate dates for weather forecast
last_day_weather = time_dt.max()
start_date = last_day_weather + timedelta(days=1)
end_date = stm_df['rt_arrival_time'].max()

In [90]:
# Fetch forecast weather
start_date_str = start_date.strftime('%Y-%m-%d')
end_date_str = end_date.strftime('%Y-%m-%d')

forecast_list = fetch_weather(start_date=start_date_str, end_date=end_date_str, forecast=True)
forecast_df = pd.DataFrame(forecast_list)

In [91]:
# Merge archive and forecast weather
weather_df = pd.concat([weather_df, forecast_df], ignore_index=True)

In [92]:
# Round arrival time to the nearest hour
rounded_arrival_dt = stm_df['rt_arrival_time'].dt.round('h')

In [93]:
# Format time to match weather data
stm_df['time'] = rounded_arrival_dt.dt.strftime('%Y-%m-%dT%H:%M')

In [94]:
# Merge STM with weather
df = pd.merge(left=stm_df, right=weather_df, how='inner', on='time').drop('time', axis=1)

## Clean Data

In [95]:
df.isna().sum()

trip_id                          0
route_id                         0
stop_id                          0
rt_arrival_time                  0
rt_departure_time            47374
schedule_relationship            0
stop_sequence                    0
trip_progress                    0
stop_name                        0
stop_lat                         0
stop_lon                         0
location_type                    0
parent_station              354424
wheelchair_boarding              0
stop_distance                    0
sch_arrival_time                 0
sch_departure_time               0
delay                            0
incident_category           353855
incident_avg_delay          353876
avg_distance_to_incident    353855
incident_delay_magnitude    353855
incident_count                   0
incident_nearby                  0
vehicle_id                       0
vehicle_bearing                  0
vehicle_speed                    0
vehicle_status                   0
occupancy_status    

### Convert columns

In [96]:
# Get columns with two values
two_values = df.loc[:, df.nunique() == 2]
two_values.columns

Index(['wheelchair_boarding', 'incident_nearby', 'vehicle_status',
       'stop_has_alert'],
      dtype='object')

In [97]:
print(df['wheelchair_boarding'].value_counts())
print(df['vehicle_status'].value_counts())
print(df['incident_nearby'].value_counts())
print(df['stop_has_alert'].value_counts())

wheelchair_boarding
1    336147
2     18277
Name: count, dtype: int64
vehicle_status
2    276890
1     77534
Name: count, dtype: int64
incident_nearby
0.0    353855
1.0       569
Name: count, dtype: int64
stop_has_alert
0    332528
1     21896
Name: count, dtype: int64


In [98]:
# Convert columns with 2 unique values to integer
df['wheelchair_boarding'] = (df['wheelchair_boarding'] == 1).astype('int64')
df['vehicle_in_transit'] = (df['vehicle_status'] == 2).astype('int64')
df['incident_nearby'] = df['incident_nearby'].astype('int64')

In [99]:
# Drop vehicle_status
df = df.drop('vehicle_status', axis=1)

### Drop Other Columns

In [101]:
df['incident_count'].describe()

count    354424.000000
mean          0.042125
std           4.068655
min           0.000000
25%           0.000000
50%           0.000000
75%           0.000000
max        1345.000000
Name: incident_count, dtype: float64

In [100]:
# Remove columns with constant values or with more than 50% missing values
df = df.loc[:, (df.nunique() > 1) & (df.isna().mean() < 0.5)]
df.columns

Index(['trip_id', 'route_id', 'stop_id', 'rt_arrival_time',
       'rt_departure_time', 'stop_sequence', 'trip_progress', 'stop_name',
       'stop_lat', 'stop_lon', 'wheelchair_boarding', 'stop_distance',
       'sch_arrival_time', 'sch_departure_time', 'delay', 'incident_count',
       'incident_nearby', 'vehicle_id', 'vehicle_bearing', 'vehicle_speed',
       'occupancy_status', 'stop_has_alert', 'temperature', 'precipitation',
       'windspeed', 'weathercode', 'vehicle_in_transit'],
      dtype='object')

### Create incident categories

### Create Delay Categories

In [102]:
df['delay'].describe()

count    354424.000000
mean         58.760061
std         309.153463
min      -13592.000000
25%           0.000000
50%           0.000000
75%          39.000000
max       31108.000000
Name: delay, dtype: float64

**Delay Threshholds**

- Early: delay < -2 min
- Slightly Early: -2 min ≤ delay < -1 min
- On Time: -1 min ≤ delay < 3 min, according to [STM](https://www.stm.info/en/info/networks/bus-network-and-schedules-enlightened)
- Slightly Late: 3 min ≤ delay < 5 min
- Late: delay ≥ 5 min

In [103]:
# Add delay category column
labels = ['Early', 'Slightly Early', 'On Time', 'Slighly Late', 'Late']
ranges = [-np.inf, -120, -60, 180, 300, np.inf]
df['delay_class'] = pd.cut(df['delay'], bins=ranges, labels=labels, include_lowest=True, right=False)
df['delay_class'].value_counts(normalize=True)

delay_class
On Time           0.843924
Late              0.066195
Slighly Late      0.062690
Early             0.016404
Slightly Early    0.010787
Name: proportion, dtype: float64

### Create Weather Categories

In [104]:
# Create weather code mapping
weathercodes = df['weathercode'].sort_values().unique()
condition_list = []
label_list = []

for code in weathercodes:
  condition_list.append(df['weathercode'] == code)
  label_list.append(WEATHER_CODES[code])

In [105]:
# Create categories
df['weather'] = np.select(condition_list, label_list, default='Unknown')
df['weather'].value_counts()

weather
Overcast            180009
Clear sky            80612
Partly cloudy        33371
Light drizzle        23645
Mainly clear         22539
Slight rain          10279
Dense drizzle         2966
Moderate drizzle      1003
Name: count, dtype: int64

### Convert Schedule Relationship to Categories

In [None]:
# Create schedule relationship mapping
# sch_codes = df['schedule_relationship'].sort_values().unique()
# condition_list = []
# label_list = []

# for code in sch_codes:
#   condition_list.append(df['schedule_relationship'] == code)
#   label_list.append(SCHEDULE_RELATIONSHIP[code])

In [None]:
# Create categories
# df['schedule_relationship'] = np.select(condition_list, label_list, default='Unknown')
# df['schedule_relationship'].value_counts()

### Convert Occupancy Status to Categories

In [106]:
# Create occupancy status mapping
occ_codes = df['occupancy_status'].sort_values().unique()
condition_list = []
label_list = []

for code in occ_codes:
  condition_list.append(df['occupancy_status'] == code)
  label_list.append(OCCUPANCY_STATUS[code])

In [107]:
# Create categories
df['occupancy_status'] = np.select(condition_list, label_list, default='Unknown')
df['occupancy_status'].value_counts()

occupancy_status
Empty                         153296
Many seats available          116922
Few seats available            78539
Crushed standing room only      5433
Unknown                          234
Name: count, dtype: int64

## Export Data

In [109]:
# Drop and reorder columns
df = df[[
	'trip_id',
  	'vehicle_id',
    'occupancy_status',
    'vehicle_in_transit',
    'vehicle_bearing',
    'vehicle_speed',
    'wheelchair_boarding',
	'route_id',
  	'stop_id',
    'stop_name',
	'stop_lat',
  	'stop_lon',
    'stop_distance',
	'stop_sequence',
  	'trip_progress',
    'stop_has_alert',
	'rt_arrival_time',
	'rt_departure_time',
	'sch_arrival_time',
  	'sch_departure_time',
  	'delay',
    'delay_class',
    'incident_nearby',
  	#'incident_category',
	#'incident_delay',
  	#'avg_distance_to_incident',
	#'incident_delay_magnitude',
  	#'incident_count',
	'temperature',
  	'precipitation',
  	'windspeed',
  	'weather',
]]

In [110]:
# Export data to CSV
df.to_csv('../data/stm_weather_traffic_merged.csv', index=False)

## End