# STM Transit Delay Data Preparation

## Overview

This notebook merges data from multiple sources and prepares it for data analysis and/or preprocessing.

## Data description

### Real-time Trip Updates

`current_time`: Timestamp when the data was fetched from the GTFS, in microseconds.<br>
`trip_id`: Unique identifier for the transit trip.<br>
`route_id`: Unique identifier for a bus or metro line.<br>
`start_date`: Start date of the transit trip.<br>
`stop_id`: Unique identifier of a stop.<br>
`arrival_time`, `departure_time`: Realtime arrival and departure time, in microseconds<br>
`schedule_relationship`: State of the trip, 0 meaning "scheduled", 1 meaning "skipped" and 2 meaning "no data".

### Scheduled STM Trips

`trip_id`: Unique identifier for the transit trip.<br>
`arrival_time`, `departure_time`: Scheduled arrival and departure time.<br>
`stop_id`: Unique identifier of a stop.<br>
`stop_sequence`: Sequence of a stop, for ordering.

### STM Stops

`stop_id`: Unique identifier of a stop.<br>
`stop_code`: Bus stop or metro station number.<br>
`stop_name`: Bus stop or metro station name<br>
`stop_lat`, `stop_lon`: Stop coordinates.<br>
`stop_url`: Stop web page.<br>
`location_type`: Stop type.<br>
`parent_station`: Parent station (metro station with multiple exits).<br>
`wheelchair_boarding`: Indicates if the stop is accessible for people in wheelchair, 1 being true and 2 being false.

### Real-time Vehicle Positions

`current_time`: Timestamp when the data was fetched from the GTFS, in microseconds.<br>
`vehicle_id`: Unique identifier for a vehicle.<br>
`trip_id`: Unique identifier for the transit trip.<br>
`route_id`: Unique identifier for a bus or metro line.<br>
`start_date`: Start date of a transit trip.<br>
`start_time`: Start time of a transit trip.<br>
`latitude`, `longitude`: Vehicle current position.<br>
`bearing`: Direction that the vehicle is facing, from 0 to 360 degrees.<br>
`speed`: Momentary speed measured by the vehicle, in meters per second.<br>
`stop_sequence`: Sequence of a stop, for ordering.<br>
`status`: Vehicle stop status in relation with a stop that it's currently approaching or is at, 1 being "stopped at" and 2 being "in transit to".<br>
`timestamp`: Timestamp when STM updated the data, in milliseconds.<br>
`occupancy_status`: Degree of passenger occupancy, ranging from 1 (empty) to 7 (not accepting passengers).

### Service Alerts

`start_time`: Start time of the alert, in microseconds.<br>
`end_time`: End time of the alert, in microseconds.<br>
`route_id`: Unique identifier for a bus or metro line.<br>
`stop_id`: Unique identifier of a stop.

### Weather Archive and Forecast

`time`: Date and hour or the weather.<br>
`temperature`: Air temperature at 2 meters above ground, in Celsius.<br>
`precipitation`: Total precipitation (rain, showers, snow) sum of the preceding hour, in millimeters.<br>
`windspeed`: Wind speed at 10 meters above ground, in kilometers per hour.<br>
`weathercode`: World Meteorological Organization (WMO) code, see [Open-Meteo API documentation](https://open-meteo.com/en/docs#weather_variable_documentation) for the list.

### Traffic Incidents

`category`: Category of the incident.<br>
`start_time`: Start time of the incident, in ISO8601 format.<br>
`end_time`: End time of the incident, in ISO8601 format.<br>
`length`: Length of the incident in meters.<br>
`delay`: Delay in seconds caused by the incident (except road closures).<br>
`magnitude_of_delay`: Severity of the delay, ranging from 0 to 4 (minor to major).<br>
`last_report_time`: Date when the last time the incident was reported,in ISO8601 format.<br>
`latitude`, `longitude`: Coordinates of the incident.

## Imports

In [1]:
from datetime import datetime, timedelta, timezone
from haversine import haversine, Unit
import numpy as np
import pandas as pd
import sys

In [2]:
# Import custom code
sys.path.insert(0, '..')
from scripts.custom_functions import fetch_weather, LOCAL_TIMEZONE, OCCUPANCY_STATUS, SCHEDULE_RELATIONSHIP, WEATHER_CODES

In [None]:
# Import data
trips_df = pd.read_csv('../data/api/fetched_stm_trip_updates.csv', low_memory=False)
schedules_df = pd.read_csv('../data/download/stop_times_2025-04-30.txt')
stops_df = pd.read_csv('../data/download/stops_2025-04-30.txt')
positions_df =  pd.read_csv('../data/api/fetched_stm_vehicle_positions.csv', low_memory=False)
alerts_df = pd.read_csv('../data/api/fetched_stm_service_alerts.csv')
weather_df = pd.read_csv('../data/api/fetched_historical_weather.csv')
traffic_df = pd.read_csv('../data/api/fetched_traffic.csv')

## Merge Data

### Schedules and stops

In [4]:
# Sort values by stop sequence
schedules_df = schedules_df.sort_values(by=['trip_id', 'stop_sequence'])

In [6]:
# Reset stop sequences (some stops might be missing)
schedules_df['stop_sequence'] = schedules_df.groupby('trip_id').cumcount() + 1

In [7]:
# Add trip progress (vehicles further along the trip are more likely to be delayed)
total_stops = schedules_df.groupby('trip_id')['stop_id'].transform('count')
schedules_df['trip_progress'] = schedules_df['stop_sequence'] / total_stops

In [8]:
# Get distribution of trip progress
schedules_df['trip_progress'].describe()

count    6.629625e+06
mean     5.139220e-01
std      2.887289e-01
min      8.547009e-03
25%      2.631579e-01
50%      5.142857e-01
75%      7.647059e-01
max      1.000000e+00
Name: trip_progress, dtype: float64

In [9]:
# Merge schedules and stops
schedules_stops_df = pd.merge(left=schedules_df, right=stops_df, how='inner', left_on='stop_id', right_on='stop_code')

In [10]:
schedules_stops_df.columns

Index(['trip_id', 'arrival_time', 'departure_time', 'stop_id_x',
       'stop_sequence', 'trip_progress', 'stop_id_y', 'stop_code', 'stop_name',
       'stop_lat', 'stop_lon', 'stop_url', 'location_type', 'parent_station',
       'wheelchair_boarding'],
      dtype='object')

In [11]:
# Rename stop id and drop other stop id columns
schedules_stops_df = schedules_stops_df.rename(columns={'stop_id_x': 'stop_id'})
schedules_stops_df = schedules_stops_df.drop(['stop_id_y', 'stop_code'], axis=1)

In [12]:
# Get coordinates of previous stop
schedules_stops_df = schedules_stops_df.sort_values(by=['trip_id', 'stop_sequence'])
schedules_stops_df['prev_lat'] = schedules_stops_df.groupby('trip_id')['stop_lat'].shift(1)
schedules_stops_df['prev_lon'] = schedules_stops_df.groupby('trip_id')['stop_lon'].shift(1)

In [13]:
# Make sure the null values are from first stops
prev_null_mask = (schedules_stops_df['prev_lat'].isna()) | (schedules_stops_df['prev_lon'].isna())
first_stop_mask = schedules_stops_df['stop_sequence'] == 1
assert prev_null_mask.sum() == first_stop_mask.sum()

In [14]:
# Get distance from previous stop
schedules_stops_df['stop_distance'] = schedules_stops_df.apply(
  lambda row: haversine((row['prev_lat'], row['prev_lon']), (row['stop_lat'], row['stop_lon']), unit=Unit.METERS),
  axis=1
)

In [15]:
# Drop previous coordinates
schedules_stops_df = schedules_stops_df.drop(['prev_lat', 'prev_lon'], axis=1)

In [16]:
# Replace null distances by zero (first stop of the trip)
schedules_stops_df['stop_distance'] = schedules_stops_df['stop_distance'].fillna(0)

In [17]:
schedules_stops_df['stop_distance'].describe()

count    6.388741e+06
mean     2.737270e+02
std      4.968967e+02
min      0.000000e+00
25%      1.699571e+02
50%      2.269746e+02
75%      2.988377e+02
max      1.376798e+04
Name: stop_distance, dtype: float64

In [18]:
# Get stop with largest distance
schedules_stops_df.iloc[schedules_stops_df['stop_distance'].idxmax()]

trip_id                                                  281377832
arrival_time                                              05:33:00
departure_time                                            05:33:00
stop_id                                                      61261
stop_sequence                                                   13
trip_progress                                                  1.0
stop_name                            YUL Aéroport Montréal-Trudeau
stop_lat                                                 45.456622
stop_lon                                                -73.751615
stop_url               https://www.stm.info/fr/recherche#stq=61261
location_type                                                    0
parent_station                                                 NaN
wheelchair_boarding                                              1
stop_distance                                          13767.98028
Name: 124692, dtype: object

After checking in the STM website, the large distance make sense because the bus stops before or after the airport are far away.

### Realtime and Scheduled Trips

In [19]:
# Convert route_id to integer
trips_df['route_id'] = trips_df['route_id'].str.extract(r'(\d+)')
trips_df['route_id'] = trips_df['route_id'].astype('int64')

In [22]:
# Get proportion of duplicates
subset = trips_df.drop('current_time', axis=1).columns
duplicate_mask = trips_df.duplicated(subset=subset)
print(f'{duplicate_mask.mean():.2%}')

10.15%


In [23]:
# Remove duplicates
trips_df = trips_df.drop_duplicates(subset=subset)

In [24]:
# Rename arrival and departure time
trips_df = trips_df.rename(columns={'arrival_time': 'rt_arrival_time','departure_time': 'rt_departure_time'})

In [25]:
# Merge trip updates with schedule
stm_trips_df = pd.merge(left=trips_df, right=schedules_stops_df, how='inner', on=['trip_id', 'stop_id'])

In [26]:
stm_trips_df.columns

Index(['current_time', 'trip_id', 'route_id', 'start_date', 'stop_id',
       'rt_arrival_time', 'rt_departure_time', 'schedule_relationship',
       'arrival_time', 'departure_time', 'stop_sequence', 'trip_progress',
       'stop_name', 'stop_lat', 'stop_lon', 'stop_url', 'location_type',
       'parent_station', 'wheelchair_boarding', 'stop_distance'],
      dtype='object')

In [27]:
# Convert start_date to datetime
stm_trips_df['start_date_dt'] = pd.to_datetime(stm_trips_df['start_date'], format='%Y%m%d')

In [28]:
def parse_gtfs_time(start_date:pd.Timestamp, stop_time:str) -> pd.Timestamp:
	'''
	Converts GTFS time string (e.g., '25:30:00') to datetime
	based on the arrival time.
	'''
	hours, minutes, seconds = map(int, stop_time.split(':'))
	total_seconds = hours * 3600 + minutes * 60 + seconds

	parsed_time = start_date + timedelta(seconds=total_seconds)
	return parsed_time

In [29]:
# Parse GTFS scheduled arrival and departure time
parsed_arrival_time = stm_trips_df.apply(lambda row: parse_gtfs_time(row['start_date_dt'], row['arrival_time']), axis=1)
parsed_departure_time = stm_trips_df.apply(lambda row: parse_gtfs_time(row['start_date_dt'], row['departure_time']), axis=1)

In [30]:
# Convert scheduled arrival and departure time to UTC datetime
sch_arrival_time_local = parsed_arrival_time.dt.tz_localize(LOCAL_TIMEZONE)
sch_departure_time_local = parsed_departure_time.dt.tz_localize(LOCAL_TIMEZONE)

stm_trips_df['sch_arrival_time'] = sch_arrival_time_local.dt.tz_convert(timezone.utc)
stm_trips_df['sch_departure_time'] = sch_departure_time_local.dt.tz_convert(timezone.utc)

In [31]:
# Convert realtime arrival and departure time to UTC datetime
stm_trips_df['rt_arrival_time'] = pd.to_datetime(stm_trips_df['rt_arrival_time'] * 1000, origin='unix', unit='ms', utc=True)
stm_trips_df['rt_departure_time'] = pd.to_datetime(stm_trips_df['rt_departure_time'] * 1000, origin='unix', unit='ms', utc=True)

In [32]:
# Get distribution of realtime timestamps
stm_trips_df[['rt_arrival_time', 'rt_departure_time']].describe()

Unnamed: 0,rt_arrival_time,rt_departure_time
count,4231901,4231901
mean,2022-09-22 04:16:03.703160832+00:00,2022-04-01 10:11:16.473557760+00:00
min,1970-01-01 00:00:00+00:00,1970-01-01 00:00:00+00:00
25%,2025-04-29 10:30:41+00:00,2025-04-29 08:40:26+00:00
50%,2025-04-30 19:53:00+00:00,2025-04-30 19:21:00+00:00
75%,2025-05-02 11:50:49+00:00,2025-05-02 11:35:19+00:00
max,2025-05-04 03:00:00+00:00,2025-05-04 02:56:00+00:00


In [33]:
# The 0 timestamps have been replaced with 1970-01-01 
zero_arrival_time = stm_trips_df['rt_arrival_time'].dt.year == 1970
zero_departure_time = stm_trips_df['rt_departure_time'].dt.year == 1970
first_stop = stm_trips_df['stop_sequence'] == 1
last_stop = stm_trips_df['trip_progress'] == 1

In [34]:
# For first stops, replace 0 departure time with scheduled time
stm_trips_df.loc[(first_stop & zero_departure_time), 'rt_departure_time'] = stm_trips_df.loc[(first_stop & zero_departure_time), 'sch_departure_time']

In [35]:
# For last stops, replace 0 arrival time with scheduled time
stm_trips_df.loc[(last_stop & zero_arrival_time), 'rt_arrival_time'] = stm_trips_df.loc[(last_stop & zero_arrival_time), 'sch_arrival_time']

In [36]:
# For other stops, replace 0 arrival time with scheduled time
stm_trips_df.loc[(~first_stop & ~last_stop & zero_arrival_time), 'rt_arrival_time'] = \
	stm_trips_df.loc[(~first_stop & ~last_stop & zero_arrival_time), 'sch_arrival_time']

In [37]:
# Check distribution of dates again
stm_trips_df[['rt_arrival_time', 'rt_departure_time']].describe()

Unnamed: 0,rt_arrival_time,rt_departure_time
count,4231901,4231901
mean,2024-01-15 18:19:01.991859200+00:00,2022-05-17 06:45:26.211661568+00:00
min,1970-01-01 00:00:00+00:00,1970-01-01 00:00:00+00:00
25%,2025-04-29 12:17:34+00:00,2025-04-29 09:23:58+00:00
50%,2025-04-30 20:53:30+00:00,2025-04-30 19:26:32+00:00
75%,2025-05-02 12:18:45+00:00,2025-05-02 11:37:51+00:00
max,2025-05-04 03:00:00+00:00,2025-05-04 02:56:00+00:00


In [38]:
# Replace 1970 dates by null
zero_arrival_time = stm_trips_df['rt_arrival_time'].dt.year == 1970
zero_departure_time = stm_trips_df['rt_departure_time'].dt.year == 1970
stm_trips_df.loc[zero_arrival_time, 'rt_arrival_time'] = None
stm_trips_df.loc[zero_departure_time, 'rt_departure_time'] = None

In [39]:
stm_trips_df[['rt_arrival_time', 'rt_departure_time']].describe()

Unnamed: 0,rt_arrival_time,rt_departure_time
count,4133218,4005798
mean,2025-05-01 00:19:58.538594048+00:00,2025-05-01 00:19:23.698283520+00:00
min,2025-04-27 20:44:51+00:00,2025-04-27 20:44:51+00:00
25%,2025-04-29 14:00:44+00:00,2025-04-29 13:57:49+00:00
50%,2025-04-30 21:47:55.500000+00:00,2025-04-30 21:46:48+00:00
75%,2025-05-02 12:51:46+00:00,2025-05-02 12:52:14+00:00
max,2025-05-04 03:00:00+00:00,2025-05-04 02:56:00+00:00


In [40]:
# Calculate delay (realtime - scheduled)
# For first stops, calculate with departure time
# For the rest, calculate with arrival time
stm_trips_df['delay'] = np.where(
  	stm_trips_df['stop_sequence'] == 1,
	(stm_trips_df['rt_departure_time'] - stm_trips_df['sch_departure_time']) / pd.Timedelta(seconds=1),
  	(stm_trips_df['rt_arrival_time'] - stm_trips_df['sch_arrival_time']) / pd.Timedelta(seconds=1)
)

In [41]:
# Get distribution
stm_trips_df['delay'].describe()

count    4.231901e+06
mean     7.006612e+01
std      4.807224e+02
min     -1.359200e+04
25%      0.000000e+00
50%      0.000000e+00
75%      0.000000e+00
max      5.458500e+04
Name: delay, dtype: float64

In [42]:
stm_trips_df.columns

Index(['current_time', 'trip_id', 'route_id', 'start_date', 'stop_id',
       'rt_arrival_time', 'rt_departure_time', 'schedule_relationship',
       'arrival_time', 'departure_time', 'stop_sequence', 'trip_progress',
       'stop_name', 'stop_lat', 'stop_lon', 'stop_url', 'location_type',
       'parent_station', 'wheelchair_boarding', 'stop_distance',
       'start_date_dt', 'sch_arrival_time', 'sch_departure_time', 'delay'],
      dtype='object')

In [43]:
# Remove extra datetime columns
stm_trips_df = stm_trips_df.drop(['arrival_time', 'departure_time', 'start_date_dt'], axis=1)

### Vehicle Positions

In [46]:
# Get proportion of duplicates
subset = positions_df.drop('current_time', axis=1).columns
duplicate_mask = positions_df.duplicated(subset=subset)
print(f'{duplicate_mask.mean():.2%}')

0.01%


In [47]:
# Drop duplicates
positions_df = positions_df.drop_duplicates(subset=subset)

In [44]:
# Rename latitude and longitude
positions_df = positions_df.rename(columns={
  'latitude': 'vehicle_lat',
  'longitude': 'vehicle_lon',
  'status': 'vehicle_status',
  'bearing': 'vehicle_bearing',
  'speed': 'vehicle_speed',
  'timestamp': 'vehicle_dt'
})

In [48]:
# Merge positions
stm_trips_positions_df = pd.merge(left=stm_trips_df, right=positions_df, how='inner', on=['trip_id', 'route_id', 'start_date', 'stop_sequence'])

In [49]:
stm_trips_positions_df.columns

Index(['current_time_x', 'trip_id', 'route_id', 'start_date', 'stop_id',
       'rt_arrival_time', 'rt_departure_time', 'schedule_relationship',
       'stop_sequence', 'trip_progress', 'stop_name', 'stop_lat', 'stop_lon',
       'stop_url', 'location_type', 'parent_station', 'wheelchair_boarding',
       'stop_distance', 'sch_arrival_time', 'sch_departure_time', 'delay',
       'current_time_y', 'vehicle_id', 'start_time', 'vehicle_lat',
       'vehicle_lon', 'vehicle_bearing', 'vehicle_speed', 'vehicle_status',
       'vehicle_dt', 'occupancy_status'],
      dtype='object')

In [50]:
# Drop current timestamps and start_date
stm_trips_positions_df = stm_trips_positions_df.drop(['current_time_x', 'current_time_y', 'start_date'], axis=1)

In [51]:
# Calculate distance between the vehicle and the current stop
stm_trips_positions_df['vehicle_distance'] = stm_trips_positions_df.apply(
  	lambda row: haversine((row['vehicle_lat'], row['vehicle_lon']), (row['stop_lat'], row['stop_lon']), unit=Unit.METERS),
  	axis=1)

In [52]:
stm_trips_positions_df['vehicle_distance'].describe()

count    3.447580e+05
mean     5.019009e+02
std      3.330059e+04
min      2.102731e-02
25%      7.932472e+00
50%      1.124396e+02
75%      2.420350e+02
max      8.742123e+06
Name: vehicle_distance, dtype: float64

In [53]:
further_than_previous_stop = stm_trips_positions_df['vehicle_distance'] > stm_trips_positions_df['stop_distance']
stm_trips_positions_df[further_than_previous_stop]

Unnamed: 0,trip_id,route_id,stop_id,rt_arrival_time,rt_departure_time,schedule_relationship,stop_sequence,trip_progress,stop_name,stop_lat,...,vehicle_id,start_time,vehicle_lat,vehicle_lon,vehicle_bearing,vehicle_speed,vehicle_status,vehicle_dt,occupancy_status,vehicle_distance
4,283213434,968,60296,NaT,2025-04-27 23:00:00+00:00,0,1,0.333333,Station Côte-Vertu,45.514212,...,40187,19:00:00,45.514233,-73.684097,0.0,0.00000,1,1745793882,1,6.475190
53,283855195,33,54923,2025-04-27 22:15:02+00:00,2025-04-27 22:15:02+00:00,0,35,0.583333,Langelier / Robert,45.597087,...,39109,17:49:00,45.600529,-73.595451,120.0,2.50002,2,1745792095,1,699.591350
76,284752224,55,52947,2025-04-27 22:18:23+00:00,NaT,0,47,1.000000,Saint-Laurent / Saint-Jacques,45.506482,...,42015,17:25:00,45.503372,-73.560776,0.0,0.00000,2,1745792078,2,497.842018
93,283553131,51,50110,2025-04-27 22:55:00+00:00,NaT,0,53,1.000000,Gare Montréal-Ouest (Elmhurst / Sherbrooke),45.454203,...,41007,18:04:00,45.469250,-73.602531,0.0,0.00000,2,1745795448,1,3467.991306
94,283553304,51,50110,2025-04-27 22:12:37+00:00,NaT,0,53,1.000000,Gare Montréal-Ouest (Elmhurst / Sherbrooke),45.454203,...,40089,17:13:00,45.457382,-73.639320,39.0,7.50006,2,1745792086,2,391.626870
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
344736,286591884,61,52898,2025-05-04 01:15:22+00:00,2025-05-04 01:15:22+00:00,0,7,0.137255,Wellington / Murray,45.492384,...,29064,21:09:00,45.493366,-73.557610,121.0,2.77780,2,1746321339,1,118.280391
344739,284302838,103,53799,NaT,2025-05-04 01:09:00+00:00,0,1,0.035714,Station Villa-Maria,45.479498,...,37074,21:09:00,45.479473,-73.619514,0.0,0.00000,1,1746320446,1,5.810695
344740,284302838,103,50962,2025-05-04 01:15:53+00:00,2025-05-04 01:15:53+00:00,0,9,0.321429,Grand Boulevard / de Terrebonne,45.470548,...,37074,21:09:00,45.469296,-73.628548,0.0,0.00000,2,1746321358,1,274.358528
344754,286589756,71,52148,NaT,2025-05-04 01:03:00+00:00,0,1,0.035714,Station Guy-Concordia (De Maisonneuve/St-Mathieu),45.494350,...,40001,21:03:00,45.494312,-73.580826,0.0,0.00000,1,1746320457,1,6.469056


The large vehicle distances don't make sense with the delays. It will confuse the model so the coordinates and the distance won't be used.

In [54]:
# Drop unneeded columns
stm_trips_positions_df = stm_trips_positions_df.drop(['vehicle_lat', 'vehicle_lon', 'vehicle_dt', 'vehicle_distance', 'start_time'], axis=1)

### Service Alerts

In [67]:
# Get proportion of duplicates
duplicate_mask = alerts_df.duplicated()
print(f'{duplicate_mask.mean():.2%}')

62.65%


In [68]:
# Remove duplicates
alerts_df = alerts_df.drop_duplicates().reset_index(drop=True)

In [69]:
# Convert timestamps to datetime
alerts_df['start_time'] = pd.to_datetime(alerts_df['start_time'] * 1000, origin='unix', unit='ms', utc=True)
alerts_df['end_time'] = pd.to_datetime(alerts_df['end_time'] * 1000, origin='unix', unit='ms', utc=True)

In [70]:
# Fill null end time with current date (assuming the alert is still active)
alerts_df['end_time'] = alerts_df['end_time'].fillna(datetime.now(timezone.utc).replace(microsecond=0))

In [71]:
# Sort values by date
alerts_df = alerts_df.sort_values('start_time').reset_index()

In [72]:
stm_df = pd.merge(left=stm_trips_positions_df, right=alerts_df, how='left', on=['route_id', 'stop_id'])

In [73]:
stm_df.columns

Index(['trip_id', 'route_id', 'stop_id', 'rt_arrival_time',
       'rt_departure_time', 'schedule_relationship', 'stop_sequence',
       'trip_progress', 'stop_name', 'stop_lat', 'stop_lon', 'stop_url',
       'location_type', 'parent_station', 'wheelchair_boarding',
       'stop_distance', 'sch_arrival_time', 'sch_departure_time', 'delay',
       'vehicle_id', 'vehicle_bearing', 'vehicle_speed', 'vehicle_status',
       'occupancy_status', 'index', 'start_time', 'end_time'],
      dtype='object')

In [74]:
# Add column has_alert
has_alert_mask = (stm_df['start_time'].notna()) & \
	(stm_df['rt_arrival_time'] >= stm_df['start_time']) & \
	(stm_df['rt_arrival_time'] <= stm_df['end_time'])
stm_df['stop_has_alert'] = has_alert_mask.astype('int64')

In [75]:
stm_df['stop_has_alert'].value_counts()

stop_has_alert
0    335460
1     19741
Name: count, dtype: int64

In [76]:
# Drop unneeded datetime columns and index
stm_df = stm_df.drop(['start_time', 'end_time', 'index'], axis=1)

### STM and Weather

In [77]:
# Convert time string to datetime
time_dt = pd.to_datetime(weather_df['time'], utc=True)

In [78]:
# Calculate dates for weather forecast
last_day_weather = time_dt.max()
start_date = last_day_weather + timedelta(days=1)
end_date = stm_df['rt_arrival_time'].max()

In [79]:
# Fetch forecast weather
start_date_str = start_date.strftime('%Y-%m-%d')
end_date_str = end_date.strftime('%Y-%m-%d')

forecast_list = fetch_weather(start_date=start_date_str, end_date=end_date_str, forecast=True)
forecast_df = pd.DataFrame(forecast_list)

In [80]:
# Merge archive and forecast weather
weather_df = pd.concat([weather_df, forecast_df], ignore_index=True)

In [81]:
# Round arrival time to the nearest hour
rounded_arrival_dt = stm_df['rt_arrival_time'].dt.round('h')

In [82]:
# Format time to match weather data
stm_df['time'] = rounded_arrival_dt.dt.strftime('%Y-%m-%dT%H:%M')

In [83]:
# Merge STM with weather
stm_weather_df = pd.merge(left=stm_df, right=weather_df, how='inner', on='time').drop('time', axis=1)

### Traffic Data

In [84]:
# Get proportion of duplicates
duplicate_mask = traffic_df.duplicated()
print(f'{duplicate_mask.mean():.2%}')

36.18%


In [85]:
# Remove duplicates
traffic_df = traffic_df.drop_duplicates(keep='last').reset_index()

In [86]:
# Convert traffic start_time and end_time to datetime
traffic_df['start_time'] = pd.to_datetime(traffic_df['start_time'], utc=True)
traffic_df['end_time'] = pd.to_datetime(traffic_df['end_time'], utc=True)

In [87]:
# Sort by date
traffic_df = traffic_df.sort_values(by='start_time').reset_index()

In [88]:
# Fill null end times with current time (assuming the incident is still ongoing)
traffic_df['end_time'] = traffic_df['end_time'].fillna(datetime.now(timezone.utc).replace(microsecond=0))
assert traffic_df['end_time'].isna().sum() == 0

In [89]:
# Build traffic cache (for every 15 min interval)
def build_traffic_cache(traffic_df:pd.DataFrame) -> dict:
	traffic_cache = {}
	traffic_df['quarter_hour'] = traffic_df['start_time'].dt.floor('15min')

	for (hour, group) in traffic_df.groupby('quarter_hour'):
		traffic_cache[hour] = group.copy()

	return traffic_cache

Since there are many trip updates on the same day (even the same hour), there's a risk of repeating the filtering of active traffic incidents for each trip individually, which takes a lot of time for a large dataset. Traffic incidents are stable over minutes or hours. This is why the incidents will be cached by 15 minute intervals.

In [90]:
def calculate_nearby_incidents(trip_update:pd.Series, traffic_cache:dict, max_distance:int=500) -> pd.Series:
	trip_datetime = trip_update['rt_arrival_time']
	stop_coords = (trip_update['stop_lat'], trip_update['stop_lon'])

	trip_quarter_hour = trip_datetime.floor('15min')

	# Get cached incidents
	quarter_hour_incidents = traffic_cache.get(trip_quarter_hour)

	# Stop if there are no incidents for that hour
	if quarter_hour_incidents is None or quarter_hour_incidents.empty:
		return pd.Series({
			'incident_nearby': 0,
			'nearest_incident_distance': None,
			'incident_category': None,
			'incident_delay': None,
			'incident_delay_magnitude': None
		})

	# Filter for active incidents at that trip hour
	active_incidents = quarter_hour_incidents[
		(quarter_hour_incidents['start_time'] <= trip_datetime) &
		(quarter_hour_incidents['end_time'] >= trip_datetime)
	].copy()

	if active_incidents.empty:
		return pd.Series({
			'incident_nearby': 0,
			'nearest_incident_distance': None,
			'incident_category': None,
			'incident_delay': None,
			'incident_delay_magnitude': None
		})

	# Calculate distance     
	active_incidents['distance'] = active_incidents.apply(
		lambda row: haversine(stop_coords, (row['latitude'], row['longitude']), unit=Unit.METERS),
		axis=1
	)

	# Filter nearby
	nearby_incidents = active_incidents[active_incidents['distance'] <= max_distance]

	if nearby_incidents.empty:
		return pd.Series({
			'incident_nearby': 0,
			'nearest_incident_distance': None,
			'incident_category': None,
			'incident_delay': None,
			'incident_delay_magnitude': None
		})
	else:
		nearest = nearby_incidents.loc[nearby_incidents['distance'].idxmin()]
		return pd.Series({
			'incident_nearby': 1,
			'nearest_incident_distance': nearest['distance'],
			'incident_category': nearest['category'],
			'incident_delay': nearest['delay'],
			'incident_delay_magnitude': nearest['magnitude_of_delay']
		})

In [91]:
# Get traffic columns (get incidents within 2 km)
traffic_cache = build_traffic_cache(traffic_df)
traffic_cols = stm_weather_df.apply(lambda row: calculate_nearby_incidents(row, traffic_cache, 2000), axis=1)

General area traffic (within 1-2 km) could affect delays, even if it's not directly at the stop. This is why incidents within 2 km are calculated.

In [92]:
# Merge the traffic
df = pd.concat([stm_weather_df, traffic_cols], axis=1)

## Clean Data

### Convert columns

In [93]:
# Get columns with two values
two_values = df.loc[:, df.nunique() == 2]
two_values.columns

Index(['wheelchair_boarding', 'vehicle_status', 'stop_has_alert',
       'incident_nearby'],
      dtype='object')

In [94]:
print(df['wheelchair_boarding'].value_counts())
print(df['vehicle_status'].value_counts())
print(df['incident_nearby'].value_counts())
print(df['stop_has_alert'].value_counts())

wheelchair_boarding
1    305388
2     16820
Name: count, dtype: int64
vehicle_status
2    251528
1     70680
Name: count, dtype: int64
incident_nearby
0.0    268662
1.0     53546
Name: count, dtype: int64
stop_has_alert
0    302467
1     19741
Name: count, dtype: int64


In [95]:
# Convert columns with 2 unique values to integer
df['wheelchair_boarding'] = (df['wheelchair_boarding'] == 1).astype('int64')
df['vehicle_in_transit'] = (df['vehicle_status'] == 2).astype('int64')
df['incident_nearby'] = df['incident_nearby'].astype('int64')

### Drop columns

In [96]:
# Remove columns with constant values or with more than 50% missing values
df = df.loc[:, (df.nunique() > 1) & (df.isna().mean() < 0.5)]
df.columns

Index(['trip_id', 'route_id', 'stop_id', 'rt_arrival_time',
       'rt_departure_time', 'schedule_relationship', 'stop_sequence',
       'trip_progress', 'stop_name', 'stop_lat', 'stop_lon', 'stop_url',
       'wheelchair_boarding', 'stop_distance', 'sch_arrival_time',
       'sch_departure_time', 'delay', 'vehicle_id', 'vehicle_bearing',
       'vehicle_speed', 'vehicle_status', 'occupancy_status', 'stop_has_alert',
       'temperature', 'precipitation', 'windspeed', 'weathercode',
       'incident_nearby', 'vehicle_in_transit'],
      dtype='object')

**Other columns to drop**

`stop_url`: An url is not useful for data analysis.<br>
`vehicle_status`: It has been converted to vehicle_in_transit.<br>

### Create Delay Categories

In [104]:
df['delay'].describe()

count    322208.000000
mean         58.600575
std         308.803202
min      -13592.000000
25%           0.000000
50%           0.000000
75%          40.000000
max       31108.000000
Name: delay, dtype: float64

In [105]:
# Create ranges and labels based on distribution
labels = ['Very Early', 'Early', 'On Time', 'Late', 'Very Late']
ranges = [-np.inf, -60, 0, 60, 300, np.inf]

In [106]:
# Add delay category column
df['delay_class'] = pd.cut(df['delay'], bins=ranges, labels=labels, include_lowest=True, right=False)
df['delay_class'].value_counts(normalize=True)

delay_class
On Time       0.716010
Late          0.173354
Very Late     0.066584
Very Early    0.027662
Early         0.016390
Name: proportion, dtype: float64

### Create Weather Categories

In [107]:
# Create weather code mapping
weathercodes = df['weathercode'].sort_values().unique()
condition_list = []
label_list = []

for code in weathercodes:
  condition_list.append(df['weathercode'] == code)
  label_list.append(WEATHER_CODES[code])

In [108]:
# Create categories
df['weather'] = np.select(condition_list, label_list, default='Unknown')
df['weather'].value_counts()

weather
Overcast            144665
Clear sky            92861
Mainly clear         30047
Partly cloudy        20265
Light drizzle        20118
Slight rain          10281
Dense drizzle         2967
Moderate drizzle      1004
Name: count, dtype: int64

### Convert Schedule Relationship to Categories

In [109]:
# Create schedule relationship mapping
sch_codes = df['schedule_relationship'].sort_values().unique()
condition_list = []
label_list = []

for code in sch_codes:
  condition_list.append(df['schedule_relationship'] == code)
  label_list.append(SCHEDULE_RELATIONSHIP[code])

In [110]:
# Create categories
df['schedule_relationship'] = np.select(condition_list, label_list, default='Unknown')
df['schedule_relationship'].value_counts()

schedule_relationship
Scheduled    321824
Skipped         370
No Data          14
Name: count, dtype: int64

### Convert Occupancy Status to Categories

In [111]:
# Create occupancy status mapping
occ_codes = df['occupancy_status'].sort_values().unique()
condition_list = []
label_list = []

for code in occ_codes:
  condition_list.append(df['occupancy_status'] == code)
  label_list.append(OCCUPANCY_STATUS[code])

In [112]:
# Create categories
df['occupancy_status'] = np.select(condition_list, label_list, default='Unknown')
df['occupancy_status'].value_counts()

occupancy_status
Empty                         136521
Many seats available          106648
Few seats available            73563
Crushed standing room only      5282
Unknown                          194
Name: count, dtype: int64

## Export Data

In [113]:
# Drop and reorder columns
df = df[[
	'trip_id',
  	'vehicle_id',
    'vehicle_in_transit', # converted
    'vehicle_bearing',
    'vehicle_speed',
	'occupancy_status',
  	'route_id',
  	'stop_id',
    'stop_name',
  	'stop_lat',
  	'stop_lon',
    'stop_distance', # engineered
    'stop_sequence',
  	'trip_progress', # engineered
    'stop_has_alert', # engineered
  	'wheelchair_boarding', # engineered
    'schedule_relationship',
  	'rt_arrival_time', # parsed
    'sch_arrival_time', # parsed
    'rt_departure_time', # parsed
    'sch_departure_time', # parsed
    'delay', # engineered
    'delay_class', # engineered
  	'temperature',
  	'precipitation',
  	'windspeed', 
	'weather',
  	'incident_nearby', # engineered
]]

In [114]:
# Export data to CSV
df.to_csv('../data/stm_weather_traffic_merged.csv', index=False)

## End