# STM Transit Delay Data Preparation

## Overview

This notebook merges data from different sources and prepares it for data analysis and preprocessing.

## Data description

### Real-time Trip Updates

`current_time`: Timestamp when the data was fetched from the GTFS, in microseconds.<br>
`trip_id`: Unique identifier for the transit trip.<br>
`route_id`: Unique identifier for a bus or metro line.<br>
`start_date`: Start date of the transit trip.<br>
`stop_id`: Unique identifier of a stop.<br>
`arrival_time`, `departure_time`: Realtime arrival and departure time, in microseconds<br>
`schedule_relationship`: State of the trip, 0 meaning "scheduled" and 1 meaning "skipped".

### Scheduled STM Trips

`trip_id`: Unique identifier for the transit trip.<br>
`arrival_time`, `departure_time`: Scheduled arrival and departure time.<br>
`stop_id`: Unique identifier of a stop.<br>
`stop_sequence`: Sequence of a stop, for ordering.

### STM Stops

`stop_id`: Unique identifier of a stop.<br>
`stop_code`: Bus stop or metro station number.<br>
`stop_name`: Bus stop or metro station name<br>
`stop_lat`, `stop_lon`: Stop coordinates.<br>
`stop_url`: Stop web page.<br>
`location_type`: Stop type.<br>
`parent_station`: Parent station (metro station with multiple exits).<br>
`wheelchair_boarding`: Indicates if the stop is accessible for people in wheelchair, 1 being true and 2 being false.

### Real-time Vehicle Positions

`current_time`: Timestamp when the data was fetched from the GTFS, in microseconds.<br>
`vehicle_id`: Unique identifier for a vehicle.<br>
`trip_id`: Unique identifier for the transit trip.<br>
`route_id`: Unique identifier for a bus or metro line.<br>
`start_date`: Start date of a transit trip.<br>
`start_time`: Start time of a transit trip.<br>
`latitude`, `longitude`: Vehicle current position.<br>
`bearing`: Direction that the vehicle is facing, from 0 to 360 degrees.<br>
`speed`: Momentary speed measured by the vehicle, in meters per second.<br>
`stop_sequence`: Sequence of a stop, for ordering.<br>
`status`: Vehicle stop status in relation with a stop that it's currently approaching or is at, 1 being "stopped at" and 2 being "in transit to".<br>
`timestamp`: Timestamp when STM updated the data, in milliseconds.<br>
`occupancy_status`: Degree of passenger occupancy, ranging from 1 (empty) to 7 (not accepting passengers).

### Weather Archive and Forecast

`time`: Date and hour or the weather.<br>
`temperature`: Air temperature at 2 meters above ground, in Celsius.<br>
`precipitation`: Total precipitation (rain, showers, snow) sum of the preceding hour, in millimeters.<br>
`windspeed`: Wind speed at 10 meters above ground, in kilometers per hour.<br>
`weathercode`: World Meteorological Organization (WMO) code, see [Open-Meteo API documentation](https://open-meteo.com/en/docs#weather_variable_documentation) for the list.

### Traffic Incidents

`category`: Category of the incident.<br>
`start_time`: Start time of the incident, in ISO8601 format.<br>
`end_time`: End time of the incident, in ISO8601 format.<br>
`length`: Length of the incident in meters.<br>
`delay`: Delay in seconds caused by the incident (except road closures).<br>
`magnitude_of_delay`: Severity of the delay, ranging from 0 to 4 (minor to major).<br>
`last_report_time`: Date when the last time the incident was reported,in ISO8601 format.<br>
`latitude`, `longitude`: Coordinates of the incident.

## Imports

In [1]:
from datetime import datetime, timedelta, timezone
from haversine import haversine, Unit
import pandas as pd
import sys

In [2]:
# Import custom code
sys.path.insert(0, '..')
from scripts.custom_functions import fetch_weather, LOCAL_TIMEZONE

In [3]:
# Import data
trips_df = pd.read_csv('../data/api/fetched_stm_trip_updates.csv', low_memory=False)
schedules_df = pd.read_csv('../data/download/stop_times_2025-04-30.txt')
stops_df = pd.read_csv('../data/download/stops_2025-04-30.txt')
positions_df =  pd.read_csv('../data/api/fetched_stm_vehicle_positions.csv', low_memory=False)
weather_df = pd.read_csv('../data/api/fetched_historical_weather.csv')
traffic_df = pd.read_csv('../data/api/fetched_traffic.csv')

## Clean Trip Updates

In [4]:
# Convert route_id to integer
trips_df['route_id'] = trips_df['route_id'].str.extract(r'(\d+)')
trips_df['route_id'] = trips_df['route_id'].astype('int64')

In [5]:
# Sort trips
trips_df = trips_df.sort_values(by=['current_time', 'trip_id', 'route_id', 'arrival_time'])

In [6]:
# Get proportion of duplicates
subset = ['start_date', 'trip_id', 'route_id', 'stop_id']
duplicate_mask = trips_df.duplicated(subset=subset)
print(f'{duplicate_mask.mean():.2%}')

26.02%


In [7]:
# Remove duplicates
trips_df = trips_df.drop_duplicates(subset=subset, keep='last') # keep latest update

In [8]:
# Convert realtime arrival and departure time to microseconds
trips_df['arrival_time'] = trips_df['arrival_time'] * 1000
trips_df['departure_time'] = trips_df['departure_time'] * 1000

In [9]:
# Rename arrival and departure time
trips_df = trips_df.rename(columns={'arrival_time': 'rt_arrival_time','departure_time': 'rt_departure_time'})

## Merge Data

### Schedules and stops

In [10]:
# Sort values by stop sequence
schedules_df = schedules_df.sort_values(by=['trip_id', 'stop_sequence'])

In [11]:
# Reset stop sequences (some stops might be missing)
schedules_df['stop_sequence'] = schedules_df.groupby('trip_id').cumcount() + 1

In [12]:
# Add trip progress (vehicles further along the trip are more likely to be delayed)
total_stops = schedules_df.groupby('trip_id')['stop_id'].transform('count')
schedules_df['trip_progress'] = schedules_df['stop_sequence'] / total_stops

In [13]:
# Get distribution of trip progress
schedules_df['trip_progress'].describe()

count    6.629625e+06
mean     5.139220e-01
std      2.887289e-01
min      8.547009e-03
25%      2.631579e-01
50%      5.142857e-01
75%      7.647059e-01
max      1.000000e+00
Name: trip_progress, dtype: float64

In [14]:
# Merge schedules and stops
schedules_stops_df = pd.merge(left=schedules_df, right=stops_df, how='inner', left_on='stop_id', right_on='stop_code')

In [15]:
# Get coordinates of previous stop
schedules_stops_df = schedules_stops_df.sort_values(by=['trip_id', 'stop_sequence'])
schedules_stops_df['prev_lat'] = schedules_stops_df.groupby('trip_id')['stop_lat'].shift(1)
schedules_stops_df['prev_lon'] = schedules_stops_df.groupby('trip_id')['stop_lon'].shift(1)

In [16]:
# Make sure the null values are from first stops
prev_null_mask = (schedules_stops_df['prev_lat'].isna()) | (schedules_stops_df['prev_lon'].isna())
first_stop_mask = schedules_stops_df['stop_sequence'] == 1
assert prev_null_mask.sum() == first_stop_mask.sum()

In [17]:
# Get distance from previous stop
schedules_stops_df['stop_distance'] = schedules_stops_df.apply(
  lambda row: haversine((row['prev_lat'], row['prev_lon']), (row['stop_lat'], row['stop_lon']), unit=Unit.METERS),
  axis=1
)

In [None]:
# Replace null distances by zero 
schedules_stops_df['stop_distance'] = schedules_stops_df['stop_distance'].fillna(0)

In [None]:
schedules_stops_df.columns

In [None]:
# Rename stop id and drop other stop id columns
schedules_stops_df = schedules_stops_df.rename(columns={'stop_id_x': 'stop_id'})
schedules_stops_df = schedules_stops_df.drop(['stop_id_y', 'stop_code'], axis=1)

### Realtime and Scheduled Trips

In [None]:
# Merge trip updates with schedule
stm_trips_df = pd.merge(left=trips_df, right=schedules_stops_df, how='inner', on=['trip_id', 'stop_id'])

In [None]:
# Convert start_date to datetime
stm_trips_df['start_date_dt'] = pd.to_datetime(stm_trips_df['start_date'], format='%Y%m%d')

In [None]:
def parse_gtfs_time(row) -> pd.Timestamp:
	'''
	Converts GTFS time string (e.g., '25:30:00') to datetime
	based on the arrival time.
	'''
	hours, minutes, seconds = map(int, row['arrival_time'].split(':'))
	total_seconds = hours * 3600 + minutes * 60 + seconds

	parsed_time = row['start_date_dt'] + timedelta(seconds=total_seconds)
	return parsed_time

In [None]:
# Convert planned arrival time to datetime
parsed_time = stm_trips_df.apply(parse_gtfs_time, axis=1)
sch_arrival_time_local = parsed_time.dt.tz_localize(LOCAL_TIMEZONE)

In [None]:
# Convert realtime and arrival time to UTC datetime
stm_trips_df['realtime_arrival_time'] = pd.to_datetime(stm_trips_df['realtime_arrival_time'], origin='unix', unit='ms', utc=True)
stm_trips_df['scheduled_arrival_time'] = sch_arrival_time_local.dt.tz_convert(timezone.utc)

In [None]:
# Calculate delay in seconds (real - scheduled)
stm_trips_df['delay'] = (stm_trips_df['realtime_arrival_time'] - stm_trips_df['scheduled_arrival_time']) / pd.Timedelta(seconds=1)

In [None]:
# Get distribution
stm_trips_df['delay'].describe()

There are some extreme delays (~2h15min early to ~5h15min late), which could greatly affect the performance of a predictive model.

In [None]:
# Get proportion of trips that are on time
on_time_mask = (stm_trips_df['delay'] >= -60) & (stm_trips_df['delay'] <= 180)
print(f'{on_time_mask.mean():.2%}')

According to [STM](https://www.stm.info/en/info/networks/bus-network-and-schedules-enlightened), a vehicle is considered on time if it arrives at the stop up to 1 minute before and 3 minutes after the planned schedule.

### Vehicle Positions

In [None]:
# Rename latitude and longitude
positions_df = positions_df.rename(columns={
  'latitude': 'vehicle_lat',
  'longitude': 'vehicle_lon',
  'status': 'vehicle_status',
  'bearing': 'vehicle_bearing',
  'speed': 'vehicle_speed'
})

In [None]:
# Convert vehicle timestamp to datetime
positions_df['vehicle_dt'] = pd.to_datetime(positions_df['timestamp'] * 1000, origin='unix', unit='ms', utc=True)

In [None]:
# Sort values
subset = ['trip_id', 'route_id', 'start_date', 'stop_sequence']
positions_df = positions_df.sort_values(by=subset)

In [None]:
# Get duplicates
duplicate_mask = positions_df.duplicated(subset=subset)
positions_df[duplicate_mask]

In [None]:
# Get proportion of duplicates
print(f'{duplicate_mask.mean():.2%}')

In [None]:
# Remove duplicates
positions_df = positions_df.drop_duplicates(subset=subset)

In [None]:
# Merge with other STM data
stm_df = pd.merge(left=stm_trips_df, right=positions_df, how='inner', on=['trip_id', 'route_id', 'start_date', 'stop_sequence'])

In [None]:
# Calculate distance between the vehicle and the previous stop
stm_df['vehicle_distance'] = stm_df.apply(
  lambda row: haversine((row['prev_lat'], row['prev_lon']), (row['vehicle_lat'], row['vehicle_lon']), unit=Unit.METERS),
  axis=1)

In [None]:
# Calculate relative distance
stm_df['vehicle_rel_distance'] = stm_df['vehicle_distance'] / stm_df['stop_distance']

In [None]:
# Get distribution of relative distance
stm_df['vehicle_rel_distance'].describe()

In [None]:
# Get null values
null_rel_dist_mask = stm_df['vehicle_rel_distance'].isna()
stm_df[null_rel_dist_mask]

In [None]:
stm_df[null_rel_dist_mask]['stop_sequence'].value_counts()

In [None]:
# Replace the null relative distance by 1, because it's the first stop
stm_df['vehicle_rel_distance'] = stm_df['vehicle_rel_distance'].fillna(1)

In [None]:
# Get trips where the relative distance is above 100%
above_one_mask = stm_df['vehicle_rel_distance'] > 1
stm_df.loc[above_one_mask, ['vehicle_status', 'vehicle_speed', 'vehicle_rel_distance']]

In [None]:
# Get proportion of trips with relative distance above 100%
print(f'{above_one_mask.mean():.2%}')

In [None]:
# Clip the relative distance to 100% if the vehicle status is 1 (stopped at)
stopped_mask = stm_df['vehicle_status'] == 1
stm_df.loc[above_one_mask & stopped_mask, 'vehicle_rel_distance'] = 1

In [None]:
# Impute the other values to 0.5 (between the previous and current stop)
stm_df.loc[above_one_mask & (stopped_mask == False), 'vehicle_rel_distance'] = 0.5

In [None]:
# Fill in other null values with 0.5
stm_df['vehicle_rel_distance'] = stm_df['vehicle_rel_distance'].fillna(0.5)

In [None]:
# Replace the vehicle speed to 0 if the vehicle is stopped
stm_df.loc[above_one_mask & stopped_mask, 'vehicle_speed'] = 0

In [None]:
# Get new distribution of relative distance
stm_df['vehicle_rel_distance'].describe()

### STM and Weather

In [None]:
# Convert time string to datetime
time_dt = pd.to_datetime(weather_df['time'], utc=True)

In [None]:
# Calculate dates for weather forecast
last_day_weather = time_dt.max()
start_date = last_day_weather + timedelta(days=1)
end_date = stm_df['realtime_arrival_time'].max()

In [None]:
# Fetch forecast weather
start_date_str = start_date.strftime('%Y-%m-%d')
end_date_str = end_date.strftime('%Y-%m-%d')

forecast_list = fetch_weather(start_date=start_date_str, end_date=end_date_str, forecast=True)
forecast_df = pd.DataFrame(forecast_list)

In [None]:
# Merge archive and forecast weather
weather_df = pd.concat([weather_df, forecast_df], ignore_index=True)

In [None]:
# Round arrival time to the nearest hour
rounded_arrival_dt = stm_df['realtime_arrival_time'].dt.round('h')

In [None]:
# Format time to match weather data
stm_df['time'] = rounded_arrival_dt.dt.strftime('%Y-%m-%dT%H:%M')

In [None]:
# Merge STM with weather
stm_weather_df = pd.merge(left=stm_df, right=weather_df, how='inner', on='time')

### Traffic Data

In [None]:
# Get proportion of duplicates
duplicate_mask = traffic_df.duplicated()
print(f'{duplicate_mask.mean():.2%}')

In [None]:
# Remove duplicates
traffic_df = traffic_df.drop_duplicates(keep='last').reset_index()

In [None]:
# Convert traffic start_time and end_time to datetime
traffic_df['start_time'] = pd.to_datetime(traffic_df['start_time'], utc=True)
traffic_df['end_time'] = pd.to_datetime(traffic_df['end_time'], utc=True)

In [None]:
# Sort by date
traffic_df = traffic_df.sort_values(by='start_time').reset_index()

In [None]:
# Fill null end times with current time (assuming the incident is still ongoing)
traffic_df['end_time'] = traffic_df['end_time'].fillna(datetime.now(timezone.utc).replace(microsecond=0))
assert traffic_df['end_time'].isna().sum() == 0

In [None]:
# Build traffic cache (for every 15 min interval)
def build_traffic_cache(traffic_df:pd.DataFrame) -> dict:
	traffic_cache = {}
	traffic_df['half_hour'] = traffic_df['start_time'].dt.floor('15min')

	for (hour, group) in traffic_df.groupby('half_hour'):
		traffic_cache[hour] = group.copy()

	return traffic_cache

Since there are many trip updates on the same day (even the same hour), there's a risk of repeating the filtering of active traffic incidents for each trip individually, which takes a lot of time for a large dataset. Traffic incidents are stable over minutes or hours. This is why the incidents will be cached by 15 minute intervals.

In [None]:
def calculate_nearby_incidents(trip_update:pd.Series, traffic_cache:dict, max_distance:int=500) -> pd.Series:
	trip_datetime = trip_update['vehicle_dt']
	stop_coords = (trip_update['vehicle_lat'], trip_update['vehicle_lon'])

	trip_quarter_hour = trip_datetime.floor('15min')

	# Get cached incidents
	quarter_hour_incidents = traffic_cache.get(trip_quarter_hour)

	# Stop if there are no incidents for that hour
	if quarter_hour_incidents is None or quarter_hour_incidents.empty:
		return pd.Series({
			'incident_nearby': 0,
			'nearest_incident_distance': None,
			'incident_category': None,
			'incident_delay': None,
			'incident_delay_magnitude': None
		})

	# Filter for active incidents at that trip hour
	active_incidents = quarter_hour_incidents[
		(quarter_hour_incidents['start_time'] <= trip_datetime) &
		(quarter_hour_incidents['end_time'] >= trip_datetime)
	].copy()

	if active_incidents.empty:
		return pd.Series({
			'incident_nearby': 0,
			'nearest_incident_distance': None,
			'incident_category': None,
			'incident_delay': None,
			'incident_delay_magnitude': None
		})

	# Calculate distance     
	active_incidents['distance'] = active_incidents.apply(
		lambda row: haversine(stop_coords, (row['latitude'], row['longitude']), unit=Unit.METERS),
		axis=1
	)

	# Filter nearby
	nearby_incidents = active_incidents[active_incidents['distance'] <= max_distance]

	if nearby_incidents.empty:
		return pd.Series({
			'incident_nearby': 0,
			'nearest_incident_distance': None,
			'incident_category': None,
			'incident_delay': None,
			'incident_delay_magnitude': None
		})
	else:
		nearest = nearby_incidents.loc[nearby_incidents['distance'].idxmin()]
		return pd.Series({
			'incident_nearby': 1,
			'nearest_incident_distance': nearest['distance'],
			'incident_category': nearest['category'],
			'incident_delay': nearest['delay'],
			'incident_delay_magnitude': nearest['magnitude_of_delay']
		})

In [None]:
# Get traffic columns (get incidents within 1.5 km)
traffic_cache = build_traffic_cache(traffic_df)
traffic_cols = stm_weather_df.apply(lambda row: calculate_nearby_incidents(row, traffic_cache, 1500), axis=1)

General area traffic (within 1-2 km) could affect delays, even if it's not directly at the stop. This is why incidents within 1.5 km are calculated.

In [None]:
# Merge the traffic
df = pd.concat([stm_weather_df, traffic_cols], axis=1)

## Export Data

In [None]:
# Get columns with two values
two_values = df.loc[:, df.nunique() == 2]
two_values.columns

In [None]:
print(df['wheelchair_boarding'].value_counts())
print(df['vehicle_status'].value_counts())
print(df['incident_nearby'].value_counts())

In [None]:
# Convert columns with 2 unique values to boolean
df['wheelchair_boarding'] = (df['wheelchair_boarding'] == 1).astype('int64')
df['vehicle_in_transit'] = (df['vehicle_status'] == 2).astype('int64')
df['incident_nearby'] = df['incident_nearby'].astype('int64')

In [None]:
# Remove columns with constant values or with more than 50% missing values
df = df.loc[:, (df.nunique() > 1) & (df.isna().mean() < 0.5)]
df.columns

**Columns to drop**

`current_time_x`, `current_time_y`: It's only the time when the data was collected.<br>
`start_date`, `departure_time_x`, `departure_time_y`, `arrival_time`, `start_date_dt`, `start_time`: Realtime and scheduled time have been calculated.<br>
`stop_sequence`: The trip progress has been calculated.<br>
`stop_name`: There is already the stop id.<br>
`stop_url`: An url is not useful for data analysis.<br>
`prev_lat`, `prev_lon`, `stop_distance`, `vehicle_lat`, `vehicle_lon`, `vehicle_distance`: The relative vehicle distance from the previous stop, along with the incidents near the vehicle position, have been calculated.<br>
`vehicle_status`: The column vehicle_in_transit has been added.<br>
`timestamp`, `vehicle_dt`: The incidents at the time when the vehicle was nearby have been added.<br>
`time`: It was to merge the weather data.


In [None]:
# Keep relevant columns
df = df[[
	'trip_id'
  	'vehicle_id',
    'vehicle_in_transit', # engineered
    'vehicle_rel_distance', # engineered
    'vehicle_bearing',
    'vehicle_speed',
	'occupancy_status',
  	'route_id',
  	'stop_id',
  	'stop_lat',
  	'stop_lon',
  	'trip_progress', # engineered
  	'wheelchair_boarding', # engineered
  	'realtime_arrival_time', # parsed
    'scheduled_arrival_time', # parsed
    'delay', # engineered
  	'temperature',
  	'precipitation',
  	'windspeed', 
	'weathercode',
  	'incident_nearby', # engineered
]]

In [None]:
df.info()

In [None]:
# Export data to CSV
df.to_csv('../data/stm_weather_traffic_merged.csv', index=False)

## End