# STM Transit Delay Data Preparation

## Overview

This notebook cleans and merges data collected from [STM](https://www.stm.info/en/about/developers) and [Open-Meteo](https://open-meteo.com/en/docs) and prepares it for data analysis and/or preprocessing.

## Data Description

### STM Schedule

`trip_id`: Unique identifier for the transit trip.<br>
`arrival_time`, `departure_time`: Scheduled arrival and departure time.<br>
`stop_id`: Unique identifier of a stop.<br>
`stop_sequence`: Sequence of a stop, for ordering.

### STM Stops

`stop_id`: Unique identifier of a stop.<br>
`stop_code`: Bus stop or metro station number.<br>
`stop_name`: Bus stop or metro station name<br>
`stop_lat`, `stop_lon`: Stop coordinates.<br>
`stop_url`: Stop web page.<br>
`location_type`: Stop type.<br>
`parent_station`: Parent station (metro station with multiple exits).<br>
`wheelchair_boarding`: Indicates if the stop is accessible for people in wheelchair, 0 meaning "no information", 1 being "accessible" and 2 being "not accessible".

### STM Trips

`route_id`:  Unique identifier for the bus or metro line.<br>
`service_id`: Identifies a set of dates when service is available for one or more routes.<br>
`trip_id`: Unique identifier for the transit trip.<br>
`trip_headsign`: Direction of the trip (Nord, South, West, East).<br>
`direction_id`: Boolean value for the direction.<br>
`shape_id`: Identifies a geospatial shape describing the vehicle travel path for a trip.
`wheelchair_accessible`: Indicates wheelchair accessibility, 0 meaning "no information", 1 being "accessible" and 2 being "not accessible".<br>
`note_fr`, `note_en`: Additionnal comment in French and English.

### STM Real-Time Trip Updates

`current_time`: Timestamp when the data was fetched from the GTFS, in milliseconds.<br>
`trip_id`: Unique identifier for the transit trip.<br>
`route_id`: Unique identifier for a bus or metro line.<br>
`start_date`: Start date of the transit trip.<br>
`stop_id`: Unique identifier of a stop.<br>
`arrival_time`, `departure_time`: Realtime arrival and departure time, in seconds<br>
`schedule_relationship`: State of the trip, 0 meaning "scheduled", 1 meaning "skipped" and 2 meaning "no data".

### STM Route Types

`route_id`: Unique identifier for a bus or metro line.<br>
`route_type`: Type of bus line (e.g. Night)<br>

### Open-Meteo Weather Archive

`time`: Date and hour or the weather.<br>
`temperature_2m`: Air temperature at 2 meters above ground, in Celsius.<br>
`relative_humidity_2m`: Relative humidity at 2 meters above ground, in percentage.<br>
`precipitation`: Total precipitation (rain, showers, snow) sum of the preceding hour, in millimeters.<br>
`pressure`: Atmospheric air pressure reduced to mean sea level (msl), in hPa.<br>
`cloud_cover`: Total cloud cover as an area fraction.<br>
`windspeed_10m`: Wind speed at 10 meters above ground, in kilometers per hour.<br>
`wind_direction_10m`: Wind direction at 10 meters above ground.<br>

## Imports

In [None]:
from datetime import datetime, timedelta, timezone
import geopandas as gpd
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
import sys

In [None]:
# Import custom code
sys.path.insert(0, '..')
from scripts.custom_functions import fetch_weather, LOCAL_TIMEZONE, SCHEDULE_RELATIONSHIP

In [None]:
# Import data
schedules_df = pd.read_csv('../data/download/stop_times_2025-04-30.txt')
stops_df = pd.read_csv('../data/download/stops_2025-04-30.txt')
trips_df = pd.read_csv('../data/download/trips_2025-04-30.txt')
trip_updates_df = pd.read_csv('../data/api/fetched_stm_trip_updates.csv', low_memory=False)
routes_df = pd.read_csv('../data/route_types.csv')
weather_df = pd.read_csv('../data/api/fetched_historical_weather.csv')

## Merge Data

### Schedules and stops

In [None]:
# Sort values by stop sequence
schedules_df = schedules_df.sort_values(by=['trip_id', 'stop_sequence'])

In [None]:
# Add trip progress (vehicles further along the trip are more likely to be delayed)
total_stops = schedules_df.groupby('trip_id')['stop_id'].transform('count')
schedules_df['trip_progress'] = schedules_df['stop_sequence'] / total_stops

In [None]:
# Get distribution of trip progress
schedules_df['trip_progress'].describe()

In [None]:
# Merge schedules and stops
schedules_stops_df = pd.merge(left=schedules_df, right=stops_df, how='inner', left_on='stop_id', right_on='stop_code') \
	.rename(columns={'stop_id_x': 'stop_id'}) \
	.drop(['stop_id_y', 'stop_code', 'stop_url'], axis=1)

In [None]:
# Get coordinates of previous stop
schedules_stops_df = schedules_stops_df.sort_values(by=['trip_id', 'stop_sequence'])
schedules_stops_df['prev_lat'] = schedules_stops_df.groupby('trip_id')['stop_lat'].shift(1)
schedules_stops_df['prev_lon'] = schedules_stops_df.groupby('trip_id')['stop_lon'].shift(1)

In [None]:
# Make sure scheduled arrival time has no null values
assert schedules_stops_df['arrival_time'].isna().sum() == 0

In [None]:
# Get arrival and departure time of previous stop
schedules_stops_df['prev_time'] = schedules_stops_df.groupby('trip_id')['arrival_time'].shift(1)

In [None]:
# Make sure the null coordinates are from first stops
prev_null_mask = (schedules_stops_df['prev_lat'].isna()) | (schedules_stops_df['prev_lon'].isna())
first_stop_mask = schedules_stops_df['stop_sequence'] == 1
assert prev_null_mask.sum() == first_stop_mask.sum()

In [None]:
def parse_gtfs_time(df:pd.DataFrame, date_column:str, time_column:str, milliseconds:bool=True) -> pd.Series:
	'''
	Converts GTFS time string (e.g., '25:30:00') to localized datetime
	based on the arrival or departure time.
	'''
	time_columns = ['hours', 'minutes', 'seconds']
	split_cols = df[time_column].str.split(':', expand=True).apply(pd.to_numeric)
	split_cols.columns = time_columns
	seconds_delta = (split_cols['hours'] * 3600) + (split_cols['minutes'] * 60) + split_cols['seconds']
	
	# Convert datetime to seconds
	if milliseconds:
		start_seconds = df[date_column].astype('int') / 10**9
	else:# microseconds
		start_seconds = df[date_column].astype('int') / 10**6

	# Add seconds 
	total_seconds = start_seconds + seconds_delta

	# Convert to datetime
	parsed_time = pd.to_datetime(total_seconds, origin='unix', unit='s').dt.tz_localize(LOCAL_TIMEZONE)

	return parsed_time

In [None]:
# Add column with current date
schedules_stops_df['today'] = datetime.now().replace(hour=0, minute=0, second=0, microsecond=0)

In [None]:
# Parse arrival time
schedules_stops_df['parsed_time'] = parse_gtfs_time(schedules_stops_df, 'today', 'arrival_time', False)

In [None]:
# Parse previous arrival time
schedules_stops_df['parsed_prev_time'] = parse_gtfs_time(schedules_stops_df, 'today', 'prev_time', False)

In [None]:
# Calculate expected trip duration
schedules_stops_df['trip_start'] = schedules_stops_df.groupby('trip_id')['parsed_time'].transform('min')
schedules_stops_df['trip_end'] = schedules_stops_df.groupby('trip_id')['parsed_time'].transform('max')
schedules_stops_df['exp_trip_duration'] = (schedules_stops_df['trip_end'] - schedules_stops_df['trip_start']) / pd.Timedelta(seconds=1)

In [None]:
# Get distribution
schedules_stops_df['exp_trip_duration'].describe()

In [None]:
# Calculate expected delay between previous and current stop
schedules_stops_df['exp_delay_prev_stop'] = (schedules_stops_df['parsed_time'] - schedules_stops_df['parsed_prev_time']) / pd.Timedelta(seconds=1)

In [None]:
# Get distribution
schedules_stops_df['exp_delay_prev_stop'].describe()

In [None]:
# Assert that the null values are from first stops
assert (schedules_stops_df['stop_sequence'] == 1).sum() == schedules_stops_df['exp_delay_prev_stop'].isna().sum()

In [None]:
# Fill null values with 0 (first stop)
schedules_stops_df['exp_delay_prev_stop'] = schedules_stops_df['exp_delay_prev_stop'].fillna(0)

In [None]:
# Create GeoDataFrames for previous and current stop
sch_gdf1 = gpd.GeoDataFrame(
  schedules_stops_df[['prev_lon', 'prev_lat']],
  geometry=gpd.points_from_xy(schedules_stops_df['prev_lon'], schedules_stops_df['prev_lat']),
  crs='EPSG:4326' # WGS84 (sea level)
).to_crs(epsg=3857) # Convert to metric

sch_gdf2 = gpd.GeoDataFrame(
  schedules_stops_df[['stop_lon', 'stop_lat']],
  geometry=gpd.points_from_xy(schedules_stops_df['stop_lon'], schedules_stops_df['stop_lat']),
  crs='EPSG:4326'
).to_crs(epsg=3857)

In [None]:
# Calculate distance from previous stop
schedules_stops_df['stop_distance'] = sch_gdf1.distance(sch_gdf2)
schedules_stops_df['stop_distance'].describe()

In [None]:
# Replace null distances by zero (first stop of the trip)
schedules_stops_df['stop_distance'] = schedules_stops_df['stop_distance'].fillna(0)
assert schedules_stops_df['stop_distance'].isna().sum() == 0

In [None]:
# Get stop with largest distance
schedules_stops_df.iloc[schedules_stops_df['stop_distance'].idxmax()]

The large distance make sense because the expected time between the the previous stop and this one is 21 minutes.

In [None]:
schedules_stops_df.columns

In [None]:
# Drop unneeded columns
schedules_stops_df = schedules_stops_df.drop([
  'prev_lat',
  'prev_lon',
  'prev_time',
  'today',
  'parsed_prev_time',
  'trip_start', 
  'trip_end',
  ], axis=1)

### Trips

In [None]:
# Keep relevant columns
trips_df = trips_df[['trip_id', 'route_id', 'trip_headsign', 'wheelchair_accessible']]

In [None]:
# Rename trip_headsign
trips_df = trips_df.rename(columns={'trip_headsign': 'route_direction'})

In [None]:
# Translate directions
condition_list = [
	trips_df['route_direction'].str.contains('Nord'),
	trips_df['route_direction'].str.contains('Sud'),
  	trips_df['route_direction'].str.contains('Ouest'),
  	trips_df['route_direction'].str.contains('Est'),
]
label_list = ['North', 'South', 'West', 'East']

trips_df['route_direction'] = np.select(condition_list, label_list, default='Metro')	
trips_df['route_direction'].value_counts()

In [None]:
trips_df.info()

In [None]:
schedules_stops_df.columns

In [None]:
# Merge with schedules and stops
scheduled_trips_df = pd.merge(left=schedules_stops_df, right=trips_df, how='inner', on='trip_id')

In [None]:
# TODO: calculate frequency of arrival per route per stop (keep parsed_time)

In [None]:
scheduled_trips_df.isna().sum()

In [None]:
# Get rows where wheelchair_boarding and wheelchair_accessible are different
scheduled_trips_df[scheduled_trips_df['wheelchair_boarding'] != scheduled_trips_df['wheelchair_accessible']]

In [None]:
# Keep wheelchair_boarding as it's stop specific
scheduled_trips_df = scheduled_trips_df.drop('wheelchair_accessible', axis=1)

### Realtime and Scheduled Trips

In [None]:
# Convert route_id to integer
trip_updates_df['route_id'] = trip_updates_df['route_id'].str.extract(r'(\d+)')
trip_updates_df['route_id'] = trip_updates_df['route_id'].astype('int64')

In [None]:
# Get proportion of duplicates
subset = trip_updates_df.drop('current_time', axis=1).columns
duplicate_mask = trip_updates_df.duplicated(subset=subset)
print(f'{duplicate_mask.mean():.2%}')

In [None]:
# Remove duplicates
trip_updates_df = trip_updates_df.drop_duplicates(subset=subset, keep='last').reset_index(drop=True)

In [None]:
# Rename arrival and departure time
trip_updates_df = trip_updates_df.rename(columns={'arrival_time': 'rt_arrival_time','departure_time': 'rt_departure_time'})

In [None]:
# Merge trip updates with schedule
merged_stm_df = pd.merge(left=trip_updates_df, right=scheduled_trips_df, how='inner', on=['trip_id', 'route_id', 'stop_id'])

In [None]:
merged_stm_df.columns

#### Calculate Delay

In [None]:
# Convert start_date to datetime
merged_stm_df['start_date_dt'] = pd.to_datetime(merged_stm_df['start_date'], format='%Y%m%d')

In [None]:
# Parse GTFS scheduled arrival and departure times
parsed_arrival_time = parse_gtfs_time(merged_stm_df, 'start_date_dt', 'arrival_time')
parsed_departure_time = parse_gtfs_time(merged_stm_df, 'start_date_dt', 'departure_time')

In [None]:
# Convert scheduled arrival and departure time to UTC datetime
merged_stm_df['sch_arrival_time'] = parsed_arrival_time.dt.tz_convert(timezone.utc)
merged_stm_df['sch_departure_time'] = parsed_departure_time.dt.tz_convert(timezone.utc)

In [None]:
# Get rows where scheduled arrival and departure time are different
merged_stm_df[merged_stm_df['sch_arrival_time'] != merged_stm_df['sch_departure_time']]

In [None]:
# Replace 0 timestamps with NaN
merged_stm_df['rt_arrival_time'] = merged_stm_df['rt_arrival_time'].replace({0: np.nan})
merged_stm_df['rt_departure_time'] = merged_stm_df['rt_departure_time'].replace({0: np.nan})

In [None]:
# Convert realtime arrival and departure time to UTC datetime
merged_stm_df['rt_arrival_time'] = pd.to_datetime(merged_stm_df['rt_arrival_time'], origin='unix', unit='s', utc=True)
merged_stm_df['rt_departure_time'] = pd.to_datetime(merged_stm_df['rt_departure_time'], origin='unix', unit='s', utc=True)

In [None]:
# Calculate delay (realtime - scheduled)
# Start with arrival time, if null, calculate with departure time
merged_stm_df['delay'] = (merged_stm_df['rt_arrival_time'] - merged_stm_df['sch_arrival_time']) / pd.Timedelta(seconds=1)
merged_stm_df['delay'] = merged_stm_df['delay'].fillna(((merged_stm_df['rt_departure_time'] - merged_stm_df['sch_departure_time']) / pd.Timedelta(seconds=1)))

In [None]:
# Get distribution
merged_stm_df['delay'].describe()

#### Handle Outliers

In [None]:
# Plot histogram
plt.figure(figsize=(10, 5))
sns.histplot(merged_stm_df['delay'], bins=50, kde=True)
plt.title('Distribution of Delay Times')
plt.xlabel('Delay Time (seconds)')
plt.ylabel('Frequency')
plt.savefig('../images/delay_histogram.png', bbox_inches='tight')
plt.show()

In [None]:
# Plot boxplot
plt.figure(figsize=(10, 5))
sns.boxplot(x=merged_stm_df['delay'])
plt.title('Boxplot of Delay Times (in seconds)')
plt.savefig('../images/delay_boxplot.png', bbox_inches='tight')
plt.show()

The distribution of delay times is highly skewed, with most values concentrated near 0, but extending both negatively and positively in a wide range. There are extreme outliers stretching up to 55000 seconds (more than 15 hours) and also negative values going beyond -10000 seconds (almost 3 hours). Such extreme values are unrealistic for transit delays. It's very likely they represent data entry errors, sensor glitches or edge cases (canceled trips, detours, etc.).

In [None]:
print(merged_stm_df['delay'].max() / merged_stm_df['delay'].min())

In [None]:
# Filter outliers, based on expected trip duration and "skewness" (positive delay is about 4x negative delay)
# If a delay is longer than the expected trip duration, it's most likely a cancelled trip.
outlier_mask = (merged_stm_df['delay'] <= merged_stm_df['exp_trip_duration'] * -0.25) | (merged_stm_df['delay'] >= merged_stm_df['exp_trip_duration'])

In [None]:
# Inspect outliers
outliers_df = merged_stm_df[outlier_mask]
outliers_df[['trip_id', 'route_id', 'stop_name', 'route_direction', 'trip_progress', 'sch_arrival_time', 'delay']].sort_values('delay', ascending=False)

In [None]:
# Calculate proportion
print(f'{outlier_mask.mean():.2%}')

In [None]:
# Remove outliers
merged_stm_df = merged_stm_df[~outlier_mask]

In [None]:
# Replot histogram
plt.figure(figsize=(10, 5))
sns.histplot(merged_stm_df['delay'], bins=50, kde=True)
plt.title('Distribution of Delay Times (After Filtering)')
plt.xlabel('Delay Time (seconds)')
plt.ylabel('Frequency')
plt.savefig('../images/delay_histogram_filtered.png', bbox_inches='tight')
plt.show()

In [None]:
# Replot boxplot
plt.figure(figsize=(10, 5))
sns.boxplot(x=merged_stm_df['delay'])
plt.title('Boxplot of Delay Times (After Filtering)')
plt.savefig('../images/delay_boxplot_filtered.png', bbox_inches='tight')
plt.show()

In [None]:
# Get null delays count
print(merged_stm_df['delay'].isna().sum())

In [None]:
# Replace the null delays with the overall average delay
merged_stm_df['delay'] = merged_stm_df['delay'].fillna(merged_stm_df['delay'].mean())
assert merged_stm_df['delay'].isna().sum() == 0

In [None]:
# Get new distribution
merged_stm_df['delay'].describe()

In [None]:
merged_stm_df.columns

In [None]:
# Remove uneeded columns
merged_stm_df = merged_stm_df.drop(['current_time', 'start_date', 'arrival_time', 'departure_time', 'start_date_dt'], axis=1)

### Route Types

In [None]:
stm_df = pd.merge(left=merged_stm_df, right=routes_df, how='inner', on='route_id')

In [None]:
stm_df.columns

### STM and Weather

In [None]:
weather_df.info()

In [None]:
# Convert time string to datetime
time_dt = pd.to_datetime(weather_df['time'], utc=True)

In [None]:
# Round arrival time to the nearest hour
rounded_arrival_dt = stm_df['sch_arrival_time'].dt.round('h')

In [None]:
# Format time to match weather data
stm_df['time'] = rounded_arrival_dt.dt.strftime('%Y-%m-%dT%H:%M')

In [None]:
# Merge STM with weather
df = pd.merge(left=stm_df, right=weather_df, how='inner', on='time').drop('time', axis=1)

## Clean Data

### Drop Columns

In [None]:
# Remove columns with constant values or with more than 50% missing values
df = df.loc[:, (df.nunique() > 1) & (df.isna().mean() < 0.5)]
df.columns

### Convert columns

In [None]:
# Get columns with two values
two_values = df.loc[:, df.nunique() == 2]
for column in two_values.columns:
  print(df[column].value_counts())

In [None]:
# Convert wheelchair_boarding to boolean
df['wheelchair_boarding'] = (df['wheelchair_boarding'] == 1).astype('int64')

### Convert schedule_relationship and occupancy_status to Categories

In [None]:
def convert_to_categories(df:pd.DataFrame, column:str, map_dict:dict) -> pd.Series:
	codes = df[column].sort_values().unique()
	condition_list = []
	label_list = []
		
	for code in codes:
		condition_list.append(df[column] == code)
		label_list.append(map_dict[code])
	
	df[column] = np.select(condition_list, label_list, default='Unknown')
	return df[column]

In [None]:
df['schedule_relationship'] = convert_to_categories(df, 'schedule_relationship', SCHEDULE_RELATIONSHIP)
df['schedule_relationship'].value_counts()

## EDA

In [None]:
df.info()

In [None]:
df.isna().sum()

In [None]:
# Get correlation of delay with other numeric variables
numeric_df = df.select_dtypes(include=[np.number])
corr_matrix = numeric_df.corr()
corr_with_delay = corr_matrix.drop('delay', axis=1).loc['delay'].sort_values(key=abs, ascending=False)
corr_with_delay

In [None]:
# Export data to CSV
df.to_parquet('../data/stm_weather_merged.parquet', index=False)

## End