# STM Transit Delay Data Preparation

## Data description

### Real-time Trip Updates

`current_time` timestamp when the data was collected<br>
`trip_id` unique identifier of a trip<br>
`route_id` bus line<br>
`start_date` start date of the trip<br>
`stop_id` stop number<br>
`arrival_time` actual arrival time, in milliseconds<br>
`departure_time` actual departure time, in milliseconds<br>
`schedule_relationship` state of the trip, 0 means scheduled and 1 means skipped

### Scheduled STM Trips

`trip_id` unique identifier of a trip<br>
`arrival_time` scheduled arrival time, in milliseconds<br>
`departure_time` scheduled departure time, in milliseconds<br>
`stop_id` stop number<br>
`stop_sequence` sequence of the stop, for ordering

### STM Stops

`stop_id` unique identifier of a stop<br>
`stop_code` bus stop or metro station number<br>
`stop_name` bus stop or metro station name<br>
`stop_lat` stop latitude<br>
`stop_lon` stop longitude<br>
`stop_url` stop web page<br>
`location_type` stop type<br>
`parent_station` parent station (metro station with multiple exits)<br>
`wheelchair_boarding` indicates if the stop is accessible for people in wheelchair, 1 being true and 2 being false

### Real-time Vehicle Positions

`current_time` timestamp when the data was collected<br>
`vehicle_id` unique identifuer of a vehicle<br>
`trip_id` unique identifier of a trip<br>
`route_id` bus or metro line<br>
`start_date` start date of a trip<br>
`start_time` start time of a trip<br>
`latitude` vehicle current latitude<br>
`longitude` vehicle current longityde<br>
`bearing` direction that the vehicle is facing<br>
`speed` momentary speed measured by the vehicle, in meters per second<br>
`stop_sequence` sequence of the stop, for ordering<br>
`status` vehicle stop status in relation with a stop that it's currently approaching or is at<br>
`timestamp` timestamp when STM updated the data<br>
`occupancy_status` degree of passenger occupancy

### Weather Archive and Forecast

`time` date and hour or the weather<br>
`temperature` air temperature at 2 meters above ground, in Celsius<br>
`precipitation` total precipitation (rain, showers, snow) sum of the preceding hour, in millimeters<br>
`windspeed` wind speed at 10 meters above ground, in km/h<br>
`weathercode` World Meteorological Organization (WMO) code

### Traffic Incidents

`category` category of the incident<br>
`start_time` start time of the incident in ISO8601 format<br>
`end_time` end time of the incident in ISO8601 format<br>
`length` length of the incident in meters<br>
`delay` delay in seconds caused by the incident (except road closures)<br>
`magnitude_of_delay` severity of the delay<br>
`last_report_time` date in ISO8601 format, when the last time the incident was reported<br>
`latitude` latitude of the incident<br>
`longitude` longitude of the incident

## Imports

In [None]:
from datetime import datetime, timedelta, timezone
from haversine import haversine, Unit
import pandas as pd
import sys

In [None]:
# Import custom code
sys.path.insert(0, '..')
from scripts.custom_functions import fetch_weather, LOCAL_TIMEZONE

In [5]:
trips_df = pd.read_csv('../data/fetched_stm_trip_updates.csv', low_memory=False)

In [6]:
# TODO: use all schedule files using glob
schedules_df = pd.read_csv('../data/stop_times_2025-04-23.txt')

In [7]:
stops_df = pd.read_csv('../data/stops_2025-04-23.txt')

In [8]:
positions_df =  pd.read_csv('../data/fetched_stm_vehicle_positions.csv', low_memory=False)

In [9]:
weather_df = pd.read_csv('../data/fetched_historical_weather.csv')

In [10]:
traffic_df = pd.read_csv('../data/fetched_traffic.csv')

## Clean Data

In [11]:
# Convert route_id to integer
trips_df['route_id'] = trips_df['route_id'].str.extract(r'(\d+)')
trips_df['route_id'] = trips_df['route_id'].astype('int64')

In [12]:
# Sort trips
trips_df = trips_df.sort_values(by=['current_time', 'trip_id', 'route_id', 'arrival_time'])

In [13]:
# Get proportion of duplicates
subset = ['start_date', 'trip_id', 'route_id', 'stop_id']
duplicate_mask = trips_df.duplicated(subset=subset)
print(f'{duplicate_mask.mean():.2%}')

25.27%


In [14]:
# Remove duplicates
trips_df = trips_df.drop_duplicates(subset=subset, keep='last') # keep latest update

In [15]:
# Convert realtime arrival and departure time to milliseconds
trips_df['arrival_time'] = trips_df['arrival_time'] * 1000
trips_df['departure_time'] = trips_df['departure_time'] * 1000

In [16]:
# Get distribution of realtime arrival times
trips_df[['arrival_time', 'departure_time']].describe()

Unnamed: 0,arrival_time,departure_time
count,1235901.0,1235901.0
mean,1644153000000.0,1640610000000.0
std,408991800000.0,415603600000.0
min,0.0,0.0
25%,1745843000000.0,1745843000000.0
50%,1745881000000.0,1745880000000.0
75%,1745937000000.0,1745937000000.0
max,1745977000000.0,1745977000000.0


In [17]:
# Get proportion of rows with zero arrival times
zero_mask = trips_df['arrival_time'] == 0
print(f'{zero_mask.mean():.2%}')

5.83%


In [18]:
# Get proportion of rows where the arrival and departure times are different
diff_date_mask = trips_df['arrival_time'] != trips_df['departure_time']
print(f'{diff_date_mask.mean():.2%}')

6.08%


In [19]:
# Display rows
trips_df[diff_date_mask].sample(10)

Unnamed: 0,current_time,trip_id,route_id,start_date,stop_id,arrival_time,departure_time,schedule_relationship
1228877,1745935000.0,285008393,200,20250429,57797,1745938080000,0,0
220048,1745834000.0,285283264,125,20250428,52433,1745835665000,0,0
1532486,1745960000.0,285031740,486,20250429,53611,1745962020000,0,0
365110,1745845000.0,285001207,25,20250428,51787,1745845271000,1745845320000,0
834139,1745885000.0,284779317,165,20250428,56136,1745887232000,0,0
963795,1745906000.0,284728912,370,20250428,51956,1745906570000,1745906640000,0
351187,1745842000.0,286570602,63,20250428,60662,1745844360000,0,0
1177161,1745932000.0,286571494,104,20250429,56306,0,1745934180000,0
126728,1745806000.0,284753154,146,20250427,53827,0,1745806680000,0
1292053,1745942000.0,286572885,114,20250429,57072,0,1745945400000,0


In [20]:
# Replace zero arrival times by departure times, as they are usually the same
trips_df.loc[zero_mask, 'arrival_time'] = trips_df.loc[zero_mask, 'departure_time']

In [21]:
# Get proportion of rows with zero arrival times again
zero_mask = trips_df['arrival_time'] == 0
print(f'{zero_mask.mean():.2%}')

3.28%


In [22]:
# Delete the rows with 0 arrival times
trips_df = trips_df[~zero_mask]
zero_mask = trips_df['arrival_time'] == 0
assert zero_mask.sum() == 0

In [23]:
# Rename arrival time
trips_df = trips_df.rename(columns={'arrival_time': 'realtime_arrival_time'})

In [24]:
# Drop departure time
trips_df = trips_df.drop('departure_time', axis=1)

## Merge Data

### Realtime and Scheduled Trips

In [25]:
# Sort values by stop sequence
schedules_df = schedules_df.sort_values(by=['trip_id', 'stop_sequence'])

In [26]:
# Reset stop sequences (some stops might be missing)
schedules_df['stop_sequence'] = schedules_df.groupby('trip_id').cumcount() + 1

In [27]:
# Add trip progress (vehicles further along the trip are more likely to be delayed)
total_stops = schedules_df.groupby('trip_id')['stop_id'].transform('count')
schedules_df['trip_progress'] = schedules_df['stop_sequence'] / total_stops

In [28]:
# Merge realtime and scheduled trips (#TODO: concatenate new schedule before left joining)
stm_trips_df = pd.merge(left=trips_df, right=schedules_df, how='inner', on=['trip_id', 'stop_id'])

In [31]:
# Convert start_date to datetime
stm_trips_df['start_date_dt'] = pd.to_datetime(stm_trips_df['start_date'], format='%Y%m%d')

In [32]:
def parse_gtfs_time(row) -> pd.Timestamp:
	'''
	Converts GTFS time string (e.g., '25:30:00') to datetime
	based on the arrival time.
	'''
	hours, minutes, seconds = map(int, row['arrival_time'].split(':'))
	total_seconds = hours * 3600 + minutes * 60 + seconds

	parsed_time = row['start_date_dt'] + timedelta(seconds=total_seconds)
	return parsed_time

In [33]:
# Convert planned arrival time to localized datetime
stm_trips_df['scheduled_arrival_time'] = stm_trips_df.apply(parse_gtfs_time, axis=1)
stm_trips_df['scheduled_arrival_time'] = stm_trips_df['scheduled_arrival_time'].dt.tz_localize(LOCAL_TIMEZONE)

In [34]:
# Convert planned time to timestamp in milliseconds since epoch
stm_trips_df['scheduled_arrival_time'] = stm_trips_df['scheduled_arrival_time'].astype('int64') // 10**6

### Trips and Stops

In [35]:
trips_stops_df = pd.merge(left=stm_trips_df, right=stops_df, how='inner', left_on='stop_id', right_on='stop_code')

In [36]:
trips_stops_df.columns

Index(['current_time', 'trip_id', 'route_id', 'start_date', 'stop_id_x',
       'realtime_arrival_time', 'schedule_relationship', 'arrival_time',
       'departure_time', 'stop_sequence', 'trip_progress', 'start_date_dt',
       'scheduled_arrival_time', 'stop_id_y', 'stop_code', 'stop_name',
       'stop_lat', 'stop_lon', 'stop_url', 'location_type', 'parent_station',
       'wheelchair_boarding'],
      dtype='object')

In [37]:
# Rename stop id
trips_stops_df = trips_stops_df.rename(columns={'stop_id_x': 'stop_id'})

In [38]:
# Convert wheelchair_boarding to boolean
trips_stops_df['wheelchair_boarding'] = (trips_stops_df['wheelchair_boarding'] == 1).astype('int64')

### Vehicle Positions

In [39]:
subset = ['trip_id', 'route_id', 'start_date', 'stop_sequence']
positions_df = positions_df.sort_values(by=subset)

In [40]:
duplicate_mask = positions_df.duplicated(subset=subset)
positions_df[duplicate_mask]

Unnamed: 0,current_time,vehicle_id,trip_id,route_id,start_date,start_time,latitude,longitude,bearing,speed,stop_sequence,status,timestamp,occupancy_status
56821,1.745929e+09,33829,908894,470,20250429.0,07:15:00,45.496933,-73.704300,41.0,3.88892,24,2,1745928932,3
15261,1.745852e+09,32802,914104,470,20250428.0,10:24:00,45.459759,-73.891930,144.0,0.00000,3,1,1745852444,1
15784,1.745853e+09,32802,914104,470,20250428.0,10:24:00,45.459759,-73.891930,94.0,0.00000,3,1,1745853345,1
16322,1.745854e+09,32802,914104,470,20250428.0,10:24:00,45.459759,-73.891930,0.0,0.00000,3,1,1745854247,1
16857,1.745855e+09,32802,914104,470,20250428.0,10:24:00,45.459759,-73.891930,115.0,0.00000,3,1,1745855147,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
92496,1.745969e+09,30086,286598485,721,20250429.0,19:00:00,45.524750,-73.550217,305.0,7.77784,4,2,1745969449,1
54859,1.745927e+09,42045,286749881,69,20250429.0,07:58:00,45.575150,-73.655861,0.0,0.00000,1,1,1745927124,1
55689,1.745928e+09,39136,287230574,56,20250429.0,07:29:00,45.618084,-73.581253,0.0,0.00000,16,2,1745928023,1
56793,1.745929e+09,39136,287230574,56,20250429.0,07:29:00,45.618450,-73.580353,0.0,0.00000,16,2,1745928937,1


In [41]:
# Get proportion of duplicates
print(f'{duplicate_mask.mean():.2%}')

3.14%


In [42]:
# Remove duplicates
positions_df = positions_df.drop_duplicates(subset=subset)

In [43]:
# Merge with other STM data
stm_df = pd.merge(left=trips_stops_df, right=positions_df, how='inner', on=['trip_id', 'route_id', 'start_date', 'stop_sequence'])

In [44]:
stm_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 82063 entries, 0 to 82062
Data columns (total 32 columns):
 #   Column                  Non-Null Count  Dtype         
---  ------                  --------------  -----         
 0   current_time_x          82063 non-null  float64       
 1   trip_id                 82063 non-null  int64         
 2   route_id                82063 non-null  int64         
 3   start_date              82063 non-null  int64         
 4   stop_id                 82063 non-null  int64         
 5   realtime_arrival_time   82063 non-null  int64         
 6   schedule_relationship   82063 non-null  int64         
 7   arrival_time            82063 non-null  object        
 8   departure_time          82063 non-null  object        
 9   stop_sequence           82063 non-null  int64         
 10  trip_progress           82063 non-null  float64       
 11  start_date_dt           82063 non-null  datetime64[ns]
 12  scheduled_arrival_time  82063 non-null  int64 

### STM and Weather

In [45]:
# Convert time string to datetime
time_dt = pd.to_datetime(weather_df['time'], utc=True)

In [46]:
# Convert STM arrival timestamp to datetime 
stm_df['arrival_time_dt'] = pd.to_datetime(stm_df['realtime_arrival_time'], origin='unix', unit='ms', utc=True)

In [47]:
# Calculate dates for weather forecast
last_day_weather = time_dt.max()
start_date = last_day_weather + timedelta(days=1)
end_date = stm_df['arrival_time_dt'].max()

In [48]:
# Fetch forecast weather
start_date_str = start_date.strftime('%Y-%m-%d')
end_date_str = end_date.strftime('%Y-%m-%d')

forecast_list = fetch_weather(start_date=start_date_str, end_date=end_date_str, forecast=True)
forecast_df = pd.DataFrame(forecast_list)

In [49]:
# Merge archive and forecast weather
weather_df = pd.concat([weather_df, forecast_df], ignore_index=True)

In [50]:
# Round arrival time to the nearest hour
stm_df['rounded_arrival_dt'] = stm_df['arrival_time_dt'].dt.round('h')

In [51]:
# Format time to match weather data
stm_df['time'] = stm_df['rounded_arrival_dt'].dt.strftime('%Y-%m-%dT%H:%M')

In [52]:
# Merge STM with weather
stm_weather_df = pd.merge(left=stm_df, right=weather_df, how='inner', on='time')

### Traffic Data

In [53]:
# Get proportion of duplicates
duplicate_mask = traffic_df.duplicated()
print(f'{duplicate_mask.mean():.2%}')

31.36%


In [54]:
# Remove duplicates
traffic_df = traffic_df.drop_duplicates(keep='last').reset_index()

In [55]:
# Convert traffic start_time and end_time to datetime
traffic_df['start_time_dt'] = pd.to_datetime(traffic_df['start_time'], utc=True)
traffic_df['end_time_dt'] = pd.to_datetime(traffic_df['end_time'], utc=True)

In [56]:
# Sort by date
traffic_df = traffic_df.sort_values(by='start_time_dt').reset_index()

In [57]:
# Fill null end times with current time (assuming the incident is still ongoing)
traffic_df['end_time_dt'] = traffic_df['end_time_dt'].fillna(datetime.now(timezone.utc).replace(microsecond=0))
assert traffic_df['end_time_dt'].isna().sum() == 0

In [58]:
# Build traffic cache
def build_traffic_cache(traffic_df:pd.DataFrame) -> dict:
	traffic_cache = {}
	traffic_df['hour'] = traffic_df['start_time_dt'].dt.floor('h')

	for (hour, group) in traffic_df.groupby('hour'):
		traffic_cache[hour] = group.copy()

	return traffic_cache

Since there are many trip updates on the same day (even the same hour), there's a risk of repeating the filtering of active traffic incidents for each trip individually, which takes a lot of time for a large dataset. Traffic incidents are stable over minutes or hours. This is why the incidents are cached by hour.

In [59]:
def calculate_nearby_incidents(trip_update:pd.Series, traffic_cache:dict, max_distance:int=500) -> pd.Series:
	trip_datetime = trip_update['arrival_time_dt']
	stop_coords = (trip_update['stop_lat'], trip_update['stop_lon'])

	trip_hour = trip_datetime.floor('h')

	# Get cached incidents
	hour_incidents = traffic_cache.get(trip_hour)

	# Stop if there are no incidents for that hour
	if hour_incidents is None or hour_incidents.empty:
		return pd.Series({
			'incident_nearby': 0,
			'nearest_incident_distance': None,
			'incident_category': None,
			'incident_delay': None,
			'incident_delay_magnitude': None
		})

	# Filter for active incidents at that trip hour
	active_incidents = hour_incidents[
		(hour_incidents['start_time_dt'] <= trip_datetime) &
		(hour_incidents['end_time_dt'] >= trip_datetime)
	].copy()

	if active_incidents.empty:
		return pd.Series({
			'incident_nearby': 0,
			'nearest_incident_distance': None,
			'incident_category': None,
			'incident_delay': None,
			'incident_delay_magnitude': None
		})

	# Calculate distance     
	active_incidents['distance'] = active_incidents.apply(
		lambda row: haversine(stop_coords, (row['latitude'], row['longitude']), unit=Unit.METERS),
		axis=1
	)

	# Filter nearby
	nearby_incidents = active_incidents[active_incidents['distance'] <= max_distance]

	if nearby_incidents.empty:
		return pd.Series({
			'incident_nearby': 0,
			'nearest_incident_distance': None,
			'incident_category': None,
			'incident_delay': None,
			'incident_delay_magnitude': None
		})
	else:
		nearest = nearby_incidents.loc[nearby_incidents['distance'].idxmin()]
		return pd.Series({
			'incident_nearby': 1,
			'nearest_incident_distance': nearest['distance'],
			'incident_category': nearest['category'],
			'incident_delay': nearest['delay'],
			'incident_delay_magnitude': nearest['magnitude_of_delay']
		})

In [None]:
# Get traffic columns (get incidents within 500 meters)
traffic_cache = build_traffic_cache(traffic_df)
traffic_cols = stm_weather_df.apply(lambda row: calculate_nearby_incidents(row, traffic_cache), axis=1)

In [None]:
# Merge the traffic
df = pd.concat([stm_weather_df, traffic_cols], axis=1)

## Export Data

In [None]:
# Remove columns with constant values or with more than 50% missing values
df = df.loc[:, (df.nunique() > 1) & (df.isna().mean() < 0.5)]
df.columns

In [None]:
# Keep relevant columns
df = df[[
  	'vehicle_id', 
	'occupancy_status',
  	'route_id',
  	'stop_id',
  	'stop_lat',
  	'stop_lon',
	'stop_sequence',
  	'trip_progress',
  	'wheelchair_boarding',
  	'realtime_arrival_time',
    'scheduled_arrival_time',
  	'temperature',
  	'precipitation',
  	'windspeed', 
	'weathercode',
  	'incident_nearby', 
	#'incident_category',
	#'nearest_incident_distance',
	#'incident_delay',
	#'incident_delay_magnitude'
]]

In [None]:
df.info()

In [None]:
# Export data to CSV
df.to_csv('../data/stm_weather_traffic_merged.csv', index=False)

## End