# STM Transit Delay Data Preparation

## Data description

### Real-time Trip Updates

`current_time` timestamp when the data was collected<br>
`trip_id` unique identifier of a trip<br>
`route_id` bus or metro line<br>
`start_date` start date of the trip<br>
`stop_id` stop number<br>
`arrival_time` actual arrival time, in milliseconds<br>
`departure_time` actual departure time, in milliseconds<br>
`schedule_relationship` state of the trip, 0 means scheduled and 1 means skipped

### Scheduled STM Trips

`trip_id` unique identifier of a trip<br>
`arrival_time` scheduled arrival time, in milliseconds<br>
`departure_time` scheduled departure time, in milliseconds<br>
`stop_id` stop number<br>
`stop_sequence` sequence of the stop, for ordering

### STM Stops

`stop_id` unique identifier of a stop<br>
`stop_code` stop number<br>
`stop_name` stop name<br>
`stop_lat` stop latitude<br>
`stop_lon` stop longitude<br>
`stop_url` stop web page<br>
`location_type` stop type, 1 being a metro station and 2 a bus stop<br>
`parent_station` parent station (ex: a metro station with multiple exits)<br>
`wheelchair_boarding` indicates if the stop is accessible for people in wheelchair, 1 being true and 2 being false

### Real-time Vehicle Positions

`current_time` timestamp when the data was collected<br>
`vehicle_id` unique identifuer of a vehicle<br>
`trip_id` unique identifier of a trip<br>
`route_id` bus or metro line<br>
`start_date` start date of a trip<br>
`start_time` start time of a trip<br>
`latitude` vehicle current latitude<br>
`longitude` vehicle current longityde<br>
`bearing` direction that the vehicle is facing<br>
`speed` momentary speed measured by the vehicle, in meters per second<br>
`stop_sequence` sequence of the stop, for ordering<br>
`status` vehicle stop status in relation with a stop that it's currently approaching or is at<br>
`timestamp` timestamp when STM updated the data<br>
`occupancy_status` degree of passenger occupancy

### Weather Archive and Forecast

`time` date and hour or the archived weather<br>
`temperature` air temperature at 2 meters above ground, in Celsius<br>
`precipitation` total precipitation (rain, showers, snow) sum of the preceding hour, in millimeters<br>
`windspeed` wind speed at 10 meters above ground, in km/h<br>
`weathercode` World Meteorological Organization (WMO) code

### Traffic Incidents

`category` category of the incident<br>
`start_time` start time of the incident in ISO8601 format<br>
`end_time` end time of the incident in ISO8601 format<br>
`length` length of the incident in meters<br>
`delay` delay in seconds caused by the incident (except road closures)<br>
`magnitude_of_delay` severity of the delay<br>
`last_report_time` date in ISO8601 format, when the last time the incident was reported<br>
`latitude` latitude of the incident<br>
`longitude` longitude of the incident

## Imports

In [113]:
from datetime import datetime, timedelta, UTC
from haversine import haversine, Unit
import numpy as np
import pandas as pd
import sys

In [2]:
# Import custom code
sys.path.insert(0, '..')
from scripts.custom_functions import fetch_weather, LOCAL_TIMEZONE

In [3]:
trips_df = pd.read_csv('../data/fetched_stm_trip_updates.csv', low_memory=False)

In [4]:
schedules_df = pd.read_csv('../data/stop_times_2025-04-23.txt')

In [5]:
stops_df = pd.read_csv('../data/stops_2025-04-23.txt')

In [6]:
positions_df =  pd.read_csv('../data/fetched_stm_vehicle_positions.csv', low_memory=False)

In [7]:
weather_df = pd.read_csv('../data/fetched_historical_weather.csv')

In [88]:
traffic_df = pd.read_csv('../data/fetched_traffic.csv')

## Clean Data

In [9]:
# Convert route_id to integer
trips_df['route_id'] = trips_df['route_id'].str.extract(r'(\d+)')
trips_df['route_id'] = trips_df['route_id'].astype('int64')

In [10]:
# Sort trips
subset = ['current_time', 'start_date', 'trip_id', 'route_id', 'stop_id']
trips_df = trips_df.sort_values(by=subset)

In [11]:
# Get proportion of duplicates
new_subset = subset[1:]
duplicate_mask = trips_df.duplicated(subset=new_subset)
print(f'{(duplicate_mask.sum() / len(trips_df)):.2%}')

24.21%


In [12]:
# Remove duplicates
trips_df = trips_df.drop_duplicates(subset=new_subset, keep='last') # keep latest update

In [13]:
# Convert realtime arrival and departure time to milliseconds
trips_df['arrival_time'] = trips_df['arrival_time'] * 1000
trips_df['departure_time'] = trips_df['departure_time'] * 1000

In [14]:
# Get distribution of realtime arrival times
trips_df[['arrival_time', 'departure_time']].describe()

Unnamed: 0,arrival_time,departure_time
count,2389599.0,2389599.0
mean,1649726000000.0,1646224000000.0
std,397701500000.0,404469900000.0
min,0.0,0.0
25%,1745501000000.0,1745500000000.0
50%,1745589000000.0,1745589000000.0
75%,1745689000000.0,1745689000000.0
max,1745814000000.0,1745814000000.0


In [15]:
# Get proportion of rows with zero arrival times
zero_mask = trips_df['arrival_time'] == 0
print(f'{(zero_mask.sum() / len(trips_df)):.2%}')

5.49%


In [16]:
# Get proportion of rows where the arrival and departure times are different
diff_date_mask = trips_df['arrival_time'] != trips_df['departure_time']
print(f'{(diff_date_mask.sum() / len(trips_df)):.2%}')

5.96%


In [17]:
# Get rows
diff_date_df = trips_df[diff_date_mask]

In [18]:
# Replace zero arrival times by departure times, as they are usually the same
trips_df.loc[zero_mask, 'arrival_time'] = trips_df.loc[zero_mask, 'departure_time']

In [19]:
# Get proportion of rows with zero arrival times again
zero_mask = trips_df['arrival_time'] == 0
print(f'{(zero_mask.sum() / len(trips_df)):.2%}')

2.97%


In [20]:
# Delete the rows with 0 arrival times
trips_df = trips_df[~zero_mask]
zero_mask = trips_df['arrival_time'] == 0
assert zero_mask.sum() == 0

In [21]:
# Rename arrival time
trips_df = trips_df.rename(columns={'arrival_time': 'realtime_arrival_time'})

In [22]:
# Drop departure time
trips_df = trips_df.drop('departure_time', axis=1)

## Merge Data

### Realtime and Scheduled Trips

In [23]:
stm_trips_df = pd.merge(left=trips_df, right=schedules_df, how='inner', on=['trip_id', 'stop_id'])

In [24]:
# Convert start_date to datetime
stm_trips_df['start_date_dt'] = pd.to_datetime(stm_trips_df['start_date'], format='%Y%m%d')

In [25]:
def parse_gtfs_time(row) -> pd.Timestamp:
	'''
	Converts GTFS time string (e.g., '25:30:00') to datetime
	based on the arrival time.
	'''
	hours, minutes, seconds = map(int, row['arrival_time'].split(':'))
	total_seconds = hours * 3600 + minutes * 60 + seconds

	parsed_time = row['start_date_dt'] + timedelta(seconds=total_seconds)
	return parsed_time

In [26]:
# Convert planned arrival time to localized datetime
stm_trips_df['scheduled_arrival_time'] = stm_trips_df.apply(parse_gtfs_time, axis=1)
stm_trips_df['scheduled_arrival_time'] = stm_trips_df['scheduled_arrival_time'].dt.tz_localize(LOCAL_TIMEZONE)

In [27]:
# Convert planned time to timestamp in milliseconds since epoch
stm_trips_df['scheduled_arrival_time'] = stm_trips_df['scheduled_arrival_time'].astype('int64') // 10**6

### Trips and Stops

In [28]:
trips_stops_df = pd.merge(left=stm_trips_df, right=stops_df, how='inner', left_on='stop_id', right_on='stop_code')

In [29]:
trips_stops_df.columns

Index(['current_time', 'trip_id', 'route_id', 'start_date', 'stop_id_x',
       'realtime_arrival_time', 'schedule_relationship', 'arrival_time',
       'departure_time', 'stop_sequence', 'start_date_dt',
       'scheduled_arrival_time', 'stop_id_y', 'stop_code', 'stop_name',
       'stop_lat', 'stop_lon', 'stop_url', 'location_type', 'parent_station',
       'wheelchair_boarding'],
      dtype='object')

In [31]:
# Keep relevant columns
trips_stops_df = trips_stops_df[[
  'current_time',
  'trip_id',
  'route_id',
  'stop_id_x',
  'stop_lat',
  'stop_lon',
  'start_date',
  'stop_sequence',
  'wheelchair_boarding',
  'realtime_arrival_time',
  'scheduled_arrival_time'
]]

In [32]:
# Rename stop id
trips_stops_df = trips_stops_df.rename(columns={'stop_id_x': 'stop_id'})

In [33]:
# Convert wheelchair_boarding to boolean
trips_stops_df['wheelchair_boarding'] = (trips_stops_df['wheelchair_boarding'] == 1).astype('int64')

### Vehicle Positions

In [34]:
stm_df = pd.merge(left=trips_stops_df, right=positions_df, how='inner', on=['trip_id', 'route_id', 'start_date', 'stop_sequence']) \
			.rename(columns={'current_time_x': 'current_time', 'status': 'vehicle_stop_status'}) \
			.drop('current_time_y', axis=1)

In [35]:
stm_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3296 entries, 0 to 3295
Data columns (total 20 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   current_time            3296 non-null   float64
 1   trip_id                 3296 non-null   int64  
 2   route_id                3296 non-null   int64  
 3   stop_id                 3296 non-null   int64  
 4   stop_lat                3296 non-null   float64
 5   stop_lon                3296 non-null   float64
 6   start_date              3296 non-null   int64  
 7   stop_sequence           3296 non-null   int64  
 8   wheelchair_boarding     3296 non-null   int64  
 9   realtime_arrival_time   3296 non-null   int64  
 10  scheduled_arrival_time  3296 non-null   int64  
 11  vehicle_id              3296 non-null   int64  
 12  start_time              3296 non-null   object 
 13  latitude                3296 non-null   float64
 14  longitude               3296 non-null   

### STM and Weather

In [36]:
# Convert arrival timestamp to datetime
rt_arrival_dt = pd.to_datetime(stm_df['realtime_arrival_time'], origin='unix', unit='ms', utc=True)

In [44]:
# Round arrival time to the nearest hour
stm_df['rounded_arrival_dt'] = rt_arrival_dt.dt.round('h')

In [45]:
# Format time to match weather data
stm_df['time'] = stm_df['rounded_arrival_dt'].dt.strftime('%Y-%m-%dT%H:%M')

In [46]:
# Merge STM data with historical weather
stm_weather_df = pd.merge(left=stm_df, right=weather_df, how='left', on='time')

In [47]:
# Filter rows with null weather
null_weather_mask = stm_weather_df.isna().any(axis=1)

In [48]:
# Get proportion of rows with null weather
print(f'{(null_weather_mask.sum() / len(stm_weather_df)):.2%}')

100.00%


In [49]:
# Separate null and non null rows
not_null_df = stm_weather_df[~null_weather_mask]
null_df = stm_weather_df[null_weather_mask]

In [50]:
# Fetch forecast weather
start_date = null_df['rounded_arrival_dt'].min().strftime('%Y-%m-%d')
end_date = null_df['rounded_arrival_dt'].max().strftime('%Y-%m-%d')

weather_list = fetch_weather(start_date=start_date, end_date=end_date, forecast=True)
weather_df = pd.DataFrame(weather_list)

In [51]:
# Merge null weather dataframe with forecast
null_df = null_df.drop(['temperature', 'precipitation', 'windspeed', 'weathercode'], axis=1)
null_df = pd.merge(left=null_df, right=weather_df, how='inner', on='time')

In [52]:
# Merge null and non null weather dataframes
stm_weather_df = pd.concat([not_null_df, null_df]).reset_index()

### Traffic Data

In [91]:
# Get proportion of duplicates
duplicate_mask = traffic_df.duplicated()
print(f'{(duplicate_mask.sum() / len(traffic_df)):.2%}')

42.68%


In [92]:
# Remove duplicates
traffic_df = traffic_df.drop_duplicates(keep='last').reset_index()

In [None]:
# Convert arrival timestamp to datetime from STM
stm_weather_df['arrival_time_dt'] = pd.to_datetime(stm_weather_df['realtime_arrival_time'], origin='unix', unit='ms', utc=True)

In [97]:
# Convert start_time and end_time to datetime
traffic_df['start_time_dt'] = pd.to_datetime(traffic_df['start_time'], utc=True)
traffic_df['end_time_dt'] = pd.to_datetime(traffic_df['end_time'], utc=True)

In [101]:
# Fill null end times with current time
traffic_df['end_time_dt'] = traffic_df['end_time_dt'].fillna(datetime.now(UTC).replace(microsecond=0))
assert traffic_df['end_time_dt'].isna().sum() == 0

In [114]:
def calculate_nearby_incidents(trip_update, traffic_df, max_distance=500):
    trip_time = trip_update['arrival_time_dt']
    stop_coords = (trip_update['stop_lat'], trip_update['stop_lon'])
    
    # Filter active incidents
    active_incidents = traffic_df[
        (traffic_df['start_time_dt'] <= trip_time) &
        (traffic_df['end_time_dt'] >= trip_time)
    ].copy()
    
    if active_incidents.empty:
        return pd.Series({
            'incident_nearby': 0,
            'nearest_incident_distance': None,
            'incident_category': None,
            'incident_delay': None,
            'incident_delay_magnitude': 0
        })
    
    # Calculate distance
    active_incidents['distance'] = active_incidents.apply(
        lambda row: haversine(stop_coords, (row['latitude'], row['longitude']), unit=Unit.METERS),
        axis=1
    )
    
    # Filter nearby
    nearby_incidents = active_incidents[active_incidents['distance'] <= max_distance]
    
    if nearby_incidents.empty:
        return pd.Series({
            'incident_nearby': 0,
            'nearest_incident_distance': None,
            'incident_category': 0,
            'incident_delay': None,
			'incident_delay_magnitude': 0
        })
    else:
        nearest = nearby_incidents.loc[nearby_incidents['distance'].idxmin()]
        return pd.Series({
            'incident_nearby': 1,
            'nearest_incident_distance': nearest['distance'],
            'incident_category': nearest['category'],
            'incident_delay': nearest['delay'],
			'incident_delay_magnitude': nearest['magnitude_of_delay']
        })

In [None]:
# Get traffic columns
traffic_cols = stm_weather_df.apply(
    lambda row: calculate_nearby_incidents(row, traffic_df, max_distance=1000),
    axis=1
)

In [122]:
# Merge the new features back
df = pd.concat([stm_weather_df, traffic_cols], axis=1)

## Export Data

In [123]:
df.info(max_cols=None)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3296 entries, 0 to 3295
Data columns (total 33 columns):
 #   Column                     Non-Null Count  Dtype              
---  ------                     --------------  -----              
 0   index                      3296 non-null   int64              
 1   current_time               3296 non-null   float64            
 2   trip_id                    3296 non-null   int64              
 3   route_id                   3296 non-null   int64              
 4   stop_id                    3296 non-null   int64              
 5   stop_lat                   3296 non-null   float64            
 6   stop_lon                   3296 non-null   float64            
 7   start_date                 3296 non-null   int64              
 8   stop_sequence              3296 non-null   int64              
 9   wheelchair_boarding        3296 non-null   int64              
 10  realtime_arrival_time      3296 non-null   int64              
 11  sche

In [124]:
df.columns

Index(['index', 'current_time', 'trip_id', 'route_id', 'stop_id', 'stop_lat',
       'stop_lon', 'start_date', 'stop_sequence', 'wheelchair_boarding',
       'realtime_arrival_time', 'scheduled_arrival_time', 'vehicle_id',
       'start_time', 'latitude', 'longitude', 'bearing', 'speed',
       'vehicle_stop_status', 'timestamp', 'occupancy_status', 'time',
       'rounded_arrival_dt', 'temperature', 'precipitation', 'windspeed',
       'weathercode', 'arrival_time_dt', 'incident_nearby',
       'nearest_incident_distance', 'incident_category', 'incident_delay',
       'incident_delay_magnitude'],
      dtype='object')

In [125]:
# Keep relevant columns #TODO: Add more incident features if needed
df = df[['current_time', 'trip_id', 'vehicle_id', 'vehicle_stop_status',
		'occupancy_status', 'route_id', 'stop_id', 'stop_lat', 'stop_lon',
       	'stop_sequence', 'wheelchair_boarding', 'realtime_arrival_time',
       	'scheduled_arrival_time', 'temperature', 'precipitation', 'windspeed', 
    	'weathercode', 'incident_nearby', 'incident_category'
    	#'nearest_incident_distance','incident_delay', 'incident_delay_magnitude'
      ]]

In [127]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3296 entries, 0 to 3295
Data columns (total 19 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   current_time            3296 non-null   float64
 1   trip_id                 3296 non-null   int64  
 2   vehicle_id              3296 non-null   int64  
 3   vehicle_stop_status     3296 non-null   int64  
 4   occupancy_status        3296 non-null   int64  
 5   route_id                3296 non-null   int64  
 6   stop_id                 3296 non-null   int64  
 7   stop_lat                3296 non-null   float64
 8   stop_lon                3296 non-null   float64
 9   stop_sequence           3296 non-null   int64  
 10  wheelchair_boarding     3296 non-null   int64  
 11  realtime_arrival_time   3296 non-null   int64  
 12  scheduled_arrival_time  3296 non-null   int64  
 13  temperature             3296 non-null   float64
 14  precipitation           3296 non-null   

In [126]:
# Export data to CSV
df.to_csv('../data/stm_weather_merged.csv', index=False)

## End