# STM Transit Delay Data Preparation

## Overview

This notebook merges data from different sources and prepares it for data analysis and preprocessing.

## Data description

### Real-time Trip Updates

`current_time`: Timestamp when the data was fetched from the GTFS, in microseconds.<br>
`trip_id`: Unique identifier for the transit trip.<br>
`route_id`: Unique identifier for a bus or metro line.<br>
`start_date`: Start date of the transit trip.<br>
`stop_id`: Unique identifier of a stop.<br>
`arrival_time`, `departure_time`: Realtime arrival and departure time, in microseconds<br>
`schedule_relationship`: State of the trip, 0 meaning "scheduled" and 1 meaning "skipped".

### Scheduled STM Trips

`trip_id`: Unique identifier for the transit trip.<br>
`arrival_time`, `departure_time`: Scheduled arrival and departure time.<br>
`stop_id`: Unique identifier of a stop.<br>
`stop_sequence`: Sequence of a stop, for ordering.

### STM Stops

`stop_id`: Unique identifier of a stop.<br>
`stop_code`: Bus stop or metro station number.<br>
`stop_name`: Bus stop or metro station name<br>
`stop_lat`, `stop_lon`: Stop coordinates.<br>
`stop_url`: Stop web page.<br>
`location_type`: Stop type.<br>
`parent_station`: Parent station (metro station with multiple exits).<br>
`wheelchair_boarding`: Indicates if the stop is accessible for people in wheelchair, 1 being true and 2 being false.

### Real-time Vehicle Positions

`current_time`: Timestamp when the data was fetched from the GTFS, in microseconds.<br>
`vehicle_id`: Unique identifier for a vehicle.<br>
`trip_id`: Unique identifier for the transit trip.<br>
`route_id`: Unique identifier for a bus or metro line.<br>
`start_date`: Start date of a transit trip.<br>
`start_time`: Start time of a transit trip.<br>
`latitude`, `longitude`: Vehicle current position.<br>
`bearing`: Direction that the vehicle is facing, from 0 to 360 degrees.<br>
`speed`: Momentary speed measured by the vehicle, in meters per second.<br>
`stop_sequence`: Sequence of a stop, for ordering.<br>
`status`: Vehicle stop status in relation with a stop that it's currently approaching or is at, 1 being "stopped at" and 2 being "in transit to".<br>
`timestamp`: Timestamp when STM updated the data, in milliseconds.<br>
`occupancy_status`: Degree of passenger occupancy, ranging from 1 (empty) to 7 (not accepting passengers).

### Weather Archive and Forecast

`time`: Date and hour or the weather.<br>
`temperature`: Air temperature at 2 meters above ground, in Celsius.<br>
`precipitation`: Total precipitation (rain, showers, snow) sum of the preceding hour, in millimeters.<br>
`windspeed`: Wind speed at 10 meters above ground, in kilometers per hour.<br>
`weathercode`: World Meteorological Organization (WMO) code, see [Open-Meteo API documentation](https://open-meteo.com/en/docs#weather_variable_documentation) for the list.

### Traffic Incidents

`category`: Category of the incident.<br>
`start_time`: Start time of the incident, in ISO8601 format.<br>
`end_time`: End time of the incident, in ISO8601 format.<br>
`length`: Length of the incident in meters.<br>
`delay`: Delay in seconds caused by the incident (except road closures).<br>
`magnitude_of_delay`: Severity of the delay, ranging from 0 to 4 (minor to major).<br>
`last_report_time`: Date when the last time the incident was reported,in ISO8601 format.<br>
`latitude`, `longitude`: Coordinates of the incident.

## Imports

In [1]:
from datetime import datetime, timedelta, timezone
from haversine import haversine, Unit
import numpy as np
import pandas as pd
import sys

In [2]:
# Import custom code
sys.path.insert(0, '..')
from scripts.custom_functions import fetch_weather, LOCAL_TIMEZONE

In [3]:
# Import data
trips_df = pd.read_csv('../data/api/fetched_stm_trip_updates.csv', low_memory=False)
schedules_df = pd.read_csv('../data/download/stop_times_2025-04-30.txt')
stops_df = pd.read_csv('../data/download/stops_2025-04-30.txt')
positions_df =  pd.read_csv('../data/api/fetched_stm_vehicle_positions.csv', low_memory=False)
weather_df = pd.read_csv('../data/api/fetched_historical_weather.csv')
traffic_df = pd.read_csv('../data/api/fetched_traffic.csv')

## Merge Data

### Schedules and stops

In [4]:
# Sort values by stop sequence
schedules_df = schedules_df.sort_values(by=['trip_id', 'stop_sequence'])

In [5]:
# Reset stop sequences (some stops might be missing)
schedules_df['stop_sequence'] = schedules_df.groupby('trip_id').cumcount() + 1

In [6]:
# Add trip progress (vehicles further along the trip are more likely to be delayed)
total_stops = schedules_df.groupby('trip_id')['stop_id'].transform('count')
schedules_df['trip_progress'] = schedules_df['stop_sequence'] / total_stops

In [7]:
# Get distribution of trip progress
schedules_df['trip_progress'].describe()

count    6.629625e+06
mean     5.139220e-01
std      2.887289e-01
min      8.547009e-03
25%      2.631579e-01
50%      5.142857e-01
75%      7.647059e-01
max      1.000000e+00
Name: trip_progress, dtype: float64

In [8]:
# Merge schedules and stops
schedules_stops_df = pd.merge(left=schedules_df, right=stops_df, how='inner', left_on='stop_id', right_on='stop_code')

In [9]:
schedules_stops_df.columns

Index(['trip_id', 'arrival_time', 'departure_time', 'stop_id_x',
       'stop_sequence', 'trip_progress', 'stop_id_y', 'stop_code', 'stop_name',
       'stop_lat', 'stop_lon', 'stop_url', 'location_type', 'parent_station',
       'wheelchair_boarding'],
      dtype='object')

In [10]:
# Rename stop id and drop other stop id columns
schedules_stops_df = schedules_stops_df.rename(columns={'stop_id_x': 'stop_id'})
schedules_stops_df = schedules_stops_df.drop(['stop_id_y', 'stop_code'], axis=1)

In [11]:
# Get coordinates of previous stop
schedules_stops_df = schedules_stops_df.sort_values(by=['trip_id', 'stop_sequence'])
schedules_stops_df['prev_lat'] = schedules_stops_df.groupby('trip_id')['stop_lat'].shift(1)
schedules_stops_df['prev_lon'] = schedules_stops_df.groupby('trip_id')['stop_lon'].shift(1)

In [12]:
# Make sure the null values are from first stops
prev_null_mask = (schedules_stops_df['prev_lat'].isna()) | (schedules_stops_df['prev_lon'].isna())
first_stop_mask = schedules_stops_df['stop_sequence'] == 1
assert prev_null_mask.sum() == first_stop_mask.sum()

In [13]:
# Get distance from previous stop
schedules_stops_df['stop_distance'] = schedules_stops_df.apply(
  lambda row: haversine((row['prev_lat'], row['prev_lon']), (row['stop_lat'], row['stop_lon']), unit=Unit.METERS),
  axis=1
)

In [14]:
# Replace null distances by zero (first stop of the trip)
schedules_stops_df['stop_distance'] = schedules_stops_df['stop_distance'].fillna(0)

### Realtime and Scheduled Trips

In [15]:
# Convert route_id to integer
trips_df['route_id'] = trips_df['route_id'].str.extract(r'(\d+)')
trips_df['route_id'] = trips_df['route_id'].astype('int64')

In [16]:
# Sort trips
trips_df = trips_df.sort_values(by=['current_time', 'trip_id', 'route_id', 'arrival_time'])

In [17]:
# Get proportion of duplicates
subset = ['start_date', 'trip_id', 'route_id', 'stop_id']
duplicate_mask = trips_df.duplicated(subset=subset)
print(f'{duplicate_mask.mean():.2%}')

26.27%


In [18]:
# Remove duplicates
trips_df = trips_df.drop_duplicates(subset=subset, keep='last') # keep latest update

In [20]:
# Rename arrival and departure time
trips_df = trips_df.rename(columns={'arrival_time': 'rt_arrival_time','departure_time': 'rt_departure_time'})

In [21]:
# Merge trip updates with schedule
stm_trips_df = pd.merge(left=trips_df, right=schedules_stops_df, how='inner', on=['trip_id', 'stop_id'])

In [22]:
stm_trips_df.columns

Index(['current_time', 'trip_id', 'route_id', 'start_date', 'stop_id',
       'rt_arrival_time', 'rt_departure_time', 'schedule_relationship',
       'arrival_time', 'departure_time', 'stop_sequence', 'trip_progress',
       'stop_name', 'stop_lat', 'stop_lon', 'stop_url', 'location_type',
       'parent_station', 'wheelchair_boarding', 'prev_lat', 'prev_lon',
       'stop_distance'],
      dtype='object')

In [23]:
# Convert start_date to datetime
stm_trips_df['start_date_dt'] = pd.to_datetime(stm_trips_df['start_date'], format='%Y%m%d')

In [24]:
def parse_gtfs_time(start_date:pd.Timestamp, stop_time:str) -> pd.Timestamp:
	'''
	Converts GTFS time string (e.g., '25:30:00') to datetime
	based on the arrival time.
	'''
	hours, minutes, seconds = map(int, stop_time.split(':'))
	total_seconds = hours * 3600 + minutes * 60 + seconds

	parsed_time = start_date + timedelta(seconds=total_seconds)
	return parsed_time

In [26]:
# Parse GTFS scheduled arrival and departure time
arrival_parsed_time = stm_trips_df.apply(lambda row: parse_gtfs_time(row['start_date_dt'], row['arrival_time']), axis=1)
departure_parsed_time = stm_trips_df.apply(lambda row: parse_gtfs_time(row['start_date_dt'], row['departure_time']), axis=1)

In [27]:
# Convert scheduled arrival and departure time to UTC datetime
sch_arrival_time_local = arrival_parsed_time.dt.tz_localize(LOCAL_TIMEZONE)
sch_departure_time_local = departure_parsed_time.dt.tz_localize(LOCAL_TIMEZONE)

stm_trips_df['sch_arrival_time'] = sch_arrival_time_local.dt.tz_convert(timezone.utc)
stm_trips_df['sch_departure_time'] = sch_departure_time_local.dt.tz_convert(timezone.utc)

In [None]:
# Convert realtime arrival and departure time to UTC datetime
stm_trips_df['rt_arrival_time'] = pd.to_datetime(stm_trips_df['rt_arrival_time'] * 1000, origin='unix', unit='ms', utc=True)
stm_trips_df['rt_departure_time'] = pd.to_datetime(stm_trips_df['rt_departure_time'] * 1000, origin='unix', unit='ms', utc=True)

In [29]:
stm_trips_df.columns

Index(['current_time', 'trip_id', 'route_id', 'start_date', 'stop_id',
       'rt_arrival_time', 'rt_departure_time', 'schedule_relationship',
       'arrival_time', 'departure_time', 'stop_sequence', 'trip_progress',
       'stop_name', 'stop_lat', 'stop_lon', 'stop_url', 'location_type',
       'parent_station', 'wheelchair_boarding', 'prev_lat', 'prev_lon',
       'stop_distance', 'start_date_dt', 'sch_arrival_time',
       'sch_departure_time'],
      dtype='object')

In [30]:
# Remove extra datetime columns
stm_trips_df = stm_trips_df.drop(['arrival_time', 'departure_time'], axis=1)

In [31]:
# For first stops, calculate delay with departure time
# For the rest, calculate delay with arrival time
stm_trips_df['delay'] = np.where(
  	stm_trips_df['stop_sequence'] == 1,
	(stm_trips_df['rt_departure_time'] - stm_trips_df['sch_departure_time']) / pd.Timedelta(seconds=1),
  	(stm_trips_df['rt_arrival_time'] - stm_trips_df['sch_arrival_time']) / pd.Timedelta(seconds=1)
)

In [32]:
# Get distribution
stm_trips_df['delay'].describe()

count    3.014556e+06
mean    -5.571049e+07
std      3.068682e+08
min     -1.746237e+09
25%      0.000000e+00
50%      0.000000e+00
75%      2.800000e+01
max      5.458500e+04
Name: delay, dtype: float64

In [33]:
# The 0 dates have been converted to 1970-01-01
# If realtime arrival or departure year is 1970, impute delay with 0
zero_times = (stm_trips_df['rt_arrival_time'].dt.year == 1970) | (stm_trips_df['rt_departure_time'].dt.year == 1970)
stm_trips_df.loc[zero_times, 'delay'] = 0

In [34]:
stm_trips_df['delay'].describe()

count    3.014556e+06
mean     6.808611e+01
std      4.128609e+02
min     -1.341200e+04
25%      0.000000e+00
50%      0.000000e+00
75%      7.000000e+00
max      5.458500e+04
Name: delay, dtype: float64

In [35]:
# Get proportion of trips that are on time
on_time_mask = (stm_trips_df['delay'] >= -60) & (stm_trips_df['delay'] <= 180)
print(f'{on_time_mask.mean():.2%}')

87.12%


According to [STM](https://www.stm.info/en/info/networks/bus-network-and-schedules-enlightened), a vehicle is considered on time if it arrives at the stop up to 1 minute before and 3 minutes after the planned schedule.

### Vehicle Positions

In [36]:
# Rename latitude and longitude
positions_df = positions_df.rename(columns={
  'latitude': 'vehicle_lat',
  'longitude': 'vehicle_lon',
  'status': 'vehicle_status',
  'bearing': 'vehicle_bearing',
  'speed': 'vehicle_speed'
})

In [37]:
# Convert vehicle timestamp to datetime
positions_df['vehicle_dt'] = pd.to_datetime(positions_df['timestamp'] * 1000, origin='unix', unit='ms', utc=True)

In [38]:
# Sort values
subset = ['trip_id', 'route_id', 'start_date', 'stop_sequence']
positions_df = positions_df.sort_values(by=subset)

In [39]:
# Get proportion of duplicates
print(f'{duplicate_mask.mean():.2%}')

26.27%


In [40]:
# Remove duplicates
positions_df = positions_df.drop_duplicates(subset=subset)

In [41]:
# Merge position with the rest of the STM data
stm_df = pd.merge(left=stm_trips_df, right=positions_df, how='inner', on=['trip_id', 'route_id', 'start_date', 'stop_sequence'])

In [42]:
stm_df.columns

Index(['current_time_x', 'trip_id', 'route_id', 'start_date', 'stop_id',
       'rt_arrival_time', 'rt_departure_time', 'schedule_relationship',
       'stop_sequence', 'trip_progress', 'stop_name', 'stop_lat', 'stop_lon',
       'stop_url', 'location_type', 'parent_station', 'wheelchair_boarding',
       'prev_lat', 'prev_lon', 'stop_distance', 'start_date_dt',
       'sch_arrival_time', 'sch_departure_time', 'delay', 'current_time_y',
       'vehicle_id', 'start_time', 'vehicle_lat', 'vehicle_lon',
       'vehicle_bearing', 'vehicle_speed', 'vehicle_status', 'timestamp',
       'occupancy_status', 'vehicle_dt'],
      dtype='object')

In [43]:
# Calculate distance between the vehicle and the previous stop
stm_df['vehicle_distance'] = stm_df.apply(
  	lambda row: haversine((row['prev_lat'], row['prev_lon']), (row['vehicle_lat'], row['vehicle_lon']), unit=Unit.METERS),
  	axis=1)

In [44]:
# Calculate relative distance
stm_df['vehicle_rel_distance'] = stm_df['vehicle_distance'] / stm_df['stop_distance']

In [45]:
# Get distribution of relative distance
stm_df['vehicle_rel_distance'].describe()

count    200387.000000
mean          1.074166
std         121.330806
min           0.000478
25%           0.127664
50%           0.647989
75%           0.998137
max       39568.770182
Name: vehicle_rel_distance, dtype: float64

In [46]:
# Get null values
null_rel_dist_mask = stm_df['vehicle_rel_distance'].isna()
stm_df[null_rel_dist_mask]

Unnamed: 0,current_time_x,trip_id,route_id,start_date,stop_id,rt_arrival_time,rt_departure_time,schedule_relationship,stop_sequence,trip_progress,...,vehicle_lat,vehicle_lon,vehicle_bearing,vehicle_speed,vehicle_status,timestamp,occupancy_status,vehicle_dt,vehicle_distance,vehicle_rel_distance
1,1.745791e+09,283213434,968,20250427,60296,1970-01-01 00:00:00+00:00,2025-04-27 23:00:00+00:00,0,1,0.333333,...,45.514233,-73.684097,0.0,0.00000,1,1745793882,1,2025-04-27 22:44:42+00:00,,
217,1.745791e+09,283854289,189,20250427,53754,1970-01-01 00:00:00+00:00,2025-04-27 22:28:10+00:00,0,1,0.015625,...,45.596958,-73.535728,113.0,0.00000,1,1745792090,1,2025-04-27 22:14:50+00:00,,
273,1.745791e+09,283212926,207,20250427,57821,1970-01-01 00:00:00+00:00,2025-04-27 22:59:00+00:00,0,1,0.023810,...,45.465130,-73.831184,146.0,7.50006,1,1745793885,1,2025-04-27 22:44:45+00:00,,
275,1.745791e+09,286580770,29,20250427,53035,1970-01-01 00:00:00+00:00,2025-04-27 22:59:00+00:00,0,1,0.038462,...,45.546761,-73.551224,0.0,0.00000,1,1745793890,1,2025-04-27 22:44:50+00:00,,
276,1.745791e+09,283212651,201,20250427,60889,1970-01-01 00:00:00+00:00,2025-04-27 22:59:00+00:00,0,1,0.014706,...,45.466412,-73.831253,0.0,0.00000,1,1745793896,1,2025-04-27 22:44:56+00:00,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
225741,1.746230e+09,285284501,185,20250502,61881,1970-01-01 00:00:00+00:00,2025-05-03 00:01:00+00:00,0,1,0.020408,...,45.595333,-73.509613,166.0,0.00000,1,1746229543,1,2025-05-02 23:45:43+00:00,,
225742,1.746230e+09,284740889,193,20250502,54058,1970-01-01 00:00:00+00:00,2025-05-03 00:11:00+00:00,0,1,0.020833,...,45.534592,-73.643776,0.0,0.00000,1,1746230444,2,2025-05-03 00:00:44+00:00,,
225743,1.746230e+09,284727277,197,20250502,51661,1970-01-01 00:00:00+00:00,2025-05-03 00:04:00+00:00,0,1,0.025000,...,45.530968,-73.597610,210.0,0.00000,1,1746230428,1,2025-05-03 00:00:28+00:00,,
225744,1.746230e+09,284740168,69,20250502,54223,1970-01-01 00:00:00+00:00,2025-05-03 00:12:00+00:00,0,1,0.015152,...,45.617542,-73.606514,0.0,0.00000,1,1746230432,1,2025-05-03 00:00:32+00:00,,


In [47]:
stm_df[null_rel_dist_mask]['stop_sequence'].value_counts()

stop_sequence
1    25360
Name: count, dtype: int64

In [48]:
# Replace the null relative distance by 1, because it's the first stop
stm_df['vehicle_rel_distance'] = stm_df['vehicle_rel_distance'].fillna(1)

In [49]:
# Get trips where the relative distance is above 100%
above_one_mask = stm_df['vehicle_rel_distance'] > 1
stm_df.loc[above_one_mask, ['vehicle_status', 'vehicle_speed', 'vehicle_rel_distance']]

Unnamed: 0,vehicle_status,vehicle_speed,vehicle_rel_distance
2,1,2.50002,1.000709
9,1,0.00000,1.000453
11,1,0.00000,1.001372
16,1,0.00000,1.000593
18,1,0.00000,1.000572
...,...,...,...
225681,2,0.00000,1.002725
225683,1,0.00000,1.001234
225690,2,1.66668,1.035759
225695,2,7.50006,3.147961


In [50]:
# Get proportion of trips with relative distance above 100%
print(f'{above_one_mask.mean():.2%}')

16.94%


In [51]:
# Clip the relative distance to 100% if the vehicle status is 1 (stopped at)
stopped_mask = stm_df['vehicle_status'] == 1
stm_df.loc[above_one_mask & stopped_mask, 'vehicle_rel_distance'] = 1

In [52]:
# Impute the other values to 0.5 (between the previous and current stop)
stm_df.loc[above_one_mask & (stopped_mask == False), 'vehicle_rel_distance'] = 0.5

In [54]:
# Replace the vehicle speed to 0 if the vehicle is stopped
stm_df.loc[stopped_mask, 'vehicle_speed'] = 0

In [55]:
# Get new distribution of relative distance
stm_df['vehicle_rel_distance'].describe()

count    225747.000000
mean          0.596245
std           0.377892
min           0.000478
25%           0.165817
50%           0.664172
75%           0.999716
max           1.000000
Name: vehicle_rel_distance, dtype: float64

### STM and Weather

In [57]:
# Convert time string to datetime
time_dt = pd.to_datetime(weather_df['time'], utc=True)

In [59]:
# Calculate dates for weather forecast
last_day_weather = time_dt.max()
start_date = last_day_weather + timedelta(days=1)
end_date = stm_df['rt_arrival_time'].max()

In [60]:
# Fetch forecast weather
start_date_str = start_date.strftime('%Y-%m-%d')
end_date_str = end_date.strftime('%Y-%m-%d')

forecast_list = fetch_weather(start_date=start_date_str, end_date=end_date_str, forecast=True)
forecast_df = pd.DataFrame(forecast_list)

In [61]:
# Merge archive and forecast weather
weather_df = pd.concat([weather_df, forecast_df], ignore_index=True)

In [63]:
# Round arrival time to the nearest hour
rounded_arrival_dt = stm_df['rt_arrival_time'].dt.round('h')

In [64]:
# Format time to match weather data
stm_df['time'] = rounded_arrival_dt.dt.strftime('%Y-%m-%dT%H:%M')

In [65]:
# Merge STM with weather
stm_weather_df = pd.merge(left=stm_df, right=weather_df, how='inner', on='time')

### Traffic Data

In [66]:
# Get proportion of duplicates
duplicate_mask = traffic_df.duplicated()
print(f'{duplicate_mask.mean():.2%}')

33.93%


In [67]:
# Remove duplicates
traffic_df = traffic_df.drop_duplicates(keep='last').reset_index()

In [68]:
# Convert traffic start_time and end_time to datetime
traffic_df['start_time'] = pd.to_datetime(traffic_df['start_time'], utc=True)
traffic_df['end_time'] = pd.to_datetime(traffic_df['end_time'], utc=True)

In [69]:
# Sort by date
traffic_df = traffic_df.sort_values(by='start_time').reset_index()

In [70]:
# Fill null end times with current time (assuming the incident is still ongoing)
traffic_df['end_time'] = traffic_df['end_time'].fillna(datetime.now(timezone.utc).replace(microsecond=0))
assert traffic_df['end_time'].isna().sum() == 0

In [71]:
# Build traffic cache (for every 15 min interval)
def build_traffic_cache(traffic_df:pd.DataFrame) -> dict:
	traffic_cache = {}
	traffic_df['quarter_hour'] = traffic_df['start_time'].dt.floor('15min')

	for (hour, group) in traffic_df.groupby('quarter_hour'):
		traffic_cache[hour] = group.copy()

	return traffic_cache

Since there are many trip updates on the same day (even the same hour), there's a risk of repeating the filtering of active traffic incidents for each trip individually, which takes a lot of time for a large dataset. Traffic incidents are stable over minutes or hours. This is why the incidents will be cached by 15 minute intervals.

In [72]:
def calculate_nearby_incidents(trip_update:pd.Series, traffic_cache:dict, max_distance:int=500) -> pd.Series:
	trip_datetime = trip_update['vehicle_dt']
	stop_coords = (trip_update['vehicle_lat'], trip_update['vehicle_lon'])

	trip_quarter_hour = trip_datetime.floor('15min')

	# Get cached incidents
	quarter_hour_incidents = traffic_cache.get(trip_quarter_hour)

	# Stop if there are no incidents for that hour
	if quarter_hour_incidents is None or quarter_hour_incidents.empty:
		return pd.Series({
			'incident_nearby': 0,
			'nearest_incident_distance': None,
			'incident_category': None,
			'incident_delay': None,
			'incident_delay_magnitude': None
		})

	# Filter for active incidents at that trip hour
	active_incidents = quarter_hour_incidents[
		(quarter_hour_incidents['start_time'] <= trip_datetime) &
		(quarter_hour_incidents['end_time'] >= trip_datetime)
	].copy()

	if active_incidents.empty:
		return pd.Series({
			'incident_nearby': 0,
			'nearest_incident_distance': None,
			'incident_category': None,
			'incident_delay': None,
			'incident_delay_magnitude': None
		})

	# Calculate distance     
	active_incidents['distance'] = active_incidents.apply(
		lambda row: haversine(stop_coords, (row['latitude'], row['longitude']), unit=Unit.METERS),
		axis=1
	)

	# Filter nearby
	nearby_incidents = active_incidents[active_incidents['distance'] <= max_distance]

	if nearby_incidents.empty:
		return pd.Series({
			'incident_nearby': 0,
			'nearest_incident_distance': None,
			'incident_category': None,
			'incident_delay': None,
			'incident_delay_magnitude': None
		})
	else:
		nearest = nearby_incidents.loc[nearby_incidents['distance'].idxmin()]
		return pd.Series({
			'incident_nearby': 1,
			'nearest_incident_distance': nearest['distance'],
			'incident_category': nearest['category'],
			'incident_delay': nearest['delay'],
			'incident_delay_magnitude': nearest['magnitude_of_delay']
		})

In [73]:
# Get traffic columns (get incidents within 1.5 km)
traffic_cache = build_traffic_cache(traffic_df)
traffic_cols = stm_weather_df.apply(lambda row: calculate_nearby_incidents(row, traffic_cache, 1500), axis=1)

General area traffic (within 1-2 km) could affect delays, even if it's not directly at the stop. This is why incidents within 1.5 km are calculated.

In [74]:
# Merge the traffic
df = pd.concat([stm_weather_df, traffic_cols], axis=1)

## Export Data

In [75]:
# Get columns with two values
two_values = df.loc[:, df.nunique() == 2]
two_values.columns

Index(['wheelchair_boarding', 'vehicle_status', 'incident_nearby'], dtype='object')

In [76]:
print(df['wheelchair_boarding'].value_counts())
print(df['vehicle_status'].value_counts())
print(df['incident_nearby'].value_counts())

wheelchair_boarding
1    189689
2     10927
Name: count, dtype: int64
vehicle_status
2    155274
1     45342
Name: count, dtype: int64
incident_nearby
0.0    190722
1.0      9894
Name: count, dtype: int64


In [77]:
# Convert columns with 2 unique values to boolean
df['wheelchair_boarding'] = (df['wheelchair_boarding'] == 1).astype('int64')
df['vehicle_in_transit'] = (df['vehicle_status'] == 2).astype('int64')
df['incident_nearby'] = df['incident_nearby'].astype('int64')

In [78]:
# Remove columns with constant values or with more than 50% missing values
df = df.loc[:, (df.nunique() > 1) & (df.isna().mean() < 0.5)]
df.columns

Index(['current_time_x', 'trip_id', 'route_id', 'start_date', 'stop_id',
       'rt_arrival_time', 'rt_departure_time', 'stop_sequence',
       'trip_progress', 'stop_name', 'stop_lat', 'stop_lon', 'stop_url',
       'wheelchair_boarding', 'prev_lat', 'prev_lon', 'stop_distance',
       'start_date_dt', 'sch_arrival_time', 'sch_departure_time', 'delay',
       'current_time_y', 'vehicle_id', 'start_time', 'vehicle_lat',
       'vehicle_lon', 'vehicle_bearing', 'vehicle_speed', 'vehicle_status',
       'timestamp', 'occupancy_status', 'vehicle_dt', 'vehicle_distance',
       'vehicle_rel_distance', 'time', 'temperature', 'precipitation',
       'windspeed', 'weathercode', 'incident_nearby', 'vehicle_in_transit'],
      dtype='object')

**Columns to drop**

`current_time_x`, `current_time_y`: It's only the time when the data was collected.<br>
`start_date`, `start_date_dt`, `start_time`: Realtime and scheduled time have been calculated.<br>
`stop_sequence`: The trip progress has been calculated.<br>
`stop_name`: There is already the stop id.<br>
`stop_url`: An url is not useful for data analysis.<br>
`prev_lat`, `prev_lon`, `stop_distance`, `vehicle_lat`, `vehicle_lon`, `vehicle_distance`: The relative vehicle distance from the previous stop, along with the incidents near the vehicle position, have been calculated.<br>
`vehicle_status`: The column vehicle_in_transit has been added.<br>
`timestamp`, `vehicle_dt`: The incidents at the time when the vehicle was nearby have been added.<br>
`time`: It was to merge the weather data.


In [82]:
# Keep relevant columns
df = df[[
	'trip_id',
  	'vehicle_id',
    'vehicle_in_transit', # engineered
    'vehicle_rel_distance', # engineered
    'vehicle_bearing',
    'vehicle_speed',
	'occupancy_status',
  	'route_id',
  	'stop_id',
  	'stop_lat',
  	'stop_lon',
  	'trip_progress', # engineered
  	'wheelchair_boarding', # engineered
  	'rt_arrival_time', # parsed
    'sch_arrival_time', # parsed
    'delay', # engineered
  	'temperature',
  	'precipitation',
  	'windspeed', 
	'weathercode',
  	'incident_nearby', # engineered
]]

In [83]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200616 entries, 0 to 200615
Data columns (total 21 columns):
 #   Column                Non-Null Count   Dtype              
---  ------                --------------   -----              
 0   trip_id               200616 non-null  int64              
 1   vehicle_id            200616 non-null  int64              
 2   vehicle_in_transit    200616 non-null  int64              
 3   vehicle_rel_distance  200616 non-null  float64            
 4   vehicle_bearing       200616 non-null  float64            
 5   vehicle_speed         200616 non-null  float64            
 6   occupancy_status      200616 non-null  int64              
 7   route_id              200616 non-null  int64              
 8   stop_id               200616 non-null  int64              
 9   stop_lat              200616 non-null  float64            
 10  stop_lon              200616 non-null  float64            
 11  trip_progress         200616 non-null  float64      

In [84]:
df.isna().sum()

trip_id                 0
vehicle_id              0
vehicle_in_transit      0
vehicle_rel_distance    0
vehicle_bearing         0
vehicle_speed           0
occupancy_status        0
route_id                0
stop_id                 0
stop_lat                0
stop_lon                0
trip_progress           0
wheelchair_boarding     0
rt_arrival_time         0
sch_arrival_time        0
delay                   0
temperature             0
precipitation           0
windspeed               0
weathercode             0
incident_nearby         0
dtype: int64

In [85]:
# Export data to CSV
df.to_csv('../data/stm_weather_traffic_merged.csv', index=False)

## End