# STM Transit Delay Data Preprocessing

This notebook preprocesses data about STM trip updates and weather data in order to build a tree-based regression model that predicts delays in seconds.

## Data Description

`trip_id` unique identifier of a trip<br>
`vehicle_id` unique identifier of a vehicle<br>
`vehicle_lat` vehicle latitude<br>
`vehicle_lon` vehicle longitude<br>
`vehicle_distance` vehicle distance from the stop<br>
`vehicle_status` status of a vehicle in relation with a stop that it is currently approaching or is at, 1 being stopped at and 2 being in transit to<br>
`vehicle_bearing` direction that the vehicle is facing<br>
`vehicle_speed` momentary speed measured by the vehicle, in meters per second<br>
`stop_sequence` sequence of the stop, for ordering<br>
`occupancy_status` degree of passenger occupancy<br>
`route_id` bus or metro line<br>
`stop_id` stop number<br>
`stop_lat`stop latitude<br>
`stop_lon`stop longitude<br>
`stop_sequence` sequence of the stop, for ordering<br>
`trip_progress` how far along the trip is the bus from 0 to 1<br>
`wheelchair_boarding` indicates if the stop is accessible for people in wheelchair, 1 being true and 2 being false<br>
`realtime_arrival_time` actual arrival time, in UTC<br>
`scheduled_arrival_time` planned arrival time, in UTC<br>
`temperature` air temperature at 2 meters above ground, in Celsius<br>
`precipitation` total precipitation (rain, showers, snow) sum of the preceding hour, in millimeters<br>
`windspeed` wind speed at 10 meters above ground, in km/h<br>
`weathercode` World Meteorological Organization (WMO) code<br>
`incident_nearby`indicates if an incident happened within 500 meters of the vehicle position

## Imports

In [1]:
import numpy as np
import pandas as pd
import pickle
from sklearn.preprocessing import LabelEncoder
import sys

In [2]:
# Import custom code
sys.path.insert(0, '..')
from scripts.custom_functions import INCIDENT_CATEGORIES, LOCAL_TIMEZONE, STOP_STATUS, WEATHER_CODES

In [3]:
# Load data
df = pd.read_csv('../data/stm_weather_traffic_merged.csv')

In [8]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 147305 entries, 0 to 147304
Data columns (total 24 columns):
 #   Column                  Non-Null Count   Dtype  
---  ------                  --------------   -----  
 0   trip_id                 147305 non-null  int64  
 1   vehicle_id              147305 non-null  int64  
 2   vehicle_lat             147305 non-null  float64
 3   vehicle_lon             147305 non-null  float64
 4   vehicle_distance        147305 non-null  float64
 5   vehicle_status          147305 non-null  int64  
 6   vehicle_bearing         147305 non-null  float64
 7   vehicle_speed           147305 non-null  float64
 8   occupancy_status        147305 non-null  int64  
 9   route_id                147305 non-null  int64  
 10  stop_id                 147305 non-null  int64  
 11  stop_lat                147305 non-null  float64
 12  stop_lon                147305 non-null  float64
 13  stop_sequence           147305 non-null  int64  
 14  trip_progress       

## Data Preprocessing

### Handle Outliers

In [5]:
# Compute mean and standard deviation
mean_delay = df['delay'].mean()
std_delay = df['delay'].std()

In [6]:
# Filter outliers based on standard deviation
outlier_mask = (df['delay'] < mean_delay - 3 * std_delay) | (df['delay'] > mean_delay + 3 * std_delay)

In [7]:
# Get outliers
df[outlier_mask]

Unnamed: 0,trip_id,vehicle_id,vehicle_lat,vehicle_lon,vehicle_distance,vehicle_status,vehicle_bearing,vehicle_speed,occupancy_status,route_id,...,trip_progress,wheelchair_boarding,realtime_arrival_time,scheduled_arrival_time,delay,temperature,precipitation,windspeed,weathercode,incident_nearby
97,283584468,30214,45.539738,-73.611061,177.873373,2,304.0,5.00004,1,30,...,0.555556,1,2025-04-27 22:15:52+00:00,2025-04-27 21:58:48+00:00,1024.0,11.0,0.0,13.3,0,0.0
98,283584468,30214,45.556622,-73.667496,58.519512,2,149.0,2.50002,1,30,...,1.000000,1,2025-04-27 22:32:04+00:00,2025-04-27 22:15:00+00:00,1024.0,10.7,0.0,13.4,0,0.0
289,284215026,37067,45.496548,-73.642517,7.096642,1,128.0,0.00000,1,124,...,0.487805,1,2025-04-27 22:45:04+00:00,2025-04-27 22:15:10+00:00,1794.0,10.7,0.0,13.4,0,0.0
295,283585000,41071,45.517879,-73.566330,5.317782,2,302.0,3.33336,1,30,...,0.155556,0,2025-04-27 22:55:37+00:00,2025-04-27 22:40:16+00:00,921.0,10.7,0.0,13.4,0,0.0
297,286581685,38034,45.508125,-73.532669,612.566331,2,261.0,2.50002,2,777,...,0.200000,1,2025-04-27 23:06:12+00:00,2025-04-27 22:42:00+00:00,1452.0,10.7,0.0,13.4,0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
146855,284779053,33820,45.495277,-73.615616,101.706857,2,87.0,12.50010,2,465,...,0.414634,1,2025-05-01 13:01:52+00:00,2025-05-01 12:44:24+00:00,1048.0,12.2,0.0,4.6,1,0.0
147188,284741525,30843,45.575249,-73.611328,105.571550,2,122.0,0.00000,1,439,...,0.594595,1,2025-05-01 13:00:12+00:00,2025-05-01 12:44:38+00:00,934.0,12.2,0.0,4.6,1,0.0
147197,284741542,32809,45.574940,-73.610626,41.327177,2,121.0,0.00000,2,439,...,0.400000,1,2025-05-01 13:00:02+00:00,2025-05-01 12:43:38+00:00,984.0,12.2,0.0,4.6,1,0.0
147241,285010170,28031,45.466579,-73.830933,0.057054,1,55.0,0.00000,1,419,...,0.052632,1,2025-05-01 13:47:39+00:00,2025-05-01 13:01:00+00:00,2799.0,14.0,0.0,7.8,3,0.0


In [7]:
# Get proportion of outliers
print(f'{outlier_mask.mean():.2%}')

1.08%


In [9]:
# Remove outliers
df = df[~outlier_mask]

In [10]:
# Get new distribution
df['delay'].describe()

count    145659.000000
mean         54.306586
std         135.602415
min        -742.000000
25%           0.000000
50%           0.000000
75%          67.000000
max         883.000000
Name: delay, dtype: float64

The delay ranging from ~16 min early to 18min45sec seems more reasonable.

In [10]:
df.columns

Index(['trip_id', 'vehicle_id', 'vehicle_lat', 'vehicle_lon',
       'vehicle_distance', 'vehicle_status', 'vehicle_bearing',
       'vehicle_speed', 'occupancy_status', 'route_id', 'stop_id', 'stop_lat',
       'stop_lon', 'stop_sequence', 'trip_progress', 'wheelchair_boarding',
       'realtime_arrival_time', 'scheduled_arrival_time', 'delay',
       'temperature', 'precipitation', 'windspeed', 'weathercode',
       'incident_nearby'],
      dtype='object')

### Encode Datetime

In [None]:
# Convert arrival times
df['realtime_arrival_time'] = pd.to_datetime(df['realtime_arrival_time'], utc=True).dt.tz_convert(LOCAL_TIMEZONE)
df['scheduled_arrival_time'] = pd.to_datetime(df['scheduled_arrival_time'], utc=True).dt.tz_convert(LOCAL_TIMEZONE)

In [None]:
df['realtime_arrival_time']

In [None]:
df['scheduled_arrival_time']

In [13]:
# Convert datetimes to day and hour
df['day'] = df['realtime_arrival_time'].dt.day_of_week
df['hour'] = df['realtime_arrival_time'].dt.hour

df['sch_day'] = df['scheduled_arrival_time'].dt.day_of_week
df['sch_hour'] = df['scheduled_arrival_time'].dt.hour

In [14]:
# Use Cyclical Encoding for day and hour, as it's more suitable for time-related features
# And the model can "understand" the wrap-around
df['day_sin'] = np.sin(2 * np.pi * df['day'] / 7)
df['day_cos'] = np.cos(2 * np.pi * df['day'] / 7)

df['sch_day_sin'] = np.sin(2 * np.pi * df['sch_day'] / 7)
df['sch_day_cos'] = np.cos(2 * np.pi * df['sch_day'] / 7)

df['hour_sin'] = np.sin(2 * np.pi * df['hour'] / 24)
df['hour_cos'] = np.cos(2 * np.pi * df['hour'] / 24)

df['sch_hour_sin'] = np.sin(2 * np.pi * df['sch_hour'] / 24)
df['sch_hour_cos'] = np.cos(2 * np.pi * df['sch_hour'] / 24)

In [15]:
# Add boolean value is_weekend
weekend_mask = df['day'].isin([5, 6])
df['is_weekend'] = np.where(weekend_mask, 1, 0)

In [16]:
# Add boolean value is_peak_hour (weekdays from 7-9am or 4-6pm)
peak_hour_mask = (weekend_mask == False) & (df['hour'].isin([7, 8, 9, 16, 17, 18]))
df['is_peak_hour'] = np.where(peak_hour_mask, 1, 0)

### Use Label Encoding for vehicle_id, route_id and stop_id

In [17]:
le_vehicle = LabelEncoder()
df['vehicle_id'] = le_vehicle.fit_transform(df['vehicle_id'])

In [18]:
le_route = LabelEncoder()
df['route_id'] = le_route.fit_transform(df['route_id'])

In [19]:
le_stop = LabelEncoder()
df['stop_id'] = le_stop.fit_transform(df['stop_id'])

### Convert vehicle status to categories

In [21]:
# Create status code mapping
status_codes = df['vehicle_status'].sort_values().unique()
condition_list = []
label_list = []

for code in status_codes:
  condition_list.append(df['vehicle_status'] == code)
  label_list.append(STOP_STATUS[code])

In [22]:
# Create categories
df['vehicle_status'] = np.select(condition_list, label_list, default='Unknown')

In [23]:
df['vehicle_status'].value_counts()

vehicle_status
In Transit To    90860
Stopped At       40773
Name: count, dtype: int64

In [27]:
# Use One Hot Encoding
one_hot = pd.get_dummies(df['vehicle_status'], drop_first=True, dtype='int64', prefix='vehicle_status')
df = df.join(one_hot)

### Convert weathercode to Categories

In [24]:
# Create weather code mapping
weathercodes = df['weathercode'].sort_values().unique()
condition_list = []
label_list = []

for code in weathercodes:
  condition_list.append(df['weathercode'] == code)
  label_list.append(WEATHER_CODES[code])

In [25]:
# Create categories
df['weather'] = np.select(condition_list, label_list, default='Unknown')

In [26]:
df['weather'].value_counts()

weather
Clear sky        72641
Overcast         35254
Mainly clear     14806
Light drizzle     4605
Partly cloudy     4327
Name: count, dtype: int64

In [28]:
# Use One Hot Encoding
one_hot = pd.get_dummies(df['weather'], drop_first=True, dtype='int64', prefix='weather')
df = df.join(one_hot)

### Convert incident_category to Categories 

## Export Data

In [29]:
df.columns

Index(['trip_id', 'vehicle_id', 'vehicle_lat', 'vehicle_lon',
       'vehicle_distance', 'vehicle_status', 'vehicle_bearing',
       'vehicle_speed', 'occupancy_status', 'route_id', 'stop_id', 'stop_lat',
       'stop_lon', 'stop_sequence', 'trip_progress', 'wheelchair_boarding',
       'realtime_arrival_time', 'scheduled_arrival_time', 'delay',
       'temperature', 'precipitation', 'windspeed', 'weathercode',
       'incident_nearby', 'day', 'hour', 'sch_day', 'sch_hour', 'day_sin',
       'day_cos', 'sch_day_sin', 'sch_day_cos', 'hour_sin', 'hour_cos',
       'sch_hour_sin', 'sch_hour_cos', 'is_weekend', 'is_peak_hour', 'weather',
       'vehicle_status_Stopped At', 'weather_Light drizzle',
       'weather_Mainly clear', 'weather_Overcast', 'weather_Partly cloudy'],
      dtype='object')

In [30]:
# Keep relevant columns
df = df[[
	'vehicle_id',
  	'vehicle_lat',
    'vehicle_lon',
    'vehicle_distance',
    'vehicle_status_Stopped At',
    'vehicle_bearing',
    'vehicle_speed',
  	'occupancy_status',
  	'route_id', 
  	'stop_id',
  	'stop_lat',
  	'stop_lon',
	'stop_sequence',
  	'trip_progress',
	'wheelchair_boarding',
  	'day_sin',
  	'day_cos',
  	'sch_day_sin',
  	'sch_day_cos',
  	'hour_sin',
  	'hour_cos', 
	'sch_hour_sin',
  	'sch_hour_cos',
  	'is_weekend',
  	'is_peak_hour', 
	'temperature', 
	'precipitation', 
	'windspeed', 
  	'weather_Light drizzle',
    'weather_Mainly clear',
  	'weather_Overcast',
  	'weather_Partly cloudy',
    'incident_nearby',
  	'delay'
]]

In [31]:
# Export encoders
encoders = {
	'le_vehicle': le_vehicle,
  	'le_route': le_route,
  	'le_stop': le_stop,
}

with open('../models/label_encoders.pkl', 'wb') as handle:
	pickle.dump(encoders, handle, protocol=pickle.HIGHEST_PROTOCOL)

In [32]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 131633 entries, 0 to 133074
Data columns (total 34 columns):
 #   Column                     Non-Null Count   Dtype  
---  ------                     --------------   -----  
 0   vehicle_id                 131633 non-null  int64  
 1   vehicle_lat                131633 non-null  float64
 2   vehicle_lon                131633 non-null  float64
 3   vehicle_distance           131633 non-null  float64
 4   vehicle_status_Stopped At  131633 non-null  int64  
 5   vehicle_bearing            131633 non-null  float64
 6   vehicle_speed              131633 non-null  float64
 7   occupancy_status           131633 non-null  int64  
 8   route_id                   131633 non-null  int64  
 9   stop_id                    131633 non-null  int64  
 10  stop_lat                   131633 non-null  float64
 11  stop_lon                   131633 non-null  float64
 12  stop_sequence              131633 non-null  int64  
 13  trip_progress              131633 

In [None]:
# Export dataframe
df.to_csv('../data/preprocessed.csv', index=False)

## End