# STM Transit Delay Data Preprocessing

This notebook preprocesses data about STM trip updates and historical weather data.

## Data Description

`trip_id` unique identifier of a trip<br>
`route_id` bus or metro line<br>
`stop_id` stop number<br>
`stop_lat`stop latitude<br>
`stop_lon`stop longitude<br>
`stop_sequence` sequence of the stop, for ordering<br>
`wheelchair_boarding` indicates if the stop is accessible for people in wheelchair, 1 being true and 2 being false<br>
`realtime_arrival_time` actual arrival time, in milliseconds<br>
`scheduled_arrival_time` planned arrival time, in milliseconds<br>
`temperature` air temperature at 2 meters above ground, in Celsius<br>
`precipitation` total precipitation (rain, showers, snow) sum of the preceding hour, in millimeters<br>
`windspeed` wind speed at 10 meters above ground, in km/h<br>
`weathercode` World Meteorological Organization (WMO) code<br>
`delay` difference between actual and planned arrival time, in seconds<br>
`delay_previous_stop` delay of the previous stop

## Imports

In [20]:
import numpy as np
import pandas as pd
import pickle
from sklearn.decomposition import PCA
from sklearn.preprocessing import LabelEncoder, StandardScaler
import sys

In [2]:
# Import custom code
sys.path.insert(0, '..')
from scripts.custom_functions import LOCAL_TIMEZONE, WEATHER_CODES

## Data Preprocessing

In [4]:
# Load data
df = pd.read_csv('../data/stm_weather_merged.csv')

### Drop Unnecessary Columns

In [5]:
# Drop scheduled_arrival_time, as there's already the delay
df = df.drop('scheduled_arrival_time', axis=1)

### Encode Datetime

In [6]:
# Convert realtime arrival timestamp to datetime
rt_arrival_dt = pd.to_datetime(df['realtime_arrival_time'], origin='unix', unit='ms', utc=True)
rt_arrival_dt = rt_arrival_dt.dt.tz_convert(LOCAL_TIMEZONE)
rt_arrival_dt.head()

0   2025-04-23 04:53:00-04:00
1   2025-04-23 04:53:56-04:00
2   2025-04-23 04:54:42-04:00
3   2025-04-23 04:55:08-04:00
4   2025-04-23 04:55:35-04:00
Name: realtime_arrival_time, dtype: datetime64[ns, Canada/Eastern]

In [7]:
# Convert datetime to useful features
df['day'] = rt_arrival_dt.dt.day_of_week
df['hour'] = rt_arrival_dt.dt.hour

In [None]:
# Use Cyclical Encoding for day and hour, as it's more suitable for time-related features
# And the model can "understand" the wrap-around
df['day_sin'] = np.sin(2 * np.pi * df['day'] / 7)
df['day_cos'] = np.cos(2 * np.pi * df['day'] / 7)

df['hour_sin'] = np.sin(2 * np.pi * df['hour'] / 24)
df['hour_cos'] = np.cos(2 * np.pi * df['hour'] / 24)

In [9]:
# Add boolean value is_weekend
df['is_weekend'] = np.where(df['day'].isin([5, 6]), 1, 0)

In [10]:
# Add boolean value is_peak_hour (from 7-9am or 4-6pm)
peak_hour_mask = df['hour'].isin([7, 8, 9, 16, 17, 18])
df['is_peak_hour'] = np.where(peak_hour_mask, 1, 0)

In [11]:
# Drop unneeded columns
df = df.drop(['realtime_arrival_time', 'day', 'hour'], axis=1)

### Convert boolean columns to integer

In [13]:
df['wheelchair_boarding'] = df['wheelchair_boarding'].astype('int64')

### Use Label Encoding for route_id and stop_id

In [14]:
le_route = LabelEncoder()
df['route_id'] = le_route.fit_transform(df['route_id'])

In [15]:
le_stop = LabelEncoder()
df['stop_id'] = le_stop.fit_transform(df['stop_id'])

### Convert weathercode Into Categories

In [16]:
# Create mapping
weathercodes = df['weathercode'].sort_values().unique()
condition_list = []
label_list = []

for code in weathercodes:
  condition_list.append(df['weathercode'] == code)
  label_list.append(WEATHER_CODES[code])

In [17]:
# Create categories
df['weather'] = np.select(condition_list, label_list, default='Unknown')

In [18]:
# Use One Hot Encoding
one_hot = pd.get_dummies(df['weather'], drop_first=True, dtype='int64', prefix='weather')
df = df.drop(['weathercode', 'weather'], axis=1).join(one_hot)

### Reduce station coordinates to one feature (PCA)

In [21]:
stop_coords = df[['stop_lat', 'stop_lon']]
stop_coords

Unnamed: 0,stop_lat,stop_lon
0,45.697416,-73.491724
1,45.694233,-73.496278
2,45.690526,-73.497867
3,45.688389,-73.498520
4,45.686125,-73.499308
...,...,...
1302603,45.469756,-73.536156
1302604,45.469756,-73.536156
1302605,45.469756,-73.536156
1302606,45.499992,-73.565764


In [22]:
# Scale the coordinates because PCA works best with normalized features
scaler_coord = StandardScaler()
coords_scaled = scaler_coord.fit_transform(stop_coords)
coords_scaled

array([[ 2.668891  ,  1.59675097],
       [ 2.61907636,  1.54603494],
       [ 2.56106101,  1.52833891],
       ...,
       [-0.8940372 ,  1.10193008],
       [-0.42083726,  0.77219797],
       [-0.42083726,  0.77219797]])

In [25]:
# Apply PCA
pca = PCA(n_components=1)
pca_coords = pca.fit_transform(coords_scaled)
df['pca_coords'] = pca_coords

In [26]:
# Drop latitude and longitude
df = df.drop(['stop_lat', 'stop_lon'], axis=1)

In [29]:
# Keep relevant columns and reorder them
df = df[[
  'route_id',
  'stop_id',
  'pca_coords',
  'wheelchair_boarding',
  'day_sin',
  'day_cos',
  'hour_sin',
  'hour_cos',
  'is_weekend',
  'is_peak_hour',
  'delay_previous_stop',
  'temperature',
  'precipitation',
  'windspeed',
  'weather_Light drizzle',
  'weather_Mainly clear',
  'weather_Moderate drizzle',
  'weather_Moderate rain',
  'weather_Overcast',
  'weather_Partly cloudy',
  'weather_Slight rain',
  'delay',
  ]]

## Export Data

In [30]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1302608 entries, 0 to 1302607
Data columns (total 22 columns):
 #   Column                    Non-Null Count    Dtype  
---  ------                    --------------    -----  
 0   route_id                  1302608 non-null  int64  
 1   stop_id                   1302608 non-null  int64  
 2   pca_coords                1302608 non-null  float64
 3   wheelchair_boarding       1302608 non-null  int64  
 4   day_sin                   1302608 non-null  float64
 5   day_cos                   1302608 non-null  float64
 6   hour_sin                  1302608 non-null  float64
 7   hour_cos                  1302608 non-null  float64
 8   is_weekend                1302608 non-null  int64  
 9   is_peak_hour              1302608 non-null  int64  
 10  delay_previous_stop       1302608 non-null  float64
 11  temperature               1302608 non-null  float64
 12  precipitation             1302608 non-null  float64
 13  windspeed                 1

In [32]:
# Export encoders
encoders = {
  'le_route': le_route,
  'le_stop': le_stop
}
with open('../models/label_encoders.pickle', 'wb') as handle:
	pickle.dump(encoders, handle, protocol=pickle.HIGHEST_PROTOCOL)

In [34]:
# Export scaler
with open('../models/coord_scaler.pickle', 'wb') as handle:
	pickle.dump(scaler_coord, handle, protocol=pickle.HIGHEST_PROTOCOL)

In [35]:
# Export coordinates PCA
with open('../models/coord_pca.pickle', 'wb') as handle:
	pickle.dump(pca_coords, handle, protocol=pickle.HIGHEST_PROTOCOL)

In [37]:
# Export dataframe
df.to_csv('../data/preprocessed.csv', index=False)

## End