# STM Transit Delay Data Preprocessing

This notebook preprocesses data about STM trip updates and historical weather data.

## Data Description

`trip_id` unique identifier of a trip<br>
`route_id` bus or metro line<br>
`stop_id` stop number<br>
`stop_lat`stop latitude<br>
`stop_lon`stop longitude<br>
`stop_sequence` sequence of the stop, for ordering<br>
`wheelchair_boarding` indicates if the stop is accessible for people in wheelchair, 1 being true and 2 being false<br>
`realtime_arrival_time` actual arrival time, in milliseconds<br>
`scheduled_arrival_time` planned arrival time, in milliseconds<br>
`temperature` air temperature at 2 meters above ground, in Celsius<br>
`precipitation` total precipitation (rain, showers, snow) sum of the preceding hour, in millimeters<br>
`windspeed` wind speed at 10 meters above ground, in km/h<br>
`weathercode` weather condition as a numeric code<br>
`delay` difference between actual and planned arrival time, in seconds<br>
`delay_previous_stop` delay of the previous stop<br>
`is_holiday` indicates if the day of the trip is a holiday

## Imports

In [91]:
import numpy as np
import pandas as pd
import pickle
from scripts.custom_functions import WEATHER_CODES
from sklearn.preprocessing import LabelEncoder

In [92]:
# Set timezone
local_timezone = 'Canada/Eastern'

## Data Preprocessing

In [93]:
# Load data
df = pd.read_csv('data/stm_weather_merged.csv')

### Drop Unnecessary Columns

In [94]:
# Drop scheduled_arrival_time, as there's already the delay
df = df.drop('scheduled_arrival_time', axis=1)

### Encode Datetime

In [95]:
# Convert realtime arrival timestamp to datetime
rt_arrival_dt = pd.to_datetime(df['realtime_arrival_time'], origin='unix', unit='ms', utc=True)
rt_arrival_dt = rt_arrival_dt.dt.tz_convert(local_timezone)
rt_arrival_dt.head()

0   2025-04-23 04:53:56-04:00
1   2025-04-23 04:54:42-04:00
2   2025-04-23 04:55:08-04:00
3   2025-04-23 04:55:35-04:00
4   2025-04-23 04:56:00-04:00
Name: realtime_arrival_time, dtype: datetime64[ns, Canada/Eastern]

In [96]:
# Convert datetime to useful features
df['day_of_week'] = rt_arrival_dt.dt.day_of_week
df['hour_of_day'] = rt_arrival_dt.dt.hour

In [97]:
# Use Cyclical Encoding for day and month, as it's more suitable for time-related features
# And the model can "understand" the wrap-around
df['day_sin'] = np.sin(2 * np.pi * df['day_of_week'] / 7)
df['day_cos'] = np.cos(2 * np.pi * df['day_of_week'] / 7)

df['hour_sin'] = np.sin(2 * np.pi * df['hour_of_day'] / 24)
df['hour_cos'] = np.cos(2 * np.pi * df['hour_of_day'] / 24)

In [98]:
# Add boolean value is_weekend
df['is_weekend'] = np.where(df['day_of_week'].isin([5, 6]), 1, 0)

In [99]:
# Add boolean value is_peak_hour (from 7-9am or 4-6pm)
peak_hour_mask = (df['hour_of_day'].isin([7, 8, 9])) | (df['hour_of_day'].isin([16, 17, 18]))
df['is_peak_hour'] = np.where(peak_hour_mask, 1, 0)

In [100]:
# Drop unneeded columns
df = df.drop(['realtime_arrival_time', 'day_of_week', 'hour_of_day'], axis=1)

### Convert boolean columns to integer

In [101]:
bool_columns = df.select_dtypes(include='bool').columns
bool_columns

Index(['wheelchair_boarding', 'is_holiday'], dtype='object')

In [102]:
df[bool_columns] = df[bool_columns].astype('int64')

### Use Label Encoding for route_id and stop_id

In [103]:
le_route = LabelEncoder()
df['route_id'] = le_route.fit_transform(df['route_id'])

In [104]:
le_stop = LabelEncoder()
df['stop_id'] = le_stop.fit_transform(df['stop_id'])

### Convert weathercode Into Categories

In [105]:
# Create mapping
weathercodes = df['weathercode'].sort_values().unique()
condition_list = []
label_list = []

for code in weathercodes:
  condition_list.append(df['weathercode'] == code)
  label_list.append(WEATHER_CODES[code])

In [106]:
# Create categories
df['weather'] = np.select(condition_list, label_list, default='Unknown')

In [107]:
# Use One Hot Encoding
one_hot = pd.get_dummies(df['weather'], drop_first=True, dtype='int64', prefix='weather')
df = df.drop(['weathercode', 'weather'], axis=1).join(one_hot)

In [108]:
df.columns

Index(['trip_id', 'route_id', 'stop_id', 'stop_lat', 'stop_lon',
       'stop_sequence', 'wheelchair_boarding', 'temperature', 'precipitation',
       'windspeed', 'delay', 'delay_previous_stop', 'is_holiday', 'day_sin',
       'day_cos', 'hour_sin', 'hour_cos', 'is_weekend', 'is_peak_hour',
       'weather_Light drizzle', 'weather_Mainly clear',
       'weather_Moderate drizzle', 'weather_Moderate rain', 'weather_Overcast',
       'weather_Partly cloudy', 'weather_Slight rain'],
      dtype='object')

In [None]:
# Keep relevant columns and reorder them
df = df[[
  'route_id',
  'stop_id',
  'stop_lat',
  'stop_lon',
  'wheelchair_boarding',
  'day_sin',
  'day_cos',
  'hour_sin',
  'hour_cos',
  'is_weekend',
  'is_holiday',
  'is_peak_hour',
  'delay_previous_stop',
  'temperature',
  'precipitation',
  'windspeed',
  'weather_Light drizzle',
  'weather_Mainly clear',
  'weather_Moderate drizzle',
  'weather_Moderate rain',
  'weather_Overcast',
  'weather_Partly cloudy',
  'weather_Slight rain',
  'delay',
  ]]

In [110]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 951210 entries, 0 to 951209
Data columns (total 24 columns):
 #   Column                    Non-Null Count   Dtype  
---  ------                    --------------   -----  
 0   route_id                  951210 non-null  int64  
 1   stop_id                   951210 non-null  int64  
 2   stop_lat                  951210 non-null  float64
 3   stop_lon                  951210 non-null  float64
 4   wheelchair_boarding       951210 non-null  int64  
 5   day_sin                   951210 non-null  float64
 6   day_cos                   951210 non-null  float64
 7   hour_sin                  951210 non-null  float64
 8   hour_cos                  951210 non-null  float64
 9   is_weekend                951210 non-null  int64  
 10  is_holiday                951210 non-null  int64  
 11  is_peak_hour              951210 non-null  int64  
 12  delay_previous_stop       951210 non-null  float64
 13  temperature               951210 non-null  f

In [111]:
# Export route label encoder and stop label encoder
encoders = {
  'le_route': le_route,
  'le_stop': le_stop
}
with open('models/encoders.pickle', 'wb') as handle:
    pickle.dump(encoders, handle, protocol=pickle.HIGHEST_PROTOCOL)

In [112]:
# Export dataframe
df.to_csv('data/preprocessed.csv', index=False)

## End