# STM Transit Delay Data Preprocessing

## Overview

This notebook preprocesses data about STM trip updates weather and traffic, data in order to build a regression model that predicts delays in seconds.

## Data Description (to be completed)

`trip_id`: Unique identifier for the transit trip.<br>
`vehicle_id`: Unique identifier for a vehicle.<br>
`vehicle_lat`, `vehicle_lon`: Vehicle current position.<br>
`vehicle_distance`: Vehicle distance from the stop, in meters.<br>
`vehicle_in_transit`: Indicates if a vehicle is in transit or if it has stopped.<br>
`vehicle_bearing`: Direction that the vehicle is facing, from to 360 degrees.<br>
`vehicle_speed`: Momentary speed measured by the vehicle, in meters per second.<br>
`stop_sequence`: Sequence of the stop, for ordering.<br>
`occupancy_status`: Degree of passenger occupancy, ranging from 1 (empty) to 7 (not accepting passengers).<br>
`route_id`: Unique identifier for a bus or metro line.<br>
`stop_id`: Unique identifier of a stop.<br>
`stop_name`: Name of the stop.<br>
`stop_lat`, `stop_lon`: Stop coordinates.<br>
`trip_progress`: How far along the trip is the vehicle, from 0 (first stop) to 1 (last stop).<br>
`wheelchair_boarding`: Indicates if the stop is accessible for people in wheelchair.<br>
`rt_arrival_time`, `sch_arrival_time`: Realtime and scheduled arrival time, in UTC.<br>
`delay`: Difference between real and scheduled arrival time, in seconds<br>
`delay_cat`: Delay magnitude, from very early to very late<br>
`temperature`: Air temperature at 2 meters above ground, in Celsius.<br>
`precipitation`: Total precipitation (rain, showers, snow) sum of the preceding hour, in millimeters.<br>
`windspeed`: Wind speed at 10 meters above ground, in km/h.<br>
`weathercode`: World Meteorological Organization (WMO) code, see [Open-Meteo API documentation](https://open-meteo.com/en/docs#weather_variable_documentation) for the list.<br>
`incident_nearby`: Indicates if an incident happened within 1.5 km of the vehicle position.

## Imports

In [108]:
import numpy as np
import pandas as pd
import pickle
from sklearn.preprocessing import LabelEncoder, OrdinalEncoder
import sys

In [None]:
# Import custom code
sys.path.insert(0, '..')
from scripts.custom_functions import DELAY_CLASS, LOCAL_TIMEZONE, OCCUPANCY_STATUS

In [110]:
# Load data
df = pd.read_csv('../data/stm_weather_traffic_merged.csv')

In [111]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 236704 entries, 0 to 236703
Data columns (total 28 columns):
 #   Column                 Non-Null Count   Dtype  
---  ------                 --------------   -----  
 0   trip_id                236704 non-null  int64  
 1   vehicle_id             236704 non-null  int64  
 2   vehicle_in_transit     236704 non-null  int64  
 3   vehicle_bearing        236704 non-null  float64
 4   vehicle_speed          236704 non-null  float64
 5   occupancy_status       236704 non-null  object 
 6   route_id               236704 non-null  int64  
 7   stop_id                236704 non-null  int64  
 8   stop_name              236704 non-null  object 
 9   stop_lat               236704 non-null  float64
 10  stop_lon               236704 non-null  float64
 11  stop_distance          236704 non-null  float64
 12  stop_sequence          236704 non-null  int64  
 13  trip_progress          236704 non-null  float64
 14  stop_has_alert         236704 non-nu

In [112]:
df.head()

Unnamed: 0,trip_id,vehicle_id,vehicle_in_transit,vehicle_bearing,vehicle_speed,occupancy_status,route_id,stop_id,stop_name,stop_lat,...,sch_arrival_time,rt_departure_time,sch_departure_time,delay,delay_class,temperature,precipitation,windspeed,weather,incident_nearby
0,283855472,32018,1,54.0,0.0,Empty,49,60515,Perras / 81e Avenue,45.668193,...,2025-04-27 22:09:00+00:00,,2025-04-27 22:09:00+00:00,0.0,On Time,11.3,0.0,13.3,Clear sky,0
1,283854752,32014,0,54.0,0.0,Many seats available,49,55288,Maurice-Duplessis / Langelier,45.616888,...,2025-04-27 22:14:00+00:00,2025-04-27 22:15:32+00:00,2025-04-27 22:14:00+00:00,92.0,On Time,11.3,0.0,13.3,Clear sky,0
2,283854752,32014,1,334.0,11.94454,Empty,49,60320,Saint-Jean-Baptiste / Émilie-Du Châtelet,45.665987,...,2025-04-27 22:47:31+00:00,2025-04-27 22:49:03+00:00,2025-04-27 22:47:31+00:00,92.0,On Time,11.0,0.0,13.4,Clear sky,0
3,284752678,31821,0,0.0,0.0,Many seats available,69,53802,Prison de Bordeaux,45.546494,...,2025-04-27 22:08:26+00:00,2025-04-27 22:14:40+00:00,2025-04-27 22:08:26+00:00,374.0,Late,11.3,0.0,13.3,Clear sky,0
4,284215436,33844,1,269.0,5.00004,Many seats available,165,51626,de la Côte-des-Neiges / Blueridge,45.495629,...,2025-04-27 22:07:00+00:00,2025-04-27 22:11:46+00:00,2025-04-27 22:07:00+00:00,286.0,Late,11.3,0.0,13.3,Clear sky,0


## Data Preprocessing

### Handle Delay Outliers

### Encode Datetime

In [121]:
# Convert scheduled arrival time
df['sch_arrival_time'] = pd.to_datetime(df['sch_arrival_time'], utc=True).dt.tz_convert(LOCAL_TIMEZONE)

In [122]:
print(df['sch_arrival_time'].isna().sum())

0


In [123]:
# Convert datetime to month, day and hour
df['month'] = df['sch_arrival_time'].dt.month
df['day_of_week'] = df['sch_arrival_time'].dt.day_of_week
df['hour'] = df['sch_arrival_time'].dt.hour

In [124]:
# Add boolean value is_weekend
weekend_mask = df['day_of_week'].isin([5, 6])
df['is_weekend'] = np.where(weekend_mask, 1, 0)

In [125]:
# Add boolean value is_peak_hour (weekdays from 7-9am or 4-6pm)
peak_hour_mask = (weekend_mask == False) & (df['hour'].isin([7, 8, 9, 16, 17, 18]))
df['is_peak_hour'] = np.where(peak_hour_mask, 1, 0)

In [126]:
# Drop datetime columns
df = df.drop(['rt_arrival_time', 'rt_departure_time', 'sch_arrival_time', 'sch_departure_time'], axis=1)

### Use Label Encoding for vehicle_id, route_id and stop_id

In [127]:
le_vehicle = LabelEncoder()
df['vehicle_id'] = le_vehicle.fit_transform(df['vehicle_id'])

In [128]:
le_route = LabelEncoder()
df['route_id'] = le_route.fit_transform(df['route_id'])

In [129]:
le_stop = LabelEncoder()
df['stop_id'] = le_stop.fit_transform(df['stop_id'])

### Use Ordinal Encoding for Occupancy Status and Delay Class

In [130]:
# Create occupation map
occ_map = {}

for number, status in OCCUPANCY_STATUS.items():
	occ_map[status] = number

occ_map

{'Unknown': 0,
 'Empty': 1,
 'Many seats available': 2,
 'Few seats available': 3,
 'Standing room only': 4,
 'Crushed standing room only': 5,
 'Full': 6,
 'Not accepting passengers': 7}

In [131]:
# Map values
df['occupancy_status'] = df['occupancy_status'].map(occ_map)
df['occupancy_status'].value_counts()

occupancy_status
1    101535
2     76302
3     52858
5      3596
0       155
Name: count, dtype: int64

In [132]:
# Create delay map
delay_map = {}

for number, cat in DELAY_CLASS.items():
	delay_map[cat] = number

delay_map

{'Very Early': 0, 'Early': 1, 'On Time': 2, 'Late': 3, 'Very Late': 4}

In [133]:
# Map values
df['delay_class'] = df['delay_class'].map(delay_map)
df['delay_class'].value_counts()

delay_class
2    186276
3     26111
4      9279
0      7658
1      5122
Name: count, dtype: int64

### Use One Hot Encoding for Weather and Schedule Relationship

In [134]:
df['weather'].value_counts()

weather
Overcast            100355
Clear sky            69098
Mainly clear         22403
Partly cloudy        17453
Light drizzle        14487
Slight rain           7652
Dense drizzle         2264
Moderate drizzle       734
Name: count, dtype: int64

In [135]:
# Collapse categories
df['weather'] = np.where(df['weather'].isin(['Mainly clear', 'Partly cloudy', 'Overcast']), 'Cloudy', df['weather'])
df['weather'] = np.where(df['weather'].isin(['Light drizzle', 'Moderate drizzle', 'Dense drizzle']), 'Drizzle', df['weather'])
df['weather'] = np.where(df['weather'].isin(['Slight rain', 'Moderate rain']), 'Rain', df['weather'])
df['weather'].value_counts()

weather
Cloudy       140211
Clear sky     69098
Drizzle       17485
Rain           7652
Name: count, dtype: int64

In [136]:
# Use One Hot Encoding for weather
one_hot = pd.get_dummies(df['weather'], drop_first=True, dtype='int64', prefix='weather')
df = df.join(one_hot).drop('weather', axis=1)

In [137]:
# Get remaining string columns
df.select_dtypes(include='object').columns

Index(['stop_name', 'schedule_relationship'], dtype='object')

In [138]:
df['schedule_relationship'].value_counts()

schedule_relationship
Scheduled    234094
Skipped         348
No Data           4
Name: count, dtype: int64

In [139]:
# Use One Hot Encoding for schedule_relationship
one_hot = pd.get_dummies(df['schedule_relationship'], drop_first=True, dtype='int64', prefix='schedule_relationship')
df = df.join(one_hot).drop('schedule_relationship', axis=1)

## Export Data

In [140]:
df.columns

Index(['trip_id', 'vehicle_id', 'vehicle_in_transit', 'vehicle_bearing',
       'vehicle_speed', 'occupancy_status', 'route_id', 'stop_id', 'stop_name',
       'stop_lat', 'stop_lon', 'stop_distance', 'stop_sequence',
       'trip_progress', 'stop_has_alert', 'wheelchair_boarding', 'delay',
       'delay_class', 'temperature', 'precipitation', 'windspeed',
       'incident_nearby', 'month', 'day_of_week', 'hour', 'is_weekend',
       'is_peak_hour', 'weather_Cloudy', 'weather_Drizzle', 'weather_Rain',
       'schedule_relationship_Scheduled', 'schedule_relationship_Skipped'],
      dtype='object')

**Columns to drop**

`trip_id`: A trip is associated with a vehichle, a route and a date.<br>
`stop_name`: There's already the stop id.<br>
`delay`: The delay class has been created.

In [141]:
# Keep relevant columns and reorder
df = df[[
    	'vehicle_id', 
    	'vehicle_in_transit',
    	'vehicle_bearing',
       	'vehicle_speed',
    	'occupancy_status',
    	'route_id',
    	'stop_id',
       	'stop_lat', 
     	'stop_lon',
    	'stop_distance',
    	'stop_sequence',
       	'trip_progress',
    	'stop_has_alert',
    	'wheelchair_boarding',
       	'schedule_relationship_Scheduled',
     	'schedule_relationship_Skipped',
     	'temperature',
    	'precipitation',
    	'windspeed',
    	'month',
    	'day_of_week',
    	'hour',
    	'is_weekend',
       	'is_peak_hour',
    	'weather_Cloudy',
    	'weather_Drizzle',
    	'weather_Rain',
      	'incident_nearby',
        'delay',
       	'delay_class',
]]

In [142]:
# Export encoders
encoders = {
	'le_vehicle': le_vehicle,
  	'le_route': le_route,
  	'le_stop': le_stop,
}

with open('../models/label_encoders.pkl', 'wb') as handle:
	pickle.dump(encoders, handle, protocol=pickle.HIGHEST_PROTOCOL)

In [143]:
# Assert all columns are numeric
len(df.columns) == len(df.select_dtypes([np.number]).columns)

True

In [144]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 234446 entries, 0 to 236703
Data columns (total 30 columns):
 #   Column                           Non-Null Count   Dtype  
---  ------                           --------------   -----  
 0   vehicle_id                       234446 non-null  int64  
 1   vehicle_in_transit               234446 non-null  int64  
 2   vehicle_bearing                  234446 non-null  float64
 3   vehicle_speed                    234446 non-null  float64
 4   occupancy_status                 234446 non-null  int64  
 5   route_id                         234446 non-null  int64  
 6   stop_id                          234446 non-null  int64  
 7   stop_lat                         234446 non-null  float64
 8   stop_lon                         234446 non-null  float64
 9   stop_distance                    234446 non-null  float64
 10  stop_sequence                    234446 non-null  int64  
 11  trip_progress                    234446 non-null  float64
 12  stop_ha

In [145]:
# Export dataframe
df.to_csv('../data/preprocessed.csv', index=False)

## End