# STM Transit Delay Data Preprocessing

## Overview

This notebook preprocesses data about STM trip updates weather and traffic, data in order to build a regression model that predicts delays in seconds.

## Data Description

`trip_id`: Unique identifier for the transit trip.<br>
`vehicle_id`: Unique identifier for a vehicle.<br>
`vehicle_lat`, `vehicle_lon`: Vehicle current position.<br>
`vehicle_distance`: Vehicle distance from the stop, in meters.<br>
`vehicle_in_transit`: Indicates if a vehicle is in transit or if it has stopped.<br>
`vehicle_bearing`: Direction that the vehicle is facing, from to 360 degrees.<br>
`vehicle_speed`: Momentary speed measured by the vehicle, in meters per second.<br>
`stop_sequence`: Sequence of the stop, for ordering.<br>
`occupancy_status`: Degree of passenger occupancy, ranging from 1 (empty) to 7 (not accepting passengers).<br>
`route_id`: Unique identifier for a bus or metro line.<br>
`stop_id`: Unique identifier of a stop.<br>
`stop_name`: Name of the stop.<br>
`stop_lat`, `stop_lon`: Stop coordinates.<br>
`trip_progress`: How far along the trip is the vehicle, from 0 (first stop) to 1 (last stop).<br>
`wheelchair_boarding`: Indicates if the stop is accessible for people in wheelchair.<br>
`rt_arrival_time`, `sch_arrival_time`: Realtime and scheduled arrival time, in UTC.<br>
`delay`: Difference between real and scheduled arrival time, in seconds<br>
`temperature`: Air temperature at 2 meters above ground, in Celsius.<br>
`precipitation`: Total precipitation (rain, showers, snow) sum of the preceding hour, in millimeters.<br>
`windspeed`: Wind speed at 10 meters above ground, in km/h.<br>
`weathercode`: World Meteorological Organization (WMO) code, see [Open-Meteo API documentation](https://open-meteo.com/en/docs#weather_variable_documentation) for the list.<br>
`incident_nearby`: Indicates if an incident happened within 1.5 km of the vehicle position.

## Imports

In [31]:
import numpy as np
import pandas as pd
import pickle
from sklearn.preprocessing import LabelEncoder
import sys

In [32]:
# Import custom code
sys.path.insert(0, '..')
from scripts.custom_functions import LOCAL_TIMEZONE, WEATHER_CODES

In [33]:
# Load data
df = pd.read_csv('../data/stm_weather_traffic_merged.csv')

In [34]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200616 entries, 0 to 200615
Data columns (total 21 columns):
 #   Column                Non-Null Count   Dtype  
---  ------                --------------   -----  
 0   trip_id               200616 non-null  int64  
 1   vehicle_id            200616 non-null  int64  
 2   vehicle_in_transit    200616 non-null  int64  
 3   vehicle_rel_distance  200616 non-null  float64
 4   vehicle_bearing       200616 non-null  float64
 5   vehicle_speed         200616 non-null  float64
 6   occupancy_status      200616 non-null  int64  
 7   route_id              200616 non-null  int64  
 8   stop_id               200616 non-null  int64  
 9   stop_lat              200616 non-null  float64
 10  stop_lon              200616 non-null  float64
 11  trip_progress         200616 non-null  float64
 12  wheelchair_boarding   200616 non-null  int64  
 13  rt_arrival_time       200616 non-null  object 
 14  sch_arrival_time      200616 non-null  object 
 15  

## Data Preprocessing

### Handle Outliers

In [35]:
# Compute mean and standard deviation
mean_delay = df['delay'].mean()
std_delay = df['delay'].std()

In [36]:
# Filter outliers based on standard deviation
outlier_mask = (df['delay'] < mean_delay - 3 * std_delay) | (df['delay'] > mean_delay + 3 * std_delay)

In [37]:
# Get outliers
df[outlier_mask]

Unnamed: 0,trip_id,vehicle_id,vehicle_in_transit,vehicle_rel_distance,vehicle_bearing,vehicle_speed,occupancy_status,route_id,stop_id,stop_lat,...,trip_progress,wheelchair_boarding,rt_arrival_time,sch_arrival_time,delay,temperature,precipitation,windspeed,weathercode,incident_nearby
96,283584468,30214,1,0.388767,304.0,5.00004,1,30,51335,45.540641,...,0.555556,1,2025-04-27 22:15:52+00:00,2025-04-27 21:58:48+00:00,1024.0,11.0,0.0,13.3,0,1
150,283211830,38066,1,0.703762,75.0,12.50010,1,201,57772,45.441143,...,0.776119,1,2025-04-27 22:12:21+00:00,2025-04-27 21:58:04+00:00,857.0,11.0,0.0,13.3,0,0
284,284215026,37067,0,1.000000,128.0,0.00000,1,124,50691,45.496493,...,0.487805,1,2025-04-27 22:45:04+00:00,2025-04-27 22:15:10+00:00,1794.0,10.7,0.0,13.4,0,1
290,283585000,41071,1,0.990828,302.0,3.33336,1,30,52585,45.517925,...,0.155556,0,2025-04-27 22:55:37+00:00,2025-04-27 22:40:16+00:00,921.0,10.7,0.0,13.4,0,0
291,286581685,38034,1,0.487827,261.0,2.50002,2,777,54330,45.506277,...,0.200000,1,2025-04-27 23:06:12+00:00,2025-04-27 22:42:00+00:00,1452.0,10.7,0.0,13.4,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
198914,286570340,29088,1,0.047994,85.0,5.55560,2,38,51909,45.463735,...,0.454545,1,2025-05-02 23:01:09+00:00,2025-05-02 22:45:22+00:00,947.0,11.9,0.0,16.2,0,0
198915,286570340,29088,1,0.327831,62.0,3.61114,1,38,62231,45.462978,...,0.863636,1,2025-05-02 23:15:03+00:00,2025-05-02 22:58:33+00:00,990.0,11.9,0.0,16.2,0,0
199274,284779316,31855,1,0.060582,117.0,3.33336,1,165,53865,45.499904,...,0.342857,1,2025-05-02 23:48:02+00:00,2025-05-02 23:23:26+00:00,1476.0,10.5,0.0,10.9,2,1
200302,286572332,29095,1,0.034292,49.0,6.11116,1,35,52321,45.482505,...,0.577778,1,2025-05-03 00:01:11+00:00,2025-05-02 23:46:57+00:00,854.0,10.5,0.0,10.9,2,0


In [38]:
# Get proportion of outliers
print(f'{outlier_mask.mean():.2%}')

1.20%


In [39]:
# Remove outliers
df = df[~outlier_mask]

In [40]:
# Get new distribution
df['delay'].describe()

count    198203.000000
mean         53.046851
std         131.176983
min        -691.000000
25%           0.000000
50%           0.000000
75%          60.000000
max         816.000000
Name: delay, dtype: float64

The delay ranging from ~16 min early to 18min45sec seems more reasonable.

In [41]:
df.columns

Index(['trip_id', 'vehicle_id', 'vehicle_in_transit', 'vehicle_rel_distance',
       'vehicle_bearing', 'vehicle_speed', 'occupancy_status', 'route_id',
       'stop_id', 'stop_lat', 'stop_lon', 'trip_progress',
       'wheelchair_boarding', 'rt_arrival_time', 'sch_arrival_time', 'delay',
       'temperature', 'precipitation', 'windspeed', 'weathercode',
       'incident_nearby'],
      dtype='object')

### Encode Datetime

In [42]:
df['sch_arrival_time'].min()

'2025-04-27 21:55:00+00:00'

In [43]:
# Convert scheduled arrival time
df['sch_arrival_time'] = pd.to_datetime(df['sch_arrival_time'], utc=True).dt.tz_convert(LOCAL_TIMEZONE)

In [44]:
# Convert datetime to month, day and hour
df['day_of_week'] = df['sch_arrival_time'].dt.day_of_week
df['hour'] = df['sch_arrival_time'].dt.hour

In [45]:
# Add boolean value is_weekend
weekend_mask = df['day_of_week'].isin([5, 6])
df['is_weekend'] = np.where(weekend_mask, 1, 0)

In [46]:
# Add boolean value is_peak_hour (weekdays from 7-9am or 4-6pm)
peak_hour_mask = (weekend_mask == False) & (df['hour'].isin([7, 8, 9, 16, 17, 18]))
df['is_peak_hour'] = np.where(peak_hour_mask, 1, 0)

### Use Label Encoding for trip_id, vehicle_id, route_id and stop_id

In [47]:
le_trip = LabelEncoder()
df['trip_id'] = le_trip.fit_transform(df['trip_id'])

In [48]:
le_vehicle = LabelEncoder()
df['vehicle_id'] = le_vehicle.fit_transform(df['vehicle_id'])

In [49]:
le_route = LabelEncoder()
df['route_id'] = le_route.fit_transform(df['route_id'])

In [50]:
le_stop = LabelEncoder()
df['stop_id'] = le_stop.fit_transform(df['stop_id'])

### Convert weathercode to Categories

In [51]:
# Create weather code mapping
weathercodes = df['weathercode'].sort_values().unique()
condition_list = []
label_list = []

for code in weathercodes:
  condition_list.append(df['weathercode'] == code)
  label_list.append(WEATHER_CODES[code])

In [52]:
# Create categories
df['weather'] = np.select(condition_list, label_list, default='Unknown')

In [53]:
df['weather'].value_counts()

weather
Overcast         74867
Clear sky        61834
Mainly clear     22715
Partly cloudy    15178
Light drizzle    14280
Slight rain       7149
Dense drizzle     2180
Name: count, dtype: int64

In [54]:
# Collapse categories
df['weather'] = np.where(df['weather'].isin(['Mainly clear', 'Partly cloudy', 'Overcast']), 'Cloudy', df['weather'])
df['weather'] = np.where(df['weather'].isin(['Light drizzle', 'Moderate drizzle', 'Dense drizzle']), 'Drizzle', df['weather'])
df['weather'] = np.where(df['weather'].isin(['Slight rain', 'Moderate rain']), 'Rain', df['weather'])
df['weather'].value_counts()

weather
Cloudy       112760
Clear sky     61834
Drizzle       16460
Rain           7149
Name: count, dtype: int64

In [55]:
# Use One Hot Encoding
one_hot = pd.get_dummies(df['weather'], drop_first=True, dtype='int64', prefix='weather')
df = df.join(one_hot)

### Convert incident_category to Categories 

## Export Data

In [56]:
df.columns

Index(['trip_id', 'vehicle_id', 'vehicle_in_transit', 'vehicle_rel_distance',
       'vehicle_bearing', 'vehicle_speed', 'occupancy_status', 'route_id',
       'stop_id', 'stop_lat', 'stop_lon', 'trip_progress',
       'wheelchair_boarding', 'rt_arrival_time', 'sch_arrival_time', 'delay',
       'temperature', 'precipitation', 'windspeed', 'weathercode',
       'incident_nearby', 'day_of_week', 'hour', 'is_weekend', 'is_peak_hour',
       'weather', 'weather_Cloudy', 'weather_Drizzle', 'weather_Rain'],
      dtype='object')

**Columns to drop**

`trip_id`: A trip is associated with a vehichle, a route and a date.<br>
`scheduled_arrival_time` : The month, day of week and hour have been extracted.<br>
`realtime_arrival_time`: There's already the delay.<br>
`weather`, `weathercode`: Dummies have been created for the weather.<br>

In [None]:
# Keep relevant columns
df = df[[
	'vehicle_id',
  	'vehicle_in_transit',
    'vehicle_rel_distance',
    'vehicle_bearing',
    'vehicle_speed',
  	'occupancy_status',
  	'route_id', 
  	'stop_id',
  	'stop_lat',
  	'stop_lon',
  	'trip_progress',
	'wheelchair_boarding',
  	'day_of_week',
    'hour',
  	'is_weekend',
  	'is_peak_hour', 
	'temperature', 
	'precipitation', 
	'windspeed', 
  	'weather_Cloudy',
  	'weather_Drizzle',
	'weather_Rain',
    'incident_nearby',
  	'delay'
]]

In [58]:
# Export encoders
encoders = {
	'le_vehicle': le_vehicle,
  	'le_route': le_route,
  	'le_stop': le_stop,
}

with open('../models/label_encoders.pkl', 'wb') as handle:
	pickle.dump(encoders, handle, protocol=pickle.HIGHEST_PROTOCOL)

In [59]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 198203 entries, 0 to 200615
Data columns (total 25 columns):
 #   Column                Non-Null Count   Dtype  
---  ------                --------------   -----  
 0   trip_id               198203 non-null  int64  
 1   vehicle_id            198203 non-null  int64  
 2   vehicle_in_transit    198203 non-null  int64  
 3   vehicle_rel_distance  198203 non-null  float64
 4   vehicle_bearing       198203 non-null  float64
 5   vehicle_speed         198203 non-null  float64
 6   occupancy_status      198203 non-null  int64  
 7   route_id              198203 non-null  int64  
 8   stop_id               198203 non-null  int64  
 9   stop_lat              198203 non-null  float64
 10  stop_lon              198203 non-null  float64
 11  trip_progress         198203 non-null  float64
 12  wheelchair_boarding   198203 non-null  int64  
 13  day_of_week           198203 non-null  int32  
 14  hour                  198203 non-null  int32  
 15  is_we

In [60]:
# Export dataframe
df.to_csv('../data/preprocessed.csv', index=False)

## End