# STM Transit Delay Data Preprocessing

This notebook preprocesses data about STM trip updates and weather data in order to build a tree-based regression model that predicts delays in seconds.

## Data Description

`trip_id` unique identifier of a trip<br>
`vehicle_id` unique identifier of a vehicle<br>
`occupancy_status` degree of passenger occupancy<br>
`route_id` bus or metro line<br>
`stop_id` stop number<br>
`stop_lat`stop latitude<br>
`stop_lon`stop longitude<br>
`stop_sequence` sequence of the stop, for ordering<br>
`trip_progress` how far along the trip is the bys from 0 to 1<br>
`wheelchair_boarding` indicates if the stop is accessible for people in wheelchair, 1 being true and 2 being false<br>
`realtime_arrival_time` actual arrival time, in milliseconds<br>
`scheduled_arrival_time` planned arrival time, in milliseconds<br>
`temperature` air temperature at 2 meters above ground, in Celsius<br>
`precipitation` total precipitation (rain, showers, snow) sum of the preceding hour, in millimeters<br>
`windspeed` wind speed at 10 meters above ground, in km/h<br>
`weathercode` World Meteorological Organization (WMO) code<br>
`incident_nearby`indicates if an incident happened within 500 meters when the vehicle arrived at the stop

## Imports

In [1]:
import numpy as np
import pandas as pd
import pickle
from sklearn.preprocessing import LabelEncoder
import sys

In [20]:
# Import custom code
sys.path.insert(0, '..')
from scripts.custom_functions import INCIDENT_CATEGORIES, LOCAL_TIMEZONE, STOP_STATUS, WEATHER_CODES

In [3]:
# Load data
df = pd.read_csv('../data/stm_weather_traffic_merged.csv')

## Data Preprocessing

### Handle Outliers

In [4]:
# Compute mean and standard deviation
mean_delay = df['delay'].mean()
std_delay = df['delay'].std()

In [5]:
# Filter outliers based on standard deviation
outlier_mask = (df['delay'] < mean_delay - 3 * std_delay) | (df['delay'] > mean_delay + 3 * std_delay)

In [6]:
# Get outliers
df[outlier_mask]

Unnamed: 0,trip_id,vehicle_id,vehicle_lat,vehicle_lon,vehicle_distance,vehicle_status,vehicle_bearing,vehicle_speed,occupancy_status,route_id,...,trip_progress,wheelchair_boarding,realtime_arrival_time,scheduled_arrival_time,delay,temperature,precipitation,windspeed,weathercode,incident_nearby
97,283584468,30214,45.539738,-73.611061,177.873373,2,304.0,5.00004,1,30,...,0.555556,1,1745792152000,1745791128000,1024.0,11.0,0.0,13.3,0,0.0
98,283584468,30214,45.556622,-73.667496,58.519512,2,149.0,2.50002,1,30,...,1.000000,1,1745793124000,1745792100000,1024.0,10.7,0.0,13.4,0,0.0
289,284215026,37067,45.496548,-73.642517,7.096642,1,128.0,0.00000,1,124,...,0.487805,1,1745793904000,1745792110000,1794.0,10.7,0.0,13.4,0,0.0
295,283585000,41071,45.517879,-73.566330,5.317782,2,302.0,3.33336,1,30,...,0.155556,0,1745794537000,1745793616000,921.0,10.7,0.0,13.4,0,0.0
297,286581685,38034,45.508125,-73.532669,612.566331,2,261.0,2.50002,2,777,...,0.200000,1,1745795172000,1745793720000,1452.0,10.7,0.0,13.4,0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
131867,286572213,44019,45.429012,-73.608215,247.662553,2,0.0,0.00000,3,110,...,0.808219,0,1746058957000,1746057683000,1274.0,3.7,0.0,7.7,3,0.0
132521,285285011,31064,45.589542,-73.537331,2.095260,1,0.0,0.00000,1,811,...,0.045455,1,1746058320000,1746055620000,2700.0,3.7,0.0,7.7,3,0.0
132611,285028811,31081,45.615032,-73.618874,7.101719,1,0.0,0.00000,1,49,...,0.634615,1,1746057203000,1746058560000,-1357.0,3.7,0.0,7.7,3,0.0
132612,285028811,31081,45.591904,-73.645943,131.202951,2,209.0,8.88896,1,49,...,0.923077,1,1746057203000,1746059284000,-2081.0,3.7,0.0,7.7,3,0.0


In [7]:
# Get proportion of outliers
print(f'{outlier_mask.mean():.2%}')

1.08%


In [8]:
# Remove outliers
df = df[~outlier_mask]

In [9]:
# Get new distribution
df['delay'].describe()

count    131633.000000
mean         55.956371
std         139.738659
min        -781.000000
25%           0.000000
50%           0.000000
75%          71.000000
max         913.000000
Name: delay, dtype: float64

The delay ranging from ~16 min early to 18min45sec seems more reasonable.

In [10]:
df.columns

Index(['trip_id', 'vehicle_id', 'vehicle_lat', 'vehicle_lon',
       'vehicle_distance', 'vehicle_status', 'vehicle_bearing',
       'vehicle_speed', 'occupancy_status', 'route_id', 'stop_id', 'stop_lat',
       'stop_lon', 'stop_sequence', 'trip_progress', 'wheelchair_boarding',
       'realtime_arrival_time', 'scheduled_arrival_time', 'delay',
       'temperature', 'precipitation', 'windspeed', 'weathercode',
       'incident_nearby'],
      dtype='object')

### Encode Datetime

In [11]:
# Convert real and scheduled timestamps
df['realtime_arrival_time'] = pd.to_datetime(df['realtime_arrival_time'], origin='unix', unit='ms', utc=True)
df['scheduled_arrival_time'] = pd.to_datetime(df['scheduled_arrival_time'], origin='unix', unit='ms', utc=True)

In [12]:
# Convert arrival times to local timezone
df['realtime_arrival_time'] = df['realtime_arrival_time'].dt.tz_convert(LOCAL_TIMEZONE)
df['scheduled_arrival_time'] = df['scheduled_arrival_time'].dt.tz_convert(LOCAL_TIMEZONE)

In [13]:
# Convert datetimes to day and hour
df['day'] = df['realtime_arrival_time'].dt.day_of_week
df['hour'] = df['realtime_arrival_time'].dt.hour

df['sch_day'] = df['scheduled_arrival_time'].dt.day_of_week
df['sch_hour'] = df['scheduled_arrival_time'].dt.hour

In [14]:
# Use Cyclical Encoding for day and hour, as it's more suitable for time-related features
# And the model can "understand" the wrap-around
df['day_sin'] = np.sin(2 * np.pi * df['day'] / 7)
df['day_cos'] = np.cos(2 * np.pi * df['day'] / 7)

df['sch_day_sin'] = np.sin(2 * np.pi * df['sch_day'] / 7)
df['sch_day_cos'] = np.cos(2 * np.pi * df['sch_day'] / 7)

df['hour_sin'] = np.sin(2 * np.pi * df['hour'] / 24)
df['hour_cos'] = np.cos(2 * np.pi * df['hour'] / 24)

df['sch_hour_sin'] = np.sin(2 * np.pi * df['sch_hour'] / 24)
df['sch_hour_cos'] = np.cos(2 * np.pi * df['sch_hour'] / 24)

In [15]:
# Add boolean value is_weekend
weekend_mask = df['day'].isin([5, 6])
df['is_weekend'] = np.where(weekend_mask, 1, 0)

In [16]:
# Add boolean value is_peak_hour (weekdays from 7-9am or 4-6pm)
peak_hour_mask = (weekend_mask == False) & (df['hour'].isin([7, 8, 9, 16, 17, 18]))
df['is_peak_hour'] = np.where(peak_hour_mask, 1, 0)

### Use Label Encoding for vehicle_id, route_id and stop_id

In [17]:
le_vehicle = LabelEncoder()
df['vehicle_id'] = le_vehicle.fit_transform(df['vehicle_id'])

In [18]:
le_route = LabelEncoder()
df['route_id'] = le_route.fit_transform(df['route_id'])

In [19]:
le_stop = LabelEncoder()
df['stop_id'] = le_stop.fit_transform(df['stop_id'])

### Convert vehicle status to categories

In [21]:
# Create status code mapping
status_codes = df['vehicle_status'].sort_values().unique()
condition_list = []
label_list = []

for code in status_codes:
  condition_list.append(df['vehicle_status'] == code)
  label_list.append(STOP_STATUS[code])

In [22]:
# Create categories
df['vehicle_status'] = np.select(condition_list, label_list, default='Unknown')

In [23]:
df['vehicle_status'].value_counts()

vehicle_status
In Transit To    90860
Stopped At       40773
Name: count, dtype: int64

In [27]:
# Use One Hot Encoding
one_hot = pd.get_dummies(df['vehicle_status'], drop_first=True, dtype='int64', prefix='vehicle_status')
df = df.join(one_hot)

### Convert weathercode to Categories

In [24]:
# Create weather code mapping
weathercodes = df['weathercode'].sort_values().unique()
condition_list = []
label_list = []

for code in weathercodes:
  condition_list.append(df['weathercode'] == code)
  label_list.append(WEATHER_CODES[code])

In [25]:
# Create categories
df['weather'] = np.select(condition_list, label_list, default='Unknown')

In [26]:
df['weather'].value_counts()

weather
Clear sky        72641
Overcast         35254
Mainly clear     14806
Light drizzle     4605
Partly cloudy     4327
Name: count, dtype: int64

In [28]:
# Use One Hot Encoding
one_hot = pd.get_dummies(df['weather'], drop_first=True, dtype='int64', prefix='weather')
df = df.join(one_hot)

### Convert incident_category to Categories 

## Export Data

In [29]:
df.columns

Index(['trip_id', 'vehicle_id', 'vehicle_lat', 'vehicle_lon',
       'vehicle_distance', 'vehicle_status', 'vehicle_bearing',
       'vehicle_speed', 'occupancy_status', 'route_id', 'stop_id', 'stop_lat',
       'stop_lon', 'stop_sequence', 'trip_progress', 'wheelchair_boarding',
       'realtime_arrival_time', 'scheduled_arrival_time', 'delay',
       'temperature', 'precipitation', 'windspeed', 'weathercode',
       'incident_nearby', 'day', 'hour', 'sch_day', 'sch_hour', 'day_sin',
       'day_cos', 'sch_day_sin', 'sch_day_cos', 'hour_sin', 'hour_cos',
       'sch_hour_sin', 'sch_hour_cos', 'is_weekend', 'is_peak_hour', 'weather',
       'vehicle_status_Stopped At', 'weather_Light drizzle',
       'weather_Mainly clear', 'weather_Overcast', 'weather_Partly cloudy'],
      dtype='object')

In [30]:
# Keep relevant columns
df = df[[
	'vehicle_id',
  	'vehicle_lat',
    'vehicle_lon',
    'vehicle_distance',
    'vehicle_status_Stopped At',
    'vehicle_bearing',
    'vehicle_speed',
  	'occupancy_status',
  	'route_id', 
  	'stop_id',
  	'stop_lat',
  	'stop_lon',
	'stop_sequence',
  	'trip_progress',
	'wheelchair_boarding',
  	'day_sin',
  	'day_cos',
  	'sch_day_sin',
  	'sch_day_cos',
  	'hour_sin',
  	'hour_cos', 
	'sch_hour_sin',
  	'sch_hour_cos',
  	'is_weekend',
  	'is_peak_hour', 
	'temperature', 
	'precipitation', 
	'windspeed', 
  	'weather_Light drizzle',
    'weather_Mainly clear',
  	'weather_Overcast',
  	'weather_Partly cloudy',
    'incident_nearby',
  	'delay'
]]

In [31]:
# Export encoders
encoders = {
	'le_vehicle': le_vehicle,
  	'le_route': le_route,
  	'le_stop': le_stop,
}

with open('../models/label_encoders.pkl', 'wb') as handle:
	pickle.dump(encoders, handle, protocol=pickle.HIGHEST_PROTOCOL)

In [32]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 131633 entries, 0 to 133074
Data columns (total 34 columns):
 #   Column                     Non-Null Count   Dtype  
---  ------                     --------------   -----  
 0   vehicle_id                 131633 non-null  int64  
 1   vehicle_lat                131633 non-null  float64
 2   vehicle_lon                131633 non-null  float64
 3   vehicle_distance           131633 non-null  float64
 4   vehicle_status_Stopped At  131633 non-null  int64  
 5   vehicle_bearing            131633 non-null  float64
 6   vehicle_speed              131633 non-null  float64
 7   occupancy_status           131633 non-null  int64  
 8   route_id                   131633 non-null  int64  
 9   stop_id                    131633 non-null  int64  
 10  stop_lat                   131633 non-null  float64
 11  stop_lon                   131633 non-null  float64
 12  stop_sequence              131633 non-null  int64  
 13  trip_progress              131633 

In [33]:
# Export dataframe
df.to_csv('../data/preprocessed.csv', index=False)

## End