# STM Transit Delay Data Preprocessing

## Overview

This notebook preprocesses data about STM trip updates weather and traffic, data in order to build a regression and classification model that predicts delays in seconds.

## Data Description

`trip_id`: Unique identifier for the transit trip.<br>
`route_id`: Unique identifier for a bus or metro line.<br>
`stop_id`: Unique identifier of a stop.<br>
`stop_name`: Name of the stop.<br>
`stop_lat`, `stop_lon`: Stop coordinates.<br>
`stop_distance`: Distance between the previous and current stop, in meters.<br>
`stop_sequence`: Sequence of the stop, for ordering.<br>
`trip_progress`: How far along the trip is the vehicle, from 0 (first stop) to 1 (last stop).<br>
`stop_has_alert`: Indicates if there's a message about the stop being moved or cancelled.<br>
`schedule_relationship`: State of the schedule: "scheduled", "skipped" or "no data".
`wheelchair_boarding`: Indicates if the stop is accessible for people in wheelchair.<br>
`rt_arrival_time`, `rt_departure_time`, `sch_arrival_time`, `sch_departure_time`: Realtime and scheduled times, in UTC.<br>
`delay`: Difference between real and scheduled arrival time, in seconds<br>
`delay_class`: Delay category, from early to late<br>
`incident_count`: Number of incidents within 1 km of the stop.<br>
`incident_nearby`: Indicates if an incident happened within 1 km of the stop.<br>
`temperature`: Air temperature at 2 meters above ground, in Celsius.<br>
`precipitation`: Total precipitation (rain, showers, snow) sum of the preceding hour, in millimeters.<br>
`windspeed`: Wind speed at 10 meters above ground, in km/h.<br>
`weather`: Weather description (ex: Light Drizzle).<br>

## Imports

In [1]:
import numpy as np
import pandas as pd
import pickle
from sklearn.preprocessing import LabelEncoder
import sys

In [None]:
# Import custom code
sys.path.insert(0, '..')
from scripts.custom_functions import DELAY_CLASS, LOCAL_TIMEZONE

In [3]:
# Load data
df = pd.read_parquet('../data/stm_weather_traffic_merged.parquet')

## Data Preprocessing

### Handle Delay Outliers

In [4]:
df['delay'].describe()

count    5.409842e+06
mean     6.778535e+01
std      4.761799e+02
min     -1.359200e+04
25%      0.000000e+00
50%      0.000000e+00
75%      1.700000e+01
max      5.458500e+04
Name: delay, dtype: float64

In [5]:
# Compute mean and standard deviation
mean_delay = df['delay'].mean()
std_delay = df['delay'].std()

In [6]:
# Filter outliers based on standard deviation
outlier_mask = (df['delay'] < mean_delay - 3 * std_delay) | (df['delay'] > mean_delay + 3 * std_delay)

In [8]:
# Get proportion of outliers
print(f'{outlier_mask.mean():.2%}')

0.52%


In [9]:
# Remove outliers
df = df[~outlier_mask].reset_index(drop=True)

In [10]:
# Get new distribution
df['delay'].describe()

count    5.381532e+06
mean     5.308398e+01
std      1.498280e+02
min     -1.360000e+03
25%      0.000000e+00
50%      0.000000e+00
75%      1.320000e+01
max      1.496000e+03
Name: delay, dtype: float64

### Encode Datetime

In [11]:
# Convert scheduled arrival time
df['sch_arrival_time'] = pd.to_datetime(df['sch_arrival_time'], utc=True).dt.tz_convert(LOCAL_TIMEZONE)

In [12]:
# Convert datetime to month, day and hour
df['month'] = df['sch_arrival_time'].dt.month
df['day_of_week'] = df['sch_arrival_time'].dt.day_of_week
df['hour'] = df['sch_arrival_time'].dt.hour

In [13]:
# Add boolean value is_weekend
weekend_mask = df['day_of_week'].isin([5, 6])
df['is_weekend'] = np.where(weekend_mask, 1, 0)

In [14]:
# Add boolean value is_peak_hour (weekdays from 7-9am or 4-6pm)
peak_hour_mask = (weekend_mask == False) & (df['hour'].isin([7, 8, 9, 16, 17, 18]))
df['is_peak_hour'] = np.where(peak_hour_mask, 1, 0)

In [15]:
# Drop datetime columns
df = df.drop(['rt_arrival_time', 'rt_departure_time', 'sch_arrival_time', 'sch_departure_time'], axis=1)

### Use Label Encoding for trip_id, route_id and stop_id

In [16]:
le_trip = LabelEncoder()
df['trip_id'] = le_trip.fit_transform(df['trip_id'])

In [17]:
le_route = LabelEncoder()
df['route_id'] = le_route.fit_transform(df['route_id'])

In [18]:
le_stop = LabelEncoder()
df['stop_id'] = le_stop.fit_transform(df['stop_id'])

### Use Ordinal Encoding for delay_class

In [19]:
delay_map = {value: key for (key, value) in DELAY_CLASS.items()}
df['delay_class'] = df['delay_class'].map(delay_map).astype('int64')
df['delay_class'].value_counts(normalize=True)

delay_class
1    0.881166
2    0.109353
0    0.009482
Name: proportion, dtype: float64

### Use One Hot Encoding for Weather and Schedule Relationship

In [20]:
df['weather'].value_counts()

weather
Overcast            2723768
Clear sky            839285
Mainly clear         631466
Partly cloudy        447877
Light drizzle        415693
Slight rain          161546
Moderate drizzle     118356
Dense drizzle         37193
Moderate rain          6348
Name: count, dtype: int64

In [21]:
# Collapse categories
df['weather'] = np.where(df['weather'].isin(['Mainly clear', 'Partly cloudy', 'Overcast']), 'Cloudy', df['weather'])
df['weather'] = np.where(df['weather'].isin(['Light drizzle', 'Moderate drizzle', 'Dense drizzle']), 'Drizzle', df['weather'])
df['weather'] = np.where(df['weather'].isin(['Slight rain', 'Moderate rain']), 'Rain', df['weather'])
df['weather'].value_counts()

weather
Cloudy       3803111
Clear sky     839285
Drizzle       571242
Rain          167894
Name: count, dtype: int64

In [22]:
# Use One Hot Encoding for weather
one_hot = pd.get_dummies(df['weather'], drop_first=True, dtype='int64', prefix='weather')
df = df.join(one_hot).drop('weather', axis=1)

In [23]:
# Get remaining string columns
df.select_dtypes(include='object').columns

Index(['stop_name', 'schedule_relationship'], dtype='object')

In [24]:
# Drop stop name, as only numerical columns should remain
df = df.drop('stop_name', axis=1)

In [25]:
# Use One Hot Encoding for schedule_relationship
one_hot = pd.get_dummies(df['schedule_relationship'], drop_first=True, dtype='int64', prefix='sch_rel')
df = df.join(one_hot).drop('schedule_relationship', axis=1)

## Export Data

In [26]:
df.columns

Index(['trip_id', 'route_id', 'stop_id', 'stop_lat', 'stop_lon',
       'stop_distance', 'stop_sequence', 'trip_progress', 'stop_has_alert',
       'wheelchair_boarding', 'delay', 'delay_class', 'incident_count',
       'incident_nearby', 'temperature', 'precipitation', 'windspeed', 'month',
       'day_of_week', 'hour', 'is_weekend', 'is_peak_hour', 'weather_Cloudy',
       'weather_Drizzle', 'weather_Rain', 'sch_rel_Scheduled',
       'sch_rel_Skipped'],
      dtype='object')

In [27]:
# Reorder columns
df = df[[
		'trip_id',
		'route_id',
		'stop_id',
		'stop_lat',
		'stop_lon',
		'stop_distance',
		'stop_sequence',
		'trip_progress',
    	'stop_has_alert',
		'sch_rel_Scheduled',
       	'sch_rel_Skipped',
		'wheelchair_boarding',
    	'month',
		'day_of_week',
    	'hour',
    	'is_weekend',
       	'is_peak_hour',
		'incident_count',
		'incident_nearby',
		'temperature',
		'precipitation',
		'windspeed',
		'weather_Cloudy',
    	'weather_Drizzle',
    	'weather_Rain',
    	'delay',
       	'delay_class',
]]

In [29]:
# Assert all columns are numeric
assert len(df.columns) == len(df.select_dtypes([np.number]).columns)

In [30]:
# Export encoders
encoders = {
	'le_trip': le_trip,
  	'le_route': le_route,
  	'le_stop': le_stop,
}

with open('../models/label_encoders.pkl', 'wb') as handle:
	pickle.dump(encoders, handle, protocol=pickle.HIGHEST_PROTOCOL)

In [31]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5381532 entries, 0 to 5381531
Data columns (total 27 columns):
 #   Column               Dtype  
---  ------               -----  
 0   trip_id              int64  
 1   route_id             int64  
 2   stop_id              int64  
 3   stop_lat             float64
 4   stop_lon             float64
 5   stop_distance        float64
 6   stop_sequence        int64  
 7   trip_progress        float64
 8   stop_has_alert       int64  
 9   sch_rel_Scheduled    int64  
 10  sch_rel_Skipped      int64  
 11  wheelchair_boarding  int64  
 12  month                int32  
 13  day_of_week          int32  
 14  hour                 int32  
 15  is_weekend           int64  
 16  is_peak_hour         int64  
 17  incident_count       float64
 18  incident_nearby      int64  
 19  temperature          float64
 20  precipitation        float64
 21  windspeed            float64
 22  weather_Cloudy       int64  
 23  weather_Drizzle      int64  
 24

In [32]:
# Export dataframe
df.to_parquet('../data/preprocessed.parquet', index=False)

## End