# STM Transit Delay Data Preprocessing

## Overview

This notebook preprocesses data about STM trip updates weather and traffic, data in order to build a regression model that predicts delays in seconds.

## Data Description (to be completed)

`trip_id`: Unique identifier for the transit trip.<br>
`vehicle_id`: Unique identifier for a vehicle.<br>
`vehicle_lat`, `vehicle_lon`: Vehicle current position.<br>
`vehicle_distance`: Vehicle distance from the stop, in meters.<br>
`vehicle_in_transit`: Indicates if a vehicle is in transit or if it has stopped.<br>
`vehicle_bearing`: Direction that the vehicle is facing, from to 360 degrees.<br>
`vehicle_speed`: Momentary speed measured by the vehicle, in meters per second.<br>
`stop_sequence`: Sequence of the stop, for ordering.<br>
`occupancy_status`: Degree of passenger occupancy, ranging from 1 (empty) to 7 (not accepting passengers).<br>
`route_id`: Unique identifier for a bus or metro line.<br>
`stop_id`: Unique identifier of a stop.<br>
`stop_name`: Name of the stop.<br>
`stop_lat`, `stop_lon`: Stop coordinates.<br>
`trip_progress`: How far along the trip is the vehicle, from 0 (first stop) to 1 (last stop).<br>
`wheelchair_boarding`: Indicates if the stop is accessible for people in wheelchair.<br>
`rt_arrival_time`, `sch_arrival_time`: Realtime and scheduled arrival time, in UTC.<br>
`delay`: Difference between real and scheduled arrival time, in seconds<br>
`delay_cat`: Delay magnitude, from very early to very late<br>
`temperature`: Air temperature at 2 meters above ground, in Celsius.<br>
`precipitation`: Total precipitation (rain, showers, snow) sum of the preceding hour, in millimeters.<br>
`windspeed`: Wind speed at 10 meters above ground, in km/h.<br>
`weathercode`: World Meteorological Organization (WMO) code, see [Open-Meteo API documentation](https://open-meteo.com/en/docs#weather_variable_documentation) for the list.<br>
`incident_nearby`: Indicates if an incident happened within 1.5 km of the vehicle position.

## Imports

In [1]:
import numpy as np
import pandas as pd
import pickle
from sklearn.preprocessing import LabelEncoder, OrdinalEncoder
import sys

In [2]:
# Import custom code
sys.path.insert(0, '..')
from scripts.custom_functions import DELAY_CLASS, LOCAL_TIMEZONE, OCCUPANCY_STATUS

In [3]:
# Load data
df = pd.read_csv('../data/stm_weather_traffic_merged.csv')

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 341924 entries, 0 to 341923
Data columns (total 31 columns):
 #   Column                    Non-Null Count   Dtype  
---  ------                    --------------   -----  
 0   trip_id                   341924 non-null  int64  
 1   vehicle_id                341924 non-null  int64  
 2   occupancy_status          341924 non-null  object 
 3   vehicle_in_transit        341924 non-null  int64  
 4   vehicle_bearing           341924 non-null  float64
 5   vehicle_speed             341924 non-null  float64
 6   wheelchair_boarding       341924 non-null  int64  
 7   route_id                  341924 non-null  int64  
 8   stop_id                   341924 non-null  int64  
 9   stop_name                 341924 non-null  object 
 10  stop_lat                  341924 non-null  float64
 11  stop_lon                  341924 non-null  float64
 12  stop_distance             341924 non-null  float64
 13  stop_sequence             341924 non-null  i

In [5]:
df.head()

Unnamed: 0,trip_id,vehicle_id,occupancy_status,vehicle_in_transit,vehicle_bearing,vehicle_speed,wheelchair_boarding,route_id,stop_id,stop_name,...,incident_nearby,incident_category,incident_delay,avg_distance_to_incident,incident_delay_magnitude,incident_count,temperature,precipitation,windspeed,weather
0,284214231,33839,Many seats available,0,0.0,0.0,1,80,51798,du Parc / Villeneuve,...,0,0.0,0.0,-1.0,0.0,0.0,11.0,0.0,13.4,Clear sky
1,284214231,33839,Empty,1,0.0,0.0,1,80,54225,Aréna Howie-Morenz,...,0,0.0,0.0,-1.0,0.0,0.0,10.4,0.0,11.6,Clear sky
2,283855472,32018,Empty,1,54.0,0.0,1,49,60515,Perras / 81e Avenue,...,0,0.0,0.0,-1.0,0.0,0.0,11.3,0.0,13.3,Clear sky
3,283213434,40187,Many seats available,1,287.0,7.50006,1,968,61988,Gare Roxboro-Pierrefonds,...,0,0.0,0.0,-1.0,0.0,0.0,10.4,0.0,11.6,Clear sky
4,283213434,40187,Empty,1,0.0,0.0,1,968,61988,Gare Roxboro-Pierrefonds,...,0,0.0,0.0,-1.0,0.0,0.0,10.4,0.0,11.6,Clear sky


## Data Preprocessing

### Handle Delay Outliers

In [6]:
df['delay'].describe()

count    341924.000000
mean         60.406848
std         310.178323
min      -13592.000000
25%           0.000000
50%           0.000000
75%          40.000000
max       31108.000000
Name: delay, dtype: float64

In [7]:
# Compute mean and standard deviation
mean_delay = df['delay'].mean()
std_delay = df['delay'].std()

In [8]:
# Filter outliers based on standard deviation
outlier_mask = (df['delay'] < mean_delay - 3 * std_delay) | (df['delay'] > mean_delay + 3 * std_delay)

In [9]:
# Get outliers
df[outlier_mask].sample(20)

Unnamed: 0,trip_id,vehicle_id,occupancy_status,vehicle_in_transit,vehicle_bearing,vehicle_speed,wheelchair_boarding,route_id,stop_id,stop_name,...,incident_nearby,incident_category,incident_delay,avg_distance_to_incident,incident_delay_magnitude,incident_count,temperature,precipitation,windspeed,weather
212430,284778708,37053,Few seats available,0,28.0,2.50002,1,162,56377,Guelph / Jellicoe,...,0,0.0,0.0,-1.0,0.0,0.0,11.1,0.2,7.7,Light drizzle
150833,284777346,37097,Many seats available,0,0.0,0.0,1,100,55514,Montée de Liesse / Bourg,...,0,0.0,0.0,-1.0,0.0,0.0,7.1,0.0,8.4,Overcast
306,286581685,38034,Many seats available,1,261.0,2.50002,1,777,54330,Casino de Montréal,...,0,0.0,0.0,-1.0,0.0,0.0,11.0,0.0,13.4,Clear sky
277328,286572216,40074,Empty,1,0.0,0.0,1,110,57060,Newman / Angrignon,...,0,0.0,0.0,-1.0,0.0,0.0,10.6,0.0,6.0,Overcast
11338,285002392,40096,Many seats available,1,212.0,10.00008,1,410,53845,René-Lévesque / De Champlain,...,0,0.0,0.0,-1.0,0.0,0.0,18.5,0.0,11.9,Overcast
211581,286574319,41080,Empty,1,232.0,12.5001,1,420,56386,Côte-Saint-Luc / Westminster,...,0,0.0,0.0,-1.0,0.0,0.0,11.1,0.2,7.7,Light drizzle
124327,284779889,40916,Empty,1,223.0,2.22224,1,213,55554,Côte-Vertu / Alexis-Nihon,...,0,0.0,0.0,-1.0,0.0,0.0,7.0,0.0,12.5,Clear sky
205249,284778705,37053,Empty,1,211.0,2.50002,1,162,56389,Kildare / Cavendish,...,0,0.0,0.0,-1.0,0.0,0.0,12.4,0.0,9.5,Overcast
151579,284740861,31218,Many seats available,1,24.0,1.66668,1,460,50671,Crémazie / Saint-Laurent,...,0,0.0,0.0,-1.0,0.0,0.0,7.7,0.0,9.1,Overcast
93688,284740313,41058,Few seats available,0,0.0,0.0,1,41,51473,Saint-Michel / Crémazie,...,0,0.0,0.0,-1.0,0.0,0.0,18.4,0.3,21.9,Light drizzle


In [10]:
# Get proportion of outliers
print(f'{outlier_mask.mean():.2%}')

0.89%


In [11]:
# Remove outliers
df = df[~outlier_mask]

In [12]:
# Get new distribution
df['delay'].describe()

count    338879.000000
mean         51.385014
std         141.813369
min        -868.000000
25%           0.000000
50%           0.000000
75%          32.000000
max         990.000000
Name: delay, dtype: float64

### Encode Datetime

In [13]:
# Convert scheduled arrival time
df['sch_arrival_time'] = pd.to_datetime(df['sch_arrival_time'], utc=True).dt.tz_convert(LOCAL_TIMEZONE)

In [14]:
# Convert datetime to month, day and hour
df['month'] = df['sch_arrival_time'].dt.month
df['day_of_week'] = df['sch_arrival_time'].dt.day_of_week
df['hour'] = df['sch_arrival_time'].dt.hour

In [15]:
# Add boolean value is_weekend
weekend_mask = df['day_of_week'].isin([5, 6])
df['is_weekend'] = np.where(weekend_mask, 1, 0)

In [16]:
# Add boolean value is_peak_hour (weekdays from 7-9am or 4-6pm)
peak_hour_mask = (weekend_mask == False) & (df['hour'].isin([7, 8, 9, 16, 17, 18]))
df['is_peak_hour'] = np.where(peak_hour_mask, 1, 0)

In [17]:
# Drop datetime columns
df = df.drop(['rt_arrival_time', 'rt_departure_time', 'sch_arrival_time', 'sch_departure_time'], axis=1)

### Use Label Encoding for vehicle_id, route_id and stop_id

In [18]:
le_vehicle = LabelEncoder()
df['vehicle_id'] = le_vehicle.fit_transform(df['vehicle_id'])

In [19]:
le_route = LabelEncoder()
df['route_id'] = le_route.fit_transform(df['route_id'])

In [20]:
le_stop = LabelEncoder()
df['stop_id'] = le_stop.fit_transform(df['stop_id'])

### Use Ordinal Encoding for Occupancy Status and Delay Class

In [21]:
# Create occupation map
occ_map = {}

for number, status in OCCUPANCY_STATUS.items():
	occ_map[status] = number

occ_map

{'Unknown': 0,
 'Empty': 1,
 'Many seats available': 2,
 'Few seats available': 3,
 'Standing room only': 4,
 'Crushed standing room only': 5,
 'Full': 6,
 'Not accepting passengers': 7}

In [22]:
# Map values
df['occupancy_status'] = df['occupancy_status'].map(occ_map)
df['occupancy_status'].value_counts()

occupancy_status
1    144666
2    112509
3     76235
5      5245
0       224
Name: count, dtype: int64

In [23]:
# Create delay map
delay_map = {}

for number, cat in DELAY_CLASS.items():
	delay_map[cat] = number

delay_map

{'Early': 0, 'Slightly Early': 1, 'On Time': 2, 'Slighly Late': 3, 'Late': 4}

In [24]:
# Map values
df['delay_class'] = df['delay_class'].map(delay_map)
df['delay_class'].value_counts()

delay_class
2    289007
3     21524
4     20220
0      4407
1      3721
Name: count, dtype: int64

### Use One Hot Encoding for Weather and Schedule Relationship

In [25]:
df['weather'].value_counts()

weather
Overcast            170500
Clear sky            78138
Partly cloudy        32238
Light drizzle        22477
Mainly clear         21836
Slight rain           9860
Dense drizzle         2878
Moderate drizzle       952
Name: count, dtype: int64

In [26]:
# Collapse categories
df['weather'] = np.where(df['weather'].isin(['Mainly clear', 'Partly cloudy', 'Overcast']), 'Cloudy', df['weather'])
df['weather'] = np.where(df['weather'].isin(['Light drizzle', 'Moderate drizzle', 'Dense drizzle']), 'Drizzle', df['weather'])
df['weather'] = np.where(df['weather'].isin(['Slight rain', 'Moderate rain']), 'Rain', df['weather'])
df['weather'].value_counts()

weather
Cloudy       224574
Clear sky     78138
Drizzle       26307
Rain           9860
Name: count, dtype: int64

In [27]:
# Use One Hot Encoding for weather
one_hot = pd.get_dummies(df['weather'], drop_first=True, dtype='int64', prefix='weather')
df = df.join(one_hot).drop('weather', axis=1)

In [28]:
# Get remaining string columns
df.select_dtypes(include='object').columns

Index(['stop_name'], dtype='object')

## Export Data

In [30]:
df.columns

Index(['trip_id', 'vehicle_id', 'occupancy_status', 'vehicle_in_transit',
       'vehicle_bearing', 'vehicle_speed', 'wheelchair_boarding', 'route_id',
       'stop_id', 'stop_name', 'stop_lat', 'stop_lon', 'stop_distance',
       'stop_sequence', 'trip_progress', 'delay', 'delay_class',
       'incident_nearby', 'incident_category', 'incident_delay',
       'avg_distance_to_incident', 'incident_delay_magnitude',
       'incident_count', 'temperature', 'precipitation', 'windspeed', 'month',
       'day_of_week', 'hour', 'is_weekend', 'is_peak_hour', 'weather_Cloudy',
       'weather_Drizzle', 'weather_Rain'],
      dtype='object')

**Columns to drop**

`trip_id`: A trip is associated with a vehichle, a route and a date.<br>
`stop_name`: There's already the stop id.<br>
`delay`: The delay class has been created.

In [31]:
# Keep relevant columns and reorder
df = df[[
    	'vehicle_id', 
      	'occupancy_status',
    	'vehicle_in_transit',
    	'vehicle_bearing',
       	'vehicle_speed',
		'wheelchair_boarding',
    	'route_id',
    	'stop_id',
       	'stop_lat', 
     	'stop_lon',
    	'stop_distance',
    	'stop_sequence',
		'trip_progress',
    	'windspeed',
    	'month',
    	'day_of_week',
    	'hour',
    	'is_weekend',
       	'is_peak_hour',
		'incident_nearby',
    	'incident_category',
    	'incident_delay',
       	'avg_distance_to_incident',
    	'incident_delay_magnitude',
       	'incident_count', 
     	'temperature',
    	'precipitation',
    	'weather_Cloudy',
    	'weather_Drizzle',
    	'weather_Rain',
        'delay',
       	'delay_class',
]]

In [32]:
# Export encoders
encoders = {
	'le_vehicle': le_vehicle,
  	'le_route': le_route,
  	'le_stop': le_stop,
}

with open('../models/label_encoders.pkl', 'wb') as handle:
	pickle.dump(encoders, handle, protocol=pickle.HIGHEST_PROTOCOL)

In [33]:
# Assert all columns are numeric
len(df.columns) == len(df.select_dtypes([np.number]).columns)

True

In [34]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 338879 entries, 0 to 341923
Data columns (total 32 columns):
 #   Column                    Non-Null Count   Dtype  
---  ------                    --------------   -----  
 0   vehicle_id                338879 non-null  int64  
 1   occupancy_status          338879 non-null  int64  
 2   vehicle_in_transit        338879 non-null  int64  
 3   vehicle_bearing           338879 non-null  float64
 4   vehicle_speed             338879 non-null  float64
 5   wheelchair_boarding       338879 non-null  int64  
 6   route_id                  338879 non-null  int64  
 7   stop_id                   338879 non-null  int64  
 8   stop_lat                  338879 non-null  float64
 9   stop_lon                  338879 non-null  float64
 10  stop_distance             338879 non-null  float64
 11  stop_sequence             338879 non-null  int64  
 12  trip_progress             338879 non-null  float64
 13  windspeed                 338879 non-null  float6

In [35]:
# Export dataframe
df.to_csv('../data/preprocessed.csv', index=False)

## End