# STM Transit Delay Data Preprocessing

This notebook preprocesses data about STM trip updates and weather data in order to build a tree-based regression model that predicts delays in seconds.

## Data Description

`trip_id` unique identifier of a trip<br>
`vehicle_id` unique identifier of a vehicle<br>
`occupancy_status` degree of passenger occupancy<br>
`route_id` bus or metro line<br>
`stop_id` stop number<br>
`stop_lat`stop latitude<br>
`stop_lon`stop longitude<br>
`stop_sequence` sequence of the stop, for ordering<br>
`trip_progress` how far along the trip is the bys from 0 to 1<br>
`wheelchair_boarding` indicates if the stop is accessible for people in wheelchair, 1 being true and 2 being false<br>
`realtime_arrival_time` actual arrival time, in milliseconds<br>
`scheduled_arrival_time` planned arrival time, in milliseconds<br>
`temperature` air temperature at 2 meters above ground, in Celsius<br>
`precipitation` total precipitation (rain, showers, snow) sum of the preceding hour, in millimeters<br>
`windspeed` wind speed at 10 meters above ground, in km/h<br>
`weathercode` World Meteorological Organization (WMO) code<br>
`incident_nearby`indicates if an incident happened within 500 meters when the vehicle arrived at the stop

## Imports

In [None]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import pickle
from sklearn.preprocessing import LabelEncoder
import sys

In [None]:
# Import custom code
sys.path.insert(0, '..')
from scripts.custom_functions import INCIDENT_CATEGORIES, LOCAL_TIMEZONE, WEATHER_CODES

In [None]:
# Load data
df = pd.read_csv('../data/stm_weather_traffic_merged.csv')

## Small EDA

In [None]:
# Get correlations of features with delay
correlations = df.corr().iloc[:-1]['delay']
correlations.sort_values(key=abs, ascending=False)

In [None]:
# Get distribution of delays
df['delay'].describe()

In [None]:
# Get proportion of trips that are on time
on_time_mask = df['delay'] == 0
print(f'{(on_time_mask.sum() / len(df)):.2%}')

In [None]:
# Plot distribution of delay
plt.hist(df['delay'], bins=50)
plt.title('Distribution of delay')
plt.xlabel('Delay (seconds)')
plt.ylabel('Frequency')
plt.savefig('../images/delay_hist.png', bbox_inches='tight')
plt.show()

The plot shows there are some massive delays (~2h15min early to ~5h15min late), which could greatly affect the model performance.

In [None]:
# Plot trip progress vs. delay
plt.scatter(x=df['trip_progress'], y=df['delay'])
plt.title('Trip Progress vs. Delay')
plt.xlabel('Trip Progress')
plt.ylabel('Delay (seconds)')
plt.savefig('../images/trip_progress_vs_delay.png', bbox_inches='tight')
plt.show()

From the plot, it looks like the delay is relatively stable across the trip progress. There are many data points that appear scattered around the x-axis, indicating that delays do not change significantly throughout the trip in many cases. The wide range of delays suggests that delays are highly variable, but there is not clear relationship between trip progress and delay. This could imply that other factors like traffic, accidents, mechanical issues or weather may have more impact on the delays.

## Data Preprocessing

### Handle Outliers

In [None]:
# Compute mean and standard deviation
mean_delay = df['delay'].mean()
std_delay = df['delay'].std()

In [None]:
# Filter outliers based on standard deviation
outlier_mask = (df['delay'] < mean_delay - 3 * std_delay) | (df['delay'] > mean_delay + 3 * std_delay)

In [None]:
# Get outliers
df[outlier_mask]

In [None]:
# Get proportion of outliers
print(f'{outlier_mask.mean():.2%}')

In [None]:
# Remove outliers
df = df[~outlier_mask]

In [None]:
# Get new distribution
df['delay'].describe()

The delay ranging from ~16 min early to 18min45sec seems more reasonable.

In [None]:
df.columns

### Encode Datetime

In [None]:
# Convert real and scheduled timestamps
df['realtime_arrival_time'] = pd.to_datetime(df['realtime_arrival_time'], origin='unix', unit='ms', utc=True)
df['scheduled_arrival_time'] = pd.to_datetime(df['scheduled_arrival_time'], origin='unix', unit='ms', utc=True)

In [None]:
# Convert arrival times to local timezone
df['realtime_arrival_time'] = df['realtime_arrival_time'].dt.tz_convert(LOCAL_TIMEZONE)
df['scheduled_arrival_time'] = df['scheduled_arrival_time'].dt.tz_convert(LOCAL_TIMEZONE)

In [None]:
# Convert datetimes to day and hour
df['day'] = df['realtime_arrival_time'].dt.day_of_week
df['hour'] = df['realtime_arrival_time'].dt.hour

df['sch_day'] = df['scheduled_arrival_time'].dt.day_of_week
df['sch_hour'] = df['scheduled_arrival_time'].dt.hour

In [None]:
# Use Cyclical Encoding for day and hour, as it's more suitable for time-related features
# And the model can "understand" the wrap-around
df['day_sin'] = np.sin(2 * np.pi * df['day'] / 7)
df['day_cos'] = np.cos(2 * np.pi * df['day'] / 7)

df['sch_day_sin'] = np.sin(2 * np.pi * df['sch_day'] / 7)
df['sch_day_cos'] = np.cos(2 * np.pi * df['sch_day'] / 7)

df['hour_sin'] = np.sin(2 * np.pi * df['hour'] / 24)
df['hour_cos'] = np.cos(2 * np.pi * df['hour'] / 24)

df['sch_hour_sin'] = np.sin(2 * np.pi * df['sch_hour'] / 24)
df['sch_hour_cos'] = np.cos(2 * np.pi * df['sch_hour'] / 24)

In [None]:
# Add boolean value is_weekend
weekend_mask = df['day'].isin([5, 6])
df['is_weekend'] = np.where(weekend_mask, 1, 0)

In [None]:
# Add boolean value is_peak_hour (weekdays from 7-9am or 4-6pm)
peak_hour_mask = (weekend_mask == False) & (df['hour'].isin([7, 8, 9, 16, 17, 18]))
df['is_peak_hour'] = np.where(peak_hour_mask, 1, 0)

### Use Label Encoding for vehicle_id, route_id and stop_id

In [None]:
le_vehicle = LabelEncoder()
df['vehicle_id'] = le_vehicle.fit_transform(df['vehicle_id'])

In [None]:
le_route = LabelEncoder()
df['route_id'] = le_route.fit_transform(df['route_id'])

In [None]:
le_stop = LabelEncoder()
df['stop_id'] = le_stop.fit_transform(df['stop_id'])

### Convert weathercode to Categories

In [None]:
# Create weather code mapping
weathercodes = df['weathercode'].sort_values().unique()
condition_list = []
label_list = []

for code in weathercodes:
  condition_list.append(df['weathercode'] == code)
  label_list.append(WEATHER_CODES[code])

In [None]:
# Create categories
df['weather'] = np.select(condition_list, label_list, default='Unknown')

In [None]:
df['weather'].value_counts()

In [None]:
# Use One Hot Encoding
one_hot = pd.get_dummies(df['weather'], drop_first=True, dtype='int64', prefix='weather')
df = df.join(one_hot)

### Convert incident_category to Categories 

## Export Data

In [None]:
df.columns

In [None]:
# Keep encoded columns and reorder them
df = df[[
	'vehicle_id',
  	'occupancy_status',
  	'route_id', 
  	'stop_id',
  	'stop_lat',
  	'stop_lon',
	'stop_sequence',
  	'trip_progress',
	'wheelchair_boarding',
  	'day_sin',
  	'day_cos',
  	'sch_day_sin',
  	'sch_day_cos',
  	'hour_sin',
  	'hour_cos', 
	'sch_hour_sin',
  	'sch_hour_cos',
  	'is_weekend',
  	'is_peak_hour', 
	'temperature', 
	'precipitation', 
	'windspeed', 
  	'weather_Light drizzle',
  	'weather_Mainly clear',
  	'weather_Overcast',
    'weather_Partly cloudy',
  	'delay'
]]

In [None]:
# Export encoders
encoders = {
	'le_vehicle': le_vehicle,
  	'le_route': le_route,
  	'le_stop': le_stop,
}

with open('../models/label_encoders.pkl', 'wb') as handle:
	pickle.dump(encoders, handle, protocol=pickle.HIGHEST_PROTOCOL)

In [None]:
df.info()

In [None]:
# Export dataframe
df.to_csv('../data/preprocessed.csv', index=False)

## End