# STM Transit Delay Data Preprocessing

## Overview

This notebook preprocesses data about STM trip updates weather and traffic, data in order to build a regression model that predicts delays in seconds.

## Data Description

`trip_id`: Unique identifier for the transit trip.<br>
`vehicle_id`: Unique identifier for a vehicle.<br>
`vehicle_lat`, `vehicle_lon`: Vehicle current position.<br>
`vehicle_distance`: Vehicle distance from the stop, in meters.<br>
`vehicle_in_transit`: Indicates if a vehicle is in transit or if it has stopped.<br>
`vehicle_bearing`: Direction that the vehicle is facing, from to 360 degrees.<br>
`vehicle_speed`: Momentary speed measured by the vehicle, in meters per second.<br>
`stop_sequence`: Sequence of the stop, for ordering.<br>
`occupancy_status`: Degree of passenger occupancy, ranging from 1 (empty) to 7 (not accepting passengers).<br>
`route_id`: Unique identifier for a bus or metro line.<br>
`stop_id`: Unique identifier of a stop.<br>
`stop_name`: Name of the stop.<br>
`stop_lat`, `stop_lon`: Stop coordinates.<br>
`trip_progress`: How far along the trip is the vehicle, from 0 (first stop) to 1 (last stop).<br>
`wheelchair_boarding`: Indicates if the stop is accessible for people in wheelchair.<br>
`rt_arrival_time`, `sch_arrival_time`: Realtime and scheduled arrival time, in UTC.<br>
`delay`: Difference between real and scheduled arrival time, in seconds<br>
`temperature`: Air temperature at 2 meters above ground, in Celsius.<br>
`precipitation`: Total precipitation (rain, showers, snow) sum of the preceding hour, in millimeters.<br>
`windspeed`: Wind speed at 10 meters above ground, in km/h.<br>
`weathercode`: World Meteorological Organization (WMO) code, see [Open-Meteo API documentation](https://open-meteo.com/en/docs#weather_variable_documentation) for the list.<br>
`incident_nearby`: Indicates if an incident happened within 1.5 km of the vehicle position.

## Imports

In [65]:
import numpy as np
import pandas as pd
import pickle
from sklearn.preprocessing import LabelEncoder
import sys

In [66]:
# Import custom code
sys.path.insert(0, '..')
from scripts.custom_functions import LOCAL_TIMEZONE, WEATHER_CODES

In [67]:
# Load data
df = pd.read_csv('../data/stm_weather_traffic_merged.csv')

In [68]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 210347 entries, 0 to 210346
Data columns (total 20 columns):
 #   Column                  Non-Null Count   Dtype  
---  ------                  --------------   -----  
 0   vehicle_id              210347 non-null  int64  
 1   vehicle_in_transit      210347 non-null  int64  
 2   vehicle_rel_distance    187277 non-null  float64
 3   vehicle_bearing         210347 non-null  float64
 4   vehicle_speed           210347 non-null  float64
 5   occupancy_status        210347 non-null  int64  
 6   route_id                210347 non-null  int64  
 7   stop_id                 210347 non-null  int64  
 8   stop_lat                210347 non-null  float64
 9   stop_lon                210347 non-null  float64
 10  trip_progress           210347 non-null  float64
 11  wheelchair_boarding     210347 non-null  int64  
 12  realtime_arrival_time   210347 non-null  object 
 13  scheduled_arrival_time  210347 non-null  object 
 14  delay               

## Data Preprocessing

### Handle Outliers

In [None]:
# Compute mean and standard deviation
mean_delay = df['delay'].mean()
std_delay = df['delay'].std()

In [None]:
# Filter outliers based on standard deviation
outlier_mask = (df['delay'] < mean_delay - 3 * std_delay) | (df['delay'] > mean_delay + 3 * std_delay)

In [None]:
# Get outliers
df[outlier_mask]

In [None]:
# Get proportion of outliers
print(f'{outlier_mask.mean():.2%}')

In [None]:
# Remove outliers
df = df[~outlier_mask]

In [None]:
# Get new distribution
df['delay'].describe()

The delay ranging from ~16 min early to 18min45sec seems more reasonable.

In [None]:
df.columns

### Encode Datetime

In [None]:
# Convert scheduled arrival time
df['scheduled_arrival_time'] = pd.to_datetime(df['scheduled_arrival_time'], utc=True).dt.tz_convert(LOCAL_TIMEZONE)

In [None]:
# Convert datetime to month, day and hour
df['day_of_week'] = df['scheduled_arrival_time'].dt.day_of_week
df['hour'] = df['scheduled_arrival_time'].dt.hour

In [None]:
# Add boolean value is_weekend
weekend_mask = df['day_of_week'].isin([5, 6])
df['is_weekend'] = np.where(weekend_mask, 1, 0)

In [None]:
# Add boolean value is_peak_hour (weekdays from 7-9am or 4-6pm)
peak_hour_mask = (weekend_mask == False) & (df['hour'].isin([7, 8, 9, 16, 17, 18]))
df['is_peak_hour'] = np.where(peak_hour_mask, 1, 0)

### Use Label Encoding for vehicle_id, route_id and stop_id

In [None]:
le_vehicle = LabelEncoder()
df['vehicle_id'] = le_vehicle.fit_transform(df['vehicle_id'])

In [None]:
le_route = LabelEncoder()
df['route_id'] = le_route.fit_transform(df['route_id'])

In [None]:
le_stop = LabelEncoder()
df['stop_id'] = le_stop.fit_transform(df['stop_id'])

### Convert weathercode to Categories

In [None]:
# Create weather code mapping
weathercodes = df['weathercode'].sort_values().unique()
condition_list = []
label_list = []

for code in weathercodes:
  condition_list.append(df['weathercode'] == code)
  label_list.append(WEATHER_CODES[code])

In [None]:
# Create categories
df['weather'] = np.select(condition_list, label_list, default='Unknown')

In [None]:
df['weather'].value_counts()

In [None]:
# Collapse categories
df['weather'] = np.where(df['weather'].isin(['Mainly clear', 'Partly cloudy', 'Overcast']), 'Cloudy', df['weather'])
df['weather'] = np.where(df['weather'].isin(['Light drizzle', 'Moderate drizzle', 'Dense drizzle']), 'Drizzle', df['weather'])
df['weather'] = np.where(df['weather'].isin(['Slight rain', 'Moderate rain']), 'Rain', df['weather'])
df['weather'].value_counts()

In [None]:
# Use One Hot Encoding
one_hot = pd.get_dummies(df['weather'], drop_first=True, dtype='int64', prefix='weather')
df = df.join(one_hot)

### Convert incident_category to Categories 

## Export Data

In [None]:
df.columns

**Columns to drop**

`scheduled_arrival_time` : The month, day of week and hour have been extracted.<br>
`realtime_arrival_time`: There's already the delay.<br>
`weather`, `weathercode`: Dummies have been created for the weather.<br>

In [None]:
# Keep relevant columns
df = df[[
	'vehicle_id',
  	'vehicle_in_transit',
    'vehicle_rel_distance',
    'vehicle_bearing',
    'vehicle_speed',
  	'occupancy_status',
  	'route_id', 
  	'stop_id',
  	'stop_lat',
  	'stop_lon',
  	'trip_progress',
	'wheelchair_boarding',
  	'day_of_week',
    'hour',
  	'is_weekend',
  	'is_peak_hour', 
	'temperature', 
	'precipitation', 
	'windspeed', 
  	'weather_Cloudy',
  	'weather_Drizzle',
	'weather_Rain',
    'incident_nearby',
  	'delay'
]]

In [None]:
# Export encoders
encoders = {
	'le_vehicle': le_vehicle,
  	'le_route': le_route,
  	'le_stop': le_stop,
}

with open('../models/label_encoders.pkl', 'wb') as handle:
	pickle.dump(encoders, handle, protocol=pickle.HIGHEST_PROTOCOL)

In [None]:
df.info()

In [None]:
df.columns

In [None]:
# Export dataframe
df.to_csv('../data/preprocessed.csv', index=False)

## End