In [1]:
import pandas as pd
import numpy as numpy
import polars as pl
import json
import matplotlib.pyplot as plt
import matplotlib as mpl

##Machine Learning Exercise 1 - Introduction to Regression for Predictive Analytics##

For this exercise, you've been provided a dataset derived from the full WeGo headway dataset. Your goal is to see how well a model can predict the ADHERENCE value.

Note that this dataset is a condensed version of the dataset that you've worked with before. Each trip has been condensed to a single row, identified by the ID column, a combination of the original CALENDAR_ID and TRIP_ID columns and where the ADHERENCE value is ADHERENCE value for the next-to-last stop.

In [2]:
# For example, the original data, the first trip
# 	ID 	CALENDAR_ID 	TRIP_ID 	ADHERENCE 	TRIP_EDGE
# 0 	120230801_345104 	120230801 	345104 	-2.13333 	1
# 1 	120230801_345104 	120230801 	345104 	-2.45 	0
# 2 	120230801_345104 	120230801 	345104 	-0.933333 	0
# 3 	120230801_345104 	120230801 	345104 	6.28333 	2

# Has been reduced to just the value for the next-to-last stop:
# 	ID 	CALENDAR_ID 	TRIP_ID 	ADHERENCE 	TRIP_EDGE
# 2 	120230801_345104 	120230801 	345104 	-0.933333 	0

# Note: Make sure that you perform a train/test split before fitting any models so that you can properly measure the performance of the model.

In [3]:
wego = pd.read_csv('../data/wego_ml.csv')

In [4]:
wego.head()

Unnamed: 0,ID,CALENDAR_ID,SERVICE_ABBR,ADHERENCE_ID,DATE,ROUTE_ABBR,BLOCK_ABBR,OPERATOR,TRIP_ID,OVERLOAD_ID,...,ADJUSTED_ONTIME_COUNT,STOP_CANCELLED,PREV_SCHED_STOP_CANCELLED,IS_RELIEF,BLOCK_STOP_ORDER,DWELL_IN_MINS,NextDay_Scheduled,NextDay_Actual_Arrival,NextDay_Actual_Departure,STARTING_ADHERENCE
0,120230801_345104,120230801,1,99457892,2023-08-01,22,2200,1040,345104,0,...,1,0,0.0,0,19,0.0,0,0,0,-2.133333
1,120230801_345105,120230801,1,99457895,2023-08-01,22,2200,1040,345105,0,...,1,0,0.0,0,51,0.0,0,0,0,-1.583333
2,120230801_345106,120230801,1,99457899,2023-08-01,22,2200,1040,345106,0,...,1,0,0.0,0,80,0.0,0,0,0,-1.716666
3,120230801_345107,120230801,1,99457902,2023-08-01,22,2200,1040,345107,0,...,0,0,0.0,0,112,0.0,0,0,0,-1.316666
4,120230801_345108,120230801,1,99457906,2023-08-01,22,2200,1040,345108,0,...,1,0,0.0,0,141,0.0,0,0,0,-1.516666


 1.   Fit a linear regression model predicting the ADHERENCE using the ROUTE_ABBR and ROUTE_DIRECTION_NAME columns. Measure the performance of the model using the R^2 and mean absolute error metrics. Interpret the meaning of each metric.

In [5]:
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

Instantiate linear regression

In [6]:
linreg = LinearRegression()

Pull out the variables we want to fit our model on.

In [7]:
X = wego[['ROUTE_ABBR', 'ROUTE_DIRECTION_NAME']]
y = wego['ADHERENCE']

In [8]:
X.head()

Unnamed: 0,ROUTE_ABBR,ROUTE_DIRECTION_NAME
0,22,TO DOWNTOWN
1,22,FROM DOWNTOWN
2,22,TO DOWNTOWN
3,22,FROM DOWNTOWN
4,22,TO DOWNTOWN


Create dummy columns to replace strings in route direction column

In [9]:
X=pd.get_dummies(X, columns=['ROUTE_ABBR', 'ROUTE_DIRECTION_NAME'], drop_first=True)

test/train split

In [10]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 7)

In [11]:
X.head()

Unnamed: 0,ROUTE_ABBR_7,ROUTE_ABBR_22,ROUTE_ABBR_23,ROUTE_ABBR_50,ROUTE_ABBR_52,ROUTE_ABBR_55,ROUTE_ABBR_56,ROUTE_DIRECTION_NAME_TO DOWNTOWN
0,False,True,False,False,False,False,False,True
1,False,True,False,False,False,False,False,False
2,False,True,False,False,False,False,False,True
3,False,True,False,False,False,False,False,False
4,False,True,False,False,False,False,False,True


fit test data

In [12]:
linreg.fit(X_train, y_train)

Extract intercept and coefficients

In [13]:
linreg.intercept_

-5.142421109756599

In [14]:
linreg.coef_

array([-0.01888845,  2.00406432,  1.52858121,  1.05122561, -0.30103696,
       -2.25162994,  0.29060458,  2.03452803])

predict to generate predictions from model

In [15]:
y_pred = linreg.predict(X_test)
y_pred

array([-3.13835679, -3.10789308, -2.05666747, ..., -7.39405105,
       -2.05666747, -2.8172885 ])

Test accuracy of model against R^2 and mean absolute error metrics

In [16]:
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

In [17]:
mean_absolute_error(y_test, y_pred)

3.5197742674855497

In [18]:
r2_score(y_test, y_pred)

0.07739460504060436

 2.   Now, try using the ROUTE_ABBR, ROUTE_DIRECTION_NAME, and OPERATOR. Does this improve the model? Warning: Your model may perform very poorly once you add the OPERATOR. If so, this is likely caused because some operators have very few observations. One option to correct this is to assign an "Other" (or -999999) value to operators with few observations.

In [40]:
X = wego[['ROUTE_ABBR', 'ROUTE_DIRECTION_NAME', 'OPERATOR']]
y = wego['ADHERENCE']

In [41]:
X.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 64333 entries, 0 to 64332
Data columns (total 3 columns):
 #   Column                Non-Null Count  Dtype 
---  ------                --------------  ----- 
 0   ROUTE_ABBR            64333 non-null  int64 
 1   ROUTE_DIRECTION_NAME  64333 non-null  object
 2   OPERATOR              64333 non-null  int64 
dtypes: int64(2), object(1)
memory usage: 1.5+ MB


Get dummies for route direction and operator

In [42]:
X=pd.get_dummies(X, columns=['ROUTE_ABBR', 'ROUTE_DIRECTION_NAME', 'OPERATOR'], drop_first=True)

In [43]:
X.head()

Unnamed: 0,ROUTE_ABBR_7,ROUTE_ABBR_22,ROUTE_ABBR_23,ROUTE_ABBR_50,ROUTE_ABBR_52,ROUTE_ABBR_55,ROUTE_ABBR_56,ROUTE_DIRECTION_NAME_TO DOWNTOWN,OPERATOR_244,OPERATOR_301,...,OPERATOR_3134,OPERATOR_3138,OPERATOR_3140,OPERATOR_3142,OPERATOR_3144,OPERATOR_3149,OPERATOR_3156,OPERATOR_3165,OPERATOR_3166,OPERATOR_3173
0,False,True,False,False,False,False,False,True,False,False,...,False,False,False,False,False,False,False,False,False,False
1,False,True,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
2,False,True,False,False,False,False,False,True,False,False,...,False,False,False,False,False,False,False,False,False,False
3,False,True,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
4,False,True,False,False,False,False,False,True,False,False,...,False,False,False,False,False,False,False,False,False,False


In [44]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 7)

In [45]:
linreg.fit(X_train, y_train)

In [25]:
linreg.intercept_

21840321895050.72

In [46]:
linreg.coef_

array([ 2.54149524e-01,  1.46898586e+00,  1.30964724e+00,  7.84696579e-01,
       -6.00234756e-02, -2.47630258e+00,  2.87006543e-01,  1.95766414e+00,
        6.26116176e-01,  1.87671603e+00,  2.08812586e-01, -3.83783655e-02,
        3.32392278e+00, -1.49177928e+00,  1.85899033e+00, -2.41663665e+00,
       -5.87138866e-01,  3.41583864e+00,  1.73456590e+00,  6.52160956e-01,
        5.75313673e-01,  6.67761632e+00,  1.29310546e-01,  7.01876478e-01,
        5.51672591e-01,  2.86315871e+00,  8.66889851e-01,  6.99005066e-01,
        1.28101115e+00, -4.39545713e+00,  2.35969059e+00, -3.11260199e+00,
        3.72244680e+00,  3.05833573e+00,  2.37616463e+00,  2.84754808e+00,
        8.88548980e-01, -4.06450687e+00,  1.96191786e+00,  2.04784262e+00,
        5.91821969e-01,  1.83582788e+00,  2.20229265e+00,  3.47310566e+00,
        1.04082143e+00,  6.47084205e-01,  9.90103647e-01,  2.75698087e+00,
        4.46553740e-01, -1.11860290e+00,  8.07619043e-01,  4.68968887e+00,
       -7.17210659e-01, -

In [48]:
y_pred = linreg.predict(X_test)
y_pred

array([-1.76472301, -4.1385375 , -3.25058092, ..., -4.82660554,
       -4.34645322, -2.52302506])

In [49]:
mean_absolute_error(y_test, y_pred)

3.23063141088477

In [50]:
r2_score(y_test, y_pred)

0.17291641211126374

 3.   Finally, the data you have been provided has an STARTING_ADHERENCE column, which contains the ADHERENCE at the beginning of the route. If you add this metric, does it improve the model? Is this of any practical use?

In [52]:
X = wego[['ROUTE_ABBR', 'ROUTE_DIRECTION_NAME', 'OPERATOR', 'STARTING_ADHERENCE']]
y = wego['ADHERENCE'] 

In [53]:
X=pd.get_dummies(X, columns=['ROUTE_ABBR', 'ROUTE_DIRECTION_NAME', 'OPERATOR'], drop_first=True)

In [54]:
X.head()

Unnamed: 0,STARTING_ADHERENCE,ROUTE_ABBR_7,ROUTE_ABBR_22,ROUTE_ABBR_23,ROUTE_ABBR_50,ROUTE_ABBR_52,ROUTE_ABBR_55,ROUTE_ABBR_56,ROUTE_DIRECTION_NAME_TO DOWNTOWN,OPERATOR_244,...,OPERATOR_3134,OPERATOR_3138,OPERATOR_3140,OPERATOR_3142,OPERATOR_3144,OPERATOR_3149,OPERATOR_3156,OPERATOR_3165,OPERATOR_3166,OPERATOR_3173
0,-2.133333,False,True,False,False,False,False,False,True,False,...,False,False,False,False,False,False,False,False,False,False
1,-1.583333,False,True,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
2,-1.716666,False,True,False,False,False,False,False,True,False,...,False,False,False,False,False,False,False,False,False,False
3,-1.316666,False,True,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
4,-1.516666,False,True,False,False,False,False,False,True,False,...,False,False,False,False,False,False,False,False,False,False


In [55]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 7)

In [56]:
linreg.fit(X_train, y_train)

In [57]:
linreg.intercept_

-2.1886294018809935

In [58]:
linreg.coef_

array([ 7.61851772e-01, -2.62948285e-01,  7.77975166e-01,  7.82536317e-01,
        9.77188200e-01, -1.07526964e-01, -1.97223762e+00, -4.77602285e-02,
        1.29196997e+00, -1.43657976e+00,  3.56142767e-01, -1.26837467e+00,
       -3.10785949e-01,  4.33236300e-01, -2.28805203e+00, -2.24041108e-01,
       -2.11791529e+00, -1.87988052e+00,  9.64476306e-01,  3.03254994e-01,
       -5.86808179e-01, -8.24718025e-01,  4.10369021e+00, -1.19815252e+00,
       -1.22936318e+00, -9.31568373e-01,  6.61937389e-01, -1.40241611e-01,
        3.32524175e-01, -1.69358336e-01, -1.10623765e-01,  1.07230103e+00,
       -5.49702798e-01,  1.53210943e+00,  1.05788887e+00,  8.09743589e-01,
        7.37334085e-01,  6.09317938e-01,  4.74973404e+00,  3.32187031e-01,
        7.22885052e-01, -1.19269093e+00, -6.24389560e-01,  1.40970343e+00,
        2.45252559e+00, -6.51446398e-01,  1.17829492e-01,  6.10695579e-01,
        8.57241216e-01, -7.12175731e-01, -1.15895076e+00, -5.95125036e-01,
        2.04351487e+00, -

In [59]:
y_pred = linreg.predict(X_test)
y_pred

array([-2.57386152, -3.03136761, -1.86261137, ..., -2.91032874,
       -5.81147004, -1.72379508])

In [60]:
mean_absolute_error(y_test, y_pred)

2.787772245920324

In [61]:
r2_score(y_test, y_pred)

0.44115571085847016

Bonus Questions:

1.  How well does a constant-only model perform compared to the models above?

2.   For this exercise, you were provided data that was already prepared by condensing each trip into one row. Go back to the original dataset and perform the preparation, creating an ID column and keeping only the next-to-last ADHERENCE value.