**Note:** Make sure that you perform a train/test split before fitting any models so that you can properly measure the performance of the model.

1. Fit a linear regression model predicting the ADHERENCE using the ROUTE_ABBR and ROUTE_DIRECTION_NAME columns. Measure the performance of the model using the R^2 and mean absolute error metrics. Interpret the meaning of each metric.

2. Now, try using the ROUTE_ABBR, ROUTE_DIRECTION_NAME, and OPERATOR. Does this improve the model? Warning: Your model may perform very poorly once you add the OPERATOR. If so, this is likely caused because some operators have very few observations. One option to correct this is to assign an "Other" (or -999999) value to operators with few observations. 

3. Finally, the data you have been provided has an STARTING_ADHERENCE column, which contains the ADHERENCE at the beginning of the route. If you add this metric, does it improve the model? Is this of any practical use?

**Bonus Questions:** 
* How well does a constant-only model perform compared to the models above?
* For this exercise, you were provided data that was already prepared by condensing each trip into one row. Go back to the original dataset and perform the preparation, creating an ID column and keeping only the next-to-last ADHERENCE value.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error

In [2]:
wego = pd.read_csv('../data/wego_ml.csv')

In [3]:
wego

Unnamed: 0,ID,CALENDAR_ID,SERVICE_ABBR,ADHERENCE_ID,DATE,ROUTE_ABBR,BLOCK_ABBR,OPERATOR,TRIP_ID,OVERLOAD_ID,...,ADJUSTED_ONTIME_COUNT,STOP_CANCELLED,PREV_SCHED_STOP_CANCELLED,IS_RELIEF,BLOCK_STOP_ORDER,DWELL_IN_MINS,NextDay_Scheduled,NextDay_Actual_Arrival,NextDay_Actual_Departure,STARTING_ADHERENCE
0,120230801_345104,120230801,1,99457892,2023-08-01,22,2200,1040,345104,0,...,1,0,0.0,0,19,0.000000,0,0,0,-2.133333
1,120230801_345105,120230801,1,99457895,2023-08-01,22,2200,1040,345105,0,...,1,0,0.0,0,51,0.000000,0,0,0,-1.583333
2,120230801_345106,120230801,1,99457899,2023-08-01,22,2200,1040,345106,0,...,1,0,0.0,0,80,0.000000,0,0,0,-1.716666
3,120230801_345107,120230801,1,99457902,2023-08-01,22,2200,1040,345107,0,...,0,0,0.0,0,112,0.000000,0,0,0,-1.316666
4,120230801_345108,120230801,1,99457906,2023-08-01,22,2200,1040,345108,0,...,1,0,0.0,0,141,0.000000,0,0,0,-1.516666
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
64328,120230913_353986,120230913,1,100702403,2023-09-13,50,8601,2400,353986,0,...,0,0,0.0,0,23,3.966666,0,0,0,-15.183333
64329,120230913_353988,120230913,1,100702412,2023-09-13,55,8604,359,353988,0,...,1,0,0.0,0,2,44.400000,0,0,0,-2.900000
64330,120230913_354069,120230913,1,100702754,2023-09-13,50,9302,2423,354069,0,...,1,0,0.0,0,2,35.916666,0,0,0,-0.300000
64331,120230913_354098,120230913,1,100702882,2023-09-13,50,9950,2563,354098,0,...,0,0,0.0,0,2,51.966666,0,0,0,1.933333


In [4]:
wego.columns

Index(['ID', 'CALENDAR_ID', 'SERVICE_ABBR', 'ADHERENCE_ID', 'DATE',
       'ROUTE_ABBR', 'BLOCK_ABBR', 'OPERATOR', 'TRIP_ID', 'OVERLOAD_ID',
       'ROUTE_DIRECTION_NAME', 'TIME_POINT_ABBR', 'ROUTE_STOP_SEQUENCE',
       'TRIP_EDGE', 'LATITUDE', 'LONGITUDE', 'SCHEDULED_TIME',
       'ACTUAL_ARRIVAL_TIME', 'ACTUAL_DEPARTURE_TIME', 'ADHERENCE',
       'SCHEDULED_HDWY', 'ACTUAL_HDWY', 'HDWY_DEV', 'ADJUSTED_EARLY_COUNT',
       'ADJUSTED_LATE_COUNT', 'ADJUSTED_ONTIME_COUNT', 'STOP_CANCELLED',
       'PREV_SCHED_STOP_CANCELLED', 'IS_RELIEF', 'BLOCK_STOP_ORDER',
       'DWELL_IN_MINS', 'NextDay_Scheduled', 'NextDay_Actual_Arrival',
       'NextDay_Actual_Departure', 'STARTING_ADHERENCE'],
      dtype='object')

**1. Fit a linear regression model predicting the ADHERENCE using the ROUTE_ABBR and ROUTE_DIRECTION_NAME columns. Measure the performance of the model using the R^2 and mean absolute error metrics. Interpret the meaning of each metric.**

In [5]:
wego_dummies = pd.get_dummies(wego[['ROUTE_ABBR', 'ROUTE_DIRECTION_NAME']], drop_first = True, columns = ['ROUTE_ABBR', 'ROUTE_DIRECTION_NAME'])
X = wego_dummies
y = wego['ADHERENCE'].values
## drop_first set to True reduces amount and value of coefficients

# predictors = ['ROUTE_ABBR', 'ROUTE_DIRECTION_NAME']
# X = wego_data[predictors]
# X = pd.get_dummies(X, columns=predictors)
# y = wego_data['ADHERENCE']

In [6]:
wego_dummies

Unnamed: 0,ROUTE_ABBR_7,ROUTE_ABBR_22,ROUTE_ABBR_23,ROUTE_ABBR_50,ROUTE_ABBR_52,ROUTE_ABBR_55,ROUTE_ABBR_56,ROUTE_DIRECTION_NAME_TO DOWNTOWN
0,0,1,0,0,0,0,0,1
1,0,1,0,0,0,0,0,0
2,0,1,0,0,0,0,0,1
3,0,1,0,0,0,0,0,0
4,0,1,0,0,0,0,0,1
...,...,...,...,...,...,...,...,...
64328,0,0,0,1,0,0,0,0
64329,0,0,0,0,0,1,0,1
64330,0,0,0,1,0,0,0,1
64331,0,0,0,1,0,0,0,1


In [7]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state = 42)
linreg = LinearRegression()
linreg.fit(X_train, y_train)
y_pred = linreg.predict(X_test)
r2 = r2_score(y_test, y_pred)
mae = mean_absolute_error(y_test, y_pred)
rmse = mean_squared_error(y_test, y_pred, squared = True)
print('R2 Score:', r2)
print('mean_absolute_error:', mae)
print('mean_squared_error:', rmse)

R2 Score: 0.08422539727803569
mean_absolute_error: 3.50529424564376
mean_squared_error: 30.036479404919717


In [8]:
linreg.coef_

array([ 0.08631152,  2.05373663,  1.57786822,  1.0790815 , -0.36304208,
       -2.26434508,  0.29030593,  1.99549622])

**2. Now, try using the ROUTE_ABBR, ROUTE_DIRECTION_NAME, and OPERATOR. Does this improve the model? Warning: Your model may perform very poorly once you add the OPERATOR. If so, this is likely caused because some operators have very few observations. One option to correct this is to assign an "Other" (or -999999) value to operators with few observations.**

In [9]:
predictors = ['ROUTE_ABBR', 'ROUTE_DIRECTION_NAME', 'OPERATOR']
X = wego[predictors]
X = pd.get_dummies(X, columns=predictors)
y = wego['ADHERENCE']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state = 42)
linreg = LinearRegression()
linreg.fit(X_train, y_train)
y_pred = linreg.predict(X_test)
r2 = r2_score(y_test, y_pred)
mae = mean_absolute_error(y_test, y_pred)
rmse = mean_squared_error(y_test, y_pred, squared = True)
print('R2:', r2)
print('mae:', mae)
print('mse:', rmse)

R2: 0.17956676044750286
mae: 3.2336921290544045
mse: 26.90937926175696


In [10]:
wego[['OPERATOR']].describe()

Unnamed: 0,OPERATOR
count,64333.0
mean,1932.589371
std,771.300712
min,235.0
25%,1372.0
50%,2001.0
75%,2569.0
max,3173.0


In [11]:
wego['OPERATOR'].value_counts()[lambda x:x > 50].index.tolist()

[1762,
 1761,
 2164,
 1170,
 379,
 1040,
 1147,
 1823,
 1222,
 1617,
 1266,
 2352,
 723,
 326,
 2312,
 409,
 1190,
 2324,
 2053,
 2424,
 1133,
 2246,
 341,
 1368,
 2793,
 1204,
 1401,
 304,
 1927,
 2399,
 1410,
 2733,
 2760,
 1650,
 1893,
 2088,
 2109,
 2832,
 1566,
 3090,
 2062,
 1237,
 2630,
 2689,
 2780,
 1623,
 522,
 2041,
 2374,
 520,
 1590,
 2638,
 1052,
 332,
 329,
 559,
 1914,
 1824,
 3031,
 1992,
 1372,
 2001,
 2765,
 1800,
 2706,
 1788,
 2990,
 659,
 1361,
 778,
 1343,
 2963,
 2012,
 785,
 1208,
 2112,
 3064,
 2040,
 357,
 2055,
 517,
 1500,
 1177,
 2082,
 1792,
 2500,
 1582,
 2649,
 1352,
 1202,
 1277,
 1892,
 2306,
 2217,
 414,
 1542,
 2127,
 1842,
 1394,
 1763,
 1682,
 2161,
 2371,
 1948,
 2335,
 2200,
 1305,
 1489,
 1359,
 1717,
 1379,
 1330,
 2697,
 2969,
 2059,
 2529,
 2321,
 1593,
 3087,
 3079,
 1457,
 3122,
 2792,
 2357,
 2458,
 611,
 2967,
 3101,
 2623,
 359,
 2783,
 1703,
 2675,
 1997,
 2560,
 3013,
 1958,
 2610,
 1251,
 2752,
 2512,
 2505,
 2348,
 2230,
 1866,
 258

In [12]:
predictors = ['ROUTE_ABBR', 'ROUTE_DIRECTION_NAME', 'OPERATOR']
X = wego[predictors]
y = wego['ADHERENCE']

## Setting a threshold with Operator counts
operators = wego['OPERATOR'].value_counts()[lambda x:x > 50].index.tolist()
X.loc[X['OPERATOR'].isin(operators), 'OPERATOR'] = -999999

X = pd.get_dummies(X, columns=predictors, drop_first = True)

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 42)
linreg = LinearRegression()
linreg.fit(X_train, y_train)
y_pred = linreg.predict(X_test)
r2 = r2_score(y_test, y_pred)
mae = mean_absolute_error(y_test, y_pred)
rmse = mean_squared_error(y_test, y_pred, squared = True)
print('R2:', r2)
print('mae:', mae)
print('mse:', rmse)

R2: 0.08314976963548881
mae: 3.5119451136425663
mse: 30.426262182934142


**3. Finally, the data you have been provided has an STARTING_ADHERENCE column, which contains the ADHERENCE at the beginning of the route. If you add this metric, does it improve the model? Is this of any practical use?**

In [13]:
predictors = ['ROUTE_ABBR', 'ROUTE_DIRECTION_NAME', 'OPERATOR', 'STARTING_ADHERENCE']
X = wego[predictors]
y = wego['ADHERENCE']

## Setting a threshold with Operator counts
operators = wego['OPERATOR'].value_counts()[lambda x:x > 50].index.tolist()
X.loc[X['OPERATOR'].isin(operators), 'OPERATOR'] = -999999

X = pd.get_dummies(X, columns=predictors)

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 42)
linreg = LinearRegression()
linreg.fit(X_train, y_train)
y_pred = linreg.predict(X_test)
r2 = r2_score(y_test, y_pred)
mae = mean_absolute_error(y_test, y_pred)
rmse = mean_squared_error(y_test, y_pred, squared = True)
print('R2:', r2)
print('mae:', mae)
print('mse:', rmse)

R2: -9.036180521478176e+21
mae: 30671130987.494442
mse: 2.9987143872943245e+23
