In [33]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score, mean_absolute_error
import numpy as np

In [2]:
wego = pd.read_csv('../data/wego_ml.csv')

1. Fit a linear regression model predicting the ADHERENCE using the ROUTE_ABBR and ROUTE_DIRECTION_NAME columns. Measure the performance of the model using the R^2 and mean absolute error metrics. Interpret the meaning of each metric.

In [6]:
# LinearRegression instance
linreg = LinearRegression()

In [8]:
# create X and y
X = wego[['ROUTE_ABBR', 'ROUTE_DIRECTION_NAME']]
y = wego['ADHERENCE']

In [13]:
# both X columns are categorical, use get_dummies
# makes sense for drop_first = True (coefficients more meaningful since it moves the value
# to the intercept)
X = pd.get_dummies(X, columns = X.columns)

In [15]:
# create train/test split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 42)

In [23]:
# fit linear regression model
linreg.fit(X_train, y_train)

# generate predictions
y_pred = linreg.predict(X_test)

In [24]:
# measure performance
mean_absolute_error(y_test, y_pred)

3.5201387587455897

In [25]:
r2_score(y_test, y_pred)

0.07982146967916637

2. Now, try using the ROUTE_ABBR, ROUTE_DIRECTION_NAME, and OPERATOR. Does this improve the model? Warning: Your model may perform very poorly once you add the OPERATOR. If so, this is likely caused because some operators have very few observations. One option to correct this is to assign an "Other" (or -999999) value to operators with few observations.

In [26]:
# create X and y
X = wego[['ROUTE_ABBR', 'ROUTE_DIRECTION_NAME', 'OPERATOR']]
y = wego['ADHERENCE']

# X columns are categorical, use get_dummies
X = pd.get_dummies(X, columns = X.columns)

In [27]:
# create train/test split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 42)

# fit linear regression model
linreg.fit(X_train, y_train)

# generate predictions
y_pred = linreg.predict(X_test)

# measure performance
print(mean_absolute_error(y_test, y_pred))
r2_score(y_test, y_pred)

3.253868914867842


0.17416221731811343

3. Finally, the data you have been provided has an STARTING_ADHERENCE column, which contains the ADHERENCE at the beginning of the route. If you add this metric, does it improve the model? Is this of any practical use?

In [31]:
# create X and y
X = wego[['ROUTE_ABBR', 'ROUTE_DIRECTION_NAME', 'OPERATOR', 'STARTING_ADHERENCE']]
y = wego['ADHERENCE']

# X columns are categorical, use get_dummies
X = pd.get_dummies(X, columns = ['ROUTE_ABBR', 'ROUTE_DIRECTION_NAME', 'OPERATOR'])

In [32]:
# create train/test split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 42)

# fit linear regression model
linreg.fit(X_train, y_train)

# generate predictions
y_pred = linreg.predict(X_test)

# measure performance
print(mean_absolute_error(y_test, y_pred))
r2_score(y_test, y_pred)

2.8039172634576675


0.41618375826552423

Makes sense for starting adherence to make a big difference in the r2 score, it adds a lot of information to the model.

**Bonus Questions:** 
* How well does a constant-only model perform compared to the models above?

Without using a model, just using average how would it do?

In [34]:
y_train.mean()

-3.9276449081846256

In [35]:
# create a constant numpy array
y_train.mean() * np.ones_like(y_test)

array([-3.92764491, -3.92764491, -3.92764491, ..., -3.92764491,
       -3.92764491, -3.92764491])

In [38]:
mean_absolute_error(y_test, (y_train.mean() * np.ones_like(y_test)))

3.676956865658779

Model MAE was better but not dramatically so for model vs constant.

Alternative approach would be to create a constant column (no predictors) and use the linear regression code.

* For this exercise, you were provided data that was already prepared by condensing each trip into one row. Go back to the original dataset and perform the preparation, creating an ID column and keeping only the next-to-last ADHERENCE value.