## Machine Learning Exercise 4 - Regularized Regression

For this exercise, you'll be attempting to predict the next-to-last ADHERENCE value per trip. 

Download the prepared dataset from [here](https://drive.google.com/file/d/1Rki8-zZTet8jaWDYwIOnW5EE4-8lYL_q/view?usp=sharing).
 
You've been provided variables containg the HOUR, and DAY_OF_WEEK, STARTING_ADHERENCE, and STARTING_DWELL time.

1. Fit a regular, unregularized linear regression model using ROUTE_ABBR, ROUTE_DIRECTION_NAME, OPERATOR, DAY_OF_WEEK, HOUR, STARTING_ADHERENCE, and STARTING_DWELL. How well does this model do? Inspect the coefficients for this model. What do you notice? 

2. Now, use ridge regression. How well does this model do compared to the regular linear regression model? How do the coefficients compare?

3. Finally, try a lasso regression model. Try adjusting the regularization strength (the alpha parameter) to see how it affects the model. Inspect both the performance and the non-zero coefficients.

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score
from sklearn.linear_model import LinearRegression, Lasso, RidgeCV
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler

In [2]:
wego_reg = pd.read_csv('../data/wego_regularization.csv')

In [9]:
wego_reg.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 62760 entries, 0 to 62759
Data columns (total 38 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   CALENDAR_ID                62760 non-null  int64  
 1   SERVICE_ABBR               62760 non-null  int64  
 2   ADHERENCE_ID               62760 non-null  int64  
 3   DATE                       62760 non-null  object 
 4   ROUTE_ABBR                 62760 non-null  int64  
 5   BLOCK_ABBR                 62760 non-null  int64  
 6   OPERATOR                   62760 non-null  int64  
 7   TRIP_ID                    62760 non-null  int64  
 8   OVERLOAD_ID                62760 non-null  int64  
 9   ROUTE_DIRECTION_NAME       62760 non-null  object 
 10  TIME_POINT_ABBR            62760 non-null  object 
 11  ROUTE_STOP_SEQUENCE        62760 non-null  float64
 12  TRIP_EDGE                  62760 non-null  int64  
 13  LATITUDE                   62760 non-null  flo

In [41]:
predictors = ['ROUTE_ABBR', 'ROUTE_DIRECTION_NAME', 'OPERATOR', 'DAY_OF_WEEK', 'HOUR', 
        'STARTING_ADHERENCE', 'STARTING_DWELL']

categorical_predictors = ['ROUTE_ABBR', 'ROUTE_DIRECTION_NAME', 'OPERATOR', 'DAY_OF_WEEK', 'HOUR']

X = wego_reg[predictors]
X = pd.get_dummies(X, columns = categorical_predictors)
y = wego_reg['ADHERENCE']


In [42]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 42)

In [43]:
linreg = LinearRegression()
linreg.fit(X_train, y_train)

In [44]:
y_pred = linreg.predict(X_test)
y_pred

array([-4.96832275, -1.56323242, -6.03112793, ..., -2.86590576,
       -4.08575439, -9.90179443])

In [45]:
linreg.intercept_

-320296704671.1143

In [46]:
linreg.coef_

array([ 8.16908355e-01,  5.80034141e-02,  1.79608548e+11,  1.79608548e+11,
        1.79608548e+11,  1.79608548e+11,  1.79608548e+11,  1.79608548e+11,
        1.79608548e+11,  1.79608548e+11,  2.01079817e+11,  2.01079817e+11,
        9.14512394e+08,  9.14512392e+08,  9.14512394e+08,  9.14512390e+08,
        9.14512392e+08,  9.14512393e+08,  9.14512391e+08,  9.14512392e+08,
        9.14512391e+08,  9.14512391e+08,  9.14512393e+08,  9.14512393e+08,
        9.14512392e+08,  9.14512392e+08,  9.14512394e+08,  9.14512392e+08,
        9.14512391e+08,  9.14512392e+08,  9.14512393e+08,  9.14512393e+08,
        9.14512393e+08,  9.14512392e+08,  9.14512394e+08,  9.14512392e+08,
        9.14512394e+08,  9.14512394e+08,  9.14512393e+08,  9.14512393e+08,
        9.14512392e+08,  9.14512394e+08,  9.14512393e+08,  9.14512393e+08,
        9.14512390e+08,  9.14512392e+08,  9.14512394e+08,  9.14512395e+08,
        9.14512392e+08,  9.14512393e+08,  9.14512393e+08,  9.14512393e+08,
        9.14512392e+08,  

In [47]:
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

In [48]:
mean_squared_error(y_test, y_pred)

19.825067819352572

In [49]:
mean_absolute_error(y_test, y_pred)

2.735623889851219

In [50]:
r2_score(y_test, y_pred)

0.38907184385559135

Now, use ridge regression. How well does this model do compared to the regular linear regression model? How do the coefficients compare?

In [51]:
from sklearn.linear_model import Ridge

In [54]:
predictors = ['ROUTE_ABBR', 'ROUTE_DIRECTION_NAME', 'OPERATOR', 'DAY_OF_WEEK', 'HOUR', 
        'STARTING_ADHERENCE', 'STARTING_DWELL']

categorical_predictors = ['ROUTE_ABBR', 'ROUTE_DIRECTION_NAME', 'OPERATOR', 'DAY_OF_WEEK', 'HOUR']

X = wego_reg[predictors]
X = pd.get_dummies(X, columns = categorical_predictors)
y = wego_reg['ADHERENCE']

In [55]:
X2_train, X2_test, y2_train, y2_test = train_test_split(X, y, random_state = 42)

In [56]:
alphas = [0.1, 1.0, 10.0, 100.0, 1000.0, 10000.0]
ridge_scores = []
for alpha in alphas:
  
  # Create a Ridge regression model
  ridge = Ridge(alpha=alpha)
  
  # Fit the data
  ridge.fit(X_train, y_train)
  
  # Obtain R-squared
  score = ridge.score(X2_test, y2_test)
  ridge_scores.append(score)
print(ridge_scores)

[0.3890647526651544, 0.38897668448261413, 0.3882114241157334, 0.38105782256512954, 0.35532068259807703, 0.3233986294153153]


In [57]:
print(f'Ridge R2 Score: {r2_score(y_test, ridge.predict(X2_test))}')

Ridge R2 Score: 0.3233986294153153


In [58]:
mean_squared_error(y2_test, y_pred)

19.825067819352572