## Linear Models

In this notebook, we try to model the data using various linear regression models. We begin by import the necessary modules, as well as loading the train and test data from the train_test directory.

In [1]:
import numpy as np
from sklearn.linear_model import LinearRegression, SGDRegressor, Ridge, Lasso, ElasticNet
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import cross_val_score, GridSearchCV

from mtcars_practice.config import data_dir

In [2]:
X_train = np.load(data_dir + '/train_test/X_train.npy')
X_test = np.load(data_dir + '/train_test/X_test.npy')

y_train = np.load(data_dir + '/train_test/y_train.npy')
y_test = np.load(data_dir + '/train_test/y_test.npy')

We will start with linear regression. Just for curiosities sake, we try both the LinearRegression class, as well as the SGDRegressor class, since the implementations are different. The SGDRegressor actually produces a higher RMSE, probably due to the SGD algorithm settling on an unoptimal solution. We will continue with the LinearRegressor class for the rest of the the modeling.

In [3]:
lin_reg = LinearRegression()
lin_reg.fit(X_train, y_train)

y_pred = lin_reg.predict(X_train)
lin_reg_rmse = np.sqrt(mean_squared_error(y_train, y_pred))

print(lin_reg_rmse)

3.1220569526033626


In [4]:
sgd_reg = SGDRegressor()
sgd_reg.fit(X_train, y_train)

y_pred_sgd = sgd_reg.predict(X_train)
sgd_reg_rmse = np.sqrt(mean_squared_error(y_train, y_pred_sgd))

print(sgd_reg_rmse)

3.82700346770967


While we see that the root mean squared error is about 3 mpg over the entire training set, we would like to measure the generalizability of the model. We try using cross validation to measure this.

In [5]:
lin_reg = LinearRegression()
scores_lin = cross_val_score(lin_reg, X_train, y_train, scoring='neg_mean_squared_error', cv=4)
rmse_scores_lin = np.sqrt(-scores_lin)

print(rmse_scores_lin, '\n')
print('Mean RMSE: ', rmse_scores_lin.mean())
print('Std RMSE: ', rmse_scores_lin.std())

[2.78111103 3.0702988  3.81408112 3.2900013 ] 

Mean RMSE:  3.2388730648939728
Std RMSE:  0.37796883501639367


It is possible the pure linear regression model is overfitting, so we can try to regularize the model. We can try Ridge, Lasso, and Elastic Net regression to add a degree of normalization to the model.

The RMSE of the Ridge regression model actually ends up being slightly higher using a coefficient of 1.

In [6]:
ridge_reg = Ridge(alpha=0.5, solver='cholesky')
scores_ridge = cross_val_score(ridge_reg, X_train, y_train, scoring='neg_mean_squared_error', cv=4)
rmse_scores_ridge = np.sqrt(-scores_ridge)

print(rmse_scores_ridge, '\n')
print('Mean RMSE: ', rmse_scores_ridge.mean())
print('Std RMSE: ', rmse_scores_ridge.std())

[2.82016917 3.06479369 3.84565084 3.22690326] 

Mean RMSE:  3.2393792396047205
Std RMSE:  0.3787935120129052


In [7]:
lasso_reg = Lasso(alpha=0.1)
scores_lasso = cross_val_score(lasso_reg, X_train, y_train, scoring='neg_mean_squared_error', cv=4)
rmse_scores_lasso = np.sqrt(-scores_lasso)

print(rmse_scores_lasso, '\n')
print('Mean RMSE: ', rmse_scores_lasso.mean())
print('Std RMSE: ', rmse_scores_lasso.std())

[2.99362495 3.1981228  3.90388921 3.31271714] 

Mean RMSE:  3.3520885245278143
Std RMSE:  0.3384655368117302


In [8]:
elnet_reg = ElasticNet(alpha=0.1, l1_ratio=0.5)
scores_elnet = cross_val_score(elnet_reg, X_train, y_train, scoring='neg_mean_squared_error', cv=4)
rmse_scores_elnet = np.sqrt(-scores_elnet)

print(rmse_scores_elnet, '\n')
print('Mean RMSE: ', rmse_scores_elnet.mean())
print('Std RMSE: ', rmse_scores_elnet.std())

[3.32493105 3.41076207 4.08204827 3.70663456] 

Mean RMSE:  3.631093989511821
Std RMSE:  0.29637309388251093


It is possible the scores above are due to unoptimal hyperparameter selection for the normalized regression models. We will attempt to find optimal values using grid search.

In [16]:
param_grid = [{'alpha': np.arange(0.01, 0.02, 0.001)}]
print(param_grid)

grid_search = GridSearchCV(Lasso(), param_grid, cv=3, scoring='neg_mean_squared_error', return_train_score=True)
grid_search.fit(X_train, y_train)

lasso_best_params = grid_search.best_params_
print(lasso_best_params)

[{'alpha': array([0.01 , 0.011, 0.012, 0.013, 0.014, 0.015, 0.016, 0.017, 0.018,
       0.019])}]
{'alpha': 0.017999999999999995}


In [17]:
param_grid = [{'alpha': np.arange(0.1, 1, 0.1)}]
print(param_grid)

grid_search = GridSearchCV(Ridge(solver='cholesky'), param_grid, cv=3, scoring='neg_mean_squared_error')
grid_search.fit(X_train, y_train)

ridge_best_params = grid_search.best_params_
print(ridge_best_params)

[{'alpha': array([0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9])}]
{'alpha': 0.30000000000000004}


In [21]:
param_grid = [{'alpha': np.arange(0.001, 0.01, 0.001), 'l1_ratio': np.arange(0.05, 0.95, 0.05)}]
print(param_grid)

grid_search = GridSearchCV(ElasticNet(), param_grid, cv=3, scoring='neg_mean_squared_error')
grid_search.fit(X_train, y_train)

elnet_best_params = grid_search.best_params_
print(elnet_best_params)

[{'alpha': array([0.001, 0.002, 0.003, 0.004, 0.005, 0.006, 0.007, 0.008, 0.009]), 'l1_ratio': array([0.05, 0.1 , 0.15, 0.2 , 0.25, 0.3 , 0.35, 0.4 , 0.45, 0.5 , 0.55,
       0.6 , 0.65, 0.7 , 0.75, 0.8 , 0.85, 0.9 ])}]
{'alpha': 0.009000000000000001, 'l1_ratio': 0.9000000000000001}


Now that we have found optimal hyperparameters for the regularized linear regression models, let us test the models against the test data. We will include the regular linear regression model as a baseline.

In [22]:
lin_reg = LinearRegression()
ridge_reg = Ridge(alpha=ridge_best_params['alpha'], solver='cholesky')
lasso_reg = Lasso(alpha=lasso_best_params['alpha'])
elnet_reg = ElasticNet(alpha=elnet_best_params['alpha'], l1_ratio=elnet_best_params['l1_ratio'])

for model in (lin_reg, ridge_reg, lasso_reg, elnet_reg):
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    rmse = np.sqrt(mean_squared_error(y_test, y_pred))
    print(f'Model: {model}  RMSE: {rmse}\n')

Model: LinearRegression()  RMSE: 2.7556148834117464

Model: Ridge(alpha=0.30000000000000004, solver='cholesky')  RMSE: 2.785089953490792

Model: Lasso(alpha=0.017999999999999995)  RMSE: 2.797656188727464

Model: ElasticNet(alpha=0.009000000000000001, l1_ratio=0.9000000000000001)  RMSE: 2.7901020857143037

