## Linear Models

In this notebook, we try to model the data using a linear regression model. We begin by import the necessary modules, as well as loading the train and test data from the train_test directory.

In [1]:
import numpy as np
from sklearn.linear_model import LinearRegression, SGDRegressor, Ridge, Lasso, ElasticNet
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import cross_val_score

from mtcars_practice.config import data_dir

In [2]:
X_train = np.load(data_dir + '/train_test/X_train.npy')
X_test = np.load(data_dir + '/train_test/X_test.npy')

y_train = np.load(data_dir + '/train_test/y_train.npy')
y_test = np.load(data_dir + '/train_test/y_test.npy')

We will start with linear regression. Just for curiosities sake, we try both the LinearRegression class, as well as the SGDRegressor class, since the implementations are different. The SGDRegressor actually produces a higher RMSE, probably due to the SGD algorithm settling on an unoptimal solution. We will continue with the LinearRegressor class for the rest of the the modeling.

In [3]:
lin_reg = LinearRegression()
lin_reg.fit(X_train, y_train)

y_pred = lin_reg.predict(X_train)
lin_reg_rmse = np.sqrt(mean_squared_error(y_train, y_pred))

print(lin_reg_rmse)

3.1220569526033626


In [4]:
sgd_reg = SGDRegressor()
sgd_reg.fit(X_train, y_train)

y_pred_sgd = sgd_reg.predict(X_train)
sgd_reg_rmse = np.sqrt(mean_squared_error(y_train, y_pred_sgd))

print(sgd_reg_rmse)

3.902317609532293


While we see that the root mean squared error is about 3 mpg over the entire training set, we would like to measure the generalizability of the model. We try using cross validation to measure this.

In [6]:
lin_reg = LinearRegression()
scores_lin = cross_val_score(lin_reg, X_train, y_train, scoring='neg_mean_squared_error', cv=4)
rmse_scores_lin = np.sqrt(-scores)

print(rmse_scores_lin, '\n')
print('Mean RMSE: ', rmse_scores_lin.mean())
print('Std RMSE: ', rmse_scores_lin.std())

[2.78111103 3.0702988  3.81408112 3.2900013 ] 

Mean RMSE:  3.2388730648939728
Std RMSE:  0.37796883501639367


It is possible the pure linear regression model is overfitting, so we can try to regularize the model. We can try Ridge, Lasso, and Elastic Net regression to add a degree of normalization to the model.

The RMSE of the Ridge regression model actually ends up being slightly higher using a coefficient of 1.

In [11]:
ridge_reg = Ridge(alpha=0.5, solver='cholesky')
scores_ridge = cross_val_score(ridge_reg, X_train, y_train, scoring='neg_mean_squared_error', cv=4)
rmse_scores_ridge = np.sqrt(-scores_ridge)

print(rmse_scores_ridge, '\n')
print('Mean RMSE: ', rmse_scores_ridge.mean())
print('Std RMSE: ', rmse_scores_ridge.std())

[2.82016917 3.06479369 3.84565084 3.22690326] 

Mean RMSE:  3.2393792396047205
Std RMSE:  0.3787935120129052


In [15]:
lasso_reg = Lasso(alpha=0.1)
scores_lasso = cross_val_score(lasso_reg, X_train, y_train, scoring='neg_mean_squared_error', cv=4)
rmse_scores_lasso = np.sqrt(-scores_lasso)

print(rmse_scores_lasso, '\n')
print('Mean RMSE: ', rmse_scores_lasso.mean())
print('Std RMSE: ', rmse_scores_lasso.std())

[2.99362495 3.1981228  3.90388921 3.31271714] 

Mean RMSE:  3.3520885245278143
Std RMSE:  0.3384655368117302


In [17]:
elnet_reg = ElasticNet(alpha=0.1, l1_ratio=0.5)
scores_elnet = cross_val_score(elnet_reg, X_train, y_train, scoring='neg_mean_squared_error', cv=4)
rmse_scores_elnet = np.sqrt(-scores_elnet)

print(rmse_scores_elnet, '\n')
print('Mean RMSE: ', rmse_scores_elnet.mean())
print('Std RMSE: ', rmse_scores_elnet.std())

[3.32493105 3.41076207 4.08204827 3.70663456] 

Mean RMSE:  3.631093989511821
Std RMSE:  0.29637309388251093


For curiosities sake, we would like to see the performance of each model on the test dataset.

In [25]:
lin_reg = LinearRegression()
ridge_reg = Ridge(alpha=0.5, solver='cholesky')
lasso_reg = Lasso(alpha=0.1)
elnet_reg = ElasticNet(alpha=0.1, l1_ratio=0.5)

for model in (lin_reg, ridge_reg, lasso_reg, elnet_reg):
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    rmse = np.sqrt(mean_squared_error(y_test, y_pred))
    print(f'Model: {model}  RMSE: {rmse}\n')

Model: LinearRegression()  RMSE: 2.7556148834117464

Model: Ridge(alpha=0.5, solver='cholesky')  RMSE: 2.8017895090837035

Model: Lasso(alpha=0.1)  RMSE: 2.894542712268647

Model: ElasticNet(alpha=0.1)  RMSE: 3.167661044403877

