## Linear Models

In this notebook, we try to model the data using various linear regression models. We begin by import the necessary modules, as well as loading the train and test data from the train_test directory.

In [5]:
import numpy as np
from sklearn.linear_model import LinearRegression, SGDRegressor
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import cross_val_score

from mtcars_practice.config import data_dir

In [3]:
X_train = np.load(data_dir + '/train_test/X_train.npy')
X_test = np.load(data_dir + '/train_test/X_test.npy')

y_train = np.load(data_dir + '/train_test/y_train.npy')
y_test = np.load(data_dir + '/train_test/y_test.npy')

We will start with linear regression. Just for curiosities sake, we try both the LinearRegression class, as well as the SGDRegressor class, since the implementations are different. Both of them produce the same RMSE.

In [8]:
lin_reg = LinearRegression()
lin_reg.fit(X_train, y_train)

y_pred = lin_reg.predict(X_train)
lin_reg_rmse = np.sqrt(mean_squared_error(y_train, y_pred))

print(lin_reg_rmse)

3.1220569526033626


In [9]:
sgd_reg = SGDRegressor()
sgd_reg.fit(X_train, y_train)

y_pred_sgd = sgd_reg.predict(X_train)
sgd_reg_rmse = np.sqrt(mean_squared_error(y_train, y_pred))

print(sgd_reg_rmse)

3.1220569526033626


While we see that the root mean squared error is about 3 mpg over the training set, we would like to measure the generalizability of the model. We try using cross validation to measure this.

In [11]:
lin_reg = LinearRegression()
scores = cross_val_score(lin_reg, X_train, y_train, scoring='neg_mean_squared_error', cv=4)
rmse_scores = np.sqrt(-scores)

print(rmse_scores)
print(rmse_scores.mean(), rmse_scores.std())

[2.78111103 3.0702988  3.81408112 3.2900013 ]
3.2388730648939728 0.37796883501639367
