# Multiple Linear Regression Model

To find the correlation between a car's design parameters and its CO<sub>2</sub> emission value
The same model as model_1, but using multiple features.

Again, begin with importing all the required modules:

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from IPython.display import Markdown as md
import pickle

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score

%matplotlib inline

It's the same dataset, but let's see the details again anyway:

In [2]:
df = pd.read_csv("../datasets/FuelConsumption.csv")
df.head()

Unnamed: 0,MODELYEAR,MAKE,MODEL,VEHICLECLASS,ENGINESIZE,CYLINDERS,TRANSMISSION,FUELTYPE,FUELCONSUMPTION_CITY,FUELCONSUMPTION_HWY,FUELCONSUMPTION_COMB,FUELCONSUMPTION_COMB_MPG,CO2EMISSIONS
0,2014,ACURA,ILX,COMPACT,2.0,4,AS5,Z,9.9,6.7,8.5,33,196
1,2014,ACURA,ILX,COMPACT,2.4,4,M6,Z,11.2,7.7,9.6,29,221
2,2014,ACURA,ILX HYBRID,COMPACT,1.5,4,AV7,Z,6.0,5.8,5.9,48,136
3,2014,ACURA,MDX 4WD,SUV - SMALL,3.5,6,AS6,Z,12.7,9.1,11.1,25,255
4,2014,ACURA,RDX AWD,SUV - SMALL,3.5,6,AS6,Z,12.1,8.7,10.6,27,244


Along with description of the data:

In [3]:
df.describe()

Unnamed: 0,MODELYEAR,ENGINESIZE,CYLINDERS,FUELCONSUMPTION_CITY,FUELCONSUMPTION_HWY,FUELCONSUMPTION_COMB,FUELCONSUMPTION_COMB_MPG,CO2EMISSIONS
count,1067.0,1067.0,1067.0,1067.0,1067.0,1067.0,1067.0,1067.0
mean,2014.0,3.346298,5.794752,13.296532,9.474602,11.580881,26.441425,256.228679
std,0.0,1.415895,1.797447,4.101253,2.79451,3.485595,7.468702,63.372304
min,2014.0,1.0,3.0,4.6,4.9,4.7,11.0,108.0
25%,2014.0,2.0,4.0,10.25,7.5,9.0,21.0,207.0
50%,2014.0,3.4,6.0,12.6,8.8,10.9,26.0,251.0
75%,2014.0,4.3,8.0,15.55,10.85,13.35,31.0,294.0
max,2014.0,8.4,12.0,30.2,20.5,25.8,60.0,488.0


I'll use engine size and fuel consumption parameters as planned previously, along with number of cylinders, just to keep the model data source the same as the solution notebook:

In [4]:
features_df = df[['ENGINESIZE', 'CYLINDERS', 'FUELCONSUMPTION_COMB_MPG']]
X_train, X_test, y_train, y_test = train_test_split(features_df, np.asanyarray(df[['CO2EMISSIONS']]), test_size=0.25)

After the split, initialize the model and train it to the training set:

In [5]:
model = LinearRegression()
model.fit(X_train, y_train)

LinearRegression()

I wrote a fancy line to write the model equation with markdown, but it doesn't seem to support decimal approximation ._.

In [6]:
md("Model Equation: {} = {} + {} * {} + {} * {} + {} * {}".format("$\hat{y}$", model.intercept_[0], model.coef_[0][0], "$x_1$", model.coef_[0][1], "$x_2$", model.coef_[0][2], "$x_3$"))

Model Equation: $\hat{y}$ = 308.50889687234303 + 9.177390259922134 * $x_1$ + 7.704448342986529 * $x_2$ + -4.831740812244262 * $x_3$

Make some predictions with this model:

In [7]:
predict = model.predict(X_test)

And see how well it worked:

In [8]:
print("Residual Sum of Squares: {}\n".format(np.mean((predict - y_test) ** 2)))
print("Model Score: {}\n".format(model.score(X_test, y_test)))
md("{} score: {}".format("$R^2$", r2_score(predict, y_test)))

Residual Sum of Squares: 463.9342837808609

Model Score: 0.8887854369437045



$R^2$ score: 0.8719498103309219

Finally, save a copy of this model:

In [9]:
pickle.dump(model, open('../saved_models/model_2.sav', 'wb'))