We will use the physical attributes of a car to predict its miles per gallon (mpg)

Linear regression produces a model in the form:

Y=β0+β1X1+β2X2…+βnXn

The way this is accomplished is by minimising the residual sum of squares, given by the equation below:

RSS=Σni=1(yi–y^i)2
RSS=Σni=1(yi–β0^–β1^x1–β2^x2–…–βp^xp)

In [1]:
import pandas as pd

df = pd.read_csv(r'C:\Users\Acer\Downloads\Day 2-20190519T040411Z-001\Day 2\Linear Regression\auto-mpg.csv')

In [4]:
df.head()

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,model year,origin,car name
0,18.0,8,307.0,130,3504,12.0,70,1,chevrolet chevelle malibu
1,15.0,8,350.0,165,3693,11.5,70,1,buick skylark 320
2,18.0,8,318.0,150,3436,11.0,70,1,plymouth satellite
3,16.0,8,304.0,150,3433,12.0,70,1,amc rebel sst
4,17.0,8,302.0,140,3449,10.5,70,1,ford torino


We don’t need the name column, so let’s remove this

In [6]:
df = df.drop('car name', axis=1)

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,model year,origin
0,18.0,8,307.0,130,3504,12.0,70,1
1,15.0,8,350.0,165,3693,11.5,70,1
2,18.0,8,318.0,150,3436,11.0,70,1
3,16.0,8,304.0,150,3433,12.0,70,1
4,17.0,8,302.0,140,3449,10.5,70,1
5,15.0,8,429.0,198,4341,10.0,70,1
6,14.0,8,454.0,220,4354,9.0,70,1
7,14.0,8,440.0,215,4312,8.5,70,1
8,14.0,8,455.0,225,4425,10.0,70,1
9,15.0,8,390.0,190,3850,8.5,70,1


Also note that the column "origin" is where the car came from and this is an ordinal categorical variable so we will need to create the dummy binary variables for this.

In [7]:
df['origin'] = df['origin'].replace({1: 'america', 2: 'europe', 3: 'asia'})

In [8]:
df.head()

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,model year,origin
0,18.0,8,307.0,130,3504,12.0,70,america
1,15.0,8,350.0,165,3693,11.5,70,america
2,18.0,8,318.0,150,3436,11.0,70,america
3,16.0,8,304.0,150,3433,12.0,70,america
4,17.0,8,302.0,140,3449,10.5,70,america


In [9]:
df = pd.get_dummies(df, columns=['origin'])

In [10]:
df.head()

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,model year,origin_america,origin_asia,origin_europe
0,18.0,8,307.0,130,3504,12.0,70,1,0,0
1,15.0,8,350.0,165,3693,11.5,70,1,0,0
2,18.0,8,318.0,150,3436,11.0,70,1,0,0
3,16.0,8,304.0,150,3433,12.0,70,1,0,0
4,17.0,8,302.0,140,3449,10.5,70,1,0,0


There are some missing values for horsepower, denoted by question marks so we’ll need to remove these

In [14]:
import numpy as np

In [15]:
df = df.replace('?', np.nan)

In [17]:
df.isnull().mean()

mpg               0.000000
cylinders         0.000000
displacement      0.000000
horsepower        0.015075
weight            0.000000
acceleration      0.000000
model year        0.000000
origin_america    0.000000
origin_asia       0.000000
origin_europe     0.000000
dtype: float64

In [18]:
df = df.dropna()

In [19]:
df.shape

(392, 10)

Now we can split our data into a training and test set

In [20]:
X = df.drop('mpg', axis=1)
y = df[['mpg']]

from sklearn.model_selection import train_test_split

# Split X and y into X_
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=1)

We train our LinearRegression model using the training set of data

In [21]:
from sklearn.linear_model import LinearRegression

regression_model = LinearRegression()
regression_model.fit(X_train, y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
         normalize=False)

Now that our model is trained, we can view the coefficients of the model using regression_model.coef_, which is an array of tuples of coefficients.

In [23]:
for idx, col_name in enumerate(X_train.columns):
    print("The coefficient for {} is {}".format(col_name, regression_model.coef_[0][idx]))

The coefficient for cylinders is -0.2463375586996188
The coefficient for displacement is 0.023870338307149623
The coefficient for horsepower is -0.006017238617773053
The coefficient for weight is -0.007336432943899324
The coefficient for acceleration is 0.21897778104124999
The coefficient for model year is 0.7851801072779495
The coefficient for origin_america is -1.7624934092199256
The coefficient for origin_asia is 0.80962691908585
The coefficient for origin_europe is 0.9528664901340751


regression_model.intercept_ contains an array of intercepts (β0 values)

In [24]:
intercept = regression_model.intercept_[0]

print("The intercept for our model is {}".format(intercept))

The intercept for our model is -19.809183848815916


So we can write our linear model as:

Y=−19.81–0.25×X1+0.02×X2–0.01×X3–0.01×X4+0.22×X5+0.78×X6–1.76×X7+0.81×X8+0.95×X9
Note that, because we’ve not done any feature scaling or dimensionality reduction, we can’t say anything about the relative importance of each of our features given these coefficients because the features are not of the same scale.

A common method of measuring the accuracy of regression models is to use the R2 statistic.

The R2 statistic is defined as follows:

R2=1–RSSTSS
The RSS (Residual sum of squares) measures the variability left unexplained after performing the regression
The TSS measues the total variance in Y
Therefore the R2 statistic measures proportion of variability in Y that is explained by X using our model
R2 can be determined using our test set and the model’s score method.

In [25]:
regression_model.score(X_test, y_test)

0.8285231316459772

So in our model, 82.85% of the variability in Y can be explained using X

We can also get the mean squared error using scikit-learn’s mean_squared_error method and comparing the prediction for the test data set (data not used for training) with the ground truth for the data test set:

In [26]:
from sklearn.metrics import mean_squared_error

y_predict = regression_model.predict(X_test)

regression_model_mse = mean_squared_error(y_predict, y_test)

regression_model_mse

12.230963834602674

In [27]:
import math

math.sqrt(regression_model_mse)

3.4972794904900972

So we are an average of 3.50 mpg away from the ground truth mpg when making predictions on our test set.

We can use our model to predict the miles per gallon for another, unseen car. Let’s give it a go on the following:

Cylinders – 4
Displacement – 121
Horsepower – 110
Weight – 2800
Acceleration – 15.4
Year – 81
Origin – Asia

In [28]:
regression_model.predict([[4, 121, 110, 2800, 15.4, 81, 0, 1, 0]])

array([[28.6713418]])

The car above is the information for a Saab 900s and it turns out that this is quite close to the actual mpg of 26 mpg for this car.