In [None]:
%matplotlib inline
import numpy as np
import math
import matplotlib.pyplot as plt
import pandas as pd

**Note**: on this notebook I am just practicing concepts of multiple linear regression. I am not considering some aspects related to machine learning like the imputation of missing values or the normalisation of the predictor variables. 

The dataset used in this notebook is the Auto MPG dataset from the UCI Machine Learning Repository: (https://archive.ics.uci.edu/ml/datasets/Auto+MPG).

The mpg column corresponds to the response variable. All the predictor variables can be interpreted as numerical variables except origin, that is categorical.

In [None]:
df = pd.read_csv('data/auto-mpg.csv', sep='\s+', header=None, usecols=range(8))
df.columns = ['mpg', 'cylinders', 'displacement',
              'horsepower', 'weight', 'acceleration',
              'model_year', 'origin']
df.head()

Removing rows with missing values:

In [None]:
df = df[~(df.horsepower == '?')]
df.horsepower = df.horsepower.astype(float)

In [None]:
fig, ax = plt.subplots(2,4)

ax[0,0].hist(df.mpg)
ax[0,0].set_title('mpg distribution')

ax[0,1].hist(df.cylinders)
ax[0,1].set_title('cylinders distribution')

ax[0,2].hist(df.displacement)
ax[0,2].set_title('displacement distribution')

ax[0,3].hist(df.horsepower)
ax[0,3].set_title('horsepower distribution')

ax[1,0].hist(df.weight)
ax[1,0].set_title('weight distribution')

ax[1,1].hist(df.acceleration)
ax[1,1].set_title('acceleration distribution')

ax[1,2].hist(df.model_year)
ax[1,2].set_title('model_year distribution')

ax[1,3].hist(df.origin)
ax[1,3].set_title('origin distribution')

fig.set_figwidth(16)
fig.set_figheight(4)
plt.tight_layout()

We are testing now whether any pair of predictors are collinear, that is, whether any pair of predictors is correlated. Collinearity is an issue since small variations in the data or the model may produce large erratic changes in the point estimates of the multiple regression model. This does not affect the accuracy of the prediction as a whole, but may have an impact on the accuracy of the prediction of each individual predictor.

There are 7 predictor variables, making a total of 7*6/2 = 21 combinations.

In [None]:
fig, ax = plt.subplots(5,5)
row = 0
col = 0

for i in range(1,len(df.columns)):
    for j in range(i+1,len(df.columns)):
        
        var1 = df[df.columns[i]]
        var2 = df[df.columns[j]]
        
        ax[row, col].scatter(var1, var2)
        
        R = 1/(len(df)-1)*np.sum(((var1-var1.mean())/var1.std())*((var2-var2.mean())/var2.std()))
        
        ax[row, col].set_title(df.columns[i] + ' vs ' + df.columns[j] + '\nR = ' + str(R))
        
        col = col + 1
        if col > 4:
            col = 0
            row = row + 1
            
fig.set_figwidth(14)
fig.set_figheight(14)
plt.tight_layout()

There seems to be a somehow strong linear relationship between a set of four variables: displacement, horsepower, weight and cylinders. We will examine later, during model selection, how this may affect the accuracy of the model.



In order to use categorical predictor variables in our multiple regression model (in our case only the 'origin' predictor is categorical) we need to transform them into indicator or dummy variables. An indicator variable takes the value 1 or 0 depending on whether the value represented by the indicator variable was the value of the original categorical variable in the original dataset. If the categorical variable only has two levels only one indicator variable is required. Otherwise, we will need to create one indicator variable for each value of the categorical variable.

In [None]:
print(np.unique(df.origin))

In [None]:
for v in np.unique(df.origin):
    df['origin_' + str(v)] = (df.origin == v).astype(int)
del df['origin']
df.head()

We will use least squares to fit the multiple regression model.

In [None]:
columns = ['cylinders', 'displacement', 'horsepower', 'weight',
        'acceleration', 'model_year', 'origin_1', 'origin_2', 'origin_3']

Y = df.mpg.values
X = df[columns].values

X = np.append(np.ones((X.shape[0], 1)), X, axis=1)

B = np.dot(np.dot(np.linalg.inv(np.dot(X.T, X)), X.T), Y)

print('Fitted linear model:')
print('intercept: ' + str(B[0]))
for i in range(len(columns)):
    print(columns[i] + ': ' + str(B[i-1]))

In [None]:
# TODO: plot linear regression for each predictor variable to test that it worked