In [None]:
%matplotlib inline
import numpy as np
import math
import matplotlib.pyplot as plt
import pandas as pd
import scipy.stats as st

**Note**: on this notebook I am just practicing concepts of multiple linear regression. I am not considering some aspects related to machine learning like the imputation of missing values or the normalisation of the predictor variables. 

## Loading and processing the data

The dataset used in this notebook is the Auto MPG dataset from the UCI Machine Learning Repository: (https://archive.ics.uci.edu/ml/datasets/Auto+MPG).

The mpg column corresponds to the response variable. All the predictor variables can be interpreted as numerical variables except origin, that is categorical.

In [None]:
df = pd.read_csv('data/auto-mpg.csv', sep='\s+', header=None, usecols=range(8))
df.columns = ['mpg', 'cylinders', 'displacement',
              'horsepower', 'weight', 'acceleration',
              'model_year', 'origin']
df.head()

Removing rows with missing values:

In [None]:
df = df[~(df.horsepower == '?')]
df.horsepower = df.horsepower.astype(float)

In [None]:
fig, ax = plt.subplots(2,4)

ax[0,0].hist(df.mpg)
ax[0,0].set_title('mpg distribution')

ax[0,1].hist(df.cylinders)
ax[0,1].set_title('cylinders distribution')

ax[0,2].hist(df.displacement)
ax[0,2].set_title('displacement distribution')

ax[0,3].hist(df.horsepower)
ax[0,3].set_title('horsepower distribution')

ax[1,0].hist(df.weight)
ax[1,0].set_title('weight distribution')

ax[1,1].hist(df.acceleration)
ax[1,1].set_title('acceleration distribution')

ax[1,2].hist(df.model_year)
ax[1,2].set_title('model_year distribution')

ax[1,3].hist(df.origin)
ax[1,3].set_title('origin distribution')

fig.set_figwidth(16)
fig.set_figheight(4)
plt.tight_layout()

We are testing now whether any pair of predictors are collinear, that is, whether any pair of predictors is correlated. Collinearity is an issue since small variations in the data or the model may produce large erratic changes in the point estimates of the multiple regression model. This does not affect the accuracy of the prediction as a whole, but may have an impact on the accuracy of the prediction of each individual predictor.

There are 7 predictor variables, making a total of 7*6/2 = 21 combinations.

In [None]:
fig, ax = plt.subplots(5,5)
row = 0
col = 0

for i in range(1,len(df.columns)):
    for j in range(i+1,len(df.columns)):
        
        var1 = df[df.columns[i]]
        var2 = df[df.columns[j]]
        
        ax[row, col].scatter(var1, var2)
        
        R = 1/(len(df)-1)*np.sum(((var1-var1.mean())/var1.std())*((var2-var2.mean())/var2.std()))
        
        ax[row, col].set_title(df.columns[i] + ' vs ' + df.columns[j] + '\nR = ' + str(R))
        
        col = col + 1
        if col > 4:
            col = 0
            row = row + 1
            
fig.set_figwidth(14)
fig.set_figheight(14)
plt.tight_layout()

There seems to be a somehow strong linear relationship between a set of four variables: displacement, horsepower, weight and cylinders. We will examine later, during model selection, how this may affect the accuracy of the model.



In order to use categorical predictor variables in our multiple regression model (in our case only the 'origin' predictor is categorical) we need to transform them into indicator or dummy variables. An indicator variable takes the value 1 or 0 depending on whether the value represented by the indicator variable was the value of the original categorical variable in the original dataset. If the categorical variable only has two levels only one indicator variable is required. 

In [None]:
print(np.unique(df.origin))

In [None]:
for v in [1,2]:
    df['origin_' + str(v)] = (df.origin == v).astype(int)
del df['origin']
df.head()

## Fitting the multiple regression model

We will use least squares to fit the multiple regression model.

In [None]:
columns = ['cylinders', 'displacement', 'horsepower', 'weight',
        'acceleration', 'model_year', 'origin_1', 'origin_2']

Y = df.mpg.values
X = df[columns].values

X = np.append(np.ones((X.shape[0], 1)), X, axis=1)

B = np.dot(np.dot(np.linalg.inv(np.dot(X.T, X)), X.T), Y)

print('Fitted linear model:')
print('intercept: ' + str(B[0]))
for i in range(len(columns)):
    print(columns[i] + ': ' + str(B[i+1]))
    
residuals = Y - np.dot(X, B)

adj_r2 = 1 - ((Y.shape[0]-1)*(np.std(residuals)**2)/((Y.shape[0]-len(columns)-1)*(np.std(Y)**2)))
print('\nAdjusted R2: ' + str(adj_r2))

Let's use an added variable plots to visualise the relationship between the response variable and each predictor variable, individually. The added variable plot for x_0 is built by plotting the residuals of the response variable after fitting the model with all the predictor variables except x_0 vs. the residuals of x_0 after fitting a regression for x_0 with all the other predictor variables. 

In [None]:
def added_variable_plot(df, predictor, ax):
    Y = df.mpg.values
    X_0 = df[predictor].values
    
    X = df[columns]
    del X[predictor]
    X = X.values
    
    X = np.append(np.ones((X.shape[0], 1)), X, axis=1)
    
    B_y = np.dot(np.dot(np.linalg.inv(np.dot(X.T, X)), X.T), Y)
    B_x0 = np.dot(np.dot(np.linalg.inv(np.dot(X.T, X)), X.T), X_0)
    
    residuals_y = Y - np.dot(X, B_y)
    residuals_x0 = X_0 - np.dot(X, B_x0)
    
    ax.scatter(residuals_x0, residuals_y)
    ax.set_xlabel(predictor)
    ax.set_ylabel('mpg')
    
    # Least squares of the residuals(
    R = 1/(Y.shape[0] - 1)*np.sum((residuals_x0 - np.mean(residuals_x0))/np.std(residuals_x0)*(residuals_y - np.mean(residuals_y))/np.std(residuals_y))
    b1 = R*np.std(residuals_y)/np.std(residuals_x0)
    b0 = np.mean(residuals_y) - b1*np.mean(residuals_x0)
    xo = np.min(residuals_x0) - 0.1
    xf = np.max(residuals_x0) + 0.1
    ax.plot([xo, xf], [b1*xo + b0, b1*xf + b0], 'r', lw=2)
    ax.set_title('R = ' + str(R))
    
fig, ax = plt.subplots(3,3)
row = 0
column = 0
for c in columns:
    added_variable_plot(df, c, ax[row][column])
    column = column + 1
    if column > 2:
        column = 0
        row = row + 1

fig.set_figwidth(8)
fig.set_figheight(8)
plt.tight_layout()

## Checking model assumptions

Is a linear multiple regression model appropriate with this data? The conditions we need to evaluate are:

- The residuals are normally distributed
- The variability of the residuals is nearly normal
- The residuals are independent
- Each variable is linearly related to the outcome

We will assess these conditions by means of a series of plots. We first build a q-q plot to test whether the residuals are normally distributed:

In [None]:
quantiles = np.arange(0.01,0.99,0.01)
q_theoretical = [st.norm.ppf(i, loc=np.mean(residuals), scale=np.std(residuals)) for i in quantiles]

q_sample = [np.percentile(residuals, i*100) for i in quantiles]

fig, ax = plt.subplots()
ax.scatter(q_sample, q_theoretical, color='blue')

min_value = min(np.min(q_theoretical), np.min(q_sample))
max_value = max(np.max(q_theoretical), np.max(q_sample))
ax.plot([min_value, max_value], [min_value, max_value], 'k--')

ax.set_xlabel('residuals')
ax.set_ylabel('theoretical')

Residuals seem to be normally distributed with maybe some outliers on the right side of the distribution.

Let's test now whether the variability of the residuals is nearly constant, by means of a plot of the residuals against the fitted values:

In [None]:
fig, ax = plt.subplots()
ax.scatter(np.dot(X, B), residuals)
ax.set_xlabel('fitted values')
ax.set_ylabel('residuals')

There seems to be non-linearity in the data. A linear multiple regression model may not be the best model for this data. 

The aim of these plots is to determine whether some structure still exists in the data after fitting the multiple regression model. We may need to adjust the model to try to account for the extra structure (for instance, the non-constant variability of the residuals). If we are not able to do so, we can still report the results, that is, the fitted linear model as long as we also report it shortcomings. 