# Multiple Linear Regression

In this notebook we will expand our simple linear regression model that we built to predict car prices in the last chapter to include several independent variables in order to produce better predictions.

### Package and Data Loading

As before, we will import the required packages and our car price data set.

In [None]:
import pandas as pd
import matplotlib.pyplot as plot
import statsmodels.api as stats
import numpy as np

In [None]:
carprice_df = pd.read_csv('CarPrice_Assignment.csv')

### Assessing the Data

In [None]:
carprice_df.shape

In [None]:
carprice_df.head()

We can see all the columns and what data type they are using [df.info()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.info.html) and how many unique values there are for the categorical types using [df.nunique()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.nunique.html).

In [None]:
carprice_df.info()

Here we are checking the number of unique values in specifically the categorical variables using [df.select_dtypes()](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.select_dtypes.html)

In [None]:
carprice_df.select_dtypes(include='object').nunique()

As we can see, the data contains a mixture of numeric types and categorical (object) types. We will remove the car_ID field from the data as this is only an identifier. For the purposes of this lesson we will also remove CarName from the data as it contains a large number of unique values (How could we extract more useful information from this variable?).

In [None]:
carprice_df = carprice_df.drop(columns=['car_ID', 'CarName'])

## Basic Multiple Regression Model

Before we build our full model using all of the data available to us in the dataset we will first build a straightforward model using four independent variables we think might be relevant to the price to practice fitting the model. The actual fitting of the model is very similar to how we did it previously for the simple linear regression model using statsmodels. We will build it using enginesize from the simple model plus curbweight, peakrpm and citympg.

In [None]:
Y_basic = carprice_df.price
X_basic = stats.add_constant(carprice_df[['enginesize', 'curbweight', 'peakrpm', 'citympg']])

The only difference to the previous chapter is that we add a constant column to our dataframe of multiple independent variables instead of to a single independent variable. The fitting process is also exactly the same.

In [None]:
model_basic = stats.OLS(Y_basic, X_basic)
results_basic = model_basic.fit()

We can see our results and the parameters for each of the independent variables using the .summary() attribute again.

In [None]:
print(results_basic.summary())

With these parameter values we can construct our model:

$\textrm{price}=116*\textrm{enginesize}+ 5.7*\textrm{curbweight}+2.7*\textrm{peakrpm}+10.8*\textrm{citympg}-30220$

## Full Multiple Regression Model

We can now look at building our final model using the full range of features available to us. Before we build the model we need to prepare the data and reduce the number of independent variables we have.

We can look at the correlations between different numerical variables in a handy way using a correlation matrix - this allows us to see the correlation between all pairs of variables at once. We can then remove some of the independent variables that are highly correlated and would cause problems with the algorithm due to multicollinearity. We can create this correlation matrix using the [df.corr()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.corr.html) method. We add a red/blue heatmap to better see where the extreme correlations are.

In [None]:
carprice_df.select_dtypes(exclude='object').corr().style.background_gradient(cmap='coolwarm')

We can see that highwaympg and citympg are highly correlated with a correlation coefficient of 0.97. Removing highwympg will get rid of this correlation and help reduce the complexity of our model. We also choose to remove carlength and carwidth to remove some more high correlations. 

In [None]:
carprice_df = carprice_df.drop(columns=['carlength', 'carwidth', 'highwaympg'])

#### One Hot Encoding

We can also use the categorical data by one-hot-encoding the it. This is where we make each category in a categorical variable its own independent variable which has a binary 1/0 value. We use the function [pd.get_dummies()](https://pandas.pydata.org/docs/reference/api/pandas.get_dummies.html) to do this for the categorical values and then join them back to the numerical variables using [pd.concat()](https://pandas.pydata.org/docs/reference/api/pandas.concat.html). We drop the first value to prevent multicollinearity. 

In [None]:
dummy = pd.get_dummies(carprice_df.select_dtypes(include='object'), drop_first=True)

In [None]:
carprice_df = pd.concat([carprice_df.select_dtypes(exclude='object'), dummy], axis=1)

We can repeat the above process where we remove highly correlated variables, now including the one hot encoded features.

In [None]:
carprice_df.corr().style.background_gradient(cmap='coolwarm')

In [None]:
carprice_df = carprice_df.drop(columns=['compressionratio', 'drivewheel_fwd', 'enginetype_rotor', 'fuelsystem_4bbl', 'fuelsystem_idi'])

In [None]:
carprice_df.shape

We now have 35 independent variables (plus our target variable price) to use in our regression model.

### Test/Train Split

As in the previous notebook, we will split our data with 70% into the training set and the remaining 30% into the test set. 

In [None]:
train_df=carprice_df.sample(frac=0.7, random_state=99) #random state is a seed value
test_df=carprice_df.drop(train_df.index)

In [None]:
train_df.shape

In [None]:
test_df.shape

### Fitting the Linear Regression Model

We once again use statsmodels to fit our linear regression model. We do this in the same way as the previous notebook except now our X_train contains all of our independent variables (plus the constant column).

In [None]:
Y_train = train_df.price
X_train = stats.add_constant(train_df.drop(columns=['price']))

In [None]:
model_carprice = stats.OLS(Y_train, X_train)
results_carprice = model_carprice.fit()

In [None]:
print(results_carprice.summary())

For example here we can see that enginelocation_rear has a coefficient of 7389 so when everything else is constant, a car with the engine in the rear we predict will cost an extra \\$7389 than a car with the engine in the front on average. Also we predict that for every unit of weight heavier a car is, the car will cost an extra $3.3. In this state we cannot compare the coefficients to one another as they all have different units - it makes no sense to compare pounds in curbweight with rpm in peakrpm! 

Our sum of square residuals is then:

In [None]:
print('The sum of square residuals is {:.1f}'.format(results_carprice.ssr))

The sum of square residuals is 579728377.8


In our simple linear regression in the last notebook we found a value of 2.3 billion for the SSE whereas here our value is 0.5 billion. As we are training our model on the same number of datapoints, we can see that the multiple linear regression is producing a smaller total error.

We can also use our test set to compare our predictions with the observed values.

In [None]:
Y_test = test_df.price
test_df = stats.add_constant(test_df)
X_test = test_df[X_train.columns]

In [None]:
test_predictions = results_carprice.predict(X_test)

In [None]:
plot.scatter(test_predictions, Y_test)
plot.plot([5000, 50000], [5000, 50000], c='k', ls='--')
plot.xlabel('Predicted Price [$]')
plot.ylabel('Observed Price [$]')
plot.show()

Plotting our predicted prices against the observed values again we can see that the points are much tighter to the diagonal line that previously for the simple linear regression model. We will explore the coefficients that we calculate in our model and the metrics used to evaluate them more in the next chapter.

## Scikit-Learn

Again we can repeat this exercise using Scikit-Learn

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

### Test/Train Split

When we have multiple independent variables, we do not need to reshape our X array as it is already a 2D array. We can therefore insert our training independent variables straight into the fitting method.

In [None]:
Y = carprice_df.price
X = carprice_df.drop(columns=['price'])

In [None]:
sk_X_train, sk_X_test, sk_Y_train, sk_Y_test = train_test_split(X, Y, test_size=0.3, random_state=99)

In [None]:
regressor = LinearRegression()  
regressor.fit(sk_X_train, sk_Y_train)

In [None]:
sk_intercept_carprice = regressor.intercept_
sk_engsize_coeffs = regressor.coef_
sk_ssr_carprice = np.sum((sk_Y_train-regressor.predict(sk_X_train))**2)

Unlike the statsmodels version, the LinearRegression class does not supply a convenient summary of the best fit coefficients however the coefficients are ordered in the same order as the columns are in our X array. We can combine the column names and coefficient values in a pandas Series to better read the values.

In [None]:
pd.Series(sk_engsize_coeffs, index=sk_X_train.columns)

In [None]:
print('The intercept value is {:.1f}'.format(sk_intercept_carprice))
print('The sum of square residuals is {:.1f}'.format(sk_ssr_carprice))