# Multiple Linear Regression

As one progresses from learning Simple Linear Regression to Multiple Linear Regression there are a few common questions that arise.

This tackles two of the most frequently asked Multiple Linear Regression-related questions.

**Question 1:** How do I use my multiple linear regression model to make a single prediction, for example, the profit of a startup with R&D Spend = 160000, Administration Spend = 130000, Marketing Spend = 300000 and State = California?

**Question 2:** How do I get the final regression equation $y = b_0 + b_1 * x_1 + b_2 * x_2 + ...$ with the final values of the coefficients?

Here's the step-by-step coding exercise:

* Import libraries
* Import datasets
* Encode categorical data
* Split data into Training and Test sets
* Train Multiple Linear Regression model
* Predict Test set results
* Make single prediction
* Get final linear regression equation with the values of the coefficients

## Dataset

### Layout

* Columns: 5
	* R&D spend
	* Administration cost
	* Marketing spend
	* State
	* Profit
* Rows: 50 observations
	* Each row represents one company
		* Features:
			* R&D spend
			* Admin cost
			* Marketing spend
		* Dependent variable:
			* Profit

### Background

* Company provided extracts from their profit and loss statements for all features and overall profit for a given year in a particular state
* Data is completely anonymized

### Wants to Understand

**Venture Capitalist (VC)**

* What states companies perform best in
* When features are equally weighted, for example, will a company that spends more on marketing perform better
* When assessing companies to invest, does R&D or marketing spend yield more profit

### Goals

* VC wants to build a multiple linear regression model that is trained to determine which companies are most profitable for investing

## No Feature Scaling

* Feature scaling does not need to be applied because in the multiple linear regression equation, each feature is multiplied by a coefficient
* Even if feature values are higher than others, the coefficient will compensate to put all feature values on the same scale

## No Need to Check Assumptions of Linear Regression

* Do not need to check assumptions of linear regression
* When one has a new dataset and one wants to experiment with different types of models to determine which one leads to the highest accuracy, even if a dataset does not have linear relationships, one can still attempt a multiple linear regression, and a model may perform poorly yielding an accuracy lower than other models
* One would then simply select a model with higher accuracy, so there is no need to check assumptions of linear regression
* If a dataset has linear relationships, multiple linear regression will check assumptions and yield high accuracy model

![](Multiple_Linear_Regression_No_Feature_Scaling.png)

## No Accounting for Dummy Variable Trap

* One does not need to take into account avoiding the dummy variable trap with the state independent variable
* The **sklearn** library class **LinearRegression** automatically handles avoiding the dummy variable trap

## Applying Model Building Methods

* One does not need to apply model building methods, such as Backwards Elimination
* The `sklearn` library class `LinearRegression` automatically identifies the best features that have the highest P-Values and those that are most statistically significant to predict the dependent variable

## Visualizing Test Set Results

* Unlike simple linear regression, one cannot plot multiple linear regression model results with multiple features on the x-axis and dependent variable on the y-axis
* Solution is to display 2 vectors:
    1. Vectors of the predicted dependent variable values (profit) from test set
    2. Vectors of the actual dependent variable values (profit) from test set
* Then evaluate the model results using the vectors to compare if actual vs. predicted values (profit) are close

## Import Libraries

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

## Import Dataset

In [2]:
dataset = pd.read_csv('50_Startups.csv')
X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, -1].values

In [3]:
print(*X[:25], sep='\n')

[165349.2 136897.8 471784.1 'New York']
[162597.7 151377.59 443898.53 'California']
[153441.51 101145.55 407934.54 'Florida']
[144372.41 118671.85 383199.62 'New York']
[142107.34 91391.77 366168.42 'Florida']
[131876.9 99814.71 362861.36 'New York']
[134615.46 147198.87 127716.82 'California']
[130298.13 145530.06 323876.68 'Florida']
[120542.52 148718.95 311613.29 'New York']
[123334.88 108679.17 304981.62 'California']
[101913.08 110594.11 229160.95 'Florida']
[100671.96 91790.61 249744.55 'California']
[93863.75 127320.38 249839.44 'Florida']
[91992.39 135495.07 252664.93 'California']
[119943.24 156547.42 256512.92 'Florida']
[114523.61 122616.84 261776.23 'New York']
[78013.11 121597.55 264346.06 'California']
[94657.16 145077.58 282574.31 'New York']
[91749.16 114175.79 294919.57 'Florida']
[86419.7 153514.11 0.0 'New York']
[76253.86 113867.3 298664.47 'California']
[78389.47 153773.43 299737.29 'New York']
[73994.56 122782.75 303319.26 'Florida']
[67532.53 105751.03 304768.73 

In [4]:
print(*y[:25], sep='\n')

192261.83
191792.06
191050.39
182901.99
166187.94
156991.12
156122.51
155752.6
152211.77
149759.96
146121.95
144259.4
141585.52
134307.35
132602.65
129917.04
126992.93
125370.37
124266.9
122776.86
118474.03
111313.02
110352.25
108733.99
108552.04


## Encode Categorical Data

* Apply one hot encoding (convert categories to bit values) to state column and do not transform remaining columns as remainder is defined as `passthrough`
* Transform the state column using the `ColumnTransformer` class and turn the results into an array

In [5]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder

ct = ColumnTransformer(transformers=[('encoder', OneHotEncoder(), [3])], remainder='passthrough')
X = np.array(ct.fit_transform(X))

In [6]:
print(*X[:25], sep='\n')

[0.0 0.0 1.0 165349.2 136897.8 471784.1]
[1.0 0.0 0.0 162597.7 151377.59 443898.53]
[0.0 1.0 0.0 153441.51 101145.55 407934.54]
[0.0 0.0 1.0 144372.41 118671.85 383199.62]
[0.0 1.0 0.0 142107.34 91391.77 366168.42]
[0.0 0.0 1.0 131876.9 99814.71 362861.36]
[1.0 0.0 0.0 134615.46 147198.87 127716.82]
[0.0 1.0 0.0 130298.13 145530.06 323876.68]
[0.0 0.0 1.0 120542.52 148718.95 311613.29]
[1.0 0.0 0.0 123334.88 108679.17 304981.62]
[0.0 1.0 0.0 101913.08 110594.11 229160.95]
[1.0 0.0 0.0 100671.96 91790.61 249744.55]
[0.0 1.0 0.0 93863.75 127320.38 249839.44]
[1.0 0.0 0.0 91992.39 135495.07 252664.93]
[0.0 1.0 0.0 119943.24 156547.42 256512.92]
[0.0 0.0 1.0 114523.61 122616.84 261776.23]
[1.0 0.0 0.0 78013.11 121597.55 264346.06]
[0.0 0.0 1.0 94657.16 145077.58 282574.31]
[0.0 1.0 0.0 91749.16 114175.79 294919.57]
[0.0 0.0 1.0 86419.7 153514.11 0.0]
[1.0 0.0 0.0 76253.86 113867.3 298664.47]
[0.0 0.0 1.0 78389.47 153773.43 299737.29]
[0.0 1.0 0.0 73994.56 122782.75 303319.26]
[0.0 1.0 0.0 

## Split Dataset into Training Set and Test Set

In [7]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

## Train Multiple Linear Regression Model on Training Set

In [8]:
from sklearn.linear_model import LinearRegression

regressor = LinearRegression()
regressor.fit(X_train, y_train)

## Predict Test Set Results

In [9]:
y_pred = regressor.predict(X_test)

* Display numeric values with a precision of 2 decimal places
* Display the predicted and actual values (profit) side-by-side
* Concatenate the predicted and actual vectors
* `concatenate` function concatenates multiple arrays or vectors:
    * 1st argument
        * Expects a tuple of arrays/vectors to concatenate
        * Expects the same shape and size
        * Display the vector values vertically using the reshape function, length function to get the size for the rows, and hard code a value of 1 for the columns
    * 2nd argument
        * Expects the axis (0 = vertical or 1 = horizontal)

In [10]:
np.set_printoptions(precision=2)
print(np.concatenate((y_pred.reshape(len(y_pred), 1), y_test.reshape(len(y_test), 1)), 1))

[[103015.2  103282.38]
 [132582.28 144259.4 ]
 [132447.74 146121.95]
 [ 71976.1   77798.83]
 [178537.48 191050.39]
 [116161.24 105008.31]
 [ 67851.69  81229.06]
 [ 98791.73  97483.56]
 [113969.44 110352.25]
 [167921.07 166187.94]]


## Make Single Prediction

### Example: The profit of a startup with R&D Spend = 160000, Administration Spend = 130000, Marketing Spend = 300000 and State = California

In [11]:
print(regressor.predict([[1, 0, 0, 160000, 130000, 300000]]))

[181566.92]


Therefore, our model predicts that the profit of a Californian startup which spent 160000 in R&D, 130000 in Administration, and 300000 in Marketing is $181,566.92.

**Important Note 1:** Notice that the values of the features were all input in a double pair of square brackets. That's because the `predict` method always expects a 2D array as the format of its inputs. And putting our values into a double pair of square brackets makes the input exactly a 2D array.

Simply put:

$1, 0, 0, 160000, 130000, 300000 \rightarrow \textrm{scalars}$

$[1, 0, 0, 160000, 130000, 300000] \rightarrow \textrm{1D array}$

$[[1, 0, 0, 160000, 130000, 300000]] \rightarrow \textrm{2D array}$

**Important Note 2:** Notice also that the "California" state was not input as a string in the last column but as "1, 0, 0" in the first three columns. That's because of course the predict method expects the one-hot-encoded values of the state, and as we see in the second row of the matrix of features `X`, "California" was encoded as "1, 0, 0". Be careful to include these values in the first three columns, not the last three ones, because the dummy variables are always created in the first columns.

## Get Final Linear Regression Equation with Values of Coefficients

In [12]:
np.set_printoptions(precision=4, suppress=True)
print(np.array(regressor.coef_))
print(regressor.intercept_)

[  86.6384 -872.6458  786.0074    0.7735    0.0329    0.0366]
42467.52924857642


Therefore, the equation of our multiple linear regression model is:

$$\textrm{Profit} = 86.6 \times \textrm{Dummy State 1} - 873 \times \textrm{Dummy State 2} + 786 \times \textrm{Dummy State 3} + 0.773 \times \textrm{R and D Spend} + 0.0329 \times \textrm{Administration} + 0.0366 \times \textrm{Marketing Spend} + 42467.53$$

**Important Note:** To get these coefficients, we called the `coef_` and `intercept_` attributes from our regressor object. Attributes in Python are different than methods and usually return a simple value or an array of values.