# Multiple Linear Regression

### Assumptions of Linear regression

The Linear regression has the following assumptions:
1. Linearity
2. Homoscedasticity
3. Multivarable Normality
4. Independence of Erros
5. Lack of multicollinearity

Always remember to recheck these assumptions before designing the linear regression model.

### Intuition of Multiple Linear Regression : Dummy Variables

**Dummy Variables**

There are two types of varialbes: Categorical variables (eg. Country: Germany, USA, UK) and numerical variables (Eg. Profit: 100, 200). 

We can only work with numerical variables and hence the categorical variables need to be converted to the numerical values. This can be done by introducing dummy variables. This is explained in the encoding step.

**Dummy variable Trap**

It is not good to include all the dummy columns in the model. For example: If we have a categorial varialbe with two countires values, and after encoding them the one dummy column itself can define the state of the other column. (because a value of 1 represents the true case and value of 0 represents the false case or the true case for the second variable)  

In a linear regression, consider a case with two dummy variables, D1 & D2.

*D2 = 1- D1*  

So if we introduce both variables into our model, it fails to distinguish the effect of both. This is called as dummy variable trap. Because of this, only one dummy variable is introduced into any model with two values of categorical variables.

There are no bias added into the system only by introducing one variable in this case. The vale of b0 in the multiple linear regression equation balances this bias effect. 

**As a general rule: If the Term b0 is available in the multiple linear equation, then always omit one dummy variable from the categorical variable class**




### P-Values & Statistical Significance

Intuition for Statistical Significance.
Consider the case of tossing a coin. We have two Hypothesises here. 

1. A null hypothesis (H_0): In one universe, the coin is a fair coin (so we have equal chances of getting a head or tail)
2. An alternative Hypothesis (H_A): In the other universe, the coin is not a fair coin. (So there is a chance to get either head or tail in most of the cases)

If we live in a universe where the null hypothesis is true, we can perform the experiments and see the probabilities to get an expected result. 

For example: 
Probability to get a tail in first toss : 0.5
Probability to get a tail again in the second toss : 0.25
Probability to get a tail again in the third toss : 0.12
Probability to get a tail again in the fourth toss : 0.06

(The probability of getting a tail in the 4th toss consicutively is only 0.0625, but having our null hypothesis as true, if we get tail again in the 5th toss, the probaility is really really less (only 3%), then we have to question our assumption we made in the null hypothesis; that is the coin is fair or not. 

Which means, if we get a tail in the 7th toss consicutively, there is a chance that our null hypothesis is wrong and the coin is not a fair one. This is a natural feeling considering the statistics.)

The P-values (Probability values) will reduce gradually considering this scenario of tossing a fair coin. 

Consider the Alternate hypothesis H_A, where we have a coin with both tail faces. Then we will get tail in all the tosses. And the P-value will be 100% in all the cases.


Here is the imporance of the Statistical significance. If we consider our case of null hypothesis is true and we set a boundary P-Value = 0.05 (5%), and if we get results less than the limit P-value, then we can question the validity of our null hypothesis. Hence we can reject the null hypothesis. 


### Step-by-step to the Model Building

Why we should filter the independent variables while designing the model?

1. Garbage in == Garbage out : If we feed the model with too much, even unnecessary data, we may end up in a Garbage model and which will give us bad results. So we have to select only the significant independent variable in the model.

2. The model shall be mathematically explainable considering the dependancy of the parameters.

So, how we will determine which parameters to eliminate in the models?

**5 Methods of Building models**
1. All-in
2. Backward Elimination
3. Forward selection
4. Bidirectional Elimination
5. Score Comparison

(The 2, 3 & 4 are called together sometimes as Stepwise Regression)


1. **All In**: Consider all the variables
Will be done if we know the scenario or someone gives you all the variables and asks you to build the model out of it.
    + Prior Knowledge
    + You have to do (someone gives you a task to build model 
    + A framework in the system which advice you to use all the variables.
    + You are preparing for the Backward Elimination step.
    

2. **Backward Elimination**: Eliminate each variable at each step based on a SL
    + Step 1: Select a significance level (P-Value) to stay in the model (eg: SL= 0.05/5%)
    + Step 2: Fit the full model with all possible predictors
    + Step 3: Consider the Variable/ predictor with the **Highest** P-Value. 
        If P > SL go to step 4
        otherwise go to FIN
    + Step 4: Remove the variable
    + Step 5: Fit the model without this variable
    + Step 6: Go to step 3 and check for the highest P-value variable.
    
    Model is finished when the highest P-value is no longer greater than the SL.

3. **Forward Selection**:
    + step 1: Select a significance level to enter the model (eg SL = 0.05)
    + Step 2: Fit all simple regression models, **y ~ Xn**,  Select the one variable with the lowest P-Value
    + Step 3: Keep this variable and fit all possible models with one extra predictor added to the one(s) you already habe
    + Step 4: Consider the predictor with the **Lowest** P-Value. If **P < SL**, go to step 3, otherwise go to FIN
    + FIN: if P is no longer less than SL, Keep the **Previous** Model. 
    
    
4. **Bidirectional Elimination**
    + Step 1: Select a significance level to enter and stay in the model (eg: SLEnter = 0.05, SLStay = 0.05)
    + Step 2: Peform the next step of Forward selection (new variable must have P < SLEnter to enter)
    + Step 3: Perform ALL the steps of Backward Elimination (Old varialbes must have P < SLStay to stay)
    + Step 4: Go to step 2, Step 3 iteratively till No new variables can enter or no old variables can exit
    + FIN: Finally the model is ready. 
    
    
5. **All Possible Models**: Score comparison

    + Step 1: Select a criterion of Goodness of Fit (eg: Akaike criterion)
    + Step 2: Construct all possible regression models: 2^N-1 total combinations
    + Step 3: Select the one with the best criterion
    + FIN : Model is ready
    
    Example: if you have 10 columns, there will be 1023 models. So it is resource consuming approach

In this tutorial, we have used the Backward Elimination approach (because it is faster)


## Importing the libraries

In [11]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

## Importing the dataset

In [12]:
dataset = pd.read_csv('50_Startups.csv')
X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, -1].values

In [13]:
dataset[:5]

Unnamed: 0,R&D Spend,Administration,Marketing Spend,State,Profit
0,165349.2,136897.8,471784.1,New York,192261.83
1,162597.7,151377.59,443898.53,California,191792.06
2,153441.51,101145.55,407934.54,Florida,191050.39
3,144372.41,118671.85,383199.62,New York,182901.99
4,142107.34,91391.77,366168.42,Florida,166187.94


In [14]:
print(X)

[[165349.2 136897.8 471784.1 'New York']
 [162597.7 151377.59 443898.53 'California']
 [153441.51 101145.55 407934.54 'Florida']
 [144372.41 118671.85 383199.62 'New York']
 [142107.34 91391.77 366168.42 'Florida']
 [131876.9 99814.71 362861.36 'New York']
 [134615.46 147198.87 127716.82 'California']
 [130298.13 145530.06 323876.68 'Florida']
 [120542.52 148718.95 311613.29 'New York']
 [123334.88 108679.17 304981.62 'California']
 [101913.08 110594.11 229160.95 'Florida']
 [100671.96 91790.61 249744.55 'California']
 [93863.75 127320.38 249839.44 'Florida']
 [91992.39 135495.07 252664.93 'California']
 [119943.24 156547.42 256512.92 'Florida']
 [114523.61 122616.84 261776.23 'New York']
 [78013.11 121597.55 264346.06 'California']
 [94657.16 145077.58 282574.31 'New York']
 [91749.16 114175.79 294919.57 'Florida']
 [86419.7 153514.11 0.0 'New York']
 [76253.86 113867.3 298664.47 'California']
 [78389.47 153773.43 299737.29 'New York']
 [73994.56 122782.75 303319.26 'Florida']
 [67532

## Encoding categorical data

We have one categorical data in the dataset, hence we have to perform one hot encoding for this data column.
The sklearn tool OneHotEncoder has been used for this approach

In [15]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
ct = ColumnTransformer(transformers=[('encoder', OneHotEncoder(), [3])], remainder='passthrough')
X = np.array(ct.fit_transform(X))

In [16]:
print(X)

[[0.0 0.0 1.0 165349.2 136897.8 471784.1]
 [1.0 0.0 0.0 162597.7 151377.59 443898.53]
 [0.0 1.0 0.0 153441.51 101145.55 407934.54]
 [0.0 0.0 1.0 144372.41 118671.85 383199.62]
 [0.0 1.0 0.0 142107.34 91391.77 366168.42]
 [0.0 0.0 1.0 131876.9 99814.71 362861.36]
 [1.0 0.0 0.0 134615.46 147198.87 127716.82]
 [0.0 1.0 0.0 130298.13 145530.06 323876.68]
 [0.0 0.0 1.0 120542.52 148718.95 311613.29]
 [1.0 0.0 0.0 123334.88 108679.17 304981.62]
 [0.0 1.0 0.0 101913.08 110594.11 229160.95]
 [1.0 0.0 0.0 100671.96 91790.61 249744.55]
 [0.0 1.0 0.0 93863.75 127320.38 249839.44]
 [1.0 0.0 0.0 91992.39 135495.07 252664.93]
 [0.0 1.0 0.0 119943.24 156547.42 256512.92]
 [0.0 0.0 1.0 114523.61 122616.84 261776.23]
 [1.0 0.0 0.0 78013.11 121597.55 264346.06]
 [0.0 0.0 1.0 94657.16 145077.58 282574.31]
 [0.0 1.0 0.0 91749.16 114175.79 294919.57]
 [0.0 0.0 1.0 86419.7 153514.11 0.0]
 [1.0 0.0 0.0 76253.86 113867.3 298664.47]
 [0.0 0.0 1.0 78389.47 153773.43 299737.29]
 [0.0 1.0 0.0 73994.56 122782.75 3

Note: **In Multiple Linear Regression, there is no need to apply a feature scaling, the coefficients will balance the values**

Note: **Do we need to check the linearity assumption? No, if the dataset has no linear relationship, our model will give a bad result. That's all**

## Splitting the dataset into the Training set and Test set

In [17]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)

## Training the Multiple Linear Regression model on the Training set

**Do we need to be aware about the Dummy Variable trap??**

No, because the Multiple Linear regression class, we are going to import from sklearn can avaoid the dummy variable trap automatically.

**Do we have to do Backward elimination**

No, the class from the Sklearn will balance the model by itself. 

In [18]:
from sklearn.linear_model import LinearRegression
# It is the same class used for the single and multiple linear regression.
regressor = LinearRegression()
regressor.fit(X_train, y_train)

LinearRegression()

## Predicting the Test set results

We cann't plot a graph as we have many variables, so we simply create two vectors and compare them.

In [21]:
y_pred = regressor.predict(X_test)
np.set_printoptions(precision=2)
# Convert the vectors from horizontal to vertical, use the reshape and len functions.
# The concatination vertical given a value of 1
print(np.concatenate((y_pred.reshape(len(y_pred),1), y_test.reshape(len(y_test),1)),1))

[[103015.2  103282.38]
 [132582.28 144259.4 ]
 [132447.74 146121.95]
 [ 71976.1   77798.83]
 [178537.48 191050.39]
 [116161.24 105008.31]
 [ 67851.69  81229.06]
 [ 98791.73  97483.56]
 [113969.44 110352.25]
 [167921.07 166187.94]]


## Backward Elimination

Backward Elimination is irrelevant in Python, because the Scikit-Learn library automatically takes care of selecting the statistically significant features when training the model to make accurate predictions.

However, if you do really want to learn how to manually implement Backward Elimination in Python and identify the most statistically significant features, please find in this link below some old videos I made on how to implement Backward Elimination in Python:

https://www.dropbox.com/sh/pknk0g9yu4z06u7/AADSTzieYEMfs1HHxKHt9j1ba?dl=0


These are old videos made on Spyder but the dataset and the code are the same as in the previous video lectures of this section on Multiple Linear Regression, except that I had manually removed the first column to avoid the Dummy Variable Trap with this line of code:

Avoiding the Dummy Variable Trap

X = X[:, 1:]


Just keep this for this Backward Elimination implementation, but keep in mind that in general you don't have to remove manually a dummy variable column because Scikit-Learn takes care of it.


## Bonus

Question 1: How do I use my multiple linear regression model to make a single prediction, for example, the profit of a startup with R&D Spend = 160000, Administration Spend = 130000, Marketing Spend = 300000 and State = California?


In [22]:
X_predict = [[1.0, 0.0, 0.0, 160000, 130000, 300000]]

In [23]:
y_pred = regressor.predict(X_predict)
# gives a numpy array

In [29]:
y_pred[0]

181566.92389385228

Question 2: How do I get the final regression equation y = b0 + b1 x1 + b2 x2 + ... with the final values of the coefficients?

In [30]:
print(regressor.coef_)
print(regressor.intercept_)

[ 8.66e+01 -8.73e+02  7.86e+02  7.73e-01  3.29e-02  3.66e-02]
42467.52924854249


Therefore, the equation of our multiple linear regression model is:

**Profit= 86.6× Dummy State1 − 873× Dummy State2 + 786× Dummy State3 + 0.773× R&D Spend + 0.0329× Administration + 0.0366 × Marketing Spend + 42467.53**

Important Note: To get these coefficients we called the "coef_" and "intercept_" attributes from our regressor object. Attributes in Python are different than methods and usually return a simple value or an array of values.
