## Multiple Linear Regression

### Index 
- [Assumptions of Linear Regression](#assumptions)
- [Equation and Method](#equation)
- [Excercise](#excercise)

In [1]:
# importing some basic libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

<a id='assumptions'></a>
### Assumptions of Linear Regression
- Linearity
- [Homoscedasticity](https://en.wikipedia.org/wiki/Homoscedasticity)
- [Multivariate normality](https://en.wikipedia.org/wiki/Multivariate_normal_distribution)
- Independence of errors
- [Lack of multicollinearity](https://en.wikipedia.org/wiki/Multicollinearity)

##### Dummy variable trap.
The categorical variables should be split into proper dummy variables, and we should omit one of the columns of the dummy variables. By default our regression model will accound for the data without this last column and when it gets the other values, i.e a 1 in its corresponding column, it will factor in accordingly. The reason why we omit one of the columns is because of a phenomenon called dummy variable trap. The main culprit is multicolliniearity. The reason why this is dangeros to our model is that, all the variables should be linearly dependent, but in dummy variables if we add all the columns we will get 1, i.e they are linearly dependent. So, if we remove one column, we can eliminate dummy variable trap.

##### P value
The p-value is actually the probability of getting a sample like ours, or more extreme than ours IF the null hypothesis is true. So, we assume the null hypothesis is true and then determine how “strange” our sample really is. If it is not that strange (a large p-value) then we don’t change our mind about the null hypothesis. As the p-value gets smaller, we start wondering if the null really is true and well maybe we should change our minds (and reject the null hypothesis).

- [Explanation 1](http://www.mathbootcamps.com/what-is-a-p-value/)
- [Explanation 2](http://www.wikihow.com/Calculate-P-Value)

<a id='equation'></a>
### Equation and Method

Like simple linear regression, Multiple linear regression uses a linear equation with multiple independent variables to determine a dependent variable.

$y$ = $b_{0}$ + $b_{1}$*$x_{1}$+ $b_{2}$*$x_{2}$+ $b_{3}$*$x_{3}$ ... + $b_{n}$*$x_{n}$


#### Different methods
- ##### All-in

> Here we throw in all the variables that we have. We ususally do this when we have prior knowledge about our variables that they are significant or when a particular framework tells us that these variables should be included.

- ##### Backward Elimination

> 1) We first select a significant level to stay in the model(eg. sl=0.5).

> 2) We fit the model with all the possible predictors(variables).

> 3) Consider the prdictor with the highest P-value, if it is higher than sl, then remove that else end procedure.

> 4) Fit the model without the removed predictor and go to previous step and do the same check.

- ##### Forward Selection

> 1) We first select a significant level to stay in the model(eg. sl=0.5).

> 2) We fit all simple regression models and select one with the lowest P-value.

> 3) We keep this variable and fit all possible models with one extra predictor added to the one.

> 4) We then consider the predictor with the lowest P-value and if P < Sl go to previous step else end procedure.

- ##### Bidirectional Elimination

> 1) Select a significant value for entering and staying in the model.

> 2) Perform forward selection with P < S-enter to enter.

> 3) Perform all steps of backward elimination with old variables having P < S-stay to stay and go to previous step.

> 4) No new variables can enter and no new can exit and then end the procedure.

- ##### All possible models/ score comparison

> Construct the model in all possible permutations and combinations of variables and compare their scores and select the best model

<a id='excercise'></a>
### Excersice

The objective of this excerise is to inspect the data set of startups and build a model that can  predict the profit from the other variables.

In [78]:
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from sklearn.cross_validation import train_test_split;
from sklearn.linear_model import LinearRegression
import statsmodels.formula.api as sm

##### Preprocessing

In [2]:
dataset = pd.read_csv('50_Startups.csv')
dataset.head()

Unnamed: 0,R&D Spend,Administration,Marketing Spend,State,Profit
0,165349.2,136897.8,471784.1,New York,192261.83
1,162597.7,151377.59,443898.53,California,191792.06
2,153441.51,101145.55,407934.54,Florida,191050.39
3,144372.41,118671.85,383199.62,New York,182901.99
4,142107.34,91391.77,366168.42,Florida,166187.94


In [41]:
x =  dataset.iloc[:, :4].values
y = dataset.iloc[:, 4].values

In [51]:
label_x = LabelEncoder()
x[:, 3] = label_x.fit_transform(x[:, 3])

one_hot_encoder = OneHotEncoder(categorical_features=[3])
x = one_hot_encoder.fit_transform(x).toarray()

Eliminating the dummy trap variable

In [52]:
x = x[:, 1:]

In [59]:
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=0)

##### Building the model

In [58]:
# fitting the model
regressor = LinearRegression()
regressor.fit(x_train, y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

In [60]:
# Predicting the model
y_predict = regressor.predict(x_test)

##### Backward elimination
Eventhough we were able to build our model and predict the test values with some amount of accuracy, we still haven't looked into the factors by which the independent variables contribute towards the dependent variable that we are predicting. And also an important factor that we missed out is that, we did not account for the $B_{0}$ in our equation. When a multiple linear regression model is built, the coefficients are calculated with respect to the available columns in our dataset. Therefore it makes sense now as to why the $B_{0}$ was not calculated. To incorporate that, we simply need to add another column in our dataset, that is with full 1's.

In [80]:
# Adding a row full of 1's for the intercept.
x = np.append(arr = np.ones((50, 1)).astype(int), values = x, axis=1) 
# we switch inorder to have the first row of intercepts.