## Multiple Linear Regression

In this exercise, a multiple linear regression model was used to predict the profitability of a comapny for investment from multiple data variables.

### Index 
- #### [Assumptions of Linear Regression](#assumptions)
- #### [Equation and Method](#equation)
- #### [Excercise](#excercise)
- #### [Conclusion](#conclusion)

In [1]:
# importing some basic libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

<a id='assumptions'></a>
### Assumptions of Linear Regression
- Linearity
- [Homoscedasticity](https://en.wikipedia.org/wiki/Homoscedasticity)
- [Multivariate normality](https://en.wikipedia.org/wiki/Multivariate_normal_distribution)
- Independence of errors
- [Lack of multicollinearity](https://en.wikipedia.org/wiki/Multicollinearity)

##### Dummy variable trap.
The categorical variables should be split into proper dummy variables, and we should omit one of the columns of the dummy variables. By default our regression model will accound for the data without this last column and when it gets the other values, i.e a 1 in its corresponding column, it will factor in accordingly. The reason why we omit one of the columns is because of a phenomenon called dummy variable trap. The main culprit is multicolliniearity. The reason why this is dangeros to our model is that, all the variables should be linearly dependent, but in dummy variables if we add all the columns we will get 1, i.e they are linearly dependent. So, if we remove one column, we can eliminate dummy variable trap.

##### P value
The p-value is actually the probability of getting a sample like ours, or more extreme than ours IF the null hypothesis is true. So, we assume the null hypothesis is true and then determine how “strange” our sample really is. If it is not that strange (a large p-value) then we don’t change our mind about the null hypothesis. As the p-value gets smaller, we start wondering if the null really is true and well maybe we should change our minds (and reject the null hypothesis).

- [Explanation 1](http://www.mathbootcamps.com/what-is-a-p-value/)
- [Explanation 2](http://www.wikihow.com/Calculate-P-Value)

<a id='equation'></a>
### Equation and Method

Like simple linear regression, Multiple linear regression uses a linear equation with multiple independent variables to determine a dependent variable.

$y$ = $b_{0}$ + $b_{1}$*$x_{1}$+ $b_{2}$*$x_{2}$+ $b_{3}$*$x_{3}$ ... + $b_{n}$*$x_{n}$


#### Different methods
- ##### All-in

> Here we throw in all the variables that we have. We ususally do this when we have prior knowledge about our variables that they are significant or when a particular framework tells us that these variables should be included.

- ##### Backward Elimination

> 1) We first select a significant level to stay in the model(eg. sl=0.5).

> 2) We fit the model with all the possible predictors(variables).

> 3) Consider the prdictor with the highest P-value, if it is higher than sl, then remove that else end procedure.

> 4) Fit the model without the removed predictor and go to previous step and do the same check.

- ##### Forward Selection

> 1) We first select a significant level to stay in the model(eg. sl=0.5).

> 2) We fit all simple regression models and select one with the lowest P-value.

> 3) We keep this variable and fit all possible models with one extra predictor added to the one.

> 4) We then consider the predictor with the lowest P-value and if P < Sl go to previous step else end procedure.

- ##### Bidirectional Elimination

> 1) Select a significant value for entering and staying in the model.

> 2) Perform forward selection with P < S-enter to enter.

> 3) Perform all steps of backward elimination with old variables having P < S-stay to stay and go to previous step.

> 4) No new variables can enter and no new can exit and then end the procedure.

- ##### All possible models/ score comparison

> Construct the model in all possible permutations and combinations of variables and compare their scores and select the best model

<a id='excercise'></a>
### Excersice
- [Building the model](#building)
- [Backward elimination](#backelimination)
- [Backward elimination simpler code](#simplercode)

The objective of this excerise is to inspect the data set of startups and build a model that can  predict the profit from the other variables.

In [2]:
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from sklearn.cross_validation import train_test_split;
from sklearn.linear_model import LinearRegression
import statsmodels.formula.api as sm



##### Preprocessing

In [3]:
dataset = pd.read_csv('50_Startups.csv')
dataset.head()

Unnamed: 0,R&D Spend,Administration,Marketing Spend,State,Profit
0,165349.2,136897.8,471784.1,New York,192261.83
1,162597.7,151377.59,443898.53,California,191792.06
2,153441.51,101145.55,407934.54,Florida,191050.39
3,144372.41,118671.85,383199.62,New York,182901.99
4,142107.34,91391.77,366168.42,Florida,166187.94


In [4]:
x =  dataset.iloc[:, :4].values
y = dataset.iloc[:, 4].values

In [5]:
label_x = LabelEncoder()
x[:, 3] = label_x.fit_transform(x[:, 3])

one_hot_encoder = OneHotEncoder(categorical_features=[3])
x = one_hot_encoder.fit_transform(x).toarray()

Eliminating the dummy trap variable

In [6]:
x = x[:, 1:]

In [7]:
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=0)

<a id='building'></a>
##### Building the model

In [8]:
# fitting the model
regressor = LinearRegression()
regressor.fit(x_train, y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

In [9]:
# Predicting the model
y_predict = regressor.predict(x_test)

<a id='backelimination'></a>
##### Backward elimination
Eventhough we were able to build our model and predict the test values with some amount of accuracy, we still haven't looked into the factors by which the independent variables contribute towards the dependent variable that we are predicting. And also an important factor that we missed out is that, we did not account for the $B_{0}$ in our equation. When a multiple linear regression model is built, the coefficients are calculated with respect to the available columns in our dataset. Therefore it makes sense now as to why the $B_{0}$ was not calculated. To incorporate that, we simply need to add another column in our dataset, that is with full 1's.

In [10]:
# Adding a row full of 1's for the intercept.
x = np.append(arr = np.ones((50, 1)).astype(int), values = x, axis=1) 
# we switch inorder to have the first row of intercepts.

We can now checkout the different $P$-values associated with the different variables and estimate their contribution to the value we are trying to predict. We use the stats models api library for this.

In [11]:
x_out = x.astype('int64')
x_opt = x_out[:, [0,1,2,3,4,5 ]]

In [12]:
# First Iteration
regressor_OLS = sm.OLS(endog=y, exog=x_opt).fit()
regressor_OLS.summary()

0,1,2,3
Dep. Variable:,y,R-squared:,0.951
Model:,OLS,Adj. R-squared:,0.945
Method:,Least Squares,F-statistic:,169.9
Date:,"Tue, 22 May 2018",Prob (F-statistic):,1.34e-27
Time:,16:38:54,Log-Likelihood:,-525.38
No. Observations:,50,AIC:,1063.0
Df Residuals:,44,BIC:,1074.0
Df Model:,5,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,5.013e+04,6884.855,7.281,0.000,3.63e+04,6.4e+04
x1,198.7542,3371.026,0.059,0.953,-6595.103,6992.611
x2,-42.0063,3256.058,-0.013,0.990,-6604.161,6520.148
x3,0.8060,0.046,17.368,0.000,0.712,0.900
x4,-0.0270,0.052,-0.517,0.608,-0.132,0.078
x5,0.0270,0.017,1.574,0.123,-0.008,0.062

0,1,2,3
Omnibus:,14.783,Durbin-Watson:,1.283
Prob(Omnibus):,0.001,Jarque-Bera (JB):,21.267
Skew:,-0.948,Prob(JB):,2.41e-05
Kurtosis:,5.572,Cond. No.,1450000.0


##### Interpreting the result
After we have obtained the description of the model, we will select the variable having the largest $P$ value and eliminate that and then remake the model. This process is carried out till no variable has a $P$ value greater than 5. Here $X_{2}$ has the largest $P$ value. So we eliminate that and carry out our modelling.

In [13]:
# eliminating x2
x_opt = x_out[:, [0,1,3,4,5 ]]

In [14]:
# Second Iteration
regressor_OLS = sm.OLS(endog=y, exog=x_opt).fit()
regressor_OLS.summary()

0,1,2,3
Dep. Variable:,y,R-squared:,0.951
Model:,OLS,Adj. R-squared:,0.946
Method:,Least Squares,F-statistic:,217.2
Date:,"Tue, 22 May 2018",Prob (F-statistic):,8.49e-29
Time:,16:38:54,Log-Likelihood:,-525.38
No. Observations:,50,AIC:,1061.0
Df Residuals:,45,BIC:,1070.0
Df Model:,4,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,5.011e+04,6647.901,7.537,0.000,3.67e+04,6.35e+04
x1,220.1847,2900.553,0.076,0.940,-5621.828,6062.197
x2,0.8060,0.046,17.606,0.000,0.714,0.898
x3,-0.0270,0.052,-0.523,0.604,-0.131,0.077
x4,0.0270,0.017,1.592,0.118,-0.007,0.061

0,1,2,3
Omnibus:,14.759,Durbin-Watson:,1.282
Prob(Omnibus):,0.001,Jarque-Bera (JB):,21.173
Skew:,-0.948,Prob(JB):,2.53e-05
Kurtosis:,5.563,Cond. No.,1400000.0


In [15]:
# Eliminate X1
x_opt = x_out[:, [0,3,4 ]]

In [16]:
# Third Iteration
regressor_OLS = sm.OLS(endog=y, exog=x_opt).fit()
regressor_OLS.summary()

0,1,2,3
Dep. Variable:,y,R-squared:,0.948
Model:,OLS,Adj. R-squared:,0.946
Method:,Least Squares,F-statistic:,426.8
Date:,"Tue, 22 May 2018",Prob (F-statistic):,7.29e-31
Time:,16:38:54,Log-Likelihood:,-526.83
No. Observations:,50,AIC:,1060.0
Df Residuals:,47,BIC:,1065.0
Df Model:,2,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,5.489e+04,6016.737,9.122,0.000,4.28e+04,6.7e+04
x1,0.8621,0.030,28.589,0.000,0.801,0.923
x2,-0.0530,0.049,-1.073,0.289,-0.152,0.046

0,1,2,3
Omnibus:,14.679,Durbin-Watson:,1.189
Prob(Omnibus):,0.001,Jarque-Bera (JB):,20.451
Skew:,-0.961,Prob(JB):,3.62e-05
Kurtosis:,5.474,Cond. No.,665000.0


In [17]:
# Eliminate x4
x_opt = x_out[:, [0, 3]]

In [18]:
# Fourth Iteration
regressor_OLS = sm.OLS(endog=y, exog=x_opt).fit()
regressor_OLS.summary()

0,1,2,3
Dep. Variable:,y,R-squared:,0.947
Model:,OLS,Adj. R-squared:,0.945
Method:,Least Squares,F-statistic:,849.8
Date:,"Tue, 22 May 2018",Prob (F-statistic):,3.5000000000000004e-32
Time:,16:38:54,Log-Likelihood:,-527.44
No. Observations:,50,AIC:,1059.0
Df Residuals:,48,BIC:,1063.0
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,4.903e+04,2537.900,19.320,0.000,4.39e+04,5.41e+04
x1,0.8543,0.029,29.151,0.000,0.795,0.913

0,1,2,3
Omnibus:,13.727,Durbin-Watson:,1.116
Prob(Omnibus):,0.001,Jarque-Bera (JB):,18.538
Skew:,-0.911,Prob(JB):,9.43e-05
Kurtosis:,5.361,Cond. No.,165000.0


<a id='simplercode'></a>
##### Backward elimination simpler code

```
        import statsmodels.formula.api as sm
        
        def backwardElimination(x, sl):
            numVars = len(x[0])
            for i in range(0, numVars):
                regressor_OLS = sm.OLS(y, x).fit()
                maxVar = max(regressor_OLS.pvalues).astype(float)
                if maxVar > sl:
                    for j in range(0, numVars - i):
                        if (regressor_OLS.pvalues[j].astype(float) == maxVar):
                            x = np.delete(x, j, 1)
            regressor_OLS.summary()
            return x
         
        SL = 0.05
        X_opt = X[:, [0, 1, 2, 3, 4, 5]]
        X_Modeled = backwardElimination(X_opt, SL)


```

<a id='conclusion'></a>
### Conclusion
From this, we will be able to conclude that the variable that will notably contribute towards the profit of a particular startup is the R&D money spend by that company.