# Multiple Linear Regression

### Assumptions

- [ ] Linearity
- [ ] Homoscedasticity
- [ ] Multivariate normality
- [ ] Independence of errors
- [ ] Lack of multicolinearity

### Finding Most Significant Independent Variables for Prediction

There are 5 main methods (method 2,3,4 also known as step wise regression)

- [ ] All in
- [x] Backward Elimination
- [ ] Forward Selection
- [ ] Bidirectional Elimination
- [ ] Score Comparison

## Backward Elimination

#### Importing Libraries and Data

In [107]:
import statsmodels.api as sm
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder, OneHotEncoder

df = pd.read_csv("./data.csv")
df.head()

Unnamed: 0,R&D Spend,Administration,Marketing Spend,State,Profit
0,165349.2,136897.8,471784.1,New York,192261.83
1,162597.7,151377.59,443898.53,California,191792.06
2,153441.51,101145.55,407934.54,Florida,191050.39
3,144372.41,118671.85,383199.62,New York,182901.99
4,142107.34,91391.77,366168.42,Florida,166187.94


#### Making Dummy Variables from Categorical Variable & Seperating Independent and Dependent Variables

In [108]:
labelEncoder = LabelEncoder()
oneHotEncoder = OneHotEncoder(handle_unknown='ignore')

X = df.iloc[:, :-1]
X["State"] = labelEncoder.fit_transform(X["State"])
dummy_state = pd.DataFrame(oneHotEncoder.fit_transform(pd.DataFrame(X["State"])).toarray())

X = dummy_state.join(X)
X.drop("State", axis=1, inplace=True)
X.head()

Unnamed: 0,0,1,2,R&D Spend,Administration,Marketing Spend
0,0.0,0.0,1.0,165349.2,136897.8,471784.1
1,1.0,0.0,0.0,162597.7,151377.59,443898.53
2,0.0,1.0,0.0,153441.51,101145.55,407934.54
3,0.0,0.0,1.0,144372.41,118671.85,383199.62
4,0.0,1.0,0.0,142107.34,91391.77,366168.42


In [109]:
y = df["Profit"]
y.head(2)

0    192261.83
1    191792.06
Name: Profit, dtype: float64

#### Adding a new column with all the values equal to  1 (values for x0 in the formula of multiple regression)

In [110]:
X = np.append(arr=np.ones((50, 1)).astype(int),values=X, axis=1)

### Fitting All the independent variables in the Model

In [111]:
X_opt = X[:, [0,1,2,3,4,5,6]]
r_OLS = sm.OLS(endog=y, exog=X_opt).fit()
r_OLS.summary()

0,1,2,3
Dep. Variable:,Profit,R-squared:,0.951
Model:,OLS,Adj. R-squared:,0.945
Method:,Least Squares,F-statistic:,169.9
Date:,"Tue, 28 Jun 2022",Prob (F-statistic):,1.34e-27
Time:,19:42:44,Log-Likelihood:,-525.38
No. Observations:,50,AIC:,1063.0
Df Residuals:,44,BIC:,1074.0
Df Model:,5,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,3.763e+04,5073.636,7.417,0.000,2.74e+04,4.79e+04
x1,1.249e+04,2449.797,5.099,0.000,7554.868,1.74e+04
x2,1.269e+04,2726.700,4.654,0.000,7195.596,1.82e+04
x3,1.245e+04,2486.364,5.007,0.000,7439.285,1.75e+04
x4,0.8060,0.046,17.369,0.000,0.712,0.900
x5,-0.0270,0.052,-0.517,0.608,-0.132,0.078
x6,0.0270,0.017,1.574,0.123,-0.008,0.062

0,1,2,3
Omnibus:,14.782,Durbin-Watson:,1.283
Prob(Omnibus):,0.001,Jarque-Bera (JB):,21.266
Skew:,-0.948,Prob(JB):,2.41e-05
Kurtosis:,5.572,Cond. No.,2.69e+17


#### Removing those variables where **P > S**   

In [112]:
X_opt = X[:, [0,1,2,3,4,5]]
r_OLS = sm.OLS(endog=y, exog=X_opt).fit()
r_OLS.summary()

0,1,2,3
Dep. Variable:,Profit,R-squared:,0.948
Model:,OLS,Adj. R-squared:,0.943
Method:,Least Squares,F-statistic:,205.0
Date:,"Tue, 28 Jun 2022",Prob (F-statistic):,2.9e-28
Time:,19:42:44,Log-Likelihood:,-526.75
No. Observations:,50,AIC:,1064.0
Df Residuals:,45,BIC:,1073.0
Df Model:,4,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,4.122e+04,4607.941,8.945,0.000,3.19e+04,5.05e+04
x1,1.339e+04,2421.500,5.529,0.000,8511.111,1.83e+04
x2,1.448e+04,2518.987,5.748,0.000,9405.870,1.96e+04
x3,1.335e+04,2459.306,5.428,0.000,8395.623,1.83e+04
x4,0.8609,0.031,27.665,0.000,0.798,0.924
x5,-0.0527,0.050,-1.045,0.301,-0.154,0.049

0,1,2,3
Omnibus:,14.275,Durbin-Watson:,1.197
Prob(Omnibus):,0.001,Jarque-Bera (JB):,19.26
Skew:,-0.953,Prob(JB):,6.57e-05
Kurtosis:,5.369,Cond. No.,9.16e+17


In [113]:
X_opt = X[:, [0,1,2,3,4]]
r_OLS = sm.OLS(endog=y, exog=X_opt).fit()
r_OLS.summary()

0,1,2,3
Dep. Variable:,Profit,R-squared:,0.947
Model:,OLS,Adj. R-squared:,0.943
Method:,Least Squares,F-statistic:,272.4
Date:,"Tue, 28 Jun 2022",Prob (F-statistic):,2.76e-29
Time:,19:42:44,Log-Likelihood:,-527.35
No. Observations:,50,AIC:,1063.0
Df Residuals:,46,BIC:,1070.0
Df Model:,3,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,3.686e+04,1959.786,18.806,0.000,3.29e+04,4.08e+04
x1,1.189e+04,1956.677,6.079,0.000,7955.697,1.58e+04
x2,1.306e+04,2122.665,6.152,0.000,8785.448,1.73e+04
x3,1.19e+04,2036.022,5.847,0.000,7805.580,1.6e+04
x4,0.8530,0.030,28.226,0.000,0.792,0.914

0,1,2,3
Omnibus:,13.418,Durbin-Watson:,1.122
Prob(Omnibus):,0.001,Jarque-Bera (JB):,17.605
Skew:,-0.907,Prob(JB):,0.00015
Kurtosis:,5.271,Cond. No.,3.7e+17
