# Multiple-Linear-Regression with Backward Elimination in Python

The basis of a multiple linear regression is to assess whether one continuous dependent variable can be predicted from a set of independent (or predictor) variables.  Or in other words, how much variance in a continuous dependent variable is explained by a set of predictors.  Certain regression selection approaches are helpful in testing predictors, thereby increasing the efficiency of analysis.

#### Example dataset: 

Which startups have the most profit? Those in california or those in New York? Those that spend most on R and D or those on that spend most on marketing?

#### There are five methods to build a model:

1. All-in
2. Backward Elimination
3. Forward Selection
4. Bidirectional Elimination
5. Score Comparison

#### Note: We would be using Backward Elimination because it is the fastest.


## Step 1: Data processing

In [1]:
#importing the libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# Importing the dataset
dataset = pd.read_csv('Startups.csv')
X = dataset.iloc[:,:-1].values
y = dataset.iloc[:,-1].values

#Encoding Categorical variables
from sklearn.preprocessing import LabelEncoder,OneHotEncoder
from sklearn.compose import make_column_transformer
labelencoder = LabelEncoder()
X[:,3] = labelencoder.fit_transform(X[:,3])
preprocess = make_column_transformer(([3],OneHotEncoder(categories='auto',sparse = False)),remainder="passthrough")
X = preprocess.fit_transform(X)
X = np.array(X, dtype=int)

# Avoiding the Dummy Variable Trap
X = X[:, 1:]

# Splitting the dataset into the Training set and Test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y , random_state=0, test_size=0.2)
print("X train:\n",X_train)
print("\ny train:\n",y_train)
print("\nX test:\n",X_test)
print("\ny test:\n",y_test)

X train:
 [[     1      0  55493 103057 214634]
 [     0      1  46014  85047 205517]
 [     1      0  75328 144135 134050]
 [     0      0  46426 157693 210797]
 [     1      0  91749 114175 294919]
 [     1      0 130298 145530 323876]
 [     1      0 119943 156547 256512]
 [     0      1   1000 124153   1903]
 [     0      1    542  51743      0]
 [     0      1  65605 153032 107138]
 [     0      1 114523 122616 261776]
 [     1      0  61994 115641  91131]
 [     0      0  63408 129219  46085]
 [     0      0  78013 121597 264346]
 [     0      0  23640  96189 148001]
 [     0      0  76253 113867 298664]
 [     0      1  15505 127382  35534]
 [     0      1 120542 148718 311613]
 [     0      0  91992 135495 252664]
 [     0      0  64664 139553 137962]
 [     0      1 131876  99814 362861]
 [     0      1  94657 145077 282574]
 [     0      0  28754 118546 172795]
 [     0      0      0 116983  45173]
 [     0      0 162597 151377 443898]
 [     1      0  93863 127320 249839]
 [

## Step 2: Fitting Multiple Linear Regression to the Training set

In [2]:
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X_train, y_train)

  linalg.lstsq(X, y)


LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
         normalize=False)

## Step 3: Predicting the Test set results and Comparing the results

In [3]:
y_pred = regressor.predict(X_test)
print("y test:\n",y_test)
print("\ny pred:\n",y_pred)


y test:
 [103282.38 144259.4  146121.95  77798.83 191050.39 105008.31  81229.06
  97483.56 110352.25 166187.94]

y pred:
 [103015.24646785 132581.94062687 132448.09397395  71975.74395634
 178537.52007851 116161.05196902  67851.47761323  98791.74112204
 113969.41004647 167921.22416077]


## Step 4: Building the optimal model using backward elimination

The statsmodels does not consider x0 from the regression equation so that column must be added. LinearRegression package considers x0.

Backward Elimination Algorithm:

##### STEP 1: Select a significance level to stay in the model (e.g. SL — - 0.05)

##### STEP 2: Fit the full model with all possible predictors 

##### STEP 3: Consider the predictor with the highest P-value. If P > SL, go to STEP 4, otherwise go to FIN 

##### STEP 4: Remove the predictor 

##### STEP 5: Fit model without this variable

##### FIN: Your Model Is Ready 

In [4]:
import statsmodels.formula.api as sm
X = np.append(arr = np.ones((50,1)).astype(int), values = X, axis = 1)
print(X)

[[     1      0      1 165349 136897 471784]
 [     1      0      0 162597 151377 443898]
 [     1      1      0 153441 101145 407934]
 [     1      0      1 144372 118671 383199]
 [     1      1      0 142107  91391 366168]
 [     1      0      1 131876  99814 362861]
 [     1      0      0 134615 147198 127716]
 [     1      1      0 130298 145530 323876]
 [     1      0      1 120542 148718 311613]
 [     1      0      0 123334 108679 304981]
 [     1      1      0 101913 110594 229160]
 [     1      0      0 100671  91790 249744]
 [     1      1      0  93863 127320 249839]
 [     1      0      0  91992 135495 252664]
 [     1      1      0 119943 156547 256512]
 [     1      0      1 114523 122616 261776]
 [     1      0      0  78013 121597 264346]
 [     1      0      1  94657 145077 282574]
 [     1      1      0  91749 114175 294919]
 [     1      0      1  86419 153514      0]
 [     1      0      0  76253 113867 298664]
 [     1      0      1  78389 153773 299737]
 [     1  

In [5]:
X_opt = X[:, [0, 1, 2, 3, 4, 5]]
regressor_OLS = sm.OLS(endog=y, exog=X_opt).fit()
regressor_OLS.summary()

0,1,2,3
Dep. Variable:,y,R-squared:,0.951
Model:,OLS,Adj. R-squared:,0.945
Method:,Least Squares,F-statistic:,169.9
Date:,"Sat, 08 Jun 2019",Prob (F-statistic):,1.34e-27
Time:,21:41:18,Log-Likelihood:,-525.38
No. Observations:,50,AIC:,1063.0
Df Residuals:,44,BIC:,1074.0
Df Model:,5,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,5.013e+04,6884.855,7.281,0.000,3.63e+04,6.4e+04
x1,198.7542,3371.026,0.059,0.953,-6595.103,6992.611
x2,-42.0063,3256.058,-0.013,0.990,-6604.161,6520.148
x3,0.8060,0.046,17.368,0.000,0.712,0.900
x4,-0.0270,0.052,-0.517,0.608,-0.132,0.078
x5,0.0270,0.017,1.574,0.123,-0.008,0.062

0,1,2,3
Omnibus:,14.783,Durbin-Watson:,1.283
Prob(Omnibus):,0.001,Jarque-Bera (JB):,21.267
Skew:,-0.948,Prob(JB):,2.41e-05
Kurtosis:,5.572,Cond. No.,1450000.0


##### Because x2 has the highest p-value so its eliminated (Step 4) amd the model is fitted without it (Step 5).

In [6]:
X_opt = X[:, [0, 1, 3, 4, 5]]
regressor_OLS = sm.OLS(endog=y, exog=X_opt).fit()
regressor_OLS.summary()

0,1,2,3
Dep. Variable:,y,R-squared:,0.951
Model:,OLS,Adj. R-squared:,0.946
Method:,Least Squares,F-statistic:,217.2
Date:,"Sat, 08 Jun 2019",Prob (F-statistic):,8.49e-29
Time:,21:41:18,Log-Likelihood:,-525.38
No. Observations:,50,AIC:,1061.0
Df Residuals:,45,BIC:,1070.0
Df Model:,4,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,5.011e+04,6647.901,7.537,0.000,3.67e+04,6.35e+04
x1,220.1847,2900.553,0.076,0.940,-5621.828,6062.197
x2,0.8060,0.046,17.606,0.000,0.714,0.898
x3,-0.0270,0.052,-0.523,0.604,-0.131,0.077
x4,0.0270,0.017,1.592,0.118,-0.007,0.061

0,1,2,3
Omnibus:,14.759,Durbin-Watson:,1.282
Prob(Omnibus):,0.001,Jarque-Bera (JB):,21.173
Skew:,-0.948,Prob(JB):,2.53e-05
Kurtosis:,5.563,Cond. No.,1400000.0


##### Because x1 has the highest p-value so its eliminated (Step 4) amd the model is fitted without it (Step 5).

In [7]:
X_opt = X[:, [0, 3, 4, 5]]
regressor_OLS = sm.OLS(endog=y, exog=X_opt).fit()
regressor_OLS.summary()

0,1,2,3
Dep. Variable:,y,R-squared:,0.951
Model:,OLS,Adj. R-squared:,0.948
Method:,Least Squares,F-statistic:,296.0
Date:,"Sat, 08 Jun 2019",Prob (F-statistic):,4.53e-30
Time:,21:41:18,Log-Likelihood:,-525.39
No. Observations:,50,AIC:,1059.0
Df Residuals:,46,BIC:,1066.0
Df Model:,3,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,5.012e+04,6572.384,7.626,0.000,3.69e+04,6.34e+04
x1,0.8057,0.045,17.846,0.000,0.715,0.897
x2,-0.0268,0.051,-0.526,0.602,-0.130,0.076
x3,0.0272,0.016,1.655,0.105,-0.006,0.060

0,1,2,3
Omnibus:,14.839,Durbin-Watson:,1.282
Prob(Omnibus):,0.001,Jarque-Bera (JB):,21.443
Skew:,-0.949,Prob(JB):,2.21e-05
Kurtosis:,5.587,Cond. No.,1400000.0


##### Because x2 (index 4) has the highest p-value so its eliminated (Step 4) amd the model is fitted without it (Step 5).

In [8]:
X_opt = X[:, [0, 3, 5]]
regressor_OLS = sm.OLS(endog=y, exog=X_opt).fit()
regressor_OLS.summary()

0,1,2,3
Dep. Variable:,y,R-squared:,0.95
Model:,OLS,Adj. R-squared:,0.948
Method:,Least Squares,F-statistic:,450.8
Date:,"Sat, 08 Jun 2019",Prob (F-statistic):,2.1600000000000003e-31
Time:,21:41:18,Log-Likelihood:,-525.54
No. Observations:,50,AIC:,1057.0
Df Residuals:,47,BIC:,1063.0
Df Model:,2,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,4.698e+04,2689.941,17.464,0.000,4.16e+04,5.24e+04
x1,0.7966,0.041,19.265,0.000,0.713,0.880
x2,0.0299,0.016,1.927,0.060,-0.001,0.061

0,1,2,3
Omnibus:,14.678,Durbin-Watson:,1.257
Prob(Omnibus):,0.001,Jarque-Bera (JB):,21.162
Skew:,-0.939,Prob(JB):,2.54e-05
Kurtosis:,5.575,Cond. No.,532000.0


##### Because x2 (index 5) has the highest p-value so its eliminated (Step 4) amd the model is fitted without it (Step 5).

In [9]:
X_opt = X[:, [0, 3]]
regressor_OLS = sm.OLS(endog=y, exog=X_opt).fit()
regressor_OLS.summary()

0,1,2,3
Dep. Variable:,y,R-squared:,0.947
Model:,OLS,Adj. R-squared:,0.945
Method:,Least Squares,F-statistic:,849.8
Date:,"Sat, 08 Jun 2019",Prob (F-statistic):,3.5000000000000004e-32
Time:,21:41:18,Log-Likelihood:,-527.44
No. Observations:,50,AIC:,1059.0
Df Residuals:,48,BIC:,1063.0
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,4.903e+04,2537.900,19.320,0.000,4.39e+04,5.41e+04
x1,0.8543,0.029,29.151,0.000,0.795,0.913

0,1,2,3
Omnibus:,13.727,Durbin-Watson:,1.116
Prob(Omnibus):,0.001,Jarque-Bera (JB):,18.538
Skew:,-0.911,Prob(JB):,9.43e-05
Kurtosis:,5.361,Cond. No.,165000.0


#### 3rd column remains which is the R and D value in the matrix for X. 

#### Thus, the profit depends on the R and D value.