# Backward Elimination Using 50-startups dataset

## What is Backward Elimination?
Backward elimination is a feature selection technique while building a machine learning model. It is used to remove those features that do not have a significant effect on the dependent variable or prediction of output. There are various ways to build a model in Machine Learning, which are:

1. All-in
2. Backward Elimination
3. Forward Selection
4. Bidirectional Elimination
5. Score Comparison
Above are the possible methods for building the model in Machine learning, but we will only use here the Backward Elimination process as it is the fastest method.

## Steps of Backward Elimination
Below are some main steps which are used to apply backward elimination process:

**Step-1**: Firstly, We need to select a significance level to stay in the model. (SL=0.05)

**Step-2**: Fit the complete model with all possible predictors/independent variables.

**Step-3**: Choose the predictor which has the highest P-value, such that.

If P-value >SL, go to step 4.
Else Finish, and Our model is ready.

**Step-4**: Remove that predictor.

**Step-5**: Rebuild and fit the model with the remaining variables.
## Need for Backward Elimination: An optimal Multiple Linear Regression model:
The model is not optimal, as we include all the independent variables and do not know which independent feature is most affecting and which one is the least affecting for the prediction.

Unnecessary features increase the complexity of the model. Hence it is good to have only the most significant features and keep our model simple to get the better result.

So, in order to optimize the performance of the model, we will use the Backward Elimination method. This process is used to optimize the performance of the MLR model as it will only include the most affecting feature and remove the least affecting feature. Let's start to apply it to our MLR model.


## Steps for Backward Elimination method:
First we will build the model with all the independent features included.

In [1]:
# Importing the required libraries

import numpy as np
import pandas as pd

In [51]:
# Importing the dataset

dataset = pd.read_csv('C:/Users/podug/Desktop/Datahill/Datasets/50_Startups.csv')
dataset.head()

Unnamed: 0,R&D Spend,Administration,Marketing Spend,State,Profit
0,165349.2,136897.8,471784.1,New York,192261.83
1,162597.7,151377.59,443898.53,California,191792.06
2,153441.51,101145.55,407934.54,Florida,191050.39
3,144372.41,118671.85,383199.62,New York,182901.99
4,142107.34,91391.77,366168.42,Florida,166187.94


In [52]:
# Extracting Independent and Dependent features

X = dataset.iloc[:, :-1]
y = dataset.iloc[:, 4:5]

In [53]:
X.head()

Unnamed: 0,R&D Spend,Administration,Marketing Spend,State
0,165349.2,136897.8,471784.1,New York
1,162597.7,151377.59,443898.53,California
2,153441.51,101145.55,407934.54,Florida
3,144372.41,118671.85,383199.62,New York
4,142107.34,91391.77,366168.42,Florida


In [9]:
y.head()

Unnamed: 0,Profit
0,192261.83
1,191792.06
2,191050.39
3,182901.99
4,166187.94


In [54]:
# Converting the categorical feature into dummy categorical columns

states = pd.get_dummies(X['State'], drop_first=True)
states.head()

Unnamed: 0,Florida,New York
0,0,1
1,0,0
2,1,0
3,0,1
4,1,0


In [55]:
# Dropping the 'State' column 

X = X.drop('State', axis=1)
X.head()

Unnamed: 0,R&D Spend,Administration,Marketing Spend
0,165349.2,136897.8,471784.1
1,162597.7,151377.59,443898.53
2,153441.51,101145.55,407934.54
3,144372.41,118671.85,383199.62
4,142107.34,91391.77,366168.42


In [56]:
# Adding dummy categorical columns

X = pd.concat([X, states], axis=1)
X.head()

Unnamed: 0,R&D Spend,Administration,Marketing Spend,Florida,New York
0,165349.2,136897.8,471784.1,0,1
1,162597.7,151377.59,443898.53,0,0
2,153441.51,101145.55,407934.54,1,0
3,144372.41,118671.85,383199.62,0,1
4,142107.34,91391.77,366168.42,1,0


In [66]:
# Splitting the dataset into training and testing sets

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42)
X_train.shape, X_test.shape, y_train.shape, y_test.shape

((40, 5), (10, 5), (40, 1), (10, 1))

In [67]:
# Fitting Multiple Linear Regression to the training set

from sklearn.linear_model import LinearRegression

regressor = LinearRegression()
regressor.fit(X_train, y_train)

LinearRegression()

In [68]:
# Predicting the test set result

y_pred = regressor.predict(X_test)

In [69]:
# Checking the score

print('Train Score: ',regressor.score(X_train, y_train))
print('Test Score: ',regressor.score(X_test, y_test))

Train Score:  0.9537019995248526
Test Score:  0.8987266414328636


In [88]:
# The difference between the two scores

score_diff = 0.9537019995248526 - 0.8987266414328636
score_diff

0.054975358091988946

In [70]:
# Calculating r2score

from sklearn.metrics import r2_score

r2score = r2_score(y_test, y_pred)
r2score

0.8987266414328636

The difference between both scores is 0.055.

**Note**: On the basis of this score, we will estimate the effect of features on our model after using the Backward elimination process.

### Step 1:  Preparation of Backward Elimination:
**Importing the library**: First, we need to import the **statsmodels.api** library, which is used for the estimation of various statistical models such as OLS(Ordinary Least Square) and for adding the constant feature.

In [19]:
import statsmodels.api as sm

**Adding a column in matrix of features**: As we can check in our MLR equation (a), there is one constant term b0, but this term is not present in our matrix of features, so we need to add it manually. We will add a column having values x0 = 1 associated with the constant term b0.

To add this, we will use **add_constant** function of **statmodels** library and will assign a value of 1.

In [71]:
X = sm.add_constant(X)
X.head()

Unnamed: 0,const,R&D Spend,Administration,Marketing Spend,Florida,New York
0,1.0,165349.2,136897.8,471784.1,0,1
1,1.0,162597.7,151377.59,443898.53,0,0
2,1.0,153441.51,101145.55,407934.54,1,0
3,1.0,144372.41,118671.85,383199.62,0,1
4,1.0,142107.34,91391.77,366168.42,1,0


### Step-2: Fit the complete model with all possible predictors/independent variables.
Now, we are actually going to apply a backward elimination process. Firstly we will create a new feature vector x_opt, which will only contain a set of independent features that are significantly affecting the dependent variable.

Next, as per the Backward Elimination process, we need to choose a significant level(0.05), and then need to fit the model with all possible predictors. So for fitting the model, we will create a regressor_OLS object of new class OLS of statsmodels library. Then we will fit it by using the fit() method.

Next we need p-value to compare with SL value, so for this we will use summary() method to get the summary table of all the values.

In [72]:
x_opt = X
x_opt.head()

Unnamed: 0,const,R&D Spend,Administration,Marketing Spend,Florida,New York
0,1.0,165349.2,136897.8,471784.1,0,1
1,1.0,162597.7,151377.59,443898.53,0,0
2,1.0,153441.51,101145.55,407934.54,1,0
3,1.0,144372.41,118671.85,383199.62,0,1
4,1.0,142107.34,91391.77,366168.42,1,0


In [73]:
# Fit an OLS (Ordinaty Least Squares) model with intercept feature added

opt_regressor = sm.OLS(endog = y, exog = x_opt).fit()
opt_regressor.summary()

0,1,2,3
Dep. Variable:,Profit,R-squared:,0.951
Model:,OLS,Adj. R-squared:,0.945
Method:,Least Squares,F-statistic:,169.9
Date:,"Wed, 20 Jan 2021",Prob (F-statistic):,1.34e-27
Time:,08:40:21,Log-Likelihood:,-525.38
No. Observations:,50,AIC:,1063.0
Df Residuals:,44,BIC:,1074.0
Df Model:,5,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,5.013e+04,6884.820,7.281,0.000,3.62e+04,6.4e+04
R&D Spend,0.8060,0.046,17.369,0.000,0.712,0.900
Administration,-0.0270,0.052,-0.517,0.608,-0.132,0.078
Marketing Spend,0.0270,0.017,1.574,0.123,-0.008,0.062
Florida,198.7888,3371.007,0.059,0.953,-6595.030,6992.607
New York,-41.8870,3256.039,-0.013,0.990,-6604.003,6520.229

0,1,2,3
Omnibus:,14.782,Durbin-Watson:,1.283
Prob(Omnibus):,0.001,Jarque-Bera (JB):,21.266
Skew:,-0.948,Prob(JB):,2.41e-05
Kurtosis:,5.572,Cond. No.,1450000.0


### Step-3: Choose the predictor which has the highest P-value
Observe the p-values of the features. We will choose the highest p-value, which is for New York = 0.990. Now, we have the highest p-value which is greater than the SL(.05) value, so will remove the New York variable (dummy variable) from the table and will refit the model.

In [74]:
x_opt = X.iloc[:, :-1]
x_opt.head()

Unnamed: 0,const,R&D Spend,Administration,Marketing Spend,Florida
0,1.0,165349.2,136897.8,471784.1,0
1,1.0,162597.7,151377.59,443898.53,0
2,1.0,153441.51,101145.55,407934.54,1
3,1.0,144372.41,118671.85,383199.62,0
4,1.0,142107.34,91391.77,366168.42,1


In [75]:
opt_regressor = sm.OLS(endog = y, exog = x_opt).fit()
opt_regressor.summary()

0,1,2,3
Dep. Variable:,Profit,R-squared:,0.951
Model:,OLS,Adj. R-squared:,0.946
Method:,Least Squares,F-statistic:,217.2
Date:,"Wed, 20 Jan 2021",Prob (F-statistic):,8.49e-29
Time:,08:40:57,Log-Likelihood:,-525.38
No. Observations:,50,AIC:,1061.0
Df Residuals:,45,BIC:,1070.0
Df Model:,4,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,5.011e+04,6647.870,7.537,0.000,3.67e+04,6.35e+04
R&D Spend,0.8060,0.046,17.606,0.000,0.714,0.898
Administration,-0.0270,0.052,-0.523,0.604,-0.131,0.077
Marketing Spend,0.0270,0.017,1.592,0.118,-0.007,0.061
Florida,220.1585,2900.536,0.076,0.940,-5621.821,6062.138

0,1,2,3
Omnibus:,14.758,Durbin-Watson:,1.282
Prob(Omnibus):,0.001,Jarque-Bera (JB):,21.172
Skew:,-0.948,Prob(JB):,2.53e-05
Kurtosis:,5.563,Cond. No.,1400000.0


From the above output, the highest p-value is 0.940. So we will remove it in the next iteration.

Now the next highest value is 0.940 for Florida variable, which is another dummy variable. So we will remove it and refit the model.

In [76]:
x_opt = x_opt.iloc[:, :-1]
x_opt.head()

Unnamed: 0,const,R&D Spend,Administration,Marketing Spend
0,1.0,165349.2,136897.8,471784.1
1,1.0,162597.7,151377.59,443898.53
2,1.0,153441.51,101145.55,407934.54
3,1.0,144372.41,118671.85,383199.62
4,1.0,142107.34,91391.77,366168.42


In [77]:
opt_regressor = sm.OLS(endog = y, exog = x_opt).fit()
opt_regressor.summary()

0,1,2,3
Dep. Variable:,Profit,R-squared:,0.951
Model:,OLS,Adj. R-squared:,0.948
Method:,Least Squares,F-statistic:,296.0
Date:,"Wed, 20 Jan 2021",Prob (F-statistic):,4.53e-30
Time:,08:41:10,Log-Likelihood:,-525.39
No. Observations:,50,AIC:,1059.0
Df Residuals:,46,BIC:,1066.0
Df Model:,3,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,5.012e+04,6572.353,7.626,0.000,3.69e+04,6.34e+04
R&D Spend,0.8057,0.045,17.846,0.000,0.715,0.897
Administration,-0.0268,0.051,-0.526,0.602,-0.130,0.076
Marketing Spend,0.0272,0.016,1.655,0.105,-0.006,0.060

0,1,2,3
Omnibus:,14.838,Durbin-Watson:,1.282
Prob(Omnibus):,0.001,Jarque-Bera (JB):,21.442
Skew:,-0.949,Prob(JB):,2.21e-05
Kurtosis:,5.586,Cond. No.,1400000.0


 the next highest value is .602, which is still greater than 0.05, so we need to remove it.

Now we will remove the Admin spend which is having .602 p-value and again refit the model.

In [78]:
x_opt = x_opt.iloc[:, [0, 1, 3]]
x_opt.head()

Unnamed: 0,const,R&D Spend,Marketing Spend
0,1.0,165349.2,471784.1
1,1.0,162597.7,443898.53
2,1.0,153441.51,407934.54
3,1.0,144372.41,383199.62
4,1.0,142107.34,366168.42


In [79]:
opt_regressor = sm.OLS(endog = y, exog = x_opt).fit()
opt_regressor.summary()

0,1,2,3
Dep. Variable:,Profit,R-squared:,0.95
Model:,OLS,Adj. R-squared:,0.948
Method:,Least Squares,F-statistic:,450.8
Date:,"Wed, 20 Jan 2021",Prob (F-statistic):,2.1600000000000003e-31
Time:,08:41:23,Log-Likelihood:,-525.54
No. Observations:,50,AIC:,1057.0
Df Residuals:,47,BIC:,1063.0
Df Model:,2,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,4.698e+04,2689.933,17.464,0.000,4.16e+04,5.24e+04
R&D Spend,0.7966,0.041,19.266,0.000,0.713,0.880
Marketing Spend,0.0299,0.016,1.927,0.060,-0.001,0.061

0,1,2,3
Omnibus:,14.677,Durbin-Watson:,1.257
Prob(Omnibus):,0.001,Jarque-Bera (JB):,21.161
Skew:,-0.939,Prob(JB):,2.54e-05
Kurtosis:,5.575,Cond. No.,532000.0


As we can see in the above output, the variable (Admin spend) has been removed. But still, there is one variable left, which is marketing spend as it has a high p-value (0.06). So we need to remove it.

Finally, we will remove one more variable, which has 0.06 p-value for marketing spend, which is more than a significant level.

In [80]:
x_opt = x_opt.iloc[:, :-1]
x_opt.head()

Unnamed: 0,const,R&D Spend
0,1.0,165349.2
1,1.0,162597.7
2,1.0,153441.51
3,1.0,144372.41
4,1.0,142107.34


In [81]:
opt_regressor = sm.OLS(endog = y, exog = x_opt).fit()
opt_regressor.summary()

0,1,2,3
Dep. Variable:,Profit,R-squared:,0.947
Model:,OLS,Adj. R-squared:,0.945
Method:,Least Squares,F-statistic:,849.8
Date:,"Wed, 20 Jan 2021",Prob (F-statistic):,3.5000000000000004e-32
Time:,08:41:32,Log-Likelihood:,-527.44
No. Observations:,50,AIC:,1059.0
Df Residuals:,48,BIC:,1063.0
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,4.903e+04,2537.897,19.320,0.000,4.39e+04,5.41e+04
R&D Spend,0.8543,0.029,29.151,0.000,0.795,0.913

0,1,2,3
Omnibus:,13.727,Durbin-Watson:,1.116
Prob(Omnibus):,0.001,Jarque-Bera (JB):,18.536
Skew:,-0.911,Prob(JB):,9.44e-05
Kurtosis:,5.361,Cond. No.,165000.0


As we can see in the above output, only two variables are left. So only the R&D independent variable is a significant variable for the prediction. So we can now predict efficiently using this variable.

## Estimating the performance:
Previously, we have calculated the train and test score of the model when we have used all the features variables. Now we will check the score with only one feature variable (R&D spend). Our Independent and Dependent datasets now look like:

In [82]:
X = x_opt.iloc[:,1:2]
X.head()

Unnamed: 0,R&D Spend
0,165349.2
1,162597.7
2,153441.51
3,144372.41
4,142107.34


In [83]:
y.head()

Unnamed: 0,Profit
0,192261.83
1,191792.06
2,191050.39
3,182901.99
4,166187.94


##  Building Multiple Linear Regression model with the new datasets:

In [84]:
# Splitting the dataset into training and test datasets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state =42)
X_train.shape, X_test.shape, y_train.shape, y_test.shape

((40, 1), (10, 1), (40, 1), (10, 1))

In [85]:
# Fitting the MLR model to the training set: 

regressor = LinearRegression()
regressor.fit(X_train, y_train)
y_pred = regressor.predict(X_test)

In [86]:
# Checking the score

train_score = regressor.score(X_train, y_train)
test_score = regressor.score(X_test, y_test)
print(f'Train score: {train_score}, Test score: {test_score}')

Train score: 0.9467864227524652, Test score: 0.9265108109341951


In [87]:
# The difference between the two scores

score_diff = train_score - test_score
score_diff

0.02027561181827009

The difference between the scores from the previous model, where we used all the features, and the present model, with only one feature, is as shown

score difference of the previous model = 0.055
score difference of the present model = 0.020

We can clearly say that the present model is performing good at predicting the profit.


In [89]:
# Checking the r2score

from sklearn.metrics import r2_score

r2score = r2_score(y_test, y_pred)
r2score

0.9265108109341951

In [None]:
Also we can observe the difference between the r2score both models.

Previous model r2score = 0.8987266414328636

Present model r2score = 0.9265108109341951

We got this result by using one independent variable (R&D spend) only instead of four variables. Hence, now, our model is simple and accurate.