<a href="https://colab.research.google.com/github/r4phael/ml-course/blob/master/notebooks/4_propagation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Importando Libs e o Dataset

In [0]:
import pandas as pd
import random
import numpy as np
from sklearn.preprocessing import StandardScaler 
from sklearn.model_selection import train_test_split

Definindo função de escalonamento e importando dataset:

In [0]:
def feature_scaling(data):
    sc = StandardScaler()
    return sc.fit_transform(data)

df = pd.read_csv('https://raw.githubusercontent.com/r4phael/ml-course/master/data/pricing_houses.csv')

##  Forward Elimination

### Etapa de Seleção e Tratamento dos Dados

Selecionando as principais features do dataset:

In [0]:
df = df.loc[:, ['LotArea', 'PoolArea', 'GarageArea', 'YearBuilt', 'SalePrice']]

df.head(5)

Unnamed: 0,LotArea,PoolArea,GarageArea,YearBuilt,SalePrice
0,8450,0,548,2003,208500
1,9600,0,460,1976,181500
2,11250,0,608,2001,223500
3,9550,0,642,1915,140000
4,14260,0,836,2000,250000


Definindo as variáveis indepedentes e dependentes, normalição das features e dividisão do dataset em conjunto de treinamento e testes:

In [0]:
X = df[df.columns[~df.columns.isin(['SalePrice'])]].values
y = df['SalePrice'].values.reshape(-1,1)

# Normalização das features:
X = feature_scaling(X)

# Dividindo os dados
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42)

Importando e treinando o modelo de Regressao Linear Multipla com o Conjunto de Treinamento:

In [0]:
# Importando o modelo
from sklearn.linear_model import LinearRegression

regressor = LinearRegression()
regressor.fit(X_train, y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

Avaliando o modelo com a métrica r2

In [0]:
regressor.score(X_test, y_test)

0.4759953905083013

### Realizando o processo de Forward Elimination


No processo de Forward Elimination, iremos selecionar as features incrementalmente uma por uma e analisamos se a mesma contribui para a melhoria do modelo. Porsteriormente, treinamos o modelo com a package OLS que realiza o processo de cálculo dos coeficientes para analise:

Selecionando a primeira feature: *LotArea*

In [70]:
df_forward = df.loc[:,['LotArea', 'SalePrice']]

df_forward.head(5)

Unnamed: 0,LotArea,SalePrice
0,8450,208500
1,9600,181500
2,11250,223500
3,9550,140000
4,14260,250000


Realizando o processo de forward elimination. Primeiro, será inserido uma coluna preenchida com valores 1 no começo da matriz de variáveis. Isso é realizada para que sejam feito os calculos necessários:

In [73]:
X = np.append(arr = np.ones((1460,1)).astype(int), values = df_forward, axis =1)

X[1:5,:]


array([[     1,   9600, 181500],
       [     1,  11250, 223500],
       [     1,   9550, 140000],
       [     1,  14260, 250000]])

Após isso, será utilizada a package OLS para calculo de importancia das features no output do modelo:

In [77]:
# Importando a package.
import statsmodels.regression.linear_model as sm

X_opt = X[:, [0,1]]
regressor_ols = sm.OLS(endog = y, exog = X[:, [0,1]]).fit()
regressor_ols.summary()

0,1,2,3
Dep. Variable:,y,R-squared:,0.07
Model:,OLS,Adj. R-squared:,0.069
Method:,Least Squares,F-statistic:,109.1
Date:,"Fri, 08 Nov 2019",Prob (F-statistic):,1.12e-24
Time:,15:38:08,Log-Likelihood:,-18491.0
No. Observations:,1460,AIC:,36990.0
Df Residuals:,1458,BIC:,37000.0
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,1.588e+05,2914.717,54.495,0.000,1.53e+05,1.65e+05
x1,2.1000,0.201,10.445,0.000,1.706,2.494

0,1,2,3
Omnibus:,587.66,Durbin-Watson:,1.998
Prob(Omnibus):,0.0,Jarque-Bera (JB):,3374.003
Skew:,1.788,Prob(JB):,0.0
Kurtosis:,9.532,Cond. No.,21100.0


Analisando os valores acima, vimos que as features X1(*LotArea* - Área do lote) possui um P-value significativo, ou seja, dentro do level de signifcância definida (SL = .05) .Portanto, deixamos a mesma e escolhemos outra feature para incrementar no modelo conforme o processo de Forward Elimination.  

**Obs: Definimos um level de significância de .05 para que as features permaneçam no modelo (SL = .05).**

In [0]:
#Analisando todas as features, exceto a feature X2 (Índice 1)

X_opt = X_train[:, [0,2,3,4]]
regressor_ols = sm.OLS(endog = y_train, exog = X_opt).fit()
regressor_ols.summary()

0,1,2,3
Dep. Variable:,y,R-squared:,0.46
Model:,OLS,Adj. R-squared:,0.459
Method:,Least Squares,F-statistic:,330.7
Date:,"Fri, 08 Nov 2019",Prob (F-statistic):,2.9e-155
Time:,15:10:03,Log-Likelihood:,-14443.0
No. Observations:,1168,AIC:,28890.0
Df Residuals:,1164,BIC:,28910.0
Df Model:,3,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,1.811e+05,1663.777,108.821,0.000,1.78e+05,1.84e+05
x1,5649.3615,1610.913,3.507,0.000,2488.743,8809.980
x2,3.772e+04,1912.254,19.727,0.000,3.4e+04,4.15e+04
x3,2.214e+04,1853.006,11.949,0.000,1.85e+04,2.58e+04

0,1,2,3
Omnibus:,441.037,Durbin-Watson:,1.983
Prob(Omnibus):,0.0,Jarque-Bera (JB):,2831.186
Skew:,1.602,Prob(JB):,0.0
Kurtosis:,9.922,Cond. No.,1.67


Adicionando a segunda feature no dataframe

In [83]:
df_forward = pd.concat([df_forward, df['GarageArea']], axis=1)

df_forward.head(5)

Unnamed: 0,LotArea,SalePrice,GarageArea
0,8450,208500,548
1,9600,181500,460
2,11250,223500,608
3,9550,140000,642
4,14260,250000,836


Calculando os coeficientes com a segunda features:

In [85]:
# Adicionando 1 na primeira coluna da matriz de features 
X = np.append(arr = np.ones((1460,1)).astype(int), values = df_forward, axis =1)

# Selecionando apenas as features de indice 0-const, 1-SalesPrice, 3-GarageArea
X_opt = X[:, [0,1,3]]
regressor_ols = sm.OLS(endog = y, exog = X_opt).fit()
regressor_ols.summary()

0,1,2,3
Dep. Variable:,y,R-squared:,0.412
Model:,OLS,Adj. R-squared:,0.412
Method:,Least Squares,F-statistic:,511.2
Date:,"Fri, 08 Nov 2019",Prob (F-statistic):,6.340000000000001e-169
Time:,15:50:13,Log-Likelihood:,-18156.0
No. Observations:,1460,AIC:,36320.0
Df Residuals:,1457,BIC:,36330.0
Df Model:,2,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,6.322e+04,4015.973,15.742,0.000,5.53e+04,7.11e+04
x1,1.2453,0.163,7.663,0.000,0.927,1.564
x2,221.1574,7.587,29.151,0.000,206.276,236.039

0,1,2,3
Omnibus:,544.83,Durbin-Watson:,2.023
Prob(Omnibus):,0.0,Jarque-Bera (JB):,5456.748
Skew:,1.446,Prob(JB):,0.0
Kurtosis:,12.018,Cond. No.,36500.0


**Final:** Todas as features acima estão dentro dentro intervalo de significância do modelo (SL = .05). Portanto, o processo continua de maneira incremental até que o modelo não seja improvisado com a adição de novas features.

Calculando o score do modelo:

In [87]:
regressor = LinearRegression()
regressor.fit(X_opt, y)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

Analisando o novo score do modelo com a métrica r2

In [90]:
regressor.score(X_opt, y)

0.41235186539832547

##  Backward Elimination

### Etapa de Seleção e Tratamento dos Dados

Selecionando as principais features do dataset:

In [52]:
df = df.loc[:, ['LotArea', 'PoolArea', 'GarageArea', 'YearBuilt', 'SalePrice']]

df.head(5)

Unnamed: 0,LotArea,PoolArea,GarageArea,YearBuilt,SalePrice
0,8450,0,548,2003,208500
1,9600,0,460,1976,181500
2,11250,0,608,2001,223500
3,9550,0,642,1915,140000
4,14260,0,836,2000,250000


Definindo as variáveis indepedentes e dependentes, normalição das features e dividisão do dataset em conjunto de treinamento e testes:

In [0]:
X = df[df.columns[~df.columns.isin(['SalePrice'])]].values
y = df['SalePrice'].values.reshape(-1,1)

# Normalização das features:
X = feature_scaling(X)

# Dividindo os dados
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42)

Importando e treinando o modelo de Regressao Linear Multipla com o Conjunto de Treinamento:

In [54]:
# Importando o modelo
from sklearn.linear_model import LinearRegression

regressor = LinearRegression()
regressor.fit(X_train, y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

Avaliando o modelo com a métrica r2

In [56]:
regressor.score(X_test, y_test)

0.4759953905083013

### Realizando o processo de Backward Elimination


Realizando o processo de backward elimination. Primeiro, será inserido uma coluna preenchida com valores 1 no começo da matriz de variáveis. Isso é realizada para que sejam feito os calculos necessários:

In [59]:
X_train = np.append(arr = np.ones((1168,1)).astype(int), values = X_train, axis =1)

X_train[2,:]


array([ 1.        , -0.1743691 , -0.06869175, -2.21296298, -2.02923537])

Selecionando as variáveis do conjunto de treinamento e treinando o modelo com a package OLS que realiza o processo de cálculo dos coeficientes para analise:

In [61]:
# Importando a package.
import statsmodels.regression.linear_model as sm

X_opt = X_train[:, [0,1,2,3,4]]
regressor_ols = sm.OLS(endog = y_train, exog = X_opt).fit()
regressor_ols.summary()

0,1,2,3
Dep. Variable:,y,R-squared:,0.49
Model:,OLS,Adj. R-squared:,0.488
Method:,Least Squares,F-statistic:,279.4
Date:,"Fri, 08 Nov 2019",Prob (F-statistic):,2.31e-168
Time:,15:09:52,Log-Likelihood:,-14409.0
No. Observations:,1168,AIC:,28830.0
Df Residuals:,1163,BIC:,28850.0
Df Model:,4,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,1.809e+05,1617.751,111.820,0.000,1.78e+05,1.84e+05
x1,1.268e+04,1533.794,8.267,0.000,9670.830,1.57e+04
x2,4729.7608,1570.186,3.012,0.003,1649.047,7810.475
x3,3.476e+04,1893.434,18.359,0.000,3.1e+04,3.85e+04
x4,2.331e+04,1807.127,12.897,0.000,1.98e+04,2.69e+04

0,1,2,3
Omnibus:,435.926,Durbin-Watson:,2.015
Prob(Omnibus):,0.0,Jarque-Bera (JB):,3360.331
Skew:,1.517,Prob(JB):,0.0
Kurtosis:,10.736,Cond. No.,1.75


Analisando os valores acima, vimos que a feature X2 (*PoolArea* - Área da piscina) possui o menor grau de importância, ou seja, o maior P-value, já que possui um p-value de .008, equanto as outras features possui um valor abaixo de .000 .Portanto, removemos a mesma e reiniciamos o processo conforme o algoritmo de Backward Elimination.  

**Obs: Definimos um level de significância de .05 para que as features permaneçam no modelo (SL = .05).**

In [62]:
#Analisando todas as features, exceto a feature X2 (Índice 1)

X_opt = X_train[:, [0,2,3,4]]
regressor_ols = sm.OLS(endog = y_train, exog = X_opt).fit()
regressor_ols.summary()

0,1,2,3
Dep. Variable:,y,R-squared:,0.46
Model:,OLS,Adj. R-squared:,0.459
Method:,Least Squares,F-statistic:,330.7
Date:,"Fri, 08 Nov 2019",Prob (F-statistic):,2.9e-155
Time:,15:10:03,Log-Likelihood:,-14443.0
No. Observations:,1168,AIC:,28890.0
Df Residuals:,1164,BIC:,28910.0
Df Model:,3,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,1.811e+05,1663.777,108.821,0.000,1.78e+05,1.84e+05
x1,5649.3615,1610.913,3.507,0.000,2488.743,8809.980
x2,3.772e+04,1912.254,19.727,0.000,3.4e+04,4.15e+04
x3,2.214e+04,1853.006,11.949,0.000,1.85e+04,2.58e+04

0,1,2,3
Omnibus:,441.037,Durbin-Watson:,1.983
Prob(Omnibus):,0.0,Jarque-Bera (JB):,2831.186
Skew:,1.602,Prob(JB):,0.0
Kurtosis:,9.922,Cond. No.,1.67


**Final:** Todas as features acima estão dentro dentro intervalo de significância do modelo (SL = .05). Portanto, o processo é finalizado e seguimos para o treinamento do modelo.

In [63]:
regressor = LinearRegression()
regressor.fit(X_opt, y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

Analisando o novo score do modelo com a métrica r2

In [64]:
regressor.score(X_test, y_test)

0.4484569043113157