# Processo de Seleção de Variáveis Usando R²

### Importando libs e funções:

Importando libs

In [0]:
import pandas as pd
import random
import numpy as np
from sklearn.preprocessing import StandardScaler 
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

Importando funções

In [0]:
# Função de escalonamento
def feature_scaling(data):
    sc = StandardScaler()
    return sc.fit_transform(data)

### Etapa de exploração e tratamento dos dados

Importando o dataset do nosso estudo. O objetivo dos modelos de regressão será de predizer o preço das casas de acordo com diferentes caracteristicas como: localização, área, etc.

Fonte: [Kaggle](https://www.kaggle.com/c/house-prices-advanced-regression-techniques/data)

In [0]:
df = pd.read_csv('https://raw.githubusercontent.com/r4phael/ml-course/master/data/pricing_houses.csv')


Visualizando todas as colunas do dataset:

In [4]:
df.columns

Index(['Id', 'MSSubClass', 'MSZoning', 'LotFrontage', 'LotArea', 'Street',
       'Alley', 'LotShape', 'LandContour', 'Utilities', 'LotConfig',
       'LandSlope', 'Neighborhood', 'Condition1', 'Condition2', 'BldgType',
       'HouseStyle', 'OverallQual', 'OverallCond', 'YearBuilt', 'YearRemodAdd',
       'RoofStyle', 'RoofMatl', 'Exterior1st', 'Exterior2nd', 'MasVnrType',
       'MasVnrArea', 'ExterQual', 'ExterCond', 'Foundation', 'BsmtQual',
       'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinSF1',
       'BsmtFinType2', 'BsmtFinSF2', 'BsmtUnfSF', 'TotalBsmtSF', 'Heating',
       'HeatingQC', 'CentralAir', 'Electrical', '1stFlrSF', '2ndFlrSF',
       'LowQualFinSF', 'GrLivArea', 'BsmtFullBath', 'BsmtHalfBath', 'FullBath',
       'HalfBath', 'BedroomAbvGr', 'KitchenAbvGr', 'KitchenQual',
       'TotRmsAbvGrd', 'Functional', 'Fireplaces', 'FireplaceQu', 'GarageType',
       'GarageYrBlt', 'GarageFinish', 'GarageCars', 'GarageArea', 'GarageQual',
       'GarageCond', 'PavedDrive

In [0]:
#Selecionando algumas features dos dados para uma melhor visualização do problema
df = df.loc[:, ['LotArea', 'PoolArea', 'GarageArea', 'OverallCond','YearBuilt', 'YrSold', 'Fireplaces',
                'SalePrice']]

Descrevendo o dataset

In [7]:
df.describe()

Unnamed: 0,LotArea,PoolArea,GarageArea,OverallCond,YearBuilt,YrSold,Fireplaces,SalePrice
count,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0
mean,10516.828082,2.758904,472.980137,5.575342,1971.267808,2007.815753,0.613014,180921.19589
std,9981.264932,40.177307,213.804841,1.112799,30.202904,1.328095,0.644666,79442.502883
min,1300.0,0.0,0.0,1.0,1872.0,2006.0,0.0,34900.0
25%,7553.5,0.0,334.5,5.0,1954.0,2007.0,0.0,129975.0
50%,9478.5,0.0,480.0,5.0,1973.0,2008.0,1.0,163000.0
75%,11601.5,0.0,576.0,6.0,2000.0,2009.0,1.0,214000.0
max,215245.0,738.0,1418.0,9.0,2010.0,2010.0,3.0,755000.0


Visualizando o dataset

In [8]:
df.head(5)

Unnamed: 0,LotArea,PoolArea,GarageArea,OverallCond,YearBuilt,YrSold,Fireplaces,SalePrice
0,8450,0,548,5,2003,2008,0,208500
1,9600,0,460,8,1976,2007,1,181500
2,11250,0,608,5,2001,2008,1,223500
3,9550,0,642,5,1915,2006,1,140000
4,14260,0,836,5,2000,2008,1,250000


## Forward Elimination


### Etapa de Seleção e Tratamento dos Dados

No processo de Forward Elimination, iremos selecionar as features incrementalmente uma por uma e analisamos se a mesma contribui para a melhoria do modelo. Posteriormente, treinamos o modelo com a package OLS que realiza o processo de cálculo dos coeficientes R² para analise:

Visualizando as principais features do dataset:

In [9]:
df.head(5)

Unnamed: 0,LotArea,PoolArea,GarageArea,OverallCond,YearBuilt,YrSold,Fireplaces,SalePrice
0,8450,0,548,5,2003,2008,0,208500
1,9600,0,460,8,1976,2007,1,181500
2,11250,0,608,5,2001,2008,1,223500
3,9550,0,642,5,1915,2006,1,140000
4,14260,0,836,5,2000,2008,1,250000


Definindo as variáveis indepedentes e dependentes e normalição das features:

In [0]:
X = df[df.columns[~df.columns.isin(['SalePrice'])]].values
y = df['SalePrice'].values.reshape(-1,1)

# Normalização das features:
X = feature_scaling(X)

### Realizando o Processo de Foward Elimination


Realizando o processo de forward elimination. Primeiro, será inserido uma coluna preenchida com valores 1 no começo da matriz de variáveis. Isso é realizada para que sejam feito os calculos necessários. Mais adiante, é realizada a divisão do dataset em conjunto de treinamento e testes:

In [11]:
X = np.append(arr = np.ones((1460,1)).astype(int), values = X, axis =1)

# Dividindo os dados
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42)

X_train[1:5,:2]

array([[ 1.        , -0.26857781],
       [ 1.        , -0.1743691 ],
       [ 1.        , -0.33241925],
       [ 1.        , -0.55290771]])

Após isso, será utilizada a package OLS para calculo de importancia das features no output do modelo utilizando somente a primeira feature - *AreaLot*

In [13]:
# Importando a package.
import statsmodels.regression.linear_model as sm

X_opt = X_train[:, [0,1]]
regressor_ols = sm.OLS(endog = y_train, exog = X_opt).fit()
regressor_ols.summary()

0,1,2,3
Dep. Variable:,y,R-squared:,0.071
Model:,OLS,Adj. R-squared:,0.07
Method:,Least Squares,F-statistic:,88.93
Date:,"Sat, 30 Nov 2019",Prob (F-statistic):,2.13e-20
Time:,13:27:50,Log-Likelihood:,-14760.0
No. Observations:,1168,AIC:,29520.0
Df Residuals:,1166,BIC:,29530.0
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,1.811e+05,2180.395,83.063,0.000,1.77e+05,1.85e+05
x1,1.907e+04,2022.624,9.430,0.000,1.51e+04,2.3e+04

0,1,2,3
Omnibus:,425.777,Durbin-Watson:,2.049
Prob(Omnibus):,0.0,Jarque-Bera (JB):,1973.518
Skew:,1.657,Prob(JB):,0.0
Kurtosis:,8.437,Cond. No.,1.08


Analisando os valores acima, vimos que as features X1(*LotArea* - Área do lote) contribuiu para um *Adj. R-squared* de  0.070. Porém, ainda é cedo para avaliar se isso é suficiente, portanto deixamos a mesma e escolhemos outra feature para incrementar no modelo conforme o processo de Forward Elimination.  

**Obs: Nesse notebook iremos utilizar os resultados de *Adj. R-squared* como um threshold (limiar) para analisar os resultados. Portanto as features com um R-squared mais próximo de um tendem a ter uma impacto positivo no modelo.**

Calculando os coeficientes com a adição da segunda feature - Área da Piscina (*PoolArea*):

In [0]:
# Selecionando apenas as features de indice 0-Constante, 1-LotArea, 2-PoolArea
X_opt = X_train[:, [0,1,2]]
regressor_ols = sm.OLS(endog = y_train, exog = X_opt).fit()
regressor_ols.summary()

0,1,2,3
Dep. Variable:,y,R-squared:,0.08
Model:,OLS,Adj. R-squared:,0.078
Method:,Least Squares,F-statistic:,50.31
Date:,"Fri, 29 Nov 2019",Prob (F-statistic):,1.1e-21
Time:,00:44:06,Log-Likelihood:,-14754.0
No. Observations:,1168,AIC:,29510.0
Df Residuals:,1165,BIC:,29530.0
Df Model:,2,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,1.811e+05,2171.175,83.405,0.000,1.77e+05,1.85e+05
x1,1.85e+04,2021.631,9.149,0.000,1.45e+04,2.25e+04
x2,6952.4530,2102.250,3.307,0.001,2827.833,1.11e+04

0,1,2,3
Omnibus:,375.808,Durbin-Watson:,2.045
Prob(Omnibus):,0.0,Jarque-Bera (JB):,1345.385
Skew:,1.54,Prob(JB):,7.13e-293
Kurtosis:,7.261,Cond. No.,1.11


Como pode ser visto nos resultados acima, a adição da feature X2(PoolArea - Área da piscina) aumentou um pouco o valor de *Adj. R-squared*. Portanto, o processo de seleção continua.

Calculando os coeficientes com a adição da terceira feature - Área da Garagem (*GarageArea*):

In [0]:
# Selecionando apenas as features de indice 0-const, 1-LotArea, 2-PoolArea, 3-GarageArea 
X_opt = X_train[:, [0,1,2,3]]
regressor_ols = sm.OLS(endog = y_train, exog = X_opt).fit()
regressor_ols.summary()

0,1,2,3
Dep. Variable:,y,R-squared:,0.417
Model:,OLS,Adj. R-squared:,0.416
Method:,Least Squares,F-statistic:,277.7
Date:,"Fri, 29 Nov 2019",Prob (F-statistic):,6.26e-136
Time:,00:44:06,Log-Likelihood:,-14487.0
No. Observations:,1168,AIC:,28980.0
Df Residuals:,1164,BIC:,29000.0
Df Model:,3,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,1.805e+05,1728.549,104.432,0.000,1.77e+05,1.84e+05
x1,1.114e+04,1634.120,6.815,0.000,7930.811,1.43e+04
x2,4089.5066,1677.167,2.438,0.015,798.898,7380.116
x3,4.63e+04,1783.141,25.968,0.000,4.28e+04,4.98e+04

0,1,2,3
Omnibus:,326.351,Durbin-Watson:,2.019
Prob(Omnibus):,0.0,Jarque-Bera (JB):,2200.071
Skew:,1.113,Prob(JB):,0.0
Kurtosis:,9.345,Cond. No.,1.25


Como pode ser visto nos resultados acima, a adição da feature X3(GarageArea - Área da Garagem) aumentou consideravelmente o valor de *Adj. R-squared*. Portanto, o processo de seleção continua.

Calculando os coeficientes com a adição da 4ª feature - Condição Geral(*OverallCond*):

In [0]:
# Selecionando apenas as features de indice 0-const, 1-LotArea, 2-PoolArea, 3-GarageArea, 4-OverallCond
X_opt = X_train[:, [0,1,2,3,4]]
regressor_ols = sm.OLS(endog = y_train, exog = X_opt).fit()
regressor_ols.summary()

0,1,2,3
Dep. Variable:,y,R-squared:,0.417
Model:,OLS,Adj. R-squared:,0.415
Method:,Least Squares,F-statistic:,208.2
Date:,"Fri, 29 Nov 2019",Prob (F-statistic):,1.0599999999999999e-134
Time:,00:44:06,Log-Likelihood:,-14487.0
No. Observations:,1168,AIC:,28980.0
Df Residuals:,1163,BIC:,29010.0
Df Model:,4,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,1.805e+05,1729.280,104.384,0.000,1.77e+05,1.84e+05
x1,1.112e+04,1635.325,6.800,0.000,7911.844,1.43e+04
x2,4088.4437,1677.789,2.437,0.015,796.611,7380.276
x3,4.64e+04,1802.016,25.749,0.000,4.29e+04,4.99e+04
x4,652.2132,1741.565,0.374,0.708,-2764.748,4069.175

0,1,2,3
Omnibus:,326.932,Durbin-Watson:,2.02
Prob(Omnibus):,0.0,Jarque-Bera (JB):,2206.916
Skew:,1.114,Prob(JB):,0.0
Kurtosis:,9.354,Cond. No.,1.31


Como pode ser visto nos resultados acima, a adição da feature X4(OverallCond - Condição Geral da Casa) diminuiu um pouco o valor de *Adj. R-squared*, e consequentemente ela não contribui de maneira positiva ao modelo de regressão. Portanto, devemos retirar essa feature do modelo e continuar selecionando as demais.

Calculando os coeficientes com a adição da 5ª feature - Ano de Construção (*YearBuilt*):

In [0]:
# Selecionando apenas as features de indice 0-const, 1-LotArea, 2-PoolArea, 3-GarageArea, 5-YearBuilt
X_opt = X_train[:, [0,1,2,3,5]]
regressor_ols = sm.OLS(endog = y_train, exog = X_opt).fit()
regressor_ols.summary()

0,1,2,3
Dep. Variable:,y,R-squared:,0.49
Model:,OLS,Adj. R-squared:,0.488
Method:,Least Squares,F-statistic:,279.4
Date:,"Fri, 29 Nov 2019",Prob (F-statistic):,2.31e-168
Time:,00:44:06,Log-Likelihood:,-14409.0
No. Observations:,1168,AIC:,28830.0
Df Residuals:,1163,BIC:,28850.0
Df Model:,4,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,1.809e+05,1617.751,111.820,0.000,1.78e+05,1.84e+05
x1,1.268e+04,1533.794,8.267,0.000,9670.830,1.57e+04
x2,4729.7608,1570.186,3.012,0.003,1649.047,7810.475
x3,3.476e+04,1893.434,18.359,0.000,3.1e+04,3.85e+04
x4,2.331e+04,1807.127,12.897,0.000,1.98e+04,2.69e+04

0,1,2,3
Omnibus:,435.926,Durbin-Watson:,2.015
Prob(Omnibus):,0.0,Jarque-Bera (JB):,3360.331
Skew:,1.517,Prob(JB):,0.0
Kurtosis:,10.736,Cond. No.,1.75


Como pode ser visto nos resultados acima, a adição da feature X4(YearBuilt - Ano de Construção da Casa) aumentou consideravelmente o valor de *Adj. R-squared*. Portanto, devemos adicionar essa feature ao modelo e continuar selecionando as demais.

Calculando os coeficientes incluíndo a 6ª feature - Ano de Venda da Casa (*YearSold*):

In [0]:
# Selecionando apenas as features de indice 0-const, 1-LotArea, 2-PoolArea, 3-GarageArea, 5-YearBuilt, 6-YearSold
X_opt = X_train[:, [0,1,2,3,5,6]]
regressor_ols = sm.OLS(endog = y_train, exog = X_opt).fit()
regressor_ols.summary()

0,1,2,3
Dep. Variable:,y,R-squared:,0.49
Model:,OLS,Adj. R-squared:,0.488
Method:,Least Squares,F-statistic:,223.4
Date:,"Fri, 29 Nov 2019",Prob (F-statistic):,3.98e-167
Time:,00:44:07,Log-Likelihood:,-14409.0
No. Observations:,1168,AIC:,28830.0
Df Residuals:,1162,BIC:,28860.0
Df Model:,5,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,1.809e+05,1618.408,111.773,0.000,1.78e+05,1.84e+05
x1,1.268e+04,1534.427,8.263,0.000,9667.786,1.57e+04
x2,4758.9244,1574.959,3.022,0.003,1668.843,7849.006
x3,3.477e+04,1894.485,18.353,0.000,3.11e+04,3.85e+04
x4,2.33e+04,1807.929,12.889,0.000,1.98e+04,2.69e+04
x5,416.2903,1629.323,0.255,0.798,-2780.453,3613.034

0,1,2,3
Omnibus:,435.787,Durbin-Watson:,2.015
Prob(Omnibus):,0.0,Jarque-Bera (JB):,3362.49
Skew:,1.516,Prob(JB):,0.0
Kurtosis:,10.74,Cond. No.,1.75


Como pode ser visto acima, a adição da feature X5(YrSold - Ano de Venda) não mudou o valor de *Adj. R-squared*. Portanto, a mesma não será utilizada no modelo final.

Calculando os coeficientes incluíndo a ultima feature - Lareiras (*Fireplaces*):

In [0]:
# Selecionando apenas as features de indice 0-const, 1-LotArea, 2-PoolArea, 3-GarageArea, 5-YearBuilt, 7-Fireplaces
X_opt = X_train[:, [0,1,2,3,5,7]]
regressor_ols = sm.OLS(endog = y_train, exog = X_opt).fit()
regressor_ols.summary()

0,1,2,3
Dep. Variable:,y,R-squared:,0.561
Model:,OLS,Adj. R-squared:,0.559
Method:,Least Squares,F-statistic:,297.4
Date:,"Fri, 29 Nov 2019",Prob (F-statistic):,5.51e-205
Time:,00:44:07,Log-Likelihood:,-14321.0
No. Observations:,1168,AIC:,28650.0
Df Residuals:,1162,BIC:,28690.0
Df Model:,5,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,1.809e+05,1501.172,120.485,0.000,1.78e+05,1.84e+05
x1,7974.8970,1463.911,5.448,0.000,5102.693,1.08e+04
x2,3212.5095,1461.214,2.199,0.028,345.596,6079.423
x3,3.046e+04,1784.732,17.065,0.000,2.7e+04,3.4e+04
x4,2.256e+04,1677.774,13.448,0.000,1.93e+04,2.59e+04
x5,2.201e+04,1602.250,13.735,0.000,1.89e+04,2.52e+04

0,1,2,3
Omnibus:,439.886,Durbin-Watson:,1.978
Prob(Omnibus):,0.0,Jarque-Bera (JB):,4524.322
Skew:,1.438,Prob(JB):,0.0
Kurtosis:,12.203,Cond. No.,1.87


Como pode ser visto acima, a adição da feature X5 (Fireplaces - Quantidade de Lareiras) aumentou de maneira significativa o valor de *Adj. R-squared*. Portanto, o processo é finalizado, visto que já temos uma lista suficiente de features que impactam de maneira positiva o modelo de Regressão (aumento do *Adj. R-squared*).


**Lista final de Features:** 1-LotArea, 2-PoolArea, 3-GarageArea, 5-YearBuilt, 7-Fireplaces

Treinando o modelo com as features selecionadas e com o conjunto de treinamento:

In [0]:
regressor = LinearRegression()
regressor.fit(X_train[:, [1,2,3,5,7]], y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

Analisando o score do modelo com a métrica R² no conjunto de testes:

In [0]:
regressor.score(X_test[:, [1,2,3,5,7]], y_test)

0.5507793131653085

##  Backward Elimination

No processo de Backward Elimination, primeiro iremos selecionar todas as features possíveis. Posteriormente, iremos retirar cada feature e verificar se a retirada dela contribui para a melhoria do modelo, ou seja, se o valor de R² aumenta. Para isso, treinamos o modelo com a package OLS que realiza o processo de cálculo do coeficiente R² para analise.

### Etapa de Seleção e Tratamento dos Dados

Visualizando todas as features do dataset:

In [14]:
df.head(5)

Unnamed: 0,LotArea,PoolArea,GarageArea,OverallCond,YearBuilt,YrSold,Fireplaces,SalePrice
0,8450,0,548,5,2003,2008,0,208500
1,9600,0,460,8,1976,2007,1,181500
2,11250,0,608,5,2001,2008,1,223500
3,9550,0,642,5,1915,2006,1,140000
4,14260,0,836,5,2000,2008,1,250000


Definindo as variáveis indepedentes e dependentes, normalição das features e dividisão do dataset em conjunto de treinamento e testes:

In [0]:
X = df[df.columns[~df.columns.isin(['SalePrice'])]].values
y = df['SalePrice'].values

# Normalização das features:
X = feature_scaling(X)

### Realizando o Processo de Backward Elimination


Esse processo é realizado através de uma análise inversao ao modelo Foward. Portanto, a cada iteração é removida uma feature que deverá ser analisada seu impacto no modelo através da métrica R².

Primeiro, será inserido uma coluna preenchida com valores 1 no começo da matriz de features. Isso é realizada para que sejam feito os calculos necessários:

In [16]:
X = np.append(arr = np.ones((1460,1)).astype(int), values = X, axis =1)

# Dividindo os dados
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42)

X_train[1:5,:2]


array([[ 1.        , -0.26857781],
       [ 1.        , -0.1743691 ],
       [ 1.        , -0.33241925],
       [ 1.        , -0.55290771]])

Selecionando todas as features do conjunto de treinamento e treinando o modelo com a package OLS para o processo de cálculo dos coeficientes. 

In [0]:
# Importando a package.
import statsmodels.regression.linear_model as sm

# Analisando todas as features:
X_opt = X_train[:, [0,1,2,3,4,5,6,7]]
regressor_ols = sm.OLS(endog = y_train, exog = X_opt).fit()
regressor_ols.summary()

0,1,2,3
Dep. Variable:,y,R-squared:,0.572
Model:,OLS,Adj. R-squared:,0.57
Method:,Least Squares,F-statistic:,221.7
Date:,"Fri, 29 Nov 2019",Prob (F-statistic):,7.88e-209
Time:,00:44:07,Log-Likelihood:,-14307.0
No. Observations:,1168,AIC:,28630.0
Df Residuals:,1160,BIC:,28670.0
Df Model:,7,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,1.808e+05,1483.694,121.883,0.000,1.78e+05,1.84e+05
x1,8061.6719,1446.960,5.571,0.000,5222.720,1.09e+04
x2,3336.4581,1448.058,2.304,0.021,495.353,6177.563
x3,3.006e+04,1765.796,17.021,0.000,2.66e+04,3.35e+04
x4,8726.2152,1607.366,5.429,0.000,5572.546,1.19e+04
x5,2.612e+04,1783.040,14.647,0.000,2.26e+04,2.96e+04
x6,236.3377,1494.222,0.158,0.874,-2695.342,3168.017
x7,2.165e+04,1584.915,13.663,0.000,1.85e+04,2.48e+04

0,1,2,3
Omnibus:,464.282,Durbin-Watson:,1.997
Prob(Omnibus):,0.0,Jarque-Bera (JB):,5052.077
Skew:,1.521,Prob(JB):,0.0
Kurtosis:,12.724,Cond. No.,2.04


Analisando o valor de *Adj. R-squared* acima, vimos que o modelo possui um valor de 0.570, então seguindo o processo de Backward Elimination, iremos eliminar uma feature (X6 - YrSold) e analisar novamente o valor de *Adj. R-squared*. 


In [0]:
#Analisando todas as features, exceto a feature 6-YrSold

X_opt = X_train[:, [0,1,2,3,4,5,7]]
regressor_ols = sm.OLS(endog = y_train, exog = X_opt).fit()
regressor_ols.summary()

0,1,2,3
Dep. Variable:,y,R-squared:,0.572
Model:,OLS,Adj. R-squared:,0.57
Method:,Least Squares,F-statistic:,258.8
Date:,"Fri, 29 Nov 2019",Prob (F-statistic):,4.7500000000000005e-210
Time:,00:44:07,Log-Likelihood:,-14307.0
No. Observations:,1168,AIC:,28630.0
Df Residuals:,1161,BIC:,28660.0
Df Model:,6,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,1.808e+05,1483.065,121.935,0.000,1.78e+05,1.84e+05
x1,8062.8559,1446.333,5.575,0.000,5225.137,1.09e+04
x2,3320.0150,1443.714,2.300,0.022,487.434,6152.596
x3,3.005e+04,1764.767,17.028,0.000,2.66e+04,3.35e+04
x4,8732.9953,1606.119,5.437,0.000,5581.775,1.19e+04
x5,2.612e+04,1781.992,14.658,0.000,2.26e+04,2.96e+04
x6,2.165e+04,1584.243,13.668,0.000,1.85e+04,2.48e+04

0,1,2,3
Omnibus:,464.454,Durbin-Watson:,1.996
Prob(Omnibus):,0.0,Jarque-Bera (JB):,5049.657
Skew:,1.522,Prob(JB):,0.0
Kurtosis:,12.721,Cond. No.,2.04


Analisando o valor de *Adj. R-squared* acima, vimos que o valor não alterou após a retirada da feature (X6 - Área da Garagem). Portanto, nesse caso a feature pode ser retirada, visto que ela não contribui para a melhoria do modelo, além de aumentar a complexidade do mesmo.


Retirando a 7ª feature - Lareiras (*Fireplaces*)

In [0]:
#Analisando todas as features, exceto a feature 6-YrSold e 7-Fireplaces

X_opt = X_train[:, [0,1,2,3,4,5]]
regressor_ols = sm.OLS(endog = y_train, exog = X_opt).fit()
regressor_ols.summary()

0,1,2,3
Dep. Variable:,y,R-squared:,0.503
Model:,OLS,Adj. R-squared:,0.501
Method:,Least Squares,F-statistic:,235.6
Date:,"Fri, 29 Nov 2019",Prob (F-statistic):,9.43e-174
Time:,00:44:07,Log-Likelihood:,-14394.0
No. Observations:,1168,AIC:,28800.0
Df Residuals:,1162,BIC:,28830.0
Df Model:,5,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,1.809e+05,1597.251,113.233,0.000,1.78e+05,1.84e+05
x1,1.269e+04,1514.348,8.382,0.000,9722.747,1.57e+04
x2,4821.4956,1550.364,3.110,0.002,1779.669,7863.322
x3,3.424e+04,1871.783,18.292,0.000,3.06e+04,3.79e+04
x4,9632.6563,1728.328,5.573,0.000,6241.664,1.3e+04
x5,2.722e+04,1917.246,14.197,0.000,2.35e+04,3.1e+04

0,1,2,3
Omnibus:,449.309,Durbin-Watson:,2.034
Prob(Omnibus):,0.0,Jarque-Bera (JB):,3662.549
Skew:,1.554,Prob(JB):,0.0
Kurtosis:,11.099,Cond. No.,1.96


Analisando o valor do *Adj. R-squared* acima, vimos que o valor diminuiu consideravelmente após a retirada da feature Lareiras. Portanto, tal feature não deve ser retirada, e devemos continuar o processo utilizando outra feature.

Retirando a 5ª feature - Ano de Construão (*YearBuilt*)

In [0]:
#Analisando todas as features, exceto a 5-YearBuilt, 6-YrSold e 7-Fireplaces

X_opt = X_train[:, [0,1,2,3,4]]
regressor_ols = sm.OLS(endog = y_train, exog = X_opt).fit()
regressor_ols.summary()

0,1,2,3
Dep. Variable:,y,R-squared:,0.417
Model:,OLS,Adj. R-squared:,0.415
Method:,Least Squares,F-statistic:,208.2
Date:,"Fri, 29 Nov 2019",Prob (F-statistic):,1.0599999999999999e-134
Time:,00:44:07,Log-Likelihood:,-14487.0
No. Observations:,1168,AIC:,28980.0
Df Residuals:,1163,BIC:,29010.0
Df Model:,4,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,1.805e+05,1729.280,104.384,0.000,1.77e+05,1.84e+05
x1,1.112e+04,1635.325,6.800,0.000,7911.844,1.43e+04
x2,4088.4437,1677.789,2.437,0.015,796.611,7380.276
x3,4.64e+04,1802.016,25.749,0.000,4.29e+04,4.99e+04
x4,652.2132,1741.565,0.374,0.708,-2764.748,4069.175

0,1,2,3
Omnibus:,326.932,Durbin-Watson:,2.02
Prob(Omnibus):,0.0,Jarque-Bera (JB):,2206.916
Skew:,1.114,Prob(JB):,0.0
Kurtosis:,9.354,Cond. No.,1.31


**Final:** Analisando o valor do *Adj. R-squared* acima, vimos que o valor diminuiu consideravelmente após a retirada da feature Ano de Construção (*YearBuilt*):. Portanto, tal feature não deve ser retirada. Desse modo, podemos finalizar o processo e utilizar o modelo com o maior R² (contendo essa feature).

Treinando o modelo com as features selecionadas no conjunto de treinamento:

In [0]:
regressor = LinearRegression()
regressor.fit(X_train[:, [1,2,3,4,5,7]], y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

Analisando o novo score do modelo no conjunto de testes com a métrica R²:

In [0]:
regressor.score(X_test[:, [1,2,3,4,5,7]], y_test)

0.5633529452543352