### Escolha do Melhor Modelo de Predição do Dataset Ames

In [147]:
import pickle
import pathlib

import numpy as np
import pandas as pd

In [148]:
DATA_DIR = pathlib.Path.cwd().parent / 'data'
print(DATA_DIR)

c:\Users\giuli\OneDrive - Insper - Institudo de Ensino e Pesquisa\Documentos\insper\MachineLearning\data


In [149]:
clean_data_path = DATA_DIR / 'processed' / 'ames_clean.pkl'

In [150]:
with open(clean_data_path, 'rb') as file:
    data = pickle.load(file)

In [151]:
data.info()

<class 'pandas.core.frame.DataFrame'>
Index: 2876 entries, 0 to 2929
Data columns (total 70 columns):
 #   Column           Non-Null Count  Dtype   
---  ------           --------------  -----   
 0   MS.SubClass      2876 non-null   category
 1   MS.Zoning        2876 non-null   category
 2   Lot.Frontage     2876 non-null   float64 
 3   Lot.Area         2876 non-null   float64 
 4   Lot.Shape        2876 non-null   category
 5   Land.Contour     2876 non-null   category
 6   Lot.Config       2876 non-null   category
 7   Land.Slope       2876 non-null   category
 8   Neighborhood     2876 non-null   category
 9   Bldg.Type        2876 non-null   category
 10  House.Style      2876 non-null   category
 11  Overall.Qual     2876 non-null   category
 12  Overall.Cond     2876 non-null   category
 13  Roof.Style       2876 non-null   category
 14  Mas.Vnr.Type     2876 non-null   category
 15  Mas.Vnr.Area     2876 non-null   float64 
 16  Exter.Qual       2876 non-null   category
 17  

In [152]:
model_data = data.copy()

##### *Codificação de variáveis categóricas* 

Vamos identificar todas as variáveis categóricas, tanto as nominais - categóricas sem ordem entre as categorias - quanto as ordinais - categóricas com uma ordem definida.

In [153]:
categorical_columns = []
ordinal_columns = []
for col in model_data.select_dtypes('category').columns:
    if model_data[col].cat.ordered:
        ordinal_columns.append(col)
    else:
        categorical_columns.append(col)

In [154]:
ordinal_columns

['Lot.Shape',
 'Land.Slope',
 'Overall.Qual',
 'Overall.Cond',
 'Exter.Qual',
 'Exter.Cond',
 'Heating.QC',
 'Electrical',
 'Kitchen.Qual',
 'Functional',
 'Paved.Drive',
 'Fence']

In [155]:
categorical_columns

['MS.SubClass',
 'MS.Zoning',
 'Land.Contour',
 'Lot.Config',
 'Neighborhood',
 'Bldg.Type',
 'House.Style',
 'Roof.Style',
 'Mas.Vnr.Type',
 'Foundation',
 'Bsmt.Qual',
 'Bsmt.Cond',
 'Bsmt.Exposure',
 'BsmtFin.Type.1',
 'BsmtFin.Type.2',
 'Central.Air',
 'Garage.Type',
 'Garage.Finish',
 'Sale.Type',
 'Sale.Condition',
 'Condition',
 'Exterior']


##### *Codificação de variáveis ordinais*

Variáveis ordinais podem ser transformadas em números inteiros de forma direta: a categoria mais baixa recebe o valor "zero", a próxima recebe o valor "um", e assim por diante. 

In [156]:
for col in ordinal_columns:
    codes, _ = pd.factorize(data[col], sort=True)
    model_data[col] = codes

In [157]:
model_data[ordinal_columns].info()

<class 'pandas.core.frame.DataFrame'>
Index: 2876 entries, 0 to 2929
Data columns (total 12 columns):
 #   Column        Non-Null Count  Dtype
---  ------        --------------  -----
 0   Lot.Shape     2876 non-null   int64
 1   Land.Slope    2876 non-null   int64
 2   Overall.Qual  2876 non-null   int64
 3   Overall.Cond  2876 non-null   int64
 4   Exter.Qual    2876 non-null   int64
 5   Exter.Cond    2876 non-null   int64
 6   Heating.QC    2876 non-null   int64
 7   Electrical    2876 non-null   int64
 8   Kitchen.Qual  2876 non-null   int64
 9   Functional    2876 non-null   int64
 10  Paved.Drive   2876 non-null   int64
 11  Fence         2876 non-null   int64
dtypes: int64(12)
memory usage: 292.1 KB


In [158]:
data['Lot.Shape'].value_counts()

Lot.Shape
Reg    1824
IR1     960
IR2      76
IR3      16
Name: count, dtype: int64

In [159]:
model_data['Lot.Shape'].value_counts()

Lot.Shape
0    1824
1     960
2      76
3      16
Name: count, dtype: int64

##### *Codificação de variáveis nominais*

A estratégia para codificar a variáveis nominais é criar várias novas variáveis numéricas para representar a associação de um dado item a uma das categorias de determinada variável. Essas variáveis são chamadas de variáveis dummy.

Cada uma dessas novas variáveis contém apenas os valores "zero" ou "um", no qual:

- 1 indica que o item pertence à categoria representada pela variável;
- 0 indica que o item não pertence a essa categoria.

Para um dado item, apenas uma variável dummy terá o valor 1, enquanto todas as demais serão 0.

In [160]:
model_data = pd.get_dummies(model_data, drop_first=True)

In [161]:
model_data.info()

<class 'pandas.core.frame.DataFrame'>
Index: 2876 entries, 0 to 2929
Columns: 165 entries, Lot.Frontage to Exterior_Other
dtypes: bool(119), float64(34), int64(12)
memory usage: 1.4 MB


In [162]:
for cat in categorical_columns:
    dummies = []
    for col in model_data.columns:
        if col.startswith(cat + "_"):
            dummies.append(f'"{col}"')
    dummies_str = ', '.join(dummies)
    print(f'From column "{cat}" we made {dummies_str}\n')

From column "MS.SubClass" we made "MS.SubClass_30", "MS.SubClass_50", "MS.SubClass_60", "MS.SubClass_70", "MS.SubClass_80", "MS.SubClass_85", "MS.SubClass_90", "MS.SubClass_120", "MS.SubClass_160", "MS.SubClass_190", "MS.SubClass_Other"

From column "MS.Zoning" we made "MS.Zoning_RH", "MS.Zoning_RL", "MS.Zoning_RM"

From column "Land.Contour" we made "Land.Contour_HLS", "Land.Contour_Low", "Land.Contour_Lvl"

From column "Lot.Config" we made "Lot.Config_CulDSac", "Lot.Config_FR2", "Lot.Config_FR3", "Lot.Config_Inside"

From column "Neighborhood" we made "Neighborhood_BrDale", "Neighborhood_BrkSide", "Neighborhood_ClearCr", "Neighborhood_CollgCr", "Neighborhood_Crawfor", "Neighborhood_Edwards", "Neighborhood_Gilbert", "Neighborhood_IDOTRR", "Neighborhood_MeadowV", "Neighborhood_Mitchel", "Neighborhood_NAmes", "Neighborhood_NPkVill", "Neighborhood_NWAmes", "Neighborhood_NoRidge", "Neighborhood_NridgHt", "Neighborhood_OldTown", "Neighborhood_SWISU", "Neighborhood_Sawyer", "Neighborhood_Sa

##### *Divisão do dataset em treino e teste*

In [163]:
X = model_data.drop(columns=['SalePrice']).copy()
y = model_data['SalePrice'].copy()

In [164]:
X.values, y.values

(array([[141.0, 31770.0, 1, ..., False, False, False],
        [80.0, 11622.0, 0, ..., False, False, False],
        [81.0, 14267.0, 1, ..., True, False, False],
        ...,
        [62.0, 10441.0, 0, ..., False, False, False],
        [77.0, 10010.0, 0, ..., False, False, False],
        [74.0, 9627.0, 0, ..., False, False, False]], dtype=object),
 array([5.33243846, 5.0211893 , 5.23552845, ..., 5.12057393, 5.23044892,
        5.27415785]))

In [165]:
from sklearn.model_selection import train_test_split

In [166]:
RANDOM_SEED = 42

In [167]:
Xtrain, Xtest, ytrain, ytest = train_test_split(
    X,
    y,
    test_size=0.25,
    random_state=RANDOM_SEED,
)

In [168]:
X.shape, Xtrain.shape, Xtest.shape

((2876, 164), (2157, 164), (719, 164))

In [169]:
y.shape, ytrain.shape, ytest.shape

((2876,), (2157,), (719,))

##### *Modelo 1 - Random Forest*

In [170]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.dummy import DummyRegressor
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

In [171]:
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('model', DummyRegressor()),
])

param_grid = [
    {
        'model': [DummyRegressor()],
    },
    {
        'model': [RandomForestRegressor()],
        'model__n_estimators': [200, 300, 400, 500],
        'model__max_depth': [20, 30, 40, 50]
    }
]

grid_search = GridSearchCV(
    pipeline,
    param_grid,
    cv=5,
    scoring='neg_mean_squared_error',
    n_jobs=-1,
)

grid_search.fit(Xtrain, ytrain)

print("Best parameters found: ", grid_search.best_params_)
print("Best score: ", grid_search.best_score_)

Best parameters found:  {'model': RandomForestRegressor(), 'model__max_depth': 40, 'model__n_estimators': 400}
Best score:  -0.0032873155981634843


##### *Modelo 2 - Ridge*

In [172]:
from sklearn.linear_model import Ridge

In [173]:
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('model', DummyRegressor()),
])

param_grid = [
    {
        'model': [DummyRegressor()],
    },
    {
        'model': [Ridge()],
        'model__alpha': [0.1, 1.0, 10.0, 100.0, 1000.0]
    },
]

grid_search = GridSearchCV(
    pipeline,
    param_grid,
    cv=5,
    scoring='neg_mean_squared_error',
    n_jobs=-1,
)
grid_search.fit(Xtrain, ytrain)

print("Best parameters found: ", grid_search.best_params_)
print("Best score: ", grid_search.best_score_)

Best parameters found:  {'model': Ridge(), 'model__alpha': 100.0}
Best score:  -0.0028599092258193024


##### *Treino e Medição de Desempenho do Modelo Escolhido*

O modelo que obteve o melhor desempenho foi o modelo Ridge, com alpha valendo 100. Vamos treinar o modelo com todos os dados de treino e medir o desempenho com os dados de teste.

In [174]:
from sklearn.metrics import mean_squared_error

In [175]:
best_model = grid_search.best_estimator_

best_model.fit(Xtrain, ytrain)

ytest_pred = best_model.predict(Xtest)

test_mse = mean_squared_error(ytest, ytest_pred)

test_rmse = test_mse ** 0.5

print("Test MSE: ", test_mse)
print("Test RMSE: ", test_rmse)


Test MSE:  0.0038387982901674296
Test RMSE:  0.061958036526082956


Vamos agorar descobrir quais variáveis causam um maior impacto no preço das casas.

In [None]:
import pandas as pd
import numpy as np

coefficients = best_model.coef_

coef_df = pd.DataFrame({
    'Feature': Xtrain.columns,
    'Coefficient': coefficients
})

coef_df['AbsCoefficient'] = np.abs(coef_df['Coefficient'])
coef_df = coef_df.sort_values(by='AbsCoefficient', ascending=False)

print(coef_df)

                Feature  Coefficient  AbsCoefficient
4          Overall.Qual     0.027609        0.027609
18          Gr.Liv.Area     0.024350        0.024350
5          Overall.Cond     0.016293        0.016293
16          X2nd.Flr.SF     0.015260        0.015260
46            House.Age    -0.015208        0.015208
..                  ...          ...             ...
91     Bldg.Type_2fmCon    -0.000103        0.000103
52       MS.SubClass_85     0.000074        0.000074
97   House.Style_2.5Fin     0.000074        0.000074
67    Lot.Config_Inside    -0.000057        0.000057
159     Exterior_Stucco     0.000034        0.000034

[164 rows x 3 columns]
