# Modelo preditivo para preço de casas | predictive model for house prices
Competição no kaggle: https://www.kaggle.com/c/house-prices-advanced-regression-techniques

Pedro Lucas | pedro.pessoal14@gmail.com

In [1]:
# dependencias

import pandas as pd
from sklearn.preprocessing import LabelEncoder
from sklearn.cluster import KMeans
from sklearn.model_selection import train_test_split

# modelos
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import AdaBoostRegressor
from sklearn.ensemble import ExtraTreesRegressor
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.ensemble import BaggingRegressor

In [2]:
modelos = [DecisionTreeRegressor(), RandomForestRegressor(), AdaBoostRegressor(), 
           ExtraTreesRegressor(), GradientBoostingRegressor(), BaggingRegressor()]

In [3]:
sample_path = './data/sample_submission.csv'
train_path = './data/train.csv'
test_path = './data/test.csv'

sample_submission = pd.read_csv(sample_path)
train = pd.read_csv(train_path)
test = pd.read_csv(test_path)

### Limpando os dados

In [4]:
# informação das colunas com algum dado nulo (treino)
null = [col for col in train if train[col].isna().sum() >= 1]
train[null].info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 19 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   LotFrontage   1201 non-null   float64
 1   Alley         91 non-null     object 
 2   MasVnrType    1452 non-null   object 
 3   MasVnrArea    1452 non-null   float64
 4   BsmtQual      1423 non-null   object 
 5   BsmtCond      1423 non-null   object 
 6   BsmtExposure  1422 non-null   object 
 7   BsmtFinType1  1423 non-null   object 
 8   BsmtFinType2  1422 non-null   object 
 9   Electrical    1459 non-null   object 
 10  FireplaceQu   770 non-null    object 
 11  GarageType    1379 non-null   object 
 12  GarageYrBlt   1379 non-null   float64
 13  GarageFinish  1379 non-null   object 
 14  GarageQual    1379 non-null   object 
 15  GarageCond    1379 non-null   object 
 16  PoolQC        7 non-null      object 
 17  Fence         281 non-null    object 
 18  MiscFeature   54 non-null   

In [5]:
train[null].isna().sum()

LotFrontage      259
Alley           1369
MasVnrType         8
MasVnrArea         8
BsmtQual          37
BsmtCond          37
BsmtExposure      38
BsmtFinType1      37
BsmtFinType2      38
Electrical         1
FireplaceQu      690
GarageType        81
GarageYrBlt       81
GarageFinish      81
GarageQual        81
GarageCond        81
PoolQC          1453
Fence           1179
MiscFeature     1406
dtype: int64

In [6]:
# informação das colunas com algum dado nulo (teste)
test[null].isna().sum()

LotFrontage      227
Alley           1352
MasVnrType        16
MasVnrArea        15
BsmtQual          44
BsmtCond          45
BsmtExposure      44
BsmtFinType1      42
BsmtFinType2      42
Electrical         0
FireplaceQu      730
GarageType        76
GarageYrBlt       78
GarageFinish      78
GarageQual        78
GarageCond        78
PoolQC          1456
Fence           1169
MiscFeature     1408
dtype: int64

Observando os dados, temos:

- **Colunas para deletar:** Alley, PoolQC, Fence, MiscFeature
(mais de 70% dessas colunas são de dados nulos, tornando inviável tratá-las para utilizar no modelo)


- **Dropar subset:** MasVnrType, MasVnrArea, Electrical, LotFrontage, MSZoning, Utilities, Exterior1st, Exterior2nd, Functional, SaleType 
(poucas linhas nessas colunas possuem dados nulos, portanto podem ser desconsideradas sem grandes consequências)


- **Preencher nulos como:** 
    - todas as colunas relacionadas a basement = No basement
    - todas as colunas relacionadas a garagem  = No garage
    - FireplaceQu = No Fireplace

In [7]:
bsmt = [col for col in train if 'Bsmt' in col] # colunas relacionadas a basement (porão)
garage = [col for col in train if 'Garage' in col] # colunas relacionadas a garage (garagem)
lot = [col for col in train if 'Lot' in col] # colunas relacionadas a lot (lote)

inuteis = ['Id', 'Alley', 'PoolQC', 'Fence', 'MiscFeature', 'Utilities', 'MoSold']

minorias = ['MasVnrType', 'MasVnrArea', 'Electrical', 'LotFrontage', 'MSZoning', 
            'Exterior1st', 'Exterior2nd', 'Functional', 'SaleType']

In [8]:
def cleanDf(df):
    '''Retorna o dataset limpo'''
    
    try:
        df.drop(columns=inuteis, inplace=True)
    except:
        pass
    
    df.dropna(subset=minorias, inplace=True)

    df[bsmt] = df[bsmt].fillna('No basement')

    df[garage] = df[garage].fillna('No garage')

    df['FireplaceQu'].fillna('No Fireplace', inplace=True)

    df.reset_index(drop=True, inplace=True)
        
    return df

In [9]:
clean_train = cleanDf(train)
clean_test = cleanDf(test)

### Modelo
Antes de aplicar um modelo de regressão, é preciso adaptar os dados. Por isso, seguiremos o seguinte método:

- transformar os dados para numéricos
- resumir os dados sobre garagem (clusterizar)
- resumir os dados de porão/basement (clusterizar)
- resumir os dados de lot (clusterizar)
- aplicar o modelo

In [10]:
clean_train.head()

Unnamed: 0,MSSubClass,MSZoning,LotFrontage,LotArea,Street,LotShape,LandContour,LotConfig,LandSlope,Neighborhood,...,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,MiscVal,YrSold,SaleType,SaleCondition,SalePrice
0,60,RL,65.0,8450,Pave,Reg,Lvl,Inside,Gtl,CollgCr,...,61,0,0,0,0,0,2008,WD,Normal,208500
1,20,RL,80.0,9600,Pave,Reg,Lvl,FR2,Gtl,Veenker,...,0,0,0,0,0,0,2007,WD,Normal,181500
2,60,RL,68.0,11250,Pave,IR1,Lvl,Inside,Gtl,CollgCr,...,42,0,0,0,0,0,2008,WD,Normal,223500
3,70,RL,60.0,9550,Pave,IR1,Lvl,Corner,Gtl,Crawfor,...,35,272,0,0,0,0,2006,WD,Abnorml,140000
4,60,RL,84.0,14260,Pave,IR1,Lvl,FR2,Gtl,NoRidge,...,84,0,0,0,0,0,2008,WD,Normal,250000


In [11]:
def encodar(dataset):
    '''Transforma os dados do dataset para tipo numérico'''
    
    le = LabelEncoder()
    df = dataset.copy().astype('str')
    
    if 'SalePrice' in df:
        for col in df.drop(columns='SalePrice'):
            df[col] = le.fit_transform(df[col])
    else:
        for col in df:
            df[col] = le.fit_transform(df[col])
    
    return df

def clusterizar(dataset, n_grupos):
    '''Resume os dados de lote, porão e garagem do dataset em um determinado número de grupos'''
    
    grupos = [lot, bsmt, garage]
    nomes = ['lot', 'bsmt', 'garage']
    km = KMeans(n_clusters=n_grupos, random_state=0)
    
    for grupo, nome in zip(grupos, nomes):
        km.fit_predict(dataset[grupo])
        dataset[f'Grupo_{nome}'] = km.labels_
        
    return dataset

def preparar(dataset):
    '''Prepara o dataset para ser aplicado em um modelo preditivo'''
    
    df = encodar(dataset)
    df = clusterizar(df, n_grupos=3)
    
    colunas_dropar = [col for col in df if 'Lot' in col or 'Bsmt' in col or 'Garage' in col]
    df.drop(columns=colunas_dropar, inplace=True)
    
    return df
    

In [12]:
# preparando o dataset de teste
test_prep = preparar(clean_test)
# test_prep

In [13]:
# preparando o dataset de treino
train_prep = preparar(clean_train)
train_prep

Unnamed: 0,MSSubClass,MSZoning,Street,LandContour,LandSlope,Neighborhood,Condition1,Condition2,BldgType,HouseStyle,...,ScreenPorch,PoolArea,MiscVal,YrSold,SaleType,SaleCondition,SalePrice,Grupo_lot,Grupo_bsmt,Grupo_garage
0,9,3,1,3,0,5,2,2,0,5,...,0,0,0,2,8,4,208500,0,0,2
1,4,3,1,3,0,24,1,2,0,2,...,0,0,0,1,8,4,181500,0,1,2
2,9,3,1,3,0,5,2,2,0,5,...,0,0,0,2,8,4,223500,1,0,0
3,10,3,1,3,0,6,2,2,0,5,...,0,0,0,0,8,0,140000,0,0,0
4,9,3,1,3,0,15,2,2,0,5,...,0,0,0,2,8,4,250000,1,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1189,9,3,1,3,0,8,2,2,0,5,...,0,0,0,1,8,4,175000,0,0,2
1190,4,3,1,3,0,14,2,2,0,2,...,0,0,0,4,8,4,210000,1,1,2
1191,10,3,1,3,0,6,2,2,0,5,...,0,0,6,4,8,4,266500,0,1,1
1192,4,3,1,3,0,12,2,2,0,2,...,0,0,0,4,8,4,142125,0,2,1


In [14]:
# definindo features e target
X = train_prep.drop(columns='SalePrice')
y = train_prep.SalePrice

In [15]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=0)

In [16]:
# qual modelo de regressão apresenta a melhor pontuação

for modelo in modelos:
    modelo.fit(X_train, y_train)
    print(modelo, modelo.score(X_test, y_test))

DecisionTreeRegressor() 0.1544994324488046
RandomForestRegressor() 0.7816890867997033
AdaBoostRegressor() 0.7045771116044484
ExtraTreesRegressor() 0.754456742371445
GradientBoostingRegressor() 0.7308683144027852
BaggingRegressor() 0.7133823246162538


Visto que o modelo com maior score foi o `RandomForestRegressor`, vamos utilizá-lo como modelo final.

In [17]:
# ajustando o modelo ao dataset de treino
model = RandomForestRegressor()
model.fit(X, y)

RandomForestRegressor()

In [18]:
# predizendo os dados de teste
pred = model.predict(test_prep)
pred

array([114157.13, 135922.87, 164651.44, ..., 135430.8 , 118315.25,
       234604.84])

In [19]:
# arquivo para entrega no kaggle
sample_submission.iloc[test_prep.index, 1] = pred
sample_submission

Unnamed: 0,Id,SalePrice
0,1461,114157.130000
1,1462,135922.870000
2,1463,164651.440000
3,1464,176922.670000
4,1465,187151.750000
...,...,...
1454,2915,167081.220949
1455,2916,164788.778231
1456,2917,219222.423400
1457,2918,184924.279659


In [20]:
sample_submission.to_csv('submission.csv', index=False)