In [1]:
import sys
sys.path.insert(0, '/home/matheus/Documentos/house-recommendation/src')
import pandas as pd
pd.set_option('mode.chained_assignment', None)
from joblib import load

# Recomendador de imóvel

O objetivo desse notebook é demonstrar o funcionamento dos processos de machine learning implementados, desde da ingestão dos dados, pré-processamento dos dados, experimentação de modelos, treinamento dos modelos de machine learning, avaliando os modelos e inferindo o modelo escolhido.

Para saber mais sobre os dados, foi feita uma análise exploratória, é só [CLICAR AQUI](https://github.com/mathdeoliveira/house-recommendation/blob/master/analysis/analise_exploratoria.ipynb).

Os processos foram divididos em pacotes do Python, cada um com a sua especificação, para ver cada um de forma separada, acesso o meu GitHub, [CLIQUE AQUI](https://github.com/mathdeoliveira).

Para saber a definição do projeto, tanto como a sua objetividade e os seus resultados, acesse o documento gerado após todo os processo clicando AQUI.


<img src="https://www.kdnuggets.com/wp-content/uploads/crisp-dm-4-problems-fig1.png"/>

## Ingestão dos dados

O processo se inicia na aquisição desses dados, os dados foram adquiridos por meio de um processo de scraping de sites de imobiliárias da minha cidade. Esse processo está bem mais descrito no documento de Extração de Dados, clicando AQUI você tem acesso à ele.

Feito o scraping, se faz necessário a importação dele e a divisão por treino e teste, após isso temos a possibilidade de ler cada arquivo, de acordo o parâmetro passado. Vamos a demonstração do processo de aquisição dos dados.

In [2]:
# pacote desenvolvido para ingestão dos dados
from data_source import DataSource

In [3]:
# função responsável por ler o arquivo original e dividir em treino e teste
DataSource().generate_test_data()

Com isso temos dois novos arquivos, o arquivo de treino e o arquivo de teste, salvos no diretório Data.

In [9]:
# função responsável por ler os arquivos de treino ou de teste, de acordo com o parâmetro
df_train = DataSource().read_data(train = True)

df_train.head()

Unnamed: 0,Id,area,quartos,garagem,banheiros,bairro,preco,y
0,336,84,3,2,2,Jardim Finotti,300000,0
1,1717,74,3,1,2,Copacabana,260000,0
2,528,61,2,1,1,Vida Nova,170000,0
3,1122,46,2,1,1,Jardim Finotti,170000,1
4,1460,57,2,1,2,Nova Uberlândia,160000,0


In [5]:
df_test = DataSource().read_data(train = False)

df_test.head()

Unnamed: 0,Id,area,quartos,garagem,banheiros,bairro,preco,y
0,1495,32,1,1,1,Santa Maria,200000,0
1,267,68,2,1,2,Dona Zulmira,160000,0
2,121,52,2,1,1,Segismundo Pereira,182000,1
3,546,44,2,1,1,Shopping Park,120000,0
4,345,125,2,2,2,Jardim Europa Ii,230000,1


In [6]:
print('Temos para os dados de treino {} linhas e {} colunas'.format(df_train.shape[0], df_train.shape[1]))
print('Temos para os dados de teste {} linhas e {} colunas'.format(df_test.shape[0], df_test.shape[1]))

Temos para os dados de treino 1201 linhas e 8 colunas
Temos para os dados de teste 592 linhas e 8 colunas


Finalizamos aqui a etapa de aquisição dos dados com o pacote data_source.py criado.
Com os dados carregados, se inicia o processo de pré-processamento dos dados, siga a para a próxima etapa,

## Pré-processamento

Necessitamos dessa etapa para adequar os dados para os algoritmos de machine learning. Por exemplo, temos dados em formato de string, "Planalto", algoritmos de machine learning consegue entender melhor caso esse campo seja um valor númerico, precisamos de algum processo para fazer essa transformação. Temos que lidar com esse exemplo além de outros processos, vamos partir para a demonstração do processo de preprocessing com o pacote criado.

#### Dados de treino

In [3]:
from preprocessing import PreProcessing
pre = PreProcessing()

In [10]:
X_train, y_train = pre.preprocess(df_train, train = True)

Creating dataframe for data manipulation
Droping columns with missing values
Dropping column with id
Creating list with numeric features
Creating list with categorical features
removing target
['bairro']
feature encoder
feature normalization and encoding


In [9]:
X_train.head()

Unnamed: 0,bairro,area,quartos,garagem,banheiros,preco
0,0.169858,-0.433758,0.317536,0.048544,-0.092727,-0.275415
1,0.169858,-0.490799,0.317536,-0.761197,-0.092727,-0.38956
2,0.169858,-0.564952,-0.845149,-0.761197,-1.044569,-0.646385
3,0.084929,-0.650513,-0.845149,-0.761197,-1.044569,-0.646385
4,0.169858,-0.587768,-0.845149,-0.761197,-0.092727,-0.674921


In [10]:
y_train.head()

0    0
1    0
2    0
3    1
4    0
Name: y, dtype: int64

Primeiramente começamos criando um dataframe onde teremos todas as informações dos dados de treino, como as colunas, o percentual de nulos e os tipos das colunas. Com isso podemos iniciar a manipulação dos dados.

Iniciando no drop de colunas onde tenha um percentual escolhido de nulos, como não temos nenhum dado faltante, nada será retirado.

A coluna de nome Id será retirada do conjunto de dados, já que ela não trás nenhuma vantagem para os algoritmos, pois ela é um valor incremental.

São criados duas listas, uma onde será armazenado as colunas númericas e a outra com valores categóricos, o intuito é usar as listas no processo de padronização dos dados númericos e no encoder dos dados das colunas categóricas.

Caso o parâmetro de treino esteja verdadeiro, a variável target é removida dos dados para fazer o processamento das features.

Nesse processo, para as features númericas foi escolhido a StandardScaler e para as categóricas foi escolhido o CatBoostEncoder, caso no futuro seja necessário testar outros processos, será mais fácil a mudança no pacote. Abaixo explico um pouco melhor esses dois processos de transformação.

Assim é retornado o dataframe de treino processado, com as suas features númericas e categóricas já transformadas e o target.

#### Dados de teste

In [11]:
X_test = pre.preprocess(df_test[pre.train_features], train = False)

Creating dataframe for data manipulation
Droping columns with missing values
Dropping column with id
Creating list with numeric features
Creating list with categorical features
removing target
['bairro']
feature encoder
feature normalization and encoding


In [12]:
X_test.head()

Unnamed: 0,bairro,area,quartos,garagem,banheiros,preco
0,0.005857,-0.73037,-2.007834,-0.761197,-1.044569,-0.560776
1,0.056619,-0.525023,-0.845149,-0.761197,-0.092727,-0.674921
2,0.361643,-0.616289,-0.845149,-0.761197,-1.044569,-0.612141
3,0.005479,-0.661921,-0.845149,-0.761197,-1.044569,-0.789065
4,0.169858,-0.19989,-0.845149,0.048544,-0.092727,-0.475168


Os dados de teste foram transformados a partir do treinamento dos dados de treino. Usamos o processamento treinado nos dados de treino para transformar os dados de teste. 

Com isso temos o retorno de um dataframe de teste transformados para assim podermos avaliar os algoritmos de machine learning com os dados não visto por ele, os de teste.

#### Standard Scaler

Sempre que chegamos nessa etapa precisamos pensar qual transformação iremos utilizar nos nossos dados, para evitar que o algoritmo fique enviesado paras as maiores features, e para isso podemos para os dados podemos escolher a normalização ou a padronização, para esse caso escolhi a padronização, vamos entender o que é a padronização.

A padronização é uma re-escala dos dados entre valores, redefinindo os valores dos dados para a forma onde a média é igual a zero e o desvião padrão igual a um. Vamos ao um exemplo prático com os dados

In [13]:
df_train['area'].describe()

count    1201.000000
mean      160.043297
std       175.385874
min        12.000000
25%        60.000000
50%        94.000000
75%       245.000000
max      3000.000000
Name: area, dtype: float64

In [14]:
from sklearn.preprocessing import StandardScaler

In [15]:
scaler = StandardScaler()

scaler = scaler.fit(df_train[['area']])

Acima nós temos os seguinte:

Vou usar a coluna `área` dos nossos dados como exemplo. Visualizamos como ela está distribuída, sendo que a maioria dos dados estão até 245 metros quadrados e podemos ver a presença de outliers.
Importamos a padronização do scikit-learn e treinamos ele com essa coluna.

In [16]:
df_train['area'] = scaler.transform(df_train[['area']])

In [17]:
df_train['area'].describe()

count    1.201000e+03
mean    -1.774878e-17
std      1.000417e+00
min     -8.444521e-01
25%     -5.706558e-01
50%     -3.767168e-01
75%      4.846006e-01
max      1.619936e+01
Name: area, dtype: float64

Usamos o modelo treinado para tranformar a coluna, e visualizamos como os dados agora estão distribuídos. Sendo que os dados foram padronizados e o desvião padrão está em torno de 1.

Como nossa base não existe muitos valores outiliers, a padronização foi escolhida para que as variáveis não tenha muita diferença em escala de uma e da outra. Com isso temos um menor impacto na construção do algoritmo de machine learning.

#### CatBoostEncoder

A biblioteca usada `category_encoders` nos trás várias possibilidades de transformar os dados categóricos e uma delas é a CatBoostEncoder, aqui utilizada. Essa técnica é uma das mas recentes criadas e tem como intenção resolver os problemas de data leakage.

Ele funciona de forma similiar ao LOO (Leave one out encoder), mas calcula os valores da target de forma "on-the-fly".

Esse tipo de algoritmo é bem complexo, usando Gradient Boosting usando a decisão binária das árvores de decisão como base. Para não alongar tanto esse texto, convido entrar nesse [link](https://arxiv.org/pdf/1706.09516.pdf) para entender melhor sobre esse encode que é um paper ou esse [link](https://towardsdatascience.com/benchmarking-categorical-encoders-9c322bd77ee8) onde é um post explicando mais detalhado o processo do CatBoostEncoder.

Abaixo mostro o funcionamento do algoritmo.

In [19]:
import category_encoders as ce

In [20]:
catb = ce.CatBoostEncoder(cols = ['bairro'])

In [21]:
df_train['bairro'] = catb.fit_transform(df_train['bairro'], y=df_train['y'])

In [22]:
df_train['bairro'].head()

0    0.169858
1    0.169858
2    0.169858
3    0.084929
4    0.169858
Name: bairro, dtype: float64

# Experimentando algoritmos

Temos que, o problema a ser resolvido é de classificação, existe inúmeros algoritmos de machine learning para esse fim, afim de testar uma parte desses algoritmos, desenvolvi esse pacote com o intuito de ler os dados, preprocessar os dados e testar os algoritmos selecionados para vermos a sua performance para os nossos dados.

Essa etapa está incluído também a parte de balanceamento da variável target, os dados estão desbalanceados assim é necessário balancear a target e com isso podemos testar os algoritmos de machine learning.
Para tal, eu escolhi o método SMOTE usando o pacote imblearn, em conjunto com Pipeline para primeiramente balancear, aqui eu utilizo a técnica de oversampling, que é uma forma de adicionar dados sintéticos para a classe minoritária dos dados. Segue uma explicação com exemplo.


In [16]:
df_train.y.value_counts()

0    997
1    204
Name: y, dtype: int64

Vemos que a classe '0' é a majoritária, que corresponde ao imóvel que não tenho interesse. Usando o SMOTE podemos igualar esses dados afim de evitar o desbalanceamento e o modelo ser o mais correto possível.

In [16]:
from experiments import Experiments

In [11]:
train_df = DataSource().read_data(train=True)
X_train, y_train = pre.preprocess(train_df, train=True)

Creating dataframe for data manipulation
Droping columns with missing values
Dropping column with id
Creating list with numeric features
Creating list with categorical features
removing target
['bairro']
feature encoder
feature normalization and encoding


In [12]:
X_train.shape, y_train.shape

((1201, 6), (1201,))

In [13]:
Experiments().train_model(X_train, y_train)

Treinando o modelo decision_tree
DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None, criterion='gini',
                       max_depth=None, max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort='deprecated',
                       random_state=None, splitter='best')
Cross val score using RepeatedStratifiedKFold
0.8405021645021645
Treinando o modelo random_forest
RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=None, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=100,
                   

135:	learn: 0.1865932	total: 745ms	remaining: 4.73s
136:	learn: 0.1860991	total: 753ms	remaining: 4.75s
137:	learn: 0.1854436	total: 760ms	remaining: 4.75s
138:	learn: 0.1847440	total: 765ms	remaining: 4.74s
139:	learn: 0.1844274	total: 778ms	remaining: 4.78s
140:	learn: 0.1840354	total: 788ms	remaining: 4.8s
141:	learn: 0.1836123	total: 794ms	remaining: 4.79s
142:	learn: 0.1828342	total: 798ms	remaining: 4.78s
143:	learn: 0.1825495	total: 801ms	remaining: 4.76s
144:	learn: 0.1818287	total: 808ms	remaining: 4.76s
145:	learn: 0.1812952	total: 811ms	remaining: 4.75s
146:	learn: 0.1809308	total: 814ms	remaining: 4.72s
147:	learn: 0.1802046	total: 818ms	remaining: 4.71s
148:	learn: 0.1794767	total: 821ms	remaining: 4.69s
149:	learn: 0.1787561	total: 824ms	remaining: 4.67s
150:	learn: 0.1782352	total: 827ms	remaining: 4.65s
151:	learn: 0.1776802	total: 829ms	remaining: 4.63s
152:	learn: 0.1772515	total: 832ms	remaining: 4.61s
153:	learn: 0.1767523	total: 838ms	remaining: 4.6s
154:	learn: 0.

307:	learn: 0.1338259	total: 1.51s	remaining: 3.4s
308:	learn: 0.1336637	total: 1.52s	remaining: 3.4s
309:	learn: 0.1335242	total: 1.52s	remaining: 3.39s
310:	learn: 0.1332987	total: 1.53s	remaining: 3.38s
311:	learn: 0.1332271	total: 1.53s	remaining: 3.38s
312:	learn: 0.1331113	total: 1.53s	remaining: 3.37s
313:	learn: 0.1327787	total: 1.55s	remaining: 3.38s
314:	learn: 0.1324205	total: 1.55s	remaining: 3.37s
315:	learn: 0.1322233	total: 1.55s	remaining: 3.36s
316:	learn: 0.1320695	total: 1.56s	remaining: 3.36s
317:	learn: 0.1318916	total: 1.56s	remaining: 3.35s
318:	learn: 0.1315916	total: 1.57s	remaining: 3.35s
319:	learn: 0.1314279	total: 1.57s	remaining: 3.35s
320:	learn: 0.1311089	total: 1.58s	remaining: 3.34s
321:	learn: 0.1308869	total: 1.58s	remaining: 3.33s
322:	learn: 0.1306836	total: 1.58s	remaining: 3.32s
323:	learn: 0.1304674	total: 1.59s	remaining: 3.31s
324:	learn: 0.1303368	total: 1.59s	remaining: 3.31s
325:	learn: 0.1301845	total: 1.59s	remaining: 3.3s
326:	learn: 0.1

475:	learn: 0.1065178	total: 2.23s	remaining: 2.45s
476:	learn: 0.1065112	total: 2.25s	remaining: 2.46s
477:	learn: 0.1063344	total: 2.25s	remaining: 2.46s
478:	learn: 0.1062148	total: 2.26s	remaining: 2.45s
479:	learn: 0.1059735	total: 2.26s	remaining: 2.45s
480:	learn: 0.1058995	total: 2.26s	remaining: 2.44s
481:	learn: 0.1057725	total: 2.26s	remaining: 2.43s
482:	learn: 0.1055851	total: 2.27s	remaining: 2.43s
483:	learn: 0.1055519	total: 2.27s	remaining: 2.42s
484:	learn: 0.1054169	total: 2.28s	remaining: 2.42s
485:	learn: 0.1052060	total: 2.28s	remaining: 2.41s
486:	learn: 0.1051633	total: 2.29s	remaining: 2.41s
487:	learn: 0.1049858	total: 2.29s	remaining: 2.41s
488:	learn: 0.1047852	total: 2.3s	remaining: 2.4s
489:	learn: 0.1045808	total: 2.3s	remaining: 2.4s
490:	learn: 0.1045633	total: 2.31s	remaining: 2.39s
491:	learn: 0.1044747	total: 2.31s	remaining: 2.38s
492:	learn: 0.1043038	total: 2.31s	remaining: 2.38s
493:	learn: 0.1041059	total: 2.31s	remaining: 2.37s
494:	learn: 0.10

636:	learn: 0.0882262	total: 2.91s	remaining: 1.66s
637:	learn: 0.0881988	total: 2.92s	remaining: 1.66s
638:	learn: 0.0880586	total: 2.93s	remaining: 1.66s
639:	learn: 0.0879523	total: 2.94s	remaining: 1.65s
640:	learn: 0.0879224	total: 2.94s	remaining: 1.65s
641:	learn: 0.0878951	total: 2.95s	remaining: 1.64s
642:	learn: 0.0878654	total: 2.95s	remaining: 1.64s
643:	learn: 0.0878036	total: 2.95s	remaining: 1.63s
644:	learn: 0.0876958	total: 2.96s	remaining: 1.63s
645:	learn: 0.0875669	total: 2.97s	remaining: 1.63s
646:	learn: 0.0874959	total: 2.98s	remaining: 1.62s
647:	learn: 0.0873920	total: 2.98s	remaining: 1.62s
648:	learn: 0.0872435	total: 2.98s	remaining: 1.61s
649:	learn: 0.0871212	total: 2.98s	remaining: 1.61s
650:	learn: 0.0870370	total: 2.99s	remaining: 1.6s
651:	learn: 0.0869760	total: 2.99s	remaining: 1.6s
652:	learn: 0.0868450	total: 3s	remaining: 1.59s
653:	learn: 0.0867695	total: 3s	remaining: 1.59s
654:	learn: 0.0867286	total: 3.01s	remaining: 1.58s
655:	learn: 0.086662

808:	learn: 0.0740645	total: 3.69s	remaining: 871ms
809:	learn: 0.0740449	total: 3.69s	remaining: 866ms
810:	learn: 0.0740141	total: 3.7s	remaining: 862ms
811:	learn: 0.0739055	total: 3.7s	remaining: 858ms
812:	learn: 0.0738832	total: 3.71s	remaining: 853ms
813:	learn: 0.0738556	total: 3.71s	remaining: 849ms
814:	learn: 0.0737763	total: 3.72s	remaining: 845ms
815:	learn: 0.0736864	total: 3.73s	remaining: 840ms
816:	learn: 0.0736343	total: 3.73s	remaining: 836ms
817:	learn: 0.0735811	total: 3.74s	remaining: 831ms
818:	learn: 0.0735088	total: 3.75s	remaining: 828ms
819:	learn: 0.0734818	total: 3.75s	remaining: 824ms
820:	learn: 0.0734159	total: 3.76s	remaining: 819ms
821:	learn: 0.0733574	total: 3.76s	remaining: 815ms
822:	learn: 0.0732803	total: 3.77s	remaining: 812ms
823:	learn: 0.0732205	total: 3.78s	remaining: 808ms
824:	learn: 0.0731128	total: 3.79s	remaining: 804ms
825:	learn: 0.0730934	total: 3.8s	remaining: 800ms
826:	learn: 0.0730266	total: 3.81s	remaining: 797ms
827:	learn: 0.0

988:	learn: 0.0631321	total: 4.84s	remaining: 53.8ms
989:	learn: 0.0630610	total: 4.85s	remaining: 49ms
990:	learn: 0.0630217	total: 4.86s	remaining: 44.1ms
991:	learn: 0.0629346	total: 4.87s	remaining: 39.2ms
992:	learn: 0.0628750	total: 4.87s	remaining: 34.3ms
993:	learn: 0.0628649	total: 4.87s	remaining: 29.4ms
994:	learn: 0.0628361	total: 4.88s	remaining: 24.5ms
995:	learn: 0.0627666	total: 4.88s	remaining: 19.6ms
996:	learn: 0.0626994	total: 4.88s	remaining: 14.7ms
997:	learn: 0.0626086	total: 4.89s	remaining: 9.79ms
998:	learn: 0.0625551	total: 4.89s	remaining: 4.9ms
999:	learn: 0.0625318	total: 4.9s	remaining: 0us
Cross val score using RepeatedStratifiedKFold
0.9644312554112554
Treinando o modelo lgbm
LGBMClassifier(boosting_type='gbdt', class_weight=None, colsample_bytree=1.0,
               importance_type='split', learning_rate=0.1, max_depth=-1,
               min_child_samples=20, min_child_weight=0.001, min_split_gain=0.0,
               n_estimators=100, n_jobs=-1, num_le

{'decision_tree': DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None, criterion='gini',
                        max_depth=None, max_features=None, max_leaf_nodes=None,
                        min_impurity_decrease=0.0, min_impurity_split=None,
                        min_samples_leaf=1, min_samples_split=2,
                        min_weight_fraction_leaf=0.0, presort='deprecated',
                        random_state=None, splitter='best'),
 'random_forest': RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                        criterion='gini', max_depth=None, max_features='auto',
                        max_leaf_nodes=None, max_samples=None,
                        min_impurity_decrease=0.0, min_impurity_split=None,
                        min_samples_leaf=1, min_samples_split=2,
                        min_weight_fraction_leaf=0.0, n_estimators=100,
                        n_jobs=None, oob_score=False, random_state=None,
                        verbos

A função train_model da classe Experiments treina vários algoritmos e nós mostra a sua performance usando `cross_val_score` e `RepeatedStratifiedKFold` para os dados de treino.

Assim temos a performance dos algoritmos escolhidos, com isso podemos agora testar esses modelos para os dados de teste, para isso, foi desenvolvido a função run_experiment, onde a funcionalidade é treinar cada modelo, avaliar cada modelo para os dados de teste nos mostrando as métricas escolhidas e salvando as métricas em um arquivo .csv.

In [20]:
Experiments().run_experiment()

Reading Data
Preprocessing train data
Creating dataframe for data manipulation
Droping columns with missing values
Dropping column with id
Creating list with numeric features
Creating list with categorical features
removing target
['bairro']
feature encoder
feature normalization and encoding
Preprocessing test data
Creating dataframe for data manipulation
Droping columns with missing values
Dropping column with id
Creating list with numeric features
Creating list with categorical features
removing target
['bairro']
feature encoder
feature normalization and encoding
Training model
Treinando o modelo decision_tree
DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None, criterion='gini',
                       max_depth=None, max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort='deprecated',
            

129:	learn: 0.1906545	total: 525ms	remaining: 3.51s
130:	learn: 0.1900838	total: 528ms	remaining: 3.5s
131:	learn: 0.1893025	total: 530ms	remaining: 3.48s
132:	learn: 0.1886839	total: 534ms	remaining: 3.48s
133:	learn: 0.1881893	total: 537ms	remaining: 3.47s
134:	learn: 0.1876271	total: 540ms	remaining: 3.46s
135:	learn: 0.1871201	total: 544ms	remaining: 3.45s
136:	learn: 0.1864917	total: 550ms	remaining: 3.46s
137:	learn: 0.1860452	total: 560ms	remaining: 3.5s
138:	learn: 0.1856176	total: 566ms	remaining: 3.51s
139:	learn: 0.1851428	total: 572ms	remaining: 3.51s
140:	learn: 0.1844935	total: 574ms	remaining: 3.5s
141:	learn: 0.1840287	total: 577ms	remaining: 3.49s
142:	learn: 0.1833843	total: 584ms	remaining: 3.5s
143:	learn: 0.1829462	total: 591ms	remaining: 3.51s
144:	learn: 0.1822798	total: 594ms	remaining: 3.5s
145:	learn: 0.1817272	total: 603ms	remaining: 3.52s
146:	learn: 0.1813653	total: 605ms	remaining: 3.51s
147:	learn: 0.1810967	total: 608ms	remaining: 3.5s
148:	learn: 0.1806

322:	learn: 0.1288951	total: 1.27s	remaining: 2.67s
323:	learn: 0.1287447	total: 1.28s	remaining: 2.67s
324:	learn: 0.1286018	total: 1.28s	remaining: 2.66s
325:	learn: 0.1285197	total: 1.28s	remaining: 2.66s
326:	learn: 0.1283329	total: 1.29s	remaining: 2.65s
327:	learn: 0.1281871	total: 1.29s	remaining: 2.64s
328:	learn: 0.1280741	total: 1.3s	remaining: 2.65s
329:	learn: 0.1278028	total: 1.3s	remaining: 2.64s
330:	learn: 0.1274579	total: 1.31s	remaining: 2.64s
331:	learn: 0.1272200	total: 1.31s	remaining: 2.64s
332:	learn: 0.1269538	total: 1.31s	remaining: 2.63s
333:	learn: 0.1268339	total: 1.32s	remaining: 2.62s
334:	learn: 0.1266089	total: 1.32s	remaining: 2.62s
335:	learn: 0.1264992	total: 1.32s	remaining: 2.62s
336:	learn: 0.1262556	total: 1.33s	remaining: 2.62s
337:	learn: 0.1259830	total: 1.34s	remaining: 2.62s
338:	learn: 0.1258268	total: 1.34s	remaining: 2.62s
339:	learn: 0.1256696	total: 1.34s	remaining: 2.61s
340:	learn: 0.1255316	total: 1.35s	remaining: 2.6s
341:	learn: 0.1

521:	learn: 0.0991762	total: 2.03s	remaining: 1.86s
522:	learn: 0.0991514	total: 2.04s	remaining: 1.86s
523:	learn: 0.0990729	total: 2.04s	remaining: 1.86s
524:	learn: 0.0990175	total: 2.05s	remaining: 1.85s
525:	learn: 0.0988384	total: 2.05s	remaining: 1.85s
526:	learn: 0.0986812	total: 2.06s	remaining: 1.85s
527:	learn: 0.0985056	total: 2.06s	remaining: 1.84s
528:	learn: 0.0983983	total: 2.07s	remaining: 1.84s
529:	learn: 0.0983330	total: 2.07s	remaining: 1.84s
530:	learn: 0.0982441	total: 2.08s	remaining: 1.84s
531:	learn: 0.0981374	total: 2.08s	remaining: 1.83s
532:	learn: 0.0979841	total: 2.09s	remaining: 1.83s
533:	learn: 0.0978593	total: 2.09s	remaining: 1.82s
534:	learn: 0.0976929	total: 2.09s	remaining: 1.82s
535:	learn: 0.0976573	total: 2.1s	remaining: 1.82s
536:	learn: 0.0975818	total: 2.1s	remaining: 1.81s
537:	learn: 0.0974182	total: 2.1s	remaining: 1.8s
538:	learn: 0.0972419	total: 2.1s	remaining: 1.8s
539:	learn: 0.0971734	total: 2.11s	remaining: 1.79s
540:	learn: 0.0969

727:	learn: 0.0783144	total: 2.79s	remaining: 1.04s
728:	learn: 0.0781798	total: 2.8s	remaining: 1.04s
729:	learn: 0.0780782	total: 2.8s	remaining: 1.04s
730:	learn: 0.0779783	total: 2.81s	remaining: 1.03s
731:	learn: 0.0778605	total: 2.81s	remaining: 1.03s
732:	learn: 0.0778257	total: 2.82s	remaining: 1.03s
733:	learn: 0.0777117	total: 2.82s	remaining: 1.02s
734:	learn: 0.0776304	total: 2.83s	remaining: 1.02s
735:	learn: 0.0775320	total: 2.83s	remaining: 1.01s
736:	learn: 0.0774629	total: 2.83s	remaining: 1.01s
737:	learn: 0.0773794	total: 2.84s	remaining: 1.01s
738:	learn: 0.0772967	total: 2.84s	remaining: 1s
739:	learn: 0.0772548	total: 2.84s	remaining: 999ms
740:	learn: 0.0770980	total: 2.85s	remaining: 995ms
741:	learn: 0.0769781	total: 2.85s	remaining: 991ms
742:	learn: 0.0768885	total: 2.85s	remaining: 987ms
743:	learn: 0.0767744	total: 2.86s	remaining: 983ms
744:	learn: 0.0766677	total: 2.86s	remaining: 979ms
745:	learn: 0.0765913	total: 2.86s	remaining: 975ms
746:	learn: 0.076

924:	learn: 0.0635298	total: 3.55s	remaining: 288ms
925:	learn: 0.0634835	total: 3.56s	remaining: 284ms
926:	learn: 0.0633975	total: 3.56s	remaining: 280ms
927:	learn: 0.0633432	total: 3.56s	remaining: 276ms
928:	learn: 0.0633107	total: 3.57s	remaining: 273ms
929:	learn: 0.0632656	total: 3.58s	remaining: 269ms
930:	learn: 0.0631900	total: 3.58s	remaining: 265ms
931:	learn: 0.0630620	total: 3.58s	remaining: 261ms
932:	learn: 0.0629807	total: 3.58s	remaining: 257ms
933:	learn: 0.0628848	total: 3.59s	remaining: 254ms
934:	learn: 0.0628527	total: 3.59s	remaining: 250ms
935:	learn: 0.0628437	total: 3.6s	remaining: 246ms
936:	learn: 0.0627996	total: 3.6s	remaining: 242ms
937:	learn: 0.0627302	total: 3.6s	remaining: 238ms
938:	learn: 0.0626514	total: 3.61s	remaining: 234ms
939:	learn: 0.0626187	total: 3.61s	remaining: 230ms
940:	learn: 0.0625867	total: 3.61s	remaining: 226ms
941:	learn: 0.0625426	total: 3.61s	remaining: 222ms
942:	learn: 0.0625118	total: 3.62s	remaining: 219ms
943:	learn: 0.0

{'roc_auc_score': 0.8992351890674741,
 'average_precision_score': 0.7142796574486898}

In [27]:
xgboost_metrics = pd.read_csv('../output/xgboost.csv', names = ['metric', 'value'], skiprows=[0])
xgboost_metrics

Unnamed: 0,metric,value
0,roc_auc_score,0.899235
1,average_precision_score,0.71428


In [28]:
catboost_metrics = pd.read_csv('../output/catboost.csv', names = ['metric', 'value'], skiprows=[0])
catboost_metrics

Unnamed: 0,metric,value
0,roc_auc_score,0.897779
1,average_precision_score,0.685778


Com isso podemos comparar os resultados das métricas para os dados de teste e escolher o modelo que mais se saiu bem nos dados.

# Avaliação do modelo

Na parte de experimentação já é mostrada a parte das métricas, onde temos a avaliação de cada modelo. Mas para ficar evidenciado, aqui mostro as duas métricas escolhidas para avaliar os modelos, sendo elas a ROC_AUC_SCORE e AVERAGE_PRECISION_SCORE.

Foi desenvolvido uma função onde espera-se receber os valores preditos e os valores reais, com isso temos a ajuda do scikit-learn para fazer os cálculos e assim nos retornando um dicionário dos resultados das métricas.

In [4]:
from metrics import Metrics

In [12]:
test_df = DataSource().read_data(train=False)
X_test = pre.preprocess(test_df[pre.train_features], train=False)
y_test = test_df['y']

Creating dataframe for data manipulation
Droping columns with missing values
Dropping column with id
Creating list with numeric features
Creating list with categorical features
removing target
['bairro']
feature encoder
feature normalization and encoding


In [13]:
X_test.head()

Unnamed: 0,bairro,area,quartos,garagem,banheiros,preco
0,0.005857,-0.73037,-2.007834,-0.761197,-1.044569,-0.560776
1,0.056619,-0.525023,-0.845149,-0.761197,-0.092727,-0.674921
2,0.361643,-0.616289,-0.845149,-0.761197,-1.044569,-0.612141
3,0.005479,-0.661921,-0.845149,-0.761197,-1.044569,-0.789065
4,0.169858,-0.19989,-0.845149,0.048544,-0.092727,-0.475168


In [14]:
y_test.head()

0    0
1    0
2    1
3    0
4    1
Name: y, dtype: int64

In [17]:
models = Experiments().train_model(X_train, y_train)

Treinando o modelo decision_tree
DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None, criterion='gini',
                       max_depth=None, max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort='deprecated',
                       random_state=None, splitter='best')
Cross val score using RepeatedStratifiedKFold
0.834022077922078
Treinando o modelo random_forest
RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=None, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=100,
                    

139:	learn: 0.1869690	total: 524ms	remaining: 3.22s
140:	learn: 0.1866880	total: 529ms	remaining: 3.22s
141:	learn: 0.1859544	total: 532ms	remaining: 3.21s
142:	learn: 0.1853508	total: 534ms	remaining: 3.2s
143:	learn: 0.1847395	total: 538ms	remaining: 3.2s
144:	learn: 0.1840164	total: 540ms	remaining: 3.19s
145:	learn: 0.1834468	total: 544ms	remaining: 3.18s
146:	learn: 0.1830638	total: 546ms	remaining: 3.17s
147:	learn: 0.1824375	total: 550ms	remaining: 3.17s
148:	learn: 0.1820416	total: 554ms	remaining: 3.16s
149:	learn: 0.1816308	total: 559ms	remaining: 3.17s
150:	learn: 0.1810669	total: 566ms	remaining: 3.18s
151:	learn: 0.1807025	total: 569ms	remaining: 3.18s
152:	learn: 0.1800593	total: 573ms	remaining: 3.17s
153:	learn: 0.1795170	total: 576ms	remaining: 3.16s
154:	learn: 0.1791900	total: 579ms	remaining: 3.16s
155:	learn: 0.1787976	total: 582ms	remaining: 3.15s
156:	learn: 0.1783781	total: 586ms	remaining: 3.15s
157:	learn: 0.1779474	total: 590ms	remaining: 3.14s
158:	learn: 0.

329:	learn: 0.1295488	total: 1.1s	remaining: 2.24s
330:	learn: 0.1293369	total: 1.1s	remaining: 2.23s
331:	learn: 0.1291627	total: 1.11s	remaining: 2.23s
332:	learn: 0.1289419	total: 1.11s	remaining: 2.23s
333:	learn: 0.1286720	total: 1.12s	remaining: 2.23s
334:	learn: 0.1283481	total: 1.12s	remaining: 2.23s
335:	learn: 0.1280914	total: 1.13s	remaining: 2.23s
336:	learn: 0.1278679	total: 1.13s	remaining: 2.22s
337:	learn: 0.1275017	total: 1.13s	remaining: 2.22s
338:	learn: 0.1273883	total: 1.13s	remaining: 2.21s
339:	learn: 0.1271259	total: 1.14s	remaining: 2.21s
340:	learn: 0.1268920	total: 1.14s	remaining: 2.21s
341:	learn: 0.1266233	total: 1.15s	remaining: 2.21s
342:	learn: 0.1262587	total: 1.15s	remaining: 2.2s
343:	learn: 0.1261832	total: 1.15s	remaining: 2.2s
344:	learn: 0.1260886	total: 1.15s	remaining: 2.19s
345:	learn: 0.1259034	total: 1.16s	remaining: 2.19s
346:	learn: 0.1256826	total: 1.16s	remaining: 2.18s
347:	learn: 0.1253622	total: 1.16s	remaining: 2.18s
348:	learn: 0.12

532:	learn: 0.0981616	total: 1.67s	remaining: 1.46s
533:	learn: 0.0979721	total: 1.67s	remaining: 1.46s
534:	learn: 0.0978408	total: 1.68s	remaining: 1.46s
535:	learn: 0.0978152	total: 1.68s	remaining: 1.45s
536:	learn: 0.0977435	total: 1.69s	remaining: 1.45s
537:	learn: 0.0977132	total: 1.7s	remaining: 1.46s
538:	learn: 0.0974442	total: 1.7s	remaining: 1.46s
539:	learn: 0.0972904	total: 1.71s	remaining: 1.45s
540:	learn: 0.0972170	total: 1.71s	remaining: 1.45s
541:	learn: 0.0971013	total: 1.71s	remaining: 1.45s
542:	learn: 0.0969785	total: 1.72s	remaining: 1.45s
543:	learn: 0.0968502	total: 1.72s	remaining: 1.44s
544:	learn: 0.0967886	total: 1.72s	remaining: 1.44s
545:	learn: 0.0966166	total: 1.72s	remaining: 1.43s
546:	learn: 0.0964581	total: 1.73s	remaining: 1.43s
547:	learn: 0.0962791	total: 1.73s	remaining: 1.43s
548:	learn: 0.0961571	total: 1.73s	remaining: 1.42s
549:	learn: 0.0961066	total: 1.73s	remaining: 1.42s
550:	learn: 0.0959823	total: 1.74s	remaining: 1.42s
551:	learn: 0.

711:	learn: 0.0783724	total: 2.24s	remaining: 907ms
712:	learn: 0.0782557	total: 2.25s	remaining: 905ms
713:	learn: 0.0781525	total: 2.25s	remaining: 903ms
714:	learn: 0.0781338	total: 2.26s	remaining: 901ms
715:	learn: 0.0780478	total: 2.26s	remaining: 898ms
716:	learn: 0.0779941	total: 2.27s	remaining: 895ms
717:	learn: 0.0779458	total: 2.27s	remaining: 892ms
718:	learn: 0.0778598	total: 2.27s	remaining: 890ms
719:	learn: 0.0777711	total: 2.28s	remaining: 886ms
720:	learn: 0.0776553	total: 2.28s	remaining: 883ms
721:	learn: 0.0775726	total: 2.29s	remaining: 880ms
722:	learn: 0.0774411	total: 2.29s	remaining: 877ms
723:	learn: 0.0773861	total: 2.29s	remaining: 874ms
724:	learn: 0.0772836	total: 2.29s	remaining: 870ms
725:	learn: 0.0771885	total: 2.3s	remaining: 867ms
726:	learn: 0.0770886	total: 2.3s	remaining: 864ms
727:	learn: 0.0770124	total: 2.3s	remaining: 860ms
728:	learn: 0.0769707	total: 2.31s	remaining: 857ms
729:	learn: 0.0769482	total: 2.31s	remaining: 854ms
730:	learn: 0.0

876:	learn: 0.0654528	total: 2.81s	remaining: 394ms
877:	learn: 0.0653956	total: 2.81s	remaining: 391ms
878:	learn: 0.0653295	total: 2.82s	remaining: 388ms
879:	learn: 0.0652723	total: 2.82s	remaining: 385ms
880:	learn: 0.0652194	total: 2.82s	remaining: 381ms
881:	learn: 0.0651730	total: 2.83s	remaining: 378ms
882:	learn: 0.0650967	total: 2.83s	remaining: 375ms
883:	learn: 0.0650449	total: 2.83s	remaining: 372ms
884:	learn: 0.0649978	total: 2.84s	remaining: 369ms
885:	learn: 0.0649608	total: 2.84s	remaining: 366ms
886:	learn: 0.0648591	total: 2.84s	remaining: 362ms
887:	learn: 0.0648039	total: 2.85s	remaining: 359ms
888:	learn: 0.0647582	total: 2.85s	remaining: 356ms
889:	learn: 0.0647128	total: 2.85s	remaining: 353ms
890:	learn: 0.0646590	total: 2.86s	remaining: 350ms
891:	learn: 0.0645624	total: 2.86s	remaining: 346ms
892:	learn: 0.0644719	total: 2.86s	remaining: 343ms
893:	learn: 0.0643738	total: 2.87s	remaining: 340ms
894:	learn: 0.0642966	total: 2.87s	remaining: 337ms
895:	learn: 

In [19]:
for model in models.keys():
    y_pred = models[model].predict(X_test)
    print(Metrics().calculate_classification(model, y_test, pd.Series(y_pred)))

{'model_name': 'decision_tree', 'roc_auc_score': 0.8454654864508114, 'average_precision_score': 0.5878590855005948}
{'model_name': 'random_forest', 'roc_auc_score': 0.8934894013510366, 'average_precision_score': 0.7006494767226982}
{'model_name': 'logistic_regression', 'roc_auc_score': 0.8343427284727074, 'average_precision_score': 0.43706272310045896}
{'model_name': 'catboost', 'roc_auc_score': 0.8893741750135881, 'average_precision_score': 0.6769240228707586}
{'model_name': 'lgbm', 'roc_auc_score': 0.8871418588399721, 'average_precision_score': 0.7046836386459028}
{'model_name': 'xgboost', 'roc_auc_score': 0.8861130522556099, 'average_precision_score': 0.6982941162186446}


Com isso nós temos para cada modelo a sua performance nos dados de teste.

# Treinando modelo escolhido e Inferência

A partir da experimentação podemos escolher o modelo a ser utilizado no problema. Neste caso, eu escolhi o CatBoostClassifier, com isso desenvolvi um pacote que irá ler os dados, processar os dados, treinar o modelo e salvar o modelo treinando em um arquivo `.pkl`.

Com isso podemos utilizar esse modelo treinado para fazer inferência e utilizar ele para colocar em produção.

In [20]:
from model_training import ModelTraining

In [21]:
ModelTraining().model_training()

Reading data
Starting training
Creating dataframe for data manipulation
Droping columns with missing values
Dropping column with id
Creating list with numeric features
Creating list with categorical features
removing target
['bairro']
feature encoder
feature normalization and encoding
Starting training model
Learning rate set to 0.013833
0:	learn: 0.6776137	total: 7.08ms	remaining: 7.07s
1:	learn: 0.6609138	total: 16.8ms	remaining: 8.37s
2:	learn: 0.6456310	total: 21.7ms	remaining: 7.2s
3:	learn: 0.6289561	total: 24.5ms	remaining: 6.1s
4:	learn: 0.6135193	total: 31.7ms	remaining: 6.31s
5:	learn: 0.5994544	total: 36.9ms	remaining: 6.11s
6:	learn: 0.5840590	total: 39.9ms	remaining: 5.66s
7:	learn: 0.5679893	total: 62.5ms	remaining: 7.75s
8:	learn: 0.5536119	total: 66.4ms	remaining: 7.31s
9:	learn: 0.5420985	total: 70.3ms	remaining: 6.96s
10:	learn: 0.5328099	total: 75.8ms	remaining: 6.81s
11:	learn: 0.5223293	total: 82.7ms	remaining: 6.81s
12:	learn: 0.5119356	total: 85.9ms	remaining: 6.

160:	learn: 0.1773630	total: 845ms	remaining: 4.4s
161:	learn: 0.1768189	total: 861ms	remaining: 4.46s
162:	learn: 0.1765215	total: 870ms	remaining: 4.46s
163:	learn: 0.1762380	total: 882ms	remaining: 4.49s
164:	learn: 0.1759496	total: 895ms	remaining: 4.53s
165:	learn: 0.1754797	total: 900ms	remaining: 4.52s
166:	learn: 0.1752695	total: 912ms	remaining: 4.55s
167:	learn: 0.1749162	total: 922ms	remaining: 4.57s
168:	learn: 0.1745630	total: 929ms	remaining: 4.57s
169:	learn: 0.1741032	total: 932ms	remaining: 4.55s
170:	learn: 0.1738631	total: 938ms	remaining: 4.54s
171:	learn: 0.1731794	total: 941ms	remaining: 4.53s
172:	learn: 0.1728111	total: 944ms	remaining: 4.51s
173:	learn: 0.1723729	total: 948ms	remaining: 4.5s
174:	learn: 0.1720802	total: 958ms	remaining: 4.52s
175:	learn: 0.1715413	total: 968ms	remaining: 4.53s
176:	learn: 0.1712919	total: 973ms	remaining: 4.52s
177:	learn: 0.1708816	total: 980ms	remaining: 4.52s
178:	learn: 0.1704616	total: 992ms	remaining: 4.55s
179:	learn: 0.

342:	learn: 0.1297251	total: 1.82s	remaining: 3.48s
343:	learn: 0.1296064	total: 1.82s	remaining: 3.48s
344:	learn: 0.1294608	total: 1.83s	remaining: 3.47s
345:	learn: 0.1293396	total: 1.83s	remaining: 3.46s
346:	learn: 0.1291519	total: 1.83s	remaining: 3.45s
347:	learn: 0.1288322	total: 1.84s	remaining: 3.44s
348:	learn: 0.1287133	total: 1.84s	remaining: 3.44s
349:	learn: 0.1284739	total: 1.85s	remaining: 3.43s
350:	learn: 0.1283954	total: 1.85s	remaining: 3.43s
351:	learn: 0.1283442	total: 1.86s	remaining: 3.43s
352:	learn: 0.1279915	total: 1.87s	remaining: 3.42s
353:	learn: 0.1278161	total: 1.87s	remaining: 3.42s
354:	learn: 0.1276784	total: 1.88s	remaining: 3.41s
355:	learn: 0.1273717	total: 1.88s	remaining: 3.41s
356:	learn: 0.1271950	total: 1.89s	remaining: 3.4s
357:	learn: 0.1270723	total: 1.89s	remaining: 3.39s
358:	learn: 0.1267954	total: 1.9s	remaining: 3.39s
359:	learn: 0.1265638	total: 1.91s	remaining: 3.39s
360:	learn: 0.1261768	total: 1.91s	remaining: 3.38s
361:	learn: 0.

521:	learn: 0.1031027	total: 2.79s	remaining: 2.55s
522:	learn: 0.1029378	total: 2.8s	remaining: 2.55s
523:	learn: 0.1027992	total: 2.81s	remaining: 2.55s
524:	learn: 0.1027191	total: 2.81s	remaining: 2.55s
525:	learn: 0.1026214	total: 2.82s	remaining: 2.54s
526:	learn: 0.1024426	total: 2.83s	remaining: 2.54s
527:	learn: 0.1023142	total: 2.84s	remaining: 2.54s
528:	learn: 0.1021501	total: 2.85s	remaining: 2.54s
529:	learn: 0.1020637	total: 2.85s	remaining: 2.53s
530:	learn: 0.1020033	total: 2.85s	remaining: 2.52s
531:	learn: 0.1018080	total: 2.87s	remaining: 2.52s
532:	learn: 0.1016609	total: 2.87s	remaining: 2.52s
533:	learn: 0.1015811	total: 2.88s	remaining: 2.51s
534:	learn: 0.1014485	total: 2.88s	remaining: 2.5s
535:	learn: 0.1013900	total: 2.88s	remaining: 2.5s
536:	learn: 0.1013668	total: 2.89s	remaining: 2.49s
537:	learn: 0.1011478	total: 2.9s	remaining: 2.49s
538:	learn: 0.1010746	total: 2.9s	remaining: 2.48s
539:	learn: 0.1010176	total: 2.9s	remaining: 2.47s
540:	learn: 0.1008

717:	learn: 0.0811528	total: 3.74s	remaining: 1.47s
718:	learn: 0.0810478	total: 3.75s	remaining: 1.46s
719:	learn: 0.0809587	total: 3.75s	remaining: 1.46s
720:	learn: 0.0808846	total: 3.76s	remaining: 1.45s
721:	learn: 0.0807769	total: 3.76s	remaining: 1.45s
722:	learn: 0.0806714	total: 3.77s	remaining: 1.44s
723:	learn: 0.0805677	total: 3.77s	remaining: 1.44s
724:	learn: 0.0805047	total: 3.78s	remaining: 1.43s
725:	learn: 0.0804234	total: 3.79s	remaining: 1.43s
726:	learn: 0.0803349	total: 3.79s	remaining: 1.42s
727:	learn: 0.0802818	total: 3.79s	remaining: 1.42s
728:	learn: 0.0801545	total: 3.8s	remaining: 1.41s
729:	learn: 0.0800956	total: 3.81s	remaining: 1.41s
730:	learn: 0.0800098	total: 3.81s	remaining: 1.4s
731:	learn: 0.0798500	total: 3.81s	remaining: 1.4s
732:	learn: 0.0797487	total: 3.82s	remaining: 1.39s
733:	learn: 0.0796964	total: 3.83s	remaining: 1.39s
734:	learn: 0.0795954	total: 3.83s	remaining: 1.38s
735:	learn: 0.0794956	total: 3.83s	remaining: 1.38s
736:	learn: 0.0

905:	learn: 0.0662857	total: 4.5s	remaining: 467ms
906:	learn: 0.0661915	total: 4.5s	remaining: 462ms
907:	learn: 0.0661538	total: 4.51s	remaining: 457ms
908:	learn: 0.0661141	total: 4.51s	remaining: 452ms
909:	learn: 0.0660444	total: 4.52s	remaining: 447ms
910:	learn: 0.0659892	total: 4.52s	remaining: 442ms
911:	learn: 0.0659504	total: 4.52s	remaining: 437ms
912:	learn: 0.0659020	total: 4.54s	remaining: 432ms
913:	learn: 0.0658483	total: 4.54s	remaining: 427ms
914:	learn: 0.0657630	total: 4.54s	remaining: 422ms
915:	learn: 0.0657257	total: 4.55s	remaining: 417ms
916:	learn: 0.0656626	total: 4.55s	remaining: 412ms
917:	learn: 0.0656059	total: 4.55s	remaining: 407ms
918:	learn: 0.0655522	total: 4.55s	remaining: 402ms
919:	learn: 0.0654569	total: 4.56s	remaining: 396ms
920:	learn: 0.0654127	total: 4.56s	remaining: 391ms
921:	learn: 0.0653158	total: 4.56s	remaining: 386ms
922:	learn: 0.0652695	total: 4.57s	remaining: 381ms
923:	learn: 0.0651871	total: 4.57s	remaining: 376ms
924:	learn: 0.

{'model': <catboost.core.CatBoostClassifier at 0x7fe22e6d61f0>,
 'preprocessing': <preprocessing.PreProcessing at 0x7fe22e6d6370>,
 'columns': ['area', 'quartos', 'garagem', 'banheiros', 'preco', 'bairro']}

In [23]:
modelo = load('../output/modelo.pkl')

In [31]:
modelo['model']

<catboost.core.CatBoostClassifier at 0x7fe22e6e4d90>

In [29]:
modelo['columns']

['area', 'quartos', 'garagem', 'banheiros', 'preco', 'bairro']

Assim fica mais facilitado acessar o processo de machine learning, como o modelo treinado, todo o pré processamento e as colunas utilizadas no treinamento do modelo.

A próxima etapa é fazer a inferência do modelo, onde iremos utilizar o modelo treinado para prever os valores para os novos dados de entrada, sendo os dados de teste.

In [2]:
from model_inference import ModelInference

In [3]:
ModelInference().predict()

Loading model
Loading data
Preprocessing Data
Creating dataframe for data manipulation
Droping columns with missing values
Dropping column with id
Creating list with numeric features
Creating list with categorical features
removing target
['bairro']
feature encoder
feature normalization and encoding
bairro       0
area         0
quartos      0
garagem      0
banheiros    0
preco        0
dtype: int64
Predicting
Evaluating model in test data
{'model_name': 'CatBoost', 'roc_auc_score': 0.8904029815979502, 'average_precision_score': 0.6826965105266992}
Saving file


array([0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0,
       0, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 1,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1,
       1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0,
       0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 1, 0, 1, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0,
       0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0,
       0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0,
       1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0,
       0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1,
       0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0,
       0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0,

# Utilizando o modelo

Nessa etapa tem como foco utilizar o modelo escolhido e treinado para classificar os dados de acordo com os valores dados à ele, as features. Assim vamos olhar como ele está se saindo com os dados reais.

In [14]:
new_df = pd.read_excel('../data/imoveis_raw.xlsx')

In [20]:
new_df.head()

new_df_copy = new_df.copy()

In [21]:
new_df.shape

(575, 8)

In [16]:
modelo = load('../output/modelo.pkl')
modelo['preprocessing'].preprocess(new_df_copy, train=False)
y_pred = modelo['model'].predict(new_df_copy)

Creating dataframe for data manipulation
Droping columns with missing values
Dropping column with id
Creating list with numeric features
Creating list with categorical features
removing target
['bairro']
feature encoder
feature normalization and encoding


In [19]:
new_df['y'] = y_pred
new_df.loc[new_df['y'] == 1]

Unnamed: 0,Id,area,quartos,garagem,banheiros,bairro,preco,y
2,1791,60,2,2,2,Santa Monica,195000.0,1
14,1803,92,2,1,2,Presidente Roosevelt,185000.0,1
15,1804,54,2,1,2,Santa Monica,175000.0,1
16,1805,54,2,1,2,Santa Monica,175000.0,1
23,1812,75,3,1,1,Presidente Roosevelt,170000.0,1
...,...,...,...,...,...,...,...,...
552,2341,60,2,2,2,Santa Mônica,190000.0,1
557,2346,78,2,3,1,Jardim Brasília,184000.0,1
559,2348,60,2,1,2,Santa Mônica,180000.0,1
563,2352,72,2,2,2,Tibery,190000.0,1


In [24]:
new_df.loc[new_df['y'] == 1].describe()

Unnamed: 0,Id,area,quartos,garagem,banheiros,preco,y
count,99.0,99.0,99.0,99.0,99.0,99.0,99.0
mean,2082.717172,80.010101,2.262626,1.575758,1.828283,195333.838485,1.0
std,175.887316,38.272251,0.464799,0.624188,0.40508,20910.821199,0.0
min,1791.0,45.0,2.0,1.0,1.0,115000.0,1.0
25%,1967.5,60.0,2.0,1.0,2.0,184000.0,1.0
50%,2081.0,68.0,2.0,2.0,2.0,195000.0,1.0
75%,2246.5,82.0,2.5,2.0,2.0,210000.0,1.0
max,2362.0,252.0,4.0,4.0,3.0,240000.0,1.0


Com isso temos os imóveis que o nosso modelo classificou como recomendados para compra. Ele se manteve dentro dos aspectos desejados por mim, nas localizações desejadas com as características escolhidas.

# Pontos de atenção

Como todos os modelos de machine learning, ele nem sempre acertará 100%, portanto esse projeto é um recomendador de imóveis que facilita a procura desses imóveis na web e mostra, de acordo com o treinamento, os imóveis que mais assemelha aos imóveis que escolhi que compraria.

Pontos de atenção:
* O link para a página do imóvel foi retirado para não haver problemas por causa do scraping
* Utilizado uma base de dados relativamente pequena, sendo necessário adquirir novos imóveis e também novas features
* Seguindo a ideia de aquisição de novos imóveis, se faz necessário novas features, como por exemplo, quantidade suítes, população do bairro, área construída e área total, etc.