<a href="https://colab.research.google.com/github/guilhermeaugusto9/sigmoidal/blob/master/07_4_Deploy_para_Machine_Learning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<img alt="Colaboratory logo" width="15%" src="https://raw.githubusercontent.com/carlosfab/escola-data-science/master/img/novo_logo_bg_claro.png">

#### **Data Science na Prática 2.0**
*by [sigmoidal.ai](https://sigmoidal.ai)*

---

# Preço de Imóveis em São Paulo

Neste módulo, iremos treinar um modelo para fazer a previsão do preço de venda de apartamentos na cidade de São Paulo e usar esse modelo para alimentar uma aplicação web mediante *deploy*.

Como o objetivo é focar na construção do *webapp* e em como subir uma aplicação, a etapa da análise exploratória será suprimida (feita anteriormente por mim).

Como identifiquei as colunas desnecessárias e redundantes, irei direto ao ponto do treinamento, visando mostrar principalmente como exportar e importar o modelo com a biblioteca `joblib`.

## Dados de Imóveis

Os dados usados aqui foram obtidos [neste link](https://www.kaggle.com/argonalyst/sao-paulo-real-estate-sale-rent-april-2019), e foram disponibilizados publicamente pela startup OpenImob.

Para facilitar seu projeto, disponibilizei o arquivo `csv` [neste link](https://www.dropbox.com/s/h8blgaphkfpqsn5/sao-paulo-properties-april-2019.csv?dl=1), a partir do meu Dropbox.

## Análise e Tratamento dos Dados

Os dados originais contém 13.640 entradas e 16 colunas, sendo a coluna `Price` a nossa variável alvo.

In [3]:
# importar os pacotes necessários
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error

import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

# importar o dataset para um dataframe
url_dataset = "https://www.dropbox.com/s/h8blgaphkfpqsn5/sao-paulo-properties-april-2019.csv?dl=1"
df = pd.read_csv(url_dataset)

# ver as 5 primeiras entradas
display(df.head())

Unnamed: 0,Price,Condo,Size,Rooms,Toilets,Suites,Parking,Elevator,Furnished,Swimming Pool,New,District,Negotiation Type,Property Type,Latitude,Longitude
0,930,220,47,2,2,1,1,0,0,0,0,Artur Alvim/São Paulo,rent,apartment,-23.543138,-46.479486
1,1000,148,45,2,2,1,1,0,0,0,0,Artur Alvim/São Paulo,rent,apartment,-23.550239,-46.480718
2,1000,100,48,2,2,1,1,0,0,0,0,Artur Alvim/São Paulo,rent,apartment,-23.542818,-46.485665
3,1000,200,48,2,2,1,1,0,0,0,0,Artur Alvim/São Paulo,rent,apartment,-23.547171,-46.483014
4,1300,410,55,2,2,1,1,1,0,0,0,Artur Alvim/São Paulo,rent,apartment,-23.525025,-46.482436


Se você também reparar acima, os nomes dos bairros tinham uma informação desnecessária para este *dataset* específico, acrescentando a *string* `"/São Paulo"` ao final de cara nome. Usando `df_clean['District'].apply(lambda x: x.split('/')[0]` eu simplesmente removi essa informação e deixei mais limpa a coluna.

Se você explorar melhor esse *dataset* vai ver que ele contempla duas situações: aluguel ou venda.

In [4]:
df_clean = df.copy()

# Limpar os nomes do bairros
df_clean['District'] = df_clean['District'].apply(lambda x: x.split('/')[0])

# ver as 5 primeiras entradas
df_clean.head()

Unnamed: 0,Price,Condo,Size,Rooms,Toilets,Suites,Parking,Elevator,Furnished,Swimming Pool,New,District,Negotiation Type,Property Type,Latitude,Longitude
0,930,220,47,2,2,1,1,0,0,0,0,Artur Alvim,rent,apartment,-23.543138,-46.479486
1,1000,148,45,2,2,1,1,0,0,0,0,Artur Alvim,rent,apartment,-23.550239,-46.480718
2,1000,100,48,2,2,1,1,0,0,0,0,Artur Alvim,rent,apartment,-23.542818,-46.485665
3,1000,200,48,2,2,1,1,0,0,0,0,Artur Alvim,rent,apartment,-23.547171,-46.483014
4,1300,410,55,2,2,1,1,1,0,0,0,Artur Alvim,rent,apartment,-23.525025,-46.482436


## Modelo de Machine Learning

Arbitrariamente, escolhi o modelo Random Forest para treinar meu modelo e observei três principais métricas de avaliação.

In [6]:
# dummy variables
df_clean = pd.get_dummies(df_clean)

# separar entre variáveis X e y
X_simp = df_clean.drop('Price', axis=1)
y_simp = df_clean['Price']

# split entre datasets de treino e teste
X_train_simp, X_test_simp, y_train_simp, y_test_simp = train_test_split(X_simp, y_simp, test_size=0.33)

# instanciar e treinar o modelo
model = RandomForestRegressor(random_state=42)
model.fit(X_train_simp, y_train_simp)

# fazer as previsões em cima do dataset de teste
y_pred_simp = model.predict(X_test_simp)

# métricas de avaliação
print("r2: \t{:.4f}".format(r2_score(y_test_simp, y_pred_simp)))
print("MAE: \t{:.4f}".format(mean_absolute_error(y_test_simp, y_pred_simp)))
print("MSE: \t{:.4f}".format(mean_squared_error(y_test_simp, y_pred_simp)))

r2: 	0.9336
MAE: 	45915.7094
MSE: 	22882973423.1370


#### Salvando o modelo

O nosso modelo está treinado e é capaz de realizar previsões. No entanto, está "preso" ao *kernel* rodando dentro do Google Colab.

Imagine precisar rodar todas as células novamente a cada vez que fosse fazer uma previsão. Seria inviável!

Para conseguir exportar o modelo de *machine learning* (na verdade, isso pode ser feito com qualquer estrutura de dados) vou usar a biblioteca `joblib`.

In [7]:
# salvar o modelo em formato joblib
from joblib import dump, load

dump(model, 'model.joblib') 

['model.joblib']

Uma vez que você exporta o seu modelo, é extremamente importante que você também salve os nomes das *features* que esse modelo espera receber, e tem que ser na ordem exata que ele foi treinado.

Da mesma maneira que fizemos com o modelo, salvei os nomes das variáveis em `features_simples.names`.

In [None]:
# salvar os nomes das features do modelo simples
features = X_train_simp.columns.values

dump(features, 'features.names') 

#### Carregando o modelo

Uma vez que você salvou o modelo em um arquivo, consegue carregar ele novamente usando o `pickle.load()`

In [None]:
# importar modelo e feature names
new_model = load('model.joblib') 
features = load('features.names') 

In [None]:
# ver o tipo da nova variável
type(new_model)

In [1]:
import sklearn
sklearn.__version__

'0.23.1'

In [None]:
X_simp

Unnamed: 0,Condo,Size,Rooms,Toilets,Suites,Parking,Elevator,Furnished,Swimming Pool,New,Latitude,Longitude,District_Alto de Pinheiros,District_Anhanguera,District_Aricanduva,District_Artur Alvim,District_Barra Funda,District_Bela Vista,District_Belém,District_Bom Retiro,District_Brasilândia,District_Brooklin,District_Brás,District_Butantã,District_Cachoeirinha,District_Cambuci,District_Campo Belo,District_Campo Grande,District_Campo Limpo,District_Cangaíba,District_Capão Redondo,District_Carrão,District_Casa Verde,District_Cidade Ademar,District_Cidade Dutra,District_Cidade Líder,District_Cidade Tiradentes,District_Consolação,District_Cursino,District_Ermelino Matarazzo,...,District_Perus,District_Pinheiros,District_Pirituba,District_Ponte Rasa,District_Raposo Tavares,District_República,District_Rio Pequeno,District_Sacomã,District_Santa Cecília,District_Santana,District_Santo Amaro,District_Sapopemba,District_Saúde,District_Socorro,District_São Domingos,District_São Lucas,District_São Mateus,District_São Miguel,District_São Rafael,District_Sé,District_Tatuapé,District_Tremembé,District_Tucuruvi,District_Vila Andrade,District_Vila Curuçá,District_Vila Formosa,District_Vila Guilherme,District_Vila Jacuí,District_Vila Leopoldina,District_Vila Madalena,District_Vila Maria,District_Vila Mariana,District_Vila Matilde,District_Vila Olimpia,District_Vila Prudente,District_Vila Sônia,District_Água Rasa,Negotiation Type_rent,Negotiation Type_sale,Property Type_apartment
0,220,47,2,2,1,1,0,0,0,0,-23.543138,-46.479486,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1
1,148,45,2,2,1,1,0,0,0,0,-23.550239,-46.480718,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1
2,100,48,2,2,1,1,0,0,0,0,-23.542818,-46.485665,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1
3,200,48,2,2,1,1,0,0,0,0,-23.547171,-46.483014,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1
4,410,55,2,2,1,1,1,0,0,0,-23.525025,-46.482436,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
13635,420,51,2,1,0,1,0,0,0,0,-23.653004,-46.635463,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1
13636,630,74,3,2,1,2,0,0,1,0,-23.648930,-46.641982,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1
13637,1100,114,3,3,1,1,0,0,1,0,-23.649693,-46.649783,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1
13638,48,39,1,2,1,1,0,1,1,0,-23.652060,-46.637046,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1


In [None]:
import numpy as np
import json

dict(zip(X_simp.columns.values, np.zeros(X_simp.shape[0]).astype(int)))

{'Condo': 0,
 'District_Alto de Pinheiros': 0,
 'District_Anhanguera': 0,
 'District_Aricanduva': 0,
 'District_Artur Alvim': 0,
 'District_Barra Funda': 0,
 'District_Bela Vista': 0,
 'District_Belém': 0,
 'District_Bom Retiro': 0,
 'District_Brasilândia': 0,
 'District_Brooklin': 0,
 'District_Brás': 0,
 'District_Butantã': 0,
 'District_Cachoeirinha': 0,
 'District_Cambuci': 0,
 'District_Campo Belo': 0,
 'District_Campo Grande': 0,
 'District_Campo Limpo': 0,
 'District_Cangaíba': 0,
 'District_Capão Redondo': 0,
 'District_Carrão': 0,
 'District_Casa Verde': 0,
 'District_Cidade Ademar': 0,
 'District_Cidade Dutra': 0,
 'District_Cidade Líder': 0,
 'District_Cidade Tiradentes': 0,
 'District_Consolação': 0,
 'District_Cursino': 0,
 'District_Ermelino Matarazzo': 0,
 'District_Freguesia do Ó': 0,
 'District_Grajaú': 0,
 'District_Guaianazes': 0,
 'District_Iguatemi': 0,
 'District_Ipiranga': 0,
 'District_Itaim Bibi': 0,
 'District_Itaim Paulista': 0,
 'District_Itaquera': 0,
 'Dis

In [None]:
# Serializing json    
json_object = json.dumps(dict(zip(X_simp.columns.values, np.zeros(X_simp.shape[0]).astype(int).tolist())), indent = 4)   
print(json_object)  

{
    "Condo": 0,
    "Size": 0,
    "Rooms": 0,
    "Toilets": 0,
    "Suites": 0,
    "Parking": 0,
    "Elevator": 0,
    "Furnished": 0,
    "Swimming Pool": 0,
    "New": 0,
    "Latitude": 0,
    "Longitude": 0,
    "District_Alto de Pinheiros": 0,
    "District_Anhanguera": 0,
    "District_Aricanduva": 0,
    "District_Artur Alvim": 0,
    "District_Barra Funda": 0,
    "District_Bela Vista": 0,
    "District_Bel\u00e9m": 0,
    "District_Bom Retiro": 0,
    "District_Brasil\u00e2ndia": 0,
    "District_Brooklin": 0,
    "District_Br\u00e1s": 0,
    "District_Butant\u00e3": 0,
    "District_Cachoeirinha": 0,
    "District_Cambuci": 0,
    "District_Campo Belo": 0,
    "District_Campo Grande": 0,
    "District_Campo Limpo": 0,
    "District_Canga\u00edba": 0,
    "District_Cap\u00e3o Redondo": 0,
    "District_Carr\u00e3o": 0,
    "District_Casa Verde": 0,
    "District_Cidade Ademar": 0,
    "District_Cidade Dutra": 0,
    "District_Cidade L\u00edder": 0,
    "District_Cidade T