## Data Science Lopes
O teste abaixo busca entender o seu processo de resolução de problemas, e entendimento dos dados disponibilizados, dando espaço para você demonstrar seus conhecimentos de Data Science e Machine Learning.

Os dados disponibilizados são de sobre imóveis da Lopes e seu respectivo preço, as features disponibilizadas representam as características de cada imóvel.

### Objetivo
O objetivo é que o candidado **faça uma breve análise dos dados**, da forma que achar melhor, e **desenvolva um modelo de precificação (regressão)** utilizando como target/variável dependente a coluna `sale`. A função custo a se minimizar é sugerido o **RMSE**, porém caso queira utilizar outra função **justifique**.

Sugerimos que o código desenvolvido tenha todos os comentários necessários para entendermos o processo, bem como boas práticas de programação e desenvolvimento de software.

### Recomendações
Aproveite as capacidades do Jupyter Notebook de utilizar markdowns e comentários para documentar sua linha de pensamento.

### Arquivo Final
Os arquivos finais a serem entregues é o notebook `train.ipynb` e as predicts do modelo desenvolvido `output.csv`.

Após o treinamento e ajuste do modelo, as inferencias deverão ser feitas em cima do arquivo `test.csv` o qual contém todas colunas do arquivo `train.csv` exceto a **target**. O arquivo final deverá conter somente 2 colunas, `sku` e `predict` conforme o exemplo abaixo, e salvo em formato `.csv` e com o nome `output.csv`

| sku | predict |
|---|---|
| 51173fe76f683f3ccd556ed32ce2f2ca | 750000 |
| b62f5788ed7c0f37d522085fe401dab7 | 230000 |
| cb40092e1db4824b5d352e2b46f4a478 | 856700 |
| 8253796889c2733a3a10fd4d1e9af1c5 | 1000000 |

### Imports

In [1]:
# liste todos os requirements necessários abaixo
import pandas as pd

pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)
pd.set_option('display.float_format', lambda x: '%.4f' % x)

### Leitura dos Dados

In [2]:
train = pd.read_csv("train.csv")
test = pd.read_csv("test.csv")

In [3]:
print(f"train shape: {train.shape}\ntest shape: {test.shape}")

train shape: (39411, 37)
test shape: (4378, 36)


In [4]:
train.head()

Unnamed: 0,sku,condominium_fee,iptu_fee,product_floor,status,category,bedroom_qty,living_area,suite_qty,bathroom_qty,parking,min_park_distance,min_train_distance,min_subway_distance,pet_care_c,piscina_coberta_c,piscina_descoberta_c,piscina_i,academia_de_ginastica_c,ar_condicionado_i,spa_c,bicicletario_c,deck_c,lavabo_i,salao_de_jogos_c,sauna_c,cozinha_americana_i,armario_embutido_i,cozinha_mobiliada_i,lavanderia_c,varanda_gourmet_i,churrasqueira_i,brinquedoteca_c,espaco_gourmet_c,playground_c,city,sale
0,51173fe76f683f3ccd556ed32ce2f2ca,1908.0,552.84,5.0,active,apartamento,3.0,166.0,1.0,3.0,2,0.3307,4.009,0.4869,0,0,0,0,0,0,0,0,0,1,0,0,1,1,1,0,0,0,0,0,0,sao paulo,1783892.2176
1,b62f5788ed7c0f37d522085fe401dab7,1746.0,289.0,11.0,active,apartamento,3.0,123.84,1.0,3.0,1,0.3307,4.009,0.4869,0,0,0,0,1,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,sao paulo,1284963.0454
2,cb40092e1db4824b5d352e2b46f4a478,1460.0,169.97,0.0,active,apartamento,2.0,104.0,0.0,0.0,2,0.5705,4.1132,0.5146,0,1,1,0,1,0,1,0,1,0,1,1,0,0,0,0,0,0,0,1,1,sao paulo,1739909.2404
3,8573b93d0835a1d9273e529a0def882f,2259.0,950.0,,active,apartamento,3.0,166.0,1.0,3.0,2,0.3307,4.009,0.4869,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,sao paulo,2107451.4011
4,8253796889c2733a3a10fd4d1e9af1c5,2000.0,610.0,8.0,active,apartamento,3.0,177.0,1.0,3.0,1,0.3307,4.009,0.4869,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,sao paulo,1906100.2876


### EDA
Na seção abaixo, desenvolva a sua análise exploratória dos dados, pode ser focado em alguma variável ou mais generalista.

A ideia é entender o comportamento a perfil dos dados, encontrar padrões/anomalias etc.

In [5]:
train.describe()

Unnamed: 0,condominium_fee,iptu_fee,product_floor,bedroom_qty,living_area,suite_qty,bathroom_qty,parking,min_park_distance,min_train_distance,min_subway_distance,pet_care_c,piscina_coberta_c,piscina_descoberta_c,piscina_i,academia_de_ginastica_c,ar_condicionado_i,spa_c,bicicletario_c,deck_c,lavabo_i,salao_de_jogos_c,sauna_c,cozinha_americana_i,armario_embutido_i,cozinha_mobiliada_i,lavanderia_c,varanda_gourmet_i,churrasqueira_i,brinquedoteca_c,espaco_gourmet_c,playground_c,sale
count,29441.0,32084.0,37921.0,39386.0,39411.0,39245.0,39371.0,39411.0,39411.0,39411.0,39411.0,39411.0,39411.0,39411.0,39411.0,39411.0,39411.0,39411.0,39411.0,39411.0,39411.0,39411.0,39411.0,39411.0,39411.0,39411.0,39411.0,39411.0,39411.0,39411.0,39411.0,39411.0,39411.0
mean,6456.1247,687.8721,4.4675,2.6879,158.6503,1.0814,2.6695,2.0764,7.5075,8.9119,7.5321,0.0167,0.1167,0.2816,0.0209,0.3588,0.0289,0.0881,0.0193,0.0223,0.2458,0.3181,0.1062,0.0668,0.3783,0.296,0.0279,0.0211,0.1338,0.1125,0.0762,0.4118,1064288.8833
std,814097.8372,13249.4525,9.0738,1.1163,1752.7553,1.154,1.6588,1.668,175.572,174.9768,175.9303,0.1282,0.3211,0.4498,0.143,0.4797,0.1675,0.2834,0.1377,0.1477,0.4306,0.4657,0.3081,0.2497,0.485,0.4565,0.1647,0.1438,0.3404,0.316,0.2653,0.4922,783338.8206
min,0.01,0.01,-2.0,0.0,0.0,0.0,0.0,0.0,0.0131,0.0457,0.0234,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,102412.5043
25%,450.0,110.0,0.0,2.0,65.0,0.0,1.0,1.0,1.0833,1.7268,0.6831,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,470150.3869
50%,800.0,290.0,2.0,3.0,110.0,1.0,2.0,2.0,1.6761,2.8844,1.3924,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,807334.9417
75%,1480.0,646.32,7.0,3.0,190.0,1.0,4.0,3.0,2.4944,4.5561,2.4766,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,1434706.3329
max,139600400.0,1800000.0,890.0,62.0,325540.0,30.0,51.0,44.0,5795.5733,5778.2582,5806.5216,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,4021362.7453


### Train Model
Na seção abaixo desenvolva o treinamento do modelo de precificação utilizando a target `sale`

### Predicts

In [None]:
#predicts.to_csv("output.csv", index=False)