## Projeto Sobre Aluguéis de Imóveis

#### Este projeto consiste em organizar e limpar os dados para que um modelo preditivo seja capaz de prever valores
#### para locação de imóveis.
#### Este dataset foi baixado diretamente do site do Kaggle.
#### https://www.kaggle.com/rubenssjr/brasilian-houses-to-rent?select=houses_to_rent_v2.csv

In [1]:
# Imports para melhor compreesão dos dados
import pandas as pd
import seaborn as sns
import numpy as np
from math import sqrt
from scipy import stats
import matplotlib.pyplot as plt

%matplotlib inline

In [2]:
# imports do pacote scikit-learn

from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Lasso
from sklearn.linear_model import Ridge
from sklearn.linear_model import ElasticNet
from sklearn.neighbors import KNeighborsRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.svm import SVR

In [3]:
# carregamento do dataset
housesDF = pd.read_csv("datasets_554905_1035602_houses_to_rent_v2.csv")
housesDF.head(5)

Unnamed: 0,city,area,rooms,bathroom,parking spaces,floor,animal,furniture,hoa (R$),rent amount (R$),property tax (R$),fire insurance (R$),total (R$)
0,São Paulo,70,2,1,1,7,acept,furnished,2065,3300,211,42,5618
1,São Paulo,320,4,4,0,20,acept,not furnished,1200,4960,1750,63,7973
2,Porto Alegre,80,1,1,1,6,acept,not furnished,1000,2800,0,41,3841
3,Porto Alegre,51,2,1,0,2,acept,not furnished,270,1112,22,17,1421
4,São Paulo,25,1,1,0,1,not acept,not furnished,0,800,25,11,836


In [4]:
# Verificando a dimensão do DataFrame
housesDF.shape

(10692, 13)

In [5]:
# Verificando o tipo de dados
housesDF.dtypes

city                   object
area                    int64
rooms                   int64
bathroom                int64
parking spaces          int64
floor                  object
animal                 object
furniture              object
hoa (R$)                int64
rent amount (R$)        int64
property tax (R$)       int64
fire insurance (R$)     int64
total (R$)              int64
dtype: object

## Tratando váriaveis as categóricas 

In [6]:
housesDF.furniture = housesDF.furniture.map({"furnished": 1, "not furnished": 0})
housesDF.animal = housesDF.animal.map({"acept": 1, "not acept": 0})
housesDF.head()

Unnamed: 0,city,area,rooms,bathroom,parking spaces,floor,animal,furniture,hoa (R$),rent amount (R$),property tax (R$),fire insurance (R$),total (R$)
0,São Paulo,70,2,1,1,7,1,1,2065,3300,211,42,5618
1,São Paulo,320,4,4,0,20,1,0,1200,4960,1750,63,7973
2,Porto Alegre,80,1,1,1,6,1,0,1000,2800,0,41,3841
3,Porto Alegre,51,2,1,0,2,1,0,270,1112,22,17,1421
4,São Paulo,25,1,1,0,1,0,0,0,800,25,11,836


 Podemos verificar que na feature floor temos um simbolo "-", temos de tratá-lo.

In [7]:

housesDF.groupby('floor').size()

floor
-      2461
1      1081
10      357
11      303
12      257
13      200
14      170
15      147
16      109
17       96
18       75
19       53
2       985
20       44
21       42
22       24
23       25
24       19
25       25
26       20
27        8
28        6
29        5
3       931
301       1
32        2
35        1
4       748
46        1
5       600
51        1
6       539
7       497
8       490
9       369
dtype: int64

In [8]:
# Gerando uma cópia de segurança do primeiro Dataframe
housesDF1 = housesDF.copy()
housesDF1.head(5)

Unnamed: 0,city,area,rooms,bathroom,parking spaces,floor,animal,furniture,hoa (R$),rent amount (R$),property tax (R$),fire insurance (R$),total (R$)
0,São Paulo,70,2,1,1,7,1,1,2065,3300,211,42,5618
1,São Paulo,320,4,4,0,20,1,0,1200,4960,1750,63,7973
2,Porto Alegre,80,1,1,1,6,1,0,1000,2800,0,41,3841
3,Porto Alegre,51,2,1,0,2,1,0,270,1112,22,17,1421
4,São Paulo,25,1,1,0,1,0,0,0,800,25,11,836


In [9]:
# Usando o método replace() do pacote str para substituir o símbolo
# e visualizando o tratamento
housesDF1.floor = housesDF1.floor.str.replace("-", "0")
housesDF1.groupby("floor").size()

floor
0      2461
1      1081
10      357
11      303
12      257
13      200
14      170
15      147
16      109
17       96
18       75
19       53
2       985
20       44
21       42
22       24
23       25
24       19
25       25
26       20
27        8
28        6
29        5
3       931
301       1
32        2
35        1
4       748
46        1
5       600
51        1
6       539
7       497
8       490
9       369
dtype: int64

In [161]:
# Número de imóveis por estado
housesDF1.groupby('city').size()

city
Belo Horizonte    1258
Campinas           853
Porto Alegre      1193
Rio de Janeiro    1501
São Paulo         5887
dtype: int64

In [10]:
# Convertendo a coluna floor para um tipo numérico
housesDF1.floor = housesDF1.floor.astype(int)
housesDF1['rent amount (R$)'] = housesDF1['rent amount (R$)'].astype(float)
housesDF1.dtypes

city                    object
area                     int64
rooms                    int64
bathroom                 int64
parking spaces           int64
floor                    int32
animal                   int64
furniture                int64
hoa (R$)                 int64
rent amount (R$)       float64
property tax (R$)        int64
fire insurance (R$)      int64
total (R$)               int64
dtype: object

In [11]:
# Tratanto a variável City com o método get_dummies() 
housesDF2 = pd.get_dummies(housesDF1)
housesDF2.shape

(10692, 17)

In [12]:
housesDF2.head()

Unnamed: 0,area,rooms,bathroom,parking spaces,floor,animal,furniture,hoa (R$),rent amount (R$),property tax (R$),fire insurance (R$),total (R$),city_Belo Horizonte,city_Campinas,city_Porto Alegre,city_Rio de Janeiro,city_São Paulo
0,70,2,1,1,7,1,1,2065,3300.0,211,42,5618,0,0,0,0,1
1,320,4,4,0,20,1,0,1200,4960.0,1750,63,7973,0,0,0,0,1
2,80,1,1,1,6,1,0,1000,2800.0,0,41,3841,0,0,1,0,0
3,51,2,1,0,2,1,0,270,1112.0,22,17,1421,0,0,1,0,0
4,25,1,1,0,1,0,0,0,800.0,25,11,836,0,0,0,0,1


In [13]:
# Verificando se o tratamento foi bem sucedido
housesDF2.dtypes

area                     int64
rooms                    int64
bathroom                 int64
parking spaces           int64
floor                    int32
animal                   int64
furniture                int64
hoa (R$)                 int64
rent amount (R$)       float64
property tax (R$)        int64
fire insurance (R$)      int64
total (R$)               int64
city_Belo Horizonte      uint8
city_Campinas            uint8
city_Porto Alegre        uint8
city_Rio de Janeiro      uint8
city_São Paulo           uint8
dtype: object

## Verificando valores Missings e Missings Ocultos

In [14]:
housesDF2.shape

(10692, 17)

In [15]:
len(housesDF2.floor)

10692

Property tax

Podemos observar que no atributo Property tax possuimos valores nulos, porém nem todos imóveis
pagam suas respecitvas taxas como é explicado no site oficial da prefeitura de são paulo:
https://www.prefeitura.sp.gov.br/cidade/secretarias/fazenda/servicos/iptu/index.php?p=2462
cada estado possui sua respectiva condição para isenção, contudo como são paulo possui a quantidade majoritária de imóveis usaremos suas respectivas leis. 

Hoa

Esta feature refere-se a um imposto para proprietarios de imóveis, para ser mais preciso condominios e todos são obrigados a pagar como
é explicado no site abaixo:
https://blog.viasul.com/taxa-de-condominio/#:~:text=A%20taxa%20de%20condom%C3%ADnio%20deve,judicialmente%20o%20respons%C3%A1vel%20pelo%20im%C3%B3vel
Logo precisamos trata-lo da melhor forma.

As demais features possuem valores ocultos, porém, são completamente compreensiveis



In [16]:
# Verificando valores missings ocultos, ou seja, valores zeros
print(len(housesDF2.loc[housesDF2['floor'] == 0]))
print(len(housesDF2.loc[housesDF2['total (R$)'] == 0]))
print(len(housesDF2.loc[housesDF2['area'] == 0]))
print(len(housesDF2.loc[housesDF2['rooms'] == 0]))
print(len(housesDF2.loc[housesDF2['bathroom'] == 0]))
print(len(housesDF2.loc[housesDF2['parking spaces'] == 0]))
print(len(housesDF2.loc[housesDF2['rent amount (R$)'] == 0]))
print(len(housesDF2.loc[housesDF2['hoa (R$)'] == 0]))
print(len(housesDF2.loc[housesDF2['property tax (R$)'] == 0]))
print(len(housesDF2.loc[housesDF2['fire insurance (R$)'] == 0]))

2461
0
0
0
0
2683
0
2373
1596
0


In [17]:
# Visualizando os valores a respeito do condominio para buscarmos a melhor solução
analize = housesDF2.groupby("hoa (R$)")
groups = analize.size().sort_index(ascending = False)
groups.head(30)

hoa (R$)
1117000    2
220000     1
200000     1
81150      1
32000      1
15000      1
14130      1
14000      1
10000      2
9900       1
9500       1
9000       1
8600       1
8500       2
8362       1
8300       1
8133       1
8043       1
8000       3
7963       1
7900       1
7774       1
7630       1
7552       1
7500       1
7400       2
7200       1
7100       1
7000       6
6900       2
dtype: int64

In [18]:
# Podemos reparar qua os valores pagos seguem uma certa constância, e que outros valores são muito discrepantes entre si
# Para que não influencie na essência dos dados irei substituir os valores nulos pelos valores mais constantes.

housesDF3 = housesDF2.copy()
hoa = housesDF2['hoa (R$)'].values.reshape(-1, 1) # Reshape necessário para 2D para aplicarmos o SimpleImputer

simpleImputer = SimpleImputer(missing_values= 0, strategy= 'most_frequent')
housesDF3['hoa (R$)'] = simpleImputer.fit_transform(hoa)

In [19]:
housesDF3.head()

Unnamed: 0,area,rooms,bathroom,parking spaces,floor,animal,furniture,hoa (R$),rent amount (R$),property tax (R$),fire insurance (R$),total (R$),city_Belo Horizonte,city_Campinas,city_Porto Alegre,city_Rio de Janeiro,city_São Paulo
0,70,2,1,1,7,1,1,2065,3300.0,211,42,5618,0,0,0,0,1
1,320,4,4,0,20,1,0,1200,4960.0,1750,63,7973,0,0,0,0,1
2,80,1,1,1,6,1,0,1000,2800.0,0,41,3841,0,0,1,0,0
3,51,2,1,0,2,1,0,270,1112.0,22,17,1421,0,0,1,0,0
4,25,1,1,0,1,0,0,400,800.0,25,11,836,0,0,0,0,1


In [20]:
# Verificando se funcionou
print(len(housesDF3.loc[housesDF3['hoa (R$)'] == 0]))

0


## Verificando a distribuição dos dados e modificando a escala

In [21]:
# Verificando a assimetria dos dados
# estou observando somente as variáveis numéricas embora estejam as categóricas junto
housesDF3.skew()

area                   69.596804
rooms                   0.702391
bathroom                1.213810
parking spaces          1.487534
floor                  11.816997
animal                 -1.336494
furniture               1.193954
hoa (R$)               69.099460
rent amount (R$)        1.838877
property tax (R$)      96.013594
fire insurance (R$)     1.970400
total (R$)             58.960803
city_Belo Horizonte     2.373633
city_Campinas           3.102254
city_Porto Alegre       2.467708
city_Rio de Janeiro     2.070692
city_São Paulo         -0.203467
dtype: float64

In [22]:
# Verificando o nível de achatamento dos dados
# estou observando somente as variáveis numéricas embora estejam as categóricas junto
housesDF3.kurt()

area                   5548.308334
rooms                     1.487659
bathroom                  1.134852
parking spaces            2.769075
floor                   529.389095
animal                   -0.213825
furniture                -0.574583
hoa (R$)               4917.994375
rent amount (R$)          4.624228
property tax (R$)      9667.782564
fire insurance (R$)       5.934963
total (R$)             3926.019305
city_Belo Horizonte       3.634813
city_Campinas             7.625406
city_Porto Alegre         4.090346
city_Rio de Janeiro       2.288194
city_São Paulo           -1.958968
dtype: float64

### Feature Selection e Verificando o nível de correlação
Podemos reparar que o estado no qual o imóvel se encontra possui uma correlação negativa em relação ao valor do aluguel, logo
já irei aplicar um feature selection manualmente para retirá-los. Também será removido a feature "Total (R$)", pois não faz sentido ter 
o total a pagar se não sabemos o valor do aluguel.

In [23]:
# Usando o método de pearson pois ele assume que os dados estão em uma distriuição normal
housesDF3.corr(method='pearson')

Unnamed: 0,area,rooms,bathroom,parking spaces,floor,animal,furniture,hoa (R$),rent amount (R$),property tax (R$),fire insurance (R$),total (R$),city_Belo Horizonte,city_Campinas,city_Porto Alegre,city_Rio de Janeiro,city_São Paulo
area,1.0,0.193796,0.226766,0.193983,-0.012724,0.039626,0.008175,0.007607,0.180742,0.039059,0.188078,0.051799,0.039573,-0.006391,-0.0301,-0.033015,0.019956
rooms,0.193796,1.0,0.733763,0.61751,-0.078687,0.17219,-0.080694,0.00931,0.541758,0.075252,0.565148,0.134597,0.160442,-0.037927,-0.110521,-0.090485,0.04988
bathroom,0.226766,0.733763,1.0,0.697379,0.004894,0.118255,0.017938,0.052132,0.668504,0.109253,0.676399,0.208339,0.042927,-0.057893,-0.128674,-0.138039,0.181574
parking spaces,0.193983,0.61751,0.697379,1.0,-0.020767,0.127432,-0.00472,0.011768,0.578361,0.098378,0.597348,0.148684,0.079569,-0.009469,-0.125913,-0.219921,0.186898
floor,-0.012724,-0.078687,0.004894,-0.020767,1.0,-0.021851,0.105994,0.015431,0.073596,0.012626,0.013652,0.036431,-0.072633,-0.038222,-0.067114,0.012153,0.101859
animal,0.039626,0.17219,0.118255,0.127432,-0.021851,1.0,-0.087972,-0.022003,0.067754,-0.003006,0.079152,-0.007143,-0.033127,0.02454,0.055098,0.024321,-0.043768
furniture,0.008175,-0.080694,0.017938,-0.00472,0.105994,-0.087972,1.0,0.001975,0.164235,0.000985,0.141768,0.037781,-0.087635,-0.077911,0.01953,0.025182,0.069255
hoa (R$),0.007607,0.00931,0.052132,0.011768,0.015431,-0.022003,0.001975,1.0,0.037908,0.007697,0.031854,0.955284,0.027699,-0.010655,-0.016262,-0.004029,0.000969
rent amount (R$),0.180742,0.541758,0.668504,0.578361,0.073596,0.067754,0.164235,0.037908,1.0,0.107884,0.987343,0.26449,-0.024869,-0.132342,-0.162051,-0.07865,0.24569
property tax (R$),0.039059,0.075252,0.109253,0.098378,0.012626,-0.003006,0.000985,0.007697,0.107884,1.0,0.105661,0.218344,-0.011036,-0.020754,-0.027675,-0.014285,0.045946


In [24]:
# Removendo as váriaveis categóricas que possuem baixo nível de correlação no dataset
columnsList = ['city_Belo Horizonte', 'city_Campinas', 'city_Porto Alegre', 'city_Rio de Janeiro', 'city_São Paulo', 'total (R$)']
dfSemCategorica = housesDF3.drop(columns= columnsList)

dfSemCategorica.head()

Unnamed: 0,area,rooms,bathroom,parking spaces,floor,animal,furniture,hoa (R$),rent amount (R$),property tax (R$),fire insurance (R$)
0,70,2,1,1,7,1,1,2065,3300.0,211,42
1,320,4,4,0,20,1,0,1200,4960.0,1750,63
2,80,1,1,1,6,1,0,1000,2800.0,0,41
3,51,2,1,0,2,1,0,270,1112.0,22,17
4,25,1,1,0,1,0,0,400,800.0,25,11


In [25]:
# Aplicando a padronização nas váriaveis numéricas
scaler = StandardScaler().fit(dfSemCategorica)
scalerDF = scaler.transform(dfSemCategorica)


In [26]:
# Construindo data frame e nomeando as colunas
columnsNames = ['area', 'rooms', 'bathroom', 'parking spaces', 'floor', 'animal', 'furniture', 'hoa (R$)', 'rent amount (R$)', 'property tax (R$)',	'fire insurance (R$)']
dfStandard = pd.DataFrame(scalerDF, columns= columnsNames)
dfStandard.head()

Unnamed: 0,area,rooms,bathroom,parking spaces,floor,animal,furniture,hoa (R$),rent amount (R$),property tax (R$),fire insurance (R$)
0,-0.147522,-0.432099,-0.87896,-0.383245,0.318352,0.534522,1.761488,0.05147,-0.174935,-0.050103,-0.236589
1,0.318035,1.275535,1.253036,-1.012395,2.460468,0.534522,-0.567702,-0.004029,0.312099,0.445121,0.203056
2,-0.128899,-1.285916,-0.87896,-0.383245,0.153574,0.534522,-0.567702,-0.016861,-0.321632,-0.117999,-0.257525
3,-0.182904,-0.432099,-0.87896,-1.012395,-0.505538,0.534522,-0.567702,-0.063699,-0.816881,-0.11092,-0.759976
4,-0.231322,-1.285916,-0.87896,-1.012395,-0.670317,-1.870829,-0.567702,-0.055358,-0.90842,-0.109955,-0.885589


In [27]:
# Removendo valores outliers com a biblioteca scipy
dfNoOutliers = dfStandard[(np.abs(stats.zscore(dfStandard) < 3)).all(axis=1)]
dfNoOutliers

Unnamed: 0,area,rooms,bathroom,parking spaces,floor,animal,furniture,hoa (R$),rent amount (R$),property tax (R$),fire insurance (R$)
0,-0.147522,-0.432099,-0.878960,-0.383245,0.318352,0.534522,1.761488,0.051470,-0.174935,-0.050103,-0.236589
1,0.318035,1.275535,1.253036,-1.012395,2.460468,0.534522,-0.567702,-0.004029,0.312099,0.445121,0.203056
2,-0.128899,-1.285916,-0.878960,-0.383245,0.153574,0.534522,-0.567702,-0.016861,-0.321632,-0.117999,-0.257525
3,-0.182904,-0.432099,-0.878960,-1.012395,-0.505538,0.534522,-0.567702,-0.063699,-0.816881,-0.110920,-0.759976
4,-0.231322,-1.285916,-0.878960,-1.012395,-0.670317,-1.870829,-0.567702,-0.055358,-0.908420,-0.109955,-0.885589
...,...,...,...,...,...,...,...,...,...,...,...
10686,0.001456,0.421718,0.542371,0.245905,0.483130,-1.870829,1.761488,-0.055358,2.817683,-0.117999,2.485023
10687,-0.160557,-0.432099,-0.878960,-0.383245,-0.011204,-1.870829,1.761488,-0.055230,-0.709499,-0.110276,-0.655299
10689,-0.147522,0.421718,0.542371,-1.012395,0.483130,-1.870829,1.761488,-0.018145,0.617228,-0.011167,0.517088
10690,-0.054410,-0.432099,-0.168294,0.245905,0.483130,0.534522,1.761488,0.020673,2.377592,-0.028222,2.129120


## Testando Alguns modelos de Machine Learning

Podemos reparar que os modelos de Regressão Linear, Cart, Ridge e suporte vector machine 
foram os melhores com uma taxa de erro próxima de 0, Por tanto por já ser muito satisfatório os resultados
neste caso especificamente não será necessário fazer uma otimização de hiperparâmetros.

In [28]:
finalDF = dfNoOutliers.copy()

### Linear regression

In [29]:
# Separando as variáveis preditoras da target e treinando o modelo

X = finalDF.drop(columns = ['rent amount (R$)'])
Y = finalDF['rent amount (R$)']

fold = 40
seed = 7

lr = LinearRegression()

kfold = KFold(fold, shuffle= True, random_state= seed)
y_pred = cross_val_score(lr, X, Y, cv= kfold, scoring= 'neg_mean_squared_error')

print("A Acurácia do modelo com o algorimto de regressão linear foi %.2f" %(np.abs(y_pred.mean())))


A Acurácia do modelo com o algorimto de regressão linear foi 0.01


### Lasso

In [30]:
X = finalDF.drop(columns = ['rent amount (R$)'])
Y = finalDF['rent amount (R$)']

fold = 10
seed = 7

lasso = Lasso()

kfold = KFold(fold, shuffle= True, random_state= seed)
result = cross_val_score(lasso, X, Y, cv= kfold, scoring= 'neg_mean_squared_error')

print("A Acurácia do modelo com o algorimto de regressão linear foi %.2f" %(abs(result.mean())))

A Acurácia do modelo com o algorimto de regressão linear foi 0.66


### KNN

In [31]:
X = finalDF.drop(columns = ['rent amount (R$)'])
Y = finalDF['rent amount (R$)']

fold = 10
seed = 7

knn = KNeighborsRegressor()

kfold = KFold(fold, shuffle= True, random_state= seed)
result = cross_val_score(knn, X, Y, cv= kfold, scoring= 'neg_mean_squared_error')

print("A Acurácia do modelo com o algorimto de regressão linear foi %.2f" %(np.abs(result.mean())))

A Acurácia do modelo com o algorimto de regressão linear foi 0.03


### Cart

In [146]:
X = finalDF.drop(columns = ['rent amount (R$)'])
Y = finalDF['rent amount (R$)']

fold = 10
seed = 7

cart = DecisionTreeRegressor()

kfold = KFold(fold, shuffle= True, random_state= seed)
result = cross_val_score(cart, X, Y, cv= kfold, scoring= 'neg_mean_squared_error')

print("A Acurácia do modelo com o algorimto de árvore de decisão foi %.2f" %(np.abs(result.mean())))

A Acurácia do modelo com o algorimto de árvore de decisão foi 0.01


### Ridge

In [33]:
X = finalDF.drop(columns = ['rent amount (R$)'])
Y = finalDF['rent amount (R$)']

fold = 10
seed = 7

ridge = Ridge()

kfold = KFold(fold, shuffle = True, random_state = seed)
result = cross_val_score(ridge, X, Y, cv = kfold, scoring= 'neg_mean_squared_error')

print("A Acurácia do modelo com o algoritmo Ridge foi de %.2f" %(np.abs(result.mean())))



A Acurácia do modelo com o algoritmo Ridge foi de 0.01


### Elastic Net

In [34]:
X = finalDF.drop(columns = ['rent amount (R$)'])
Y = finalDF['rent amount (R$)']

fold = 10
seed = 7

elastic = ElasticNet()

kfold = KFold(fold, shuffle = True, random_state = seed)
result = cross_val_score(elastic, X, Y, cv = kfold, scoring= 'neg_mean_squared_error')

print("A Acurácia do modelo com o algoritmo Elastic Net foi de %.2f" %(np.abs(result.mean())))

A Acurácia do modelo com o algoritmo Elastic Net foi de 0.52


### Suporte Vector Machine

In [35]:
X = finalDF.drop(columns = ['rent amount (R$)'])
Y = finalDF['rent amount (R$)']

fold = 10
seed = 7

svr = SVR()

kfold = KFold(fold, shuffle = True, random_state = seed)
result = cross_val_score(svr, X, Y, cv = kfold, scoring= 'neg_mean_squared_error')

print("A Acurácia do modelo com o algoritmo Suport Vector Machine foi de %.2f" %(np.abs(result.mean())))

A Acurácia do modelo com o algoritmo Suport Vector Machine foi de 0.01


### Preparando o modelo para por em produção

Para eu ciêntista de dados as células com os algoritmos de machine learning estão ótimos, cheguei ao meu objetivo, porém, para
que o mesmo vá para a produção, ou seja, comece a realmente ser utilizado é importante simplificarmos o máximo possível o treinamento e o uso
do algoritmo para que o engenheiro de machine learning faça o deploy e todas as outras preparações para que o usuário final possa usar.

In [36]:
# Preparação, removendo valores outliers do DataFrame no qual removi as váriaveis categóricas com o 
# feature selection anteriormente

dfProducao = dfSemCategorica[np.abs(stats.zscore(dfSemCategorica) < 3).all(axis = 1)]
dfProducao.head(1)

Unnamed: 0,area,rooms,bathroom,parking spaces,floor,animal,furniture,hoa (R$),rent amount (R$),property tax (R$),fire insurance (R$)
0,70,2,1,1,7,1,1,2065,3300.0,211,42


In [140]:
# Criando o Pipeline para os modelos escolhidos

X = finalDF.drop(columns = ['rent amount (R$)'])
Y = finalDF['rent amount (R$)']
seed = 7

x_treino, x_test, y_treino, y_test = train_test_split(X, Y, test_size = 0.3, random_state = seed)


In [148]:
# Criação dos pipelines

pipelr = Pipeline([('standard', StandardScaler()), ('lr', LinearRegression())])
pipeCart = Pipeline([('standard', StandardScaler()), ('cart', DecisionTreeRegressor())])
pipeSVM = Pipeline([('standard', StandardScaler()), ('svr', SVR())])
pipeRidge = Pipeline([('standard', StandardScaler()), ('ridge', Ridge())])

### Treinando os Pipelines

In [149]:

pipelr.fit(x_treino, y_treino)

Pipeline(memory=None,
         steps=[('standard',
                 StandardScaler(copy=True, with_mean=True, with_std=True)),
                ('lr',
                 LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
                                  normalize=False))],
         verbose=False)

In [150]:
pipeCart.fit(x_treino, y_treino)

Pipeline(memory=None,
         steps=[('standard',
                 StandardScaler(copy=True, with_mean=True, with_std=True)),
                ('cart',
                 DecisionTreeRegressor(ccp_alpha=0.0, criterion='mse',
                                       max_depth=None, max_features=None,
                                       max_leaf_nodes=None,
                                       min_impurity_decrease=0.0,
                                       min_impurity_split=None,
                                       min_samples_leaf=1, min_samples_split=2,
                                       min_weight_fraction_leaf=0.0,
                                       presort='deprecated', random_state=None,
                                       splitter='best'))],
         verbose=False)

In [151]:
pipeSVM.fit(x_treino, y_treino)

Pipeline(memory=None,
         steps=[('standard',
                 StandardScaler(copy=True, with_mean=True, with_std=True)),
                ('svr',
                 SVR(C=1.0, cache_size=200, coef0=0.0, degree=3, epsilon=0.1,
                     gamma='scale', kernel='rbf', max_iter=-1, shrinking=True,
                     tol=0.001, verbose=False))],
         verbose=False)

In [152]:
pipeRidge.fit(x_treino, y_treino)

Pipeline(memory=None,
         steps=[('standard',
                 StandardScaler(copy=True, with_mean=True, with_std=True)),
                ('ridge',
                 Ridge(alpha=1.0, copy_X=True, fit_intercept=True,
                       max_iter=None, normalize=False, random_state=None,
                       solver='auto', tol=0.001))],
         verbose=False)

Podemos visualizar que a acurácia dos modelos continuaram praticamente com os mesmos 
valores, lembrando que o valor mostrado na métrica mean_squared_error vai de 0 a 1 sendo valores próximos de zero o ideal.
Aquele 0,01 é um número irracional arredondado para duas casas decimais e os mesmos correspondem a 98% de precisão e não 99%

In [158]:
print("Precisão do modelo de Regressão Linear %.2f%%" %(pipelr.score(x_test, y_test) * 100))
print("Precisão do modelo de Cart %.2f%%" %(pipeCart.score(x_test, y_test)* 100))
print("Precisão do modelo de Suporte Vector Machine %.2f%%" %(pipeSVM.score(x_test, y_test) * 100))
print("Precisão do modelo de Ridge %.2f%%" %(pipeRidge.score(x_test, y_test) * 100))

Precisão do modelo de Regressão Linear 98.57%
Precisão do modelo de Cart 98.46%
Precisão do modelo de Suporte Vector Machine 98.36%
Precisão do modelo de Ridge 98.57%


## Valores necessários para previsão
### [ area,	rooms, bathroom, parking spaces, floor, animal, furniture, hoa (R$), property tax (R$), fire insurance (R$)]

In [159]:
# Testando o modelo com valores hipotéticos
teste = np.array([70, 3, 4, 2, 3, 1, 0, 2000, 300,100 ]).reshape(1, -1)
teste

array([[  70,    3,    4,    2,    3,    1,    0, 2000,  300,  100]])

In [160]:
# Prevendo o preço
pipelr.predict(teste)

array([1961.17128654])

### <html><font size = 6><font color = 'blue'>Conclusão</font></font></html>

Este modelo de machine learning está completo e apto a prever com uma precisão alta o valor para locação de imóveis e também
esta pronto para entrar em produção. O uso dos pipelines facilitam para os engenheiros de machine learning fazerem o deploy do projeto, pois, é como se eu tivesse feito um encapsulamento do código, conceito de POO(Programação Orientada a Objetos) consequentimente automatizando 
todo o processo.