Na Semana DataScience, promovida pelo Blog Minerando, foi construído um modelo de Machine Learning para prever preços de casas em Boston, representado por uma base de dados contendo caraterísticas que podem influenciar no preço de casas.

Importando as bibliotecas básicas



In [0]:
import pandas as pd
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt

Leitura e análise da base de dados




In [0]:
df = pd.read_csv('data.csv')

In [1206]:
df.describe()

Unnamed: 0,CRIM,INDUS,CHAS,NOX,RM,PTRATIO,B,LSTAT,MEDV
count,490.0,490.0,490.0,490.0,490.0,490.0,490.0,490.0,490.0
mean,3.643241,11.113143,0.059184,0.554307,5.740816,18.52,355.855449,12.92402,21.635918
std,8.722154,6.821302,0.236209,0.116688,0.737657,2.110478,92.634273,7.08318,7.865301
min,0.00632,0.74,0.0,0.385,3.0,12.6,0.32,1.98,5.0
25%,0.082045,5.19,0.0,0.449,5.0,17.4,375.9125,7.3475,16.7
50%,0.24751,9.69,0.0,0.538,6.0,19.1,391.77,11.675,20.9
75%,3.647422,18.1,0.0,0.624,6.0,20.2,396.3225,17.1175,24.675
max,88.9762,27.74,1.0,0.871,8.0,22.0,396.9,37.97,48.8


In [1207]:
df.shape

(490, 9)

In [1208]:
features = df.iloc[:,0:8].values
features.shape

(490, 8)

In [1209]:
alvo = df.iloc[:, 8].values
alvo.shape

(490,)

Tratando valores nulos

In [1210]:
df.isnull()

Unnamed: 0,CRIM,INDUS,CHAS,NOX,RM,PTRATIO,B,LSTAT,MEDV
0,False,False,False,False,False,False,False,False,False
1,False,False,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False,False,False
3,False,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False,False,False
...,...,...,...,...,...,...,...,...,...
485,False,False,False,False,False,False,False,False,False
486,False,False,False,False,False,False,False,False,False
487,False,False,False,False,False,False,False,False,False
488,False,False,False,False,False,False,False,False,False


In [0]:
df.dropna(axis=0, inplace=True)

In [1212]:
df.isnull()

Unnamed: 0,CRIM,INDUS,CHAS,NOX,RM,PTRATIO,B,LSTAT,MEDV
0,False,False,False,False,False,False,False,False,False
1,False,False,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False,False,False
3,False,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False,False,False
...,...,...,...,...,...,...,...,...,...
485,False,False,False,False,False,False,False,False,False
486,False,False,False,False,False,False,False,False,False
487,False,False,False,False,False,False,False,False,False
488,False,False,False,False,False,False,False,False,False


In [1213]:
df.head()

Unnamed: 0,CRIM,INDUS,CHAS,NOX,RM,PTRATIO,B,LSTAT,MEDV
0,0.00632,2.31,0.0,0.538,6,15.3,396.9,4.98,24.0
1,0.02731,7.07,0.0,0.469,6,17.8,396.9,9.14,21.6
2,0.02729,7.07,0.0,0.469,7,17.8,392.83,4.03,34.7
3,0.03237,2.18,0.0,0.458,6,18.7,394.63,2.94,33.4
4,0.06905,2.18,0.0,0.458,7,18.7,396.9,5.33,36.2


Divisão da base de dados em treino e teste, pegando uma porção de 80% para treino e 20% para teste.

In [0]:
from sklearn.model_selection import train_test_split
X_train, X_teste, y_train, y_teste = train_test_split(features,alvo,
                                                                  test_size = 0.3,
                                                                  random_state = 0)

Desenvolvendo o modelo com Regressãão Linear Simples

In [0]:
from sklearn.ensemble import RandomForestRegressor
regressor = RandomForestRegressor(n_estimators = 50, random_state=None,max_features = 6)                                                                  
regressor.fit(X_train, y_train)
score = regressor.score(X_teste, y_teste)

Verificando o erro quadrático médio (RMSE), que pega a média da diferença quadrática entre os valores reais e os preditos pelo modelo.


In [0]:
y_pred = regressor.predict(X_teste)

In [0]:
from sklearn.metrics import mean_squared_error,mean_absolute_error
rmse = (np.sqrt(mean_squared_error(y_teste, y_pred)))

In [1218]:
score

0.8261008680553994

In [1219]:
print ('Valores do modelo em relação aos dados de teste:')
print('\nRMSE: {} '.format(rmse))

Valores do modelo em relação aos dados de teste:

RMSE: 3.6397838090840997 


Criando um DataFrame df_results para adicionar as colunas dos valores reais e dos valores preditos pelo modelo de regressão.

In [0]:
df_results = pd.DataFrame()

In [0]:
df_results['valor_real'] = y_teste

In [0]:
df_results['valor_predito_reg_linear'] = y_pred

In [1223]:
df_results

Unnamed: 0,valor_real,valor_predito_reg_linear
0,24.3,23.094
1,32.5,31.574
2,17.8,15.628
3,19.5,19.196
4,19.9,21.358
...,...,...
142,15.2,15.842
143,17.4,20.012
144,13.6,13.792
145,19.9,20.496


Plotando os valores preditos pela regressão linear em relação aos valores reais.

In [1224]:
import plotly.graph_objects as go

# Create traces
fig = go.Figure()

# Linha com os dados de teste
fig.add_trace(go.Scatter(x=df_results.index,
                         y=df_results.valor_real,
                         mode='lines+markers',
                         name='Valor Real'))

# Linha com os dados preditos pela regressão linear
fig.add_trace(go.Scatter(x=df_results.index,
                         y=df_results.valor_predito_reg_linear,
                         mode='lines',
                         line = dict(color = '#D2691E'),
                         name='Valor Predito Regressão Linear'))

# Plota a figura
fig.show()