# Projeto de Estudo - _House Prices_ 

Estudo prático de Regressão de preço de casas (com _dataset California House Prices_);

---

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1EOKP1UJ1ZAVzjUIaUP6t61dE84sJj3nO?usp=sharing)

---

[Leonichel Guimarães (PIBITI/CNPq-FA-UEM)](https://github.com/leonichel)

Professora Linnyer Ruiz (orientadora)

---

Referência:

GÉRON, Aurélien. _Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems_. 2. ed. O'Reilly Media, 2019.

---

Manna Team  |  UEM       |     CNPq
:----------:|:----------:|:----------:|
<img src="https://manna.team/_next/static/images/logo2-e283461cfa92b2105bfd67e8e530529e.png" alt="Manna Team" width="200"/> | <img src="https://marcoadp.github.io/WebSiteDIN/img/logo-uem2.svg" alt="UEM" width="200"/> | <img src="https://www.gov.br/cnpq/pt-br/canais_atendimento/identidade-visual/logo_cnpq.svg" alt="CNPq" width="200"/>

## Leitura e exploração do banco de dados

### Importando bibliotecas

In [None]:
# 1
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

### Leitura

In [None]:
# DEMOSTRATIVO - NÃO EXECUTAR
data = pd.read_csv("/content/sample_data/california_housing_train.csv")
dataTest = pd.read_csv("/content/sample_data/california_housing_train.csv")
data = data.append(dataTest)

data.head()

In [None]:
# 2
data = pd.read_csv("https://raw.githubusercontent.com/ageron/handson-ml2/master/datasets/housing/housing.csv")

data.head()

### Exploração

In [None]:
data.info();

In [None]:
data["ocean_proximity"].value_counts()

In [None]:
data.describe()

In [None]:
data.boxplot(['median_house_value'], figsize=(10, 10));

In [None]:
y = data['ocean_proximity'].value_counts()

plt.figure(figsize=(10,5))
plt.title('Ocean Proximity Summary')
plt.ylabel('Number of Occurrences', fontsize=12)
plt.xlabel('Ocean Proximity', fontsize=12)

sns.barplot(y.index, y.values, alpha=0.7, palette="Set2")

plt.show()

In [None]:
data.hist(bins=50, figsize=(20,15))
plt.show()

In [None]:
data.isnull().sum()

### Criando um banco de teste

In [None]:
# 3
from sklearn.model_selection import train_test_split

train, test = train_test_split(data, test_size=0.2, random_state=42)

housing = train.copy()

In [None]:
train.info()

In [None]:
test.info()

## Vizualização e análise de dados

### Vizualização geográfica

In [None]:
housing.plot(kind="scatter", x="longitude", y="latitude", alpha=0.1, figsize=(12, 8));

In [None]:
housing.plot(kind="scatter", x="longitude", y="latitude", 
           alpha=0.4, s=housing["population"]/100, label="population", 
           figsize=(12,8), c="median_house_value", cmap=plt.get_cmap("rainbow"), 
           colorbar=True)
            
plt.legend();

Análise de correlação (coeficiente de _Pearson_)

![Correlação](https://qph.fs.quoracdn.net/main-qimg-055d00ae63f6d33fc2d9c716af031f37.webp)

In [None]:
housing.corr()

In [None]:
housing.corr()["median_house_value"].sort_values(ascending=False)

In [None]:
sns.heatmap(housing.corr(), square=True)

In [None]:
attributes = ["median_house_value", "median_income", "total_rooms",
"housing_median_age", "latitude"]

pd.plotting.scatter_matrix(housing[attributes], figsize=(14, 9), alpha=0.1);

### Testes com atributos combinados

In [None]:
housing["rooms_per_household"] = housing["total_rooms"]/housing["households"]
housing["bedrooms_per_room"] = housing["total_bedrooms"]/housing["total_rooms"]
housing["population_per_household"]=housing["population"]/housing["households"]

housing.head()

In [None]:
housing.corr()["median_house_value"].sort_values(ascending=False)

## Preparação dos dados para modelo de aprendizado

Métodos de_SKLearn_ divididos em: 

* Estimadores - _estimators_ - _fit()_
* Transformadores - _transformers_ - _transform()_
* Preditores - _predictors_ - _predict()_

### Retornando valores de _housing_ e separando variáveis

In [None]:
#4
housing = train.drop("median_house_value", axis=1)
housing_labels = train["median_house_value"].copy()

housing_num = housing.drop("ocean_proximity", axis=1)

In [None]:
housing.head()

In [None]:
housing_num.head()

### Limpando banco (_data cleaning_)

In [None]:
# DEMONSTRATIVO - NÃO EXECUTAR

# Remover valores nulos
housing.dropna(subset=["total_bedrooms"]) # Opção 1
housing.drop("total_bedrooms", axis=1) # Opção 2

# Ou substituí-los por uma média dos valores
median = housing["total_bedrooms"].median() # Opção 3
housing["total_bedrooms"].fillna(median, inplace=True)

# Ou substituir usando ferramentas do sklearn - Opção 4
from sklearn.impute import SimpleImputer

imputer = SimpleImputer(strategy="median")
imputer.fit(housing_num) # somente valores numéricos
X = imputer.transform(housing_num) # vetor
housing_tr = pd.DataFrame(X, columns=housing_num.columns) # adicionando em frame

### Tratando dados não numéricos

In [None]:
# DEMONSTRATIVO - NÃO EXECUTAR

# Transformar em lista de números (0, 1, 2, 3, ...); recomendado para poucas classes
from sklearn.preprocessing import OrdinalEncoder

ordinal_encoder = OrdinalEncoder() # cria encoder
housing_cat_encoded = ordinal_encoder.fit_transform(housing_cat) # aplica o encoder

housing_cat_encoded # lê o resultado
ordinal_encoder.categories_ # lê os nomes das categorias

# Transformar em vetores ([1 0 0 0], [0 1 0 0], ...); recomendado para muitas classes: profissões e CEPs, por exemplo
from sklearn.preprocessing import OneHotEncoder

cat_encoder = OneHotEncoder() # cria encoder
housing_cat_1hot = cat_encoder.fit_transform(housing_cat) # aplica o encoder

housing_cat_1hot.toarray() # lê o resultado
cat_encoder.categories_ # lê os nomes da categoria

### Dimensionamento de recurso (_Feature Scaling_)

Para banco de treinamento apenas;

Dois métodos, pelo _SKLearn_: 

* Normalizar (_min-max scaling_): valores são re-escalados para um intervalor de 0 a 1 (ou outro intervalo); Equação: $X_{norm} = \frac{X - X_{min}}{X_{max} - X_{min}}$

* Padronizar (_standardization_): menos afetado por _outliers_, e não se regulariza para um intervalo específico. Equação: $X_{stand} = \frac{X - X_{med}}{sd}$

[Leitura adicional 1](https://towardsdatascience.com/normalization-vs-standardization-quantitative-analysis-a91e8a79cebf)

[Leitura adicional 2](https://www.analyticsvidhya.com/blog/2020/04/feature-scaling-machine-learning-normalization-standardization/)

In [None]:
# DEMONSTRATIVO - NÃO EXECUTAR

# Normalização 
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
housing = scaler.fit_transform(housing)

# Padronização

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
housing = scaler.fit_transform(housing)

### Transformações customizadas

In [None]:
# 5
from sklearn.base import BaseEstimator, TransformerMixin

rooms_ix, bedrooms_ix, population_ix, households_ix = 3, 4, 5, 6

class CombinedAttributesAdder(BaseEstimator, TransformerMixin):
    
    def __init__(self, add_bedrooms_per_room = True):
        self.add_bedrooms_per_room = add_bedrooms_per_room

    def fit(self, X, y=None):
        return self # nothing else to do

    def transform(self, X, y=None):
        rooms_per_household = X[:, rooms_ix] / X[:, households_ix]
        population_per_household = X[:, population_ix] / X[:, households_ix]

        if self.add_bedrooms_per_room:
            bedrooms_per_room = X[:, bedrooms_ix] / X[:, rooms_ix]
            return np.c_[X, rooms_per_household, population_per_household, bedrooms_per_room]

        else:
            return np.c_[X, rooms_per_household, population_per_household]

attr_adder = CombinedAttributesAdder(add_bedrooms_per_room=False)
housing_extra_attribs = attr_adder.transform(housing.values)

### Caminho de transformações (_Transformation Pipelines_)

Extremamente importante: o caminho/ordem que os dados farão para se transformar.

In [None]:
# 6
# Para numérico
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer    # data cleaning
from sklearn.preprocessing import OneHotEncoder # para tratar dados textuais
from sklearn.preprocessing import StandardScaler # feature scaling

num_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy="median")),
    ('attribs_adder', CombinedAttributesAdder()),
    ('std_scaler', StandardScaler()),
    ])

# housing_num_tr = num_pipeline.fit_transform(housing_num)

# Completo, para dados não numéricos e numéricos
from sklearn.compose import ColumnTransformer

num_attribs = list(housing_num)
cat_attribs = ["ocean_proximity"]

full_pipeline = ColumnTransformer([
    ("num", num_pipeline, num_attribs),
    ("cat", OneHotEncoder(), cat_attribs),
    ])

housing_prepared = full_pipeline.fit_transform(housing)

## Modelos de aprendizagem

### Avaliação

Não tocamos ainda no banco de teste, vamos avaliar o modelo no banco de treinamento. E, quando tivermos certeza de nosso modelo, validamos-o com o banco de teste; Vamos usar 2 métodos:

* Erro médio quadrado (_mean squared error_):
* Validação cruzada (_cross-validation_)

![Validação Cruzada](https://slideplayer.com.br/slide/356790/2/images/14/Valida%C3%A7%C3%A3o+Cruzada.jpg)

In [None]:
# 7
# Funções para avaliação
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import cross_val_score

def mean_squared(labels, predictions):

    lin_mse = mean_squared_error(labels, predictions)
    lin_rmse = np.sqrt(lin_mse)
    
    print("Mean error:", lin_rmse)

def cross_val(model, predictions, labels):

    scores = cross_val_score(model, predictions, labels, 
                             scoring="neg_mean_squared_error", cv=10)

    tree_rmse_scores = np.sqrt(-scores)

    print("Scores:", tree_rmse_scores)
    print("Mean:", tree_rmse_scores.mean())
    print("Standard deviation:", tree_rmse_scores.std())

### Salvar modelos

In [None]:
# 7
# Salvar
from sklearn.externals import joblib

def saveModel(model, name):

    joblib.dump(model, name)

# Ler
# model = joblib.load("model.pkl")

### Mínimos quadrados (_ordinary least squares_)

In [None]:
# Treino
from sklearn.linear_model import LinearRegression

lin_reg = LinearRegression()
lin_reg.fit(housing_prepared, housing_labels)

In [None]:
# Predição
some_data = housing.iloc[:5]
some_labels = housing_labels.iloc[:5]
some_data_prepared = full_pipeline.transform(some_data)

print("Predictions:", lin_reg.predict(some_data_prepared))
print("Labels:", list(some_labels))

In [None]:
# Avaliação no banco de treinameto
housing_predictions = lin_reg.predict(housing_prepared)

mean_squared(housing_labels, housing_predictions)
cross_val(lin_reg, housing_prepared, housing_labels)

### Regressão _Ridge_

In [None]:
# Treino
from sklearn import linear_model

regRidge = linear_model.Ridge(alpha=.5)
regRidge.fit(housing_prepared, housing_labels)

In [None]:
# Predição
some_data = housing.iloc[:5]
some_labels = housing_labels.iloc[:5]
some_data_prepared = full_pipeline.transform(some_data)

print("Predictions:", regRidge.predict(some_data_prepared))
print("Labels:", list(some_labels))

In [None]:
# Avaliação no banco de treinameto
housing_predictions = regRidge.predict(housing_prepared)

mean_squared(housing_labels, housing_predictions)
cross_val(regRidge, housing_prepared, housing_labels)

### Regressão LASSO

In [None]:
# Treino
from sklearn import linear_model

regLasso = linear_model.Lasso(alpha=0.1)
regLasso.fit(housing_prepared, housing_labels)

In [None]:
# Predição
some_data = housing.iloc[:5]
some_labels = housing_labels.iloc[:5]
some_data_prepared = full_pipeline.transform(some_data)

print("Predictions:", regLasso.predict(some_data_prepared))
print("Labels:", list(some_labels))

In [None]:
# Avaliação no banco de treinameto
housing_predictions = regLasso.predict(housing_prepared)

mean_squared(housing_labels, housing_predictions)
cross_val(regLasso, housing_prepared, housing_labels)

### Árvores de desisão

In [None]:
# Treino
from sklearn.tree import DecisionTreeRegressor

tree_reg = DecisionTreeRegressor()
tree_reg.fit(housing_prepared, housing_labels)

In [None]:
# Predição
some_data = housing.iloc[:5]
some_labels = housing_labels.iloc[:5]
some_data_prepared = full_pipeline.transform(some_data)

print("Predictions:", tree_reg.predict(some_data_prepared))
print("Labels:", list(some_labels))

In [None]:
# Avaliação no banco de treinameto
housing_predictions = tree_reg.predict(housing_prepared)

mean_squared(housing_labels, housing_predictions)
cross_val(tree_reg, housing_prepared, housing_labels)

### Máquina de vetores de suporte (_SVM_)

In [None]:
# Treino
from sklearn.svm import SVR

svm_reg = SVR(kernel="linear")
svm_reg.fit(housing_prepared, housing_labels)

In [None]:
# Predição
some_data = housing.iloc[:5]
some_labels = housing_labels.iloc[:5]
some_data_prepared = full_pipeline.transform(some_data)

print("Predictions:", svm_reg.predict(some_data_prepared))
print("Labels:", list(some_labels))

In [None]:
# Avaliação no banco de treinameto
housing_predictions = svm_reg.predict(housing_prepared)

mean_squared(housing_labels, housing_predictions)
cross_val(svm_reg, housing_prepared, housing_labels)

### Salvar e reustaurar modelos

In [None]:
# Salvar
saveModel(forest_reg, "model.pkl")

In [None]:
# Ler
model = joblib.load("model.pkl")
print("Predictions:", model.predict(some_data_prepared))
print("Labels:", list(some_labels))

## Aperfeiçoar modelos - ajuste-fino (_fine-tune of the models_)

### Busca por melhores modelos (_estimators_)

In [None]:
# DEMONSTRATIVO - NÃO EXECUTAR - MUITO LENTO

from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV

from sklearn.linear_model import LinearRegression
from sklearn import svm
from sklearn.linear_model import SGDRegressor
from sklearn.neighbors import KNeighborsRegressor
from sklearn import tree
from sklearn.naive_bayes import GaussianNB
 
pipeline = Pipeline(steps=[('estimator', LinearRegression())])

params_grid = [{
                'estimator':[LinearRegression()],
                },
                {
                'estimator': [svm.SVR()],
                'estimator__C': [1,2,3],
                'estimator__epsilon': [0.1,0.2,0.3],
                },
                {
                'estimator':[SGDRegressor()],
                'estimator__max_iter':[500,1000,1500],
               },
                {
                'estimator':[KNeighborsRegressor()],
                'estimator__n_neighbors':[3,5,7],
               },
               {
                'estimator':[tree.DecisionTreeRegressor()],
                'estimator__criterion':['mse', 'poisson'],
               },
               {
                'estimator':[GaussianNB()]
               },
              ]

grid_search = GridSearchCV(pipeline, params_grid, cv=5,
                           scoring='neg_mean_squared_error',
                           return_train_score=True)

grid_search.fit(housing_prepared, housing_labels)

In [None]:
# Exibindo resultado da melhor combinação de hiper-parâmetros
grid_search.best_params_

resultado:
''' {'estimator': KNeighborsRegressor(algorithm='auto', leaf_size=30, metric='minkowski',
                     metric_params=None, n_jobs=None, n_neighbors=7, p=2,
                     weights='uniform'), 'estimator__n_neighbors': 7} '''

In [None]:
# Obtendo melhor modelo (estimator)
grid_search.best_estimator_

''' resultado:
Pipeline(memory=None,
         steps=[('estimator',
                 KNeighborsRegressor(algorithm='auto', leaf_size=30,
                                     metric='minkowski', metric_params=None,
                                     n_jobs=None, n_neighbors=7, p=2,
                                     weights='uniform'))],
         verbose=False) '''

In [None]:
# Visualizando resultados
cvres = grid_search.cv_results_
for mean_score, params in zip(cvres["mean_test_score"], cvres["params"]):
    print(np.sqrt(-mean_score), params)

''' resultado:
67866.82839151102 {'estimator': LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)}
118493.77560936943 {'estimator': SVR(C=1.0, cache_size=200, coef0=0.0, degree=3, epsilon=0.1, gamma='scale',
    kernel='rbf', max_iter=-1, shrinking=True, tol=0.001, verbose=False), 'estimator__C': 1, 'estimator__epsilon': 0.1}
118493.77560936943 {'estimator': SVR(C=1.0, cache_size=200, coef0=0.0, degree=3, epsilon=0.1, gamma='scale',
    kernel='rbf', max_iter=-1, shrinking=True, tol=0.001, verbose=False), 'estimator__C': 1, 'estimator__epsilon': 0.2}
118493.77560936943 {'estimator': SVR(C=1.0, cache_size=200, coef0=0.0, degree=3, epsilon=0.1, gamma='scale',
    kernel='rbf', max_iter=-1, shrinking=True, tol=0.001, verbose=False), 'estimator__C': 1, 'estimator__epsilon': 0.3}
118201.75022645184 {'estimator': SVR(C=1.0, cache_size=200, coef0=0.0, degree=3, epsilon=0.1, gamma='scale',
    kernel='rbf', max_iter=-1, shrinking=True, tol=0.001, verbose=False), 'estimator__C': 2, 'estimator__epsilon': 0.1}
118201.75022645184 {'estimator': SVR(C=1.0, cache_size=200, coef0=0.0, degree=3, epsilon=0.1, gamma='scale',
    kernel='rbf', max_iter=-1, shrinking=True, tol=0.001, verbose=False), 'estimator__C': 2, 'estimator__epsilon': 0.2}
118201.75022645184 {'estimator': SVR(C=1.0, cache_size=200, coef0=0.0, degree=3, epsilon=0.1, gamma='scale',
(...)
29964296.48683099 {'estimator': SGDRegressor(alpha=0.0001, average=False, early_stopping=False, epsilon=0.1,
             eta0=0.01, fit_intercept=True, l1_ratio=0.15,
             learning_rate='invscaling', loss='squared_loss', max_iter=1000,
             n_iter_no_change=5, penalty='l2', power_t=0.25, random_state=None,
             shuffle=True, tol=0.001, validation_fraction=0.1, verbose=0,
             warm_start=False), 'estimator__max_iter': 1500}
63796.83006909728 {'estimator': KNeighborsRegressor(algorithm='auto', leaf_size=30, metric='minkowski',
                    metric_params=None, n_jobs=None, n_neighbors=7, p=2,
                    weights='uniform'), 'estimator__n_neighbors': 3}
61682.59497366035 {'estimator': KNeighborsRegressor(algorithm='auto', leaf_size=30, metric='minkowski',
                    metric_params=None, n_jobs=None, n_neighbors=7, p=2,
                    weights='uniform'), 'estimator__n_neighbors': 5}
61160.46360411159 {'estimator': KNeighborsRegressor(algorithm='auto', leaf_size=30, metric='minkowski',
                    metric_params=None, n_jobs=None, n_neighbors=7, p=2,
                    weights='uniform'), 'estimator__n_neighbors': 7}
(...)
96416.18105983316 {'estimator': GaussianNB(priors=None, var_smoothing=1e-09)} '''

### Métodos com junção de modelos (_ensembles_)

#### Floresta aleatória (_random forest_)

In [None]:
# Treinamento
from sklearn.ensemble import RandomForestRegressor

forest_reg = RandomForestRegressor()
reg.fit(housing_prepared, housing_labels)

In [None]:
# Predição
some_data = housing.iloc[:5]
some_labels = housing_labels.iloc[:5]
some_data_prepared = full_pipeline.transform(some_data)

print("Predictions:", forest_reg.predict(some_data_prepared))
print("Labels:", list(some_labels))

In [None]:
# Avaliação no banco de treinameto
housing_predictions = forest_reg.predict(housing_prepared)

mean_squared(housing_labels, housing_predictions)
cross_val(forest_reg, housing_prepared, housing_labels)

#### Regressor _AdaBoosts_

In [None]:
# Treinamento
from sklearn.ensemble import AdaBoostRegressor

regr = AdaBoostRegressor(random_state=42, n_estimators=100)
regr.fit(housing_prepared, housing_labels)

In [None]:
# Predição
some_data = housing.iloc[:5]
some_labels = housing_labels.iloc[:5]
some_data_prepared = full_pipeline.transform(some_data)

print("Predictions:", regr.predict(some_data_prepared))
print("Labels:", list(some_labels))

In [None]:
# Avaliação no banco de treinameto
housing_predictions = regr.predict(housing_prepared)

mean_squared(housing_labels, housing_predictions)
cross_val(regr, housing_prepared, housing_labels)

#### Regressor _Gradient Boosting_ 

In [None]:
# Treinamento
from sklearn.ensemble import GradientBoostingRegressor

reg = GradientBoostingRegressor(random_state=0)
reg.fit(housing_prepared, housing_labels)

In [None]:
# Predição
some_data = housing.iloc[:5]
some_labels = housing_labels.iloc[:5]
some_data_prepared = full_pipeline.transform(some_data)

print("Predictions:", reg.predict(some_data_prepared))
print("Labels:", list(some_labels))

In [None]:
# Avaliação no banco de treinameto
housing_predictions = reg.predict(housing_prepared)

mean_squared(housing_labels, housing_predictions)
cross_val(reg, housing_prepared, housing_labels)

### Empilhamento _Stacking_

In [None]:
# Treinamento
from sklearn.neighbors import KNeighborsRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.ensemble import StackingRegressor

estimators = [
              ('kn', KNeighborsRegressor()),
              ('gb', GradientBoostingRegressor(random_state=42))
              ]

reg = StackingRegressor(
    estimators=estimators,
    final_estimator=RandomForestRegressor(n_estimators=10,random_state=42))

reg.fit(housing_prepared, housing_labels)

In [None]:
# Avaliação no banco de treinameto
housing_predictions = reg.predict(housing_prepared)

mean_squared(housing_labels, housing_predictions)
cross_val(reg, housing_prepared, housing_labels)

### Otimização de hiper-parâmetros

#### Busca por valores ótimos de hiper-parâmetros

In [None]:
# 8
from sklearn.model_selection import GridSearchCV

param_grid = [
    # tentativa com 12 (3×4) combinações de hiper-parâmetros
    {'n_estimators': [3, 10, 30], 'max_features': [2, 4, 6, 8]},
    # e mais tentativa com 6 (2×3) combinações com bootstrap como False
    {'bootstrap': [False], 'n_estimators': [3, 10], 'max_features': [2, 3, 4]},
  ]

forest_reg = RandomForestRegressor(random_state=42)
# treina com 5 subconjuntos (folds), totalizando (12+6)*5 = 90 rodadas de treinamento 
grid_search = GridSearchCV(forest_reg, param_grid, cv=5,
                           scoring='neg_mean_squared_error',
                           return_train_score=True)
grid_search.fit(housing_prepared, housing_labels)

In [None]:
# Exibindo resultado da melhor combinação de hiper-parâmetros
grid_search.best_params_

In [None]:
# Obtendo melhor modelo (estimator)
grid_search.best_estimator_

In [None]:
# Visualizando resultados
cvres = grid_search.cv_results_
for mean_score, params in zip(cvres["mean_test_score"], cvres["params"]):
    print(np.sqrt(-mean_score), params)

In [None]:
# Visualizando em tabela
pd.DataFrame(grid_search.cv_results_)

#### Busca aleatória por valores ótimos de hiper-parâmetros

In [None]:
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint

param_distribs = {
        'n_estimators': randint(low=1, high=200),
        'max_features': randint(low=1, high=8),
    }

forest_reg = RandomForestRegressor(random_state=42)
rnd_search = RandomizedSearchCV(forest_reg, param_distributions=param_distribs,
                                n_iter=10, cv=5, scoring='neg_mean_squared_error', random_state=42)
rnd_search.fit(housing_prepared, housing_labels)

In [None]:
# Visualizando resultados
cvres = rnd_search.cv_results_
for mean_score, params in zip(cvres["mean_test_score"], cvres["params"]):
    print(np.sqrt(-mean_score), params)

### Extras

#### Busca automática por melhores parâmetros

In [None]:
# DEMONSTRATIVO - NÃO EXECUTAR

param_grid = [{
    'preparation__num__imputer__strategy': ['mean', 'median', 'most_frequent'],
    'feature_selection__k': list(range(1, len(feature_importances) + 1))
}]

grid_search_prep = GridSearchCV(prepare_select_and_predict_pipeline, param_grid, cv=5,
                                scoring='neg_mean_squared_error', verbose=2)
grid_search_prep.fit(housing, housing_labels)

#### Selecionando melhores atributos

In [None]:
# DEMONSTRATIVO - NÃO EXECUTAR

from sklearn.base import BaseEstimator, TransformerMixin

def indices_of_top_k(arr, k):
    return np.sort(np.argpartition(np.array(arr), -k)[-k:])

class TopFeatureSelector(BaseEstimator, TransformerMixin):
    def __init__(self, feature_importances, k):
        self.feature_importances = feature_importances
        self.k = k
    def fit(self, X, y=None):
        self.feature_indices_ = indices_of_top_k(self.feature_importances, self.k)
        return self
    def transform(self, X):
        return X[:, self.feature_indices_]

In [None]:
# DEMONSTRATIVO - NÃO EXECUTAR

k = 5
top_k_feature_indices = indices_of_top_k(feature_importances, k)
top_k_feature_indices

In [None]:
# DEMONSTRATIVO - NÃO EXECUTAR

np.array(attributes)[top_k_feature_indices]
sorted(zip(feature_importances, attributes), reverse=True)[:k]

#### Caminho completo (_pipeline_ completo)

In [None]:
# DEMONSTRATIVO - NÃO EXECUTAR

final_pipeline = Pipeline([
    ('preparation', full_pipeline),
    ('feature_selection', TopFeatureSelector(feature_importances, k)),
    ('reg', RandomForestRegressor(max_features=k, n_estimators=180))
])

final_pipeline.fit(housing, housing_labels)

## Validação do modelo com banco de teste

In [None]:
# 9
# Validação
final_model = grid_search.best_estimator_

X_test = test.drop("median_house_value", axis=1)
y_test = test["median_house_value"].copy()

X_test_prepared = full_pipeline.transform(X_test)

final_predictions = final_model.predict(X_test_prepared)

final_mse = mean_squared_error(y_test, final_predictions)
final_rmse = np.sqrt(final_mse)
final_rmse

In [None]:
# 10
# Obtendo intervalo de confiança
from scipy import stats

confidence = 0.95
squared_errors = (final_predictions - y_test) ** 2
np.sqrt(stats.t.interval(confidence, len(squared_errors) - 1,
                         loc=squared_errors.mean(),
                         scale=stats.sem(squared_errors)))

## Exportar modelo

Nota: para chegar do início até esse modelo exportado, percorrer todas células enumeradas de 1 até 11. As demais células foram utilizadas para análises e criação de outros modelos, que não foram exportados.

In [None]:
# 11 
saveModel(final_model, "Housing_RandomForest.pkl")