### Modelos baseados em árvores

<br>

Ao longo do módulo, discutimos bastante as árvores de decisão, bem como ensemble de árvores, como Random Forest e algoritmos do tipo boosting.

Esses __ensembles acabam tendo muitos hiperparâmetros;__ escolhe-los de forma manual acaba sendo muito custoso e tedioso. 

Neste exercício, vamos discutir a respeito da metolodia __grid-search__, que otimiza essa busca de hiperparâmetros.

Considere o dataset abaixo (basta executar as células):

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [2]:
from sklearn.datasets import load_diabetes
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder

In [3]:
#problema de regressão

X, y = load_diabetes().data, load_diabetes().target
Xtrain, Xtest, ytrain, ytest = train_test_split(X, y, test_size = 0.25, random_state = 42)
print(Xtrain.shape, Xtest.shape, ytrain.shape, ytest.shape)

(331, 10) (111, 10) (331,) (111,)


Imagine que queremos testar - usando cross-validation - várias instâncias de Random Forests: com 10 árvores, com 100 árvores, com 1000 árvores, com profundidade máxima 1, 5, 10. 

Como podemos proceder? O código abaixo exemplifica um jeito:

In [4]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import cross_val_score

for n_est in [10,100,100]:
    for prof in [1,5,10]:
        rf = RandomForestRegressor(n_estimators=n_est, max_depth=prof)
        cvres = cross_val_score(estimator=rf, X = Xtrain, y = ytrain, cv = 3, scoring='r2')
        print("estimators: ", n_est, " prof: ", prof, " | R2 mean / std: ", cvres.mean(), ' / ', cvres.std())

estimators:  10  prof:  1  | R2 mean / std:  0.3438726299902708  /  0.025743659390274534
estimators:  10  prof:  5  | R2 mean / std:  0.4149228487272518  /  0.022172874186633884
estimators:  10  prof:  10  | R2 mean / std:  0.3521818694605585  /  0.07248973908345704
estimators:  100  prof:  1  | R2 mean / std:  0.337000322604752  /  0.032999610753598495
estimators:  100  prof:  5  | R2 mean / std:  0.43115787775827835  /  0.03662193947493671
estimators:  100  prof:  10  | R2 mean / std:  0.4208618290727248  /  0.039079357767148944
estimators:  100  prof:  1  | R2 mean / std:  0.32722633127569367  /  0.036980695219750406
estimators:  100  prof:  5  | R2 mean / std:  0.4243161692008444  /  0.04223702553000389
estimators:  100  prof:  10  | R2 mean / std:  0.4173069827654519  /  0.04217196824974701


Podemos, com algum trabalho, escolher o melhor modelo.

Se quisermos testar mais parâmetros, podemos aumentar nosso loop... mais isso vai ficando cada vez mais complicado.

A proposta do __grid-search__ é justamente fazer isso de forma mais automática!

Podemos importar a função GridSearchCV do módulo model_selection do sklearn e usá-la para isso. 
Na prática, precisamos definir um __estimador base__ para o grid. Além disso, precisamos definir um __dicionário de parâmetros__ a ser testado. Ainda, definiremos a quantidade de folds para cross-validation e qual a métrica de performance que queremos otimizar:

In [5]:
#importando a função
from sklearn.model_selection import GridSearchCV

In [6]:
#definindo o estimador base
estimador_base = RandomForestRegressor()

#definindo o dicionario de parâmetros do modelo
params_RF = {"n_estimators":[10,1000], "max_depth":[2,10]}

In [7]:
grid = GridSearchCV(estimator = estimador_base, 
                    param_grid = params_RF, 
                    scoring = 'r2', 
                    cv = 3)

grid

GridSearchCV(cv=3, estimator=RandomForestRegressor(),
             param_grid={'max_depth': [2, 10], 'n_estimators': [10, 1000]},
             scoring='r2')

In [8]:
#treinando os modelos no grid
grid.fit(Xtrain, ytrain)

GridSearchCV(cv=3, estimator=RandomForestRegressor(),
             param_grid={'max_depth': [2, 10], 'n_estimators': [10, 1000]},
             scoring='r2')

O objeto "grid", após o treinamento acima, conterá várias informações muito relevantes. 

__1- "best_params_":__ retorna os melhores parâmetros, de acordo com a métrica de performance avaliada na cross-validation;

__1- "best_score_":__ retorna o melhor score - métrica de performance - nos dados de validação;

__1- "best_estimator_":__ retorna o melhor modelo, já treinado;

__1- "cv_results_":__ retorna uma visão geral dos resultados.

In [9]:
grid.best_params_

{'max_depth': 10, 'n_estimators': 1000}

In [10]:
grid.best_score_

0.4189516175012673

In [11]:
grid.best_estimator_

RandomForestRegressor(max_depth=10, n_estimators=1000)

In [12]:
grid.cv_results_

{'mean_fit_time': array([0.02414926, 1.43388629, 0.02081561, 2.02043716]),
 'std_fit_time': array([0.01154013, 0.04863793, 0.00737235, 0.13899456]),
 'mean_score_time': array([0.0026652 , 0.06289959, 0.        , 0.08225807]),
 'std_score_time': array([0.00376916, 0.00053476, 0.        , 0.00589496]),
 'param_max_depth': masked_array(data=[2, 2, 10, 10],
              mask=[False, False, False, False],
        fill_value='?',
             dtype=object),
 'param_n_estimators': masked_array(data=[10, 1000, 10, 1000],
              mask=[False, False, False, False],
        fill_value='?',
             dtype=object),
 'params': [{'max_depth': 2, 'n_estimators': 10},
  {'max_depth': 2, 'n_estimators': 1000},
  {'max_depth': 10, 'n_estimators': 10},
  {'max_depth': 10, 'n_estimators': 1000}],
 'split0_test_score': array([0.34168606, 0.35385612, 0.35526116, 0.36834359]),
 'split1_test_score': array([0.42072262, 0.44719916, 0.46265561, 0.45033351]),
 'split2_test_score': array([0.39317396, 0.4

__Exercício 1:__ Utilizando o dataset abaixo, faça um grid_search com KNN's, Random Forests e GradientBoostings e retorne o melhor modelo de cada tipo.

__Obs.:__ Lembre-se de fazer um pré-processamento nos dados!

In [13]:
#preco_mediano_das_casas é a variável target
df = pd.read_csv("preco_casas.csv")
print(df.shape)
df.head()

(20640, 10)


Unnamed: 0,longitude,latitude,idade_mediana_das_casas,total_comodos,total_quartos,populacao,familias,salario_mediano,preco_mediano_das_casas,proximidade_ao_mar
0,-122.23,37.88,41.0,880.0,129.0,322.0,126.0,8.3252,452600.0,PERTO DA BAÍA
1,-122.22,37.86,21.0,7099.0,1106.0,2401.0,1138.0,8.3014,358500.0,PERTO DA BAÍA
2,-122.24,37.85,52.0,1467.0,190.0,496.0,177.0,7.2574,352100.0,PERTO DA BAÍA
3,-122.25,37.85,52.0,1274.0,235.0,558.0,219.0,5.6431,341300.0,PERTO DA BAÍA
4,-122.25,37.85,52.0,1627.0,280.0,565.0,259.0,3.8462,342200.0,PERTO DA BAÍA


In [14]:
# dividindo os dados

dftrain, dftest = train_test_split(df, random_state = 0)
print(df.shape)
print(dftrain.shape)
print(dftest.shape)

(20640, 10)
(15480, 10)
(5160, 10)


In [15]:
# Pré-processamento dos dados

def preprocessamento_completo(df, dataset_de_treino = True, cat_encoder = None, std_scaler = None):

    dff = df.copy()

    #retirando valores faltantes
    dff = dff.dropna(axis = 0)
    
    variaveis_para_normalizar = ['latitude',
                                 'longitude',
                                'idade_mediana_das_casas',
                                 'total_comodos',
                                 'total_quartos',
                                 'populacao',
                                 'familias',
                                 'salario_mediano']

    if dataset_de_treino:  
        
        #OHE
        encoder = OneHotEncoder()
        df_prox_mar_OHE = encoder.fit_transform(dff[['proximidade_ao_mar']]).toarray()

        #normalização
        sc = StandardScaler()
        variaveis_norm = sc.fit_transform(dff[variaveis_para_normalizar])
        
        X, y =  np.c_[df_prox_mar_OHE, variaveis_norm], dff.preco_mediano_das_casas.values
        return X, y, encoder, sc
    
    else:
        #OHE
        df_prox_mar_OHE = cat_encoder.transform(dff[['proximidade_ao_mar']]).toarray()
        
        #normalização
        variaveis_norm = std_scaler.transform(dff[variaveis_para_normalizar]) 
        
        X, y =  np.c_[df_prox_mar_OHE, variaveis_norm], dff.preco_mediano_das_casas.values
        return X, y

In [16]:
Xtrain, ytrain, encoder_train, scaler_train  = preprocessamento_completo(df = dftrain, 
                                                                         dataset_de_treino = True, 
                                                                         cat_encoder = None, 
                                                                         std_scaler = None)

In [17]:
Xtrain.shape, ytrain.shape, dftrain.shape

((15331, 13), (15331,), (15480, 10))

In [18]:
from sklearn.neighbors import KNeighborsRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.tree import DecisionTreeRegressor

In [19]:
# faça um grid_search com KNN's, Random Forests e GradientBoostings e retorne o melhor modelo de cada tipo:

#KNN

#definindo o estimador base
estimador_base = KNeighborsRegressor()

#definindo o dicionario de parâmetros do modelo
params_RF = {'n_neighbors': [3,5,11,19],
            'weights': ['uniform', 'distance']}

In [20]:
grid = GridSearchCV(estimator = estimador_base, 
                    param_grid = params_RF, 
                    verbose = 1,
                    cv = 3,
                    n_jobs = -1)

grid

GridSearchCV(cv=3, estimator=KNeighborsRegressor(), n_jobs=-1,
             param_grid={'n_neighbors': [3, 5, 11, 19],
                         'weights': ['uniform', 'distance']},
             verbose=1)

In [21]:
#treinando os modelos no grid
grid.fit(Xtrain, ytrain)

Fitting 3 folds for each of 8 candidates, totalling 24 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  24 out of  24 | elapsed:    9.7s finished


GridSearchCV(cv=3, estimator=KNeighborsRegressor(), n_jobs=-1,
             param_grid={'n_neighbors': [3, 5, 11, 19],
                         'weights': ['uniform', 'distance']},
             verbose=1)

In [22]:
grid.best_params_

{'n_neighbors': 11, 'weights': 'distance'}

In [23]:
grid.best_score_

0.724256628091983

In [24]:
grid.best_estimator_

KNeighborsRegressor(n_neighbors=11, weights='distance')

In [25]:
grid.cv_results_

{'mean_fit_time': array([0.09372719, 0.1300892 , 0.10875003, 0.10847592, 0.1492339 ,
        0.10770496, 0.10043557, 0.07882055]),
 'std_fit_time': array([0.02209124, 0.04879662, 0.0068061 , 0.02218682, 0.01994212,
        0.00261229, 0.01698595, 0.01363808]),
 'mean_score_time': array([0.54585226, 0.616606  , 0.71821745, 0.84641528, 1.04065863,
        0.87461146, 0.99074181, 0.92390712]),
 'std_score_time': array([0.08055478, 0.02730402, 0.0366151 , 0.06579543, 0.05650464,
        0.05255909, 0.03306908, 0.01272117]),
 'param_n_neighbors': masked_array(data=[3, 3, 5, 5, 11, 11, 19, 19],
              mask=[False, False, False, False, False, False, False, False],
        fill_value='?',
             dtype=object),
 'param_weights': masked_array(data=['uniform', 'distance', 'uniform', 'distance',
                    'uniform', 'distance', 'uniform', 'distance'],
              mask=[False, False, False, False, False, False, False, False],
        fill_value='?',
             dtype=objec

In [26]:
# Random Forest

#definindo o estimador base
estimador_base = RandomForestRegressor()

#definindo o dicionario de parâmetros do modelo
params_RF = {"n_estimators":[10,1000], "max_depth":[2,10]}

In [27]:
grid = GridSearchCV(estimator = estimador_base, 
                    param_grid = params_RF, 
                    scoring = 'r2', 
                    cv = 3)

grid

GridSearchCV(cv=3, estimator=RandomForestRegressor(),
             param_grid={'max_depth': [2, 10], 'n_estimators': [10, 1000]},
             scoring='r2')

In [28]:
#treinando os modelos no grid
grid.fit(Xtrain, ytrain)

GridSearchCV(cv=3, estimator=RandomForestRegressor(),
             param_grid={'max_depth': [2, 10], 'n_estimators': [10, 1000]},
             scoring='r2')

In [29]:
grid.best_params_

{'max_depth': 10, 'n_estimators': 1000}

In [30]:
grid.best_score_

0.7794954239391165

In [31]:
grid.best_estimator_

RandomForestRegressor(max_depth=10, n_estimators=1000)

In [32]:
grid.cv_results_

{'mean_fit_time': array([ 0.14830565, 10.84426252,  0.43850867, 43.88705095]),
 'std_fit_time': array([0.05510599, 0.01452842, 0.00088697, 0.05511266]),
 'mean_score_time': array([0.        , 0.17844303, 0.00520658, 0.69482287]),
 'std_score_time': array([0.        , 0.00793536, 0.00736322, 0.00729655]),
 'param_max_depth': masked_array(data=[2, 2, 10, 10],
              mask=[False, False, False, False],
        fill_value='?',
             dtype=object),
 'param_n_estimators': masked_array(data=[10, 1000, 10, 1000],
              mask=[False, False, False, False],
        fill_value='?',
             dtype=object),
 'params': [{'max_depth': 2, 'n_estimators': 10},
  {'max_depth': 2, 'n_estimators': 1000},
  {'max_depth': 10, 'n_estimators': 10},
  {'max_depth': 10, 'n_estimators': 1000}],
 'split0_test_score': array([0.52198605, 0.51944578, 0.76730194, 0.7789192 ]),
 'split1_test_score': array([0.50460465, 0.50789849, 0.75319377, 0.7690912 ]),
 'split2_test_score': array([0.5151826 ,

In [33]:
# GradientBoosting

#definindo o estimador base
estimador_base = GradientBoostingRegressor()

#definindo o dicionario de parâmetros do modelo
params_RF = {'n_estimators': [10,50,100], 'learning_rate':[0.001,0.01, 0.1]}

In [34]:
grid = GridSearchCV(estimator = estimador_base, 
                    param_grid = params_RF, 
                    scoring = 'r2', 
                    cv = 3)

grid

GridSearchCV(cv=3, estimator=GradientBoostingRegressor(),
             param_grid={'learning_rate': [0.001, 0.01, 0.1],
                         'n_estimators': [10, 50, 100]},
             scoring='r2')

In [35]:
#treinando os modelos no grid
grid.fit(Xtrain, ytrain)

GridSearchCV(cv=3, estimator=GradientBoostingRegressor(),
             param_grid={'learning_rate': [0.001, 0.01, 0.1],
                         'n_estimators': [10, 50, 100]},
             scoring='r2')

In [36]:
grid.best_params_

{'learning_rate': 0.1, 'n_estimators': 100}

In [37]:
grid.best_score_

0.7723663905094972

In [38]:
grid.best_estimator_

GradientBoostingRegressor()

In [39]:
grid.cv_results_

{'mean_fit_time': array([0.29841232, 1.18162982, 2.28385671, 0.22462519, 1.12909897,
        2.27075617, 0.25610677, 1.21786729, 2.19507202]),
 'std_fit_time': array([0.08715627, 0.07509833, 0.05210106, 0.00691102, 0.03457957,
        0.04248532, 0.00896721, 0.02724469, 0.01244631]),
 'mean_score_time': array([0.        , 0.00786074, 0.00555102, 0.        , 0.01041587,
        0.01576757, 0.        , 0.00520563, 0.01041452]),
 'std_score_time': array([0.        , 0.00637772, 0.00785033, 0.        , 0.00736513,
        0.00018403, 0.        , 0.00736187, 0.00736418]),
 'param_learning_rate': masked_array(data=[0.001, 0.001, 0.001, 0.01, 0.01, 0.01, 0.1, 0.1, 0.1],
              mask=[False, False, False, False, False, False, False, False,
                    False],
        fill_value='?',
             dtype=object),
 'param_n_estimators': masked_array(data=[10, 50, 100, 10, 50, 100, 10, 50, 100],
              mask=[False, False, False, False, False, False, False, False,
              

__Exercício 2:__ Crie uma classe para comparar o grid_search dentre vários modelos distintos.
    
    
Essa classe, gridSearchAll(), já está pré-desenvolvida no código abaixo. O exercício consiste de __completar essa classe.__ Para isso, crie o métodos fit_all, que irá treinar, usando grid_search, todos os grids que tenham sido pré-construídos e inseridos na classe.
Ainda, a quantidade de folds para a validação cruzada no grid_search deve ser implementada no método construtor da classe, bem como qual a métrica de performance a ser avaliada. 
Finalmente, salve o melhor modelo de cada grid e tenha um método best_all_grid_models que retorna o melhor modelo dentre todos os grids.

In [61]:
class gridSearchAll():
    
    def __init__(self, scoring, num_folds):
        self.grid_models = []
        self.scoring = scoring
        self.num_folds = num_folds
        self.best = []
    
    def insert_model(self, estimator_base, param_grid):
        self.grid_models.append([estimator_base, param_grid])
        
    def fit_all(self, X, y):
        for estimator, param in self.grid_models:
            gridCV = GridSearchCV(estimator, param_grid=param, cv=self.num_folds, scoring=self.scoring)
            gridCV.fit(X, y)
            print(gridCV.best_params_)
            self.best.append(gridCV.best_params_)
   
    def best_all_grid_models(self):
        best_models = []
        for grid in self.grid:
            best_models.append(grid.best_estimator_)
        return best_models

In [41]:
gd = gridSearchAll(scoring='accuracy', num_folds=5)

In [42]:
gd.grid_models

[]

In [43]:
params_RF

{'n_estimators': [10, 50, 100], 'learning_rate': [0.001, 0.01, 0.1]}

In [44]:
gd.insert_model(estimator_base = RandomForestRegressor(), param_grid = params_RF)

In [45]:
gd.grid_models

[[RandomForestRegressor(),
  {'n_estimators': [10, 50, 100], 'learning_rate': [0.001, 0.01, 0.1]}]]

In [46]:
gd.insert_model(estimator_base = KNeighborsRegressor(), param_grid = {"n_neighbors":[1,2,10]})

In [47]:
gd.grid_models

[[RandomForestRegressor(),
  {'n_estimators': [10, 50, 100], 'learning_rate': [0.001, 0.01, 0.1]}],
 [KNeighborsRegressor(), {'n_neighbors': [1, 2, 10]}]]

__Exercício 3:__ Usando a classe criada, analise novamente os modelos criados no exercício 1.

In [54]:
rf = RandomForestRegressor()
knn = KNeighborsRegressor()
gb = GradientBoostingRegressor()

In [70]:
gd.insert_model(rf, [{"n_estimators":[10,1000], "max_depth":[2,10]}])
gd.insert_model(knn, [{'n_neighbors': [3,5,11,19], 'weights': ['uniform', 'distance']}])
gd.insert_model(gb, [{'n_estimators': [10,50,100], 'learning_rate':[0.001,0.01, 0.1]}])

In [71]:
gd.grid_models

[[RandomForestRegressor(),
  {'n_estimators': [10, 50, 100], 'learning_rate': [0.001, 0.01, 0.1]}],
 [KNeighborsRegressor(), {'n_neighbors': [1, 2, 10]}],
 [RandomForestRegressor(),
  [{'n_estimators': [10, 50, 100],
    'criterion': ['gini', 'entropy'],
    'max_depth': [2, 8, 12, 20]}]],
 [RandomForestRegressor(),
  [{'n_estimators': [10, 50, 100],
    'criterion': ['gini', 'entropy'],
    'max_depth': [2, 8, 12, 20]}]],
 [KNeighborsRegressor(), [{'n_neighbors': [2, 5, 8, 10, 12]}]],
 [GradientBoostingRegressor(),
  [{'n_estimators': [10, 50, 100], 'learning_rate': [0.001, 0.01, 0.1]}]],
 [RandomForestRegressor(),
  [{'n_estimators': [10, 50, 100],
    'criterion': ['squared_error', 'absolute_error', 'poisson'],
    'max_depth': [2, 8, 12, 20]}]],
 [KNeighborsRegressor(), [{'n_neighbors': [2, 5, 8, 10, 12]}]],
 [GradientBoostingRegressor(),
  [{'n_estimators': [10, 50, 100], 'learning_rate': [0.001, 0.01, 0.1]}]],
 [KNeighborsRegressor(), [{'n_neighbors': [2, 5, 8, 10, 12]}]],
 [Grad

In [72]:
gd.fit_all(Xtrain, ytrain)

ValueError: Invalid parameter learning_rate for estimator RandomForestRegressor(). Check the list of available parameters with `estimator.get_params().keys()`.