## Material Auxiliar 3: GridSearch
 

Este exemplo tem como objetivo mostrar o funcionamento do GridSearch que dado um algoritmo de classifcação e diversos valores para os parâmetros do algoritmo, valida os resultados  de cada modelo (classificador) gerado utilizando validação cruzada. 

In [1]:
''' Importa as bibliotecas Pandas e Numpy
'''
import pandas as pd
import numpy as np

In [2]:
''' Importa do sklearn o método que carrega o dataset breast cancer. Nesse dataset iremos classificar se uma 
pessoa tem um tumor benigno (valor de target igual a 1) ou maligno (valor de target igual a 0) no seio. 
'''
from sklearn.datasets import load_breast_cancer

In [3]:
# dataset está no formato do sklearn
aux_dataset = load_breast_cancer()
# transforma o dataset para o formato pd.DataFrame
dataset = pd.DataFrame(aux_dataset['data'], columns = aux_dataset['feature_names'])
# inclui a classe
dataset['target'] = aux_dataset['target']
# mostra os exemplos do dataset
dataset

Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,...,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension,target
0,17.99,10.38,122.80,1001.0,0.11840,0.27760,0.30010,0.14710,0.2419,0.07871,...,17.33,184.60,2019.0,0.16220,0.66560,0.7119,0.2654,0.4601,0.11890,0
1,20.57,17.77,132.90,1326.0,0.08474,0.07864,0.08690,0.07017,0.1812,0.05667,...,23.41,158.80,1956.0,0.12380,0.18660,0.2416,0.1860,0.2750,0.08902,0
2,19.69,21.25,130.00,1203.0,0.10960,0.15990,0.19740,0.12790,0.2069,0.05999,...,25.53,152.50,1709.0,0.14440,0.42450,0.4504,0.2430,0.3613,0.08758,0
3,11.42,20.38,77.58,386.1,0.14250,0.28390,0.24140,0.10520,0.2597,0.09744,...,26.50,98.87,567.7,0.20980,0.86630,0.6869,0.2575,0.6638,0.17300,0
4,20.29,14.34,135.10,1297.0,0.10030,0.13280,0.19800,0.10430,0.1809,0.05883,...,16.67,152.20,1575.0,0.13740,0.20500,0.4000,0.1625,0.2364,0.07678,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
564,21.56,22.39,142.00,1479.0,0.11100,0.11590,0.24390,0.13890,0.1726,0.05623,...,26.40,166.10,2027.0,0.14100,0.21130,0.4107,0.2216,0.2060,0.07115,0
565,20.13,28.25,131.20,1261.0,0.09780,0.10340,0.14400,0.09791,0.1752,0.05533,...,38.25,155.00,1731.0,0.11660,0.19220,0.3215,0.1628,0.2572,0.06637,0
566,16.60,28.08,108.30,858.1,0.08455,0.10230,0.09251,0.05302,0.1590,0.05648,...,34.12,126.70,1124.0,0.11390,0.30940,0.3403,0.1418,0.2218,0.07820,0
567,20.60,29.33,140.10,1265.0,0.11780,0.27700,0.35140,0.15200,0.2397,0.07016,...,39.42,184.60,1821.0,0.16500,0.86810,0.9387,0.2650,0.4087,0.12400,0


Note que para este exemplo somente iremos separar as variáves independentes (X) e a variável dependente (y). A parte de separar em conjunto de treino e teste (fold a fold) na validação cruzada será feita internamente na função Grid Search, bastando para nós configurar o número de folds a ser utilizado. 

In [4]:
X = dataset.loc[:, dataset.columns != 'target'] 
y = np.array(dataset.loc[:, dataset.columns == 'target']).ravel()

In [5]:
# importa o algoritmo de classifcaçaõ k-vizinhos
from sklearn.neighbors import KNeighborsClassifier
# importa o GridSearchCV
from sklearn.model_selection import GridSearchCV

# no grid coloca os hiperparâmetros do algoritmo que serão testados (dicionário)
# testará o valor de k com os seguintes valores e o hiperparâmetros weights com os valores 'uniform' e 'distance'
parameters = {'n_neighbors' : [1, 3, 5, 7, 9, 11, 13], 'weights' : ['uniform', 'distance']}

# define o algoritmo de classificação que será usado
knn = KNeighborsClassifier()
''' configura o GridSearch com o algoritmo de classificação = knn (instanciado), os parâmetros testados serão os 
    definidos em parameters, usa cross-validation = 5 e a medida de avaliação é a acurácia.'''  
gs = GridSearchCV(knn, parameters, cv=5, scoring='accuracy')
# o grid search treinará todos os modelos conforme a parametrização acima
gs.fit(X, y)

GridSearchCV(cv=5, estimator=KNeighborsClassifier(),
             param_grid={'n_neighbors': [1, 3, 5, 7, 9, 11, 13],
                         'weights': ['uniform', 'distance']},
             scoring='accuracy')

In [6]:
# coloca os resultados num Frame para melhor visulização
results = pd.DataFrame(gs.cv_results_)
results

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_n_neighbors,param_weights,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,split4_test_score,mean_test_score,std_test_score,rank_test_score
0,0.00227,0.000214,0.007737,0.000657,1,uniform,"{'n_neighbors': 1, 'weights': 'uniform'}",0.859649,0.929825,0.912281,0.912281,0.911504,0.905108,0.023754,13
1,0.002249,4.1e-05,0.003314,0.000415,1,distance,"{'n_neighbors': 1, 'weights': 'distance'}",0.859649,0.929825,0.912281,0.912281,0.911504,0.905108,0.023754,13
2,0.002278,4.3e-05,0.008211,0.000833,3,uniform,"{'n_neighbors': 3, 'weights': 'uniform'}",0.877193,0.921053,0.947368,0.938596,0.911504,0.919143,0.024482,12
3,0.002328,0.000211,0.003512,0.000148,3,distance,"{'n_neighbors': 3, 'weights': 'distance'}",0.877193,0.929825,0.947368,0.947368,0.920354,0.924422,0.025805,11
4,0.002499,0.00054,0.007939,0.000181,5,uniform,"{'n_neighbors': 5, 'weights': 'uniform'}",0.885965,0.938596,0.938596,0.947368,0.929204,0.927946,0.021763,8
5,0.002269,7e-05,0.003757,5.8e-05,5,distance,"{'n_neighbors': 5, 'weights': 'distance'}",0.894737,0.938596,0.938596,0.947368,0.929204,0.9297,0.018402,5
6,0.002185,1.8e-05,0.007884,4.6e-05,7,uniform,"{'n_neighbors': 7, 'weights': 'uniform'}",0.877193,0.938596,0.947368,0.947368,0.920354,0.926176,0.026404,9
7,0.002168,2.2e-05,0.003751,3.4e-05,7,distance,"{'n_neighbors': 7, 'weights': 'distance'}",0.868421,0.938596,0.947368,0.947368,0.920354,0.924422,0.029687,10
8,0.002163,1e-05,0.007872,3.5e-05,9,uniform,"{'n_neighbors': 9, 'weights': 'uniform'}",0.877193,0.938596,0.947368,0.95614,0.938053,0.93147,0.027934,3
9,0.002157,4e-05,0.003761,2.6e-05,9,distance,"{'n_neighbors': 9, 'weights': 'distance'}",0.868421,0.938596,0.947368,0.95614,0.946903,0.931486,0.032017,2


Note na tabela acima que é mostrado o tempo de execucação de cada modelo, suas parametrizações, o resultado para cada fold (que neste caso é a acurácia), o resultado médio dos folds e o rank dos modelos. 

Agora vamos criar uma visão para facilitar a visualização dos resultados, e ordenar os resultados conforme os resultados. 

In [7]:
view = ['params', 'mean_test_score', 'std_test_score', 'rank_test_score']
results[view].sort_values(by='rank_test_score')

Unnamed: 0,params,mean_test_score,std_test_score,rank_test_score
12,"{'n_neighbors': 13, 'weights': 'uniform'}",0.93324,0.028568,1
9,"{'n_neighbors': 9, 'weights': 'distance'}",0.931486,0.032017,2
8,"{'n_neighbors': 9, 'weights': 'uniform'}",0.93147,0.027934,3
13,"{'n_neighbors': 13, 'weights': 'distance'}",0.929716,0.031354,4
5,"{'n_neighbors': 5, 'weights': 'distance'}",0.9297,0.018402,5
10,"{'n_neighbors': 11, 'weights': 'uniform'}",0.9297,0.02774,5
11,"{'n_neighbors': 11, 'weights': 'distance'}",0.927961,0.030044,7
4,"{'n_neighbors': 5, 'weights': 'uniform'}",0.927946,0.021763,8
6,"{'n_neighbors': 7, 'weights': 'uniform'}",0.926176,0.026404,9
7,"{'n_neighbors': 7, 'weights': 'distance'}",0.924422,0.029687,10


Note na tabela acima que a melhor hiperparametrização, de acordo com os valores testados, é 'n_neighbors = 13' e 'weights = uniform'.

É possível recuperar o melhor classificador, assim como somente seus resultados e parametrizações, de acordo com a mostrado na documentação: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html