<h1>Conceito</h1>

<h2><i>Randomized search cross validation</i></h2>

Ao se realizar a otimização dos hiperparâmetros de um modelo de machine learning, o espaço definido pelos hiperparâmetros a serem testados pode ser muito grande e acabar tornando inviável testar todos os pontos discretos desse espaço (que é o que é feito pela classe *GridSearchCV*). Nestes casos é interessante utilizar uma busca aleatória dentre as combinações de parâmetros, utilizando a classe ***RandomizedSearchCV***.

Essa classe irá selecionar aleatóriamente pontos do espaço definido e testá-los e, semelhante a classe *GridSearchCV*, irá ter diversos atributos, como os *scores*, modelo de melhor desempenho etc.

Algumas das **vantagens** da busca aleatória são:

- É possível definir um número de amostras a serem testadas independente do número total de parâmetros e possíveis combinações
- Adicionar parâmetros que não influenciam a porformance não irá reduzir a eficiência da busca
- É possível utilizar intervalos contínuos e não apenas discretos (para parâmetros que aceitem intervalos contínuos obviamente)

<h1>Aplicação</h1>

In [7]:
# Importando as bibliotecas
import pandas as pd
import numpy as np

from sklearn.model_selection import KFold
from sklearn.model_selection import RandomizedSearchCV
from sklearn.model_selection import cross_val_score
from sklearn.tree import DecisionTreeClassifier

In [2]:
# Importando a base de dados
base = pd.read_csv(r'Dados\base.csv')
base.head()

Unnamed: 0,preco,vendido,idade_do_modelo,km_por_ano
0,30941.02,1,18,35085.22134
1,40557.96,1,20,12622.05362
2,89627.5,0,12,11440.79806
3,95276.14,0,3,43167.32682
4,117384.68,1,4,12770.1129


In [3]:
# Definindo inputs e outputs
x = base[['preco', 'idade_do_modelo', 'km_por_ano']].values
y = base['vendido'].values.ravel()

In [6]:
# Otimizando uma árvore de decisão através do RandomizedSearchCV

SEED=301
np.random.seed(SEED)

parameters = {
    'max_depth': [3,4,5],
    'min_samples_split':range(32,129),
    'min_samples_leaf': range(32,129),
    'criterion': ['gini', 'entropy']
}

busca = RandomizedSearchCV(
    DecisionTreeClassifier(),
    parameters,
    cv = KFold(n_splits=10, shuffle=True),
    n_iter=30
)

In [8]:
# Realizando um teste simples, sem validação cruzada externa:
busca.fit(x,y)
resultados_simples = pd.DataFrame(busca.cv_results_)

In [9]:
resultados_simples.head()

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_min_samples_split,param_min_samples_leaf,param_max_depth,param_criterion,params,split0_test_score,...,split3_test_score,split4_test_score,split5_test_score,split6_test_score,split7_test_score,split8_test_score,split9_test_score,mean_test_score,std_test_score,rank_test_score
0,0.013898,0.001687,0.0008,0.000246,97,92,3,entropy,"{'min_samples_split': 97, 'min_samples_leaf': ...",0.775,...,0.772,0.818,0.783,0.784,0.781,0.79,0.794,0.7869,0.0127,1
1,0.014547,0.000685,0.000905,0.000376,115,112,4,entropy,"{'min_samples_split': 115, 'min_samples_leaf':...",0.775,...,0.772,0.818,0.783,0.784,0.781,0.79,0.794,0.7869,0.0127,1
2,0.018346,0.002397,0.000752,0.000337,57,70,5,entropy,"{'min_samples_split': 57, 'min_samples_leaf': ...",0.775,...,0.761,0.815,0.784,0.784,0.778,0.79,0.794,0.7848,0.013497,28
3,0.015749,0.004485,0.000901,0.000489,58,117,3,entropy,"{'min_samples_split': 58, 'min_samples_leaf': ...",0.775,...,0.772,0.818,0.783,0.784,0.781,0.79,0.794,0.7869,0.0127,1
4,0.018897,0.003263,0.000901,0.000375,68,98,5,entropy,"{'min_samples_split': 68, 'min_samples_leaf': ...",0.775,...,0.761,0.818,0.783,0.784,0.777,0.79,0.794,0.7849,0.014244,24


In [10]:
# Realizando agora um teste com validação cruzada externa:

SEED=301
np.random.seed(SEED)

cv_score = cross_val_score(busca, x, y, cv=KFold(n_splits=10, shuffle=True))

In [11]:
# Verificando o score de cada iteração:
cv_score

array([0.778, 0.782, 0.792, 0.783, 0.798, 0.788, 0.785, 0.762, 0.796,
       0.796])

In [12]:
# Verificando o score médio final e seu intervalo:

print(f'Score médio: {cv_score.mean()*100:.2f}')
print(f'Intervalo: [{(cv_score.mean() - 2*cv_score.std())*100:.2f}, {(cv_score.mean() + 2*cv_score.std())*100:.2f}]')

Score médio: 78.60
Intervalo: [76.55, 80.65]


In [13]:
# Verificando o melhor estimador:
busca.best_estimator_