# Fazendo otimização de hiperparametros com árvore de decisão

Nesse notebook vamos tentar fazer otimização de hiperparametros com árvore de decisão e comparar os resultados sem feito a otimização



In [17]:
# importando os módulos
import numpy as np
import pandas as pd
from sklearn.model_selection import GridSearchCV , RandomizedSearchCV, cross_validate
from sklearn.preprocessing import StandardScaler
from sklearn.tree import DecisionTreeClassifier 
from imblearn.under_sampling import RandomUnderSampler


In [18]:
heart = pd.read_csv('heart_failure_clinical_records_dataset.csv')

In [19]:
heart.head()

Unnamed: 0,age,anaemia,creatinine_phosphokinase,diabetes,ejection_fraction,high_blood_pressure,platelets,serum_creatinine,serum_sodium,sex,smoking,time,DEATH_EVENT
0,75.0,0,582,0,20,1,265000.0,1.9,130,1,0,4,1
1,55.0,0,7861,0,38,0,263358.03,1.1,136,1,0,6,1
2,65.0,0,146,0,20,0,162000.0,1.3,129,1,1,7,1
3,50.0,1,111,0,20,0,210000.0,1.9,137,1,0,7,1
4,65.0,1,160,1,20,0,327000.0,2.7,116,0,0,8,1


In [20]:
x = heart.drop(['DEATH_EVENT'],axis=1).values
y = heart.DEATH_EVENT.values

x_std = StandardScaler().fit_transform(x)

In [21]:
nm = RandomUnderSampler(random_state=10)

x_nm , y_nm = nm.fit_resample(x_std,y)

## Aplicando o grid search

In [22]:
parametros = {'criterion':['gini','entropy','log_loss'],'splitter':['best','random'],'min_samples_leaf':np.arange(1,11,1),
              'max_features':['auto','sqrt','log2'],'max_leaf_nodes':np.arange(30,50,1),'min_impurity_decrease':np.arange(1,11,1)}

In [23]:
grid_search = GridSearchCV(estimator=DecisionTreeClassifier(),param_grid=parametros).fit(x_nm,y_nm)

60000 fits failed out of a total of 180000.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
60000 fits failed with the following error:
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/dist-packages/sklearn/model_selection/_validation.py", line 680, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/usr/local/lib/python3.7/dist-packages/sklearn/tree/_classes.py", line 942, in fit
    X_idx_sorted=X_idx_sorted,
  File "/usr/local/lib/python3.7/dist-packages/sklearn/tree/_classes.py", line 352, in fit
    criterion = CRITERIA_CLF[self.criterion](
KeyError: 'log_loss'



In [24]:
melhores_parametros = grid_search.best_params_
melhor_resultado = grid_search.best_score_
print(melhores_parametros)
print(melhor_resultado)

{'criterion': 'gini', 'max_features': 'auto', 'max_leaf_nodes': 30, 'min_impurity_decrease': 1, 'min_samples_leaf': 1, 'splitter': 'best'}
0.4948717948717949


In [25]:
tree = DecisionTreeClassifier(criterion='gini',max_features='auto',max_leaf_nodes=30,
                              min_impurity_decrease=1,min_samples_leaf=1,splitter='best')

In [26]:
nome_metricas = ['accuracy', 'precision_macro', 'recall_macro']

metricas_ran = cross_validate(tree,x_nm, y_nm, cv=7, scoring=nome_metricas)
for met in metricas_ran:
  print(f'-{met}')
  print(f"-- {metricas_ran[met]}")
  media = np.mean(metricas_ran[met])
  desvio = np.std(metricas_ran[met])
  print(f'Média do {met}: {media}')
  print(f'Desvio {desvio}')
  print(f'Intervalo [{(media-(2*desvio)):.3f},{(media+(2*desvio)):.3f}]')
  print('-*-'*20)

-fit_time
-- [0.00338531 0.00060296 0.00056314 0.00054598 0.00050163 0.00049019
 0.00049067]
Média do fit_time: 0.0009399822780064174
Desvio 0.0009990556016993856
Intervalo [-0.001,0.003]
-*--*--*--*--*--*--*--*--*--*--*--*--*--*--*--*--*--*--*--*-
-score_time
-- [0.00529575 0.00587773 0.00253081 0.0023284  0.00214481 0.00213194
 0.00210404]
Média do score_time: 0.00320192745753697
Desvio 0.0015224608463211633
Intervalo [0.000,0.006]
-*--*--*--*--*--*--*--*--*--*--*--*--*--*--*--*--*--*--*--*-
-test_accuracy
-- [0.5        0.5        0.5        0.48148148 0.48148148 0.48148148
 0.48148148]
Média do test_accuracy: 0.48941798941798936
Desvio 0.00916428998713693
Intervalo [0.471,0.508]
-*--*--*--*--*--*--*--*--*--*--*--*--*--*--*--*--*--*--*--*-
-test_precision_macro
-- [0.25       0.25       0.25       0.24074074 0.24074074 0.24074074
 0.24074074]
Média do test_precision_macro: 0.24470899470899468
Desvio 0.004582144993568465
Intervalo [0.236,0.254]
-*--*--*--*--*--*--*--*--*--*--*--*--*-

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


comparando as métricas da árvore de decisão com e sem orimização 
- árvore de decisão *sem otimização*:
  - acuracia: 0,759
  - precisão: 0,779
  - recall: 0,762

- árvore de decisão *com otimização*:
  - acuracia:0,489
  - precisão: 0,244
  - recall: 0,5

## Aplicando o random search

In [27]:
parametros2 = {'criterion':['gini','entropy','log_loss'],'splitter':['best','random'],'min_samples_leaf':np.arange(1,101,1),
              'max_features':['auto','sqrt','log2'],'max_leaf_nodes':np.arange(30,50,1),'min_impurity_decrease':np.arange(1,101,1)}

In [28]:
random_search = RandomizedSearchCV(estimator=DecisionTreeClassifier(),param_distributions=parametros2)
random_search.fit(x_nm,y_nm)
melhores_parametros2 = random_search.best_params_
melhor_resultado2 = random_search.best_score_

15 fits failed out of a total of 50.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
15 fits failed with the following error:
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/dist-packages/sklearn/model_selection/_validation.py", line 680, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/usr/local/lib/python3.7/dist-packages/sklearn/tree/_classes.py", line 942, in fit
    X_idx_sorted=X_idx_sorted,
  File "/usr/local/lib/python3.7/dist-packages/sklearn/tree/_classes.py", line 352, in fit
    criterion = CRITERIA_CLF[self.criterion](
KeyError: 'log_loss'

        nan        nan 0.49487179 0.49487179]


In [29]:
print(melhores_parametros2)
print(melhor_resultado2)

{'splitter': 'best', 'min_samples_leaf': 59, 'min_impurity_decrease': 11, 'max_leaf_nodes': 45, 'max_features': 'log2', 'criterion': 'entropy'}
0.4948717948717949


In [30]:
tree2 = DecisionTreeClassifier(splitter='random',min_samples_leaf=99,min_impurity_decrease=16,max_leaf_nodes=30,
                               max_features='sqrt',criterion='gini')


metricas_ran2 = cross_validate(tree2,x_nm, y_nm, cv=7, scoring=nome_metricas)
for met in metricas_ran:
  print(f'-{met}')
  print(f"-- {metricas_ran2[met]}")
  media = np.mean(metricas_ran2[met])
  desvio = np.std(metricas_ran2[met])
  print(f'Média do {met}: {media}')
  print(f'Desvio {desvio}')
  print(f'Intervalo [{(media-(2*desvio)):.3f},{(media+(2*desvio)):.3f}]')
  print('-*-'*20)

-fit_time
-- [0.00312185 0.00054312 0.00054288 0.00048685 0.00047874 0.00048232
 0.00048733]
Média do fit_time: 0.000877584729875837
Desvio 0.0009165865621709188
Intervalo [-0.001,0.003]
-*--*--*--*--*--*--*--*--*--*--*--*--*--*--*--*--*--*--*--*-
-score_time
-- [0.00582433 0.00222778 0.00214505 0.00213671 0.00212312 0.00212002
 0.00211883]
Média do score_time: 0.002670833042689732
Desvio 0.0012878918411188837
Intervalo [0.000,0.005]
-*--*--*--*--*--*--*--*--*--*--*--*--*--*--*--*--*--*--*--*-
-test_accuracy
-- [0.5        0.5        0.5        0.48148148 0.48148148 0.48148148
 0.48148148]
Média do test_accuracy: 0.48941798941798936
Desvio 0.00916428998713693
Intervalo [0.471,0.508]
-*--*--*--*--*--*--*--*--*--*--*--*--*--*--*--*--*--*--*--*-
-test_precision_macro
-- [0.25       0.25       0.25       0.24074074 0.24074074 0.24074074
 0.24074074]
Média do test_precision_macro: 0.24470899470899468
Desvio 0.004582144993568465
Intervalo [0.236,0.254]
-*--*--*--*--*--*--*--*--*--*--*--*--*-

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


Comparando os três, temos os seguintes resultados:
- Árvore de decisão *sem otimização*:
  - acuracia: 0,759
  - precisão: 0,779
  - recall: 0,762

- Árvore de decisão *com otimização grid search*:
  - acuracia:0,489
  - precisão: 0,244
  - recall: 0,5
- Árvore de decisão *com otimização random search*:
  - acuraqcia: 0,489
  - precisão: 0,244
  - recall: 0,5

  Temos então, que não tivemos uma melhora com a otimização de parametros