# Optimización de Hiperparámetros

¡Gracias Martín Gonella por la creación de los contenidos de este encuentro!


Comenzamos a trabajar con Optimización de Hiperparámetros en Python con Scikit-Learn, para ello vamos a empezar con ejemplo guiado usando como conjunto de datos el mismo dataset que en el Notebook anterior, Breast Cancer Wisconsin (diagnostic) dataset . Al finalizar, proponemos un análisis similar, pero con otro conjunto de datos. 

## 1. Breast Cancer Wisconsin (diagnostic) dataset

In [1]:
import pandas as pd
import numpy as np
import scipy as sp

from sklearn.metrics import classification_report
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RandomizedSearchCV
from sklearn.metrics import confusion_matrix, accuracy_score

**Para investigar:** ¿Reconoces todas las librerías que acabamos de importar y sus objetos? Si no es así, recuerda simpre leer la documentación.

Importamos el dataset así como hicimos en el notebook anterior.

In [2]:
from sklearn.datasets import load_breast_cancer
data = load_breast_cancer()

Si observan la variable `data`, podrán notar que es un `diccionario`, por lo tanto vamos a proceder a convertirlo en un `DataFrame` de Pandas.

In [3]:
df = pd.DataFrame(np.c_[data['data'], data['target']],
                  columns= np.append(data['feature_names'], ['target']))

Y, como ya lo exploramos, simplemente vamos a seleccionar los atributos que utilizaremos.

In [4]:
features_mean = list(df.columns[0:10])
features_mean

['mean radius',
 'mean texture',
 'mean perimeter',
 'mean area',
 'mean smoothness',
 'mean compactness',
 'mean concavity',
 'mean concave points',
 'mean symmetry',
 'mean fractal dimension']

In [5]:
data = df[features_mean + ['target']]
data.head()

Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,target
0,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,0.07871,0.0
1,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,0.05667,0.0
2,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,0.05999,0.0
3,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,0.09744,0.0
4,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,0.05883,0.0


Procedemos a separar los `features` del `target`, para luego poder dividir los datos en conjunto de `train` y `test`.

La clase 1 (que es la que considera como positiva) tiene mas instancias. Cambie a train_test split con stratify

In [6]:
X = data.drop(['target'],axis=1)
y = data['target']

# Dividimos los datos en Train y Test
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=42) #test_size=0.33

Genial, ¡ya tenemos listos nuestros datos!

Ahora vamos a escoger un modelo de clasificación, vamos con un `KNeighborsClassifier`. Luego, puedes probar con algún otro clasificador.

In [7]:
knn = KNeighborsClassifier()

Como recordarás de la bitácora, vimos tres estrategias para realizar una optimización de hiperparámetros:

    * Manual.
    * Por grilla.
    * Aleatoria.
    
Como ya mencionamos la búsqueda manual puede resultar muy tediosa y poco eficiente, por lo tanto vamos a probar con las dos restantes: **aleatoria** y por **grilla**. Además, ya tenemos una idea del desempeño de estos modelos sobre este dataset del encuentro anterior.

### 1.1 Grid Search

Definamos las grillas que necesitamos para el `GridSearchCV`. ¿Que tipo de objeto - desde el punto de vista de la programación - es? Prestar atención también qué tipo de datos usamos para cada hiperparámetro. Como siempre, mirar la documentación de la clase.

`Diccionario`

In [8]:
# Grilla para Grid Search de KNN
param_grid = {'n_neighbors':np.arange(1, 20),
              'weights': ['uniform', 'distance'], 
              'leaf_size':[1,3,5,7,10],
              'algorithm':['auto', 'kd_tree']}

Una vez definida la grilla, ya podemos entrenar el modelo. 

In [9]:
# ESTRATEGIA 1: Grid Search
model = GridSearchCV(knn, param_grid=param_grid, cv=5)

# Entrenamos: KNN con la grilla definida arriba y CV con tamaño de Fold=5
model.fit(X_train, y_train)

GridSearchCV(cv=5, estimator=KNeighborsClassifier(),
             param_grid={'algorithm': ['auto', 'kd_tree'],
                         'leaf_size': [1, 3, 5, 7, 10],
                         'n_neighbors': array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17,
       18, 19]),
                         'weights': ['uniform', 'distance']})

Genial, ya tenemos entrenado nuestro modelo KNN para una grilla de hiperparámetros. Además, dichas búsquedas por grilla vienen acompañadas de un validación cruzada, por lo cuál también hemos validado correctamente cada modelo con su correspondiente configuración de hiperparámetros.

<img src="https://media.giphy.com/media/rVbAzUUSUC6dO/giphy.gif" width="400" />

**Pero, ¿Cómo elijo la mejor configuración? ¿Cuál es la mejor performance? ¿Y el resto de los resultados?**

`best_score_` `best_params_` `best_estimator_`

**Pista:** La respuesta correcta siempre se encuentra en la documentación.

Existen 3 atributos del modelo (clase de la librería sklearn) que nos van a ayudar a responder éstas preguntas: `best_params_`, `best_score_` y `cv_results_`

**Para investigar:** Antes de continuar con la ejecución de la notebbok, lee un poco más acerca de la documentación en general de [`GridSearchCV()`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html) y [`RandomizedSearchCV()`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.RandomizedSearchCV.html) y en particular sobre los 3 atributos que acabamos de mencionar.

<img src="https://media.giphy.com/media/2k8EwXEwhoQGQ/giphy.gif" width="400" />

In [10]:
print("Mejores parametros KNN: "+str(model.best_params_))
print("Mejor Score KNN: "+str(model.best_score_)+'\n')

scores = pd.DataFrame(model.cv_results_)
scores

Mejores parametros KNN: {'algorithm': 'auto', 'leaf_size': 1, 'n_neighbors': 10, 'weights': 'distance'}
Mejor Score KNN: 0.8873324213406292



Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_algorithm,param_leaf_size,param_n_neighbors,param_weights,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,split4_test_score,mean_test_score,std_test_score,rank_test_score
0,0.003452,0.000300,0.005184,0.000360,auto,1,1,uniform,"{'algorithm': 'auto', 'leaf_size': 1, 'n_neigh...",0.860465,0.870588,0.823529,0.823529,0.882353,0.852093,0.024329,311
1,0.002666,0.000127,0.002374,0.000231,auto,1,1,distance,"{'algorithm': 'auto', 'leaf_size': 1, 'n_neigh...",0.860465,0.870588,0.823529,0.823529,0.882353,0.852093,0.024329,311
2,0.002515,0.000237,0.004059,0.000212,auto,1,2,uniform,"{'algorithm': 'auto', 'leaf_size': 1, 'n_neigh...",0.802326,0.870588,0.847059,0.788235,0.870588,0.835759,0.034439,361
3,0.003008,0.000659,0.002929,0.000581,auto,1,2,distance,"{'algorithm': 'auto', 'leaf_size': 1, 'n_neigh...",0.860465,0.870588,0.823529,0.823529,0.882353,0.852093,0.024329,311
4,0.002551,0.000234,0.004120,0.000191,auto,1,3,uniform,"{'algorithm': 'auto', 'leaf_size': 1, 'n_neigh...",0.860465,0.858824,0.858824,0.800000,0.882353,0.852093,0.027532,311
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
375,0.002287,0.000326,0.002042,0.000053,kd_tree,10,17,distance,"{'algorithm': 'kd_tree', 'leaf_size': 10, 'n_n...",0.872093,0.870588,0.858824,0.894118,0.929412,0.885007,0.024960,11
376,0.002320,0.000365,0.003938,0.000163,kd_tree,10,18,uniform,"{'algorithm': 'kd_tree', 'leaf_size': 10, 'n_n...",0.837209,0.882353,0.858824,0.882353,0.905882,0.873324,0.023399,181
377,0.002162,0.000108,0.002249,0.000367,kd_tree,10,18,distance,"{'algorithm': 'kd_tree', 'leaf_size': 10, 'n_n...",0.872093,0.870588,0.847059,0.894118,0.929412,0.882654,0.027719,41
378,0.002262,0.000209,0.004356,0.000411,kd_tree,10,19,uniform,"{'algorithm': 'kd_tree', 'leaf_size': 10, 'n_n...",0.860465,0.882353,0.858824,0.882353,0.917647,0.880328,0.021250,81


En este DataFrame están todos los resultados que devuelve `GridSearchCV()`. Hay mucha información para explorar, pero corre las siguientes celdas antes de hacerlo.

Ya leímos la documentación y por lo tanto ya sabemos que podemos predecir con el mejor modelo de la siguiente manera:

In [11]:
#Predecimos en los datos de test
prediction_test = model.predict(X_test)
print('Exactitud KNN en Test:',  accuracy_score(y_test, prediction_test))
#Predecimos en los datos de train
prediction_train = model.predict(X_train)
print('Exactitud KNN en Train:', accuracy_score(y_train, prediction_train))

Exactitud KNN en Test: 0.9020979020979021


¿Por qué predecimos sobre el conjunto de test?¿Estuvo involucrado este conjunto en el entrenamiento del modelo? 
`Para tener una medida de performance realista, con datos que no vio. Los del test no se usan para entrenar`

In [13]:
# Matriz de Confusion
cm = confusion_matrix(y_test,prediction_test)
print("Matriz de confusión KNN:")
print(cm)

Matriz de confusión KNN:
[[43 10]
 [ 4 86]]


In [14]:
# Reporte de Clasificacion
report = classification_report(y_test, prediction_test)
print("Reporte de Clasificación KNN:")
print(report)

Reporte de Clasificación KNN:
              precision    recall  f1-score   support

         0.0       0.91      0.81      0.86        53
         1.0       0.90      0.96      0.92        90

    accuracy                           0.90       143
   macro avg       0.91      0.88      0.89       143
weighted avg       0.90      0.90      0.90       143



**Para pensar**: ¿mejoró el desempeño del modelo con respecto a lo que hicimos en la bitácora anterior?¿Qué otros hiperparámetros puedes explorar para ver si mejora el desempeñó?¿Se puede hacer una mejor exploración de los resultados de `GridSearchCV`? Ahora sí, tómate un tiempo para explorar toda la información que devuelve.
`Para KNN pueden cambiar la metrica, es decir que tipo de distancia calcula`

**Ejercicio - Challenge:** Repite lo que hicimos, pero para un árbol de decisión. Algunos hiperparámetros que pueden ser interesantes de explorar, en este caso, son: `criterion`, `max_depth`, `min_samples_split` y `min_samples_leaf`.

Vamos con un `DecisionTreeClassifier`

In [15]:
from sklearn.tree import DecisionTreeClassifier
clf = DecisionTreeClassifier()

In [16]:
# Grilla para Grid Search del Arbol de decision
param_grid_clf = {'criterion': ['gini', 'entropy'],
              'max_depth': np.arange(1, 10),               'min_samples_split':[2,3,4],              'min_samples_leaf':[1,3,5]}#,               'min_samples_split':[2,3,4],              'min_samples_leaf':[1,3,5]

In [17]:
# ESTRATEGIA 1: Grid Search
model_clf = GridSearchCV(clf, param_grid=param_grid_clf, cv=5)

# Entrenamos: KNN con la grilla definida arriba y CV con tamaño de Fold=5
model_clf.fit(X_train, y_train)

GridSearchCV(cv=5, estimator=DecisionTreeClassifier(),
             param_grid={'criterion': ['gini', 'entropy'],
                         'max_depth': array([1, 2, 3, 4, 5, 6, 7, 8, 9]),
                         'min_samples_leaf': [1, 3, 5],
                         'min_samples_split': [2, 3, 4]})

In [18]:
print("Mejores parametros Arbol: "+str(model_clf.best_params_))
print("Mejor Score Arbol: "+str(model_clf.best_score_)+'\n')
scores = pd.DataFrame(model_clf.cv_results_)
scores

Mejores parametros Arbol: {'criterion': 'gini', 'max_depth': 4, 'min_samples_leaf': 3, 'min_samples_split': 2}
Mejor Score Arbol: 0.9436662106703146



Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_criterion,param_max_depth,param_min_samples_leaf,param_min_samples_split,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,split4_test_score,mean_test_score,std_test_score,rank_test_score
0,0.003217,0.000420,0.001696,0.000094,gini,1,1,2,"{'criterion': 'gini', 'max_depth': 1, 'min_sam...",0.918605,0.917647,0.917647,0.905882,0.917647,0.915486,0.004816,114
1,0.003091,0.000493,0.001873,0.000209,gini,1,1,3,"{'criterion': 'gini', 'max_depth': 1, 'min_sam...",0.918605,0.917647,0.917647,0.905882,0.917647,0.915486,0.004816,114
2,0.002613,0.000263,0.001588,0.000168,gini,1,1,4,"{'criterion': 'gini', 'max_depth': 1, 'min_sam...",0.918605,0.917647,0.917647,0.905882,0.917647,0.915486,0.004816,114
3,0.002160,0.000084,0.001335,0.000021,gini,1,3,2,"{'criterion': 'gini', 'max_depth': 1, 'min_sam...",0.918605,0.917647,0.917647,0.905882,0.917647,0.915486,0.004816,114
4,0.002093,0.000049,0.001271,0.000087,gini,1,3,3,"{'criterion': 'gini', 'max_depth': 1, 'min_sam...",0.918605,0.917647,0.917647,0.905882,0.917647,0.915486,0.004816,114
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
157,0.003520,0.000057,0.001235,0.000007,entropy,9,3,3,"{'criterion': 'entropy', 'max_depth': 9, 'min_...",0.930233,0.905882,0.941176,0.905882,0.929412,0.922517,0.014203,66
158,0.003620,0.000222,0.001240,0.000020,entropy,9,3,4,"{'criterion': 'entropy', 'max_depth': 9, 'min_...",0.930233,0.905882,0.964706,0.905882,0.929412,0.927223,0.021585,33
159,0.003398,0.000130,0.001273,0.000079,entropy,9,5,2,"{'criterion': 'entropy', 'max_depth': 9, 'min_...",0.895349,0.905882,0.894118,0.894118,0.964706,0.910834,0.027296,143
160,0.003512,0.000196,0.001273,0.000048,entropy,9,5,3,"{'criterion': 'entropy', 'max_depth': 9, 'min_...",0.918605,0.905882,0.894118,0.894118,0.964706,0.915486,0.026221,114


In [19]:
#Predecimos en los datos de test
prediction_clf_test = model_clf.predict(X_test)
print('Exactitud Arbol de decision en Test:', accuracy_score(y_test, prediction_clf_test))

Exactitud Arbol de decision en Test: 0.9300699300699301


In [20]:
#Predecimos en los datos de test
prediction_clf_train = model_clf.predict(X_train)
print('Exactitud Arbol de decision en Train:', accuracy_score(y_train, prediction_clf_train))

Exactitud Arbol de decision en Train: 0.9741784037558685


In [21]:
# Matriz de Confusion
cm = confusion_matrix(y_test,prediction_clf_test)
print("Matriz de confusión Arbol de decision:")
print(cm)

Matriz de confusión Arbol de decision:
[[47  6]
 [ 4 86]]


In [22]:
# Reporte de Clasificacion
report = classification_report(y_test, prediction_clf_test)
print("Reporte de Clasificación con Arbol de decision:")
print(report)

Reporte de Clasificación con Arbol de decision:
              precision    recall  f1-score   support

         0.0       0.92      0.89      0.90        53
         1.0       0.93      0.96      0.95        90

    accuracy                           0.93       143
   macro avg       0.93      0.92      0.92       143
weighted avg       0.93      0.93      0.93       143



### 1.2 Random Search

La metodología es muy parecida. La principal diferencia radica en que, para crear la grilla, ya no debemos pasar valores para los hiperparámetros, sino un generador aleatorio para cada atributo, en aquellos atributos que queremos que explore aleatoriamente.

In [23]:
# Grilla para Random Search
param_dist = {'n_neighbors':sp.stats.randint(1, 20),
              'weights': ['uniform', 'distance'], 
              'leaf_size':sp.stats.randint(1, 10),
              'algorithm':['auto', 'kd_tree']}

Presta atención a la documentación, para entender qué hace (es un poco larga, con el comienzo es suficiente):

In [24]:
#help(sp.stats.randint)

Ya podemos entrenar nuestro modelo. Presta atención al parámetro `n_iter`.

`    Number of parameter settings that are sampled. n_iter trades off runtime vs quality of the solution.`


In [25]:
# ESTRATEGIA 2: Random Search
model = RandomizedSearchCV(knn, param_dist,n_iter=100, random_state=0, cv=5)

# Entrenamos: KNN con la grilla definida arriba y CV con tamaño de Fold=5
model.fit(X_train, y_train)

RandomizedSearchCV(cv=5, estimator=KNeighborsClassifier(), n_iter=100,
                   param_distributions={'algorithm': ['auto', 'kd_tree'],
                                        'leaf_size': <scipy.stats._distn_infrastructure.rv_frozen object at 0x7f19ef2053a0>,
                                        'n_neighbors': <scipy.stats._distn_infrastructure.rv_frozen object at 0x7f1a186e6bb0>,
                                        'weights': ['uniform', 'distance']},
                   random_state=0)

In [26]:
print("Mejores parametros: "+str(model.best_params_))
print("Mejor Score: "+str(model.best_score_)+'\n')

scores = pd.DataFrame(model.cv_results_)
scores

Mejores parametros: {'algorithm': 'auto', 'leaf_size': 2, 'n_neighbors': 10, 'weights': 'distance'}
Mejor Score: 0.8873324213406292



Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_algorithm,param_leaf_size,param_n_neighbors,param_weights,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,split4_test_score,mean_test_score,std_test_score,rank_test_score
0,0.002808,0.000496,0.002278,0.000278,auto,6,1,distance,"{'algorithm': 'auto', 'leaf_size': 6, 'n_neigh...",0.860465,0.870588,0.823529,0.823529,0.882353,0.852093,0.024329,81
1,0.002496,0.000232,0.002088,0.000039,kd_tree,4,8,distance,"{'algorithm': 'kd_tree', 'leaf_size': 4, 'n_ne...",0.895349,0.894118,0.847059,0.870588,0.917647,0.884952,0.024097,8
2,0.002185,0.000149,0.003934,0.000132,kd_tree,6,19,uniform,"{'algorithm': 'kd_tree', 'leaf_size': 6, 'n_ne...",0.860465,0.882353,0.858824,0.882353,0.917647,0.880328,0.021250,26
3,0.002644,0.000697,0.005096,0.001754,kd_tree,7,13,uniform,"{'algorithm': 'kd_tree', 'leaf_size': 7, 'n_ne...",0.872093,0.882353,0.847059,0.882353,0.917647,0.880301,0.022696,30
4,0.002323,0.000256,0.002018,0.000135,kd_tree,7,8,distance,"{'algorithm': 'kd_tree', 'leaf_size': 7, 'n_ne...",0.895349,0.894118,0.847059,0.870588,0.917647,0.884952,0.024097,8
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
95,0.002871,0.000311,0.002714,0.000314,auto,6,12,distance,"{'algorithm': 'auto', 'leaf_size': 6, 'n_neigh...",0.883721,0.882353,0.835294,0.882353,0.929412,0.882627,0.029768,19
96,0.002909,0.000292,0.002698,0.000258,auto,3,14,distance,"{'algorithm': 'auto', 'leaf_size': 3, 'n_neigh...",0.872093,0.870588,0.835294,0.882353,0.929412,0.877948,0.030251,40
97,0.002763,0.000186,0.002557,0.000255,kd_tree,7,11,distance,"{'algorithm': 'kd_tree', 'leaf_size': 7, 'n_ne...",0.883721,0.882353,0.858824,0.882353,0.917647,0.884979,0.018797,6
98,0.002992,0.000346,0.004684,0.000199,auto,1,4,uniform,"{'algorithm': 'auto', 'leaf_size': 1, 'n_neigh...",0.790698,0.800000,0.847059,0.788235,0.882353,0.821669,0.037078,97


¿Encontró algo parecido a Grid Search?¿Fue más rápido?

In [30]:
#Predecimos en los datos de test
prediction_test = model.predict(X_test)
print('Exactitud KNN en Test:',  accuracy_score(y_test, prediction_test))
#Predecimos en los datos de train
prediction_train = model.predict(X_train)
print('Exactitud KNN en Train:', accuracy_score(y_train, prediction_train))

Exactitud KNN en Test: 0.9020979020979021
Exactitud KNN en Train: 1.0


In [31]:
# Matriz de Confusion
cm = confusion_matrix(y_test,prediction_test)
print("Matriz de confusión:")
print(cm)

Matriz de confusión:
[[43 10]
 [ 4 86]]


In [32]:
# Reporte de Clasificacion
report = classification_report(y_test, prediction_test)
print("Reporte de Clasificación:")
print(report)

Reporte de Clasificación:
              precision    recall  f1-score   support

         0.0       0.91      0.81      0.86        53
         1.0       0.90      0.96      0.92        90

    accuracy                           0.90       143
   macro avg       0.91      0.88      0.89       143
weighted avg       0.90      0.90      0.90       143



---
## Se acabo la parte guiada, ahora es tú turno...

<img src="https://www.mememaker.net/api/bucket?path=static/img/memes/full/2020/Jan/6/10/now-it-s-your-turn-15161.png" width="400" />
    
Ahora es tú turno de aplicar todo lo aprendido con un nuevo conjunto de datos. Para ello, vamos a generarlo de manera artificial usando la función [make_classification()](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_classification.html) de sklearn.

In [33]:
import seaborn as sns
from sklearn.datasets import make_classification

X, y = make_classification(n_samples=100000, n_features=4, n_informative=4,
                           n_redundant=0, n_clusters_per_class=1,
                           class_sep=1.0, random_state=40)

In [34]:
df = pd.DataFrame()

for i in range(X.shape[1]):
    df['x' + str(i)] = X[:,i]
df['y'] = y 
df.head()

Unnamed: 0,x0,x1,x2,x3,y
0,-0.292624,-0.783805,-3.849876,-1.758806,1
1,1.629085,-1.059597,-1.654043,0.43143,0
2,-1.504583,0.35132,-1.838668,-0.418151,1
3,1.093868,-1.61616,1.727165,1.464183,0
4,-1.081868,0.864286,0.838435,-4.457802,1


Exploremos un poco el dataset con el que vamos a trabajar.

In [None]:
sns.pairplot(data = df, vars = df.columns[:-1], hue = 'y');

**Ejercicios:**
1. Explora el espacio de hiperparámetros con `Grid Search` de un árbol de decisión, entrenado con el dataset artificial antes mencionado. Elige aquellos hiperparámetros que maximicen la exactitud. Luego, evalúa la performance en el conjunto de Test y comparala con la obtenida por `Grid Search` ¿Son diferentes? ¿A qué se deberá? 

`Para el train da mas alto el fscore porque tiene el sesgo de que son los datos que se usaron para entrenar
Con el test puedo reportar la medida de performance mas realista`

Algunas recomendaciones útiles:
   * Recuerda que el espacio a explorar es definido a través de un diccionario. Algunas variables que pueden ser interesantes de explorar, en el caso de un árbol de decisión son: `criterion`, `max_depth`, `min_samples_split` y `min_samples_leaf`.
   * Los resultados del `GridSearchCV` se encuentran en un diccionario que se accede con `.cv_results_`. Si quieres conocer las *keys* de ese diccionario, pueden usar `.cv_results_.keys()`
   * `GridSearchCV` entrena al final un modelo utilizando todo el conjunto de entrenamiento, con los mejores parámetros que encontró. Por lo tanto, se puede usar ese modelo para predecir con `.predict()`
   * Les recomendamos tener a mano la [documentación](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html) de `GridSearchCV` en Scikit-Learn.
    
2. Repite el ejercicio 1, pero esta vez evaluando precisión, exhaustividad, F-Score y AUC-ROC. 

**Notar** que se pueden evaluar múltiples métricas a la vez. También notar que si no eligen una métrica por sobre las otras, `GridSearchCV` no puede reentrenar con el mejor modelo. ¿Cómo son los hiperparámetros que maximizan cada métrica? Por ejemplo, compara entre precisión y exhaustividad.

3. **Opcional 1:** repite los ejercicios 1 y 2 pero esta vez utilizando un `Random Search`.
4. **Opcional 2:** Si aún tienes tiempo y ganas, repite para un `clasificador KNN`.

In [35]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

In [36]:
from sklearn.tree import DecisionTreeClassifier
clf = DecisionTreeClassifier()

In [37]:
# Grilla para Grid Search del Arbol de decision
param_grid_clf = {'max_depth':np.arange(1, 10),'criterion': ['gini', 'entropy'],'min_samples_split':np.arange(2, 4),'min_samples_leaf':np.arange(2, 4)}
#param_grid_clf = {'max_depth':np.arange(1, 10),'criterion': ['gini', 'entropy'],'min_samples_leaf':np.arange(1, 5)}

In [38]:
# ESTRATEGIA 1: Grid Search
#model_clf = GridSearchCV(clf, param_grid=param_grid_clf, cv=5)
model_clf = GridSearchCV(clf, param_grid=param_grid_clf, cv=5,scoring='f1')

# Entrenamos: KNN con la grilla definida arriba y CV con tamaño de Fold=5
model_clf.fit(X_train, y_train)

GridSearchCV(cv=5, estimator=DecisionTreeClassifier(),
             param_grid={'criterion': ['gini', 'entropy'],
                         'max_depth': array([1, 2, 3, 4, 5, 6, 7, 8, 9]),
                         'min_samples_leaf': array([2, 3]),
                         'min_samples_split': array([2, 3])},
             scoring='f1')

In [39]:
print("Mejores parametros Arbol: "+str(model_clf.best_params_))
print("Mejor Score Arbol: "+str(model_clf.best_score_)+'\n')
scores = pd.DataFrame(model_clf.cv_results_)
scores

Mejores parametros Arbol: {'criterion': 'entropy', 'max_depth': 9, 'min_samples_leaf': 3, 'min_samples_split': 3}
Mejor Score Arbol: 0.9574199196463944



Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_criterion,param_max_depth,param_min_samples_leaf,param_min_samples_split,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,split4_test_score,mean_test_score,std_test_score,rank_test_score
0,0.040417,0.003279,0.005366,0.000161,gini,1,2,2,"{'criterion': 'gini', 'max_depth': 1, 'min_sam...",0.823998,0.801519,0.815040,0.808712,0.809577,0.811769,0.007476,65
1,0.037231,0.000312,0.005223,0.000155,gini,1,2,3,"{'criterion': 'gini', 'max_depth': 1, 'min_sam...",0.823998,0.801519,0.815040,0.808712,0.809577,0.811769,0.007476,65
2,0.038256,0.001583,0.005454,0.000422,gini,1,3,2,"{'criterion': 'gini', 'max_depth': 1, 'min_sam...",0.823998,0.801519,0.815040,0.808712,0.809577,0.811769,0.007476,65
3,0.037258,0.000476,0.005212,0.000154,gini,1,3,3,"{'criterion': 'gini', 'max_depth': 1, 'min_sam...",0.823998,0.801519,0.815040,0.808712,0.809577,0.811769,0.007476,65
4,0.065700,0.000520,0.005383,0.000119,gini,2,2,2,"{'criterion': 'gini', 'max_depth': 2, 'min_sam...",0.881948,0.873420,0.877000,0.875558,0.879425,0.877470,0.002972,57
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
67,0.351822,0.002640,0.005761,0.000104,entropy,8,3,3,"{'criterion': 'entropy', 'max_depth': 8, 'min_...",0.956389,0.953361,0.954506,0.958234,0.959619,0.956422,0.002304,15
68,0.391124,0.004637,0.005892,0.000180,entropy,9,2,2,"{'criterion': 'entropy', 'max_depth': 9, 'min_...",0.956285,0.954931,0.956362,0.958408,0.959181,0.957033,0.001545,9
69,0.389694,0.002969,0.006016,0.000242,entropy,9,2,3,"{'criterion': 'entropy', 'max_depth': 9, 'min_...",0.956158,0.954799,0.956150,0.958552,0.959181,0.956968,0.001639,10
70,0.390113,0.002881,0.006062,0.000201,entropy,9,3,2,"{'criterion': 'entropy', 'max_depth': 9, 'min_...",0.956043,0.955767,0.956030,0.958350,0.959952,0.957229,0.001652,3


In [41]:
#Predecimos en los datos de test
prediction_clf_test = model_clf.predict(X_test)
print('Exactitud Arbol de decision en Test:',  accuracy_score(y_test, prediction_clf_test))
#Predecimos en los datos de train
prediction_clf_train = model_clf.predict(X_train)
print('Exactitud Arbol de decision en Train:', accuracy_score(y_train, prediction_clf_train))

Exactitud Arbol de decision en Test: 0.95884
Exactitud Arbol de decision en Train: 0.9652


In [43]:
# Matriz de Confusion
cm = confusion_matrix(y_test,prediction_clf_test)
print("Matriz de confusión Arbol de decision:")
print(cm)

Matriz de confusión Arbol de decision:
[[11905   606]
 [  423 12066]]


In [47]:
# Reporte de Clasificacion
report = classification_report(y_test, prediction_clf_test)
print("Reporte de Clasificación con Arbol de decision:")
print(report)

Reporte de Clasificación con Arbol de decision:
              precision    recall  f1-score   support

           0       0.97      0.95      0.96     12511
           1       0.95      0.97      0.96     12489

    accuracy                           0.96     25000
   macro avg       0.96      0.96      0.96     25000
weighted avg       0.96      0.96      0.96     25000



In [48]:
print(classification_report(y_train, prediction_clf_train))

              precision    recall  f1-score   support

           0       0.97      0.96      0.96     37485
           1       0.96      0.97      0.97     37515

    accuracy                           0.97     75000
   macro avg       0.97      0.97      0.97     75000
weighted avg       0.97      0.97      0.97     75000

