## Introducción al Machine Learning 
### Meta-Clasificadores. Bagging 

Este ejemplo muestra como utilizar sklearn para entrenar un clasificador utilizando
un esquema de Bagging

In [4]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

Para detalles del dataset, o la particion train/test ver ejemplos de la sesión 2

In [49]:
bankruptcy = pd.read_csv('../data/bankruptcy.csv', index_col='Company')

In [50]:
features = bankruptcy.loc[:, bankruptcy.columns != 'Bankrupt']
target = bankruptcy.Bankrupt

In [12]:
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.metrics import accuracy_score

In [9]:
x_train, x_test, y_train, y_test = train_test_split(features.values,
                                                    target.values,
                                                    test_size=0.4,
                                                    stratify=target.values)

In [20]:
vals, counts = np.unique(y_train, return_counts=True)
dict(zip(vals, counts))

{'no': 15, 'yes': 15}

La clase que utilizamos para realizar el bagging funciona como los otros 
algoritmos de ML, Pero tenemos que pasarle el algoritmo de base en *base_estimator*

In [36]:
tree = DecisionTreeClassifier()
bagg = BaggingClassifier(base_estimator=tree,
                         n_estimators=10, 
                         random_state=0)

In [37]:
bagg = bagg.fit(x_train, y_train)

In [38]:
y_pred = bagg.predict(x_test)

Entrenamos también un árbol independiente para comprobar la mejora obtenida con un meta-clasificador

In [39]:
single_tree = DecisionTreeClassifier().fit(x_train, y_train)
y_pred_single = single_tree.predict(x_test)

In [40]:
print("Tree:", accuracy_score(y_test, y_pred_single))
print("Bagging:", accuracy_score(y_test, y_pred))      

Tree: 0.8
Bagging: 0.95


Podemos ver la muestra de cada bag que se ha utilizado para entrenar cada
clasificador. Cada array representa los índices de los ejemplos seleccionados

In [45]:
bagg.estimators_samples_

[array([12,  3, 18,  0,  0, 18, 22,  9,  7,  3, 15, 25, 15, 23,  9,  6,  1,
        17, 21,  7, 27, 15, 27,  9,  7,  2, 16,  7, 25, 21]),
 array([11, 29, 19, 28,  0, 16, 17, 13, 24, 17,  7, 12,  9,  4,  8, 13, 21,
         2,  7, 14, 15, 12, 18, 14,  2, 11, 11, 28, 10, 11]),
 array([13, 19, 14,  5,  4, 14, 10,  5, 23,  1, 23, 23, 11,  1, 24, 16, 21,
        14, 16, 23, 22,  9, 26, 15, 14, 16,  9, 17, 21,  6]),
 array([ 4,  5,  3, 16, 24,  3, 16, 20,  3, 27, 28, 21,  1,  9,  7,  5, 14,
        21,  5, 24,  7, 25, 27, 10, 12, 21, 22, 24, 12, 20]),
 array([ 1, 23,  9, 11, 11, 11,  5, 19, 15,  6,  3,  6,  5, 19,  4, 15, 10,
         7, 28,  0,  9,  2, 13, 21,  2, 26, 19, 20,  2, 20]),
 array([17, 23, 18, 23,  0,  8, 17, 19, 13, 19, 26, 17, 25, 22, 13, 29, 27,
         0, 11, 21,  6, 26, 18, 25,  0,  6,  2, 17, 22, 15]),
 array([29, 18, 29, 16, 20, 14, 25, 21, 10, 14,  4,  9, 13,  8,  7, 28, 27,
        10, 17, 13, 16, 25, 16,  6,  7, 23, 25, 29, 14, 28]),
 array([ 4, 12, 29, 17,  7, 26, 17

In [48]:
pd.Series(bagg.estimators_samples_[0]).value_counts()

7     4
15    3
9     3
25    2
3     2
27    2
18    2
21    2
0     2
17    1
16    1
12    1
22    1
6     1
23    1
2     1
1     1
dtype: int64

___

#### Propuesta Ejercicio

- Utilizar la clase BaggingRegressor para mejorar el rendimiento de un árbol de regresión en el dataset Boston de valoración de inmuebles.