## Introducción al Machine Learning 
### Meta-clasificadores, Salida por Votación
### Solución

Este cuaderno plantea un ejercicio de combinar varios clasificadores para 
hacer un meta-clasificador por votación

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

Utilizaremos el dataset del marketing telefonico de productos bancarios.  Por simplicidad haremos un preprocesado 
directo con la función get_dummies.  Para un preprocesado más adecuado consultar el notebook de preprocesado en la
sesión 3

In [2]:
bank_marketing = pd.read_csv('../../data/bank.csv', sep=';')

In [3]:
from sklearn import preprocessing

In [4]:
bank_marketing.head()

Unnamed: 0,age,job,marital,education,default,balance,housing,loan,contact,day,month,duration,campaign,pdays,previous,poutcome,y
0,30,unemployed,married,primary,no,1787,no,no,cellular,19,oct,79,1,-1,0,unknown,no
1,33,services,married,secondary,no,4789,yes,yes,cellular,11,may,220,1,339,4,failure,no
2,35,management,single,tertiary,no,1350,yes,no,cellular,16,apr,185,1,330,1,failure,no
3,30,management,married,tertiary,no,1476,yes,yes,unknown,3,jun,199,4,-1,0,unknown,no
4,59,blue-collar,married,secondary,no,0,yes,no,unknown,5,may,226,1,-1,0,unknown,no


In [5]:
raw_features = bank_marketing.drop(columns='y')
features = pd.get_dummies(raw_features)
target = bank_marketing.y

In [6]:
features.columns

Index(['age', 'balance', 'day', 'duration', 'campaign', 'pdays', 'previous',
       'job_admin.', 'job_blue-collar', 'job_entrepreneur', 'job_housemaid',
       'job_management', 'job_retired', 'job_self-employed', 'job_services',
       'job_student', 'job_technician', 'job_unemployed', 'job_unknown',
       'marital_divorced', 'marital_married', 'marital_single',
       'education_primary', 'education_secondary', 'education_tertiary',
       'education_unknown', 'default_no', 'default_yes', 'housing_no',
       'housing_yes', 'loan_no', 'loan_yes', 'contact_cellular',
       'contact_telephone', 'contact_unknown', 'month_apr', 'month_aug',
       'month_dec', 'month_feb', 'month_jan', 'month_jul', 'month_jun',
       'month_mar', 'month_may', 'month_nov', 'month_oct', 'month_sep',
       'poutcome_failure', 'poutcome_other', 'poutcome_success',
       'poutcome_unknown'],
      dtype='object')

___


### Ejercicio Propuesto

Entrenar 3 clasificadores, por ejemplo con Arboles, K-NN y Naive-Bayes, para hacer un meta clasificador que prediga la salida por votación

In [7]:
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score

In [8]:
train_x, test_x, train_y, test_y = train_test_split(features.values,
                                                    target.values,
                                                    test_size=0.7,
                                                    stratify=target.values,
                                                    random_state=11
                                                    )

Construimos y entrenamos los 3 clasificadores

In [9]:
tree = DecisionTreeClassifier(max_depth=5)
knn = KNeighborsClassifier()
nbayes = GaussianNB()

In [10]:
tree = tree.fit(train_x, train_y)
knn = knn.fit(train_x, train_y)
nbayes = nbayes.fit(train_x, train_y)

Calculamos las predicciones con cada clasificador y las unimos en un dataframe

In [11]:
tree_pred = tree.predict(test_x)
knn_pred = knn.predict(test_x)
nbayes_pred = nbayes.predict(test_x)

In [12]:
all_pred = pd.DataFrame({
    'tree': tree_pred,
    'knn': knn_pred,
    'nbayes': nbayes_pred
})
all_pred.head(10)

Unnamed: 0,tree,knn,nbayes
0,no,no,no
1,no,no,no
2,no,no,no
3,no,no,no
4,no,no,no
5,yes,no,yes
6,no,no,no
7,no,no,no
8,no,no,no
9,no,no,no


De cada trío de predicciones sacamos la que mas
se repite calculando la moda

In [13]:
all_pred.mode(axis=1)

Unnamed: 0,0
0,no
1,no
2,no
3,no
4,no
...,...
3160,no
3161,yes
3162,no
3163,no


In [14]:
meta_pred = all_pred.mode(axis=1)

In [15]:
meta_pred

Unnamed: 0,0
0,no
1,no
2,no
3,no
4,no
...,...
3160,no
3161,yes
3162,no
3163,no


Calculamos el accuracy de cada clasificador, incluido 
el del meta-clasificador por votación

In [16]:
print("arbol:", accuracy_score(test_y, tree_pred))
print("knn:", accuracy_score(test_y, knn_pred))
print("nbayes:", accuracy_score(test_y, nbayes_pred))
print("votacion:", accuracy_score(test_y, meta_pred))

arbol: 0.885308056872038
knn: 0.8821484992101106
nbayes: 0.8382306477093207
votacion: 0.8890995260663507


En este caso, y según lo esperado, el esquema de votación consigue
un accuracy superior