# <u>Métodos Supervisados para Clasificación - Parte 1</u>

## Caso de uso

<img src = 'https://datahack-prod.s3.ap-south-1.amazonaws.com/__sized__/contest_cover/loanpre-thumbnail-1200x1200.png'>

### Importación de librerías iniciales

In [None]:
import warnings
warnings.filterwarnings("ignore")

#Importar las librerías necesarias en Python.
import pandas as pd      ## Manejo de dataframes o set de datos
import numpy as np       ## Todo lo referente a trabajar con vectores y matrices
from scipy import stats  ## Herramientas y algoritmos matemáticos para python

import seaborn as sns
import matplotlib.pyplot as plt

Usamos una semilla a lo largo de todo el notebook para los procesos aleatorios

In [None]:
seed = 2021

In [None]:
target = 'Loan_Status'

### **Lectura Inicial de base de datos**

Utilizaremos las bases de este caso ya preprocesadas en el notebook del Módulo 13. Estas bases ya recibieron el siguiente tratamiento previo:

1. División train test 80/20
2. Tratamiento de nulos
3. Labeling de variables
4. Encoding de variables categóricas (dummys)
5. Tratamiento de valores extremos y outliers
6. Creación de nuevas variables
7. Reescalamiento final de la base

¡OJO! No olvidar balancear la base de train si su target < 5% (no pasar de 15%-25% en el target balanceado)

In [None]:
train = pd.read_csv('data/train_preprocesed.csv')
test = pd.read_csv('data/test_preprocesed.csv')

In [None]:
# Vemos la dimensionalidad de la base train
train.shape

In [None]:
# Vemos la dimensionalidad de la base test
test.shape

In [None]:
# Visualizacion Global de los datos train
train.head()

In [None]:
# Visualizacion Global de los datos test
test.head()

In [None]:
X_train = train.drop(target, axis =1)
y_train = train[target]

X_test = test.drop(target, axis =1)
y_test = test[target]

## Algoritmos Machine Learning

### 1. Regresion Logistica Binaria

In [None]:
# Paso N°01: Elegimos y entrenamos un algoritmo de ML
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression()

In [None]:
#hiperparámetros del modelo estimado
lr.get_params()

In [None]:
lr.fit(X_train,y_train) # Entrenamiento!

In [None]:
# variables
lr.feature_names_in_

In [None]:
# coeficientes del modelo
lr.coef_

In [None]:
lr.predict_proba(X_test)[:,1]   # Predicción de la probabilidad

In [None]:
lr.predict(X_test)  # Predicción de la categoría

In [None]:
# Paso N°02: Con el algoritmo entrenado predecimos sobre la data de train y test!

y_pred_train=lr.predict(X_train) # Prediccion sobre el train
y_pred_test= lr.predict(X_test) # Prediccion sobre el test

y_proba_test= lr.predict_proba(X_test)[:,1]   #Prediccion de probabilidades del target

In [None]:
# Paso N°03: Revisamos las metricas de validacion técnicas adecuadas!
from sklearn import metrics as metrics

def metricas_confusion(y_train,y_pred_train,y_test,y_pred_test):
    # Matriz de confusion
    print("Matriz confusion: Train")
    cm_train = metrics.confusion_matrix(y_train,y_pred_train)
    print(cm_train)

    print("Matriz confusion: Test")
    cm_test = metrics.confusion_matrix(y_test,y_pred_test)
    print(cm_test)

    # Accuracy
    print("Accuracy: Train")
    accuracy_train=metrics.accuracy_score(y_train,y_pred_train)
    print(accuracy_train)

    print("Accuracy: Test")
    accuracy_test=metrics.accuracy_score(y_test,y_pred_test)
    print(accuracy_test)

    # Precision o Aporte del Modelo
    print("Precision: Train")
    precision_train=metrics.precision_score(y_train,y_pred_train)
    print(precision_train)

    print("Precision: Test")
    precision_test=metrics.precision_score(y_test,y_pred_test)
    print(precision_test)

    # Recall o Sensibilidad 
    print("Recall: Train")
    recall_train=metrics.recall_score(y_train,y_pred_train)
    print(recall_train)

    print("Recall: Test")
    recall_test=metrics.recall_score(y_test,y_pred_test)
    print(recall_test)

In [None]:
metricas_confusion(y_train,y_pred_train,y_test,y_pred_test)

In [None]:
from sklearn.metrics import classification_report

print(metrics.classification_report(y_test, y_pred_test))

In [None]:
from sklearn.metrics import roc_auc_score, roc_curve

def plot_roc_curve(y, y_proba, label = ''):
    '''
    dibujar la curva roc para las probabilidades y target entregados
    
    params:
    y: etiquetas originales
    y_proba: probabilidades resultado del modelo
    '''
    
    auc_roc = roc_auc_score(y, y_proba)
    fpr, tpr, thresholds = roc_curve(y, y_proba)
    
    plt.figure(figsize=(8,6))
    plt.rcParams.update({'font.size': 12})
    plt.plot(fpr, fpr, c = 'red')
    plt.plot(fpr, tpr, label= (f"Curva ROC {label} (AUC = {auc_roc:.4f})"))
    plt.xlabel("FPR")
    plt.ylabel("TPR")
    plt.title(f"Curva ROC {label}")
    plt.legend(loc=4, numpoints=1)

In [None]:
# ROC AUC
roc_auc_score(y_test, y_proba_test)

In [None]:
# Gráfica Curva ROC
plot_roc_curve(y_test, y_proba_test, 'Regresión Logística')

### Ejecutando con varios optimizadores y reguladores
Construiremos el modelo de Regresión logística nuevamente para el mismo conjunto de datos, pero esta vez, usa diferentes valores de <b>solver</b> y <b>regularization</b>. Con esto podemos hacer comparaciones de varios modelos hasta encontrar el más adecuado:

In [None]:
solvers=['newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga']
Cs=[0.01,0.02,0.05,0.1]

for s in solvers:
    for c in Cs:
        LR = LogisticRegression(C=c, solver=s).fit(X_train,y_train)
        yhat = LR.predict(X_test)
        yhat_prob = LR.predict_proba(X_test)[:,1]
        print("Solver="+s+", C="+str(c)+
              "->Accuracy: "+str(metrics.accuracy_score(y_test, yhat)) +
             "->AUC : "+str(roc_auc_score(y_test, yhat_prob)))

### 2. Análisis Discriminante

In [None]:
# Análisis discriminante lineal y cuadrático
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA, QuadraticDiscriminantAnalysis as QDA

lda = LDA()
model_lda = lda.fit(X_train, y_train)

qda = QDA()
model_qda = qda.fit(X_train, y_train)

In [None]:
# Predecimos! LDA
Y_pred_train = lda.predict(X_train) # realizando la predicción
Y_pred_test  = lda.predict(X_test) # realizando la predicción

Y_proba_test= lda.predict_proba(X_test)[:,1]   #Prediccion de probabilidades del target

In [None]:
#Cálculo de los indicadores mas relevantes!
metricas_confusion(y_train,Y_pred_train,y_test,Y_pred_test)

In [None]:
print(metrics.classification_report(y_test, Y_pred_test))

In [None]:
# ROC AUC
roc_auc_score(y_test, Y_proba_test)

In [None]:
# Gráfica Curva ROC
plot_roc_curve(y_test, Y_proba_test, 'Análisis Discriminante Lineal')

In [None]:
# Predecimos! QDA
Y_pred_train = qda.predict(X_train) # realizando la predicción
Y_pred_test  = qda.predict(X_test) # realizando la predicción

Y_proba_test= qda.predict_proba(X_test)[:,1]   #Prediccion de probabilidades del target

In [None]:
#Cálculo de los indicadores mas relevantes!
metricas_confusion(y_train,Y_pred_train,y_test,Y_pred_test)

In [None]:
print(metrics.classification_report(y_test, Y_pred_test))

In [None]:
# ROC AUC
roc_auc_score(y_test, Y_proba_test)

In [None]:
# Gráfica Curva ROC
plot_roc_curve(y_test, Y_proba_test, 'Análisis Discriminante Lineal')

### 3. Árbol de Clasificación CART

In [None]:
# Arbol de Clasificacion CART
from sklearn.tree import DecisionTreeClassifier
tree_bonsai = DecisionTreeClassifier(
                       ccp_alpha=0.0, 
                       class_weight=None, 
                       criterion='gini',
                       max_depth=2,           # Profundidad del arbol
                       max_features=3,        # Numero maximo de variables
                       max_leaf_nodes=None,   # Numero de nodos
                       min_samples_leaf=100, 
                       min_samples_split=200,
                       min_weight_fraction_leaf=0.0, 
                       random_state=None, 
                       splitter='best')

In [None]:
# Arbol de Clasificacion CART
from sklearn.tree import DecisionTreeClassifier
tree_complete = DecisionTreeClassifier()

In [None]:
# Arbol de Clasificacion CART (Experto)
from sklearn.tree import DecisionTreeClassifier
tree_expert = DecisionTreeClassifier(
                       ccp_alpha=0.0, 
                       class_weight=None, 
                       criterion='gini',
                       max_depth=3,         # Profundidad del arbol (4)
                       max_features=6,     # Numero maximo de variables
                       max_leaf_nodes=None, # Numero de nodos
                       min_samples_leaf=20, 
                       min_samples_split=40,
                       min_weight_fraction_leaf=0.0, 
                       random_state=None, 
                       splitter='best')

In [None]:
# Entrenamos!
tree_bonsai = tree_bonsai.fit(X_train,y_train) # ajustando el modelo a mis datos
tree_complete = tree_complete.fit(X_train,y_train) # ajustando el modelo a mis datos
tree_expert = tree_expert.fit(X_train,y_train) # ajustando el modelo a mis datos

In [None]:
# Visualizando el arbol!
from sklearn.tree import plot_tree
_ = plot_tree(tree_bonsai, feature_names = X_train.columns, rounded = True, filled = True)

In [None]:
plt.figure(figsize=(12,12))
_ = plot_tree(tree_complete, feature_names = X_train.columns, rounded = True, filled = True)
plt.show()

In [None]:
plt.figure(figsize=(12,12))
_ = plot_tree(tree_expert, fontsize= 10, feature_names = X_train.columns, rounded = True, filled = True)
plt.show()

In [None]:
# Predecimos!
Y_pred_train = tree_expert.predict(X_train) # realizando la predicción
Y_pred_test  = tree_expert.predict(X_test) # realizando la predicción

Y_proba_test= tree_expert.predict_proba(X_test)[:,1]   #Prediccion de probabilidades del target

In [None]:
#Cálculo de los indicadores mas relevantes!
metricas_confusion(y_train,Y_pred_train,y_test,Y_pred_test)

In [None]:
print(metrics.classification_report(y_test, Y_pred_test))

In [None]:
# ROC AUC
roc_auc_score(y_test, Y_proba_test)

In [None]:
# Gráfica Curva ROC
plot_roc_curve(y_test, Y_proba_test, 'Árbol de clasificación')

### 4. KNN

In [None]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()   #se recomienda reescalar la base antes de usar esta técnica
scaler.fit(X_train)

X_train_ss = scaler.transform(X_train)
X_test_ss = scaler.transform(X_test)

In [None]:
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=9)
knn.fit(X_train_ss, y_train)

In [None]:
# Predecimos!
Y_pred_train = knn.predict(X_train_ss) # realizando la predicción
Y_pred_test  = knn.predict(X_test_ss) # realizando la predicción

Y_proba_test= knn.predict_proba(X_test_ss)[:,1]   #Prediccion de probabilidades del target

In [None]:
#Cálculo de los indicadores mas relevantes!
metricas_confusion(y_train,Y_pred_train,y_test,Y_pred_test)

In [None]:
print(metrics.classification_report(y_test, Y_pred_test))

In [None]:
# ROC AUC
roc_auc_score(y_test, Y_proba_test)

In [None]:
# Gráfica Curva ROC
plot_roc_curve(y_test, Y_proba_test, 'KNN')

### 5. SVM

In [None]:
from sklearn.preprocessing import StandardScaler
sc = StandardScaler().fit(X_train)   #se recomienda reescalar la base antes de usar esta tecnica
X_train_ss = sc.transform(X_train)
X_test_ss = sc.transform(X_test)

In [None]:
from sklearn import svm
svml = svm.SVC(kernel='linear', C=0.01, probability=True)
svml.fit(X_train_ss, y_train)

In [None]:
# Predecimos!
Y_pred_train = svml.predict(X_train_ss) # realizando la predicción
Y_pred_test  = svml.predict(X_test_ss) # realizando la predicción

Y_proba_test= svml.predict_proba(X_test_ss)[:,1]   #Prediccion de probabilidades del target

In [None]:
#Cálculo de los indicadores mas relevantes!
metricas_confusion(y_train,Y_pred_train,y_test,Y_pred_test)

In [None]:
print(metrics.classification_report(y_test, Y_pred_test))

In [None]:
# ROC AUC
roc_auc_score(y_test, Y_proba_test)

In [None]:
# Gráfica Curva ROC
plot_roc_curve(y_test, Y_proba_test, 'SVM')

### Búsqueda de hiperparpametros con GridSearch
Este procedimiento permite encontrar los mejores hiperparámetros de un modelo mediante una búsqueda exhaustiva.
Se entrega una lista de valores para distintos hiperparámetros del algoritmo.
Se evalúa el modelo para cada combinación de hiperparámetros y se selecciona la que obtenga mejores valores en la métrica de evaluación.

Construiremos el modelo SVM nuevamente para el mismo conjunto de datos, pero esta vez, con un algoritmo que busca los mejores hiperparámetros y utilizando Kernels Lineal y Radial.

In [None]:
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedKFold
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import ParameterGrid
import multiprocessing

In [None]:
import sklearn
sorted(sklearn.metrics.SCORERS.keys())

In [None]:
# Grid de hiperparámetros evaluados - KERNEL LINEAL
# ==============================================================================
param_grid = {'C': [0.1,1,10,100],
              'tol': [2**-2,2**-1,2**0]
             }

# Búsqueda por grid search con validación cruzada
# ==============================================================================
grid = GridSearchCV(
        estimator  = svm.SVC(kernel = 'linear',gamma='scale'),
        param_grid = param_grid,
        scoring    = 'roc_auc',
        n_jobs     = multiprocessing.cpu_count() - 1,
        cv         = RepeatedKFold(n_splits=5, n_repeats=3, random_state=123), 
        refit      = True,
        verbose    = 0,
        return_train_score = True
       )

grid.fit(X = X_train, y = y_train)

# Resultados
# ==============================================================================
resultados = pd.DataFrame(grid.cv_results_)
resultados.filter(regex = '(param*|mean_t|std_t)') \
    .drop(columns = 'params') \
    .sort_values('mean_test_score', ascending = False) \
    .head(10)

In [None]:
# Grid de hiperparámetros evaluados - KERNEL RADIAL
# ==============================================================================
param_grid = {'C': [0.1,1,10,100],
              'tol': [2**-2,2**-1,2**0],
              'gamma': [10**-4,10**-3,10**-2,10**-1]
             }

# Búsqueda por grid search con validación cruzada
# ==============================================================================
grid = GridSearchCV(
        estimator  = svm.SVC(kernel = 'rbf',gamma='scale'),
        param_grid = param_grid,
        scoring    = 'roc_auc',
        n_jobs     = multiprocessing.cpu_count() - 1,
        cv         = RepeatedKFold(n_splits=5, n_repeats=3, random_state=123), 
        refit      = True,
        verbose    = 0,
        return_train_score = True
       )

grid.fit(X = X_train, y = y_train)

# Resultados
# ==============================================================================
resultados = pd.DataFrame(grid.cv_results_)
resultados.filter(regex = '(param*|mean_t|std_t)') \
    .drop(columns = 'params') \
    .sort_values('mean_test_score', ascending = False) \
    .head(10)

Probemos esta búsqueda codiciosa con un árbol de decisión, haciendo búsquedas sobre varios parámetros importantes:

In [None]:
from sklearn.tree import DecisionTreeClassifier

In [None]:
# Grid de hiperparámetros evaluados - ARBOL CART
# ==============================================================================
param_grid = {'max_depth': [1,2,3,4,5,6,7,8,9,10],
              'max_features': [3,4,5,6,7,8,9],
              'min_samples_leaf': [20,50,100],
              'min_samples_split': [40,100,200]
             }

# Búsqueda por grid search con validación cruzada
# ==============================================================================
grid = GridSearchCV(
        estimator  = DecisionTreeClassifier(
                       ccp_alpha=0.0, 
                       class_weight=None, 
                       criterion='gini',
                       max_leaf_nodes=None,
                       min_weight_fraction_leaf=0.0, 
                       random_state=None, 
                       splitter='best'),
        param_grid = param_grid,
        scoring    = 'roc_auc',
        n_jobs     = multiprocessing.cpu_count() - 1,
        cv         = RepeatedKFold(n_splits=5, n_repeats=3, random_state=123), 
        refit      = True,
        verbose    = 0,
        return_train_score = True
       )

grid.fit(X = X_train, y = y_train)

# Resultados
# ==============================================================================
resultados = pd.DataFrame(grid.cv_results_)
resultados.filter(regex = '(param*|mean_t|std_t)') \
    .drop(columns = 'params') \
    .sort_values('mean_test_score', ascending = False) \
    .head(10)

In [None]:
# Arbol de Clasificacion CART (mejores parámetros)
from sklearn.tree import DecisionTreeClassifier
tree_final = DecisionTreeClassifier(
                       ccp_alpha=0.0, 
                       class_weight=None, 
                       criterion='gini',
                       max_depth=10,         # Profundidad del arbol (4)
                       max_features=8,     # Numero maximo de variables
                       max_leaf_nodes=None, # Numero de nodos
                       min_samples_leaf=20, 
                       min_samples_split=100,
                       min_weight_fraction_leaf=0.0, 
                       random_state=None, 
                       splitter='best')

tree_final = tree_final.fit(X_train,y_train) # ajustando el modelo a mis datos

In [None]:
plt.figure(figsize=(12,12))
_ = plot_tree(tree_final, fontsize= 10, feature_names = X_train.columns, rounded = True, filled = True)
plt.show()

### 6. MODELOS ENSAMBLADOS: TECNICAS BASICAS

### Max Voting

In [None]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression

model1 = DecisionTreeClassifier()
model2 = KNeighborsClassifier()
model3= LogisticRegression()

In [None]:
model1.fit(X_train,y_train)
model2.fit(X_train,y_train)
model3.fit(X_train,y_train)

pred1=model1.predict(X_test)
pred2=model2.predict(X_test)
pred3=model3.predict(X_test)

In [None]:
from statistics import mode
final_pred = np.array([])
for i in range(0,len(X_test)):
    final_pred = np.append(final_pred, mode([pred1[i], pred2[i], pred3[i]]))

In [None]:
final_pred

In [None]:
#Cálculo de los indicadores mas relevantes!
print(metrics.classification_report(y_test, final_pred))

### Averaging

In [None]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression

model1 = DecisionTreeClassifier()
model2 = KNeighborsClassifier()
model3= LogisticRegression()

In [None]:
model1.fit(X_train,y_train)
model2.fit(X_train,y_train)
model3.fit(X_train,y_train)

In [None]:
pred1=model1.predict_proba(X_test)
pred2=model2.predict_proba(X_test)
pred3=model3.predict_proba(X_test)

In [None]:
finalpred=(pred1+pred2+pred3)/3

In [None]:
finalpred

In [None]:
finalpredf = finalpred[:,0]>=0.5

In [None]:
#Cálculo de los indicadores mas relevantes!
print(metrics.classification_report(y_test, finalpredf))

### Weighted Average

In [None]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression

model1 = DecisionTreeClassifier()
model2 = KNeighborsClassifier()
model3= LogisticRegression()

model1.fit(X_train,y_train)
model2.fit(X_train,y_train)
model3.fit(X_train,y_train)

pred1=model1.predict_proba(X_test)
pred2=model2.predict_proba(X_test)
pred3=model3.predict_proba(X_test)

finalpred=(pred1*0.3+pred2*0.2+pred3*0.5)

In [None]:
finalpredf = finalpred[:,0]>=0.5

In [None]:
#Cálculo de los indicadores mas relevantes!
print(metrics.classification_report(y_test, finalpredf))

### 7. BAGGING

In [None]:
## Modelos Supervisados : Random Forest ##
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(n_estimators=500,
                            max_features= 6,
                            max_depth=4
                            ) # Numero de obs por nodo hoja
rf.fit(X_train, y_train) # Entrenando un algoritmo

In [None]:
# Predecir con el algoritmo entrenado para validar
y_pred_train=rf.predict(X_train) # Prediccion sobre el train
y_pred_test= rf.predict(X_test) # Prediccion sobre el test

In [None]:
#Cálculo de los indicadores mas relevantes!
metricas_confusion(y_train,y_pred_train,y_test,y_pred_test)

### 8. BOOSTING

### AdaBoost

In [None]:
## Modelos Supervisados : AdaBoost ##
from sklearn.ensemble import AdaBoostClassifier  # Paso01: Instancio
AdaBoost=AdaBoostClassifier(learning_rate=0.001, 
                            n_estimators=250) # Paso02: Especifico
AdaBoost.fit(X_train, y_train)                   # Paso03: Entrenamiento algoritmo

In [None]:
# Predecir con el algoritmo entrenado para validar
y_pred_train=AdaBoost.predict(X_train) # Prediccion sobre el train
y_pred_test= AdaBoost.predict(X_test) # Prediccion sobre el test

In [None]:
#Cálculo de los indicadores mas relevantes!
metricas_confusion(y_train,y_pred_train,y_test,y_pred_test)

Más sobre modelos ensamblados: 
https://www.analyticsvidhya.com/blog/2018/06/comprehensive-guide-for-ensemble-models/