In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import zscore

from sklearn.model_selection import train_test_split, GridSearchCV
# from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.metrics import precision_recall_fscore_support

In [2]:
# Opción para ver todas las columnas del dataset en el notebook
pd.set_option('display.max_columns', 50)

# Práctico 04: Aprendizaje Supervisado

Para finalizar nuestro modelo, aplicaremos estrategias de sampling para dividir entre train y test y haremos crossvalidation sobre train. Realizaremos pruebas con varios clasificadores y evaluaremos los resultados con múltiples métricas. Por último calcularemos el feature importance y obtendremos conclusiones.

## Objetivo del práctico

### Train-Validation-Test
(obtener del práctico anterior)
- División del dataset en train/validation/test
- Estratificación

### Preprocesamiento
(obtener del práctico anterior)
- Tratamiento de valores nulos
- Estandarización
- Encoding de variables categóricas

### Definición de métricas

Definiremos las métricas a utilizar:
- Accuracy
- Precision
- Recall
- F1
- AUC
- PRAUC  

Además investigaremos como utilizar el classification report y confusion matrix. Adicionalmente, cómo usar crossvalidation.

### Testeo con varios modelos

Realizaremos varios tests con diversos tipos de modelos de scikit-learn:
- Logistic regression
- SVM
- Naive Bayes
- etc  
Usaremos crossvalidation y compararemos con validation y test.

### Modelos Tree Based

En esta instancia utilizaremos modelos que no pertenecen a la librería scikit-learn.  
Estos modelos son los más utilizados actualmente y han demostrado su efectividad en muchas competencias de Kaggle.  
Además, tienen la ventaja de que 
- XGBoost
- LightGBM

### Optimización de Hiperparámetros

En esta sección realizaremos varios tipos de optimización de hiperparámetros para lograr mejorar nuestras métricas.
- Grid Search
- Randomized Search

### Explainability

Realizaremos feature importance y como opcional utilizaremos la librería SHAP para analizar las predicciones.


### Presentación

Al final del práctico, es necesario hacer 3 o 4 slides que irán incluidos en la presentación final.  
Los slides deberán contener las etapas de preprocesamiento, los modelos que utilizamos, como optimizamos los hiperparámetros, cómo validamos y qué métricas utilizamos. Por último responderemos desde el punto de vista de negocio si sirve o no sirve el modelo.

### Librerías recomendadas

Utilizaremos principalmente scikit-learn, opcionalmente xgboost y lightgbm.  
Recomiendo el siguiente material:  

- https://scikit-learn.org/stable/ -> Referencia de librería scikit-learn. Contiene casi todo lo que vamos a utilizar, pipelines, preprocesamiento y varios modelos.
- https://xgboost.readthedocs.io/en/latest/ -> Librería muy utilizada debido a que tiene muy buenos resultados. Es un tipo de algoritmo "boosting tree"
- https://lightgbm.readthedocs.io/en/latest/ -> Otra librería similar a xgboost, cada vez se usa más, debido a su facilidad de uso y buenos resultados.
- https://shap.readthedocs.io/en/latest/index.html -> Librería SHAP, para realizar explainability y analizar predicciones.
- https://scikit-learn.org/stable/auto_examples/model_selection/grid_search_text_feature_extraction.html#sphx-glr-auto-examples-model-selection-grid-search-text-feature-extraction-py -> Ejemplo de pipelines, cross_validation y optimización de hiperparámetros

## Práctico 03: Aprendizaje Supervisado - Resolución

In [3]:
# Leemos el dataset con la función de pandas "read_csv"
df = pd.read_csv("data.csv", sep=";")

### Train-Validation-Test

(Obtener el código del Práctico 03)

In [4]:
from sklearn.base import BaseEstimator, TransformerMixin

def get_contactado(x):
    if x >= 999:
        return '0'
    elif x < 6:
        return '1'
    elif 5 < x < 11:
        return '2'   
    else:
        return '3'
    
def productos_financieros(x):
    if x.loan == 'yes' or x.housing == 'yes':
        return 'yes'
    else:
        return 'no'

class CatCustom(BaseEstimator, TransformerMixin):
    def __init__(self):
        return None
    def fit(self, X, y=None):
        return self
    def transform(self, X, y=None):
        
        X["job"] = X.job.replace("unknown", X.job.mode()[0])
        X["marital"] = X.marital.replace("unknown", X.marital.mode()[0])
        X["education"] = X.education.replace("unknown", X.education.mode()[0])
        X["loan"] = X.loan.replace("unknown", X.loan.mode()[0])
        X["housing"] = X.housing.replace("unknown", X.housing.mode()[0])
        
        # Education
        X["education"] = X["education"].replace({
                            'illiterate': 'ninguno',
                            'basic.4y': 'primario',
                           'basic.6y':'primario',
                           'basic.9y': 'primario',
                           'high.school':'secundario',
                           'professional.course':'terciario',
                           'university.degree':'universitario'})
        # Creamos esta columna para ver si el usuario adquirio productos financieros
        X['productos_financieros'] = X.apply(lambda x: productos_financieros(x), axis=1)
        # Drop de columnas
        X = X.drop(['loan', 'housing'], axis=1)
        
        X = pd.get_dummies(X)
        return X
    
class NumCustom(BaseEstimator, TransformerMixin):
    def __init__(self):
        return None
    def fit(self, X, y=None):
        return self
    def transform(self, X, y=None):
        # Borrar outliers
        '''
        cols = ['age']
        for c in cols:
            z_scores = zscore(df[c])
            abs_z_scores = np.abs(z_scores)
            filtered_entries = (abs_z_scores < 3)
            X = X[filtered_entries]
        
        X_with_nan = X[cols][X[cols].isna().any(axis=1)]
        X.merge(X_with_nan)
        '''
        
        #Contactado
        X['contactado'] = X.apply(lambda x: get_contactado(x['pdays']), axis=1)    
        
        X = X.drop(["pdays", "emp.var.rate", "cons.price.idx", "cons.conf.idx", "nr.employed", "duration"], axis=1)
        return X

In [5]:
from sklearn.pipeline import Pipeline

num_pipeline = Pipeline([
    ("num_custom", NumCustom())
])

cat_pipeline = Pipeline([
    ("cat_custom", CatCustom()),
])

pipeline_completo = Pipeline([
    ("num_custom", NumCustom()),
    ("cat_custom", CatCustom()),
])

In [6]:
df["y"] = df["y"].replace({"no": 0, "yes":1})

In [7]:
X = df.drop(columns='y')
y = df.y

In [8]:
X_temp, X_test, y_temp, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=42)

In [9]:
X_temp.shape, y_temp.shape, X_test.shape, y_test.shape

((32950, 20), (32950,), (8238, 20), (8238,))

In [10]:
X_train, X_val, y_train, y_val = train_test_split(X_temp, y_temp, test_size=0.2, stratify=y_temp, random_state=42)

In [11]:
X_train.shape, y_train.shape, X_val.shape, y_val.shape

((26360, 20), (26360,), (6590, 20), (6590,))

### Preprocesamiento

In [12]:
X_train = pipeline_completo.fit_transform(X=X_train)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X['contactado'] = X.apply(lambda x: get_contactado(x['pdays']), axis=1)


(Obtener el código del Práctico 03)

### Definición de métricas

Justificar que métricas se utilizarán:  
- Accuracy
- Precision
- Recall
- F1
- AUC
- PRAUC  

Explicación de las métricas utilizadas a un stakeholder no técnico

In [None]:
# Por ejemplo: nuestro modelo identifica a los clientes que adquieren un plazo fijo.
# PRECISION: de los clientes que nuestro modelo dicen que van a convertir, precision nos indica el porcentaje que convirtieron realmente.
# Si nuestro modelo nos indica que 80 clientes van a convertir y tenemos un precision de 50% esto quiere decir que realmente convierten 40 clientes.

### Testeo con varios modelos

Realizaremos varios tests con diversos tipos de modelos de scikit-learn:
- Logistic regression
- SVM
- Naive Bayes
- etc  
Usaremos crossvalidation y compararemos con validation y test.

In [13]:
import mlflow

In [14]:
X_test = pipeline_completo.fit_transform(X=X_test)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X['contactado'] = X.apply(lambda x: get_contactado(x['pdays']), axis=1)


In [15]:
train_cols = set(X_train.columns.to_list())
test_cols = set(X_test.columns.to_list())

In [16]:
train_cols - test_cols

{'default_yes'}

In [17]:
X_test['default_yes'] = 0

In [18]:
X_val = pipeline_completo.fit_transform(X=X_val)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X['contactado'] = X.apply(lambda x: get_contactado(x['pdays']), axis=1)


In [19]:
train_cols = set(X_train.columns.to_list())
val_cols = set(X_val.columns.to_list())
train_cols - val_cols

set()

In [20]:
mlflow.set_experiment("bank_experiment")

#### LogisticRegression

In [None]:
from sklearn.linear_model import LogisticRegression

with mlflow.start_run(run_name="LogRes"):

    pipeline_numerico_lineal = Pipeline([
                                 ('standard_scaler', StandardScaler()),
                                ])

    X_t = pipeline_numerico_lineal.fit_transform(X=X_train)
    X_test_linear = pipeline_numerico_lineal.transform(X=X_test)
    
    log_ref = LogisticRegression()
    log_ref.fit(X_t, y_train)
    
    y_pred_linear = log_ref.predict(X_test_linear)
    
    prec, rec, fscore, _ = precision_recall_fscore_support(y_test, y_pred_linear, average="macro")
    mlflow.log_metric("precision", prec)
    mlflow.log_metric("recall", rec)
    mlflow.log_metric("f1-score", fscore)
    

#### Decision Tree Clf

In [None]:
from sklearn.tree import DecisionTreeClassifier

with mlflow.start_run(run_name="DTC"):
    clf = DecisionTreeClassifier(random_state=42)
    clf.fit(X_train, y_train)
    
    prec, rec, fscore, _ = precision_recall_fscore_support(y_test, clf.predict(X_test), average="macro")
    mlflow.log_metric("precision", prec)
    mlflow.log_metric("recall", rec)
    mlflow.log_metric("f1-score", fscore)

#### SVM

In [None]:
from sklearn.svm import SVC

with mlflow.start_run(run_name="SVC"):
    clf = SVC(random_state=42)
    clf.fit(X_train, y_train)
    
    prec, rec, fscore, _ = precision_recall_fscore_support(y_test, clf.predict(X_test), average="macro")
    mlflow.log_metric("precision", prec)
    mlflow.log_metric("recall", rec)
    mlflow.log_metric("f1-score", fscore)

#### Naive Bayes

In [None]:
from sklearn.naive_bayes import BernoulliNB

with mlflow.start_run(run_name="Naive Bayes"):
    clf = BernoulliNB()
    clf.fit(X_train, y_train)
    
    prec, rec, fscore, _ = precision_recall_fscore_support(y_test, clf.predict(X_test), average="macro")
    mlflow.log_metric("precision", prec)
    mlflow.log_metric("recall", rec)
    mlflow.log_metric("f1-score", fscore)

#### Random Forest Classifier

In [None]:
from sklearn.ensemble import RandomForestClassifier

with mlflow.start_run(run_name="RFC"):
    clf = RandomForestClassifier(random_state=42)
    clf.fit(X_train, y_train)
    
    prec, rec, fscore, _ = precision_recall_fscore_support(y_test, clf.predict(X_test), average="macro")
    mlflow.log_metric("precision", prec)
    mlflow.log_metric("recall", rec)
    mlflow.log_metric("f1-score", fscore)

#### XGBoost

In [27]:
from xgboost import XGBClassifier

with mlflow.start_run(run_name="XGBC"):
    clf = XGBClassifier(objective="binary:logistic",
                            use_label_encoder=False,
                            random_state=42)
    clf.fit(X_train, y_train)
    
    prec, rec, fscore, _ = precision_recall_fscore_support(y_test, clf.predict(X_test), average="macro")
    mlflow.log_metric("precision", prec)
    mlflow.log_metric("recall", rec)
    mlflow.log_metric("f1-score", fscore)



#### LightGBM

In [None]:
from lightgbm import LGBMClassifier

with mlflow.start_run(run_name="LGBMC"):
    params = {
        "objective": "binary",
        "random_state": 42
    }
    mlflow.log_params(params)
    
    clf = LGBMClassifier(**params)
    clf.fit(X_train, y_train)
    
    prec, rec, fscore, _ = precision_recall_fscore_support(y_test, clf.predict(X_test), average="macro")
    mlflow.log_metric("precision", prec)
    mlflow.log_metric("recall", rec)
    mlflow.log_metric("f1-score", fscore)

#### MLFlow ui

In [21]:
!mlflow ui

[2021-09-21 09:28:56 -0300] [219] [INFO] Starting gunicorn 20.1.0
[2021-09-21 09:28:56 -0300] [219] [INFO] Listening at: http://127.0.0.1:5000 (219)
[2021-09-21 09:28:56 -0300] [219] [INFO] Using worker: sync
[2021-09-21 09:28:56 -0300] [221] [INFO] Booting worker with pid: 221
^C
[2021-09-21 09:29:42 -0300] [219] [INFO] Handling signal: int
[2021-09-21 09:29:42 -0300] [221] [INFO] Worker exiting (pid: 221)


### Optimización de Hiperparámetros

En esta sección realizaremos varios tipos de optimización de hiperparámetros para lograr mejorar nuestras métricas. Elegiremos uno de los modelos (XGBoost o LightGBM) para buscar los parámetros óptimos.

#### Grid Search

##### GS - Log Reg

In [40]:
from sklearn.linear_model import LogisticRegression

with mlflow.start_run(run_name="GS-logreg"):
    pipeline_numerico_lineal = Pipeline([
                                 ('standard_scaler', StandardScaler()),
                                ])

    X_t = pipeline_numerico_lineal.fit_transform(X=X_train)
    X_test_linear = pipeline_numerico_lineal.transform(X=X_test)

    param_grid={'C': [0.001, 0.01, 0.1, 1, 10, 100, 1000] }

    log_reg = LogisticRegression()
    grid = GridSearchCV(log_ref, param_grid, cv=5, n_jobs=-1, scoring="f1_macro")
    grid.fit(X_t, y_train)

    mlflow.log_metric("f1-score", grid.cv_results_["mean_test_score"][grid.best_index_])

##### GS - DTC

In [47]:
from sklearn.tree import DecisionTreeClassifier

with mlflow.start_run(run_name="GS-DTC"):
    param_grid={'criterion': ["gini", "entropy"],
               'random_state': [42],
               'max_depth': [2, 3, 5, 7, 10],
               }
    clf = DecisionTreeClassifier()
    grid = GridSearchCV(clf, param_grid, cv=5, n_jobs=-1, scoring="f1_macro")
    grid.fit(X_train, y_train)

    mlflow.log_metric("f1-score", grid.cv_results_["mean_test_score"][grid.best_index_])

##### GS - SVC

In [None]:
from sklearn.svm import SVC

with mlflow.start_run(run_name="GS-SVC"):
    param_grid = {'kernel': ['poly', 'rbf', 'sigmoid'],
                 'C': [0.001, 0.01, 0.1, 1, 10, 100, 1000],
                 'gamma': [0.1, 1, 10, 100],
                 'random_state':[42]}
    clf = SVC()
    grid = GridSearchCV(clf, param_grid, cv=5, n_jobs=-1, scoring="f1_macro")
    grid.fit(X_train, y_train)
    
    mlflow.log_metric("f1-score", grid.cv_results_["mean_test_score"][grid.best_index_])

##### GS - Naive Bayes

In [50]:
from sklearn.naive_bayes import BernoulliNB

with mlflow.start_run(run_name="GS - Naive Bayes"):
    param_grid={'alpha': [0.001, 0.01, 0.1, 1, 10, 100, 1000],
               'fit_prior': [False, True]}
    
    clf = BernoulliNB()
    grid = GridSearchCV(clf, param_grid, cv=5, n_jobs=-1, scoring="f1_macro", verbose=2)
    grid.fit(X_train, y_train)
    
    mlflow.log_metric("f1-score", grid.cv_results_["mean_test_score"][grid.best_index_])

Fitting 5 folds for each of 14 candidates, totalling 70 fits


##### GS - Random Forest Classifier

In [None]:
from sklearn.ensemble import RandomForestClassifier

with mlflow.start_run(run_name="GS - RFC"):
    param_grid = {'n_estimators': [500, 1000],
                 'criterion': ['gini', 'entropy'],
                  'max_depth': [2,3,5,7,10],
                  'max_features': ['auto', 'sqrt', 'log2'],
                  'bootstrap': [True, False]
                 }
    
    clf = RandomForestClassifier()
    grid = GridSearchCV(clf, param_grid, cv=5, n_jobs=-1, scoring="f1_macro", verbose=2)
    grid.fit(X_train, y_train)
    
    mlflow.log_metric("f1-score", grid.cv_results_["mean_test_score"][grid.best_index_])

In [64]:
grid.best_estimator_

RandomForestClassifier(bootstrap=False, max_depth=10, max_features='sqrt',
                       n_estimators=500)

##### GS - XGBoost

In [25]:
from xgboost import XGBClassifier

with mlflow.start_run(run_name="GS - XGB"):
    param_grid = {'objective': ["binary:logistic"],
                  'learning_rate': [0.01, 0.1, 0.3],
                  'gamma': [0.1, 1, 10],
                  'max_depth': [2,3,5,7,10],
                  'alpha': [0.1, 1, 100],
                  'n_estimators': [500],
                  'random_state': [42],
                  'use_label_encoder': [False]
                 }
    
    clf = XGBClassifier()
    grid = GridSearchCV(clf, param_grid, cv=5, n_jobs=-1, scoring="f1_macro", verbose=2)
    grid.fit(X_train, y_train, eval_set=[(X_train, y_train), (X_val, y_val)], early_stopping_rounds=20)
    
    mlflow.log_metric("f1-score", grid.cv_results_["mean_test_score"][grid.best_index_])

Fitting 5 folds for each of 135 candidates, totalling 675 fits
[0]	validation_0-logloss:0.68648	validation_1-logloss:0.68649
[0]	validation_0-logloss:0.68648	validation_1-logloss:0.68649
[0]	validation_0-logloss:0.68648	validation_1-logloss:0.68648
[0]	validation_0-logloss:0.68638	validation_1-logloss:0.68642
[0]	validation_0-logloss:0.68648	validation_1-logloss:0.68649
[0]	validation_0-logloss:0.68649	validation_1-logloss:0.68649
[1]	validation_0-logloss:0.67994	validation_1-logloss:0.67997
[1]	validation_0-logloss:0.67996	validation_1-logloss:0.67997
[0]	validation_0-logloss:0.68641	validation_1-logloss:0.68641
[0]	validation_0-logloss:0.68640	validation_1-logloss:0.68641
[1]	validation_0-logloss:0.67977	validation_1-logloss:0.67981
[1]	validation_0-logloss:0.67996	validation_1-logloss:0.67997
[2]	validation_0-logloss:0.67353	validation_1-logloss:0.67357
[1]	validation_0-logloss:0.67992	validation_1-logloss:0.67996
[1]	validation_0-logloss:0.67993	validation_1-logloss:0.67994
[2]	val

KeyboardInterrupt: 

In [33]:
from sklearn.metrics import SCORERS
SCORERS.keys()

dict_keys(['explained_variance', 'r2', 'max_error', 'neg_median_absolute_error', 'neg_mean_absolute_error', 'neg_mean_absolute_percentage_error', 'neg_mean_squared_error', 'neg_mean_squared_log_error', 'neg_root_mean_squared_error', 'neg_mean_poisson_deviance', 'neg_mean_gamma_deviance', 'accuracy', 'top_k_accuracy', 'roc_auc', 'roc_auc_ovr', 'roc_auc_ovo', 'roc_auc_ovr_weighted', 'roc_auc_ovo_weighted', 'balanced_accuracy', 'average_precision', 'neg_log_loss', 'neg_brier_score', 'adjusted_rand_score', 'rand_score', 'homogeneity_score', 'completeness_score', 'v_measure_score', 'mutual_info_score', 'adjusted_mutual_info_score', 'normalized_mutual_info_score', 'fowlkes_mallows_score', 'precision', 'precision_macro', 'precision_micro', 'precision_samples', 'precision_weighted', 'recall', 'recall_macro', 'recall_micro', 'recall_samples', 'recall_weighted', 'f1', 'f1_macro', 'f1_micro', 'f1_samples', 'f1_weighted', 'jaccard', 'jaccard_macro', 'jaccard_micro', 'jaccard_samples', 'jaccard_wei

In [None]:
model = xgb.XGBClassifier()

param_grid = {
    'pca__n_components': [5, 10, 15, 20, 25, 30],
    'model__max_depth': [2, 3, 5, 7, 10],
    'model__n_estimators': [10, 100, 500],
}

grid = GridSearchCV(pipeline, param_grid, cv=5, n_jobs=-1, scoring='roc_auc')

#### Randomized Search

In [None]:
# Este ejemplo es muy similar al anterior, y deberíamos lograr unos parámetros parecidos en mucho menos tiempo

### Feature importance y explainability

In [None]:
import xgboost as xgb
variables_numericas = ['cons.price.idx', 'cons.conf.idx', 'age','duration']
variables_categoricas = ['education', 'marital','job', 'contact', 'day_of_week']

# Filtramos las variables que seleccionamos
X_t = X_train[variables_categoricas + variables_numericas]

pipeline_numerico = Pipeline([
                             ('standard_scaler', StandardScaler()),
                            ])

pipeline_completo = ColumnTransformer([('num', pipeline_numerico, variables_numericas),
                                   ('cat', OneHotEncoder(), variables_categoricas),
                                  ])

pipeline_modelo = Pipeline([('preprocess', pipeline_completo),
                            ('xgb', xgb.XGBClassifier())])


In [None]:
pipeline_modelo.fit(X_t, y_train)

#### Obtener los nombres de las variables

In [None]:
# Si realizamos one hot encoding, vamos a tener el problema de que se incrementan el numero de features y necesitamos la nueva lista.
numeric_features = variables_numericas
cat_features = pipeline_modelo.named_steps['preprocess'].transformers_[1][1].get_feature_names(variables_categoricas)

#### Feature importance utilizando XGBoost

In [None]:
onehot_columns = np.array(cat_features)
numeric_features_list = np.array(numeric_features)
numeric_features_list = np.append(numeric_features_list, onehot_columns)

In [None]:
# Es necesario ordenar las los valores del feature importance (utilizamos argsort para tener el orden de los indices)
sorted_idx = pipeline_modelo[1].feature_importances_.argsort()
plt.barh(numeric_features_list[sorted_idx], pipeline_modelo[1].feature_importances_[sorted_idx])
plt.xlabel("Xgboost Feature Importance")
plt.show()

#### Feature importance utilizando eli5

In [None]:
import eli5
# Utilizar eli5 nos resuelve el problema de ordenar las columnas

In [None]:
onehot_columns = cat_features
features_list = list(numeric_features)
features_list.extend(onehot_columns)

In [None]:
eli5.explain_weights(pipeline_modelo[1], top=50, feature_names=features_list)

#### Utilizar SHAP para obtener feature importance y expainability de las predicciones (opcional)

Como opcional, podemos utilizar la librería SHAP que nos muestra un tipo de explainability por cada predicción. Estas pueden ser agregadas para obtener un feature importance global del modelo.


https://shap.readthedocs.io/en/latest/example_notebooks/tabular_examples/tree_based_models/Census%20income%20classification%20with%20XGBoost.html