## <a id = '0'> Índice </a>

* [**Entorno**](#1)  
   * [Librerías](#1d1)  
   * [Funciones](#1d2)  
   * [Constantes](#1d3)

* [**Lectura de datos**](#2)


## <a id = '1'> Entorno </a>
[índice](#0)

### <a id = '1d1'> Librerías </a>

In [1]:
import os
import pandas as pd
import numpy as np

from sklearn.preprocessing import  LabelEncoder
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB, MultinomialNB
import xgboost as xgb
from xgboost import XGBClassifier

from sklearn.model_selection import GridSearchCV, RandomizedSearchCV, ParameterGrid

from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score, log_loss, classification_report, make_scorer, confusion_matrix
from scipy.stats import ks_2samp

import matplotlib.pyplot as plt

import math
import itertools

import joblib
# from config import data_folder

In [2]:
os.chdir("../")

In [16]:
MODEL_PATH = "output/models/V2/" 

### <a id = '1d2'> Funciones </a>

In [4]:
from src.utils import *

## <a id = '2'> Lectura de datos </a>
[índice](#0)

In [5]:
train_data = pd.read_csv("output/features/transform/train_features_pca.csv")
test_data = pd.read_csv("output/features/transform/test_features_pca.csv")

In [6]:
#Columnas que no vamos a usar en el modelado
skip_columns = ["patient", "label"]
le = LabelEncoder()

#Generamos las características y la variable objetivo 
X_train = train_data.drop(columns = skip_columns)
y_train = le.fit_transform(train_data["label"])

X_test = test_data.drop(columns = skip_columns)
y_test = le.fit_transform(test_data["label"])


In [7]:
train_data.shape

(1728, 287)

In [8]:
recall_macro = make_scorer(recall_score, average='macro') #Todas las clases tienen el mismo peso
f1_score_macro = make_scorer(f1_score, average='macro') #Todas las clases tienen el mismo peso

## Logistic Regression

In [9]:
# Estimador
lr = LogisticRegression(penalty='l2', 
                        C=1e5, 
                        solver='lbfgs', 
                        multi_class='multinomial',
                        random_state = 42)

# Parámetros
params_lr = {
    'penalty': ['l2'],
    'C': [0.00001, 0.0001, 0.001, 0.01, 0.1, 1, 10]
}
       
# Grid Search
grid_lr = GridSearchCV(estimator=lr,
                       param_grid=params_lr,
                       scoring=f1_score_macro,
                       cv=5,

                       verbose=1,
                       n_jobs=-1)

In [10]:
grid_lr.fit(X_train, y_train)

Fitting 5 folds for each of 7 candidates, totalling 35 fits


STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

In [11]:
#Pronóstico
y_pred_grid_lr = grid_lr.predict(X_test)
y_pred_prob_grid_lr = grid_lr.predict_proba(X_test)

In [12]:
grid_lr.best_params_

{'C': 0.001, 'penalty': 'l2'}

In [13]:
print(params_to_markdown(grid_lr.best_params_))

| parámetro   | valor   |
|:------------|:--------|
| C           | 0.001   |
| penalty     | l2      |


In [14]:
grid_lr.best_score_

0.542072175757129

In [15]:
grid_lr.best_estimator_

In [17]:
joblib.dump(grid_lr, MODEL_PATH + "LR/logistic_regression_model.pkl")

LR_preds = pd.DataFrame({
    "patient" : test_data["patient"],
    "label" : test_data["label"],
    "y_true" : y_test,
    "pred" : y_pred_grid_lr,
})

LR_probas = pd.DataFrame(y_pred_prob_grid_lr, columns=[f"proba_clase_{c}" for c in grid_lr.classes_])
LR_results = pd.concat([LR_preds, LR_probas], axis=1)
LR_results.to_csv(MODEL_PATH + "LR/predictions.csv", index=False)


In [18]:
table_metrics(y_test,y_pred_grid_lr, y_pred_prob_grid_lr)

Unnamed: 0,metric,value
0,accuracy,0.527083
1,precision_weighted,0.528874
2,recall_weighted,0.527083
3,f1_weighted,0.527638
4,roc_auc_ovr,0.790475
5,log_loss,1.068772
6,gini_normalized,0.580949
7,ks_test_clase_0,0.408333
8,ks_test_clase_1,0.525
9,ks_test_clase_2,0.388889


In [19]:
confusion_matrix(y_test, y_pred_grid_lr)

array([[58, 20, 17, 25],
       [20, 68, 25,  7],
       [18, 17, 59, 26],
       [23,  6, 23, 68]])

In [20]:
print(classification_report(y_test, y_pred_grid_lr))

              precision    recall  f1-score   support

           0       0.49      0.48      0.49       120
           1       0.61      0.57      0.59       120
           2       0.48      0.49      0.48       120
           3       0.54      0.57      0.55       120

    accuracy                           0.53       480
   macro avg       0.53      0.53      0.53       480
weighted avg       0.53      0.53      0.53       480



In [21]:
genera_metricas_markdown(y_test,y_pred_grid_lr, y_pred_prob_grid_lr)

| metric             |    value |
|:-------------------|---------:|
| accuracy           | 0.527083 |
| precision_weighted | 0.528874 |
| recall_weighted    | 0.527083 |
| f1_weighted        | 0.527638 |
| roc_auc_ovr        | 0.790475 |
| log_loss           | 1.06877  |
| gini_normalized    | 0.580949 |
| ks_test_clase_0    | 0.408333 |
| ks_test_clase_1    | 0.525    |
| ks_test_clase_2    | 0.388889 |
| ks_test_clase_3    | 0.494444 |


|              |   precision |   recall |   f1-score |    support |
|:-------------|------------:|---------:|-----------:|-----------:|
| 0            |    0.487395 | 0.483333 |   0.485356 | 120        |
| 1            |    0.612613 | 0.566667 |   0.588745 | 120        |
| 2            |    0.475806 | 0.491667 |   0.483607 | 120        |
| 3            |    0.539683 | 0.566667 |   0.552846 | 120        |
| accuracy     |    0.527083 | 0.527083 |   0.527083 |   0.527083 |
| macro avg    |    0.528874 | 0.527083 |   0.527638 | 480        |
| weighted a

In [22]:
cm_lr = confusion_matrix(y_test, y_pred_grid_lr)
df_cm_lr = pd.DataFrame(cm_lr,
                         index = [f"Real {label}" for label in grid_lr.classes_],
                         columns = [f"Pred {label}" for label in grid_lr.classes_])
print(df_cm_lr.to_markdown())


|        |   Pred 0 |   Pred 1 |   Pred 2 |   Pred 3 |
|:-------|---------:|---------:|---------:|---------:|
| Real 0 |       58 |       20 |       17 |       25 |
| Real 1 |       20 |       68 |       25 |        7 |
| Real 2 |       18 |       17 |       59 |       26 |
| Real 3 |       23 |        6 |       23 |       68 |


## Random Forest Classifier

In [23]:
#Estimador
rfc = RandomForestClassifier(random_state = 42, 
                             n_jobs = -1, 
                             bootstrap = True)

#Parámetros
params_rfc = {'n_estimators': [100, 350, 500],
             'max_features': ['log2', 'sqrt'],
             'max_depth': [5, 10, 20],
             'min_samples_split': [2, 10, 30],
             'min_samples_leaf': [2, 10, 30]}

#Grid Search
grid_rfc = GridSearchCV(estimator=rfc,
                       param_grid=params_rfc,
                       scoring=f1_score_macro,
                       cv=5,
                       verbose=1,
                       n_jobs=-1)

In [24]:
grid_rfc.fit(X_train, y_train)

Fitting 5 folds for each of 162 candidates, totalling 810 fits


In [25]:
grid_rfc.best_params_

{'max_depth': 10,
 'max_features': 'sqrt',
 'min_samples_leaf': 2,
 'min_samples_split': 2,
 'n_estimators': 500}

In [26]:
print(params_to_markdown(grid_rfc.best_params_))

| parámetro         | valor   |
|:------------------|:--------|
| max_depth         | 10      |
| max_features      | sqrt    |
| min_samples_leaf  | 2       |
| min_samples_split | 2       |
| n_estimators      | 500     |


In [27]:
grid_rfc.best_score_

0.5049454564247575

In [28]:
grid_rfc.best_estimator_

In [29]:
#Pronóstico
y_pred_grid_rfc = grid_rfc.predict(X_test)
y_pred_prob_grid_rfc = grid_rfc.predict_proba(X_test)

In [30]:
joblib.dump(grid_rfc, MODEL_PATH + "RFC/random_forest_model.pkl")

RFC_preds = pd.DataFrame({
    "patient" : test_data["patient"],
    "label" : test_data["label"],
    "y_true" : y_test,
    "pred" : y_pred_grid_rfc,
})

RFC_probas = pd.DataFrame(y_pred_prob_grid_rfc, columns=[f"proba_clase_{c}" for c in grid_rfc.classes_])
RFC_results = pd.concat([RFC_preds, RFC_probas], axis=1)
RFC_results.to_csv(MODEL_PATH + "RFC/predictions.csv", index=False)


In [31]:
table_metrics(y_test,y_pred_grid_rfc, y_pred_prob_grid_rfc)

Unnamed: 0,metric,value
0,accuracy,0.483333
1,precision_weighted,0.481121
2,recall_weighted,0.483333
3,f1_weighted,0.48103
4,roc_auc_ovr,0.751476
5,log_loss,1.241261
6,gini_normalized,0.502951
7,ks_test_clase_0,0.302778
8,ks_test_clase_1,0.522222
9,ks_test_clase_2,0.327778


In [32]:
confusion_matrix(y_test, y_pred_grid_rfc)

array([[48, 22, 20, 30],
       [23, 70, 21,  6],
       [25, 21, 45, 29],
       [27,  9, 15, 69]])

In [33]:
print(classification_report(y_test, y_pred_grid_rfc))

              precision    recall  f1-score   support

           0       0.39      0.40      0.40       120
           1       0.57      0.58      0.58       120
           2       0.45      0.38      0.41       120
           3       0.51      0.57      0.54       120

    accuracy                           0.48       480
   macro avg       0.48      0.48      0.48       480
weighted avg       0.48      0.48      0.48       480



In [34]:
genera_metricas_markdown(y_test,y_pred_grid_rfc, y_pred_prob_grid_rfc)

| metric             |    value |
|:-------------------|---------:|
| accuracy           | 0.483333 |
| precision_weighted | 0.481121 |
| recall_weighted    | 0.483333 |
| f1_weighted        | 0.48103  |
| roc_auc_ovr        | 0.751476 |
| log_loss           | 1.24126  |
| gini_normalized    | 0.502951 |
| ks_test_clase_0    | 0.302778 |
| ks_test_clase_1    | 0.522222 |
| ks_test_clase_2    | 0.327778 |
| ks_test_clase_3    | 0.45     |


|              |   precision |   recall |   f1-score |    support |
|:-------------|------------:|---------:|-----------:|-----------:|
| 0            |    0.390244 | 0.4      |   0.395062 | 120        |
| 1            |    0.57377  | 0.583333 |   0.578512 | 120        |
| 2            |    0.445545 | 0.375    |   0.40724  | 120        |
| 3            |    0.514925 | 0.575    |   0.543307 | 120        |
| accuracy     |    0.483333 | 0.483333 |   0.483333 |   0.483333 |
| macro avg    |    0.481121 | 0.483333 |   0.48103  | 480        |
| weighted a

In [35]:
cm_rfc = confusion_matrix(y_test, y_pred_grid_rfc)
df_cm_rfc = pd.DataFrame(cm_rfc,
                         index = [f"Real {label}" for label in grid_rfc.classes_],
                         columns = [f"Pred {label}" for label in grid_rfc.classes_])
print(df_cm_rfc.to_markdown())


|        |   Pred 0 |   Pred 1 |   Pred 2 |   Pred 3 |
|:-------|---------:|---------:|---------:|---------:|
| Real 0 |       48 |       22 |       20 |       30 |
| Real 1 |       23 |       70 |       21 |        6 |
| Real 2 |       25 |       21 |       45 |       29 |
| Real 3 |       27 |        9 |       15 |       69 |


# Gradient Boosting Classifier

In [36]:
#Estimador
gb = GradientBoostingClassifier(learning_rate=0.05, 
                                subsample=0.5, 
                                max_depth=6, 
                                n_estimators=10,
                                 random_state=42,
                                )

#Parámetros
params_gb = {'n_estimators': [1,10,100], 
             'learning_rate' : [0.01,0.05,0.1],
             'subsample' : [0.1,0.5,1.0], 
             'max_depth': [5,10,20],
             'min_samples_split': [2, 10, 30],
             'min_samples_leaf': [2, 10, 30],
             'max_features': ['log2', 'sqrt']}

#Grid Search
grid_gb = GridSearchCV(estimator=gb,
                       param_grid=params_gb,
                       scoring=f1_score_macro,
                       cv = 5,
                       verbose=1,
                       n_jobs=-1)

In [37]:
#Entrenamiento
grid_gb.fit(X_train, y_train)

Fitting 5 folds for each of 1458 candidates, totalling 7290 fits


In [38]:
grid_gb.best_params_

{'learning_rate': 0.1,
 'max_depth': 5,
 'max_features': 'sqrt',
 'min_samples_leaf': 2,
 'min_samples_split': 30,
 'n_estimators': 100,
 'subsample': 1.0}

In [39]:
print(params_to_markdown(grid_gb.best_params_))

| parámetro         | valor   |
|:------------------|:--------|
| learning_rate     | 0.1     |
| max_depth         | 5       |
| max_features      | sqrt    |
| min_samples_leaf  | 2       |
| min_samples_split | 30      |
| n_estimators      | 100     |
| subsample         | 1.0     |


In [40]:
grid_gb.best_score_

0.5046356149324713

In [41]:
grid_gb.best_estimator_

In [42]:
#Predicciones
y_pred_grid_gb = grid_gb.predict(X_test)
y_pred_prob_grid_gb = grid_gb.predict_proba(X_test)

In [43]:
joblib.dump(grid_rfc, MODEL_PATH + "GB/gradient_boosting_model.pkl")

GB_preds = pd.DataFrame({
    "patient" : test_data["patient"],
    "label" : test_data["label"],
    "y_true" : y_test,
    "pred" : y_pred_grid_gb,
})

GB_probas = pd.DataFrame(y_pred_prob_grid_gb, columns=[f"proba_clase_{c}" for c in grid_gb.classes_])
GB_results = pd.concat([GB_preds, GB_probas], axis=1)
GB_results.to_csv(MODEL_PATH + "GB/predictions.csv", index=False)


In [44]:
table_metrics(y_test,y_pred_grid_gb, y_pred_prob_grid_gb)

Unnamed: 0,metric,value
0,accuracy,0.525
1,precision_weighted,0.530189
2,recall_weighted,0.525
3,f1_weighted,0.526186
4,roc_auc_ovr,0.75559
5,log_loss,1.152912
6,gini_normalized,0.511181
7,ks_test_clase_0,0.319444
8,ks_test_clase_1,0.491667
9,ks_test_clase_2,0.319444


In [45]:
confusion_matrix(y_test, y_pred_grid_gb)

array([[63, 20, 16, 21],
       [24, 64, 26,  6],
       [28, 16, 54, 22],
       [25,  5, 19, 71]])

In [46]:
print(classification_report(y_test, y_pred_grid_gb))

              precision    recall  f1-score   support

           0       0.45      0.53      0.48       120
           1       0.61      0.53      0.57       120
           2       0.47      0.45      0.46       120
           3       0.59      0.59      0.59       120

    accuracy                           0.53       480
   macro avg       0.53      0.53      0.53       480
weighted avg       0.53      0.53      0.53       480



In [47]:
genera_metricas_markdown(y_test,y_pred_grid_gb, y_pred_prob_grid_gb)

| metric             |    value |
|:-------------------|---------:|
| accuracy           | 0.525    |
| precision_weighted | 0.530189 |
| recall_weighted    | 0.525    |
| f1_weighted        | 0.526186 |
| roc_auc_ovr        | 0.75559  |
| log_loss           | 1.15291  |
| gini_normalized    | 0.511181 |
| ks_test_clase_0    | 0.319444 |
| ks_test_clase_1    | 0.491667 |
| ks_test_clase_2    | 0.319444 |
| ks_test_clase_3    | 0.483333 |


|              |   precision |   recall |   f1-score |   support |
|:-------------|------------:|---------:|-----------:|----------:|
| 0            |    0.45     | 0.525    |   0.484615 |   120     |
| 1            |    0.609524 | 0.533333 |   0.568889 |   120     |
| 2            |    0.469565 | 0.45     |   0.459574 |   120     |
| 3            |    0.591667 | 0.591667 |   0.591667 |   120     |
| accuracy     |    0.525    | 0.525    |   0.525    |     0.525 |
| macro avg    |    0.530189 | 0.525    |   0.526186 |   480     |
| weighted avg |    

In [48]:
cm_gb = confusion_matrix(y_test, y_pred_grid_gb)
df_cm_gb = pd.DataFrame(cm_gb,
                         index = [f"Real {label}" for label in grid_gb.classes_],
                         columns = [f"Pred {label}" for label in grid_gb.classes_])
print(df_cm_gb.to_markdown())

|        |   Pred 0 |   Pred 1 |   Pred 2 |   Pred 3 |
|:-------|---------:|---------:|---------:|---------:|
| Real 0 |       63 |       20 |       16 |       21 |
| Real 1 |       24 |       64 |       26 |        6 |
| Real 2 |       28 |       16 |       54 |       22 |
| Real 3 |       25 |        5 |       19 |       71 |


## Naive Bayes

In [49]:
gnb = GaussianNB()

params_gnb = {
    'var_smoothing': [1e-9, 1e-8, 1e-7, 1e-6, 1e-5, 1e-4, 1e-3]
}
#Grid Search
grid_gnb = GridSearchCV(estimator=gnb,
                       param_grid=params_gnb,
                       scoring=f1_score_macro,
                       cv=5,
                       verbose=1,
                       n_jobs=-1)

In [50]:
#Entrenamiento
grid_gnb.fit(X_train, y_train)

Fitting 5 folds for each of 7 candidates, totalling 35 fits


In [51]:
grid_gnb.best_params_

{'var_smoothing': 1e-09}

In [52]:
print(params_to_markdown(grid_gnb.best_params_))

| parámetro     |   valor |
|:--------------|--------:|
| var_smoothing |   1e-09 |


In [53]:
grid_gnb.best_score_

0.3286336224760228

In [54]:
#Predicciones
y_pred_grid_gnb = grid_gnb.predict(X_test)
y_pred_prob_grid_gnb = grid_gnb.predict_proba(X_test)

In [55]:
joblib.dump(grid_gnb, MODEL_PATH + "GNB/Gaussian_Naive_Bayes_model.pkl")

GNB_preds = pd.DataFrame({
    "patient" : test_data["patient"],
    "label" : test_data["label"],
    "y_true" : y_test,
    "pred" : y_pred_grid_gnb,
})

GNB_probas = pd.DataFrame(y_pred_prob_grid_gnb, columns=[f"proba_clase_{c}" for c in grid_gnb.classes_])
GNB_results = pd.concat([GNB_preds, GNB_probas], axis=1)
GNB_results.to_csv(MODEL_PATH + "GNB/predictions.csv", index=False)


In [56]:
table_metrics(y_test,y_pred_grid_gnb, y_pred_prob_grid_gnb)

Unnamed: 0,metric,value
0,accuracy,0.335417
1,precision_weighted,0.374984
2,recall_weighted,0.335417
3,f1_weighted,0.287205
4,roc_auc_ovr,0.616916
5,log_loss,6.404771
6,gini_normalized,0.233831
7,ks_test_clase_0,0.186111
8,ks_test_clase_1,0.266667
9,ks_test_clase_2,0.163889


In [57]:
confusion_matrix(y_test, y_pred_grid_gnb)

array([[47, 65,  5,  3],
       [25, 88,  6,  1],
       [32, 65, 12, 11],
       [35, 61, 10, 14]])

In [58]:
print(classification_report(y_test, y_pred_grid_gnb))

              precision    recall  f1-score   support

           0       0.34      0.39      0.36       120
           1       0.32      0.73      0.44       120
           2       0.36      0.10      0.16       120
           3       0.48      0.12      0.19       120

    accuracy                           0.34       480
   macro avg       0.37      0.34      0.29       480
weighted avg       0.37      0.34      0.29       480



In [59]:
genera_metricas_markdown(y_test,y_pred_grid_gnb, y_pred_prob_grid_gnb)

| metric             |    value |
|:-------------------|---------:|
| accuracy           | 0.335417 |
| precision_weighted | 0.374984 |
| recall_weighted    | 0.335417 |
| f1_weighted        | 0.287205 |
| roc_auc_ovr        | 0.616916 |
| log_loss           | 6.40477  |
| gini_normalized    | 0.233831 |
| ks_test_clase_0    | 0.186111 |
| ks_test_clase_1    | 0.266667 |
| ks_test_clase_2    | 0.163889 |
| ks_test_clase_3    | 0.216667 |


|              |   precision |   recall |   f1-score |    support |
|:-------------|------------:|---------:|-----------:|-----------:|
| 0            |    0.338129 | 0.391667 |   0.362934 | 120        |
| 1            |    0.315412 | 0.733333 |   0.441103 | 120        |
| 2            |    0.363636 | 0.1      |   0.156863 | 120        |
| 3            |    0.482759 | 0.116667 |   0.187919 | 120        |
| accuracy     |    0.335417 | 0.335417 |   0.335417 |   0.335417 |
| macro avg    |    0.374984 | 0.335417 |   0.287205 | 480        |
| weighted a

In [60]:
cm_gnb = confusion_matrix(y_test, y_pred_grid_gnb)
df_cm_gnb = pd.DataFrame(cm_gnb,
                         index = [f"Real {label}" for label in grid_gnb.classes_],
                         columns = [f"Pred {label}" for label in grid_gnb.classes_])
print(df_cm_gnb.to_markdown())

|        |   Pred 0 |   Pred 1 |   Pred 2 |   Pred 3 |
|:-------|---------:|---------:|---------:|---------:|
| Real 0 |       47 |       65 |        5 |        3 |
| Real 1 |       25 |       88 |        6 |        1 |
| Real 2 |       32 |       65 |       12 |       11 |
| Real 3 |       35 |       61 |       10 |       14 |


## XGB Classifier

In [67]:
xgb_base = xgb.XGBClassifier(
    learning_rate=0.1,
    n_estimators=5000,
    eval_metric="mlogloss",
    objective="multi:softprob",
    # early_stopping_rounds=50,
    
)

xgb_base.fit(X_train, y_train,
             eval_set=[(X_test, y_test)],             
             verbose = 1)

# xgb_n_estimator = xgb_base.best_iteration

[0]	validation_0-mlogloss:1.35553
[1]	validation_0-mlogloss:1.33309
[2]	validation_0-mlogloss:1.31500
[3]	validation_0-mlogloss:1.30022
[4]	validation_0-mlogloss:1.28894
[5]	validation_0-mlogloss:1.27549
[6]	validation_0-mlogloss:1.26431
[7]	validation_0-mlogloss:1.25507
[8]	validation_0-mlogloss:1.24945
[9]	validation_0-mlogloss:1.23887
[10]	validation_0-mlogloss:1.23391
[11]	validation_0-mlogloss:1.22665
[12]	validation_0-mlogloss:1.22010
[13]	validation_0-mlogloss:1.21551
[14]	validation_0-mlogloss:1.21124
[15]	validation_0-mlogloss:1.20745
[16]	validation_0-mlogloss:1.20262
[17]	validation_0-mlogloss:1.20002
[18]	validation_0-mlogloss:1.19812
[19]	validation_0-mlogloss:1.19490
[20]	validation_0-mlogloss:1.19297
[21]	validation_0-mlogloss:1.19017
[22]	validation_0-mlogloss:1.19078
[23]	validation_0-mlogloss:1.18647
[24]	validation_0-mlogloss:1.18508
[25]	validation_0-mlogloss:1.18403
[26]	validation_0-mlogloss:1.18082
[27]	validation_0-mlogloss:1.17709
[28]	validation_0-mlogloss:1.1

In [68]:
#Predicciones
y_pred_grid_xgb_b = xgb_base.predict(X_test)
y_pred_prob_grid_xgb_b = xgb_base.predict_proba(X_test)

In [69]:
table_metrics(y_test,y_pred_grid_xgb_b, y_pred_prob_grid_xgb_b)

Unnamed: 0,metric,value
0,accuracy,0.514583
1,precision_weighted,0.516245
2,recall_weighted,0.514583
3,f1_weighted,0.515084
4,roc_auc_ovr,0.762303
5,log_loss,1.477503
6,gini_normalized,0.524606
7,ks_test_clase_0,0.333333
8,ks_test_clase_1,0.494444
9,ks_test_clase_2,0.380556


In [70]:
joblib.dump(xgb_base, MODEL_PATH + "XGB/XGB_base_model.pkl")

XGB_base_preds = pd.DataFrame({
    "patient" : test_data["patient"],
    "label" : test_data["label"],
    "y_true" : y_test,
    "pred" : y_pred_grid_xgb_b,
})

XGB_base_probas = pd.DataFrame(y_pred_prob_grid_xgb_b, columns=[f"proba_clase_{c}" for c in xgb_base.classes_])
XGB_base_results = pd.concat([XGB_base_preds, XGB_base_probas], axis=1)
XGB_base_results.to_csv(MODEL_PATH + "XGB/predictions_base.csv", index=False)


In [71]:
genera_metricas_markdown(y_test,y_pred_grid_xgb_b, y_pred_prob_grid_xgb_b)

| metric             |    value |
|:-------------------|---------:|
| accuracy           | 0.514583 |
| precision_weighted | 0.516245 |
| recall_weighted    | 0.514583 |
| f1_weighted        | 0.515084 |
| roc_auc_ovr        | 0.762303 |
| log_loss           | 1.4775   |
| gini_normalized    | 0.524606 |
| ks_test_clase_0    | 0.333333 |
| ks_test_clase_1    | 0.494444 |
| ks_test_clase_2    | 0.380556 |
| ks_test_clase_3    | 0.505556 |


|              |   precision |   recall |   f1-score |    support |
|:-------------|------------:|---------:|-----------:|-----------:|
| 0            |    0.445378 | 0.441667 |   0.443515 | 120        |
| 1            |    0.603604 | 0.558333 |   0.580087 | 120        |
| 2            |    0.48     | 0.5      |   0.489796 | 120        |
| 3            |    0.536    | 0.558333 |   0.546939 | 120        |
| accuracy     |    0.514583 | 0.514583 |   0.514583 |   0.514583 |
| macro avg    |    0.516245 | 0.514583 |   0.515084 | 480        |
| weighted a

In [72]:
cm_xgb_base = confusion_matrix(y_test, y_pred_grid_xgb_b)
df_cm_xgb_base = pd.DataFrame(cm_xgb_base,
                         index = [f"Real {label}" for label in xgb_base.classes_],
                         columns = [f"Pred {label}" for label in xgb_base.classes_])
print(df_cm_xgb_base.to_markdown())

|        |   Pred 0 |   Pred 1 |   Pred 2 |   Pred 3 |
|:-------|---------:|---------:|---------:|---------:|
| Real 0 |       53 |       22 |       14 |       31 |
| Real 1 |       22 |       67 |       26 |        5 |
| Real 2 |       22 |       16 |       60 |       22 |
| Real 3 |       22 |        6 |       25 |       67 |


In [73]:
print(params_to_markdown(xgb_base.get_params()))

| parámetro          | valor          |
|:-------------------|:---------------|
| objective          | multi:softprob |
| enable_categorical | False          |
| eval_metric        | mlogloss       |
| learning_rate      | 0.1            |
| missing            | nan            |
| n_estimators       | 5000           |


In [74]:
#Transformación de datos para entrenar modelo
xgb_dmatrix = xgb.DMatrix(data=X_train, label=y_train)
xgb_dmatrix_test = xgb.DMatrix(data=X_test, label=y_test)

#Parámetros
params_xgb = {'learning_rate': [0.01, 0.1, 0.5],
              'n_estimators' : [100, 350, 500],
              'subsample' : [0.3, 0.5, 0.9 ],
              'max_depth': [5, 10, 20],
              'min_child_weight': [1, 5, 10],
              'colsample_bytree': [0.3, 0.5, 0.9],
}

#Estimador
xgb = xgb.XGBClassifier()

#Grid Search
grid_xgb = GridSearchCV(estimator=xgb,
                       param_grid=params_xgb,
                       scoring=recall_macro,
                       cv=5,
                       verbose=1,
                       n_jobs=-1)

In [None]:
grid_xgb.fit(X_train, y_train)

Fitting 5 folds for each of 729 candidates, totalling 3645 fits




KeyboardInterrupt: 