## <a id = '0'> Índice </a>

* [**Entorno**](#1)  
   * [Librerías](#1d1)  
   * [Funciones](#1d2)  
   * [Constantes](#1d3)

* [**Lectura de datos**](#2)


## <a id = '1'> Entorno </a>
[índice](#0)

### <a id = '1d1'> Librerías </a>

In [1]:
import os
import pandas as pd

from sklearn.preprocessing import  LabelEncoder
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB
import xgboost as xgb

from sklearn.model_selection import GridSearchCV

from sklearn.metrics import recall_score, f1_score, classification_report, make_scorer, confusion_matrix

import joblib
# from config import data_folder

In [2]:
os.chdir("../")

In [3]:
MODEL_PATH = "output/models/V3/" 

In [4]:
folders = ["LR", "GB", "GNB", "RFC", "XGB"]

# Crear las carpetas dentro de MODEL_PATH
for folder in folders:
    os.makedirs(os.path.join(MODEL_PATH, folder), exist_ok=True)

### <a id = '1d2'> Funciones </a>

In [5]:
from src.utils import table_metrics, params_to_markdown, genera_metricas_markdown, get_metrics_mode
from src.TicToc import TicToc
tt = TicToc()

## <a id = '2'> Lectura de datos </a>
[índice](#0)

In [6]:
train_data = pd.read_csv("output/chunk_data/chunk_100/pre_model/train_data.csv")
test_data = pd.read_csv("output/chunk_data/chunk_100/pre_model/test_data.csv")
val_data = pd.read_csv("output/chunk_data/chunk_100/pre_model/val_data.csv")

In [7]:
#Columnas que no vamos a usar en el modelado
skip_columns = ["patient_id", "label", "chunk"]
le = LabelEncoder()

#Generamos las características y la variable objetivo 
X_train = train_data.drop(columns = skip_columns)
y_train = le.fit_transform(train_data["label"])
train_preds = train_data[["patient_id", "label"]]
train_preds["y_true"] = y_train

X_test = test_data.drop(columns = skip_columns)
y_test = le.fit_transform(test_data["label"])
test_preds = test_data[["patient_id", "label"]]
test_preds["y_true"] = y_test

X_val = val_data.drop(columns = skip_columns)
y_val = le.fit_transform(val_data["label"])
val_preds = val_data[["patient_id", "label"]]
val_preds["y_true"] = y_val


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  train_preds["y_true"] = y_train
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  test_preds["y_true"] = y_test
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  val_preds["y_true"] = y_val


In [8]:
recall_macro = make_scorer(recall_score, average='macro') #Todas las clases tienen el mismo peso
f1_score_macro = make_scorer(f1_score, average='macro') #Todas las clases tienen el mismo peso

## Logistic Regression

In [9]:
# Estimador
lr = LogisticRegression(penalty='l2', 
                        C=1e5, 
                        solver='lbfgs', 
                        random_state = 42)

# Parámetros
params_lr = {
    'penalty': ['l2'],
    'C': [0.00001, 0.0001, 0.001, 0.01, 0.1, 1, 10]
}
       
# Grid Search
grid_lr = GridSearchCV(estimator=lr,
                       param_grid=params_lr,
                       scoring='accuracy',
                       cv=5,
                       verbose=1,
                       n_jobs=-1)

In [10]:
tt.tic()
grid_lr.fit(X_train, y_train)
tt.toc()

Fitting 5 folds for each of 7 candidates, totalling 35 fits


STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

Elapsed time: 24.092620 seconds


24.092620134353638

In [11]:
#Pronóstico
y_pred_grid_lr = grid_lr.best_estimator_.predict(X_test)
y_pred_prob_grid_lr = grid_lr.best_estimator_.predict_proba(X_test)

In [12]:
test_preds_LR = test_preds.copy()
test_preds_LR["pred"] = y_pred_grid_lr
print(get_metrics_mode(test_preds_LR))

| metric             |    value |
|:-------------------|---------:|
| accuracy           | 0.333333 |
| precision_weighted | 0.340975 |
| recall_weighted    | 0.333333 |
| f1_weighted        | 0.335172 |


In [13]:
grid_lr.best_params_

{'C': 0.001, 'penalty': 'l2'}

In [14]:
print(params_to_markdown(grid_lr.best_params_))

| parámetro   | valor   |
|:------------|:--------|
| C           | 0.001   |
| penalty     | l2      |


In [15]:
grid_lr.best_score_

0.26878184344603984

In [16]:
grid_lr.best_estimator_

In [17]:
joblib.dump(grid_lr, MODEL_PATH + "LR/logistic_regression_model.pkl")

LR_preds = pd.DataFrame({
    "patient_id" : test_data["patient_id"],
    "label" : test_data["label"],
    "y_true" : y_test,
    "pred" : y_pred_grid_lr,
})

LR_probas = pd.DataFrame(y_pred_prob_grid_lr, columns=[f"proba_clase_{c}" for c in grid_lr.classes_])
LR_results = pd.concat([LR_preds, LR_probas], axis=1)
LR_results.to_csv(MODEL_PATH + "LR/predictions.csv", index=False)


In [18]:
table_metrics(y_test,y_pred_grid_lr, y_pred_prob_grid_lr)

Unnamed: 0,metric,value
0,accuracy,0.256667
1,precision_weighted,0.256479
2,recall_weighted,0.256667
3,f1_weighted,0.256464
4,roc_auc_ovr,0.500283
5,log_loss,1.39063
6,gini_normalized,0.000567
7,ks_test_clase_0,0.071111
8,ks_test_clase_1,0.06
9,ks_test_clase_2,0.038889


In [19]:
table_metrics(y_test,y_pred_grid_lr, y_pred_prob_grid_lr)

Unnamed: 0,metric,value
0,accuracy,0.256667
1,precision_weighted,0.256479
2,recall_weighted,0.256667
3,f1_weighted,0.256464
4,roc_auc_ovr,0.500283
5,log_loss,1.39063
6,gini_normalized,0.000567
7,ks_test_clase_0,0.071111
8,ks_test_clase_1,0.06
9,ks_test_clase_2,0.038889


In [20]:
f1_score(y_test, y_pred_grid_lr, average='macro'),

(0.2564641130371913,)

In [21]:
confusion_matrix(y_test, y_pred_grid_lr)

array([[67, 61, 82, 90],
       [68, 81, 77, 74],
       [82, 68, 79, 71],
       [66, 84, 69, 81]])

In [22]:
print(classification_report(y_test, y_pred_grid_lr))

              precision    recall  f1-score   support

           0       0.24      0.22      0.23       300
           1       0.28      0.27      0.27       300
           2       0.26      0.26      0.26       300
           3       0.26      0.27      0.26       300

    accuracy                           0.26      1200
   macro avg       0.26      0.26      0.26      1200
weighted avg       0.26      0.26      0.26      1200



In [23]:
genera_metricas_markdown(y_test,y_pred_grid_lr, y_pred_prob_grid_lr)

| metric             |       value |
|:-------------------|------------:|
| accuracy           | 0.256667    |
| precision_weighted | 0.256479    |
| recall_weighted    | 0.256667    |
| f1_weighted        | 0.256464    |
| roc_auc_ovr        | 0.500283    |
| log_loss           | 1.39063     |
| gini_normalized    | 0.000566667 |
| ks_test_clase_0    | 0.0711111   |
| ks_test_clase_1    | 0.06        |
| ks_test_clase_2    | 0.0388889   |
| ks_test_clase_3    | 0.03        |


|              |   precision |   recall |   f1-score |     support |
|:-------------|------------:|---------:|-----------:|------------:|
| 0            |    0.236749 | 0.223333 |   0.229846 |  300        |
| 1            |    0.27551  | 0.27     |   0.272727 |  300        |
| 2            |    0.257329 | 0.263333 |   0.260297 |  300        |
| 3            |    0.256329 | 0.27     |   0.262987 |  300        |
| accuracy     |    0.256667 | 0.256667 |   0.256667 |    0.256667 |
| macro avg    |    0.256479 | 0.2

In [24]:
cm_lr = confusion_matrix(y_test, y_pred_grid_lr)
df_cm_lr = pd.DataFrame(cm_lr,
                         index = [f"Real {label}" for label in grid_lr.classes_],
                         columns = [f"Pred {label}" for label in grid_lr.classes_])
print(df_cm_lr.to_markdown())


|        |   Pred 0 |   Pred 1 |   Pred 2 |   Pred 3 |
|:-------|---------:|---------:|---------:|---------:|
| Real 0 |       67 |       61 |       82 |       90 |
| Real 1 |       68 |       81 |       77 |       74 |
| Real 2 |       82 |       68 |       79 |       71 |
| Real 3 |       66 |       84 |       69 |       81 |


## Random Forest Classifier

In [25]:
#Estimador
rfc = RandomForestClassifier(random_state = 42, 
                             n_jobs = -1, 
                             bootstrap = True)

#Parámetros
params_rfc = {'n_estimators': [100, 350, 500],
             'max_features': [ 'sqrt'],
             'max_depth': [5, 10, 20],
             'min_samples_split': [2, 10, 30],
             'min_samples_leaf': [2, 10, 30]}

#Grid Search
grid_rfc = GridSearchCV(estimator=rfc,
                       param_grid=params_rfc,
                       scoring=recall_macro,
                       cv=5,
                       verbose=1,
                       n_jobs=-1)

In [26]:
tt.tic()
grid_rfc.fit(X_train, y_train)
tt.toc()

Fitting 5 folds for each of 81 candidates, totalling 405 fits




Elapsed time: 4135.943624 seconds


4135.943624019623

In [27]:
grid_rfc.best_params_

{'max_depth': 20,
 'max_features': 'sqrt',
 'min_samples_leaf': 2,
 'min_samples_split': 10,
 'n_estimators': 500}

In [28]:
print(params_to_markdown(grid_rfc.best_params_))

| parámetro         | valor   |
|:------------------|:--------|
| max_depth         | 20      |
| max_features      | sqrt    |
| min_samples_leaf  | 2       |
| min_samples_split | 10      |
| n_estimators      | 500     |


In [29]:
grid_rfc.best_score_

0.5002990792276506

In [30]:
grid_rfc.best_estimator_

In [31]:
#Pronóstico
y_pred_grid_rfc = grid_rfc.predict(X_test)
y_pred_prob_grid_rfc = grid_rfc.predict_proba(X_test)

In [32]:
test_preds_RFC = test_preds.copy()
test_preds_RFC["pred"] = y_pred_grid_rfc
print(get_metrics_mode(test_preds_RFC))

| metric             |    value |
|:-------------------|---------:|
| accuracy           | 0.566667 |
| precision_weighted | 0.565486 |
| recall_weighted    | 0.566667 |
| f1_weighted        | 0.563655 |


In [33]:
joblib.dump(grid_rfc, MODEL_PATH + "RFC/random_forest_model.pkl")

RFC_preds = pd.DataFrame({
    "patient_id" : test_data["patient_id"],
    "label" : test_data["label"],
    "y_true" : y_test,
    "pred" : y_pred_grid_rfc,
})

RFC_probas = pd.DataFrame(y_pred_prob_grid_rfc, columns=[f"proba_clase_{c}" for c in grid_rfc.classes_])
RFC_results = pd.concat([RFC_preds, RFC_probas], axis=1)
RFC_results.to_csv(MODEL_PATH + "RFC/predictions.csv", index=False)


In [34]:
table_metrics(y_test,y_pred_grid_rfc, y_pred_prob_grid_rfc)

Unnamed: 0,metric,value
0,accuracy,0.485
1,precision_weighted,0.482702
2,recall_weighted,0.485
3,f1_weighted,0.482669
4,roc_auc_ovr,0.73496
5,log_loss,1.20453
6,gini_normalized,0.46992
7,ks_test_clase_0,0.246667
8,ks_test_clase_1,0.546667
9,ks_test_clase_2,0.266667


In [35]:
confusion_matrix(y_test, y_pred_grid_rfc)

array([[115,  63,  53,  69],
       [ 46, 189,  40,  25],
       [ 68,  34, 109,  89],
       [ 53,   9,  69, 169]])

In [36]:
print(classification_report(y_test, y_pred_grid_rfc))

              precision    recall  f1-score   support

           0       0.41      0.38      0.40       300
           1       0.64      0.63      0.64       300
           2       0.40      0.36      0.38       300
           3       0.48      0.56      0.52       300

    accuracy                           0.48      1200
   macro avg       0.48      0.48      0.48      1200
weighted avg       0.48      0.48      0.48      1200



In [37]:
genera_metricas_markdown(y_test,y_pred_grid_rfc, y_pred_prob_grid_rfc)

| metric             |    value |
|:-------------------|---------:|
| accuracy           | 0.485    |
| precision_weighted | 0.482702 |
| recall_weighted    | 0.485    |
| f1_weighted        | 0.482669 |
| roc_auc_ovr        | 0.73496  |
| log_loss           | 1.20453  |
| gini_normalized    | 0.46992  |
| ks_test_clase_0    | 0.246667 |
| ks_test_clase_1    | 0.546667 |
| ks_test_clase_2    | 0.266667 |
| ks_test_clase_3    | 0.44     |


|              |   precision |   recall |   f1-score |   support |
|:-------------|------------:|---------:|-----------:|----------:|
| 0            |    0.407801 | 0.383333 |   0.395189 |   300     |
| 1            |    0.640678 | 0.63     |   0.635294 |   300     |
| 2            |    0.402214 | 0.363333 |   0.381786 |   300     |
| 3            |    0.480114 | 0.563333 |   0.518405 |   300     |
| accuracy     |    0.485    | 0.485    |   0.485    |     0.485 |
| macro avg    |    0.482702 | 0.485    |   0.482669 |  1200     |
| weighted avg |    

In [38]:
cm_rfc = confusion_matrix(y_test, y_pred_grid_rfc)
df_cm_rfc = pd.DataFrame(cm_rfc,
                         index = [f"Real {label}" for label in grid_rfc.classes_],
                         columns = [f"Pred {label}" for label in grid_rfc.classes_])
print(df_cm_rfc.to_markdown())


|        |   Pred 0 |   Pred 1 |   Pred 2 |   Pred 3 |
|:-------|---------:|---------:|---------:|---------:|
| Real 0 |      115 |       63 |       53 |       69 |
| Real 1 |       46 |      189 |       40 |       25 |
| Real 2 |       68 |       34 |      109 |       89 |
| Real 3 |       53 |        9 |       69 |      169 |


# Gradient Boosting Classifier

In [39]:
#Estimador
gb = GradientBoostingClassifier(learning_rate=0.05, 
                                subsample=0.5, 
                                max_depth=6, 
                                n_estimators=10,
                                 random_state=42,
                                )

#Parámetros
params_gb = {'n_estimators': [10,100], 
             'learning_rate' : [0.01,0.1],
             'subsample' : [0.5,1.0], 
             'max_depth': [5,10,20],
             'min_samples_split': [2, 10],
             'min_samples_leaf': [10, 30],
             'max_features': [ 'sqrt']}

#Grid Search
grid_gb = GridSearchCV(estimator=gb,
                       param_grid=params_gb,
                       scoring=recall_macro,
                       cv = 5,
                       verbose=1,
                       n_jobs=-1)

In [40]:
#Entrenamiento
tt.tic()
grid_gb.fit(X_train, y_train)
tt.toc()

Fitting 5 folds for each of 96 candidates, totalling 480 fits
Elapsed time: 4336.662994 seconds


4336.6629939079285

In [41]:
grid_gb.best_params_

{'learning_rate': 0.1,
 'max_depth': 20,
 'max_features': 'sqrt',
 'min_samples_leaf': 30,
 'min_samples_split': 2,
 'n_estimators': 100,
 'subsample': 1.0}

In [42]:
print(params_to_markdown(grid_gb.best_params_))

| parámetro         | valor   |
|:------------------|:--------|
| learning_rate     | 0.1     |
| max_depth         | 20      |
| max_features      | sqrt    |
| min_samples_leaf  | 30      |
| min_samples_split | 2       |
| n_estimators      | 100     |
| subsample         | 1.0     |


In [43]:
grid_gb.best_score_

0.5269244313887171

In [44]:
grid_gb.best_estimator_

In [45]:
#Predicciones
y_pred_grid_gb = grid_gb.predict(X_test)
y_pred_prob_grid_gb = grid_gb.predict_proba(X_test)

In [46]:
test_preds_GB = test_preds.copy()
test_preds_GB["pred"] = y_pred_grid_gb
print(get_metrics_mode(test_preds_GB))

| metric             |    value |
|:-------------------|---------:|
| accuracy           | 0.533333 |
| precision_weighted | 0.557202 |
| recall_weighted    | 0.533333 |
| f1_weighted        | 0.540249 |


In [47]:
joblib.dump(grid_rfc, MODEL_PATH + "GB/gradient_boosting_model.pkl")

GB_preds = pd.DataFrame({
    "patient_id" : test_data["patient_id"],
    "label" : test_data["label"],
    "y_true" : y_test,
    "pred" : y_pred_grid_gb,
})

GB_probas = pd.DataFrame(y_pred_prob_grid_gb, columns=[f"proba_clase_{c}" for c in grid_gb.classes_])
GB_results = pd.concat([GB_preds, GB_probas], axis=1)
GB_results.to_csv(MODEL_PATH + "GB/predictions.csv", index=False)


In [48]:
table_metrics(y_test,y_pred_grid_gb, y_pred_prob_grid_gb)

Unnamed: 0,metric,value
0,accuracy,0.499167
1,precision_weighted,0.513695
2,recall_weighted,0.499167
3,f1_weighted,0.504324
4,roc_auc_ovr,0.746159
5,log_loss,1.190228
6,gini_normalized,0.492319
7,ks_test_clase_0,0.272222
8,ks_test_clase_1,0.563333
9,ks_test_clase_2,0.301111


In [49]:
confusion_matrix(y_test, y_pred_grid_gb)

array([[133,  48,  65,  54],
       [ 53, 183,  46,  18],
       [ 73,  15, 140,  72],
       [ 56,   4,  97, 143]])

In [50]:
print(classification_report(y_test, y_pred_grid_gb))

              precision    recall  f1-score   support

           0       0.42      0.44      0.43       300
           1       0.73      0.61      0.67       300
           2       0.40      0.47      0.43       300
           3       0.50      0.48      0.49       300

    accuracy                           0.50      1200
   macro avg       0.51      0.50      0.50      1200
weighted avg       0.51      0.50      0.50      1200



In [51]:
genera_metricas_markdown(y_test,y_pred_grid_gb, y_pred_prob_grid_gb)

| metric             |    value |
|:-------------------|---------:|
| accuracy           | 0.499167 |
| precision_weighted | 0.513695 |
| recall_weighted    | 0.499167 |
| f1_weighted        | 0.504324 |
| roc_auc_ovr        | 0.746159 |
| log_loss           | 1.19023  |
| gini_normalized    | 0.492319 |
| ks_test_clase_0    | 0.272222 |
| ks_test_clase_1    | 0.563333 |
| ks_test_clase_2    | 0.301111 |
| ks_test_clase_3    | 0.416667 |


|              |   precision |   recall |   f1-score |     support |
|:-------------|------------:|---------:|-----------:|------------:|
| 0            |    0.422222 | 0.443333 |   0.43252  |  300        |
| 1            |    0.732    | 0.61     |   0.665455 |  300        |
| 2            |    0.402299 | 0.466667 |   0.432099 |  300        |
| 3            |    0.498258 | 0.476667 |   0.487223 |  300        |
| accuracy     |    0.499167 | 0.499167 |   0.499167 |    0.499167 |
| macro avg    |    0.513695 | 0.499167 |   0.504324 | 1200        |
| we

In [52]:
cm_gb = confusion_matrix(y_test, y_pred_grid_gb)
df_cm_gb = pd.DataFrame(cm_gb,
                         index = [f"Real {label}" for label in grid_gb.classes_],
                         columns = [f"Pred {label}" for label in grid_gb.classes_])
print(df_cm_gb.to_markdown())

|        |   Pred 0 |   Pred 1 |   Pred 2 |   Pred 3 |
|:-------|---------:|---------:|---------:|---------:|
| Real 0 |      133 |       48 |       65 |       54 |
| Real 1 |       53 |      183 |       46 |       18 |
| Real 2 |       73 |       15 |      140 |       72 |
| Real 3 |       56 |        4 |       97 |      143 |


## Naive Bayes

In [53]:
gnb = GaussianNB()

params_gnb = {
    'var_smoothing': [1e-9, 1e-8, 1e-7, 1e-6, 1e-5, 1e-4, 1e-3]
}
#Grid Search
grid_gnb = GridSearchCV(estimator=gnb,
                       param_grid=params_gnb,
                       scoring=recall_macro,
                       cv=5,
                       verbose=1,
                       n_jobs=-1)

In [54]:
#Entrenamiento
tt.tic()
grid_gnb.fit(X_train, y_train)
tt.toc()

Fitting 5 folds for each of 7 candidates, totalling 35 fits
Elapsed time: 3.714519 seconds


3.7145190238952637

In [55]:
grid_gnb.best_params_

{'var_smoothing': 1e-09}

In [56]:
print(params_to_markdown(grid_gnb.best_params_))

| parámetro     |   valor |
|:--------------|--------:|
| var_smoothing |   1e-09 |


In [57]:
grid_gnb.best_score_

0.33170643509929226

In [58]:
#Predicciones
y_pred_grid_gnb = grid_gnb.predict(X_test)
y_pred_prob_grid_gnb = grid_gnb.predict_proba(X_test)

In [59]:
test_preds_GNB = test_preds.copy()
test_preds_GNB["pred"] = y_pred_grid_gnb
print(get_metrics_mode(test_preds_GNB))

| metric             |    value |
|:-------------------|---------:|
| accuracy           | 0.308333 |
| precision_weighted | 0.302306 |
| recall_weighted    | 0.308333 |
| f1_weighted        | 0.241931 |


In [60]:
joblib.dump(grid_gnb, MODEL_PATH + "GNB/Gaussian_Naive_Bayes_model.pkl")

GNB_preds = pd.DataFrame({
    "patient_id" : test_data["patient_id"],
    "label" : test_data["label"],
    "y_true" : y_test,
    "pred" : y_pred_grid_gnb,
})

GNB_probas = pd.DataFrame(y_pred_prob_grid_gnb, columns=[f"proba_clase_{c}" for c in grid_gnb.classes_])
GNB_results = pd.concat([GNB_preds, GNB_probas], axis=1)
GNB_results.to_csv(MODEL_PATH + "GNB/predictions.csv", index=False)


In [61]:
table_metrics(y_test,y_pred_grid_gnb, y_pred_prob_grid_gnb)

Unnamed: 0,metric,value
0,accuracy,0.325833
1,precision_weighted,0.329307
2,recall_weighted,0.325833
3,f1_weighted,0.280119
4,roc_auc_ovr,0.588748
5,log_loss,20.860292
6,gini_normalized,0.177495
7,ks_test_clase_0,0.185556
8,ks_test_clase_1,0.242222
9,ks_test_clase_2,0.117778


In [62]:
confusion_matrix(y_test, y_pred_grid_gnb)

array([[ 31, 164,  29,  76],
       [ 13, 233,  27,  27],
       [ 20, 171,  46,  63],
       [ 34, 156,  29,  81]])

In [63]:
print(classification_report(y_test, y_pred_grid_gnb))

              precision    recall  f1-score   support

           0       0.32      0.10      0.16       300
           1       0.32      0.78      0.46       300
           2       0.35      0.15      0.21       300
           3       0.33      0.27      0.30       300

    accuracy                           0.33      1200
   macro avg       0.33      0.33      0.28      1200
weighted avg       0.33      0.33      0.28      1200



In [64]:
genera_metricas_markdown(y_test,y_pred_grid_gnb, y_pred_prob_grid_gnb)

| metric             |     value |
|:-------------------|----------:|
| accuracy           |  0.325833 |
| precision_weighted |  0.329307 |
| recall_weighted    |  0.325833 |
| f1_weighted        |  0.280119 |
| roc_auc_ovr        |  0.588748 |
| log_loss           | 20.8603   |
| gini_normalized    |  0.177495 |
| ks_test_clase_0    |  0.185556 |
| ks_test_clase_1    |  0.242222 |
| ks_test_clase_2    |  0.117778 |
| ks_test_clase_3    |  0.115556 |


|              |   precision |   recall |   f1-score |     support |
|:-------------|------------:|---------:|-----------:|------------:|
| 0            |    0.316327 | 0.103333 |   0.155779 |  300        |
| 1            |    0.321823 | 0.776667 |   0.455078 |  300        |
| 2            |    0.351145 | 0.153333 |   0.213457 |  300        |
| 3            |    0.327935 | 0.27     |   0.296161 |  300        |
| accuracy     |    0.325833 | 0.325833 |   0.325833 |    0.325833 |
| macro avg    |    0.329307 | 0.325833 |   0.280119 | 1200 

In [65]:
cm_gnb = confusion_matrix(y_test, y_pred_grid_gnb)
df_cm_gnb = pd.DataFrame(cm_gnb,
                         index = [f"Real {label}" for label in grid_gnb.classes_],
                         columns = [f"Pred {label}" for label in grid_gnb.classes_])
print(df_cm_gnb.to_markdown())

|        |   Pred 0 |   Pred 1 |   Pred 2 |   Pred 3 |
|:-------|---------:|---------:|---------:|---------:|
| Real 0 |       31 |      164 |       29 |       76 |
| Real 1 |       13 |      233 |       27 |       27 |
| Real 2 |       20 |      171 |       46 |       63 |
| Real 3 |       34 |      156 |       29 |       81 |


## XGB Classifier

In [66]:
xgb_base = xgb.XGBClassifier(
    learning_rate=0.1,
    n_estimators=4000,
    # eval_metric="merror",
    eval_metric="auc",
    objective="multi:softprob",
    early_stopping_rounds=500,
    
)

In [67]:
tt.tic()
xgb_base.fit(X_train, y_train,
             eval_set=[(X_test, y_test)],             
             verbose = 1
             )

# xgb_n_estimator = xgb_base.best_iteration
tt.toc()

[0]	validation_0-auc:0.62871
[1]	validation_0-auc:0.64596
[2]	validation_0-auc:0.65933
[3]	validation_0-auc:0.66394
[4]	validation_0-auc:0.67429
[5]	validation_0-auc:0.67691
[6]	validation_0-auc:0.68130
[7]	validation_0-auc:0.68853
[8]	validation_0-auc:0.69308
[9]	validation_0-auc:0.69517
[10]	validation_0-auc:0.69710
[11]	validation_0-auc:0.70342
[12]	validation_0-auc:0.70312
[13]	validation_0-auc:0.70495
[14]	validation_0-auc:0.70760
[15]	validation_0-auc:0.71094
[16]	validation_0-auc:0.71315
[17]	validation_0-auc:0.71501
[18]	validation_0-auc:0.71624
[19]	validation_0-auc:0.71670
[20]	validation_0-auc:0.71804
[21]	validation_0-auc:0.71931
[22]	validation_0-auc:0.72110
[23]	validation_0-auc:0.72196
[24]	validation_0-auc:0.72251
[25]	validation_0-auc:0.72240
[26]	validation_0-auc:0.72282
[27]	validation_0-auc:0.72310
[28]	validation_0-auc:0.72418
[29]	validation_0-auc:0.72518
[30]	validation_0-auc:0.72510
[31]	validation_0-auc:0.72487
[32]	validation_0-auc:0.72608
[33]	validation_0-au

194.94480109214783

In [68]:
#Predicciones
y_pred_grid_xgb_b = xgb_base.predict(X_val)
y_pred_prob_grid_xgb_b = xgb_base.predict_proba(X_val)

In [69]:
test_preds_XGB = val_preds.copy()
test_preds_XGB["pred"] = y_pred_grid_xgb_b
print(get_metrics_mode(test_preds_XGB))

| metric             |    value |
|:-------------------|---------:|
| accuracy           | 0.528926 |
| precision_weighted | 0.537967 |
| recall_weighted    | 0.528926 |
| f1_weighted        | 0.532205 |


In [70]:
table_metrics(y_val,y_pred_grid_xgb_b, y_pred_prob_grid_xgb_b)

Unnamed: 0,metric,value
0,accuracy,0.51405
1,precision_weighted,0.521677
2,recall_weighted,0.51405
3,f1_weighted,0.516701
4,roc_auc_ovr,0.767521
5,log_loss,1.121876
6,gini_normalized,0.535042
7,ks_test_clase_0,0.344286
8,ks_test_clase_1,0.597143
9,ks_test_clase_2,0.302043


In [71]:
table_metrics(y_val,y_pred_grid_xgb_b, y_pred_prob_grid_xgb_b)

Unnamed: 0,metric,value
0,accuracy,0.51405
1,precision_weighted,0.521677
2,recall_weighted,0.51405
3,f1_weighted,0.516701
4,roc_auc_ovr,0.767521
5,log_loss,1.121876
6,gini_normalized,0.535042
7,ks_test_clase_0,0.344286
8,ks_test_clase_1,0.597143
9,ks_test_clase_2,0.302043


In [72]:
joblib.dump(xgb_base, MODEL_PATH + "XGB/XGB_base_model.pkl")

XGB_base_preds = pd.DataFrame({
    "patient_id" : val_data["patient_id"],
    "label" : val_data["label"],
    "y_true" : y_val,
    "pred" : y_pred_grid_xgb_b,
})

XGB_base_probas = pd.DataFrame(y_pred_prob_grid_xgb_b, columns=[f"proba_clase_{c}" for c in xgb_base.classes_])
XGB_base_results = pd.concat([XGB_base_preds, XGB_base_probas], axis=1)
XGB_base_results.to_csv(MODEL_PATH + "XGB/predictions_base.csv", index=False)


In [73]:
genera_metricas_markdown(y_val,y_pred_grid_xgb_b, y_pred_prob_grid_xgb_b)

| metric             |    value |
|:-------------------|---------:|
| accuracy           | 0.51405  |
| precision_weighted | 0.521677 |
| recall_weighted    | 0.51405  |
| f1_weighted        | 0.516701 |
| roc_auc_ovr        | 0.767521 |
| log_loss           | 1.12188  |
| gini_normalized    | 0.535042 |
| ks_test_clase_0    | 0.344286 |
| ks_test_clase_1    | 0.597143 |
| ks_test_clase_2    | 0.302043 |
| ks_test_clase_3    | 0.459304 |


|              |   precision |   recall |   f1-score |    support |
|:-------------|------------:|---------:|-----------:|-----------:|
| 0            |    0.444816 | 0.443333 |   0.444073 |  300       |
| 1            |    0.732824 | 0.64     |   0.683274 |  300       |
| 2            |    0.412698 | 0.419355 |   0.416    |  310       |
| 3            |    0.5      | 0.556667 |   0.526814 |  300       |
| accuracy     |    0.51405  | 0.51405  |   0.51405  |    0.51405 |
| macro avg    |    0.522585 | 0.514839 |   0.51754  | 1210       |
| weighted a

In [74]:
cm_xgb_base = confusion_matrix(y_val, y_pred_grid_xgb_b)
df_cm_xgb_base = pd.DataFrame(cm_xgb_base,
                         index = [f"Real {label}" for label in xgb_base.classes_],
                         columns = [f"Pred {label}" for label in xgb_base.classes_])
print(df_cm_xgb_base.to_markdown())

|        |   Pred 0 |   Pred 1 |   Pred 2 |   Pred 3 |
|:-------|---------:|---------:|---------:|---------:|
| Real 0 |      133 |       45 |       71 |       51 |
| Real 1 |       49 |      192 |       37 |       22 |
| Real 2 |       69 |       17 |      130 |       94 |
| Real 3 |       48 |        8 |       77 |      167 |


In [75]:
print(params_to_markdown(xgb_base.get_params()))

| parámetro             | valor          |
|:----------------------|:---------------|
| objective             | multi:softprob |
| early_stopping_rounds | 500            |
| enable_categorical    | False          |
| eval_metric           | auc            |
| learning_rate         | 0.1            |
| missing               | nan            |
| n_estimators          | 4000           |
