# Pathology prediction (Enlarged_CM)


@References : Soenksen, L.R., Ma, Y., Zeng, C. et al. Integrated multimodal artificial intelligence framework for healthcare applications. npj Digit. Med. 5, 149 (2022). https://doi.org/10.1038/s41746-022-00689-4

In this notebook, the task is to predict the Enlarged Cardiomediastinum pathology using the CSV embeddings file



## Introduction


Radiology notes were processed to determine if each of the pathologies was explicitly confirmed as present (value = 1), explicitly confirmed as absent (value = 0), inconclusive in the study (value = −1), or not explored (no value).

Selected samples : 0 or 1 values (removing the rest from the training and testing data).

Excluded variables : the unstructured radiology notes component (E_rad ) from the allowable input to avoid potential overfitting or misrepresentations of real predictive value.

The model is based on binary classification for each target chest pathology.

Final sample size for the Enlarged Cardiomediastinum pathology is : (N =  3206)

#### Imports

In [1]:
import os
os.chdir('../')

from pandas import read_csv

from src.data import constants
from src.data.dataset import HAIMDataset
from src.evaluation.pycaret_evaluator import PyCaretEvaluator
from src.utils.metric_scores import *

#### Read data from local source



In [2]:
df = read_csv(constants.FILE_DF, nrows=constants.N_DATA)

#### Create a custom dataset for the HAIM experiment


Build the target column for the task at hand, set the dataset specificities:  the ``haim_id`` as a ``global_id``, use all sources except ``radiology notes`` 

In [3]:
dataset = HAIMDataset(df,  
                      constants.CHEST_PREDICTORS, 
                      constants.ALL_MODALITIES, 
                      constants.ENLARGED_CARDIOMEDIASTINUM,
                      constants.IMG_ID, 
                      constants.GLOBAL_ID)

#### Set hyper-parameters

In [4]:
# Define the grid oh hyper-parameters for the tuning
grid_hps = {'max_depth': [5, 6, 7, 8],
            'n_estimators': [200, 300],
            'learning_rate': [0.3, 0.1, 0.05],
            }

### Model training and predictions using an XGBClassifier model with GridSearchCV and Hyperparameters optimization


The goal of this section of the notebook is to compute the following metrics:

``ACCURACY_SCORE, BALANCED_ACCURACY_SCORE, SENSITIVITY, SPECIFICITY, AUC, BRIER SCORE, BINARY CROSS-ENTROPY``


The
hyperparameter combinations of individual XGBoost models were
selected within each training loop using a ``fivefold cross-validated
grid search`` on the training set (80%). This XGBoost ``tuning process``
selected the ``maximum depth of the trees (5–8)``, the number of
``estimators (200 or 300)``, and the ``learning rate (0.05, 0.1, 0.3)``
according to the parameter value combination leading to the
highest observed AUROC within the training loop 


As mentioned previously, all XGBoost models were trained ``five times with five different data splits`` to repeat the
experiments and compute average metrics 


```Refer to page 8 of study``` : https://doi.org/10.1038/s41746-022-00689-4

In [5]:
# Initialize the PyCaret Evaluator
evaluator = PyCaretEvaluator(dataset=dataset, target="EnlargedCardiomediastinum", experiment_name="CP_EnlargedCardiomediastinum", filepath="./results/enlargedcardiomediastinum")

# Model training and results evaluation
evaluator.run_experiment(
    train_size=0.8,
    fold=5,
    fold_strategy='stratifiedkfold',
    outer_fold=5,
    outer_strategy='stratifiedkfold',
    session_id=42,
    model='xgboost',
    optimize='AUC',
    custom_grid=grid_hps
)

Outer fold 1/5
Configuring PyCaret for outer fold 1
Creating model xgboost for outer fold 1


Unnamed: 0_level_0,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
Fold,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
0,0.8637,0.8928,0.9608,0.8698,0.913,0.602,0.6182
1,0.8293,0.8623,0.9279,0.855,0.8899,0.5123,0.5209
2,0.8634,0.8887,0.941,0.8831,0.9111,0.6177,0.6235
3,0.8707,0.8723,0.9706,0.871,0.9181,0.6159,0.6367
4,0.8634,0.8802,0.9771,0.8592,0.9144,0.5838,0.6145
Mean,0.8581,0.8793,0.9555,0.8676,0.9093,0.5863,0.6028
Std,0.0147,0.0111,0.0184,0.0099,0.01,0.039,0.0416


Tuning hyperparameters for model xgboost with custom grid


Unnamed: 0_level_0,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
Fold,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
0,0.8589,0.8958,0.9542,0.869,0.9097,0.5906,0.6043
1,0.8293,0.8666,0.9213,0.8593,0.8892,0.5189,0.5249
2,0.861,0.8886,0.9344,0.8851,0.9091,0.6147,0.6188
3,0.878,0.8746,0.9739,0.8765,0.9226,0.639,0.6591
4,0.8683,0.8854,0.9804,0.8621,0.9174,0.5987,0.6301
Mean,0.8591,0.8822,0.9528,0.8704,0.9096,0.5924,0.6075
Std,0.0164,0.0104,0.0225,0.0095,0.0114,0.0403,0.045


Fitting 5 folds for each of 10 candidates, totalling 50 fits
Transformation Pipeline and Model Successfully Saved


Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
0,Extreme Gradient Boosting,0.8551,0.9039,0.954,0.8653,0.9075,0.5777,0.5926


Outer fold 2/5
Configuring PyCaret for outer fold 2
Creating model xgboost for outer fold 2


Unnamed: 0_level_0,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
Fold,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
0,0.8297,0.8504,0.9346,0.8512,0.891,0.5059,0.5177
1,0.8662,0.9081,0.9706,0.8659,0.9153,0.6022,0.625
2,0.839,0.8627,0.9377,0.8589,0.8966,0.537,0.5477
3,0.8415,0.8996,0.9377,0.8614,0.898,0.5456,0.5556
4,0.8463,0.8605,0.9837,0.8384,0.9053,0.5121,0.5616
Mean,0.8445,0.8762,0.9529,0.8552,0.9012,0.5406,0.5615
Std,0.0121,0.0231,0.0203,0.0096,0.0084,0.0342,0.0351


Tuning hyperparameters for model xgboost with custom grid


Unnamed: 0_level_0,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
Fold,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
0,0.8321,0.8489,0.9379,0.8516,0.8927,0.5113,0.5241
1,0.8686,0.9175,0.9739,0.8663,0.9169,0.608,0.6326
2,0.8537,0.8562,0.9508,0.8657,0.9062,0.5762,0.5896
3,0.8512,0.9037,0.9475,0.8653,0.9045,0.5706,0.5829
4,0.861,0.8746,0.9869,0.8507,0.9138,0.5652,0.6094
Mean,0.8533,0.8802,0.9594,0.8599,0.9068,0.5663,0.5877
Std,0.0122,0.0266,0.0181,0.0072,0.0084,0.0313,0.0362


Fitting 5 folds for each of 10 candidates, totalling 50 fits
Transformation Pipeline and Model Successfully Saved


Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
0,Extreme Gradient Boosting,0.8705,0.9148,0.9561,0.8805,0.9168,0.6277,0.6386


Outer fold 3/5
Configuring PyCaret for outer fold 3
Creating model xgboost for outer fold 3


Unnamed: 0_level_0,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
Fold,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
0,0.8491,0.8723,0.9608,0.8547,0.9046,0.55,0.5722
1,0.8394,0.8769,0.9444,0.855,0.8975,0.5309,0.5452
2,0.8341,0.8788,0.941,0.8516,0.8941,0.5164,0.5303
3,0.8366,0.8694,0.9443,0.8521,0.8958,0.5218,0.5369
4,0.8341,0.8473,0.9346,0.8563,0.8938,0.5192,0.5297
Mean,0.8387,0.8689,0.945,0.8539,0.8972,0.5277,0.5429
Std,0.0056,0.0113,0.0086,0.0018,0.004,0.0122,0.0157


Tuning hyperparameters for model xgboost with custom grid


Unnamed: 0_level_0,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
Fold,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
0,0.8589,0.883,0.9608,0.8647,0.9102,0.5849,0.603
1,0.8297,0.8682,0.9575,0.8371,0.8933,0.4808,0.5087
2,0.8463,0.8956,0.9508,0.858,0.902,0.5504,0.5663
3,0.8415,0.8673,0.9574,0.8488,0.8998,0.5262,0.5489
4,0.8341,0.8647,0.9346,0.8563,0.8938,0.5192,0.5297
Mean,0.8421,0.8758,0.9522,0.853,0.8998,0.5323,0.5513
Std,0.0102,0.0118,0.0094,0.0094,0.0062,0.0345,0.0322


Fitting 5 folds for each of 10 candidates, totalling 50 fits
Transformation Pipeline and Model Successfully Saved


Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
0,Extreme Gradient Boosting,0.8799,0.9038,0.9644,0.8848,0.9229,0.6531,0.6657


Outer fold 4/5
Configuring PyCaret for outer fold 4
Creating model xgboost for outer fold 4


Unnamed: 0_level_0,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
Fold,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
0,0.837,0.8827,0.9379,0.8567,0.8955,0.5287,0.5401
1,0.8394,0.8575,0.9608,0.8448,0.8991,0.514,0.5405
2,0.8537,0.881,0.9443,0.8701,0.9057,0.582,0.5918
3,0.8415,0.8587,0.9216,0.8731,0.8967,0.5574,0.5612
4,0.8366,0.8807,0.9379,0.8567,0.8955,0.5247,0.5361
Mean,0.8416,0.8721,0.9405,0.8603,0.8985,0.5414,0.5539
Std,0.0063,0.0115,0.0126,0.0102,0.0038,0.0249,0.0208


Tuning hyperparameters for model xgboost with custom grid


Unnamed: 0_level_0,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
Fold,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
0,0.8589,0.8888,0.951,0.8713,0.9094,0.5934,0.6052
1,0.8418,0.8678,0.9641,0.8453,0.9008,0.5197,0.5481
2,0.8439,0.8686,0.9443,0.8597,0.9,0.548,0.5607
3,0.861,0.8616,0.951,0.8739,0.9108,0.5984,0.6095
4,0.8244,0.9004,0.9444,0.8401,0.8892,0.4726,0.492
Mean,0.846,0.8775,0.9509,0.858,0.902,0.5464,0.5631
Std,0.0133,0.0147,0.0072,0.0135,0.0078,0.0471,0.0429


Fitting 5 folds for each of 10 candidates, totalling 50 fits
Transformation Pipeline and Model Successfully Saved


Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
0,Extreme Gradient Boosting,0.8409,0.8876,0.9434,0.8571,0.8982,0.5377,0.5509


Outer fold 5/5
Configuring PyCaret for outer fold 5
Creating model xgboost for outer fold 5


Unnamed: 0_level_0,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
Fold,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
0,0.854,0.886,0.9444,0.8705,0.906,0.5823,0.5921
1,0.8516,0.8758,0.9346,0.8746,0.9036,0.5824,0.5885
2,0.8561,0.8756,0.9377,0.8773,0.9065,0.5958,0.6021
3,0.8537,0.8907,0.9314,0.8796,0.9048,0.5901,0.5946
4,0.839,0.8829,0.9346,0.8614,0.8966,0.5366,0.5458
Mean,0.8509,0.8822,0.9366,0.8727,0.9035,0.5774,0.5846
Std,0.0061,0.0058,0.0044,0.0064,0.0036,0.021,0.0199


Tuning hyperparameters for model xgboost with custom grid


Unnamed: 0_level_0,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
Fold,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
0,0.8735,0.896,0.9641,0.878,0.919,0.633,0.6477
1,0.8662,0.8953,0.951,0.8792,0.9137,0.6184,0.6279
2,0.8439,0.8652,0.9311,0.8685,0.8987,0.5601,0.5666
3,0.8634,0.8911,0.9444,0.8811,0.9117,0.6122,0.6194
4,0.8415,0.8925,0.9346,0.864,0.898,0.5452,0.5537
Mean,0.8577,0.888,0.9451,0.8742,0.9082,0.5938,0.6031
Std,0.0127,0.0116,0.0118,0.0067,0.0084,0.0346,0.0364


Fitting 5 folds for each of 10 candidates, totalling 50 fits
Transformation Pipeline and Model Successfully Saved


Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
0,Extreme Gradient Boosting,0.844,0.8545,0.9518,0.855,0.9008,0.5407,0.5581


Final metrics table:
     Metric     Mean   Std Dev
0  Accuracy  0.85646  0.015014
1       AUC  0.88250  0.019545
2    Recall  0.95360  0.010194
3     Prec.  0.86692  0.009822
4        F1  0.90820  0.009547
5     Kappa  0.58264  0.044186
6       MCC  0.59686  0.044837
