# Pathology prediction (pneumothorax)

@References : Soenksen, L.R., Ma, Y., Zeng, C. et al. Integrated multimodal artificial intelligence framework for healthcare applications. npj Digit. Med. 5, 149 (2022). https://doi.org/10.1038/s41746-022-00689-4

In this notebook, the task is to predict the pneumothorax pathology using the CSV embeddings file

## Introduction


Radiology notes were processed to determine if each of the pathologies was explicitly confirmed as present (value = 1), explicitly confirmed as absent (value = 0), inconclusive in the study (value = −1), or not explored (no value).

Selected samples :  0 or 1 values (removing the rest from the training and testing data).

Excluded variables : the unstructured radiology notes component (E_rad ) from the allowable input to avoid potential overfitting or misrepresentations of real predictive value. 

The model is based on binary classification for each target chest pathology.

Final sample size for the Pneumothorax pathology is : (N = 17,159)


#### Imports

In [1]:
import os
os.chdir('../')

from pandas import read_csv

from src.data import constants
from src.data.dataset import HAIMDataset
from src.evaluation.pycaret_evaluator import PyCaretEvaluator
from src.utils.metric_scores import *

#### Read data from local source



In [2]:
df = read_csv(constants.FILE_DF, nrows=constants.N_DATA)

#### Create a custom dataset for the HAIM experiment


Build the target column for the task at hand, set the dataset specificities:  the ``haim_id`` as a ``global_id``, use all sources except ``radiology notes`` 

In [3]:
dataset = HAIMDataset(df,  
                      constants.CHEST_PREDICTORS, 
                      constants.ALL_MODALITIES, 
                      constants.PNEUMOTHORAX, 
                      constants.IMG_ID, 
                      constants.GLOBAL_ID)

#### Set hyper-parameters

In [4]:
# Define the grid oh hyper-parameters for the tuning
grid_hps = {'max_depth': [5, 6, 7, 8],
            'n_estimators': [200, 300],
            'learning_rate': [0.3, 0.1, 0.05],
            }

### Model training and predictions using an XGBClassifier model with GridSearchCV and Hyperparameters optimization


The goal of this section of the notebook is to compute the following metrics:

``ACCURACY_SCORE, BALANCED_ACCURACY_SCORE, SENSITIVITY, SPECIFICITY, AUC, BRIER SCORE, BINARY CROSS-ENTROPY``


The
hyperparameter combinations of individual XGBoost models were
selected within each training loop using a ``fivefold cross-validated
grid search`` on the training set (80%). This XGBoost ``tuning process``
selected the ``maximum depth of the trees (5–8)``, the number of
``estimators (200 or 300)``, and the ``learning rate (0.05, 0.1, 0.3)``
according to the parameter value combination leading to the
highest observed AUROC within the training loop 


As mentioned previously, all XGBoost models were trained ``five times with five different data splits`` to repeat the
experiments and compute average metrics 


```Refer to page 8 of study``` : https://doi.org/10.1038/s41746-022-00689-4

In [5]:
# Initialize the PyCaret Evaluator
evaluator = PyCaretEvaluator(dataset=dataset, target="Pneumothorax", experiment_name="CP_Pneumothorax", filepath="./results/pneumothorax")

# Model training and results evaluation
evaluator.run_experiment(
    train_size=0.8,
    fold=5,
    fold_strategy='stratifiedkfold',
    outer_fold=5,
    outer_strategy='stratifiedkfold',
    session_id=42,
    model='xgboost',
    optimize='AUC',
    custom_grid=grid_hps
)

2024-10-21 16:37:00,227	INFO worker.py:1777 -- Started a local Ray instance. View the dashboard at [1m[32mhttp://127.0.0.1:8265 [39m[22m


[36m(run_fold pid=352119)[0m Outer fold 1
[36m(run_fold pid=352119)[0m Train indices: [    0     1     2 ... 17156 17157 17158]
[36m(run_fold pid=352119)[0m Test indices: [   13    14    15 ... 17135 17147 17155]
[36m(run_fold pid=352119)[0m Configuring PyCaret for outer fold 1
[36m(run_fold pid=352122)[0m Outer fold 2
[36m(run_fold pid=352122)[0m Train indices: [    0     1     2 ... 17155 17157 17158]
[36m(run_fold pid=352122)[0m Test indices: [    3     6     8 ... 17152 17154 17156]


Processing:   0%|          | 0/4 [00:00<?, ?it/s]
Processing:  25%|██▌       | 1/4 [00:01<00:03,  1.27s/it]
Processing:   0%|          | 0/4 [00:00<?, ?it/s]
Processing:  75%|███████▌  | 3/4 [07:00<02:35, 155.51s/it][32m [repeated 2x across cluster] (Ray deduplicates logs by default. Set RAY_DEDUP_LOGS=0 to disable log deduplication, or see https://docs.ray.io/en/master/ray-observability/user-guides/configure-logging.html#log-deduplication for more options.)[0m


[36m(run_fold pid=352122)[0m       Accuracy     AUC  Recall   Prec.      F1   Kappa     MCC
[36m(run_fold pid=352122)[0m Fold                                                          
[36m(run_fold pid=352122)[0m 0       0.9126  0.8939  0.5759  0.8800  0.6962  0.6477  0.6676
[36m(run_fold pid=352122)[0m 1       0.9144  0.9043  0.5733  0.8975  0.6997  0.6526  0.6749
[36m(run_fold pid=352122)[0m 2       0.9030  0.8527  0.5249  0.8621  0.6525  0.6000  0.6250
[36m(run_fold pid=352122)[0m 3       0.9144  0.8781  0.5591  0.9142  0.6938  0.6474  0.6739
[36m(run_fold pid=352122)[0m 4       0.9071  0.8892  0.4961  0.9403  0.6495  0.6018  0.6427
[36m(run_fold pid=352122)[0m Mean    0.9103  0.8836  0.5459  0.8988  0.6783  0.6299  0.6568
[36m(run_fold pid=352122)[0m Std     0.0045  0.0176  0.0308  0.0271  0.0224  0.0238  0.0197
[36m(run_fold pid=352122)[0m Configuring PyCaret for outer fold 2
[36m(run_fold pid=352122)[0m Tuning hyperparameters for model xgboost with custom gr

                                                          
Processing:  75%|███████▌  | 3/4 [07:03<02:36, 156.57s/it]


[36m(run_fold pid=352119)[0m 0       0.9103  0.8928  0.5497  0.8936  0.6807  0.6320  0.6572
[36m(run_fold pid=352119)[0m 1       0.9098  0.8933  0.5550  0.8833  0.6817  0.6323  0.6555
[36m(run_fold pid=352119)[0m 2       0.9057  0.8889  0.5302  0.8783  0.6612  0.6103  0.6366
[36m(run_fold pid=352119)[0m 3       0.8998  0.8724  0.4961  0.8710  0.6321  0.5791  0.6099
[36m(run_fold pid=352119)[0m 4       0.9048  0.8937  0.5328  0.8675  0.6602  0.6085  0.6329
[36m(run_fold pid=352119)[0m Transformation Pipeline and Model Successfully Saved
[36m(run_fold pid=352119)[0m       Accuracy     AUC  Recall   Prec.      F1   Kappa     MCC
[36m(run_fold pid=352119)[0m Fold                                                          
[36m(run_fold pid=352119)[0m Mean    0.9061  0.8882  0.5328  0.8787  0.6632  0.6124  0.6384
[36m(run_fold pid=352119)[0m Std     0.0038  0.0081  0.0207  0.0093  0.0180  0.0195  0.0173
[36m(run_fold pid=352119)[0m Tuning hyperparameters for model xgboos

Processing:   0%|          | 0/4 [00:00<?, ?it/s]
                                                          


[36m(run_fold pid=352119)[0m Configuring PyCaret for outer fold 3


Processing:  25%|██▌       | 1/4 [00:01<00:03,  1.25s/it]
Processing:  75%|███████▌  | 3/4 [06:48<02:31, 151.02s/it]


[36m(run_fold pid=352122)[0m Transformation Pipeline and Model Successfully Saved
[36m(run_fold pid=352122)[0m                        Model  Accuracy     AUC  ...      F1   Kappa     MCC
[36m(run_fold pid=352122)[0m 0  Extreme Gradient Boosting    0.9181  0.9126  ...  0.7082  0.6637  0.6901
[36m(run_fold pid=352122)[0m 
[36m(run_fold pid=352122)[0m [1 rows x 8 columns]
[36m(run_fold pid=352122)[0m Outer fold 4
[36m(run_fold pid=352122)[0m Train indices: [    0     1     2 ... 17156 17157 17158]
[36m(run_fold pid=352122)[0m Test indices: [    4     5     9 ... 17139 17141 17150]
[36m(run_fold pid=352122)[0m Configuring PyCaret for outer fold 4


Processing:   0%|          | 0/4 [00:00<?, ?it/s]
Processing:  25%|██▌       | 1/4 [00:01<00:03,  1.23s/it]
                                                          


[36m(run_fold pid=352119)[0m       Accuracy     AUC  Recall   Prec.      F1   Kappa     MCC
[36m(run_fold pid=352119)[0m Fold                                                          
[36m(run_fold pid=352119)[0m 0       0.9030  0.8693  0.5157  0.8756  0.6491  0.5972  0.6254
[36m(run_fold pid=352119)[0m 1       0.8966  0.8647  0.4843  0.8605  0.6198  0.5653  0.5966
[36m(run_fold pid=352119)[0m 2       0.9030  0.8771  0.5144  0.8750  0.6479  0.5960  0.6244
[36m(run_fold pid=352119)[0m 3       0.9167  0.8959  0.5906  0.8929  0.7109  0.6646  0.6840
[36m(run_fold pid=352119)[0m 4       0.9062  0.8833  0.5118  0.9070  0.6544  0.6049  0.6381
[36m(run_fold pid=352119)[0m Mean    0.9051  0.8781  0.5234  0.8822  0.6564  0.6056  0.6337
[36m(run_fold pid=352119)[0m Std     0.0066  0.0110  0.0355  0.0161  0.0298  0.0324  0.0286
[36m(run_fold pid=352119)[0m Tuning hyperparameters for model xgboost with custom grid using grid search


Processing:  75%|███████▌  | 3/4 [06:47<02:30, 150.81s/it]


[36m(run_fold pid=352122)[0m 0       0.8999  0.8737  0.5079  0.8584  0.6382  0.5844  0.6116
[36m(run_fold pid=352122)[0m 1       0.9021  0.8722  0.4974  0.8920  0.6387  0.5872  0.6208
[36m(run_fold pid=352122)[0m 2       0.9030  0.8825  0.5171  0.8717  0.6491  0.5970  0.6245
[36m(run_fold pid=352122)[0m 3       0.9103  0.8773  0.5643  0.8740  0.6858  0.6363  0.6570
[36m(run_fold pid=352122)[0m 4       0.9199  0.8796  0.5696  0.9476  0.7115  0.6683  0.6975
[36m(run_fold pid=352122)[0m       Accuracy     AUC  Recall   Prec.      F1   Kappa     MCC
[36m(run_fold pid=352122)[0m Fold                                                          
[36m(run_fold pid=352122)[0m Mean    0.9070  0.8771  0.5312  0.8887  0.6646  0.6147  0.6423
[36m(run_fold pid=352122)[0m Std     0.0073  0.0038  0.0299  0.0313  0.0292  0.0326  0.0316
[36m(run_fold pid=352122)[0m Tuning hyperparameters for model xgboost with custom grid using grid search


                                                          


[36m(run_fold pid=352119)[0m Transformation Pipeline and Model Successfully Saved
[36m(run_fold pid=352119)[0m                        Model  Accuracy     AUC  ...      F1   Kappa     MCC
[36m(run_fold pid=352119)[0m 0  Extreme Gradient Boosting    0.9222  0.9172  ...  0.7216  0.6794  0.7077
[36m(run_fold pid=352119)[0m 
[36m(run_fold pid=352119)[0m [1 rows x 8 columns]
[36m(run_fold pid=352119)[0m Outer fold 5
[36m(run_fold pid=352122)[0m Transformation Pipeline and Model Successfully Saved
[36m(run_fold pid=352119)[0m Train indices: [    0     1     3 ... 17156 17157 17158]
[36m(run_fold pid=352119)[0m Test indices: [    2     7    12 ... 17146 17151 17153]
[36m(run_fold pid=352122)[0m 


Processing:   0%|          | 0/4 [00:00<?, ?it/s]


[36m(run_fold pid=352119)[0m Configuring PyCaret for outer fold 5
[36m(run_fold pid=352122)[0m                        Model  Accuracy     AUC  ...      F1   Kappa     MCC
[36m(run_fold pid=352122)[0m 0  Extreme Gradient Boosting    0.9301  0.9202  ...  0.7595  0.7204  0.7395
[36m(run_fold pid=352122)[0m [1 rows x 8 columns]


Processing:  25%|██▌       | 1/4 [00:01<00:03,  1.21s/it]
Processing:  75%|███████▌  | 3/4 [06:26<02:23, 143.10s/it]
                                                          


[36m(run_fold pid=352119)[0m       Accuracy     AUC  Recall   Prec.      F1   Kappa     MCC
[36m(run_fold pid=352119)[0m Fold                                                          
[36m(run_fold pid=352119)[0m 0       0.9085  0.8677  0.5419  0.8884  0.6732  0.6236  0.6494
[36m(run_fold pid=352119)[0m 1       0.9053  0.8679  0.5288  0.8783  0.6601  0.6090  0.6355
[36m(run_fold pid=352119)[0m 2       0.8966  0.8590  0.4672  0.8812  0.6106  0.5574  0.5948
[36m(run_fold pid=352119)[0m 3       0.9048  0.8603  0.5092  0.8981  0.6499  0.5997  0.6321
[36m(run_fold pid=352119)[0m 4       0.9121  0.8828  0.5591  0.8950  0.6882  0.6402  0.6642
[36m(run_fold pid=352119)[0m Mean    0.9055  0.8675  0.5212  0.8882  0.6564  0.6060  0.6352
[36m(run_fold pid=352119)[0m Std     0.0051  0.0085  0.0316  0.0077  0.0262  0.0279  0.0232
[36m(run_fold pid=352119)[0m Tuning hyperparameters for model xgboost with custom grid using grid search
[36m(run_fold pid=352119)[0m Transformation P

[36m(raylet)[0m Spilled 7659 MiB, 15 objects, write throughput 1327 MiB/s. Set RAY_verbose_spill_logs=0 to disable this message.
[33m(raylet)[0m [2024-10-22 15:55:01,994 E 352020 352020] (raylet) node_manager.cc:3065: 1 Workers (tasks / actors) killed due to memory pressure (OOM), 0 Workers crashed due to other reasons at node (ID: 852018e6044b47b3741b52a8d7dd68d1f083fb92e9c7b6b292aa1541, IP: 10.44.86.85) over the last time period. To see more information about the Workers killed on this node, use `ray logs raylet.out -ip 10.44.86.85`
[33m(raylet)[0m 
[33m(raylet)[0m Refer to the documentation on how to address the out of memory issue: https://docs.ray.io/en/latest/ray-core/scheduling/ray-oom-prevention.html. Consider provisioning more memory on this node or reducing task parallelism by requesting more CPUs per task. To adjust the kill threshold, set the environment variable `RAY_memory_usage_threshold` when starting Ray. To disable worker killing, set the environment variable 

Final metrics table:
     Metric     Mean   Std Dev
0  Accuracy  0.91280  0.003134
1       AUC  0.89782  0.012412
2    Recall  0.55234  0.022815
3     Prec.  0.91136  0.011377
4        F1  0.68752  0.016419
5     Kappa  0.64042  0.017331
6       MCC  0.66772  0.013369
Best hyperparameters across all folds: objective                  binary:logistic
base_score                             NaN
booster                             gbtree
callbacks                              NaN
colsample_bylevel                      NaN
colsample_bynode                       NaN
colsample_bytree                       NaN
device                                 cpu
early_stopping_rounds                  NaN
enable_categorical                   False
eval_metric                            NaN
feature_types                          NaN
gamma                                  NaN
grow_policy                            NaN
importance_type                        NaN
interaction_constraints                NaN
lear