Skip to content

lcapacitor/PREMOD

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Digitally Enriching a Screening Population for Pancreatic Cancer Using Routine Blood-based Measures and Clinical Histories

This repository includes the implementation and experiments of the PRE-diagnostic pancreatic cancer risk MODel (PREMOD).

Architecture diagram

Authors

Chris Varghese1,2,*, Leo Y. Li-Han1,*, Richa Bisht1, Ellen Larson1, Frank Lee1, Ryan M. Carr1, Tanios S. Bekaii-Saab3, Shounak Majumder4, John D. Halamka5, Mark Truty1, Ajit H. Goenka6, Hojjat Salehinejad7,8, Cornelius A. Thiels1

* Co-First Authorship

  1. Department of Surgery, Mayo Clinic, Rochester, MN, USA
  2. Department of Surgery, University of Auckland, Auckland, NZ
  3. Department of Hematology and Oncology, Mayo Clinic, Phoenix, AZ, USA
  4. Division of Gastroenterology and Hepatology, Mayo Clinic, Rochester, MN, USA
  5. Mayo Clinic Platform, Mayo Clinic, Rochester, Minnesota
  6. Department of Radiology, Mayo Clinic, Rochester, Minnesota
  7. Division of Health Care Delivery Research, Robert D. and Patricia E. Kern Center for the Science of Health Care Delivery, Mayo Clinic, Rochester, MN, USA
  8. Department of Artificial Intelligence and Informatics, Mayo Clinic, Rochester, MN, USA

1. Summary

Earlier detection of pancreatic cancer is key to enabling wider access to curative treatment and reducing cancer deaths; however, screening is presently not viable. Latent indicators of pathology are evident in an individual’s disease and blood test trajectories and may predict the development of pancreatic cancer. Longitudinal sequences of coded diagnoses and blood test values accrued by patients throughout their clinical interactions were used to train a custom Transformer-based neural network with a multi-head attention mechanism to predict risk of pancreatic cancer with a multi-year lead time and risk-stratify populations for targeted screening. The cohort comprised 6,017 adults with pancreatic cancer and 177,081 controls (overall median age 75, 45% female) with median 12 years (interquartile range 6.9—16.2) of medical history prior to pancreatic cancer diagnosis. External validation via leave-one-site-out, out-of-sample testing predicting pancreatic cancer 1-, 2-, and 3-years prior to diagnosis demonstrated mean area under the receiver operating characteristic of 0.837 (95% confidence interval 0.827—0.848), 0.797 (95% confidence interval 0.782—0.813), and 0.760 (95% confidence interval 0.745—0.776), respectively. Estimated pancreatic cancer risks were well-calibrated (calibration plot slope 1.08, intercept of —0.077; Brier score 0.025), and a Bayesian population pancreatic cancer prevalence update allows estimated cancer risk outputs to be transportable across settings. At testing, a screening threshold of >3.3% risk of pancreatic cancer in 1-year offered a diagnostic odds ratio of 18.2. Our work therefore lays the foundation for a first population-level digital enrichment tool to widen access to curative-intent management of pancreatic cancer.


2. Repository Overview

├── mock_data/                       
│   ├── mock_dx_data.csv        # Synthesized dataframe for diagnosis codes matching the schema.
│   └── mock_lab_data.csv       # Synthesized dataframe for blood tests matching the schema.
├── scripts/
│   ├── __init__.py            
│   ├── data_process.py         # Data processing functions: Site-wise patient split, data encoding, etc. 
│   ├── model_train.py          # Model training pipeline.
│   ├── model_trainer.py        # Model trainer class.
│   ├── model_eval.py           # Model evaluation pipeline.
│   ├── model_explain.py        # Model interpretability pipeline (feature contribution plots).
│   ├── model_risk_predict.py   # Continuous risk prediction (risk curves).
│   ├── models.py               # Mock implementation of the Transformer model (due to patent-related restrictions).
│   ├── loss_function.py        # Implementation of loss functions in model training.
│   └── utils.py                # Utility functions: dataset class, metric calculation, logging, visualization functions, etc.
├── main.py                     # Main script of the project.
├── requirements.txt            # Python pip dependencies.
├── LICENSE                     # Apache-2.0 license.
└── README.md                   # This documentation file.

3. Prerequisites & Environment Setup

Python Version: 3.10.20

# Create a virtual environment using Python 3.10.20
python3.10 -m venv pc_env
source pc_env/bin/activate

# Install required dependencies
pip install --upgrade pip
pip install -r requirements.txt

4. Data Preparation

Due to patient privacy restrictions, the raw clinical cohort cannot be shared publicly. Instead, we provide mock data data/mock_dx_data.csv and data/mock_lab_data.csv to verify functionality, following the data schema stated below.

  • The mock data files contain 20 years of diagnosis codes and blood test histories prior to the cancer diagnoses for 1000 subjects.
  • The cancer/target diagnosis date TARGET_DX_DATE for each subject is randomly chosen from the range 2025-01-01 to 2026-05-01; DIAGNOSIS_DATE and LAB_DATE are selected in the range 20 years before the corresponding TARGET_DX_DATE.
  • The diagnosis codes DIAGNOSIS_CODE are randomly generated in the formats 001 to 999 and A00 to Z99 to emulate ICD-9 and ICD-10 codes, respectively.
  • The blood test names LAB_ITEM_NAME are chosen from Lab01 to Lab31.
  • The total number of unique diagnosis codes and blood test items is 2696 and 31, respectively, consistent with our study dataset.
  • The SITE_CODE is chosen from {0,1,2,3}, and the LABEL is selected from {0,1} following a positive-to-negative ratio of 1:10.

By running python scripts/data_process.py, a mock_data_processed folder will be created, containing the generated encoding matrices for diagnosis codes and lab tests, saved as patient-wise tuple (Dx, Lab, Label) named as pid.joblib. The generated matrix tuple will be used as input to the Dataset and the model.

Data schema:

a. data/mock_dx_data.csv

Column Name Data Type Description Example Value
PATIENT_ID int64 Unique patient identifier 1000001
SITE_CODE int64 Clinical site location indicator (for LOSO evaluation) 1
DIAGNOSIS_DATE datetime64 Timestamp for the current event 2016-01-01
DIAGNOSIS_CODE string Structured ICD diagnostic code (ICD-09 or ICD-10 versions) "K86"
TARGET_DX_DATE datetime64 Timestamp for the cancer diagnosis 2026-01-01
LABEL int64 Binary outcome ground truth (1: Pancreatic Cancer, 0: Control) 1

b. data/mock_lab_data.csv

Column Name Data Type Description Example Value
PATIENT_ID int64 Unique patient identifier 1000001
LAB_DATE datetime64 Timestamp for the current event 2016-01-01
LAB_ITEM_NAME string Name of current lab test "Platelet Count"
RESULT_NUM float64 Numerical measurements of the current lab test item 99.9
TARGET_DX_DATE datetime64 Timestamp for the cancer diagnosis 2026-01-01
LABEL int64 Binary outcome ground truth (1: Pancreatic Cancer, 0: Control) 1

5. Training and Evaluation Pipeline

  • Train and test models using leave-one-site-out CV with 12-, 24-, and 36-month prediction lead times. Two folders, results and figures, will be created to contain trained models and plots, respectively.
python main.py --task site-loo-intervals
  • Train and test models using leave-one-site-out CV with a 12-month (default) prediction lead time.
python main.py --task site-loo
  • Train the model on the data from sites 0, 2, and 3, then test on site 1. The performance plots and calibration curves will be saved to figures/MODEL/CONFIG/PATH/eval.
python main.py --task site-one --train_sites 0-2-3 --test_sites 1
  • Test trained models on site 1, without post-hoc recalibration (default: test_calibration=1).
python main.py --task site-loo-eval --test_calibration 0
  • Model interpretability: create feature contribution plots and save to figures/MODEL/CONFIG/PATH/SHAP_analysis.
python main.py --task site-one-explain
  • Simulation of the continuous risk prediction: create the cancer risk curves and save to figures/MODEL/CONFIG/PATH/risk_curve.
python main.py --task site-one-risk
  • All possible arguments can be found in main.py or
python main.py -h

6. Trained Checkpoints

The trained model is not publicly available, but can be made available for external validation, with appropriate data use and privacy agreements. This can be requested from the corresponding author.


7. Citation / BibTeX

@misc{varghese2026digitally,
Author = {Chris Varghese and Leo Y. Li-Han and Richa Bisht and Ellen Larson and Frank Lee and Ryan M. Carr and Tanios S. Bekaii-Saab and Shounak Majumder and John D. Halamka and Mark Truty and Ajit H. Goenka and Hojjat Salehinejad and Cornelius A. Thiels},
Title = {Digitally enriching a screening population for pancreatic cancer using routine blood-based measures and clinical histories},
Year = {2026},
Eprint = {arXiv:2605.30275},
}

About

Digitally enriching a screening population for pancreatic cancer using routine blood-based measures and clinical histories

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages