GitHub - lcapacitor/PREMOD: Digitally enriching a screening population for pancreatic cancer using routine blood-based measures and clinical histories

Digitally Enriching a Screening Population for Pancreatic Cancer Using Routine Blood-based Measures and Clinical Histories

This repository includes the implementation and experiments of the PRE-diagnostic pancreatic cancer risk MODel (PREMOD).

Authors

Chris Varghese^1,2,*, Leo Y. Li-Han^1,*, Richa Bisht¹, Ellen Larson¹, Frank Lee¹, Ryan M. Carr¹, Tanios S. Bekaii-Saab³, Shounak Majumder⁴, John D. Halamka⁵, Mark Truty¹, Ajit H. Goenka⁶, Hojjat Salehinejad^7,8, Cornelius A. Thiels¹

^* Co-First Authorship

Department of Surgery, Mayo Clinic, Rochester, MN, USA
Department of Surgery, University of Auckland, Auckland, NZ
Department of Hematology and Oncology, Mayo Clinic, Phoenix, AZ, USA
Division of Gastroenterology and Hepatology, Mayo Clinic, Rochester, MN, USA
Mayo Clinic Platform, Mayo Clinic, Rochester, Minnesota
Department of Radiology, Mayo Clinic, Rochester, Minnesota
Division of Health Care Delivery Research, Robert D. and Patricia E. Kern Center for the Science of Health Care Delivery, Mayo Clinic, Rochester, MN, USA
Department of Artificial Intelligence and Informatics, Mayo Clinic, Rochester, MN, USA

1. Summary

Earlier detection of pancreatic cancer is key to enabling wider access to curative treatment and reducing cancer deaths; however, screening is presently not viable. Latent indicators of pathology are evident in an individual’s disease and blood test trajectories and may predict the development of pancreatic cancer. Longitudinal sequences of coded diagnoses and blood test values accrued by patients throughout their clinical interactions were used to train a custom Transformer-based neural network with a multi-head attention mechanism to predict risk of pancreatic cancer with a multi-year lead time and risk-stratify populations for targeted screening. The cohort comprised 6,017 adults with pancreatic cancer and 177,081 controls (overall median age 75, 45% female) with median 12 years (interquartile range 6.9—16.2) of medical history prior to pancreatic cancer diagnosis. External validation via leave-one-site-out, out-of-sample testing predicting pancreatic cancer 1-, 2-, and 3-years prior to diagnosis demonstrated mean area under the receiver operating characteristic of 0.837 (95% confidence interval 0.827—0.848), 0.797 (95% confidence interval 0.782—0.813), and 0.760 (95% confidence interval 0.745—0.776), respectively. Estimated pancreatic cancer risks were well-calibrated (calibration plot slope 1.08, intercept of —0.077; Brier score 0.025), and a Bayesian population pancreatic cancer prevalence update allows estimated cancer risk outputs to be transportable across settings. At testing, a screening threshold of >3.3% risk of pancreatic cancer in 1-year offered a diagnostic odds ratio of 18.2. Our work therefore lays the foundation for a first population-level digital enrichment tool to widen access to curative-intent management of pancreatic cancer.

2. Repository Overview

├── mock_data/                       
│   ├── mock_dx_data.csv        # Synthesized dataframe for diagnosis codes matching the schema.
│   └── mock_lab_data.csv       # Synthesized dataframe for blood tests matching the schema.
├── scripts/
│   ├── __init__.py            
│   ├── data_process.py         # Data processing functions: Site-wise patient split, data encoding, etc. 
│   ├── model_train.py          # Model training pipeline.
│   ├── model_trainer.py        # Model trainer class.
│   ├── model_eval.py           # Model evaluation pipeline.
│   ├── model_explain.py        # Model interpretability pipeline (feature contribution plots).
│   ├── model_risk_predict.py   # Continuous risk prediction (risk curves).
│   ├── models.py               # Mock implementation of the Transformer model (due to patent-related restrictions).
│   ├── loss_function.py        # Implementation of loss functions in model training.
│   └── utils.py                # Utility functions: dataset class, metric calculation, logging, visualization functions, etc.
├── main.py                     # Main script of the project.
├── requirements.txt            # Python pip dependencies.
├── LICENSE                     # Apache-2.0 license.
└── README.md                   # This documentation file.

3. Prerequisites & Environment Setup

Python Version: 3.10.20

# Create a virtual environment using Python 3.10.20
python3.10 -m venv pc_env
source pc_env/bin/activate

# Install required dependencies
pip install --upgrade pip
pip install -r requirements.txt

4. Data Preparation

Due to patient privacy restrictions, the raw clinical cohort cannot be shared publicly. Instead, we provide mock data data/mock_dx_data.csv and data/mock_lab_data.csv to verify functionality, following the data schema stated below.

The mock data files contain 20 years of diagnosis codes and blood test histories prior to the cancer diagnoses for 1000 subjects.
The cancer/target diagnosis date TARGET_DX_DATE for each subject is randomly chosen from the range 2025-01-01 to 2026-05-01; DIAGNOSIS_DATE and LAB_DATE are selected in the range 20 years before the corresponding TARGET_DX_DATE.
The diagnosis codes DIAGNOSIS_CODE are randomly generated in the formats 001 to 999 and A00 to Z99 to emulate ICD-9 and ICD-10 codes, respectively.
The blood test names LAB_ITEM_NAME are chosen from Lab01 to Lab31.
The total number of unique diagnosis codes and blood test items is 2696 and 31, respectively, consistent with our study dataset.
The SITE_CODE is chosen from {0,1,2,3}, and the LABEL is selected from {0,1} following a positive-to-negative ratio of 1:10.

By running python scripts/data_process.py, a mock_data_processed folder will be created, containing the generated encoding matrices for diagnosis codes and lab tests, saved as patient-wise tuple (Dx, Lab, Label) named as pid.joblib. The generated matrix tuple will be used as input to the Dataset and the model.

Data schema:

a. data/mock_dx_data.csv

Column Name	Data Type	Description	Example Value
`PATIENT_ID`	`int64`	Unique patient identifier	`1000001`
`SITE_CODE`	`int64`	Clinical site location indicator (for LOSO evaluation)	`1`
`DIAGNOSIS_DATE`	`datetime64`	Timestamp for the current event	`2016-01-01`
`DIAGNOSIS_CODE`	`string`	Structured ICD diagnostic code (ICD-09 or ICD-10 versions)	`"K86"`
`TARGET_DX_DATE`	`datetime64`	Timestamp for the cancer diagnosis	`2026-01-01`
`LABEL`	`int64`	Binary outcome ground truth (1: Pancreatic Cancer, 0: Control)	`1`

b. data/mock_lab_data.csv

Column Name	Data Type	Description	Example Value
`PATIENT_ID`	`int64`	Unique patient identifier	`1000001`
`LAB_DATE`	`datetime64`	Timestamp for the current event	`2016-01-01`
`LAB_ITEM_NAME`	`string`	Name of current lab test	`"Platelet Count"`
`RESULT_NUM`	`float64`	Numerical measurements of the current lab test item	`99.9`
`TARGET_DX_DATE`	`datetime64`	Timestamp for the cancer diagnosis	`2026-01-01`
`LABEL`	`int64`	Binary outcome ground truth (1: Pancreatic Cancer, 0: Control)	`1`

5. Training and Evaluation Pipeline

Train and test models using leave-one-site-out CV with 12-, 24-, and 36-month prediction lead times. Two folders, results and figures, will be created to contain trained models and plots, respectively.

python main.py --task site-loo-intervals

Train and test models using leave-one-site-out CV with a 12-month (default) prediction lead time.

python main.py --task site-loo

Train the model on the data from sites 0, 2, and 3, then test on site 1. The performance plots and calibration curves will be saved to figures/MODEL/CONFIG/PATH/eval.

python main.py --task site-one --train_sites 0-2-3 --test_sites 1

Test trained models on site 1, without post-hoc recalibration (default: test_calibration=1).

python main.py --task site-loo-eval --test_calibration 0

Model interpretability: create feature contribution plots and save to figures/MODEL/CONFIG/PATH/SHAP_analysis.

python main.py --task site-one-explain

Simulation of the continuous risk prediction: create the cancer risk curves and save to figures/MODEL/CONFIG/PATH/risk_curve.

python main.py --task site-one-risk

All possible arguments can be found in main.py or

python main.py -h

6. Trained Checkpoints

The trained model is not publicly available, but can be made available for external validation, with appropriate data use and privacy agreements. This can be requested from the corresponding author.

7. Citation / BibTeX

@misc{varghese2026digitally,
Author = {Chris Varghese and Leo Y. Li-Han and Richa Bisht and Ellen Larson and Frank Lee and Ryan M. Carr and Tanios S. Bekaii-Saab and Shounak Majumder and John D. Halamka and Mark Truty and Ajit H. Goenka and Hojjat Salehinejad and Cornelius A. Thiels},
Title = {Digitally enriching a screening population for pancreatic cancer using routine blood-based measures and clinical histories},
Year = {2026},
Eprint = {arXiv:2605.30275},
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Digitally Enriching a Screening Population for Pancreatic Cancer Using Routine Blood-based Measures and Clinical Histories

1. Summary

2. Repository Overview

3. Prerequisites & Environment Setup

4. Data Preparation

5. Training and Evaluation Pipeline

6. Trained Checkpoints

7. Citation / BibTeX

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
figures		figures
mock_data		mock_data
scripts		scripts
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
main.py		main.py
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Digitally Enriching a Screening Population for Pancreatic Cancer Using Routine Blood-based Measures and Clinical Histories

1. Summary

2. Repository Overview

3. Prerequisites & Environment Setup

4. Data Preparation

5. Training and Evaluation Pipeline

6. Trained Checkpoints

7. Citation / BibTeX

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages