Digitally Enriching a Screening Population for Pancreatic Cancer Using Routine Blood-based Measures and Clinical Histories
This repository includes the implementation and experiments of the PRE-diagnostic pancreatic cancer risk MODel (PREMOD).
Authors
Chris Varghese1,2,*, Leo Y. Li-Han1,*, Richa Bisht1, Ellen Larson1, Frank Lee1, Ryan M. Carr1, Tanios S. Bekaii-Saab3, Shounak Majumder4, John D. Halamka5, Mark Truty1, Ajit H. Goenka6, Hojjat Salehinejad7,8, Cornelius A. Thiels1
* Co-First Authorship
- Department of Surgery, Mayo Clinic, Rochester, MN, USA
- Department of Surgery, University of Auckland, Auckland, NZ
- Department of Hematology and Oncology, Mayo Clinic, Phoenix, AZ, USA
- Division of Gastroenterology and Hepatology, Mayo Clinic, Rochester, MN, USA
- Mayo Clinic Platform, Mayo Clinic, Rochester, Minnesota
- Department of Radiology, Mayo Clinic, Rochester, Minnesota
- Division of Health Care Delivery Research, Robert D. and Patricia E. Kern Center for the Science of Health Care Delivery, Mayo Clinic, Rochester, MN, USA
- Department of Artificial Intelligence and Informatics, Mayo Clinic, Rochester, MN, USA
Earlier detection of pancreatic cancer is key to enabling wider access to curative treatment and reducing cancer deaths; however, screening is presently not viable. Latent indicators of pathology are evident in an individual’s disease and blood test trajectories and may predict the development of pancreatic cancer. Longitudinal sequences of coded diagnoses and blood test values accrued by patients throughout their clinical interactions were used to train a custom Transformer-based neural network with a multi-head attention mechanism to predict risk of pancreatic cancer with a multi-year lead time and risk-stratify populations for targeted screening. The cohort comprised 6,017 adults with pancreatic cancer and 177,081 controls (overall median age 75, 45% female) with median 12 years (interquartile range 6.9—16.2) of medical history prior to pancreatic cancer diagnosis. External validation via leave-one-site-out, out-of-sample testing predicting pancreatic cancer 1-, 2-, and 3-years prior to diagnosis demonstrated mean area under the receiver operating characteristic of 0.837 (95% confidence interval 0.827—0.848), 0.797 (95% confidence interval 0.782—0.813), and 0.760 (95% confidence interval 0.745—0.776), respectively. Estimated pancreatic cancer risks were well-calibrated (calibration plot slope 1.08, intercept of —0.077; Brier score 0.025), and a Bayesian population pancreatic cancer prevalence update allows estimated cancer risk outputs to be transportable across settings. At testing, a screening threshold of >3.3% risk of pancreatic cancer in 1-year offered a diagnostic odds ratio of 18.2. Our work therefore lays the foundation for a first population-level digital enrichment tool to widen access to curative-intent management of pancreatic cancer.
├── mock_data/
│ ├── mock_dx_data.csv # Synthesized dataframe for diagnosis codes matching the schema.
│ └── mock_lab_data.csv # Synthesized dataframe for blood tests matching the schema.
├── scripts/
│ ├── __init__.py
│ ├── data_process.py # Data processing functions: Site-wise patient split, data encoding, etc.
│ ├── model_train.py # Model training pipeline.
│ ├── model_trainer.py # Model trainer class.
│ ├── model_eval.py # Model evaluation pipeline.
│ ├── model_explain.py # Model interpretability pipeline (feature contribution plots).
│ ├── model_risk_predict.py # Continuous risk prediction (risk curves).
│ ├── models.py # Mock implementation of the Transformer model (due to patent-related restrictions).
│ ├── loss_function.py # Implementation of loss functions in model training.
│ └── utils.py # Utility functions: dataset class, metric calculation, logging, visualization functions, etc.
├── main.py # Main script of the project.
├── requirements.txt # Python pip dependencies.
├── LICENSE # Apache-2.0 license.
└── README.md # This documentation file.
Python Version: 3.10.20
# Create a virtual environment using Python 3.10.20
python3.10 -m venv pc_env
source pc_env/bin/activate
# Install required dependencies
pip install --upgrade pip
pip install -r requirements.txtDue to patient privacy restrictions, the raw clinical cohort cannot be shared publicly. Instead, we provide mock data data/mock_dx_data.csv and data/mock_lab_data.csv to verify functionality, following the data schema stated below.
- The mock data files contain 20 years of diagnosis codes and blood test histories prior to the cancer diagnoses for 1000 subjects.
- The cancer/target diagnosis date
TARGET_DX_DATEfor each subject is randomly chosen from the range 2025-01-01 to 2026-05-01;DIAGNOSIS_DATEandLAB_DATEare selected in the range 20 years before the correspondingTARGET_DX_DATE. - The diagnosis codes
DIAGNOSIS_CODEare randomly generated in the formats001to999andA00toZ99to emulate ICD-9 and ICD-10 codes, respectively. - The blood test names
LAB_ITEM_NAMEare chosen fromLab01toLab31. - The total number of unique diagnosis codes and blood test items is 2696 and 31, respectively, consistent with our study dataset.
- The
SITE_CODEis chosen from{0,1,2,3}, and theLABELis selected from{0,1}following a positive-to-negative ratio of 1:10.
By running python scripts/data_process.py, a mock_data_processed folder will be created, containing the generated encoding matrices for diagnosis codes and lab tests, saved as patient-wise tuple (Dx, Lab, Label) named as pid.joblib. The generated matrix tuple will be used as input to the Dataset and the model.
Data schema:
a. data/mock_dx_data.csv
| Column Name | Data Type | Description | Example Value |
|---|---|---|---|
PATIENT_ID |
int64 |
Unique patient identifier | 1000001 |
SITE_CODE |
int64 |
Clinical site location indicator (for LOSO evaluation) | 1 |
DIAGNOSIS_DATE |
datetime64 |
Timestamp for the current event | 2016-01-01 |
DIAGNOSIS_CODE |
string |
Structured ICD diagnostic code (ICD-09 or ICD-10 versions) | "K86" |
TARGET_DX_DATE |
datetime64 |
Timestamp for the cancer diagnosis | 2026-01-01 |
LABEL |
int64 |
Binary outcome ground truth (1: Pancreatic Cancer, 0: Control) | 1 |
b. data/mock_lab_data.csv
| Column Name | Data Type | Description | Example Value |
|---|---|---|---|
PATIENT_ID |
int64 |
Unique patient identifier | 1000001 |
LAB_DATE |
datetime64 |
Timestamp for the current event | 2016-01-01 |
LAB_ITEM_NAME |
string |
Name of current lab test | "Platelet Count" |
RESULT_NUM |
float64 |
Numerical measurements of the current lab test item | 99.9 |
TARGET_DX_DATE |
datetime64 |
Timestamp for the cancer diagnosis | 2026-01-01 |
LABEL |
int64 |
Binary outcome ground truth (1: Pancreatic Cancer, 0: Control) | 1 |
- Train and test models using leave-one-site-out CV with 12-, 24-, and 36-month prediction lead times. Two folders,
resultsandfigures,will be created to contain trained models and plots, respectively.
python main.py --task site-loo-intervals- Train and test models using leave-one-site-out CV with a 12-month (default) prediction lead time.
python main.py --task site-loo- Train the model on the data from sites 0, 2, and 3, then test on site 1. The performance plots and calibration curves will be saved to
figures/MODEL/CONFIG/PATH/eval.
python main.py --task site-one --train_sites 0-2-3 --test_sites 1- Test trained models on site 1, without post-hoc recalibration (default:
test_calibration=1).
python main.py --task site-loo-eval --test_calibration 0- Model interpretability: create feature contribution plots and save to
figures/MODEL/CONFIG/PATH/SHAP_analysis.
python main.py --task site-one-explain- Simulation of the continuous risk prediction: create the cancer risk curves and save to
figures/MODEL/CONFIG/PATH/risk_curve.
python main.py --task site-one-risk- All possible arguments can be found in
main.pyor
python main.py -hThe trained model is not publicly available, but can be made available for external validation, with appropriate data use and privacy agreements. This can be requested from the corresponding author.
@misc{varghese2026digitally,
Author = {Chris Varghese and Leo Y. Li-Han and Richa Bisht and Ellen Larson and Frank Lee and Ryan M. Carr and Tanios S. Bekaii-Saab and Shounak Majumder and John D. Halamka and Mark Truty and Ajit H. Goenka and Hojjat Salehinejad and Cornelius A. Thiels},
Title = {Digitally enriching a screening population for pancreatic cancer using routine blood-based measures and clinical histories},
Year = {2026},
Eprint = {arXiv:2605.30275},
}