Companion codebase for the paper "Retrieve, Then Classify: Corpus-Grounded Automation of Clinical Value Set Authoring".
Paper link:
This repository contains:
create_dataset/: VSAC download, retrieval-index construction, dataset reconstruction, and the released lightweight split manifestmodel_training/: standalone training scripts for the MLP, LightGBM, and cross-encoder models
This repository does not include raw VSAC content. To reconstruct the dataset artifacts used for training, you must download the value set content locally with a valid UMLS API key.
Model weights are released separately on Hugging Face.
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txtOr install from pyproject.toml:
pip install -e .RASC/
create_dataset/
download_vsac.py
build_index.py
build_dataset.py
release_manifest.py
split_manifest_release.jsonl
model_training/
train_mlp.py
train_lightgbm.py
train_cross_encoder.py
export UMLS_API_KEY="YOUR_UMLS_API_KEY"
python create_dataset/download_vsac.pyThis writes the local corpus to vsac_data/.
The notebook configuration used title retrieval with SAPBERT.
python create_dataset/build_index.py --strategy titleThis writes the FAISS index to vsac_index/.
python create_dataset/build_dataset.py \
--vsac-dir vsac_data \
--index-dir vsac_index \
--strategy title \
--top-k 10 \
--out-dir dataset \
--holdout-publishers "Clinical Architecture" "CSTE Steward"This produces:
dataset/train_meta.pkldataset/val_meta.pkldataset/test_meta.pkldataset/title_embs.npzdataset/code_embs.npzdataset/split_manifest.jsonldataset/dataset_stats.json
The lightweight release manifest is:
create_dataset/split_manifest_release.jsonl
To match your local download against it:
python create_dataset/release_manifest.py recoverAll default hyperparameters are set to mimic the training notebooks used in this project.
python model_training/train_mlp.pyOptional threshold tuning on validation:
python model_training/train_mlp.py --tune-thresholdpython model_training/train_lightgbm.pyOptional threshold tuning on validation:
python model_training/train_lightgbm.py --tune-thresholdpython model_training/train_cross_encoder.pyOptional threshold tuning on validation:
python model_training/train_cross_encoder.py --tune-threshold- The released manifest is lightweight and contains no VSAC content.
- Raw value set content must be downloaded locally by each user with their own UMLS credentials.
- The training scripts expect reconstructed dataset artifacts under
dataset/.