RASC

Companion codebase for the paper "Retrieve, Then Classify: Corpus-Grounded Automation of Clinical Value Set Authoring".

Paper link:

This repository contains:

create_dataset/: VSAC download, retrieval-index construction, dataset reconstruction, and the released lightweight split manifest
model_training/: standalone training scripts for the MLP, LightGBM, and cross-encoder models

This repository does not include raw VSAC content. To reconstruct the dataset artifacts used for training, you must download the value set content locally with a valid UMLS API key.

Model weights are released separately on Hugging Face.

Setup

python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

Or install from pyproject.toml:

pip install -e .

Repository Layout

RASC/
  create_dataset/
    download_vsac.py
    build_index.py
    build_dataset.py
    release_manifest.py
    split_manifest_release.jsonl
  model_training/
    train_mlp.py
    train_lightgbm.py
    train_cross_encoder.py

Data Reconstruction

1. Download VSAC content locally

export UMLS_API_KEY="YOUR_UMLS_API_KEY"
python create_dataset/download_vsac.py

This writes the local corpus to vsac_data/.

2. Build the semantic retrieval index

The notebook configuration used title retrieval with SAPBERT.

python create_dataset/build_index.py --strategy title

This writes the FAISS index to vsac_index/.

3. Reconstruct the manifest-compatible dataset artifacts

python create_dataset/build_dataset.py \
  --vsac-dir vsac_data \
  --index-dir vsac_index \
  --strategy title \
  --top-k 10 \
  --out-dir dataset \
  --holdout-publishers "Clinical Architecture" "CSTE Steward"

This produces:

dataset/train_meta.pkl
dataset/val_meta.pkl
dataset/test_meta.pkl
dataset/title_embs.npz
dataset/code_embs.npz
dataset/split_manifest.jsonl
dataset/dataset_stats.json

4. Recover split assignments from the released lightweight manifest

The lightweight release manifest is:

create_dataset/split_manifest_release.jsonl

To match your local download against it:

python create_dataset/release_manifest.py recover

Train Models

All default hyperparameters are set to mimic the training notebooks used in this project.

MLP

python model_training/train_mlp.py

Optional threshold tuning on validation:

python model_training/train_mlp.py --tune-threshold

LightGBM

python model_training/train_lightgbm.py

Optional threshold tuning on validation:

python model_training/train_lightgbm.py --tune-threshold

Cross-Encoder

python model_training/train_cross_encoder.py

Optional threshold tuning on validation:

python model_training/train_cross_encoder.py --tune-threshold

Notes

The released manifest is lightweight and contains no VSAC content.
Raw value set content must be downloaded locally by each user with their own UMLS credentials.
The training scripts expect reconstructed dataset artifacts under dataset/.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
create_dataset		create_dataset
model_training		model_training
.gitignore		.gitignore
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

RASC

Setup

Repository Layout

Data Reconstruction

1. Download VSAC content locally

2. Build the semantic retrieval index

3. Reconstruct the manifest-compatible dataset artifacts

4. Recover split assignments from the released lightweight manifest

Train Models

MLP

LightGBM

Cross-Encoder

Notes

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

RASC

Setup

Repository Layout

Data Reconstruction

1. Download VSAC content locally

2. Build the semantic retrieval index

3. Reconstruct the manifest-compatible dataset artifacts

4. Recover split assignments from the released lightweight manifest

Train Models

MLP

LightGBM

Cross-Encoder

Notes

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages