# Hugging Face dataset export walkthrough (text)
Use this notebook as a step-by-step template for pulling public datasets from Hugging Face, persisting each split locally, and ending up with clean `.jsonl` files organized under `data/`. It installs `datasets`, shows how to specify a target directory, iterates over every split (train/test/etc.), and streams each example to disk so you can immediately plug the results into downstream tooling.

The dataset(s) in this example are used for demonstration purposes only. Microsoft does not endorse them specifically.

# Download ncbi/MedCalc-Bench-v1.1
This section installs the necessary Hugging Face tooling, fetches the `ncbi/MedCalc-Bench-v1.1` dataset from Hugging Face, and persists each available split as a JSON Lines file under `data/medcalc-bench-v1.1`.

`
@misc{khandekar2024medcalcbench,
      title={MedCalc-Bench: Evaluating Large Language Models for Medical Calculations}, 
      author={Nikhil Khandekar and Qiao Jin and Guangzhi Xiong and Soren Dunn and Serina S Applebaum and Zain Anwar and Maame Sarfo-Gyamfi and Conrad W Safranek and Abid A Anwar and Andrew Zhang and Aidan Gilson and Maxwell B Singer and Amisha Dave and Andrew Taylor and Aidong Zhang and Qingyu Chen and Zhiyong Lu},
      year={2024},
      eprint={2406.12036},
      archivePrefix={arXiv},
      primaryClass={id='cs.CL' full_name='Computation and Language' is_active=True alt_name='cmp-lg' in_archive='cs' is_general=False description='Covers natural language processing. Roughly includes material in ACM Subject Class I.2.7. Note that work on artificial languages (programming languages, logics, formal systems) that does not explicitly address natural-language issues broadly construed (natural-language processing, computational linguistics, speech, text retrieval, etc.) is not appropriate for this area.'}
}`

In [None]:
import sys
import subprocess

subprocess.run(
    [sys.executable, "-m", "pip", "install", "datasets", "huggingface-hub"],
    check=True,
)

In [None]:
from datasets import load_dataset
from pathlib import Path
import json

target_dir = Path("data/medcalc-bench-v1.1")
target_dir.mkdir(parents=True, exist_ok=True)

dataset = load_dataset("ncbi/MedCalc-Bench-v1.1")

for split_name, split_dataset in dataset.items():
    output_path = target_dir / f"{split_name}.jsonl"
    with output_path.open("w", encoding="utf-8") as output_file:
        for record in split_dataset:
            json.dump(record, output_file, ensure_ascii=False)
            output_file.write("\n")
    print(f"Saved {split_name} split to {output_path}")

Run the second code cell to download the data; the JSONL files are saved per split in `data/medcalc-bench-v1.1`.

# Download ncbi/TrialGPT-Criterion-Annotations
This section loads the `ncbi/TrialGPT-Criterion-Annotations` dataset from Hugging Face and writes each split as JSON Lines into `data/trialgpt-criterion-annotations`.


Citation:
Qiao Jin, Zifeng Wang, Charalampos S. Floudas, Fangyuan Chen, Changlin Gong, Dara Bracken-Clarke, Elisabetta Xue, Yifan Yang, Jimeng Sun, Zhiyong Lu. Matching Patients to Clinical Trials with Large Language Models. Nat Commun. 2024;15:9074. doi: 10.1038/s41467-024-53081-z

In [None]:
from datasets import load_dataset
from pathlib import Path
import json

target_dir = Path("data/trialgpt-criterion-annotations")
target_dir.mkdir(parents=True, exist_ok=True)

dataset = load_dataset("ncbi/TrialGPT-Criterion-Annotations")

for split_name, split_dataset in dataset.items():
    output_path = target_dir / f"{split_name}.jsonl"
    with output_path.open("w", encoding="utf-8") as output_file:
        for record in split_dataset:
            json.dump(record, output_file, ensure_ascii=False)
            output_file.write("\n")
    print(f"Saved {split_name} split to {output_path}")

# Download rungalileo/medical_transcription_4
This section loads the `rungalileo/medical_transcription_4` dataset from Hugging Face and exports each split as JSON Lines files under `data/medical_transcription_4`.

(could not locate citation info. please let us know if you think you should be cited / credited)

In [None]:
from datasets import load_dataset
from pathlib import Path
import json

target_dir = Path("data/medical_transcription_4")
target_dir.mkdir(parents=True, exist_ok=True)

dataset = load_dataset("rungalileo/medical_transcription_4")

for split_name, split_dataset in dataset.items():
    output_path = target_dir / f"{split_name}.jsonl"
    with output_path.open("w", encoding="utf-8") as output_file:
        for record in split_dataset:
            json.dump(record, output_file, ensure_ascii=False)
            output_file.write("\n")
    print(f"Saved {split_name} split to {output_path}")