An implementation of OpenAI's HealthBench and HealthBench Professional evaluation frameworks, based on simple-evals. This repository focuses exclusively on HealthBench evaluations, adds support for additional models (Claude, Gemini), and is kept in sync with the upstream scoring and evaluation logic so results are fully reproducible.
| Eval name | Description | Examples |
|---|---|---|
healthbench |
Full HealthBench benchmark | 5,000 |
healthbench_hard |
Difficult subset of HealthBench | 1,000 |
healthbench_consensus |
Consensus subset of HealthBench | 3,671 |
healthbench_meta |
Meta-evaluation (grader quality) | 29,511 |
healthbench_professional |
HealthBench Professional (clinician chat tasks) | 525 |
Grader:
healthbench,healthbench_hard,healthbench_consensus, andhealthbench_metausegpt-4.1-2025-04-14(Chat Completions API) by default.healthbench_professionalusesgpt-5.4-2026-03-05at low reasoning effort (Responses API) per the paper. Override with--healthbench-grader-modeland--healthbench-grader-reasoning-effort.Data:
healthbench,healthbench_hard, andhealthbench_consensusload from 🤗 openai/healthbench with OpenAI public blob as fallback.healthbench_professionalloads from 🤗 openai/healthbench-professional with a bundled local file as fallback.healthbench_metaloads from the OpenAI public blob.
HealthBench Professional evaluates LLMs on real clinician chat tasks spanning three use cases: care consult, writing and documentation, and medical research. It applies a length adjustment penalty by default (center=2,000 chars, penalty=0.0147 per 500 chars) as described in Section 4.1 of the paper. Data is loaded from HuggingFace automatically, with the bundled local file as a fallback if HuggingFace is unavailable.
Step 1: Install uv (if not already installed):
curl -LsSf https://astral.sh/uv/install.sh | shStep 2: Clone the repository and install dependencies:
git clone https://github.com/your-username/HealthBench.git
cd HealthBench
uv syncCreate a .env file in the project root with your API keys:
OPENAI_API_KEY=your_openai_key
ANTHROPIC_API_KEY=your_anthropic_key
GEMINI_API_KEY=your_gemini_keyRun all commands from inside the HealthBench directory.
Quick test (10 examples, useful for verifying setup):
uv run python -m healthbench \
--model gpt-5.5-2026-04-23 \
--eval healthbench_hard \
--n-threads 4 \
--examples 10Run full HealthBench:
uv run python -m healthbench \
--model gpt-4.1 \
--eval healthbenchRun HealthBench Hard or Consensus:
uv run python -m healthbench --model gpt-4.1 --eval healthbench_hard
uv run python -m healthbench --model gpt-4.1 --eval healthbench_consensusRun HealthBench Professional:
uv run python -m healthbench \
--model gpt-4.1 \
--eval healthbench_professionalThis automatically loads the 525-example dataset from HuggingFace, applies the paper's default length adjustment (center=2,000, penalty=0.0147/500 chars), and uses gpt-5.4-2026-03-05 at low reasoning as the grader — all visible in the printed args namespace at runtime.
Override the grader for any eval:
uv run python -m healthbench \
--model gpt-4.1 \
--eval healthbench \
--healthbench-grader-model gpt-5.4-2026-03-05 \
--healthbench-grader-reasoning-effort low| Parameter | Description |
|---|---|
--model |
Model name (use --list-models to see all available models) |
--eval |
Evaluation type: healthbench, healthbench_hard, healthbench_consensus, healthbench_meta, healthbench_professional |
--n-threads |
Number of parallel threads (default: 4) |
--n-repeats |
Number of evaluation repeats (default: 1) |
--examples |
Number of examples to run (overrides default) |
--debug |
Run in debug mode with 10 examples |
--output-dir |
Directory to write results (default: results/) |
--healthbench-input-path |
Custom JSONL data path in HealthBench format (only for --eval=healthbench) |
--healthbench-grader-model |
Grader model ID (default: gpt-4.1-2025-04-14; auto-set to gpt-5.4-2026-03-05 for healthbench_professional) |
--healthbench-grader-reasoning-effort |
Reasoning effort for the grader: low, medium, high (auto-set to low for healthbench_professional) |
--healthbench-length-adjustment-center |
Center character count for length penalty (auto-set to 2000 for healthbench_professional) |
--healthbench-length-adjustment-penalty-per-500-chars |
Score penalty per 500 response characters (auto-set to 0.0147 for healthbench_professional) |
--healthbench-professional-mode |
Validation bundle for --eval=healthbench with a custom input path — requires --healthbench-input-path, --healthbench-grader-model gpt-5.4-2026-03-05, --healthbench-grader-reasoning-effort low, and both length adjustment flags |
The --n-threads parameter controls parallel API requests. The default is 4, which is safe for local development on any machine. Raise it if you have high-tier API access and want faster runs.
| API tier | Recommended --n-threads |
|---|---|
| High-tier / Enterprise | 50–120 |
| Standard | 10–20 |
| Low-tier / Free / Local | 4 (default) |
uv run python -m healthbench \
--model gpt-4o \
--eval healthbench \
--n-threads 20-
Arora, R. K. et al. (2025). HealthBench: Evaluating Large Language Models Towards Improved Human Health. arXiv. https://arxiv.org/abs/2505.08775
-
Soskin Hicks, R. et al. (2026). HealthBench Professional: Evaluating Large Language Models on Real Clinician Chats. arXiv. https://arxiv.org/abs/2604.27470