HealthBench

An implementation of OpenAI's HealthBench and HealthBench Professional evaluation frameworks, based on simple-evals. This repository focuses exclusively on HealthBench evaluations, adds support for additional models (Claude, Gemini), and is kept in sync with the upstream scoring and evaluation logic so results are fully reproducible.

Supported Evaluations

Eval name	Description	Examples
`healthbench`	Full HealthBench benchmark	5,000
`healthbench_hard`	Difficult subset of HealthBench	1,000
`healthbench_consensus`	Consensus subset of HealthBench	3,671
`healthbench_meta`	Meta-evaluation (grader quality)	29,511
`healthbench_professional`	HealthBench Professional (clinician chat tasks)	525

Grader: healthbench, healthbench_hard, healthbench_consensus, and healthbench_meta use gpt-4.1-2025-04-14 (Chat Completions API) by default. healthbench_professional uses gpt-5.4-2026-03-05 at low reasoning effort (Responses API) per the paper. Override with --healthbench-grader-model and --healthbench-grader-reasoning-effort.

Data: healthbench, healthbench_hard, and healthbench_consensus load from 🤗 openai/healthbench with OpenAI public blob as fallback. healthbench_professional loads from 🤗 openai/healthbench-professional with a bundled local file as fallback. healthbench_meta loads from the OpenAI public blob.

HealthBench Professional evaluates LLMs on real clinician chat tasks spanning three use cases: care consult, writing and documentation, and medical research. It applies a length adjustment penalty by default (center=2,000 chars, penalty=0.0147 per 500 chars) as described in Section 4.1 of the paper. Data is loaded from HuggingFace automatically, with the bundled local file as a fallback if HuggingFace is unavailable.

Setup

Step 1: Install uv (if not already installed):

curl -LsSf https://astral.sh/uv/install.sh | sh

Step 2: Clone the repository and install dependencies:

git clone https://github.com/your-username/HealthBench.git
cd HealthBench
uv sync

Environment Variables

Create a .env file in the project root with your API keys:

OPENAI_API_KEY=your_openai_key
ANTHROPIC_API_KEY=your_anthropic_key
GEMINI_API_KEY=your_gemini_key

Usage

Run all commands from inside the HealthBench directory.

Quick test (10 examples, useful for verifying setup):

uv run python -m healthbench \
  --model gpt-5.5-2026-04-23 \
  --eval healthbench_hard \
  --n-threads 4 \
  --examples 10

Run full HealthBench:

uv run python -m healthbench \
  --model gpt-4.1 \
  --eval healthbench

Run HealthBench Hard or Consensus:

uv run python -m healthbench --model gpt-4.1 --eval healthbench_hard
uv run python -m healthbench --model gpt-4.1 --eval healthbench_consensus

Run HealthBench Professional:

uv run python -m healthbench \
  --model gpt-4.1 \
  --eval healthbench_professional

This automatically loads the 525-example dataset from HuggingFace, applies the paper's default length adjustment (center=2,000, penalty=0.0147/500 chars), and uses gpt-5.4-2026-03-05 at low reasoning as the grader — all visible in the printed args namespace at runtime.

Override the grader for any eval:

uv run python -m healthbench \
  --model gpt-4.1 \
  --eval healthbench \
  --healthbench-grader-model gpt-5.4-2026-03-05 \
  --healthbench-grader-reasoning-effort low

Parameters

Parameter	Description
`--model`	Model name (use `--list-models` to see all available models)
`--eval`	Evaluation type: `healthbench`, `healthbench_hard`, `healthbench_consensus`, `healthbench_meta`, `healthbench_professional`
`--n-threads`	Number of parallel threads (default: `4`)
`--n-repeats`	Number of evaluation repeats (default: `1`)
`--examples`	Number of examples to run (overrides default)
`--debug`	Run in debug mode with 10 examples
`--output-dir`	Directory to write results (default: `results/`)
`--healthbench-input-path`	Custom JSONL data path in HealthBench format (only for `--eval=healthbench`)
`--healthbench-grader-model`	Grader model ID (default: `gpt-4.1-2025-04-14`; auto-set to `gpt-5.4-2026-03-05` for `healthbench_professional`)
`--healthbench-grader-reasoning-effort`	Reasoning effort for the grader: `low`, `medium`, `high` (auto-set to `low` for `healthbench_professional`)
`--healthbench-length-adjustment-center`	Center character count for length penalty (auto-set to `2000` for `healthbench_professional`)
`--healthbench-length-adjustment-penalty-per-500-chars`	Score penalty per 500 response characters (auto-set to `0.0147` for `healthbench_professional`)
`--healthbench-professional-mode`	Validation bundle for `--eval=healthbench` with a custom input path — requires `--healthbench-input-path`, `--healthbench-grader-model gpt-5.4-2026-03-05`, `--healthbench-grader-reasoning-effort low`, and both length adjustment flags

Tips & FAQ

Managing API Rate Limits

The --n-threads parameter controls parallel API requests. The default is 4, which is safe for local development on any machine. Raise it if you have high-tier API access and want faster runs.

API tier	Recommended `--n-threads`
High-tier / Enterprise	50–120
Standard	10–20
Low-tier / Free / Local	4 (default)

uv run python -m healthbench \
  --model gpt-4o \
  --eval healthbench \
  --n-threads 20

References

Arora, R. K. et al. (2025). HealthBench: Evaluating Large Language Models Towards Improved Human Health. arXiv. https://arxiv.org/abs/2505.08775
Soskin Hicks, R. et al. (2026). HealthBench Professional: Evaluating Large Language Models on Real Clinician Chats. arXiv. https://arxiv.org/abs/2604.27470

Name		Name	Last commit message	Last commit date
Latest commit History 26 Commits
healthbench		healthbench
notebooks		notebooks
results		results
.gitignore		.gitignore
.python-version		.python-version
README.md		README.md
main.py		main.py
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

HealthBench

Supported Evaluations

Setup

Environment Variables

Usage

Parameters

Tips & FAQ

Managing API Rate Limits

References

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

HealthBench

Supported Evaluations

Setup

Environment Variables

Usage

Parameters

Tips & FAQ

Managing API Rate Limits

References

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages