🩺 CoEval

Medical LLM Evaluation Framework

Part of the Chain-of-Evidence project by Lunit

Quick Start · Datasets · Metrics · Configuration · Extend · Project Structure · Development · License

📰 What's New

Date	Version	Update
2026-04-08	v0.1.0	Initial release — 14 medical datasets, 8 metrics, async evaluation pipeline

CoEval is an async-first evaluation framework built by Lunit's Chain-of-Evidence team for benchmarking LLMs on medical tasks. It is designed to:

Evaluate your own model — Point to any OpenAI-compatible endpoint (vLLM, SGLang, OpenAI, HuggingFace TGI) and run standardized medical benchmarks.
Compare models fairly — Run multiple models against the same datasets, metrics, and prompts for apples-to-apples comparison.
Scale easily — Adding a new dataset is ~50 lines of Python + one YAML file. Adding a metric is even less.

It ships with 14 medical datasets, 8 metrics (deterministic + LLM-as-judge), and a Hydra-based config system for fully reproducible evaluations.

🚀 Quick Start

1. Clone and install

git clone https://github.com/lunit-io/coeval.git
cd coeval

mise trust          # Required on first clone — trusts mise.toml config
mise run sync       # Installs Python 3.12 + deps via mise/uv

2. Serve your model

Note: sglang-gravity is included by default for serving Gravity MoE models locally. If you already have an OpenAI-compatible endpoint running (vLLM, OpenAI, Azure, etc.), skip this step.

# Start SGLang server (default: learning-unit/Test, GPU 0, port 9006)
mise run serve

# Custom GPU and port
mise run serve -- --gpu 4 --port 9015

# Custom model with multi-GPU
mise run serve -- --model-path your/model --gpu 0,1 --tp 2

Or use any OpenAI-compatible API:

# Hosted API (OpenAI, Azure, etc.)
export OPENAI_API_KEY=sk-...

3. Run evaluation

# Smoke test (5 samples, default dataset: PubMedQA)
mise run eval -- num_samples=5

# Single dataset
mise run eval -- datasets=medqa

# All 14 datasets
mise run eval -- datasets=all

# Custom endpoint (if not using mise run serve)
mise run eval -- client.api_base=http://localhost:8000/v1 client.model=your-model datasets=all

4. (Optional) LLM-as-judge datasets

Some datasets (e.g., HealthBench) use an LLM judge to score responses instead of exact match. These require an OpenAI API key for the judge model.

Set it in mise.toml (recommended — keeps it out of your shell history):

# mise.toml → [env]
OPENAI_API_KEY = "sk-..."

Or export it directly:

export OPENAI_API_KEY=sk-...

Then run:

# Run HealthBench — your model generates responses, gpt-4.1 grades them
mise run eval -- datasets=healthbench_consensus

Note: OPENAI_API_KEY is used for both the model server and the judge model. Most MCQ datasets (MedQA, MedMCQA, etc.) use deterministic scoring and do not require a judge model.

5. Check results

Results are saved to evaluation_outputs/:

📁 evaluation_outputs/YYYY-MM-DD/HH-MM-SS/
├── results_<dataset>.json        # Per-sample: input, prediction, scores
├── summary_<dataset>.json        # Aggregated metrics + breakdown
└── summary_combined.json         # Cross-dataset leaderboard

🧰 Datasets

All datasets are evaluated as MCQ (multiple-choice question) unless noted otherwise. The model selects an answer letter (A/B/C/D) and is scored by exact match.

14 datasets

Dataset	Key	Source	Task	Metric
MedQA	`medqa`	USMLE	4-option MCQ	MCQ Accuracy
MedMCQA	`medmcqa`	AIIMS/PGI	4-option MCQ	MCQ Accuracy
MMLU-Pro Health	`mmlu_pro_health`	MMLU-Pro subset	10-option MCQ	MCQ Accuracy
HeadQA	`headqa`	Spanish medical exams	4-option MCQ	MCQ Accuracy
CareQA	`careqa`	USMLE Step 1-3	4-option MCQ	MCQ Accuracy
M-ARC	`m_arc`	Medical ARC	4-option MCQ	MCQ Accuracy
MetaMedQA	`metamedqa`	Meta medical eval	4-option MCQ	MCQ Accuracy
MedXpertQA	`medxpertqa`	Expert medical QA	4-option MCQ	MCQ Accuracy
Medbullets	`medbullets`	Step 2 practice	4/5-option MCQ	MCQ Accuracy
MedHallu	`medhallu`	Hallucination detection	Binary classification	Macro F1
MedCalc	`medcalc`	Clinical calculation	Open-ended numeric	Numeric Accuracy
PubMedQA	`pubmedqa`	PubMed abstracts	3-option MCQ (Yes/No/Maybe)	MCQ Accuracy
HealthBench	`healthbench_consensus`	OpenAI	Open-ended multi-turn	LLM-as-judge (rubric)
AttributionBench	`attributionbench`	OSU NLP	Binary classification	Macro F1

mise run eval -- datasets=medqa              # Single dataset
mise run eval -- datasets=all                # All 14 datasets
mise run eval -- datasets=all num_samples=50 # Quick run, 50 samples each

📊 Metrics

Deterministic (no LLM required)

Metric	Class	Description
MCQ Accuracy	`MCQAccuracyMetric`	Robust answer letter extraction + exact match
Classification	`ClassificationMetric`	Label extraction + exact match
Numeric Accuracy	`NumericAccuracyMetric`	Numeric value comparison with tolerance

Note: While these metrics are deterministic, scores may vary slightly across runs due to LLM generation randomness (e.g., different token sampling even at low temperatures). For fully reproducible results, use temperature=0.

LLM-as-Judge (requires judge model)

Metric	Class	Description
HealthBench Rubric	`HealthBenchRubricMetric`	Per-criterion rubric scoring
Faithfulness	`FaithfulnessMetric`	Is the answer grounded in context?
Answer Relevancy	`AnswerRelevancyMetric`	Is the answer relevant to the question?
Contextual Precision	`ContextualPrecisionMetric`	Does context contain relevant info?
Contextual Recall	`ContextualRecallMetric`	Is all relevant info retrieved?

Aggregation

Strategy	Key	Use case
Simple Average	`default`	Most MCQ datasets
Weighted Average	`weighted_avg`	Grouped sub-datasets
Macro F1	`f1`	Classification tasks (e.g., AttributionBench)

⚙️ Configuration

CoEval uses Hydra for composable configuration. Everything is overridable from the CLI.

Environment Variables

Variable	Description	Example
`OPENAI_API_KEY`	API key (model server + LLM-as-judge)	`sk-...`

CLI Overrides

# Override any config value
mise run eval -- datasets=medqa num_samples=100 client.temperature=0.3

# Change system prompt
mise run eval -- 'system_prompt="Answer concisely."'

# Swap judge model for HealthBench
mise run eval -- datasets=healthbench_consensus datasets/metrics/judge@healthbench_judge=gpt-4.1

🔧 Extend

CoEval is designed to be easily extensible:

Adding a new dataset — ~50 lines of Python + one YAML config
Adding a new metric — Extend DeterministicMetric or use DeepEval's LLM-as-judge

🏗️ Project Structure

src/coeval/
├── main.py                 # CLI entry point (Hydra)
├── clients/
│   └── passthrough.py      # PassthroughClient (OpenAI-compatible)
├── llm/                    # Low-level async LLM client
│   ├── client.py           # LLMClient (llama-index OpenAILike wrapper)
│   ├── config.py           # LLMConfig dataclass
│   └── exceptions.py       # Error hierarchy (timeout, rate limit, etc.)
├── core/
│   ├── runner.py           # EvalRunner — async generation + scoring pipeline
│   ├── evaluate.py         # Metric dispatch (deterministic vs LLM-as-judge)
│   ├── schema.py           # EvalResult, EvalSummary (Pydantic models)
│   └── types.py            # Centralized type aliases and deepeval re-exports
├── datasets/
│   ├── base.py             # GoldenDatasetBase, MultiTurnDatasetBase
│   ├── medqa.py            # 16 dataset loaders (one file each)
│   └── ...
├── metrics/                # 3 deterministic + 5 LLM-as-judge metrics
├── util/                   # Parsers, score aggregation, Rich console
└── conf/                   # Hydra YAML configs
    ├── config.yaml         # Root config
    ├── client/             # LLM client configs
    ├── runner/             # Runner configs
    └── datasets/           # Per-dataset configs (dataset + metric + aggregator)

🛠️ Development

mise run sync       # Install dependencies
mise run test       # Run unit tests
mise run lint       # Ruff linter
mise run format     # Ruff formatter

📄 License

Licensed under the Apache License, Version 2.0. See LICENSE for details.

Name		Name	Last commit message	Last commit date
Latest commit History 40 Commits
.github/workflows		.github/workflows
scripts		scripts
src/coeval		src/coeval
tests		tests
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE		LICENSE
README.md		README.md
mise.toml		mise.toml
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🩺 CoEval

📰 What's New

🚀 Quick Start

1. Clone and install

2. Serve your model

3. Run evaluation

4. (Optional) LLM-as-judge datasets

5. Check results

🧰 Datasets

14 datasets

📊 Metrics

Deterministic (no LLM required)

LLM-as-Judge (requires judge model)

Aggregation

⚙️ Configuration

Environment Variables

CLI Overrides

🔧 Extend

🏗️ Project Structure

🛠️ Development

📄 License

About

Uh oh!

Releases 1

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

🩺 CoEval

📰 What's New

🚀 Quick Start

1. Clone and install

2. Serve your model

3. Run evaluation

4. (Optional) LLM-as-judge datasets

5. Check results

🧰 Datasets

14 datasets

📊 Metrics

Deterministic (no LLM required)

LLM-as-Judge (requires judge model)

Aggregation

⚙️ Configuration

Environment Variables

CLI Overrides

🔧 Extend

🏗️ Project Structure

🛠️ Development

📄 License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages