Medical LLM Evaluation Framework
Part of the Chain-of-Evidence project by Lunit
Quick Start · Datasets · Metrics · Configuration · Extend · Project Structure · Development · License
| Date | Version | Update |
|---|---|---|
| 2026-04-08 | v0.1.0 | Initial release — 14 medical datasets, 8 metrics, async evaluation pipeline |
CoEval is an async-first evaluation framework built by Lunit's Chain-of-Evidence team for benchmarking LLMs on medical tasks. It is designed to:
- Evaluate your own model — Point to any OpenAI-compatible endpoint (vLLM, SGLang, OpenAI, HuggingFace TGI) and run standardized medical benchmarks.
- Compare models fairly — Run multiple models against the same datasets, metrics, and prompts for apples-to-apples comparison.
- Scale easily — Adding a new dataset is ~50 lines of Python + one YAML file. Adding a metric is even less.
It ships with 14 medical datasets, 8 metrics (deterministic + LLM-as-judge), and a Hydra-based config system for fully reproducible evaluations.
git clone https://github.com/lunit-io/coeval.git
cd coevalmise trust # Required on first clone — trusts mise.toml config
mise run sync # Installs Python 3.12 + deps via mise/uvNote: sglang-gravity is included by default for serving Gravity MoE models locally. If you already have an OpenAI-compatible endpoint running (vLLM, OpenAI, Azure, etc.), skip this step.
# Start SGLang server (default: learning-unit/Test, GPU 0, port 9006)
mise run serve
# Custom GPU and port
mise run serve -- --gpu 4 --port 9015
# Custom model with multi-GPU
mise run serve -- --model-path your/model --gpu 0,1 --tp 2Or use any OpenAI-compatible API:
# Hosted API (OpenAI, Azure, etc.)
export OPENAI_API_KEY=sk-...# Smoke test (5 samples, default dataset: PubMedQA)
mise run eval -- num_samples=5
# Single dataset
mise run eval -- datasets=medqa
# All 14 datasets
mise run eval -- datasets=all
# Custom endpoint (if not using mise run serve)
mise run eval -- client.api_base=http://localhost:8000/v1 client.model=your-model datasets=allSome datasets (e.g., HealthBench) use an LLM judge to score responses instead of exact match. These require an OpenAI API key for the judge model.
Set it in mise.toml (recommended — keeps it out of your shell history):
# mise.toml → [env]
OPENAI_API_KEY = "sk-..."Or export it directly:
export OPENAI_API_KEY=sk-...Then run:
# Run HealthBench — your model generates responses, gpt-4.1 grades them
mise run eval -- datasets=healthbench_consensusNote:
OPENAI_API_KEYis used for both the model server and the judge model. Most MCQ datasets (MedQA, MedMCQA, etc.) use deterministic scoring and do not require a judge model.
Results are saved to evaluation_outputs/:
📁 evaluation_outputs/YYYY-MM-DD/HH-MM-SS/
├── results_<dataset>.json # Per-sample: input, prediction, scores
├── summary_<dataset>.json # Aggregated metrics + breakdown
└── summary_combined.json # Cross-dataset leaderboard
All datasets are evaluated as MCQ (multiple-choice question) unless noted otherwise. The model selects an answer letter (A/B/C/D) and is scored by exact match.
| Dataset | Key | Source | Task | Metric |
|---|---|---|---|---|
| MedQA | medqa |
USMLE | 4-option MCQ | MCQ Accuracy |
| MedMCQA | medmcqa |
AIIMS/PGI | 4-option MCQ | MCQ Accuracy |
| MMLU-Pro Health | mmlu_pro_health |
MMLU-Pro subset | 10-option MCQ | MCQ Accuracy |
| HeadQA | headqa |
Spanish medical exams | 4-option MCQ | MCQ Accuracy |
| CareQA | careqa |
USMLE Step 1-3 | 4-option MCQ | MCQ Accuracy |
| M-ARC | m_arc |
Medical ARC | 4-option MCQ | MCQ Accuracy |
| MetaMedQA | metamedqa |
Meta medical eval | 4-option MCQ | MCQ Accuracy |
| MedXpertQA | medxpertqa |
Expert medical QA | 4-option MCQ | MCQ Accuracy |
| Medbullets | medbullets |
Step 2 practice | 4/5-option MCQ | MCQ Accuracy |
| MedHallu | medhallu |
Hallucination detection | Binary classification | Macro F1 |
| MedCalc | medcalc |
Clinical calculation | Open-ended numeric | Numeric Accuracy |
| PubMedQA | pubmedqa |
PubMed abstracts | 3-option MCQ (Yes/No/Maybe) | MCQ Accuracy |
| HealthBench | healthbench_consensus |
OpenAI | Open-ended multi-turn | LLM-as-judge (rubric) |
| AttributionBench | attributionbench |
OSU NLP | Binary classification | Macro F1 |
mise run eval -- datasets=medqa # Single dataset
mise run eval -- datasets=all # All 14 datasets
mise run eval -- datasets=all num_samples=50 # Quick run, 50 samples each| Metric | Class | Description |
|---|---|---|
| MCQ Accuracy | MCQAccuracyMetric |
Robust answer letter extraction + exact match |
| Classification | ClassificationMetric |
Label extraction + exact match |
| Numeric Accuracy | NumericAccuracyMetric |
Numeric value comparison with tolerance |
Note: While these metrics are deterministic, scores may vary slightly across runs due to LLM generation randomness (e.g., different token sampling even at low temperatures). For fully reproducible results, use
temperature=0.
| Metric | Class | Description |
|---|---|---|
| HealthBench Rubric | HealthBenchRubricMetric |
Per-criterion rubric scoring |
| Faithfulness | FaithfulnessMetric |
Is the answer grounded in context? |
| Answer Relevancy | AnswerRelevancyMetric |
Is the answer relevant to the question? |
| Contextual Precision | ContextualPrecisionMetric |
Does context contain relevant info? |
| Contextual Recall | ContextualRecallMetric |
Is all relevant info retrieved? |
| Strategy | Key | Use case |
|---|---|---|
| Simple Average | default |
Most MCQ datasets |
| Weighted Average | weighted_avg |
Grouped sub-datasets |
| Macro F1 | f1 |
Classification tasks (e.g., AttributionBench) |
CoEval uses Hydra for composable configuration. Everything is overridable from the CLI.
| Variable | Description | Example |
|---|---|---|
OPENAI_API_KEY |
API key (model server + LLM-as-judge) | sk-... |
# Override any config value
mise run eval -- datasets=medqa num_samples=100 client.temperature=0.3
# Change system prompt
mise run eval -- 'system_prompt="Answer concisely."'
# Swap judge model for HealthBench
mise run eval -- datasets=healthbench_consensus datasets/metrics/judge@healthbench_judge=gpt-4.1CoEval is designed to be easily extensible:
- Adding a new dataset — ~50 lines of Python + one YAML config
- Adding a new metric — Extend
DeterministicMetricor use DeepEval's LLM-as-judge
src/coeval/
├── main.py # CLI entry point (Hydra)
├── clients/
│ └── passthrough.py # PassthroughClient (OpenAI-compatible)
├── llm/ # Low-level async LLM client
│ ├── client.py # LLMClient (llama-index OpenAILike wrapper)
│ ├── config.py # LLMConfig dataclass
│ └── exceptions.py # Error hierarchy (timeout, rate limit, etc.)
├── core/
│ ├── runner.py # EvalRunner — async generation + scoring pipeline
│ ├── evaluate.py # Metric dispatch (deterministic vs LLM-as-judge)
│ ├── schema.py # EvalResult, EvalSummary (Pydantic models)
│ └── types.py # Centralized type aliases and deepeval re-exports
├── datasets/
│ ├── base.py # GoldenDatasetBase, MultiTurnDatasetBase
│ ├── medqa.py # 16 dataset loaders (one file each)
│ └── ...
├── metrics/ # 3 deterministic + 5 LLM-as-judge metrics
├── util/ # Parsers, score aggregation, Rich console
└── conf/ # Hydra YAML configs
├── config.yaml # Root config
├── client/ # LLM client configs
├── runner/ # Runner configs
└── datasets/ # Per-dataset configs (dataset + metric + aggregator)
mise run sync # Install dependencies
mise run test # Run unit tests
mise run lint # Ruff linter
mise run format # Ruff formatterCopyright 2026 Lunit Inc.
Licensed under the Apache License, Version 2.0. See LICENSE for details.