Skip to content

lunit-io/CoEval

Repository files navigation

🩺 CoEval

Medical LLM Evaluation Framework

Part of the Chain-of-Evidence project by Lunit

Python 3.12+ Hydra DeepEval HuggingFace Pydantic v2 Rich mise uv Ruff pre-commit License

Quick Start · Datasets · Metrics · Configuration · Extend · Project Structure · Development · License


📰 What's New

Date Version Update
2026-04-08 v0.1.0 Initial release — 14 medical datasets, 8 metrics, async evaluation pipeline

CoEval is an async-first evaluation framework built by Lunit's Chain-of-Evidence team for benchmarking LLMs on medical tasks. It is designed to:

  • Evaluate your own model — Point to any OpenAI-compatible endpoint (vLLM, SGLang, OpenAI, HuggingFace TGI) and run standardized medical benchmarks.
  • Compare models fairly — Run multiple models against the same datasets, metrics, and prompts for apples-to-apples comparison.
  • Scale easily — Adding a new dataset is ~50 lines of Python + one YAML file. Adding a metric is even less.

It ships with 14 medical datasets, 8 metrics (deterministic + LLM-as-judge), and a Hydra-based config system for fully reproducible evaluations.


🚀 Quick Start

1. Clone and install

git clone https://github.com/lunit-io/coeval.git
cd coeval
mise trust          # Required on first clone — trusts mise.toml config
mise run sync       # Installs Python 3.12 + deps via mise/uv

2. Serve your model

Note: sglang-gravity is included by default for serving Gravity MoE models locally. If you already have an OpenAI-compatible endpoint running (vLLM, OpenAI, Azure, etc.), skip this step.

# Start SGLang server (default: learning-unit/Test, GPU 0, port 9006)
mise run serve

# Custom GPU and port
mise run serve -- --gpu 4 --port 9015

# Custom model with multi-GPU
mise run serve -- --model-path your/model --gpu 0,1 --tp 2

Or use any OpenAI-compatible API:

# Hosted API (OpenAI, Azure, etc.)
export OPENAI_API_KEY=sk-...

3. Run evaluation

# Smoke test (5 samples, default dataset: PubMedQA)
mise run eval -- num_samples=5

# Single dataset
mise run eval -- datasets=medqa

# All 14 datasets
mise run eval -- datasets=all

# Custom endpoint (if not using mise run serve)
mise run eval -- client.api_base=http://localhost:8000/v1 client.model=your-model datasets=all

4. (Optional) LLM-as-judge datasets

Some datasets (e.g., HealthBench) use an LLM judge to score responses instead of exact match. These require an OpenAI API key for the judge model.

Set it in mise.toml (recommended — keeps it out of your shell history):

# mise.toml → [env]
OPENAI_API_KEY = "sk-..."

Or export it directly:

export OPENAI_API_KEY=sk-...

Then run:

# Run HealthBench — your model generates responses, gpt-4.1 grades them
mise run eval -- datasets=healthbench_consensus

Note: OPENAI_API_KEY is used for both the model server and the judge model. Most MCQ datasets (MedQA, MedMCQA, etc.) use deterministic scoring and do not require a judge model.

5. Check results

Results are saved to evaluation_outputs/:

📁 evaluation_outputs/YYYY-MM-DD/HH-MM-SS/
├── results_<dataset>.json        # Per-sample: input, prediction, scores
├── summary_<dataset>.json        # Aggregated metrics + breakdown
└── summary_combined.json         # Cross-dataset leaderboard

🧰 Datasets

All datasets are evaluated as MCQ (multiple-choice question) unless noted otherwise. The model selects an answer letter (A/B/C/D) and is scored by exact match.

14 datasets

Dataset Key Source Task Metric
MedQA medqa USMLE 4-option MCQ MCQ Accuracy
MedMCQA medmcqa AIIMS/PGI 4-option MCQ MCQ Accuracy
MMLU-Pro Health mmlu_pro_health MMLU-Pro subset 10-option MCQ MCQ Accuracy
HeadQA headqa Spanish medical exams 4-option MCQ MCQ Accuracy
CareQA careqa USMLE Step 1-3 4-option MCQ MCQ Accuracy
M-ARC m_arc Medical ARC 4-option MCQ MCQ Accuracy
MetaMedQA metamedqa Meta medical eval 4-option MCQ MCQ Accuracy
MedXpertQA medxpertqa Expert medical QA 4-option MCQ MCQ Accuracy
Medbullets medbullets Step 2 practice 4/5-option MCQ MCQ Accuracy
MedHallu medhallu Hallucination detection Binary classification Macro F1
MedCalc medcalc Clinical calculation Open-ended numeric Numeric Accuracy
PubMedQA pubmedqa PubMed abstracts 3-option MCQ (Yes/No/Maybe) MCQ Accuracy
HealthBench healthbench_consensus OpenAI Open-ended multi-turn LLM-as-judge (rubric)
AttributionBench attributionbench OSU NLP Binary classification Macro F1
mise run eval -- datasets=medqa              # Single dataset
mise run eval -- datasets=all                # All 14 datasets
mise run eval -- datasets=all num_samples=50 # Quick run, 50 samples each

📊 Metrics

Deterministic (no LLM required)

Metric Class Description
MCQ Accuracy MCQAccuracyMetric Robust answer letter extraction + exact match
Classification ClassificationMetric Label extraction + exact match
Numeric Accuracy NumericAccuracyMetric Numeric value comparison with tolerance

Note: While these metrics are deterministic, scores may vary slightly across runs due to LLM generation randomness (e.g., different token sampling even at low temperatures). For fully reproducible results, use temperature=0.

LLM-as-Judge (requires judge model)

Metric Class Description
HealthBench Rubric HealthBenchRubricMetric Per-criterion rubric scoring
Faithfulness FaithfulnessMetric Is the answer grounded in context?
Answer Relevancy AnswerRelevancyMetric Is the answer relevant to the question?
Contextual Precision ContextualPrecisionMetric Does context contain relevant info?
Contextual Recall ContextualRecallMetric Is all relevant info retrieved?

Aggregation

Strategy Key Use case
Simple Average default Most MCQ datasets
Weighted Average weighted_avg Grouped sub-datasets
Macro F1 f1 Classification tasks (e.g., AttributionBench)

⚙️ Configuration

CoEval uses Hydra for composable configuration. Everything is overridable from the CLI.

Environment Variables

Variable Description Example
OPENAI_API_KEY API key (model server + LLM-as-judge) sk-...

CLI Overrides

# Override any config value
mise run eval -- datasets=medqa num_samples=100 client.temperature=0.3

# Change system prompt
mise run eval -- 'system_prompt="Answer concisely."'

# Swap judge model for HealthBench
mise run eval -- datasets=healthbench_consensus datasets/metrics/judge@healthbench_judge=gpt-4.1

🔧 Extend

CoEval is designed to be easily extensible:


🏗️ Project Structure

src/coeval/
├── main.py                 # CLI entry point (Hydra)
├── clients/
│   └── passthrough.py      # PassthroughClient (OpenAI-compatible)
├── llm/                    # Low-level async LLM client
│   ├── client.py           # LLMClient (llama-index OpenAILike wrapper)
│   ├── config.py           # LLMConfig dataclass
│   └── exceptions.py       # Error hierarchy (timeout, rate limit, etc.)
├── core/
│   ├── runner.py           # EvalRunner — async generation + scoring pipeline
│   ├── evaluate.py         # Metric dispatch (deterministic vs LLM-as-judge)
│   ├── schema.py           # EvalResult, EvalSummary (Pydantic models)
│   └── types.py            # Centralized type aliases and deepeval re-exports
├── datasets/
│   ├── base.py             # GoldenDatasetBase, MultiTurnDatasetBase
│   ├── medqa.py            # 16 dataset loaders (one file each)
│   └── ...
├── metrics/                # 3 deterministic + 5 LLM-as-judge metrics
├── util/                   # Parsers, score aggregation, Rich console
└── conf/                   # Hydra YAML configs
    ├── config.yaml         # Root config
    ├── client/             # LLM client configs
    ├── runner/             # Runner configs
    └── datasets/           # Per-dataset configs (dataset + metric + aggregator)

🛠️ Development

mise run sync       # Install dependencies
mise run test       # Run unit tests
mise run lint       # Ruff linter
mise run format     # Ruff formatter

📄 License

Copyright 2026 Lunit Inc.

Licensed under the Apache License, Version 2.0. See LICENSE for details.

About

Chain-of-Evidence Evaluation Framework for benchmarking medical LLMs, developed as part of the Lunit-led AI-specialized Foundation Model consortium.

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors