-
Notifications
You must be signed in to change notification settings - Fork 0
Home
Editing note: This wiki is generated from
docs/wiki/in the main repo, which is the single source of truth. Always edit the pages there and push tomain— a GitHub Action mirrors them here. Changes made directly in the wiki UI will be overwritten on the next sync.
Item-level LLM evaluation over any API, with built-in budget control — a thin layer on inspect_ai for studies that need every individual grading event (psychometrics, IRT, mixed-model, judge reliability), never just aggregate scores.
benchmark source ─▶ adapter ─▶ items ─┐
├─▶ GENERATE ─▶ solutions store ─▶ GRADE ─▶ gradings table
design.yaml ─▶ facet grid expansion ──┘ (inspect) (parquet+logs) (inspect) (long-format)
You describe what to evaluate (a benchmark + a facet grid) in one YAML file; itemeval expands the grid into conditions, runs generation and grading as two decoupled, resumable stages on inspect_ai, and exports one long-format row per grading event with scores, judge reasoning, tokens, and dollars.
Each tutorial is a complete, runnable use case, step by step. They build on each other but stand alone.
| Tutorial | What you learn | Cost |
|---|---|---|
| 1 — Score a verifiable benchmark | The full pipeline on AIME 2025: estimate → generate → grade → export | ~2¢ |
| 2 — Grade with an LLM judge | Rubric-based judging of open-ended answers; the judge output contract | a few ¢ |
| 3 — Compare models and prompts | A crossed design with replications; analyzing the export in pandas | tens of ¢ |
| 4 — Add a second judge at $0 generation | New graders/rubrics over stored solutions; judge agreement | a few ¢ |
| 5 — Scale up without surprises | Policies, the cost gate, hard caps, batch mode, resume, savings | you decide |
New here? Do Getting Started first (free, no API key), then Tutorial 1.
| Page | What it covers |
|---|---|
| Getting Started | Install, run the free demo pipeline in 5 minutes |
| Pipeline Concepts | Items, facets, conditions, replications, two-stage design, resume & caching |
| Configuration | Complete YAML reference for every config field |
| CLI | The five commands, options, and exit codes |
| Python API | The same pipeline from import itemeval — functions, results, kwargs |
| Outputs and Schemas | Study directory layout, parquet stores, export table, manifests |
| Budget and Costs | Estimation, confirmation gate, policies, pricing, batch mode |
| Cost Savings | Every saving option, measured price/time trade-offs, defaults, direct API vs OpenRouter |
| Error Handling | Failure channels, reporting, exit codes, retry & resume semantics |
| Agent Guide | Driving itemeval from an AI agent: contract, guardrails, recovery |
| Architecture | Module map: what each file does and why it exists |
| FAQ | Common errors, troubleshooting, design rationale |
itemeval estimate configs/my_study.yaml # projected $ per stage, no model API calls
itemeval generate configs/my_study.yaml # stage 1: solutions (resumable)
itemeval grade configs/my_study.yaml # stage 2: gradings (resumable, re-runnable per grader x rubric)
itemeval export configs/my_study.yaml # long-format parquet + CSV + cost ledger
itemeval status configs/my_study.yaml # grid completion matrix, spend, manifests
Pre-1.0, exactly four surfaces are stable-ish (minor versions may still break them, with changelog notice):
- The CLI commands and exit codes.
- The config YAML schema.
- The on-disk outputs (parquet schemas, manifest JSON, study layout).
- The Python API — everything in
itemeval.__all__:load_config,prepare_study,estimate_study,run_generate,run_grade,export_study,build_status,ExperimentConfig,Item,__version__(Python API).
Every _-prefixed module is internal and free to change.