Skip to content
github-actions[bot] edited this page Jun 12, 2026 · 7 revisions

itemeval Wiki

Editing note: This wiki is generated from docs/wiki/ in the main repo, which is the single source of truth. Always edit the pages there and push to main — a GitHub Action mirrors them here. Changes made directly in the wiki UI will be overwritten on the next sync.

Item-level LLM evaluation over any API, with built-in budget control — a thin layer on inspect_ai for studies that need every individual grading event (psychometrics, IRT, mixed-model, judge reliability), never just aggregate scores.

benchmark source ─▶ adapter ─▶ items ─┐
                                      ├─▶ GENERATE ─▶ solutions store ─▶ GRADE ─▶ gradings table
design.yaml ─▶ facet grid expansion ──┘   (inspect)    (parquet+logs)   (inspect)  (long-format)

You describe what to evaluate (a benchmark + a facet grid) in one YAML file; itemeval expands the grid into conditions, runs generation and grading as two decoupled, resumable stages on inspect_ai, and exports one long-format row per grading event with scores, judge reasoning, tokens, and dollars.

Tutorials — learn by running

Each tutorial is a complete, runnable use case, step by step. They build on each other but stand alone.

Tutorial What you learn Cost
1 — Score a verifiable benchmark The full pipeline on AIME 2025: estimate → generate → grade → export ~2¢
2 — Grade with an LLM judge Rubric-based judging of open-ended answers; the judge output contract a few ¢
3 — Compare models and prompts A crossed design with replications; analyzing the export in pandas tens of ¢
4 — Add a second judge at $0 generation New graders/rubrics over stored solutions; judge agreement a few ¢
5 — Scale up without surprises Policies, the cost gate, hard caps, batch mode, resume, savings you decide

New here? Do Getting Started first (free, no API key), then Tutorial 1.

Reference pages

Page What it covers
Getting Started Install, run the free demo pipeline in 5 minutes
Pipeline Concepts Items, facets, conditions, replications, two-stage design, resume & caching
Configuration Complete YAML reference for every config field
CLI The five commands, options, and exit codes
Python API The same pipeline from import itemeval — functions, results, kwargs
Outputs and Schemas Study directory layout, parquet stores, export table, manifests
Budget and Costs Estimation, confirmation gate, policies, pricing, batch mode
Cost Savings Every saving option, measured price/time trade-offs, defaults, direct API vs OpenRouter
Error Handling Failure channels, reporting, exit codes, retry & resume semantics
Agent Guide Driving itemeval from an AI agent: contract, guardrails, recovery
Architecture Module map: what each file does and why it exists
FAQ Common errors, troubleshooting, design rationale

The five commands

itemeval estimate configs/my_study.yaml   # projected $ per stage, no model API calls
itemeval generate configs/my_study.yaml   # stage 1: solutions (resumable)
itemeval grade    configs/my_study.yaml   # stage 2: gradings (resumable, re-runnable per grader x rubric)
itemeval export   configs/my_study.yaml   # long-format parquet + CSV + cost ledger
itemeval status   configs/my_study.yaml   # grid completion matrix, spend, manifests

Stability promises

Pre-1.0, exactly four surfaces are stable-ish (minor versions may still break them, with changelog notice):

  1. The CLI commands and exit codes.
  2. The config YAML schema.
  3. The on-disk outputs (parquet schemas, manifest JSON, study layout).
  4. The Python API — everything in itemeval.__all__: load_config, prepare_study, estimate_study, run_generate, run_grade, export_study, build_status, ExperimentConfig, Item, __version__ (Python API).

Every _-prefixed module is internal and free to change.

Clone this wiki locally