Home

itemeval Wiki

Editing note: This wiki is generated from docs/wiki/ in the main repo, which is the single source of truth. Always edit the pages there and push to main — a GitHub Action mirrors them here. Changes made directly in the wiki UI will be overwritten on the next sync.

Item-level LLM evaluation over any API, with built-in budget control — a thin layer on inspect_ai for studies that need every individual grading event (psychometrics, IRT, mixed-model, judge reliability), never just aggregate scores.

benchmark source ─▶ adapter ─▶ items ─┐
                                      ├─▶ GENERATE ─▶ solutions store ─▶ GRADE ─▶ gradings table
design.yaml ─▶ facet grid expansion ──┘   (inspect)    (parquet+logs)   (inspect)  (long-format)

You describe what to evaluate (a benchmark + a facet grid) in one YAML file; itemeval expands the grid into conditions, runs generation and grading as two decoupled, resumable stages on inspect_ai, and exports one long-format row per grading event with scores, judge reasoning, tokens, and dollars.

Tutorials — learn by running

Each tutorial is a complete, runnable use case, step by step. They build on each other but stand alone.

Tutorial	What you learn	Cost
1 — Score a verifiable benchmark	The full pipeline on AIME 2025: estimate → generate → grade → export	~2¢
2 — Grade with an LLM judge	Rubric-based judging of open-ended answers; the judge output contract	a few ¢
3 — Compare models and prompts	A crossed design with replications; analyzing the export in pandas	tens of ¢
4 — Add a second judge at $0 generation	New graders/rubrics over stored solutions; judge agreement	a few ¢
5 — Scale up without surprises	Policies, the cost gate, hard caps, batch mode, resume, savings	you decide

New here? Do Getting Started first (free, no API key), then Tutorial 1.

Reference pages

Page	What it covers
Getting Started	Install, run the free demo pipeline in 5 minutes
Pipeline Concepts	Items, facets, conditions, replications, two-stage design, resume & caching
Configuration	Complete YAML reference for every config field
CLI	The five commands, options, and exit codes
Python API	The same pipeline from `import itemeval` — functions, results, kwargs
Outputs and Schemas	Study directory layout, parquet stores, export table, manifests
Budget and Costs	Estimation, confirmation gate, policies, pricing, batch mode
Cost Savings	Every saving option, measured price/time trade-offs, defaults, direct API vs OpenRouter
Error Handling	Failure channels, reporting, exit codes, retry & resume semantics
Agent Guide	Driving itemeval from an AI agent: contract, guardrails, recovery
Architecture	Module map: what each file does and why it exists
FAQ	Common errors, troubleshooting, design rationale

The five commands

itemeval estimate configs/my_study.yaml   # projected $ per stage, no model API calls
itemeval generate configs/my_study.yaml   # stage 1: solutions (resumable)
itemeval grade    configs/my_study.yaml   # stage 2: gradings (resumable, re-runnable per grader x rubric)
itemeval export   configs/my_study.yaml   # long-format parquet + CSV + cost ledger
itemeval status   configs/my_study.yaml   # grid completion matrix, spend, manifests

Stability promises

Pre-1.0, exactly four surfaces are stable-ish (minor versions may still break them, with changelog notice):

The CLI commands and exit codes.
The config YAML schema.
The on-disk outputs (parquet schemas, manifest JSON, study layout).
The Python API — everything in itemeval.__all__: load_config, prepare_study, estimate_study, run_generate, run_grade, export_study, build_status, ExperimentConfig, Item, __version__ (Python API).

Every _-prefixed module is internal and free to change.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Home

itemeval Wiki

Tutorials — learn by running

Reference pages

The five commands

Stability promises

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally