Home

itemeval Wiki

Editing note: This wiki is generated from docs/wiki/ in the main repo, which is the single source of truth. Always edit the pages there and push to main — a GitHub Action mirrors them here. Changes made directly in the wiki UI will be overwritten on the next sync.

Item-level LLM evaluation over any API, with built-in budget control — a thin layer on inspect_ai for studies that need every individual grading event (psychometrics, G-theory, IRT), never just aggregate scores.

benchmark source ─▶ adapter ─▶ items ─┐
                                      ├─▶ GENERATE ─▶ solutions store ─▶ GRADE ─▶ gradings table
design.yaml ─▶ facet grid expansion ──┘   (inspect)    (parquet+logs)   (inspect)  (long-format)

You describe what to evaluate (a benchmark + a facet grid) in one YAML file; itemeval expands the grid into conditions, runs generation and grading as two decoupled, resumable stages on inspect_ai, and exports one long-format row per grading event with scores, judge reasoning, tokens, and dollars.

Pages

Page	What it covers
Getting Started	Install, run the free demo pipeline in 5 minutes
Pipeline Concepts	Items, facets, conditions, replications, two-stage design, resume & caching
Configuration	Complete YAML reference for every config field
CLI	The five commands, options, and exit codes
Python API	The same pipeline from `import itemeval` — functions, results, kwargs
Outputs and Schemas	Study directory layout, parquet stores, export table, manifests
Budget and Costs	Estimation, confirmation gate, policies, pricing, batch mode
Error Handling	Failure channels, reporting, exit codes, retry & resume semantics
Architecture	Module map: what each file does and why it exists
FAQ	Common errors, troubleshooting, design rationale

The five commands

itemeval estimate configs/my_study.yaml   # projected $ per stage, no model API calls
itemeval generate configs/my_study.yaml   # stage 1: solutions (resumable)
itemeval grade    configs/my_study.yaml   # stage 2: gradings (resumable, re-runnable per grader x rubric)
itemeval export   configs/my_study.yaml   # long-format parquet + CSV + cost ledger
itemeval status   configs/my_study.yaml   # grid completion matrix, spend, manifests

Stability promises

Pre-1.0, exactly four surfaces are stable-ish (minor versions may still break them, with changelog notice):

The CLI commands and exit codes.
The config YAML schema.
The on-disk outputs (parquet schemas, manifest JSON, study layout).
The Python API — everything in itemeval.__all__: load_config, prepare_study, estimate_study, run_generate, run_grade, export_study, build_status, ExperimentConfig, Item, __version__ (Python API).

Every _-prefixed module is internal and free to change.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Home

itemeval Wiki

Pages

The five commands

Stability promises

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally