Skip to content
github-actions[bot] edited this page Jun 10, 2026 · 7 revisions

itemeval Wiki

Editing note: This wiki is generated from docs/wiki/ in the main repo, which is the single source of truth. Always edit the pages there and push to main — a GitHub Action mirrors them here. Changes made directly in the wiki UI will be overwritten on the next sync.

Item-level LLM evaluation over any API, with built-in budget control — a thin layer on inspect_ai for studies that need every individual grading event (psychometrics, G-theory, IRT), never just aggregate scores.

benchmark source ─▶ adapter ─▶ items ─┐
                                      ├─▶ GENERATE ─▶ solutions store ─▶ GRADE ─▶ gradings table
design.yaml ─▶ facet grid expansion ──┘   (inspect)    (parquet+logs)   (inspect)  (long-format)

You describe what to evaluate (a benchmark + a facet grid) in one YAML file; itemeval expands the grid into conditions, runs generation and grading as two decoupled, resumable stages on inspect_ai, and exports one long-format row per grading event with scores, judge reasoning, tokens, and dollars.

Pages

Page What it covers
Getting Started Install, run the free demo pipeline in 5 minutes
Pipeline Concepts Items, facets, conditions, replications, two-stage design, resume & caching
Configuration Complete YAML reference for every config field
CLI The five commands, options, and exit codes
Python API The same pipeline from import itemeval — functions, results, kwargs
Outputs and Schemas Study directory layout, parquet stores, export table, manifests
Budget and Costs Estimation, confirmation gate, policies, pricing, batch mode
Error Handling Failure channels, reporting, exit codes, retry & resume semantics
Architecture Module map: what each file does and why it exists
FAQ Common errors, troubleshooting, design rationale

The five commands

itemeval estimate configs/my_study.yaml   # projected $ per stage, no model API calls
itemeval generate configs/my_study.yaml   # stage 1: solutions (resumable)
itemeval grade    configs/my_study.yaml   # stage 2: gradings (resumable, re-runnable per grader x rubric)
itemeval export   configs/my_study.yaml   # long-format parquet + CSV + cost ledger
itemeval status   configs/my_study.yaml   # grid completion matrix, spend, manifests

Stability promises

Pre-1.0, exactly four surfaces are stable-ish (minor versions may still break them, with changelog notice):

  1. The CLI commands and exit codes.
  2. The config YAML schema.
  3. The on-disk outputs (parquet schemas, manifest JSON, study layout).
  4. The Python API — everything in itemeval.__all__: load_config, prepare_study, estimate_study, run_generate, run_grade, export_study, build_status, ExperimentConfig, Item, __version__ (Python API).

Every _-prefixed module is internal and free to change.

Clone this wiki locally