CLI

CLI Reference

itemeval init DIR [options]
itemeval {estimate,generate,grade,export,status} CONFIG [options]

init scaffolds a new study into DIR; every other command takes the config YAML path as its argument. itemeval is installed as a console script; python -m itemeval.cli is equivalent.

The run/report commands (estimate|generate|grade|export|status) accept -C/--base-dir DIR to set the work directory that anchors outputs (the studies/ tree). It defaults to the current directory; inputs (prompts/rubrics) always resolve relative to the config file, independent of -C.

estimate, generate, grade, and status also accept --policy {dev,full-interactive,full-batch}, overriding budget.policy for that invocation only (see Budget-and-Costs#policies).

Exit codes (all commands)

Code	Meaning
0	success
1	unexpected error, or at least one condition failed during a run
2	config / template / adapter error (and argparse usage errors)
3	cost gate declined, or confirmation required in a non-interactive shell
4	projected cost exceeds `budget.max_usd` (hard cap; `--yes` does not override)

`init` — scaffold a new study

itemeval init DIR [--with-templates] [--force]

Writes DIR/config.yaml — a runnable starter study (mock provider, the USAMO demo dataset, builtin: template references) named after DIR. Refuses to overwrite an existing config.yaml unless --force. With --with-templates, also copies the referenced built-in prompts/rubrics into DIR/prompts/ and DIR/rubrics/ and rewrites the config to reference those local copies (bare names). Then cd DIR && itemeval status config.yaml.

`estimate` — projected cost, no model API calls

itemeval estimate CONFIG [--stage {generate,grade,all}] [--refresh-pricing] [--json]

Prints per-stage and per-condition projections (calls, tokens, USD), flags unpriced models, and warns when generation is uncapped (no max_tokens). --refresh-pricing pulls current per-token prices from the OpenRouter API into a local cache first. Projections show the full policy-effective grid plus the remaining figure (completed cells subtracted — what the next run can actually spend; the money gate operates on it). See Budget-and-Costs#estimation.

`generate` — stage 1 (resumable)

itemeval generate CONFIG [-y/--yes] [--force] [--condition F]...
                  [--display {none,plain,rich,full}] [--json]
                  [--policy P] [--wave LABEL]

Flow: estimate → print projection → gate → run each generate condition serially → upsert solutions, log index, ledger → write manifest. Conditions already complete print skipped: complete. A condition whose eval fails is reported and the rest continue (final exit 1).

-y/--yes confirms the gate non-interactively (never overrides max_usd).
--force re-runs completed work (rows are replaced, not duplicated).
--condition F (repeatable) selects conditions by exact id, id prefix, or slug — e.g. --condition gpt-5-mini_minimal_default.
--display passes through to inspect (default none; try rich interactively).
--json emits the run result as a single JSON document on stdout (the GenerateResult fields plus pricing, estimate_usd, and the gate outcome) and silences the live display unless --display is passed. A gate stop still emits a JSON document (projected cost, gate reason, the --yes rerun command) before exiting 3/4.
--wave LABEL re-observes the current scope as a new epoch block, keeping both observations (see Pipeline-Concepts#waves).

`grade` — stage 2 (resumable, re-runnable)

itemeval grade CONFIG [-y/--yes] [--force] [--condition F]...
               [--grader N]... [--rubric N]... [--display ...] [--json]
               [--policy P] [--wave LABEL]

Same flow over grade conditions (including --json and --wave, as on generate). Verifiable conditions cost $0 and need no model. --grader/--rubric (repeatable) narrow to specific judges/rubrics — useful for adding a new grader over existing solutions. The summary line reports parse_failures (rows kept with parse_ok=false).

`export` — analysis-ready tables

itemeval export CONFIG [--snapshot NAME] [--json]

Joins gradings × solutions into export/gradings_long.parquet (one row per grading event, 47 columns) plus a byte-equivalent CSV and ledger.csv. Prints per-stage spend and the internal reconciliation verdict (ledger totals vs row sums; reconciliation against provider dashboards is a manual step).

--snapshot NAME additionally freezes an immutable copy under export/snapshots/NAME/ (tables, locks, covering manifests, snapshot.json, STUDY_CARD.md); an existing name is refused with exit 2. See Outputs-and-Schemas#snapshots.

`status` — completion matrix, no model API calls

itemeval status CONFIG [--json]

Prints datasets (id @ revision, item counts), the policy-effective scope, both condition tables with done/expected, error and parse-failure counts, spend per stage, and manifest count. Both tables are scoped to the current grid at the current scope (wave 0); studies with more than one wave get an extra waves: line with per-wave gen/graded counts (Pipeline-Concepts#waves). --json emits the full structured report (also available for estimate and export).

Hints

Commands may end with up to two dim hint: lines on stderr — one observed fact from this run plus a doc pointer (hint: <fact> — learn more: <wiki-page#anchor>). Hints never change behavior and never block. Set ITEMEVAL_HINTS=off to silence them in the text rendering; under --json they always ride as structured data in the hints array ({code, message, learn_more}), uncapped.

Hint codes are stable and append-only:

Code	Fires when
`cache-zero-reads`	same-prefix calls were scheduled for a provider cache discount but none engaged
`empty-solutions`	completions finished without an API error but produced no gradable text
`unpriced-models`	a model has no pricing-table entry (dollar columns missing; run unaffected)

Typical session

itemeval estimate cfg.yaml --refresh-pricing   # sanity-check projected $
itemeval generate cfg.yaml                     # prompts if above confirm_above_usd
# ... interrupted? just re-run; completed conditions skip ...
itemeval generate cfg.yaml
itemeval grade    cfg.yaml
itemeval grade    cfg.yaml --grader second_judge   # later: new judge, $0 generation
itemeval export   cfg.yaml
itemeval status   cfg.yaml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CLI

CLI Reference

Exit codes (all commands)

`init` — scaffold a new study

`estimate` — projected cost, no model API calls

`generate` — stage 1 (resumable)

`grade` — stage 2 (resumable, re-runnable)

`export` — analysis-ready tables

`status` — completion matrix, no model API calls

Hints

Typical session

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally

CLI

CLI Reference

Exit codes (all commands)

init — scaffold a new study

estimate — projected cost, no model API calls

generate — stage 1 (resumable)

grade — stage 2 (resumable, re-runnable)

export — analysis-ready tables

status — completion matrix, no model API calls

Hints

Typical session

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally

`init` — scaffold a new study

`estimate` — projected cost, no model API calls

`generate` — stage 1 (resumable)

`grade` — stage 2 (resumable, re-runnable)

`export` — analysis-ready tables

`status` — completion matrix, no model API calls