# CLI Reference ``` itemeval init DIR [options] itemeval {estimate,generate,grade,export,status} CONFIG [options] ``` `init` scaffolds a new study into `DIR`; every other command takes the config YAML path as its argument. `itemeval` is installed as a console script; `python -m itemeval.cli` is equivalent. The run/report commands (`estimate|generate|grade|export|status`) accept `-C/--base-dir DIR` to set the **work directory** that anchors outputs (the `studies/` tree). It defaults to the current directory; inputs (prompts/rubrics) always resolve relative to the config file, independent of `-C`. `estimate`, `generate`, `grade`, and `status` also accept `--policy {dev,full-interactive,full-batch}`, overriding `budget.policy` for that invocation only (see [Budget-and-Costs#policies](Budget-and-Costs.md#policies)). ## Exit codes (all commands) | Code | Meaning | |------|---------| | 0 | success | | 1 | unexpected error, or at least one condition failed during a run | | 2 | config / template / adapter error (and argparse usage errors) | | 3 | cost gate declined, or confirmation required in a non-interactive shell | | 4 | projected cost exceeds `budget.max_usd` (hard cap; `--yes` does not override) | ## `init` — scaffold a new study ``` itemeval init DIR [--with-templates] [--force] ``` Writes `DIR/config.yaml` — a runnable starter study (mock provider, the USAMO demo dataset, `builtin:` template references) named after `DIR`. Refuses to overwrite an existing `config.yaml` unless `--force`. With `--with-templates`, also copies the referenced built-in prompts/rubrics into `DIR/prompts/` and `DIR/rubrics/` and rewrites the config to reference those local copies (bare names). Then `cd DIR && itemeval status config.yaml`. ## `estimate` — projected cost, no model API calls ``` itemeval estimate CONFIG [--stage {generate,grade,all}] [--refresh-pricing] [--json] ``` Prints per-stage and per-condition projections (calls, tokens, USD), flags unpriced models, and warns when generation is uncapped (no `max_tokens`). `--refresh-pricing` pulls current per-token prices from the OpenRouter API into a local cache first. Projections show the **full** policy-effective grid plus the **remaining** figure (completed cells subtracted — what the next run can actually spend; the money gate operates on it). See [Budget-and-Costs#estimation](Budget-and-Costs.md#estimation). ## `generate` — stage 1 (resumable) ``` itemeval generate CONFIG [-y/--yes] [--force] [--condition F]... [--display {none,plain,rich,full}] [--json] [--policy P] [--wave LABEL] ``` Flow: estimate → print projection → **gate** → run each generate condition serially → upsert solutions, log index, ledger → write manifest. Conditions already complete print `skipped: complete`. A condition whose eval fails is reported and the rest continue (final exit 1). - `-y/--yes` confirms the gate non-interactively (never overrides `max_usd`). - `--force` re-runs completed work (rows are replaced, not duplicated). - `--condition F` (repeatable) selects conditions by exact id, id prefix, or slug — e.g. `--condition gpt-5-mini_minimal_default`. - `--display` passes through to inspect (default `none`; try `rich` interactively). - `--json` emits the run result as a single JSON document on stdout (the `GenerateResult` fields plus `pricing`, `estimate_usd`, and the `gate` outcome) and silences the live display unless `--display` is passed. A gate stop still emits a JSON document (projected cost, gate reason, the `--yes` rerun command) before exiting 3/4. - `--wave LABEL` re-observes the current scope as a new epoch block, keeping both observations (see [Pipeline-Concepts#waves](Pipeline-Concepts.md#waves)). ## `grade` — stage 2 (resumable, re-runnable) ``` itemeval grade CONFIG [-y/--yes] [--force] [--condition F]... [--grader N]... [--rubric N]... [--display ...] [--json] [--policy P] [--wave LABEL] ``` Same flow over grade conditions (including `--json` and `--wave`, as on `generate`). Verifiable conditions cost $0 and need no model. `--grader`/`--rubric` (repeatable) narrow to specific judges/rubrics — useful for adding a new grader over existing solutions. The summary line reports `parse_failures` (rows kept with `parse_ok=false`). ## `export` — analysis-ready tables ``` itemeval export CONFIG [--snapshot NAME] [--json] ``` Joins gradings × solutions into `export/gradings_long.parquet` (one row per grading event, 47 columns) plus a byte-equivalent CSV and `ledger.csv`. Prints per-stage spend and the internal reconciliation verdict (ledger totals vs row sums; reconciliation against provider dashboards is a manual step). `--snapshot NAME` additionally freezes an immutable copy under `export/snapshots/NAME/` (tables, locks, covering manifests, `snapshot.json`, `STUDY_CARD.md`); an existing name is refused with exit 2. See [Outputs-and-Schemas#snapshots](Outputs-and-Schemas.md#snapshots). ## `status` — completion matrix, no model API calls ``` itemeval status CONFIG [--json] ``` Prints datasets (id @ revision, item counts), the policy-effective scope, both condition tables with `done/expected`, error and parse-failure counts, spend per stage, and manifest count. Both tables are scoped to the current grid at the current scope (wave 0); studies with more than one wave get an extra `waves:` line with per-wave gen/graded counts ([Pipeline-Concepts#waves](Pipeline-Concepts.md#waves)). `--json` emits the full structured report (also available for `estimate` and `export`). ## Hints Commands may end with up to two dim `hint:` lines on **stderr** — one observed fact from this run plus a doc pointer (`hint: — learn more: `). Hints never change behavior and never block. Set `ITEMEVAL_HINTS=off` to silence them in the text rendering; under `--json` they always ride as structured data in the `hints` array (`{code, message, learn_more}`), uncapped. Hint codes are stable and append-only: | Code | Fires when | |---|---| | `cache-zero-reads` | same-prefix calls were scheduled for a provider cache discount but none engaged | | `empty-solutions` | completions finished without an API error but produced no gradable text | | `unpriced-models` | a model has no pricing-table entry (dollar columns missing; run unaffected) | ## Typical session ```bash itemeval estimate cfg.yaml --refresh-pricing # sanity-check projected $ itemeval generate cfg.yaml # prompts if above confirm_above_usd # ... interrupted? just re-run; completed conditions skip ... itemeval generate cfg.yaml itemeval grade cfg.yaml itemeval grade cfg.yaml --grader second_judge # later: new judge, $0 generation itemeval export cfg.yaml itemeval status cfg.yaml ```