-
Notifications
You must be signed in to change notification settings - Fork 0
CLI
itemeval init DIR [options]
itemeval {estimate,generate,grade,export,status} CONFIG [options]
init scaffolds a new study into DIR; every other command takes the config
YAML path as its argument. itemeval is installed as a console script;
python -m itemeval.cli is equivalent.
The run/report commands (estimate|generate|grade|export|status) accept
-C/--base-dir DIR to set the work directory that anchors outputs (the
studies/ tree). It defaults to the current directory; inputs (prompts/rubrics)
always resolve relative to the config file, independent of -C.
estimate, generate, grade, and status also accept
--policy {dev,full-interactive,full-batch}, overriding budget.policy for
that invocation only (see
Budget-and-Costs#policies).
| Code | Meaning |
|---|---|
| 0 | success |
| 1 | unexpected error, or at least one condition failed during a run |
| 2 | config / template / adapter error (and argparse usage errors) |
| 3 | cost gate declined, or confirmation required in a non-interactive shell |
| 4 | projected cost exceeds budget.max_usd (hard cap; --yes does not override) |
itemeval init DIR [--with-templates] [--force]
Writes DIR/config.yaml — a runnable starter study (mock provider, the USAMO
demo dataset, builtin: template references) named after DIR. Refuses to
overwrite an existing config.yaml unless --force. With --with-templates,
also copies the referenced built-in prompts/rubrics into DIR/prompts/ and
DIR/rubrics/ and rewrites the config to reference those local copies (bare
names). Then cd DIR && itemeval status config.yaml.
itemeval estimate CONFIG [--stage {generate,grade,all}] [--refresh-pricing] [--json]
Prints per-stage and per-condition projections (calls, tokens, USD), flags
unpriced models, and warns when generation is uncapped (no max_tokens).
--refresh-pricing pulls current per-token prices from the OpenRouter API
into a local cache first. Projections show the full policy-effective
grid plus the remaining figure (completed cells subtracted — what the
next run can actually spend; the money gate operates on it). See
Budget-and-Costs#estimation.
itemeval generate CONFIG [-y/--yes] [--force] [--condition F]...
[--display {none,plain,rich,full}] [--json]
[--policy P] [--wave LABEL]
Flow: estimate → print projection → gate → run each generate condition
serially → upsert solutions, log index, ledger → write manifest. Conditions
already complete print skipped: complete. A condition whose eval fails is
reported and the rest continue (final exit 1).
-
-y/--yesconfirms the gate non-interactively (never overridesmax_usd). -
--forcere-runs completed work (rows are replaced, not duplicated). -
--condition F(repeatable) selects conditions by exact id, id prefix, or slug — e.g.--condition gpt-5-mini_minimal_default. -
--displaypasses through to inspect (defaultnone; tryrichinteractively). -
--jsonemits the run result as a single JSON document on stdout (theGenerateResultfields pluspricing,estimate_usd, and thegateoutcome) and silences the live display unless--displayis passed. A gate stop still emits a JSON document (projected cost, gate reason, the--yesrerun command) before exiting 3/4. -
--wave LABELre-observes the current scope as a new epoch block, keeping both observations (see Pipeline-Concepts#waves).
itemeval grade CONFIG [-y/--yes] [--force] [--condition F]...
[--grader N]... [--rubric N]... [--display ...] [--json]
[--policy P] [--wave LABEL]
Same flow over grade conditions (including --json and --wave, as on
generate). Verifiable conditions cost $0 and need no
model. --grader/--rubric (repeatable) narrow to specific judges/rubrics —
useful for adding a new grader over existing solutions. The summary line
reports parse_failures (rows kept with parse_ok=false).
itemeval export CONFIG [--snapshot NAME] [--json]
Joins gradings × solutions into export/gradings_long.parquet (one row per
grading event, 47 columns) plus a byte-equivalent CSV and ledger.csv.
Prints per-stage spend and the internal reconciliation verdict (ledger totals
vs row sums; reconciliation against provider dashboards is a manual step).
--snapshot NAME additionally freezes an immutable copy under
export/snapshots/NAME/ (tables, locks, covering manifests, snapshot.json,
STUDY_CARD.md); an existing name is refused with exit 2. See
Outputs-and-Schemas#snapshots.
itemeval status CONFIG [--json]
Prints datasets (id @ revision, item counts), the policy-effective scope,
both condition tables with done/expected, error and parse-failure counts,
spend per stage, and manifest count. Both tables are scoped to the current
grid at the current scope (wave 0); studies with more than one wave get an
extra waves: line with per-wave gen/graded counts
(Pipeline-Concepts#waves). --json emits the
full structured report (also available for estimate and export).
Commands may end with up to two dim hint: lines on stderr — one
observed fact from this run plus a doc pointer
(hint: <fact> — learn more: <wiki-page#anchor>). Hints never change
behavior and never block. Set ITEMEVAL_HINTS=off to silence them in the
text rendering; under --json they always ride as structured data in the
hints array ({code, message, learn_more}), uncapped.
Hint codes are stable and append-only:
| Code | Fires when |
|---|---|
cache-zero-reads |
same-prefix calls were scheduled for a provider cache discount but none engaged |
empty-solutions |
completions finished without an API error but produced no gradable text |
unpriced-models |
a model has no pricing-table entry (dollar columns missing; run unaffected) |
itemeval estimate cfg.yaml --refresh-pricing # sanity-check projected $
itemeval generate cfg.yaml # prompts if above confirm_above_usd
# ... interrupted? just re-run; completed conditions skip ...
itemeval generate cfg.yaml
itemeval grade cfg.yaml
itemeval grade cfg.yaml --grader second_judge # later: new judge, $0 generation
itemeval export cfg.yaml
itemeval status cfg.yaml