-
Notifications
You must be signed in to change notification settings - Fork 0
CLI
itemeval {estimate,generate,grade,export,status} CONFIG [options]
Every command takes the config YAML path as its argument. itemeval is
installed as a console script (./.venv/bin/itemeval); python -m itemeval.cli is equivalent.
| Code | Meaning |
|---|---|
| 0 | success |
| 1 | unexpected error, or at least one condition failed during a run |
| 2 | config / template / adapter error (and argparse usage errors) |
| 3 | cost gate declined, or confirmation required in a non-interactive shell |
| 4 | projected cost exceeds budget.max_usd (hard cap; --yes does not override) |
itemeval estimate CONFIG [--stage {generate,grade,all}] [--refresh-pricing] [--json]
Prints per-stage and per-condition projections (calls, tokens, USD), flags
unpriced models, and warns when generation is uncapped (no max_tokens).
--refresh-pricing pulls current per-token prices from the OpenRouter API
into a local cache first. The estimate always projects the full
policy-effective grid; completed work is not subtracted (conservative).
itemeval generate CONFIG [-y/--yes] [--force] [--condition F]...
[--display {none,plain,rich,full}]
Flow: estimate → print projection → gate → run each generate condition
serially → upsert solutions, log index, ledger → write manifest. Conditions
already complete print skipped: complete. A condition whose eval fails is
reported and the rest continue (final exit 1).
-
-y/--yesconfirms the gate non-interactively (never overridesmax_usd). -
--forcere-runs completed work (rows are replaced, not duplicated). -
--condition F(repeatable) selects conditions by exact id, id prefix, or slug — e.g.--condition gpt-5-mini_minimal_default. -
--displaypasses through to inspect (defaultnone; tryrichinteractively).
itemeval grade CONFIG [-y/--yes] [--force] [--condition F]...
[--grader N]... [--rubric N]... [--display ...]
Same flow over grade conditions. Verifiable conditions cost $0 and need no
model. --grader/--rubric (repeatable) narrow to specific judges/rubrics —
useful for adding a new grader over existing solutions. The summary line
reports parse_failures (rows kept with parse_ok=false).
itemeval export CONFIG [--json]
Joins gradings × solutions into export/gradings_long.parquet (one row per
grading event, 45 columns) plus a byte-equivalent CSV and ledger.csv.
Prints per-stage spend and the internal reconciliation verdict (ledger totals
vs row sums; reconciliation against provider dashboards is a manual step).
itemeval status CONFIG [--json]
Prints datasets (id @ revision, item counts), the policy-effective scope,
both condition tables with done/expected, error and parse-failure counts,
spend per stage, and manifest count. --json emits the full structured
report (also available for estimate and export).
itemeval estimate cfg.yaml --refresh-pricing # sanity-check projected $
itemeval generate cfg.yaml # prompts if above confirm_above_usd
# ... interrupted? just re-run; completed conditions skip ...
itemeval generate cfg.yaml
itemeval grade cfg.yaml
itemeval grade cfg.yaml --grader second_judge # later: new judge, $0 generation
itemeval export cfg.yaml
itemeval status cfg.yaml