-
Notifications
You must be signed in to change notification settings - Fork 0
CLI
itemeval init DIR [options]
itemeval {estimate,generate,grade,export,status} CONFIG [options]
init scaffolds a new study into DIR; every other command takes the config
YAML path as its argument. itemeval is installed as a console script;
python -m itemeval.cli is equivalent.
The run/report commands (estimate|generate|grade|export|status) accept
-C/--base-dir DIR to set the work directory that anchors outputs (the
studies/ tree). It defaults to the current directory; inputs (prompts/rubrics)
always resolve relative to the config file, independent of -C.
| Code | Meaning |
|---|---|
| 0 | success |
| 1 | unexpected error, or at least one condition failed during a run |
| 2 | config / template / adapter error (and argparse usage errors) |
| 3 | cost gate declined, or confirmation required in a non-interactive shell |
| 4 | projected cost exceeds budget.max_usd (hard cap; --yes does not override) |
itemeval init DIR [--with-templates] [--force]
Writes DIR/config.yaml — a runnable starter study (mock provider, the USAMO
demo dataset, builtin: template references) named after DIR. Refuses to
overwrite an existing config.yaml unless --force. With --with-templates,
also copies the referenced built-in prompts/rubrics into DIR/prompts/ and
DIR/rubrics/ and rewrites the config to reference those local copies (bare
names). Then cd DIR && itemeval status config.yaml.
itemeval estimate CONFIG [--stage {generate,grade,all}] [--refresh-pricing] [--json]
Prints per-stage and per-condition projections (calls, tokens, USD), flags
unpriced models, and warns when generation is uncapped (no max_tokens).
--refresh-pricing pulls current per-token prices from the OpenRouter API
into a local cache first. The estimate always projects the full
policy-effective grid; completed work is not subtracted (conservative).
itemeval generate CONFIG [-y/--yes] [--force] [--condition F]...
[--display {none,plain,rich,full}]
Flow: estimate → print projection → gate → run each generate condition
serially → upsert solutions, log index, ledger → write manifest. Conditions
already complete print skipped: complete. A condition whose eval fails is
reported and the rest continue (final exit 1).
-
-y/--yesconfirms the gate non-interactively (never overridesmax_usd). -
--forcere-runs completed work (rows are replaced, not duplicated). -
--condition F(repeatable) selects conditions by exact id, id prefix, or slug — e.g.--condition gpt-5-mini_minimal_default. -
--displaypasses through to inspect (defaultnone; tryrichinteractively).
itemeval grade CONFIG [-y/--yes] [--force] [--condition F]...
[--grader N]... [--rubric N]... [--display ...]
Same flow over grade conditions. Verifiable conditions cost $0 and need no
model. --grader/--rubric (repeatable) narrow to specific judges/rubrics —
useful for adding a new grader over existing solutions. The summary line
reports parse_failures (rows kept with parse_ok=false).
itemeval export CONFIG [--json]
Joins gradings × solutions into export/gradings_long.parquet (one row per
grading event, 45 columns) plus a byte-equivalent CSV and ledger.csv.
Prints per-stage spend and the internal reconciliation verdict (ledger totals
vs row sums; reconciliation against provider dashboards is a manual step).
itemeval status CONFIG [--json]
Prints datasets (id @ revision, item counts), the policy-effective scope,
both condition tables with done/expected, error and parse-failure counts,
spend per stage, and manifest count. --json emits the full structured
report (also available for estimate and export).
itemeval estimate cfg.yaml --refresh-pricing # sanity-check projected $
itemeval generate cfg.yaml # prompts if above confirm_above_usd
# ... interrupted? just re-run; completed conditions skip ...
itemeval generate cfg.yaml
itemeval grade cfg.yaml
itemeval grade cfg.yaml --grader second_judge # later: new judge, $0 generation
itemeval export cfg.yaml
itemeval status cfg.yaml