Agent Guide

Agent Guide — driving itemeval from an AI agent

This page is written for an LLM agent (Claude Code, Codex, a custom orchestrator) operating itemeval on a user's behalf — and for the humans wiring that up. It is a compact operating contract: what to run, in what order, what every outcome means, and which guardrails must never be bypassed. If you maintain a study repo, copy the drop-in instructions block into your repo's CLAUDE.md / AGENTS.md.

(This is about using the installed package. For developing itemeval itself, see CLAUDE.md in the repo.)

Mental model in five lines

One YAML config fully describes a study: a benchmark (HuggingFace dataset + column mapping) and a facet grid (models × prompts × model-configs × graders × rubrics × replications).
generate produces solutions; grade scores stored solutions (verifiable scorer at $0, or LLM judge); the stages are decoupled — new graders/rubrics never re-pay generation.
All state lives under <cwd>/studies/<study>/ in keyed parquet stores; every command is idempotent and resumable — re-running is always safe and never duplicates or re-pays completed work.
Money is governed by config: budget.policy (scope), confirm_above_usd (confirmation gate), max_usd (hard abort, not overridable).
export writes the analysis artifact: export/gradings_long.parquet, one row per grading event.

The command contract

itemeval init DIR [--with-templates]            # scaffold config.yaml (no API calls)
itemeval estimate CONFIG [--refresh-pricing] [--json]   # projected $; NO model calls
itemeval generate CONFIG [--yes] [--json] [--force] [--condition F]...  # stage 1 (paid)
itemeval grade    CONFIG [--yes] [--json] [--force] [--grader N] [--rubric N] [--condition F]...  # stage 2 (paid if judge)
itemeval export   CONFIG [--json]               # tables + ledger (no API calls)
itemeval status   CONFIG [--json]               # completion matrix (no API calls)

estimate, status, export never call a model API and are always safe. First-ever run downloads the dataset from the HF Hub (free).
--json (on every command) emits the full structured report — prefer it over parsing human-readable stdout. On generate/grade it carries the run result plus pricing, estimate_usd, and the gate outcome, and a gate stop still emits a JSON document (projected cost, gate reason, the --yes rerun command) before exit 3/4.
-C/--base-dir DIR anchors the output tree (studies/); default is the current directory. Inputs (prompts/rubrics) always resolve relative to the config file.
For paid runs in non-interactive shells, --yes is required whenever the projection exceeds confirm_above_usd (there is no TTY to confirm on).

Exit codes (deterministic — branch on them)

Code	Meaning	Correct agent reaction
0	success	proceed
1	unexpected error, or ≥1 condition failed during the run	run `status --json`; re-run the same command (errored samples retry); escalate if the same rows fail repeatedly
2	config / template / adapter error	fix the YAML or template file; do not retry unchanged
3	cost gate needs confirmation (non-interactive)	report the projected cost to the user; re-run with `--yes` only within an authorized budget
4	projection exceeds `budget.max_usd`	stop. Never raise `max_usd` yourself — that number is the user's, not yours

Hard rules

Never raise or remove budget.max_usd, and never inflate confirm_above_usd, without an explicit user instruction quoting the new number. Exit 4 is the user's hard cap working as designed.
Always run estimate (ideally --refresh-pricing --json) before the first generate/grade of a session and compare total_usd against the budget you were given.
Start every new config at policy: dev (a few items). Scale to full-interactive/full-batch only after the dev run's export looks right.
Re-run, don't repair. On interruption or partial failure, re-invoke the identical command — the stores are keyed and the response cache prevents double payment. Never delete or hand-edit studies/<study>/*.parquet.
One command at a time per study directory. Never run two generate/grade processes concurrently on the same study.
Don't loop on parse failures. Judge rows with parse_ok=false are final results, not retryable errors; re-running grade will not change them (that needs --force or a rubric change — a user decision).
API keys come from the environment (OPENAI_API_KEY, etc.). Never write keys into configs or commit them.

Standard operating procedure

# 1. Scaffold (or receive) a config
itemeval init my_study && cd my_study

# 2. Edit config.yaml: point benchmark.datasets/mapping at the target dataset,
#    set solvers.models, choose facets.scorer (verifiable) or grader+rubric (judge),
#    keep policy: dev.

# 3. Validate without spending
itemeval status   config.yaml --json   # config parses; grid is what you expect
itemeval estimate config.yaml --refresh-pricing --json   # projected $; check warnings

# 4. Dev run (cheap), then inspect
itemeval generate config.yaml --yes
itemeval grade    config.yaml --yes
itemeval export   config.yaml --json
#    -> read studies/<study>/export/gradings_long.parquet; check scores, parse_ok,
#       empty-solution counts; sanity-check a solution and a judge reasoning by eye

# 5. Report findings + full-scope estimate to the user; on approval flip
#    budget.policy to full-batch, set max_usd, then repeat 3–4 at full scope.

Key config rules that bite agents (full reference: Configuration):

Validation is strict — unknown keys are rejected (exit 2 with the field named). Fix the config; don't retry.
facets needs at least one of scorer (verifiable: exact_match / multiple_choice / numeric) or grader (+ entries under graders:).
Template namespaces: builtin:NAME = packaged template; bare NAME = local file under prompts_dir/rubrics_dir (relative to the config file). Solver prompts must contain {input}; rubrics must contain {input} and {solution}.
Reasoning models need max_tokens headroom for hidden reasoning plus the visible answer; if grade/status report empty solutions, raise max_tokens or lower reasoning_effort and set solvers.on_empty: rerun.

Reading results programmatically

Prefer the parquet stores over stdout:

Artifact	Path under `studies/<study>/`	Use
Analysis table	`export/gradings_long.parquet`	one row per grading event; key cols: `item_id, model, prompt_name, replication, grader_name, rubric_name, score, parse_ok, solution, reasoning, gen_usd, grade_usd`
Solutions	`solutions.parquet`	per (condition × item × epoch): `solution, stop_reason, error`, tokens, `usd`
Gradings	`gradings.parquet`	per grading event incl. `parse_error`, `judge_completion`
Cost ledger	`ledger.parquet` / `export/ledger.csv`	spend by run × stage × condition × model
Manifests	`manifests/<run_id>.json`	full reproducibility record per run
Raw transcripts	`logs/<stage>/<condition_id>/*.eval`	inspect_ai logs (open with `inspect view`)

Or stay in Python — one public function per command, same semantics, pydantic results (.model_dump() for JSON). Consent is a parameter: pass max_usd= and the run raises itemeval.BudgetExceededError before any API call when the remaining projection exceeds it (Python API).

import itemeval
cfg  = itemeval.load_config("config.yaml")
prep = itemeval.prepare_study(cfg)
est  = itemeval.estimate_study(prep)        # remaining figures: est.generate.remaining_usd
gen  = itemeval.run_generate(prep, display="none", max_usd=BUDGET_USD)
assert not any(c.status == "error" for c in gen.conditions)
itemeval.run_grade(prep, display="none", max_usd=BUDGET_USD)
itemeval.export_study(cfg)

Failure triage table

Observation	Meaning	Action
exit 2, pydantic message naming a field	invalid config	fix that field
exit 2, `local template 'x' not found`	bare name with no local file	create the file, or use `builtin:x`
exit 3	gate wants confirmation	surface cost to user; `--yes` if authorized
exit 4	projection > `max_usd`	stop; report; user decides
exit 1 + `ERROR:` on a condition line	whole condition failed (auth, provider down)	check the named exception; fix env/keys; re-run
`errors=N` in summary / `err` column in status	per-sample provider failures	re-run the same command (they retry)
`parse_failures=N` / `parse_ok=false` rows	judge output didn't parse	final, not retryable; inspect `judge_completion`; consider raising grader `max_tokens` or fixing the rubric, then `grade --force`
`empty=N` / `empty` column	completions with no text (reasoning-token exhaustion)	raise `max_tokens` / lower `reasoning_effort`; `on_empty: rerun`; re-generate
rows with `usd = 0.0`, zero tokens	served by local response cache	normal — genuinely free
rows with `usd = null`	model not in pricing table	`estimate --refresh-pricing`; run is otherwise fine

Drop-in block for a study repo's CLAUDE.md / AGENTS.md

## Running evaluations (itemeval)

This repo's eval studies run on itemeval (https://github.com/luozm/itemeval —
agent contract: https://github.com/luozm/itemeval/wiki/Agent-Guide).

- Pipeline per study config: `itemeval estimate <cfg> --json` →
  `itemeval generate <cfg> --yes` → `itemeval grade <cfg> --yes` →
  `itemeval export <cfg> --json`. All commands are idempotent; on any
  interruption or partial failure, re-run the same command.
- ALWAYS `estimate` before the first paid command; report projected USD.
- NEVER change `budget.max_usd` or `confirm_above_usd` without an explicit
  instruction. Exit code 4 = over hard cap: stop and report.
- New/changed configs start at `budget.policy: dev`; full runs only after a
  green dev export and explicit approval.
- Results: read `studies/<study>/export/gradings_long.parquet` (one row per
  grading event). Never hand-edit anything under `studies/`.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Agent Guide

Agent Guide — driving itemeval from an AI agent

Mental model in five lines

The command contract

Exit codes (deterministic — branch on them)

Hard rules

Standard operating procedure

Reading results programmatically

Failure triage table

Drop-in block for a study repo's CLAUDE.md / AGENTS.md

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally