-
Notifications
You must be signed in to change notification settings - Fork 0
Agent Guide
This page is written for an LLM agent (Claude Code, Codex, a custom
orchestrator) operating itemeval on a user's behalf — and for the humans
wiring that up. It is a compact operating contract: what to run, in what
order, what every outcome means, and which guardrails must never be bypassed.
If you maintain a study repo, copy the
drop-in instructions block
into your repo's CLAUDE.md / AGENTS.md.
(This is about using the installed package. For developing itemeval itself,
see CLAUDE.md in the repo.)
- One YAML config fully describes a study: a benchmark (HuggingFace dataset + column mapping) and a facet grid (models × prompts × model-configs × graders × rubrics × replications).
-
generateproduces solutions;gradescores stored solutions (verifiable scorer at $0, or LLM judge); the stages are decoupled — new graders/rubrics never re-pay generation. - All state lives under
<cwd>/studies/<study>/in keyed parquet stores; every command is idempotent and resumable — re-running is always safe and never duplicates or re-pays completed work. - Money is governed by config:
budget.policy(scope),confirm_above_usd(confirmation gate),max_usd(hard abort, not overridable). -
exportwrites the analysis artifact:export/gradings_long.parquet, one row per grading event.
itemeval init DIR [--with-templates] # scaffold config.yaml (no API calls)
itemeval estimate CONFIG [--refresh-pricing] [--json] # projected $; NO model calls
itemeval generate CONFIG [--yes] [--json] [--force] [--condition F]... # stage 1 (paid)
itemeval grade CONFIG [--yes] [--json] [--force] [--grader N] [--rubric N] [--condition F]... # stage 2 (paid if judge)
itemeval export CONFIG [--json] # tables + ledger (no API calls)
itemeval status CONFIG [--json] # completion matrix (no API calls)
-
estimate,status,exportnever call a model API and are always safe. First-ever run downloads the dataset from the HF Hub (free). -
--json(on every command) emits the full structured report — prefer it over parsing human-readable stdout. Ongenerate/gradeit carries the run result pluspricing,estimate_usd, and thegateoutcome, and a gate stop still emits a JSON document (projected cost, gate reason, the--yesrerun command) before exit 3/4. -
-C/--base-dir DIRanchors the output tree (studies/); default is the current directory. Inputs (prompts/rubrics) always resolve relative to the config file. - For paid runs in non-interactive shells,
--yesis required whenever the projection exceedsconfirm_above_usd(there is no TTY to confirm on).
| Code | Meaning | Correct agent reaction |
|---|---|---|
| 0 | success | proceed |
| 1 | unexpected error, or ≥1 condition failed during the run | run status --json; re-run the same command (errored samples retry); escalate if the same rows fail repeatedly |
| 2 | config / template / adapter error | fix the YAML or template file; do not retry unchanged |
| 3 | cost gate needs confirmation (non-interactive) | report the projected cost to the user; re-run with --yes only within an authorized budget |
| 4 | projection exceeds budget.max_usd
|
stop. Never raise max_usd yourself — that number is the user's, not yours |
-
Never raise or remove
budget.max_usd, and never inflateconfirm_above_usd, without an explicit user instruction quoting the new number. Exit 4 is the user's hard cap working as designed. -
Always run
estimate(ideally--refresh-pricing --json) before the firstgenerate/gradeof a session and comparetotal_usdagainst the budget you were given. -
Start every new config at
policy: dev(a few items). Scale tofull-interactive/full-batchonly after the dev run's export looks right. -
Re-run, don't repair. On interruption or partial failure, re-invoke the
identical command — the stores are keyed and the response cache prevents
double payment. Never delete or hand-edit
studies/<study>/*.parquet. -
One command at a time per study directory. Never run two
generate/gradeprocesses concurrently on the same study. -
Don't loop on parse failures. Judge rows with
parse_ok=falseare final results, not retryable errors; re-runninggradewill not change them (that needs--forceor a rubric change — a user decision). -
API keys come from the environment (
OPENAI_API_KEY, etc.). Never write keys into configs or commit them.
# 1. Scaffold (or receive) a config
itemeval init my_study && cd my_study
# 2. Edit config.yaml: point benchmark.datasets/mapping at the target dataset,
# set solvers.models, choose facets.scorer (verifiable) or grader+rubric (judge),
# keep policy: dev.
# 3. Validate without spending
itemeval status config.yaml --json # config parses; grid is what you expect
itemeval estimate config.yaml --refresh-pricing --json # projected $; check warnings
# 4. Dev run (cheap), then inspect
itemeval generate config.yaml --yes
itemeval grade config.yaml --yes
itemeval export config.yaml --json
# -> read studies/<study>/export/gradings_long.parquet; check scores, parse_ok,
# empty-solution counts; sanity-check a solution and a judge reasoning by eye
# 5. Report findings + full-scope estimate to the user; on approval flip
# budget.policy to full-batch, set max_usd, then repeat 3–4 at full scope.Key config rules that bite agents (full reference: Configuration):
- Validation is strict — unknown keys are rejected (exit 2 with the field named). Fix the config; don't retry.
-
facetsneeds at least one ofscorer(verifiable:exact_match/multiple_choice/numeric) orgrader(+ entries undergraders:). - Template namespaces:
builtin:NAME= packaged template; bareNAME= local file underprompts_dir/rubrics_dir(relative to the config file). Solver prompts must contain{input}; rubrics must contain{input}and{solution}. - Reasoning models need
max_tokensheadroom for hidden reasoning plus the visible answer; ifgrade/statusreportemptysolutions, raisemax_tokensor lowerreasoning_effortand setsolvers.on_empty: rerun.
Prefer the parquet stores over stdout:
| Artifact | Path under studies/<study>/
|
Use |
|---|---|---|
| Analysis table | export/gradings_long.parquet |
one row per grading event; key cols: item_id, model, prompt_name, replication, grader_name, rubric_name, score, parse_ok, solution, reasoning, gen_usd, grade_usd
|
| Solutions | solutions.parquet |
per (condition × item × epoch): solution, stop_reason, error, tokens, usd
|
| Gradings | gradings.parquet |
per grading event incl. parse_error, judge_completion
|
| Cost ledger |
ledger.parquet / export/ledger.csv
|
spend by run × stage × condition × model |
| Manifests | manifests/<run_id>.json |
full reproducibility record per run |
| Raw transcripts | logs/<stage>/<condition_id>/*.eval |
inspect_ai logs (open with inspect view) |
Or stay in Python — one public function per command, same semantics, pydantic
results (.model_dump() for JSON). Consent is a parameter: pass max_usd=
and the run raises itemeval.BudgetExceededError before any API call when
the remaining projection exceeds it
(Python API).
import itemeval
cfg = itemeval.load_config("config.yaml")
prep = itemeval.prepare_study(cfg)
est = itemeval.estimate_study(prep) # remaining figures: est.generate.remaining_usd
gen = itemeval.run_generate(prep, display="none", max_usd=BUDGET_USD)
assert not any(c.status == "error" for c in gen.conditions)
itemeval.run_grade(prep, display="none", max_usd=BUDGET_USD)
itemeval.export_study(cfg)| Observation | Meaning | Action |
|---|---|---|
| exit 2, pydantic message naming a field | invalid config | fix that field |
exit 2, local template 'x' not found
|
bare name with no local file | create the file, or use builtin:x
|
| exit 3 | gate wants confirmation | surface cost to user; --yes if authorized |
| exit 4 | projection > max_usd
|
stop; report; user decides |
exit 1 + ERROR: on a condition line |
whole condition failed (auth, provider down) | check the named exception; fix env/keys; re-run |
errors=N in summary / err column in status |
per-sample provider failures | re-run the same command (they retry) |
parse_failures=N / parse_ok=false rows |
judge output didn't parse | final, not retryable; inspect judge_completion; consider raising grader max_tokens or fixing the rubric, then grade --force
|
empty=N / empty column |
completions with no text (reasoning-token exhaustion) | raise max_tokens / lower reasoning_effort; on_empty: rerun; re-generate |
rows with usd = 0.0, zero tokens |
served by local response cache | normal — genuinely free |
rows with usd = null
|
model not in pricing table |
estimate --refresh-pricing; run is otherwise fine |
## Running evaluations (itemeval)
This repo's eval studies run on itemeval (https://github.com/luozm/itemeval —
agent contract: https://github.com/luozm/itemeval/wiki/Agent-Guide).
- Pipeline per study config: `itemeval estimate <cfg> --json` →
`itemeval generate <cfg> --yes` → `itemeval grade <cfg> --yes` →
`itemeval export <cfg> --json`. All commands are idempotent; on any
interruption or partial failure, re-run the same command.
- ALWAYS `estimate` before the first paid command; report projected USD.
- NEVER change `budget.max_usd` or `confirm_above_usd` without an explicit
instruction. Exit code 4 = over hard cap: stop and report.
- New/changed configs start at `budget.policy: dev`; full runs only after a
green dev export and explicit approval.
- Results: read `studies/<study>/export/gradings_long.parquet` (one row per
grading event). Never hand-edit anything under `studies/`.