-
Notifications
You must be signed in to change notification settings - Fork 0
Pipeline Concepts
The canonical unit is an Item: id, input (the problem), target
(reference answer, may be empty), grading_scheme (rubric/points text, may be
null), metadata. Adapters turn benchmark rows into items via the
benchmark.mapping spec; item ids must be unique across all configured
datasets.
A facet is an experimental factor: solver model, prompt variant,
model-config (sampling/reasoning settings), grader, rubric, replication.
itemeval crosses them fully (crossing: full):
-
Generate conditions =
models × facets.prompt × facets.model_config -
Grade conditions =
facets.grader × facets.rubric, plus one verifiable-scorer condition iffacets.scoreris set -
Replications = inspect epochs; every (generate condition × item) runs
facets.replicationstimes
Every condition gets a stable, content-derived id:
<human-readable-slug>--<sha256-of-payload>[:12], e.g.
gpt-5-mini_minimal_default--3fa9c2d1e0b4. The hashed payload contains the
content that defines the condition — model id, resolved sampling params,
prompt/rubric name and content hash. Consequences:
- Editing a prompt file changes its conditions' ids — old rows stay in the store under the old id; the edited condition starts fresh. You can never silently mix results from two prompt versions.
- Ids are reproducible across machines and runs. Replication/epoch is never part of the id (it's a column).
- When a run's grid disagrees with stored rows — same facet name with a
different content hash (edited template), or an unchanged slug mapping to
a new id (changed sampling param) —
generate/gradeprint a config drift warning naming the facet, the hash change, and the affected row count (also in the run JSON aswarnings[]). Warnings never block: the run proceeds under the new condition by design.
Generate runs one inspect task per generate condition and upserts one row
per (condition × item × epoch) into solutions.parquet.
Grade fans out grade conditions over stored solutions — it never re-generates and never writes the solutions store:
-
Verifiable scorers (
exact_match,multiple_choice,numeric) are pure Python over the stored text: no model, no cost. - Judge grading builds a fresh inspect task whose dataset is the stored solutions rendered into a rubric template; the judge model runs at temperature 0 with its own logs, retries, caching, batch eligibility, and cost accounting.
Because the stages share only the solutions store, adding a new grader or
rubric later (itemeval grade --grader new_judge) re-uses every stored
solution at zero generation cost — that is the core reason this package
exists on top of inspect_ai rather than using inspect's in-eval scorers.
The packaged format suffix instructs judges to end with a fenced JSON block
{"score": <number>, "reasoning": "..."}. Parsing is strict (fenced blocks
last-to-first, then raw JSON objects) with exact failure codes
(no_json_object, no_score_in_json, score_not_numeric,
score_not_finite). Parse failures are results, not errors: the row is
kept with parse_ok=false and is final — it is not retried on re-runs
(use --force to re-grade). Sample-level errors (provider failures), by
contrast, leave error set and are re-attempted on the next run.
generate cfg.yaml --wave 2026-07 re-observes the same design scope as
a new observation, keeping both — the substrate for drift / model-downgrade
detection over time. A wave is an epoch block: wave w with
replications: R occupies epochs w·R+1 … (w+1)·R. The epoch axis is
already in every store key and both cache keys, so new waves are new keys —
no migration, no replacement — and fresh epoch numbers give fresh draws
automatically. Mechanics:
-
--waveis a design declaration: always explicit, never auto-fired. Without it, behavior is exactly the single-wave default (wavecolumn constantly 0,wave_labelnull — zero noise). - The offset eval runs with the local response cache off (announced:
wave 2026-07: local response cache off — re-observations must be fresh draws) — otherwise inspect would replay the wave-0 bytes as "new" observations. By design waves never replay: they cost full price. - Resumable mid-wave: re-running with the same label fills only the block's missing work; an existing label resumes its block, a new label allocates the next free one.
-
grade --wave LABELgrades exactly that block's solutions under the existing grade conditions; plaingradestays scoped to wave 0. - Rows carry
wave(int) andwave_labelcolumns (solutions/gradings/export, additive — old stores read as wave 0); manifests and the ledger record theepoch_offset.statusprints per-wave completion (generate and graded counts, e.g.waves: 0 — gen 8/8 · graded 8/8, 1 (w1) — gen 8/8 · graded 0/8) only when more than one wave exists; the main completion matrix stays scoped to wave 0, so wave progress lives entirely on this line. - Analysis:
df.groupby("wave")over the export, plusserved_modelfrom the manifests to attribute differences. - Config drift warnings are load-bearing here: a changed template between waves means the new wave is a different condition — the warning says so.
The parquet store is the source of truth; raw .eval logs are evidence.
- Upserts are keyed — generate:
(condition_id, item_id, epoch); grade:(grade_condition_id, gen_condition_id, item_id, epoch)— re-runs replace rather than duplicate. - A generate item is re-run iff any epoch is missing or errored. A solution is re-graded iff it has no successful grading row under that grade condition.
-
--forcere-runs everything selected;--condition <id-prefix-or-slug>narrows any run. - Interrupting a run is always safe; re-invoke the same command.
Two layers, both from inspect_ai:
-
Local response cache (
cache: true, the default): identical model calls (same model, prompt, sampling config, epoch) are served from disk. Re-running a wiped study re-pays nothing; such rows recordusd = 0.0. Per-epoch caching keeps replications distinct. -
Provider prompt caching: the grade stage sets
cache_prompt: auto, so repeated rubric+problem prefixes across solutions are cache-eligible on providers that support it.
Every run writes manifests/<run_id>.json: config content hash, dataset ids
- resolved revisions, prompt/rubric content hashes, model ids, requested
sampling params (effective per-condition values backfilled after the run),
seeds, package versions, the full condition grid, and the estimate the run
was approved under. Dataset revisions are pinned at first run in
dataset_locks.json. Same manifest + cache ⇒ identical results.