-
Notifications
You must be signed in to change notification settings - Fork 0
Outputs and Schemas
Everything a study produces lives under one directory:
<work_dir>/<output_dir>/<study>/, where work_dir is the current directory by
default (override with -C/--base-dir, or work_dir= in load_config).
studies/<study>/
items.parquet # canonical items snapshot (all loaded items)
solutions.parquet # one row per (generate condition x item x epoch)
gradings.parquet # one row per grading event
log_index.parquet # index of raw inspect .eval logs
ledger.parquet # cost ledger (run x stage x condition x model)
dataset_locks.json # dataset revisions pinned at first run
manifests/<run_id>.json # one reproducibility manifest per run
logs/generate/<condition_id>/*.eval # raw inspect logs (full transcripts)
logs/grade/<condition_id>/*.eval
export/gradings_long.parquet # the analysis-ready long table
export/gradings_long.csv # byte-equivalent CSV mirror
export/ledger.csv
All parquet stores use keyed, atomic upserts — re-runs replace rows, never
duplicate them. Raw .eval logs are append-only evidence; open them with
inspect view or inspect_ai.log.read_eval_log() to audit any number.
dataset_locks.json pins each dataset's resolved revision at first run so
every later run loads identical data. Every command that loads data prints
one provenance line per dataset — revision, downloaded vs reused from the HF
cache, and (only when this run wrote the lock) a
revision pinned in dataset_locks.json clause; the same facts ride JSON as
datasets[] ({id, split, revision, revision_source, cache, cache_dir, download_bytes, pinned_now}).
Provenance: study, run_id, condition_id, condition_slug, item_id, dataset_id, dataset_revision, epoch, model, prompt_name, prompt_hash, model_config_name.
Sampling params, requested and effective: temperature_*, top_p_*, max_tokens_*, seed_requested, reasoning_effort[_effective], reasoning_tokens_requested — provider-forced values show up as a
requested/effective mismatch.
Result: solution, stop_reason, error (errored rows are kept and re-run next
time).
Cost: input_tokens, output_tokens, total_tokens, cache_read_tokens, cache_write_tokens, reasoning_tokens, usd, latency_s.
Audit: log_file, sample_uuid, created_at.
Waves: wave (int, default 0), wave_label (null unless --wave was used);
old stores read as wave 0 — see
Pipeline-Concepts#waves. Gradings and the
export carry the same two columns, inherited from the graded solution row.
Provenance: study, run_id, grade_condition_id/slug, gen_condition_id, item_id, epoch, grade_kind (judge|verifiable), grader_name, grader_model, rubric_name, rubric_hash, scorer_name.
Result: score, score_raw, parse_ok, parse_error, reasoning, judge_completion, error.
Cost/audit: token columns, usd (0.0 for verifiable), latency_s,
log_file, created_at.
Invariant: parse_ok=false ⟺ parse_error set ⟺ score null (for rows
without a sample-level error). Parse failures are final; errors re-run.
The left-join of gradings onto solutions: 47 columns, grouped as
-
Design cell:
study, item_id, dataset_id, dataset_revision, model, prompt_name, prompt_hash, model_config_name, replication, gen_condition_id/slug, grade_condition_id/slug, grade_kind, grader_name, grader_model, rubric_name, rubric_hash, scorer_name -
Outcome:
score, score_raw, parse_ok, parse_error, reasoning, solution, judge_completion -
Params:
temperature_requested, temperature_effective, reasoning_effort -
Cost:
gen_*andgrade_*token counts,gen_usd, grade_usd, gen_latency_s, grade_latency_s -
Audit:
gen_run_id, grade_run_id, gen_log_file, grade_log_file, created_at
Never aggregated; parse failures present with parse_ok=false. This is the
input to IRT / mixed-model analysis.
provider (the inspect prefix of model — which provider's billing the row
belongs to), calls, token totals, usd, priced (was a price known),
batch (was batch mode on), created_at. export checks ledger totals equal
row sums per stage (internally reconciled). Null usd on a row means "model
unpriced"; 0.0 with empty tokens means the call was served by the response
cache (free).
Per run: run_id, stage, created_at; itemeval_version, python_version, packages (inspect-ai, pandas, pyarrow, pydantic, pyyaml, datasets);
config_path, config_sha256 + the full parsed config; per-dataset resolved
revisions and an items content hash; solver/rubric templates each as
{name, source, path, sha256} (source is local or builtin; built-in
templates record a package-relative, machine-independent path);
resolved graders; requested sampling params (+ sampling_effective per
condition, backfilled after the run); endpoints_effective per condition
({provider, base_url, served_model}, backfilled after the run — the endpoint
and provider-returned model snapshot that actually answered; base_url is null
on the provider's default endpoint); policy, effective replications/items
limit/batch; the complete condition grid with payloads; the condition ids
this run selected; and the estimate the run was approved under
(estimate_usd = the remaining figure, estimate_full_usd alongside).
Because manifests are immutable, they double as an endpoint drift
record: when past manifests show inconsistent served_model snapshots for a
model id this run uses (or the last run is >30 days old), generate/grade
print a best-effort warning — rows remain distinguishable by run_id.
itemeval export CONFIG --snapshot NAME (Python:
export_study(cfg, snapshot="NAME") → ExportResult.snapshot_path) runs a
normal export, then freezes an immutable copy:
export/snapshots/<name>/
gradings_long.parquet # frozen copy of the just-written export
gradings_long.csv
ledger.csv
dataset_locks.json # pins as of snapshot time
manifests/ # every manifest covering included rows
snapshot.json # name, created_at, itemeval_version, config_sha256,
# run_ids, row/condition counts, spend totals
STUDY_CARD.md # self-describing record (see below)
Why copy, not reference: the current-state layer is mutable (upserts
replace), so "the table as of pub-1" cannot be reconstructed later — history
is materialized at freeze time. Rules: names match
^[a-z0-9][a-z0-9_-]{0,63}$; an existing name is refused (exit 2,
snapshot 'pub1' exists — choose a new name) — refusing overwrite removes
the only destructive path; snapshots are never read by any compute path (not
resume, not merge). Consume them like any export (read the parquet; zip the
folder to share). status lists existing snapshots
(snapshots: pub1 (2026-06-11, 1,920 rows); snapshots[] in JSON).
STUDY_CARD.md is the HF-dataset-card analog written into every snapshot:
YAML front-matter (schema-versioned, itemeval_study_card: 1 — datasets and
pins, models, replications, graders, rows, spend) followed by sections every
number of which is derived from existing stores: Design (the facet grid with
content hashes and template sources), Execution (one row per run from
manifests + ledger, including served_model per condition — exactly which
provider snapshot answered), Results (completion matrix and per-condition
mean scores, labeled descriptive, not analysis), Costs (per-stage and
per-provider spend, savings decomposition), and Reproduce (the config
verbatim plus dataset pins). Configs must never contain secrets — keys live
in the environment.