-
Notifications
You must be signed in to change notification settings - Fork 0
FAQ
With an in-eval scorer, judge calls live inside the generating eval: one set
of logs, no separate batching/caching/cost attribution, and adding a second
judge or rubric later re-runs generation. itemeval stores solutions once and
fans grading out over them — itemeval grade --grader new_judge against a
finished study costs only judge tokens. For item-response analysis (where
grader and rubric are facets, not afterthoughts) this decoupling is the
whole point.
Condition ids hash the condition's content: model id, resolved sampling
params, prompt/rubric name and file content. If you edited a template or
changed solvers.temperature, the affected conditions are genuinely new
cells — old rows remain under the old id, and status shows the new cells
as 0/N. This is deliberate: you can never silently mix results produced
under different conditions. Cosmetic renames also change the id (the slug is
part of it), so finish naming before big runs.
Every facets.grader name needs either an entry under graders: or to be a
model id containing /. Add:
graders:
judge_a: {model: openai/gpt-5-mini}A bare template name (x) resolves to a local file under prompts_dir/
rubrics_dir, anchored to the config file's directory. Either create
<config dir>/prompts/solver/x.md, point prompts_dir at the right directory,
or — if you meant a packaged template — reference it as builtin:x. The error
lists the local templates found and suggests builtin: when a built-in of that
name exists. (Outputs are separate: they anchor to the working directory, not
the config dir — see Configuration.)
The cost gate needs confirmation and stdin isn't a TTY. Pass --yes
(and set budget.max_usd as the un-overridable backstop).
Those calls were served by inspect's local response cache — genuinely free.
Null usd is different: it means no price was known for the model (run
estimate --refresh-pricing or provide budget.pricing_path).
The judge's output didn't contain a valid {"score": ...} JSON block;
parse_error says exactly how it failed, and judge_completion holds the
raw text. These rows are kept (never dropped) and are final — re-running
grade won't retry them. If you fix a rubric to elicit better-formatted
output, the rubric hash changes and grading starts a fresh condition; to
re-grade in place, use grade --force.
Just re-run the same command. The store is keyed: completed work skips,
errored rows re-run, and the response cache means already-paid calls aren't
paid again. status shows exactly what's missing.
--condition <id|id-prefix|slug> (repeatable) on generate/grade;
--grader / --rubric on grade. The dev policy (budget.policy: dev)
limits to the first dev_items items globally.
No model API calls, ever. The first run of any command resolves and downloads the dataset from the HF Hub (free); after that, the revision lock plus HF's local cache make loads effectively offline.
Not yet — adapter: hf is the only adapter in v0.1. GitHub-repo and local
JSONL adapters are on the roadmap ("Later"); the adapter protocol in
adapters/_base.py is the extension point.
studies/<study>/logs/<stage>/<condition_id>/*.eval — full inspect logs.
inspect view --log-dir studies/<study>/logs gives you the inspect UI over
them; every store row carries its log_file and sample_uuid.
Yes — any model id starting with mockllm/ runs a deterministic free stub
(solver-style or judge-style depending on stage). It exists so pipelines can
be validated end-to-end at $0; swap in real model ids when ready.