-
Notifications
You must be signed in to change notification settings - Fork 0
Configuration
github-actions[bot] edited this page Jun 12, 2026
·
4 revisions
One YAML file fully describes a study. Validation is strict
(extra="forbid"): unknown or misspelled keys are rejected at load time with
a pydantic error naming the field.
Paths resolve by intent: input paths (prompts_dir, rubrics_dir,
budget.pricing_path) resolve relative to the config file's directory, so a
config always finds its own templates; the output path (output_dir)
resolves relative to the working directory — the current directory by
default, override with -C/--base-dir (CLI) or work_dir= (load_config).
Outputs never land next to the config or inside the installed package. Absolute
paths are used as-is.
study: my_study # required; ^[a-z0-9][a-z0-9_-]{0,63}$
output_dir: studies # output study dir = <work_dir (CWD)>/<output_dir>/<study>
prompts_dir: prompts # local solver templates at <config dir>/<prompts_dir>/solver/<name>.md
rubrics_dir: rubrics # local rubric templates at <config dir>/<rubrics_dir>/<name>.md
cache: true # inspect local response cache, both stages
benchmark:
adapter: hf # only "hf" in v0.1
datasets: # one or more; item ids must be unique across all
- id: MathArena/usamo_2025
revision: 0a2c60f2... # optional commit SHA/tag; omit to pin at first run
split: train # default: train
name: null # optional HF config name
limit: null # optional: first N rows only
mapping: # dataset columns -> Item fields
input: problem # required
id: problem_idx # optional; default: row index
target: sample_solution # optional; default ""
grading_scheme: grading_scheme # optional; non-strings stored as canonical JSON
metadata: [points] # optional columns copied into Item.metadata
solvers:
models: [openai/gpt-5-mini, anthropic/claude-haiku-4-5] # inspect model ids, unique
temperature: 0.7 # optional, 0..2
max_tokens: 1024 # optional; unset => uncapped (estimate warns)
top_p: null # optional, (0, 1]
seed: null # optional; recorded; only some providers honor it
on_empty: skip # empty (no-error) solutions: skip | rerun | grade (default skip)
cache_prompt: auto # provider prompt caching for generation: auto|on|off (auto = on when replications > 1)
split_prompt: false # render prompt as system(template head)+user(item) for provider cache breakpoints; changes condition ids
facets:
prompt: [builtin:standard] # builtin:NAME = packaged; bare NAME = local file. default: [builtin:standard]
grader: [judge_a] # judge grading; [] if using scorer only
rubric: [builtin:standard] # same rule; used only with graders. default: [builtin:standard]
scorer: null # or: exact_match | multiple_choice | numeric
replications: 4 # default 1; = inspect epochs
model_config: # sampling/reasoning variants as a facet; default one "default" cell
- name: plain
- name: thinking
reasoning_effort: high # none|minimal|low|medium|high|xhigh|max (OpenAI-style)
reasoning_tokens: 8192 # Anthropic extended thinking budget
temperature: 1.0 # per-cell overrides of solvers.* fields
graders: # resolves facets.grader names
judge_a:
model: openai/gpt-5-mini # required; judge temperature is pinned to 0.0 in v0.1
max_tokens: 2048 # default 2048
reasoning_effort: null
split_rubric: false # rubric head as system msg + solution as user msg (provider cache breakpoint); changes condition ids
crossing: full # only "full" in v0.1
budget:
policy: dev # dev | full-interactive | full-batch (default: dev)
confirm_above_usd: 5 # gate threshold (default 5.0)
batch: auto # auto | true | false | <max batch size int>
max_usd: null # hard cap; exceeding estimate aborts (exit 4), never overridable
dev_items: 2 # dev policy: first N items (default 2)
dev_replications: null # dev policy: cap replications (default: keep)
pricing_path: null # explicit pricing JSON (else user cache, else packaged seed)
cache_schedule: auto # warm-then-fan-out scheduling of same-prefix calls: auto | off-
facetsrequires at least one ofgrader/scorer. Both may be set; then the grid contains the verifiable condition plus every grader × rubric. -
Grader resolution: a
facets.gradername is looked up ingraders:; a name containing/is treated directly as a judge model id with default settings. Anything else fails at grid-expansion time with aConfigError. -
model_configname: stored under the YAML keymodel_config(the pydantic field is internally aliased). Each cell's non-null fields override the matchingsolvers.*value for that condition;reasoning_effort/reasoning_tokensexist only on cells. -
solvers.on_empty: what to do with a completed generation that produced no gradable text (empty/blanksolution, no API error — typically a reasoning model whosemax_tokenswas spent entirely on hidden reasoning). This is a distinct channel from API errors (always re-attempted) and parse failures (always final).skip(default) excludes them from grading;rerunalso treats them as not-done so a latergeneratere-attempts them (raisemax_tokens/ lowerreasoning_effortfirst — an identical request hits the response cache and stays empty);gradesends the empty answer to the judge as-is (usually scored low). Either way they are surfaced, never silently counted as complete:gradeprints the count + stop-reason breakdown andstatusshows anemptycolumn. The usual cause is too small amax_tokensfor a reasoning model — give the cap room for reasoning plus the answer. -
Templates: built-in vs local. A
prompt/rubricentry references a template in one of two namespaces, never mixed or silently shadowed:builtin:NAMEresolves to a template packaged inside itemeval (runitemeval init --with-templatesto see/copy them); a bareNAMEresolves to a local file underprompts_dir/rubrics_dir. A localstandardandbuiltin:standardare distinct references — each is content-hashed and recorded with itssource(local/builtin) in the run manifest, so two same-named templates with different content never collide. A bare name with no local file errors at load (before any output is written) and, if a built-in of that name exists, suggests thebuiltin:form. Built-in templates ship today: promptsminimal,standard; rubricstandard. -
Templates are content-hashed; required placeholders are validated before
any run. Solver prompts must contain
{input}(optional{id}). Rubrics must contain{input}and{solution}(optional{target},{grading_scheme},{id}). Rendering replaces only known placeholders — LaTeX/JSON braces in templates and item text are safe. -
policy: devtrims the run to the firstdev_itemsitems and forces batch off — the recommended default until your pipeline looks right. -
batch: autoenables batch-API mode only underpolicy: full-batch(an integer sets the batch size). Batch-capable providers: openai, anthropic, google, grok, together.
from itemeval import load_config
cfg = load_config("configs/my_study.yaml") # ConfigError on any problem
cfg.study_dir # resolved output directoryThe whole pipeline is also drivable programmatically — see Python API.