-
Notifications
You must be signed in to change notification settings - Fork 0
Configuration
Zhimeng Luo edited this page Jun 10, 2026
·
6 revisions
One YAML file fully describes a study. Validation is strict
(extra="forbid"): unknown or misspelled keys are rejected at load time with
a pydantic error naming the field. Relative paths (output_dir,
prompts_dir, rubrics_dir, budget.pricing_path) resolve relative to
the config file's directory.
study: my_study # required; ^[a-z0-9][a-z0-9_-]{0,63}$
output_dir: studies # study dir = <config dir>/<output_dir>/<study>
prompts_dir: prompts # solver templates at <prompts_dir>/solver/<name>.md
rubrics_dir: rubrics # rubric templates at <rubrics_dir>/<name>.md
cache: true # inspect local response cache, both stages
benchmark:
adapter: hf # only "hf" in v0.1
datasets: # one or more; item ids must be unique across all
- id: MathArena/usamo_2025
revision: 0a2c60f2... # optional commit SHA/tag; omit to pin at first run
split: train # default: train
name: null # optional HF config name
limit: null # optional: first N rows only
mapping: # dataset columns -> Item fields
input: problem # required
id: problem_idx # optional; default: row index
target: sample_solution # optional; default ""
grading_scheme: grading_scheme # optional; non-strings stored as canonical JSON
metadata: [points] # optional columns copied into Item.metadata
solvers:
models: [openai/gpt-5-mini, anthropic/claude-haiku-4-5] # inspect model ids, unique
temperature: 0.7 # optional, 0..2
max_tokens: 1024 # optional; unset => uncapped (estimate warns)
top_p: null # optional, (0, 1]
seed: null # optional; recorded; only some providers honor it
facets:
prompt: [minimal, standard] # default: [default]
grader: [judge_a] # judge grading; [] if using scorer only
rubric: [standard] # default: [default]; used only with graders
scorer: null # or: exact_match | multiple_choice | numeric
replications: 4 # default 1; = inspect epochs
model_config: # sampling/reasoning variants as a facet; default one "default" cell
- name: plain
- name: thinking
reasoning_effort: high # none|minimal|low|medium|high|xhigh|max (OpenAI-style)
reasoning_tokens: 8192 # Anthropic extended thinking budget
temperature: 1.0 # per-cell overrides of solvers.* fields
graders: # resolves facets.grader names
judge_a:
model: openai/gpt-5-mini # required; judge temperature is pinned to 0.0 in v0.1
max_tokens: 2048 # default 2048
reasoning_effort: null
crossing: full # only "full" in v0.1
budget:
policy: dev # dev | full-interactive | full-batch (default: dev)
confirm_above_usd: 5 # gate threshold (default 5.0)
batch: auto # auto | true | false | <max batch size int>
max_usd: null # hard cap; exceeding estimate aborts (exit 4), never overridable
dev_items: 2 # dev policy: first N items (default 2)
dev_replications: null # dev policy: cap replications (default: keep)
pricing_path: null # explicit pricing JSON (else user cache, else packaged seed)-
facetsrequires at least one ofgrader/scorer. Both may be set; then the grid contains the verifiable condition plus every grader × rubric. -
Grader resolution: a
facets.gradername is looked up ingraders:; a name containing/is treated directly as a judge model id with default settings. Anything else fails at grid-expansion time with aConfigError. -
model_configname: stored under the YAML keymodel_config(the pydantic field is internally aliased). Each cell's non-null fields override the matchingsolvers.*value for that condition;reasoning_effort/reasoning_tokensexist only on cells. -
Templates are content-hashed; required placeholders are validated at
grid expansion. Solver prompts must contain
{input}(optional{id}). Rubrics must contain{input}and{solution}(optional{target},{grading_scheme},{id}). Rendering replaces only known placeholders — LaTeX/JSON braces in templates and item text are safe. -
policy: devtrims the run to the firstdev_itemsitems and forces batch off — the recommended default until your pipeline looks right. -
batch: autoenables batch-API mode only underpolicy: full-batch(an integer sets the batch size). Batch-capable providers: openai, anthropic, google, grok, together.
from itemeval import load_config
cfg = load_config("configs/my_study.yaml") # ConfigError on any problem
cfg.study_dir # resolved output directoryThe whole pipeline is also drivable programmatically — see Python API.