Tutorial Budget and Scale

Tutorial 5 — Scale up without surprises

Use case: "The pipeline works at dev scope. Now I want the full run — hundreds or thousands of paid calls — without discovering the bill afterward."

itemeval's budget layer exists so that scaling up is a config change, not a leap of faith: estimate first, gate on a threshold, cap hard, batch for ~50% off, and resume anything that breaks. You will take a validated study (any of Tutorials 1–4; examples below use Tutorial 3's compare.yaml) to full scope.

Step 1 — Refresh pricing and re-estimate at full scope

Switch the policy from dev to a full run and set the guardrails:

budget:
  policy: full-batch        # all items, batch APIs on (~50% cheaper)
  confirm_above_usd: 5      # ask before anything projected above $5
  max_usd: 25               # hard cap: abort if projection exceeds this. Never overridable.

Then:

itemeval estimate compare.yaml --refresh-pricing

--refresh-pricing pulls current per-token prices (from the OpenRouter catalog) into a local cache, so projections use today's prices — do this before any sizeable run. Read the estimate's per-condition breakdown and its warnings; in particular, an uncapped-generation warning means you forgot max_tokens and the estimator had to assume a pessimistic default. The estimate always projects the full grid (it doesn't subtract completed work) — a deliberate, conservative planning number targeted to be within ~2× of actuals.

To keep prices fresh automatically instead, set budget.pricing_max_age_days: 7 — every cost-bearing command prints which pricing table it used either way.

Step 2 — Understand the three policies and the gate

policy	items	batch APIs	use for
`dev` (default)	first `dev_items`	forced off	pipeline validation
`full-interactive`	all	off unless `batch: true`	runs you watch
`full-batch`	all	on (`batch: auto`)	large unattended runs

When you run generate or grade, the projection meets the gate, in order:

projection > max_usd → abort (exit 4). --yes does not override.
projection ≤ confirm_above_usd → proceed.
otherwise → interactive Proceed? [y/N], or exit 3 if there's no TTY.

The scripting/CI pattern: set confirm_above_usd to your comfort level, pass --yes, and let max_usd be the backstop that no flag can talk past.

Step 3 — Run it (batched, unattended-safe)

itemeval generate compare.yaml --yes
itemeval grade    compare.yaml --yes

Under full-batch, eligible providers (OpenAI, Anthropic, Google, Grok, Together) receive the calls through their batch APIs at roughly half price — slower, but built for exactly this. Judge grading additionally benefits from provider prompt caching: the rubric + problem prefix repeats across solutions, so repeated prefixes are served from cache where the provider supports it.

Interruptions are a non-event. Ctrl-C, a crash, a rate-limit storm, an expired session — re-run the same command. The stores are keyed, so completed work skips, errored samples retry, and inspect_ai's local response cache means already-paid calls are never paid twice (re-served rows record usd = 0.0). Check progress any time with:

itemeval status compare.yaml    # done/expected per condition, errors, spend so far

If a few samples keep erroring, the run still completes (exit 1, failures reported per condition) — status and the stores tell you exactly which rows are missing; see Error Handling.

Step 4 — Audit what you actually spent

itemeval export compare.yaml

Beyond the data tables, export settles the books:

spend: generate $1.20 | grade $2.92
savings vs list price: $5.68 (58%) — cache $3.10, batch $2.58 (estimated)
provider   calls  spend   list_price  saved
anthropic  640    $2.92   $6.30       $3.38
openai     320    $1.20   $3.50       $2.30

Per-sample costs are recorded on every row (gen_usd, grade_usd); the per-run ledger aggregates them by stage × condition × model, and export verifies ledger totals equal row sums.
The savings report re-prices your tokens at the plain-API list price and splits what the package saved into prompt-cache and batch components.
Batch rows use a documented flat 0.5× approximation — the provider invoice is authoritative; the ledger records the batch flag so rows can be re-priced.

Step 5 — Pilot-then-scale, the recommended lifecycle

The pattern the package is built around:

dev — policy: dev, mock or cheap models, 2–10 items. Validate mapping, prompts, rubric, parsing. Cost: ≈ $0.
pilot — real models, dev_items: 20–50, full-interactive if you want to watch. Sanity-check estimate-vs-actual, per-item output quality, and judge parse rates.
full — full-batch, --yes, max_usd set. Walk away; status when curious; export when done.

Each step reuses the previous step's completed work where conditions overlap — nothing you validated is re-paid (note that dev runs the first N items, so its items are a subset of the full run).

Gotchas at scale

Reasoning models with tight max_tokens can burn the whole budget on hidden reasoning and return empty text — not an error, so it won't retry by default. itemeval surfaces these (status's empty column) and solvers.on_empty: rerun makes them re-attempt after you raise the cap (Configuration).
Unpriced models (not in the pricing table) run fine but carry null usd — estimate flags them up front; refresh pricing or supply budget.pricing_path.
Don't run two commands concurrently against the same study directory; stages are designed to run serially.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tutorial Budget and Scale

Tutorial 5 — Scale up without surprises

Step 1 — Refresh pricing and re-estimate at full scope

Step 2 — Understand the three policies and the gate

Step 3 — Run it (batched, unattended-safe)

Step 4 — Audit what you actually spent

Step 5 — Pilot-then-scale, the recommended lifecycle

Gotchas at scale

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally