Skip to content

Tutorial Budget and Scale

github-actions[bot] edited this page Jun 11, 2026 · 1 revision

Tutorial 5 — Scale up without surprises

Use case: "The pipeline works at dev scope. Now I want the full run — hundreds or thousands of paid calls — without discovering the bill afterward."

itemeval's budget layer exists so that scaling up is a config change, not a leap of faith: estimate first, gate on a threshold, cap hard, batch for ~50% off, and resume anything that breaks. You will take a validated study (any of Tutorials 1–4; examples below use Tutorial 3's compare.yaml) to full scope.

Step 1 — Refresh pricing and re-estimate at full scope

Switch the policy from dev to a full run and set the guardrails:

budget:
  policy: full-batch        # all items, batch APIs on (~50% cheaper)
  confirm_above_usd: 5      # ask before anything projected above $5
  max_usd: 25               # hard cap: abort if projection exceeds this. Never overridable.

Then:

itemeval estimate compare.yaml --refresh-pricing

--refresh-pricing pulls current per-token prices (from the OpenRouter catalog) into a local cache, so projections use today's prices — do this before any sizeable run. Read the estimate's per-condition breakdown and its warnings; in particular, an uncapped-generation warning means you forgot max_tokens and the estimator had to assume a pessimistic default. The estimate always projects the full grid (it doesn't subtract completed work) — a deliberate, conservative planning number targeted to be within ~2× of actuals.

To keep prices fresh automatically instead, set budget.pricing_max_age_days: 7 — every cost-bearing command prints which pricing table it used either way.

Step 2 — Understand the three policies and the gate

policy items batch APIs use for
dev (default) first dev_items forced off pipeline validation
full-interactive all off unless batch: true runs you watch
full-batch all on (batch: auto) large unattended runs

When you run generate or grade, the projection meets the gate, in order:

  1. projection > max_usdabort (exit 4). --yes does not override.
  2. projection ≤ confirm_above_usd → proceed.
  3. otherwise → interactive Proceed? [y/N], or exit 3 if there's no TTY.

The scripting/CI pattern: set confirm_above_usd to your comfort level, pass --yes, and let max_usd be the backstop that no flag can talk past.

Step 3 — Run it (batched, unattended-safe)

itemeval generate compare.yaml --yes
itemeval grade    compare.yaml --yes

Under full-batch, eligible providers (OpenAI, Anthropic, Google, Grok, Together) receive the calls through their batch APIs at roughly half price — slower, but built for exactly this. Judge grading additionally benefits from provider prompt caching: the rubric + problem prefix repeats across solutions, so repeated prefixes are served from cache where the provider supports it.

Interruptions are a non-event. Ctrl-C, a crash, a rate-limit storm, an expired session — re-run the same command. The stores are keyed, so completed work skips, errored samples retry, and inspect_ai's local response cache means already-paid calls are never paid twice (re-served rows record usd = 0.0). Check progress any time with:

itemeval status compare.yaml    # done/expected per condition, errors, spend so far

If a few samples keep erroring, the run still completes (exit 1, failures reported per condition) — status and the stores tell you exactly which rows are missing; see Error Handling.

Step 4 — Audit what you actually spent

itemeval export compare.yaml

Beyond the data tables, export settles the books:

spend: generate $1.20 | grade $2.92
savings vs list price: $5.68 (58%) — cache $3.10, batch $2.58 (estimated)
provider   calls  spend   list_price  saved
anthropic  640    $2.92   $6.30       $3.38
openai     320    $1.20   $3.50       $2.30
  • Per-sample costs are recorded on every row (gen_usd, grade_usd); the per-run ledger aggregates them by stage × condition × model, and export verifies ledger totals equal row sums.
  • The savings report re-prices your tokens at the plain-API list price and splits what the package saved into prompt-cache and batch components.
  • Batch rows use a documented flat 0.5× approximation — the provider invoice is authoritative; the ledger records the batch flag so rows can be re-priced.

Step 5 — Pilot-then-scale, the recommended lifecycle

The pattern the package is built around:

  1. devpolicy: dev, mock or cheap models, 2–10 items. Validate mapping, prompts, rubric, parsing. Cost: ≈ $0.
  2. pilot — real models, dev_items: 20–50, full-interactive if you want to watch. Sanity-check estimate-vs-actual, per-item output quality, and judge parse rates.
  3. fullfull-batch, --yes, max_usd set. Walk away; status when curious; export when done.

Each step reuses the previous step's completed work where conditions overlap — nothing you validated is re-paid (note that dev runs the first N items, so its items are a subset of the full run).

Gotchas at scale

  • Reasoning models with tight max_tokens can burn the whole budget on hidden reasoning and return empty text — not an error, so it won't retry by default. itemeval surfaces these (status's empty column) and solvers.on_empty: rerun makes them re-attempt after you raise the cap (Configuration).
  • Unpriced models (not in the pricing table) run fine but carry null usdestimate flags them up front; refresh pricing or supply budget.pricing_path.
  • Don't run two commands concurrently against the same study directory; stages are designed to run serially.

Clone this wiki locally