Skip to content

Budget and Costs

github-actions[bot] edited this page Jun 12, 2026 · 3 revisions

Budget and Costs

itemeval refuses to be surprised by a bill: every paid stage is preceded by a projection and a gate, every call's cost is attributed afterward, and a hard cap can never be talked past.

Estimation

itemeval estimate projects each stage without any model API calls:

  • input tokens: a chars/4 heuristic over the actual rendered prompts (judge inputs use real stored solutions when present, otherwise a placeholder sized from the generation max_tokens);
  • output tokens: the configured max_tokens cap, or pessimistic defaults when uncapped (4096/call for generation — with a printed uncapped-generation warning — and 512/call for judges without a cap);
  • dollars: tokens × the pricing table, with a 0.5 multiplier for batch-eligible conditions.

The estimate reports two figures per stage. The full projection covers the whole policy-effective grid; the remaining projection subtracts already-complete cells (using the same predicates the runners use to decide what to run), so it is exactly what the next run can spend. On a partially complete study the lines read:

projected generate cost: $4.10 remaining of $11.30 full grid (63% complete)

JSON parity (append-only; usd keeps meaning the full grid): remaining_usd, full_usd, remaining_calls, completed_cells, total_cells, and rows_replaced on each StageEstimate. --force sets remaining = full (everything selected re-runs). Either way the estimate is a planning number, not an invoice — target accuracy is "within ~2× of actuals".

When the planned run would overwrite existing rows (--force, epoch extension, on_empty: rerun re-attempts), the projection block states it as part of the same single confirmation: this run replaces 48 existing rows (…) — never a second prompt.

The gate

generate and grade compare their stage's remaining projection (what this run can actually spend — completed work is never re-paid) against the config:

  1. projection > budget.max_usdabort, exit 4 — never overridable;
  2. projection ≤ confirm_above_usd → proceed;
  3. --yes → proceed;
  4. interactive terminal (and not --json) → ask Proceed? [y/N];
  5. otherwise → exit 3 with "re-run with --yes to confirm".

Under --json the gate never prompts — it proceeds under threshold or with --yes, otherwise exits 3 after emitting the JSON document (projected cost, gate reason, rerun command, hints).

CI/scripting pattern: set confirm_above_usd to your comfort level, pass --yes, and set max_usd as the backstop.

Policies

policy items batch use for
dev (default) first dev_items (2) forced off pipeline validation
full-interactive all off unless batch: true runs you watch
full-batch all on (batch: auto) large unattended runs, ~50% cheaper

--policy {dev,full-interactive,full-batch} (on estimate/generate/ grade/status) overrides budget.policy for that invocation only — the config is untouched. This makes the pilot flow zero-edit: generate cfg.yaml --policy dev → inspect the export → generate cfg.yaml. Python parity: prepare_study(cfg, policy=...). The effective policy and its source (config | override) are recorded in the run manifest and in the estimate/status JSON (policy, policy_source).

Batch mode flows through inspect_ai to the provider batch APIs (openai, anthropic, google, grok, together). The recorded per-row cost applies a flat 0.5 multiplier for batch runs — a documented approximation; provider invoices are authoritative, and the ledger records the batch flag so rows can be re-priced.

Pricing table

Lookup precedence:

  1. budget.pricing_path (explicit JSON, relative to the config dir);
  2. the user cache — $ITEMEVAL_PRICING_PATH or ~/.cache/itemeval/pricing.json, written by --refresh-pricing;
  3. the packaged seed (a handful of common models, point-in-time estimates — refresh before real runs).

estimate --refresh-pricing merges live per-token prices for every model on the OpenRouter API over the seed. Models with no price are flagged unpriced in estimates and carry null usd in stores — the run still works; only cost attribution is missing.

mockllm/* models are deliberately priced (at claude-sonnet-class rates) so demos and tests exercise the full dollar pipeline at $0 actual spend.

Why OpenRouter as the refresh source

OpenRouter's /models endpoint is the only practical way to keep the long tail fresh from a single call: it lists hundreds of models across providers in one public, keyless response with a uniform prompt/completion per-token schema. Native providers (OpenAI, Anthropic, Google) publish prices on docs pages, not a stable machine-readable API, and would each need their own auth and parser. Crucially, a refresh does not clobber curated native prices: it writes openrouter/<id> keys and only fills a bare native id when the seed lacks it (seed wins for native ids). The refresh is an estimate for planning — the provider invoice is authoritative.

Auto-refresh

Set budget.pricing_max_age_days to refresh the cached table automatically once it ages past the threshold — no manual --refresh-pricing. It is:

  • opt-in (default None = off), so no command makes a surprise network call and offline/CI runs are unaffected;
  • best-effort — a failed fetch (offline, API change) keeps the existing table and never breaks a run;
  • ignored when budget.pricing_path pins an explicit table (you chose it deliberately).

Provenance (knowing which prices you got)

Because prices can come from a pinned file, a months-old seed, or a fresh refresh, every cost-bearing command states where its numbers came from:

pricing: merged (updated 2026-06-08T00:00:00Z, 2d old) — just refreshed from OpenRouter

estimate, generate, grade, export, and status all print this line. Programmatically the same provenance is on Estimate.pricing and ExportResult.pricing (a PricingProvenance: source, updated_at, age_days, refreshed), and PreparedStudy.pricing_refreshed flags whether a live refresh ran during preparation.

Provider prompt caching (how it's priced and reported)

Providers discount input tokens whose prefix they recently processed. Which options engage this, their measured price/time trade-offs, and when to use them live in Cost Savings (cache_schedule, split_prompt, split_rubric, direct API vs OpenRouter). This page covers only the accounting:

  • Every row records cache_read_tokens / cache_write_tokens; run summaries print per-condition totals and the hit rate (cache_read=… cache_write=… hit_rows=…). Zeros on a run that should cache are the diagnostic signal.
  • Cache reads are priced at the model's cache-read rate (default 0.1× input); cache writes are priced at 1.25× input for Anthropic-family models (an explicit-caching surcharge) and $0 for providers with free automatic caching (OpenAI, Gemini, DeepSeek). --refresh-pricing pulls per-model cache rates from OpenRouter where published.
  • inspect's accounting keeps input_tokens exclusive of the cache buckets: true input = input + cache_read + cache_write. The savings report below relies on this.
  • Cache lifetimes are minutes, so these discounts apply within a run, not across days — cross-run reuse is the local response cache's job (a local-cache hit has null tokens and usd = 0.0, a different signature than a provider cache read).

Where actual costs come from

Per-sample token usage from inspect logs × the pricing table, recorded on every solutions/gradings row (usd) and aggregated per (run × stage × condition × model) in the ledger. Cache-served calls record usd = 0.0. export verifies ledger totals equal row sums; checking the totals against your provider dashboards is a manual step (expect the batch approximation delta).

Savings and per-provider spend

export re-prices the ledger's stored tokens at the current table to report what the package saved versus a plain-API list price, plus a per-provider breakdown (ExportResult.cost, a CostReport):

spend: generate $1.20 | grade $2.92
savings vs list price: $5.68 (58%) — cache $3.10, batch $2.58 (estimated; excludes resume / response-cache reuse)
provider   calls  spend   list_price  saved
anthropic  640    $2.92   $6.30       $3.38
openai     320    $1.20   $3.50       $2.30

Three cost points are computed per ledger row and summed:

point input pricing batch meaning
baseline every input token at the full rate none plain API, no caching/batch
after_cache cached tokens at their discounted rates none caching on, batch off
actual discounted cache rates ×0.5 if batched what you paid

The savings decompose exactly: cache_savings = baseline − after_cache, batch_savings = after_cache − actual, and the two sum to total_savings = baseline − actual. This relies on inspect's accounting that input_tokens excludes cached tokens (true input is input + cache_read + cache_write), so the baseline re-adds the cache buckets at the full rate.

Scope and caveats:

  • Cache writes are a surcharge, not a saving (the first call pays ~1.25× input on the cached prefix); savings only accrue on later reads, so cache_savings can be negative on a run with little reuse.
  • Batch savings inherit the flat 0.5 approximation — provider invoice is authoritative.
  • Resume / response-cache reuse is NOT counted. A local-cache hit carries no usage object, so its ledger row holds null tokens and contributes zero to both actual and baseline. The figure therefore covers the prompt-cache and batch discounts only; counting reuse savings (a join back to the original run's tokens) is a planned follow-up.
  • Reasoning tokens need no special handling — they sit inside output_tokens in both baseline and actual, so they cancel (a cost, not a saving).
  • unpriced models are excluded from the figures and listed separately.

Clone this wiki locally