# Budget and Costs itemeval refuses to be surprised by a bill: every paid stage is preceded by a projection and a gate, every call's cost is attributed afterward, and a hard cap can never be talked past. ## Estimation `itemeval estimate` projects each stage **without any model API calls**: - input tokens: a chars/4 heuristic over the actual rendered prompts (judge inputs use real stored solutions when present, otherwise a placeholder sized from the generation `max_tokens`); - output tokens: the configured `max_tokens` cap, or pessimistic defaults when uncapped (4096/call for generation — with a printed `uncapped-generation` warning — and 512/call for judges without a cap); - dollars: tokens × the pricing table, with a 0.5 multiplier for batch-eligible conditions. The estimate reports two figures per stage. The **full** projection covers the whole policy-effective grid; the **remaining** projection subtracts already-complete cells (using the same predicates the runners use to decide what to run), so it is exactly what the next run can spend. On a partially complete study the lines read: ``` projected generate cost: $4.10 remaining of $11.30 full grid (63% complete) ``` JSON parity (append-only; `usd` keeps meaning the full grid): `remaining_usd`, `full_usd`, `remaining_calls`, `completed_cells`, `total_cells`, and `rows_replaced` on each `StageEstimate`. `--force` sets remaining = full (everything selected re-runs). Either way the estimate is a planning number, not an invoice — target accuracy is "within ~2× of actuals". When the planned run would overwrite existing rows (`--force`, epoch extension, `on_empty: rerun` re-attempts), the projection block states it as part of the same single confirmation: `this run replaces 48 existing rows (…)` — never a second prompt. ## The gate `generate` and `grade` compare their stage's **remaining** projection (what this run can actually spend — completed work is never re-paid) against the config: 1. projection > `budget.max_usd` → **abort, exit 4** — never overridable; 2. projection ≤ `confirm_above_usd` → proceed; 3. `--yes` → proceed; 4. interactive terminal (and not `--json`) → ask `Proceed? [y/N]`; 5. otherwise → **exit 3** with "re-run with --yes to confirm". Under `--json` the gate never prompts — it proceeds under threshold or with `--yes`, otherwise exits 3 after emitting the JSON document (projected cost, gate reason, rerun command, hints). CI/scripting pattern: set `confirm_above_usd` to your comfort level, pass `--yes`, and set `max_usd` as the backstop. ## Policies | policy | items | batch | use for | |--------|-------|-------|---------| | `dev` (default) | first `dev_items` (2) | forced off | pipeline validation | | `full-interactive` | all | off unless `batch: true` | runs you watch | | `full-batch` | all | on (`batch: auto`) | large unattended runs, ~50% cheaper | `--policy {dev,full-interactive,full-batch}` (on `estimate`/`generate`/ `grade`/`status`) overrides `budget.policy` for that invocation only — the config is untouched. This makes the pilot flow zero-edit: `generate cfg.yaml --policy dev` → inspect the export → `generate cfg.yaml`. Python parity: `prepare_study(cfg, policy=...)`. The effective policy and its source (`config` | `override`) are recorded in the run manifest and in the estimate/status JSON (`policy`, `policy_source`). Batch mode flows through inspect_ai to the provider batch APIs (openai, anthropic, google, grok, together). The recorded per-row cost applies a flat 0.5 multiplier for batch runs — a documented approximation; **provider invoices are authoritative**, and the ledger records the `batch` flag so rows can be re-priced. ## Pricing table Lookup precedence: 1. `budget.pricing_path` (explicit JSON, relative to the config dir); 2. the user cache — `$ITEMEVAL_PRICING_PATH` or `~/.cache/itemeval/pricing.json`, written by `--refresh-pricing`; 3. the packaged seed (a handful of common models, point-in-time estimates — refresh before real runs). `estimate --refresh-pricing` merges live per-token prices for every model on the OpenRouter API over the seed. Models with no price are flagged `unpriced` in estimates and carry null `usd` in stores — the run still works; only cost attribution is missing. `mockllm/*` models are deliberately priced (at claude-sonnet-class rates) so demos and tests exercise the full dollar pipeline at $0 actual spend. ### Why OpenRouter as the refresh source OpenRouter's `/models` endpoint is the only practical way to keep the long tail fresh from a single call: it lists hundreds of models across providers in one **public, keyless** response with a **uniform** `prompt`/`completion` per-token schema. Native providers (OpenAI, Anthropic, Google) publish prices on docs pages, not a stable machine-readable API, and would each need their own auth and parser. Crucially, a refresh does **not** clobber curated native prices: it writes `openrouter/` keys and only fills a bare native id when the seed lacks it (`seed wins for native ids`). The refresh is an estimate for planning — the **provider invoice is authoritative**. ### Auto-refresh Set `budget.pricing_max_age_days` to refresh the cached table automatically once it ages past the threshold — no manual `--refresh-pricing`. It is: - **opt-in** (default `None` = off), so no command makes a surprise network call and offline/CI runs are unaffected; - **best-effort** — a failed fetch (offline, API change) keeps the existing table and never breaks a run; - **ignored** when `budget.pricing_path` pins an explicit table (you chose it deliberately). ### Provenance (knowing which prices you got) Because prices can come from a pinned file, a months-old seed, or a fresh refresh, every cost-bearing command states where its numbers came from: ``` pricing: merged (updated 2026-06-08T00:00:00Z, 2d old) — just refreshed from OpenRouter ``` `estimate`, `generate`, `grade`, `export`, and `status` all print this line. Programmatically the same provenance is on `Estimate.pricing` and `ExportResult.pricing` (a `PricingProvenance`: `source`, `updated_at`, `age_days`, `refreshed`), and `PreparedStudy.pricing_refreshed` flags whether a live refresh ran during preparation. ## Provider prompt caching (how it's priced and reported) Providers discount input tokens whose prefix they recently processed. **Which options engage this, their measured price/time trade-offs, and when to use them live in [Cost Savings](Cost-Savings.md)** (`cache_schedule`, `split_prompt`, `split_rubric`, direct API vs OpenRouter). This page covers only the accounting: - Every row records `cache_read_tokens` / `cache_write_tokens`; run summaries print per-condition totals and the hit rate (`cache_read=… cache_write=… hit_rows=…`). Zeros on a run that should cache are the diagnostic signal. - Cache reads are priced at the model's cache-read rate (default 0.1× input); cache **writes** are priced at 1.25× input for Anthropic-family models (an explicit-caching surcharge) and $0 for providers with free automatic caching (OpenAI, Gemini, DeepSeek). `--refresh-pricing` pulls per-model cache rates from OpenRouter where published. - inspect's accounting keeps `input_tokens` **exclusive** of the cache buckets: true input = `input + cache_read + cache_write`. The savings report below relies on this. - Cache lifetimes are minutes, so these discounts apply within a run, not across days — cross-run reuse is the local response cache's job (a local-cache hit has null tokens and `usd = 0.0`, a different signature than a provider cache read). ## Where actual costs come from Per-sample token usage from inspect logs × the pricing table, recorded on every solutions/gradings row (`usd`) and aggregated per (run × stage × condition × model) in the ledger. Cache-served calls record `usd = 0.0`. `export` verifies ledger totals equal row sums; checking the totals against your provider dashboards is a manual step (expect the batch approximation delta). ## Savings and per-provider spend `export` re-prices the ledger's stored tokens at the current table to report what the package saved versus a plain-API list price, plus a per-provider breakdown (`ExportResult.cost`, a `CostReport`): ``` spend: generate $1.20 | grade $2.92 savings vs list price: $5.68 (58%) — cache $3.10, batch $2.58 (estimated; excludes resume / response-cache reuse) provider calls spend list_price saved anthropic 640 $2.92 $6.30 $3.38 openai 320 $1.20 $3.50 $2.30 ``` Three cost points are computed per ledger row and summed: | point | input pricing | batch | meaning | |-------|---------------|-------|---------| | `baseline` | every input token at the **full** rate | none | plain API, no caching/batch | | `after_cache` | cached tokens at their discounted rates | none | caching on, batch off | | `actual` | discounted cache rates | ×0.5 if batched | what you paid | The savings decompose exactly: `cache_savings = baseline − after_cache`, `batch_savings = after_cache − actual`, and the two sum to `total_savings = baseline − actual`. This relies on inspect's accounting that `input_tokens` **excludes** cached tokens (true input is `input + cache_read + cache_write`), so the baseline re-adds the cache buckets at the full rate. **Scope and caveats:** - **Cache writes are a surcharge, not a saving** (the first call pays ~1.25× input on the cached prefix); savings only accrue on later reads, so `cache_savings` can be negative on a run with little reuse. - **Batch savings inherit the flat 0.5 approximation** — provider invoice is authoritative. - **Resume / response-cache reuse is NOT counted.** A local-cache hit carries no usage object, so its ledger row holds null tokens and contributes zero to both `actual` and `baseline`. The figure therefore covers the prompt-cache and batch discounts only; counting reuse savings (a join back to the original run's tokens) is a planned follow-up. - Reasoning tokens need no special handling — they sit inside `output_tokens` in both `baseline` and `actual`, so they cancel (a cost, not a saving). - `unpriced` models are excluded from the figures and listed separately.