-
Notifications
You must be signed in to change notification settings - Fork 0
Budget and Costs
itemeval refuses to be surprised by a bill: every paid stage is preceded by a projection and a gate, every call's cost is attributed afterward, and a hard cap can never be talked past.
itemeval estimate projects each stage without any model API calls:
- input tokens: a chars/4 heuristic over the actual rendered prompts (judge
inputs use real stored solutions when present, otherwise a placeholder
sized from the generation
max_tokens); - output tokens: the configured
max_tokenscap, or pessimistic defaults when uncapped (4096/call for generation — with a printeduncapped-generationwarning — and 512/call for judges without a cap); - dollars: tokens × the pricing table, with a 0.5 multiplier for batch-eligible conditions.
The estimate reports two figures per stage. The full projection covers the whole policy-effective grid; the remaining projection subtracts already-complete cells (using the same predicates the runners use to decide what to run), so it is exactly what the next run can spend. On a partially complete study the lines read:
projected generate cost: $4.10 remaining of $11.30 full grid (63% complete)
JSON parity (append-only; usd keeps meaning the full grid):
remaining_usd, full_usd, remaining_calls, completed_cells,
total_cells, and rows_replaced on each StageEstimate. --force sets
remaining = full (everything selected re-runs). Either way the estimate is a
planning number, not an invoice — target accuracy is "within ~2× of actuals".
When the planned run would overwrite existing rows (--force, epoch
extension, on_empty: rerun re-attempts), the projection block states it as
part of the same single confirmation:
this run replaces 48 existing rows (…) — never a second prompt.
generate and grade compare their stage's remaining projection (what
this run can actually spend — completed work is never re-paid) against the
config:
- projection >
budget.max_usd→ abort, exit 4 — never overridable; - projection ≤
confirm_above_usd→ proceed; -
--yes→ proceed; - interactive terminal (and not
--json) → askProceed? [y/N]; - otherwise → exit 3 with "re-run with --yes to confirm".
Under --json the gate never prompts — it proceeds under threshold or with
--yes, otherwise exits 3 after emitting the JSON document (projected cost,
gate reason, rerun command, hints).
CI/scripting pattern: set confirm_above_usd to your comfort level, pass
--yes, and set max_usd as the backstop.
| policy | items | batch | use for |
|---|---|---|---|
dev (default) |
first dev_items (2) |
forced off | pipeline validation |
full-interactive |
all | off unless batch: true
|
runs you watch |
full-batch |
all | on (batch: auto) |
large unattended runs, ~50% cheaper |
--policy {dev,full-interactive,full-batch} (on estimate/generate/
grade/status) overrides budget.policy for that invocation only — the
config is untouched. This makes the pilot flow zero-edit:
generate cfg.yaml --policy dev → inspect the export → generate cfg.yaml.
Python parity: prepare_study(cfg, policy=...). The effective policy and its
source (config | override) are recorded in the run manifest and in the
estimate/status JSON (policy, policy_source).
Batch mode flows through inspect_ai to the provider batch APIs (openai,
anthropic, google, grok, together). The recorded per-row cost applies a flat
0.5 multiplier for batch runs — a documented approximation; provider
invoices are authoritative, and the ledger records the batch flag so rows
can be re-priced.
Lookup precedence:
-
budget.pricing_path(explicit JSON, relative to the config dir); - the user cache —
$ITEMEVAL_PRICING_PATHor~/.cache/itemeval/pricing.json, written by--refresh-pricing; - the packaged seed (a handful of common models, point-in-time estimates — refresh before real runs).
estimate --refresh-pricing merges live per-token prices for every model on
the OpenRouter API over the seed. Models with no price are flagged
unpriced in estimates and carry null usd in stores — the run still works;
only cost attribution is missing.
mockllm/* models are deliberately priced (at claude-sonnet-class rates) so
demos and tests exercise the full dollar pipeline at $0 actual spend.
OpenRouter's /models endpoint is the only practical way to keep the long
tail fresh from a single call: it lists hundreds of models across providers in
one public, keyless response with a uniform prompt/completion
per-token schema. Native providers (OpenAI, Anthropic, Google) publish prices
on docs pages, not a stable machine-readable API, and would each need their own
auth and parser. Crucially, a refresh does not clobber curated native
prices: it writes openrouter/<id> keys and only fills a bare native id when
the seed lacks it (seed wins for native ids). The refresh is an estimate for
planning — the provider invoice is authoritative.
Set budget.pricing_max_age_days to refresh the cached table automatically
once it ages past the threshold — no manual --refresh-pricing. It is:
-
opt-in (default
None= off), so no command makes a surprise network call and offline/CI runs are unaffected; - best-effort — a failed fetch (offline, API change) keeps the existing table and never breaks a run;
-
ignored when
budget.pricing_pathpins an explicit table (you chose it deliberately).
Because prices can come from a pinned file, a months-old seed, or a fresh refresh, every cost-bearing command states where its numbers came from:
pricing: merged (updated 2026-06-08T00:00:00Z, 2d old) — just refreshed from OpenRouter
estimate, generate, grade, export, and status all print this line.
Programmatically the same provenance is on Estimate.pricing and
ExportResult.pricing (a PricingProvenance: source, updated_at,
age_days, refreshed), and PreparedStudy.pricing_refreshed flags whether a
live refresh ran during preparation.
Providers discount input tokens whose prefix they recently processed.
Which options engage this, their measured price/time trade-offs, and when
to use them live in Cost Savings (cache_schedule,
split_prompt, split_rubric, direct API vs OpenRouter). This page covers
only the accounting:
- Every row records
cache_read_tokens/cache_write_tokens; run summaries print per-condition totals and the hit rate (cache_read=… cache_write=… hit_rows=…). Zeros on a run that should cache are the diagnostic signal. - Cache reads are priced at the model's cache-read rate (default 0.1× input);
cache writes are priced at 1.25× input for Anthropic-family models (an
explicit-caching surcharge) and $0 for providers with free automatic
caching (OpenAI, Gemini, DeepSeek).
--refresh-pricingpulls per-model cache rates from OpenRouter where published. - inspect's accounting keeps
input_tokensexclusive of the cache buckets: true input =input + cache_read + cache_write. The savings report below relies on this. - Cache lifetimes are minutes, so these discounts apply within a run, not
across days — cross-run reuse is the local response cache's job (a
local-cache hit has null tokens and
usd = 0.0, a different signature than a provider cache read).
Per-sample token usage from inspect logs × the pricing table, recorded on
every solutions/gradings row (usd) and aggregated per (run × stage ×
condition × model) in the ledger. Cache-served calls record usd = 0.0.
export verifies ledger totals equal row sums; checking the totals against
your provider dashboards is a manual step (expect the batch approximation
delta).
export re-prices the ledger's stored tokens at the current table to report
what the package saved versus a plain-API list price, plus a per-provider
breakdown (ExportResult.cost, a CostReport):
spend: generate $1.20 | grade $2.92
savings vs list price: $5.68 (58%) — cache $3.10, batch $2.58 (estimated; excludes resume / response-cache reuse)
provider calls spend list_price saved
anthropic 640 $2.92 $6.30 $3.38
openai 320 $1.20 $3.50 $2.30
Three cost points are computed per ledger row and summed:
| point | input pricing | batch | meaning |
|---|---|---|---|
baseline |
every input token at the full rate | none | plain API, no caching/batch |
after_cache |
cached tokens at their discounted rates | none | caching on, batch off |
actual |
discounted cache rates | ×0.5 if batched | what you paid |
The savings decompose exactly: cache_savings = baseline − after_cache,
batch_savings = after_cache − actual, and the two sum to
total_savings = baseline − actual. This relies on inspect's accounting that
input_tokens excludes cached tokens (true input is
input + cache_read + cache_write), so the baseline re-adds the cache buckets
at the full rate.
Scope and caveats:
-
Cache writes are a surcharge, not a saving (the first call pays ~1.25×
input on the cached prefix); savings only accrue on later reads, so
cache_savingscan be negative on a run with little reuse. - Batch savings inherit the flat 0.5 approximation — provider invoice is authoritative.
-
Resume / response-cache reuse is NOT counted. A local-cache hit carries
no usage object, so its ledger row holds null tokens and contributes zero to
both
actualandbaseline. The figure therefore covers the prompt-cache and batch discounts only; counting reuse savings (a join back to the original run's tokens) is a planned follow-up. - Reasoning tokens need no special handling — they sit inside
output_tokensin bothbaselineandactual, so they cancel (a cost, not a saving). -
unpricedmodels are excluded from the figures and listed separately.