Budget and Costs

itemeval refuses to be surprised by a bill: every paid stage is preceded by a projection and a gate, every call's cost is attributed afterward, and a hard cap can never be talked past.

Estimation

itemeval estimate projects each stage without any model API calls:

input tokens: a chars/4 heuristic over the actual rendered prompts (judge inputs use real stored solutions when present, otherwise a placeholder sized from the generation max_tokens);
output tokens: the configured max_tokens cap, or pessimistic defaults when uncapped (4096/call for generation — with a printed uncapped-generation warning — and 512/call for judges without a cap);
dollars: tokens × the pricing table, with a 0.5 multiplier for batch-eligible conditions.

The estimate reports two figures per stage. The full projection covers the whole policy-effective grid; the remaining projection subtracts already-complete cells (using the same predicates the runners use to decide what to run), so it is exactly what the next run can spend. On a partially complete study the lines read:

projected generate cost: $4.10 remaining of $11.30 full grid (63% complete)

JSON parity (append-only; usd keeps meaning the full grid): remaining_usd, full_usd, remaining_calls, completed_cells, total_cells, and rows_replaced on each StageEstimate. --force sets remaining = full (everything selected re-runs). Either way the estimate is a planning number, not an invoice — target accuracy is "within ~2× of actuals".

When the planned run would overwrite existing rows (--force, epoch extension, on_empty: rerun re-attempts), the projection block states it as part of the same single confirmation: this run replaces 48 existing rows (…) — never a second prompt.

The gate

generate and grade compare their stage's remaining projection (what this run can actually spend — completed work is never re-paid) against the config:

projection > budget.max_usd → abort, exit 4 — never overridable;
projection ≤ confirm_above_usd → proceed;
--yes → proceed;
interactive terminal (and not --json) → ask Proceed? [y/N];
otherwise → exit 3 with "re-run with --yes to confirm".

Under --json the gate never prompts — it proceeds under threshold or with --yes, otherwise exits 3 after emitting the JSON document (projected cost, gate reason, rerun command, hints).

CI/scripting pattern: set confirm_above_usd to your comfort level, pass --yes, and set max_usd as the backstop.

Policies

policy	items	batch	use for
`dev` (default)	first `dev_items` (2)	forced off	pipeline validation
`full-interactive`	all	off unless `batch: true`	runs you watch
`full-batch`	all	on (`batch: auto`)	large unattended runs, ~50% cheaper

--policy {dev,full-interactive,full-batch} (on estimate/generate/ grade/status) overrides budget.policy for that invocation only — the config is untouched. This makes the pilot flow zero-edit: generate cfg.yaml --policy dev → inspect the export → generate cfg.yaml. Python parity: prepare_study(cfg, policy=...). The effective policy and its source (config | override) are recorded in the run manifest and in the estimate/status JSON (policy, policy_source).

Batch mode flows through inspect_ai to the provider batch APIs (openai, anthropic, google, grok, together). The recorded per-row cost applies a flat 0.5 multiplier for batch runs — a documented approximation; provider invoices are authoritative, and the ledger records the batch flag so rows can be re-priced.

Pricing table

Lookup precedence:

budget.pricing_path (explicit JSON, relative to the config dir);
the user cache — $ITEMEVAL_PRICING_PATH or ~/.cache/itemeval/pricing.json, written by --refresh-pricing;
the packaged seed (a handful of common models, point-in-time estimates — refresh before real runs).

estimate --refresh-pricing merges live per-token prices for every model on the OpenRouter API over the seed. Models with no price are flagged unpriced in estimates and carry null usd in stores — the run still works; only cost attribution is missing.

mockllm/* models are deliberately priced (at claude-sonnet-class rates) so demos and tests exercise the full dollar pipeline at $0 actual spend.

Why OpenRouter as the refresh source

OpenRouter's /models endpoint is the only practical way to keep the long tail fresh from a single call: it lists hundreds of models across providers in one public, keyless response with a uniform prompt/completion per-token schema. Native providers (OpenAI, Anthropic, Google) publish prices on docs pages, not a stable machine-readable API, and would each need their own auth and parser. Crucially, a refresh does not clobber curated native prices: it writes openrouter/<id> keys and only fills a bare native id when the seed lacks it (seed wins for native ids). The refresh is an estimate for planning — the provider invoice is authoritative.

Auto-refresh

Set budget.pricing_max_age_days to refresh the cached table automatically once it ages past the threshold — no manual --refresh-pricing. It is:

opt-in (default None = off), so no command makes a surprise network call and offline/CI runs are unaffected;
best-effort — a failed fetch (offline, API change) keeps the existing table and never breaks a run;
ignored when budget.pricing_path pins an explicit table (you chose it deliberately).

Provenance (knowing which prices you got)

Because prices can come from a pinned file, a months-old seed, or a fresh refresh, every cost-bearing command states where its numbers came from:

pricing: merged (updated 2026-06-08T00:00:00Z, 2d old) — just refreshed from OpenRouter

estimate, generate, grade, export, and status all print this line. Programmatically the same provenance is on Estimate.pricing and ExportResult.pricing (a PricingProvenance: source, updated_at, age_days, refreshed), and PreparedStudy.pricing_refreshed flags whether a live refresh ran during preparation.

Provider prompt caching (how it's priced and reported)

Providers discount input tokens whose prefix they recently processed. Which options engage this, their measured price/time trade-offs, and when to use them live in Cost Savings (cache_schedule, split_prompt, split_rubric, direct API vs OpenRouter). This page covers only the accounting:

Every row records cache_read_tokens / cache_write_tokens; run summaries print per-condition totals and the hit rate (cache_read=… cache_write=… hit_rows=…). Zeros on a run that should cache are the diagnostic signal.
Cache reads are priced at the model's cache-read rate (default 0.1× input); cache writes are priced at 1.25× input for Anthropic-family models (an explicit-caching surcharge) and $0 for providers with free automatic caching (OpenAI, Gemini, DeepSeek). --refresh-pricing pulls per-model cache rates from OpenRouter where published.
inspect's accounting keeps input_tokens exclusive of the cache buckets: true input = input + cache_read + cache_write. The savings report below relies on this.
Cache lifetimes are minutes, so these discounts apply within a run, not across days — cross-run reuse is the local response cache's job (a local-cache hit has null tokens and usd = 0.0, a different signature than a provider cache read).

Where actual costs come from

Per-sample token usage from inspect logs × the pricing table, recorded on every solutions/gradings row (usd) and aggregated per (run × stage × condition × model) in the ledger. Cache-served calls record usd = 0.0. export verifies ledger totals equal row sums; checking the totals against your provider dashboards is a manual step (expect the batch approximation delta).

Savings and per-provider spend

export re-prices the ledger's stored tokens at the current table to report what the package saved versus a plain-API list price, plus a per-provider breakdown (ExportResult.cost, a CostReport):

spend: generate $1.20 | grade $2.92
savings vs list price: $5.68 (58%) — cache $3.10, batch $2.58 (estimated; excludes resume / response-cache reuse)
provider   calls  spend   list_price  saved
anthropic  640    $2.92   $6.30       $3.38
openai     320    $1.20   $3.50       $2.30

Three cost points are computed per ledger row and summed:

point	input pricing	batch	meaning
`baseline`	every input token at the full rate	none	plain API, no caching/batch
`after_cache`	cached tokens at their discounted rates	none	caching on, batch off
`actual`	discounted cache rates	×0.5 if batched	what you paid

The savings decompose exactly: cache_savings = baseline − after_cache, batch_savings = after_cache − actual, and the two sum to total_savings = baseline − actual. This relies on inspect's accounting that input_tokens excludes cached tokens (true input is input + cache_read + cache_write), so the baseline re-adds the cache buckets at the full rate.

Scope and caveats:

Cache writes are a surcharge, not a saving (the first call pays ~1.25× input on the cached prefix); savings only accrue on later reads, so cache_savings can be negative on a run with little reuse.
Batch savings inherit the flat 0.5 approximation — provider invoice is authoritative.
Resume / response-cache reuse is NOT counted. A local-cache hit carries no usage object, so its ledger row holds null tokens and contributes zero to both actual and baseline. The figure therefore covers the prompt-cache and batch discounts only; counting reuse savings (a join back to the original run's tokens) is a planned follow-up.
Reasoning tokens need no special handling — they sit inside output_tokens in both baseline and actual, so they cancel (a cost, not a saving).
unpriced models are excluded from the figures and listed separately.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Budget and Costs

Budget and Costs

Estimation

The gate

Policies

Pricing table

Why OpenRouter as the refresh source

Auto-refresh

Provenance (knowing which prices you got)

Provider prompt caching (how it's priced and reported)

Where actual costs come from

Savings and per-provider spend

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally