-
Notifications
You must be signed in to change notification settings - Fork 0
Budget and Costs
itemeval refuses to be surprised by a bill: every paid stage is preceded by a projection and a gate, every call's cost is attributed afterward, and a hard cap can never be talked past.
itemeval estimate projects each stage without any model API calls:
- input tokens: a chars/4 heuristic over the actual rendered prompts (judge
inputs use real stored solutions when present, otherwise a placeholder
sized from the generation
max_tokens); - output tokens: the configured
max_tokenscap, or pessimistic defaults when uncapped (4096/call for generation — with a printeduncapped-generationwarning — and 512/call for judges without a cap); - dollars: tokens × the pricing table, with a 0.5 multiplier for batch-eligible conditions.
The estimate covers the full policy-effective grid (resume state is not subtracted) and is a planning number, not an invoice — target accuracy is "within ~2× of actuals".
generate and grade compare their stage's projection against the config:
- projection >
budget.max_usd→ abort, exit 4 — never overridable; - projection ≤
confirm_above_usd→ proceed; -
--yes→ proceed; - interactive terminal → ask
Proceed? [y/N]; - otherwise → exit 3 with "re-run with --yes to confirm".
CI/scripting pattern: set confirm_above_usd to your comfort level, pass
--yes, and set max_usd as the backstop.
| policy | items | batch | use for |
|---|---|---|---|
dev (default) |
first dev_items (2) |
forced off | pipeline validation |
full-interactive |
all | off unless batch: true
|
runs you watch |
full-batch |
all | on (batch: auto) |
large unattended runs, ~50% cheaper |
Batch mode flows through inspect_ai to the provider batch APIs (openai,
anthropic, google, grok, together). The recorded per-row cost applies a flat
0.5 multiplier for batch runs — a documented approximation; provider
invoices are authoritative, and the ledger records the batch flag so rows
can be re-priced.
Lookup precedence:
-
budget.pricing_path(explicit JSON, relative to the config dir); - the user cache —
$ITEMEVAL_PRICING_PATHor~/.cache/itemeval/pricing.json, written by--refresh-pricing; - the packaged seed (a handful of common models, point-in-time estimates — refresh before real runs).
estimate --refresh-pricing merges live per-token prices for every model on
the OpenRouter API over the seed. Models with no price are flagged
unpriced in estimates and carry null usd in stores — the run still works;
only cost attribution is missing.
mockllm/* models are deliberately priced (at claude-sonnet-class rates) so
demos and tests exercise the full dollar pipeline at $0 actual spend.
Per-sample token usage from inspect logs × the pricing table, recorded on
every solutions/gradings row (usd) and aggregated per (run × stage ×
condition × model) in the ledger. Cache-served calls record usd = 0.0.
export verifies ledger totals equal row sums; checking the totals against
your provider dashboards is a manual step (expect the batch approximation
delta).