-
Notifications
You must be signed in to change notification settings - Fork 0
Cost Savings
itemeval has several built-in ways to spend less on the same study. This page lists each one in plain terms: what it saves, what it costs you in return, when to use it, and what's already on by default. (Deeper detail: Budget and Costs.)
If you change nothing, you already get: never paying twice for completed work,
free re-runs, and smart call scheduling. If your study repeats long text —
the same question asked several times, or one rubric grading many answers —
turn on the two split options below and expect roughly half the bill on
those parts. If your run is large and you don't need results today, switch to
batch for another ~50% off.
Solutions are stored once. Adding another judge model or another rubric later re-grades the stored answers — you never pay for generation again. This is how itemeval works; nothing to configure.
Strictly speaking this is insurance, not a discount: it makes repeated work free rather than making necessary work cheaper. Your results themselves are safe in the study's data files regardless — this is about the API calls. Two layers:
- Resume: re-running a command skips work that's already done — completed calls aren't even attempted after an interruption or crash.
-
Call memo: if an identical call is issued again (extending
replications,
--forceafter a fix, duplicate judge inputs), it's answered from your disk for $0 instead of re-billed. When this happens the run says so —12 calls answered from local cache ($0) — cache dir: …— and the run JSON carrieslocal_cache_rows/local_cache_dir.
In practice this matters because iterating is the workflow — you will
re-run things, and none of it re-bills.
Trade-off: none. Limit: the memo lives on your machine — a
different computer starts fresh. Exception (by design): wave runs
(--wave) turn the memo off — re-observations must be fresh draws, so waves
never replay and always cost full price.
The consequence for growing a study: pilot first, scale later, and the pilot
is never wasted money. itemeval generate cfg.yaml --policy dev runs a few
items without touching the config; re-running at full scope only pays for the
delta, because completed rows resume-skip and identical calls replay from the
local memo at $0. The pilot-available hint points here when the money gate
engages with no completed rows behind it.
Providers charge ~75–90% less for input text they processed moments ago. Two things decide whether you actually get that discount:
a) Call order (budget.cache_schedule — already on).
The discount works like a toll transponder: the first call must register at
full price before the ones behind it get the fast lane. itemeval sends one
warm-up call per group, then the rest together. On OpenAI's own API it also
tags every scheduled request with a stable cache key per study and condition
and asks for 24-hour retention (free) — so your pilot in the morning still
discounts the full run in the afternoon. Nothing to configure.
b) Prompt packaging (solvers.split_prompt / graders.<name>.split_rubric
— off by default, recommended for repeat-heavy studies).
Your prompt has a reusable part (instructions, rubric, problem) and a changing
part (the specific answer). These options send them as two pieces so the
provider can recognize the reusable part. Required for Anthropic models called
through OpenRouter; helpful everywhere. The model sees exactly the same text.
If you run an Anthropic model through OpenRouter without the split option,
itemeval says so up front (the anthropic-openrouter-no-split hint, at
estimate time) and projects full price — that combination verifiably gets no
discount at all.
What we measured (real runs, June 2026):
| Situation | Without | With | Money | Time |
|---|---|---|---|---|
| Ask 2 questions × 5 times each (OpenAI) | $0.038 / 26 s | $0.017 / 39 s | −55% | +13 s |
| Same, Anthropic, with both options on | $0.119 / 40 s | $0.055 / 50 s | −54% | +10 s |
| One judge grading 116 answers (Anthropic) | $0.840 / 72 s | $0.426 / 35 s | −49% | −37 s |
The trade-off in one sentence: on small runs the warm-up call adds a few seconds in exchange for ~half price; on big runs you get both — cheaper and faster, because discounted calls also skip re-reading the long text.
The discount never applies to the model's output (its
answers are always full price — judge-style work benefits most); and providers
only discount reusable parts longer than a minimum (OpenAI ~1,000 tokens;
Anthropic ~500–4,000 depending on the model), silently doing nothing below
that. itemeval now checks this before you spend: if a split layout's shared
part estimates below the minimum, the split-head-below-min hint fires at
estimate time and on the run itself. After the run, check the
cache_read=… hit_rows=… numbers it prints — zeros on a big run mean the
discount isn't engaging (that situation also triggers the cache-zero-reads
hint).
budget.policy: full-batch sends calls through the provider's batch queue at
about half price. Trade-off: results take minutes to hours, with no live
progress — use it for large runs you'll collect later, never for iterating.
Limit: works with OpenAI/Anthropic/Google/Grok/Together directly; not
through OpenRouter.
New configs run on the first 2 items (dev policy) until you scale up;
projected costs above confirm_above_usd ask first; max_usd is a hard stop
nothing can override; and estimate projects the bill with zero API calls.
One thing to know about projections: output is priced at your max_tokens
cap, since nothing else bounds it before the run. A generous cap (say, a
reasoning model with a fat budget producing short answers) over-states the
estimate — never under — so a real bill far below the projection usually
just means your cap is roomy.
Both work in the same config; pick per model:
Use the provider's own API (e.g. openai/…, anthropic/…) when…
- you want the discounts above to work reliably (through OpenRouter, requests sometimes land on backends that ignore them);
- you want batch mode (OpenRouter has none);
- you want one clean bill per provider with no marketplace fee.
Use OpenRouter (openrouter/…) when…
- you're comparing many models and want one key and one bill for all of them;
- the model has no direct account you own;
- the run is small enough that discounts don't matter anyway (dev/pilot).
Rule of thumb: pilot wide on OpenRouter, run big and cached on direct keys.
If you do run cached Anthropic models through OpenRouter, pin the upstream — OpenRouter is free to route your calls to hosts (Amazon Bedrock, Google Vertex) that ignore the caching markers, and the only symptom is a silently full-price bill. One config line fixes it (also available per grader):
solvers:
provider_routing: { order: [anthropic], allow_fallbacks: false }The object is passed to OpenRouter verbatim, so anything from
OpenRouter's provider-routing docs
works. itemeval reminds you when this matters: a cached
openrouter/anthropic/* run without it gets the openrouter-unpinned-cache
hint. And you can verify the pin held after any run: the run's manifest
records which host actually answered (endpoints_effective → upstream,
e.g. "Anthropic" vs "Amazon Bedrock"), and if the upstream changes
between runs of the same model, the next run warns you. In short: OpenAI, Grok, and Gemini models cache fine through OpenRouter
as-is; Anthropic and DeepSeek-style open models need the pin; OpenAI's keyed
caching and all batch APIs need a direct key.
| Setting | Default | Change it when… |
|---|---|---|
Free re-runs (cache) |
on | almost never |
Call scheduling (budget.cache_schedule) |
on (auto) |
a tiny latency-critical run (off) |
Prompt packaging (split_prompt / split_rubric) |
off | your study repeats long text — turn on for ~half price (note: starts fresh conditions, so decide before big runs) |
Generation prompt caching (solvers.cache_prompt) |
auto (on when replications > 1) |
rarely |
Batch (budget.policy) |
off (dev) |
large unattended runs → full-batch
|
| Budget gate / hard cap | $5 ask / no cap | set max_usd before every big run |