# Tutorial 5 — Scale up without surprises

**Use case:** "The pipeline works at `dev` scope. Now I want the full run —
hundreds or thousands of paid calls — without discovering the bill afterward."

itemeval's budget layer exists so that scaling up is a config change, not a
leap of faith: estimate first, gate on a threshold, cap hard, batch for ~50%
off, and resume anything that breaks. You will take a validated study (any of
Tutorials 1–4; examples below use Tutorial 3's `compare.yaml`) to full scope.

## Step 1 — Refresh pricing and re-estimate at full scope

Switch the policy from `dev` to a full run and set the guardrails:

```yaml
budget:
  policy: full-batch        # all items, batch APIs on (~50% cheaper)
  confirm_above_usd: 5      # ask before anything projected above $5
  max_usd: 25               # hard cap: abort if projection exceeds this. Never overridable.
```

Then:

```bash
itemeval estimate compare.yaml --refresh-pricing
```

`--refresh-pricing` pulls current per-token prices (from the OpenRouter
catalog) into a local cache, so projections use today's prices — do this
before any sizeable run. Read the estimate's per-condition breakdown and its
warnings; in particular, an `uncapped-generation` warning means you forgot
`max_tokens` and the estimator had to assume a pessimistic default. The
estimate always projects the **full** grid (it doesn't subtract completed
work) — a deliberate, conservative planning number targeted to be within ~2×
of actuals.

To keep prices fresh automatically instead, set
`budget.pricing_max_age_days: 7` — every cost-bearing command prints which
pricing table it used either way.

## Step 2 — Understand the three policies and the gate

| policy | items | batch APIs | use for |
|--------|-------|-----------|---------|
| `dev` (default) | first `dev_items` | forced off | pipeline validation |
| `full-interactive` | all | off unless `batch: true` | runs you watch |
| `full-batch` | all | on (`batch: auto`) | large unattended runs |

When you run `generate` or `grade`, the projection meets the gate, in order:

1. projection > `max_usd` → **abort** (exit 4). `--yes` does *not* override.
2. projection ≤ `confirm_above_usd` → proceed.
3. otherwise → interactive `Proceed? [y/N]`, or exit 3 if there's no TTY.

The scripting/CI pattern: set `confirm_above_usd` to your comfort level, pass
`--yes`, and let `max_usd` be the backstop that no flag can talk past.

## Step 3 — Run it (batched, unattended-safe)

```bash
itemeval generate compare.yaml --yes
itemeval grade    compare.yaml --yes
```

Under `full-batch`, eligible providers (OpenAI, Anthropic, Google, Grok,
Together) receive the calls through their batch APIs at roughly half price —
slower, but built for exactly this. Judge grading additionally benefits from
provider prompt caching: the rubric + problem prefix repeats across solutions,
so repeated prefixes are served from cache where the provider supports it.

**Interruptions are a non-event.** Ctrl-C, a crash, a rate-limit storm, an
expired session — re-run the same command. The stores are keyed, so completed
work skips, errored samples retry, and inspect_ai's local response cache means
already-paid calls are never paid twice (re-served rows record `usd = 0.0`).
Check progress any time with:

```bash
itemeval status compare.yaml    # done/expected per condition, errors, spend so far
```

If a few samples keep erroring, the run still completes (exit 1, failures
reported per condition) — `status` and the stores tell you exactly which rows
are missing; see [Error Handling](Error-Handling.md).

## Step 4 — Audit what you actually spent

```bash
itemeval export compare.yaml
```

Beyond the data tables, `export` settles the books:

```
spend: generate $1.20 | grade $2.92
savings vs list price: $5.68 (58%) — cache $3.10, batch $2.58 (estimated)
provider   calls  spend   list_price  saved
anthropic  640    $2.92   $6.30       $3.38
openai     320    $1.20   $3.50       $2.30
```

- Per-sample costs are recorded on every row (`gen_usd`, `grade_usd`); the
  per-run **ledger** aggregates them by stage × condition × model, and export
  verifies ledger totals equal row sums.
- The **savings report** re-prices your tokens at the plain-API list price and
  splits what the package saved into prompt-cache and batch components.
- Batch rows use a documented flat 0.5× approximation — the **provider invoice
  is authoritative**; the ledger records the `batch` flag so rows can be
  re-priced.

## Step 5 — Pilot-then-scale, the recommended lifecycle

The pattern the package is built around:

1. **dev** — `policy: dev`, mock or cheap models, 2–10 items. Validate
   mapping, prompts, rubric, parsing. Cost: ≈ $0.
2. **pilot** — real models, `dev_items: 20`–50, `full-interactive` if you want
   to watch. Sanity-check estimate-vs-actual, per-item output quality, and
   judge parse rates.
3. **full** — `full-batch`, `--yes`, `max_usd` set. Walk away; `status` when
   curious; `export` when done.

Each step reuses the previous step's completed work where conditions overlap —
nothing you validated is re-paid (note that `dev` runs the *first N* items, so
its items are a subset of the full run).

## Gotchas at scale

- **Reasoning models with tight `max_tokens`** can burn the whole budget on
  hidden reasoning and return empty text — not an error, so it won't retry by
  default. itemeval surfaces these (`status`'s `empty` column) and
  `solvers.on_empty: rerun` makes them re-attempt after you raise the cap
  ([Configuration](Configuration.md)).
- **Unpriced models** (not in the pricing table) run fine but carry null `usd`
  — `estimate` flags them up front; refresh pricing or supply
  `budget.pricing_path`.
- **Don't run two commands concurrently** against the same study directory;
  stages are designed to run serially.