perf(bench): cached pre-compute for bench-sweep budget axis (5x speedup)

## Problem

`bench-sweep.yml` matrix is **5 methods × 5 budgets × 3 test_sets = 75 cells**. Each cell currently runs `benchmarks/run_final_eval.py` from scratch on every benchmark instance — so within a fixed (`method`, `test_set`) group the heavy phase (clone, parse, fragment, discover, score) is repeated identically for every budget value, and only the cheap selection phase actually depends on `budget`.

Same shape as the calibration grid problem solved in #38: redundant heavy work along an axis whose only consumer is the light-phase selector.

## Numbers

- 75 cells × ~500 instances/test_set ≈ 37 500 evals per sweep.
- Today on the self-hosted Hetzner runner (one cell at a time, `--workers 22`) a sweep is ~6 h.
- After the same split that #38 already shipped on Rust side (`compute_scored_state` + `select_with_params`), within each (method, test_set) group the 5 budget cells share one heavy run → effective work drops from 75 heavy + 75 light to **15 heavy + 75 light** for the four diffctx-family methods (PPR, Ego, Hybrid, BM25). Aider is an external repo-map baseline, no cache to share.
- Expected wall-time: **~75 min** instead of ~6 h on the same hardware.

## Proposed fix

### Phase 1 — `run_final_eval` budget-sweep mode

Add `--cached-budgets B1,B2,B3,...` to `benchmarks/run_final_eval.py`. When set, the runner:

1. Builds the eval-fn once per (instance, method) using `compute_scored_state` (heavy).
2. Loops the budget list, calling `select_with_params(state, budget=B)` per budget (light).
3. Writes one checkpoint row per (instance, budget) into the appropriate per-cell JSONL — same layout as the existing one-budget path so downstream aggregation is unchanged.

This is the orchestrator inversion that #38 already proved out for the (τ, cbf) axis: the worker function returns a list of `(budget, EvalResult)` pairs that the parent demuxes into per-cell checkpoints.

### Phase 2 — `bench-sweep.yml` matrix collapse

Change the matrix from 75 cells (5 methods × 5 budgets × 3 test_sets) to **15 cells (5 methods × 3 test_sets)** with the budget list passed inline:

\`\`\`yaml
strategy:
  matrix:
    method: [ppr, ego, hybrid, bm25, aider]
    test_set: [contextbench_verified, polybench500, swebench_verified]
\`\`\`

Inside each cell, invoke:

\`\`\`bash
python -m benchmarks.run_final_eval \\
    --baseline "${BASELINE}" \\
    --winner /tmp/winner.json \\
    --manifests-dir /tmp/manifest_one \\
    --cached-budgets -1,0,32000,64000,128000 \\
    --workers 22 \\
    --out "${CELL_DIR}"
\`\`\`

Aider stays at one budget per cell since it doesn't share state.

## Acceptance criteria

- [ ] `cached-budgets` mode produces byte-identical per-cell JSONL contents to the current one-budget-per-cell path on a small benchmark subset.
- [ ] `run_final_eval --baseline diffctx --cached-budgets 8000,16000,32000` finishes in <2× the wall time of the single-budget call (vs the current 3× linear cost).
- [ ] Existing `aggregate_sweep` aggregator works unchanged on the new layout.
- [ ] Full `bench-sweep` workflow finishes in <90 min on the self-hosted hetzner-48core runner (vs current ~6 h).

## Out of scope

- Aider caching — orthogonal, external tool.
- Per-method state sharing (PPR/Ego/Hybrid all use the same `compute_scored_state` output but with different `scoring_mode` arg upstream — would require deeper refactor that splits the discovery-mode dispatch).

## References

- Pattern source: #38 (calibration grid orchestrator inversion).
- Affected files: \`benchmarks/run_final_eval.py\`, \`benchmarks/adapters/runner.py\`, \`benchmarks/diffctx_eval_fn.py\` (already has \`pool_eval_all_cells\` for the τ-axis case — generalize it to also accept a budget list), \`.github/workflows/bench-sweep.yml\`.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(bench): cached pre-compute for bench-sweep budget axis (5x speedup) #39

Problem

Numbers

Proposed fix

Phase 1 — `run_final_eval` budget-sweep mode

Phase 2 — `bench-sweep.yml` matrix collapse

Acceptance criteria

Out of scope

References

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

perf(bench): cached pre-compute for bench-sweep budget axis (5x speedup) #39

Description

Problem

Numbers

Proposed fix

Phase 1 — run_final_eval budget-sweep mode

Phase 2 — bench-sweep.yml matrix collapse

Acceptance criteria

Out of scope

References

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions

Phase 1 — `run_final_eval` budget-sweep mode

Phase 2 — `bench-sweep.yml` matrix collapse