Skip to content

perf(bench): cached pre-compute for bench-sweep budget axis (5x speedup) #39

@nikolay-e

Description

@nikolay-e

Problem

bench-sweep.yml matrix is 5 methods × 5 budgets × 3 test_sets = 75 cells. Each cell currently runs benchmarks/run_final_eval.py from scratch on every benchmark instance — so within a fixed (method, test_set) group the heavy phase (clone, parse, fragment, discover, score) is repeated identically for every budget value, and only the cheap selection phase actually depends on budget.

Same shape as the calibration grid problem solved in #38: redundant heavy work along an axis whose only consumer is the light-phase selector.

Numbers

  • 75 cells × ~500 instances/test_set ≈ 37 500 evals per sweep.
  • Today on the self-hosted Hetzner runner (one cell at a time, --workers 22) a sweep is ~6 h.
  • After the same split that perf(bench): cached pre-compute architecture for calibration grid (12× speedup) #38 already shipped on Rust side (compute_scored_state + select_with_params), within each (method, test_set) group the 5 budget cells share one heavy run → effective work drops from 75 heavy + 75 light to 15 heavy + 75 light for the four diffctx-family methods (PPR, Ego, Hybrid, BM25). Aider is an external repo-map baseline, no cache to share.
  • Expected wall-time: ~75 min instead of ~6 h on the same hardware.

Proposed fix

Phase 1 — run_final_eval budget-sweep mode

Add --cached-budgets B1,B2,B3,... to benchmarks/run_final_eval.py. When set, the runner:

  1. Builds the eval-fn once per (instance, method) using compute_scored_state (heavy).
  2. Loops the budget list, calling select_with_params(state, budget=B) per budget (light).
  3. Writes one checkpoint row per (instance, budget) into the appropriate per-cell JSONL — same layout as the existing one-budget path so downstream aggregation is unchanged.

This is the orchestrator inversion that #38 already proved out for the (τ, cbf) axis: the worker function returns a list of (budget, EvalResult) pairs that the parent demuxes into per-cell checkpoints.

Phase 2 — bench-sweep.yml matrix collapse

Change the matrix from 75 cells (5 methods × 5 budgets × 3 test_sets) to 15 cells (5 methods × 3 test_sets) with the budget list passed inline:

```yaml
strategy:
matrix:
method: [ppr, ego, hybrid, bm25, aider]
test_set: [contextbench_verified, polybench500, swebench_verified]
```

Inside each cell, invoke:

```bash
python -m benchmarks.run_final_eval \
--baseline "${BASELINE}" \
--winner /tmp/winner.json \
--manifests-dir /tmp/manifest_one \
--cached-budgets -1,0,32000,64000,128000 \
--workers 22 \
--out "${CELL_DIR}"
```

Aider stays at one budget per cell since it doesn't share state.

Acceptance criteria

  • cached-budgets mode produces byte-identical per-cell JSONL contents to the current one-budget-per-cell path on a small benchmark subset.
  • run_final_eval --baseline diffctx --cached-budgets 8000,16000,32000 finishes in <2× the wall time of the single-budget call (vs the current 3× linear cost).
  • Existing aggregate_sweep aggregator works unchanged on the new layout.
  • Full bench-sweep workflow finishes in <90 min on the self-hosted hetzner-48core runner (vs current ~6 h).

Out of scope

  • Aider caching — orthogonal, external tool.
  • Per-method state sharing (PPR/Ego/Hybrid all use the same compute_scored_state output but with different scoring_mode arg upstream — would require deeper refactor that splits the discovery-mode dispatch).

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions