You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
bench-sweep.yml matrix is 5 methods × 5 budgets × 3 test_sets = 75 cells. Each cell currently runs benchmarks/run_final_eval.py from scratch on every benchmark instance — so within a fixed (method, test_set) group the heavy phase (clone, parse, fragment, discover, score) is repeated identically for every budget value, and only the cheap selection phase actually depends on budget.
Same shape as the calibration grid problem solved in #38: redundant heavy work along an axis whose only consumer is the light-phase selector.
Today on the self-hosted Hetzner runner (one cell at a time, --workers 22) a sweep is ~6 h.
After the same split that perf(bench): cached pre-compute architecture for calibration grid (12× speedup) #38 already shipped on Rust side (compute_scored_state + select_with_params), within each (method, test_set) group the 5 budget cells share one heavy run → effective work drops from 75 heavy + 75 light to 15 heavy + 75 light for the four diffctx-family methods (PPR, Ego, Hybrid, BM25). Aider is an external repo-map baseline, no cache to share.
Expected wall-time: ~75 min instead of ~6 h on the same hardware.
Proposed fix
Phase 1 — run_final_eval budget-sweep mode
Add --cached-budgets B1,B2,B3,... to benchmarks/run_final_eval.py. When set, the runner:
Builds the eval-fn once per (instance, method) using compute_scored_state (heavy).
Loops the budget list, calling select_with_params(state, budget=B) per budget (light).
Writes one checkpoint row per (instance, budget) into the appropriate per-cell JSONL — same layout as the existing one-budget path so downstream aggregation is unchanged.
This is the orchestrator inversion that #38 already proved out for the (τ, cbf) axis: the worker function returns a list of (budget, EvalResult) pairs that the parent demuxes into per-cell checkpoints.
Phase 2 — bench-sweep.yml matrix collapse
Change the matrix from 75 cells (5 methods × 5 budgets × 3 test_sets) to 15 cells (5 methods × 3 test_sets) with the budget list passed inline:
Aider stays at one budget per cell since it doesn't share state.
Acceptance criteria
cached-budgets mode produces byte-identical per-cell JSONL contents to the current one-budget-per-cell path on a small benchmark subset.
run_final_eval --baseline diffctx --cached-budgets 8000,16000,32000 finishes in <2× the wall time of the single-budget call (vs the current 3× linear cost).
Existing aggregate_sweep aggregator works unchanged on the new layout.
Full bench-sweep workflow finishes in <90 min on the self-hosted hetzner-48core runner (vs current ~6 h).
Out of scope
Aider caching — orthogonal, external tool.
Per-method state sharing (PPR/Ego/Hybrid all use the same compute_scored_state output but with different scoring_mode arg upstream — would require deeper refactor that splits the discovery-mode dispatch).
Affected files: `benchmarks/run_final_eval.py`, `benchmarks/adapters/runner.py`, `benchmarks/diffctx_eval_fn.py` (already has `pool_eval_all_cells` for the τ-axis case — generalize it to also accept a budget list), `.github/workflows/bench-sweep.yml`.
Problem
bench-sweep.ymlmatrix is 5 methods × 5 budgets × 3 test_sets = 75 cells. Each cell currently runsbenchmarks/run_final_eval.pyfrom scratch on every benchmark instance — so within a fixed (method,test_set) group the heavy phase (clone, parse, fragment, discover, score) is repeated identically for every budget value, and only the cheap selection phase actually depends onbudget.Same shape as the calibration grid problem solved in #38: redundant heavy work along an axis whose only consumer is the light-phase selector.
Numbers
--workers 22) a sweep is ~6 h.compute_scored_state+select_with_params), within each (method, test_set) group the 5 budget cells share one heavy run → effective work drops from 75 heavy + 75 light to 15 heavy + 75 light for the four diffctx-family methods (PPR, Ego, Hybrid, BM25). Aider is an external repo-map baseline, no cache to share.Proposed fix
Phase 1 —
run_final_evalbudget-sweep modeAdd
--cached-budgets B1,B2,B3,...tobenchmarks/run_final_eval.py. When set, the runner:compute_scored_state(heavy).select_with_params(state, budget=B)per budget (light).This is the orchestrator inversion that #38 already proved out for the (τ, cbf) axis: the worker function returns a list of
(budget, EvalResult)pairs that the parent demuxes into per-cell checkpoints.Phase 2 —
bench-sweep.ymlmatrix collapseChange the matrix from 75 cells (5 methods × 5 budgets × 3 test_sets) to 15 cells (5 methods × 3 test_sets) with the budget list passed inline:
```yaml
strategy:
matrix:
method: [ppr, ego, hybrid, bm25, aider]
test_set: [contextbench_verified, polybench500, swebench_verified]
```
Inside each cell, invoke:
```bash
python -m benchmarks.run_final_eval \
--baseline "${BASELINE}" \
--winner /tmp/winner.json \
--manifests-dir /tmp/manifest_one \
--cached-budgets -1,0,32000,64000,128000 \
--workers 22 \
--out "${CELL_DIR}"
```
Aider stays at one budget per cell since it doesn't share state.
Acceptance criteria
cached-budgetsmode produces byte-identical per-cell JSONL contents to the current one-budget-per-cell path on a small benchmark subset.run_final_eval --baseline diffctx --cached-budgets 8000,16000,32000finishes in <2× the wall time of the single-budget call (vs the current 3× linear cost).aggregate_sweepaggregator works unchanged on the new layout.bench-sweepworkflow finishes in <90 min on the self-hosted hetzner-48core runner (vs current ~6 h).Out of scope
compute_scored_stateoutput but with differentscoring_modearg upstream — would require deeper refactor that splits the discovery-mode dispatch).References