F-T-245: TBR 4-wide candidate batching (EW flat path) by ms609 · Pull Request #238 · ms609/TreeSearch

ms609 · 2026-03-28T17:43:15Z

Agent F. GHA 23690208221 PASSED.

Summary

Restructures the TBR rerooting inner loop to evaluate 4 regraft candidates simultaneously, exploiting memory-level parallelism at large tree sizes (180+ tips).

Changes

src/ts_fitch.h/cpp: Added fitch_indirect_cached_flat_x4() (EW) and fitch_na_indirect_cached_flat_x4() (NA-aware). Each processes 4 independent vroot_cache rows per block iteration; data-independence lets the CPU's out-of-order engine issue 4 separate L2 load streams concurrently. Early-exit fires when ALL 4 accumulators exceed cutoff (bitwise-AND, no branch).
src/ts_tbr.cpp:
- use_flat flag computed once per tbr_search() call (all_weight_one && no upweight_mask).
- SPR loop switched to flat variants when use_flat.
- TBR rerooting inner loop: use_flat && !use_iw → batch-of-4 while loop collecting non-skipped candidates. IW, Profile, and ratchet-upweight paths use existing scalar loop.

Correctness

Batch early-exit is a screening heuristic only; the candidate < best_candidate check at output time is always authoritative.
Trailing partial-batch slots initialized to cutoff_b sentinel; output loop bounded by b_n < 4, not 4, so uninitialized slots are never processed.
28 test-ts-tbr-search + 23 test-ts-constraint-small PASS locally; GHA full suite PASS.

Expected benefit

Estimated ~13% overall improvement on large trees (TBR = 86% of wall time at 180 tips). Hamilton benchmark (feature/tbr-batch vs cpp-search, mbank_X30754 + syab07205_206t, 60s/120s, 10 seeds) to follow.

In the TBR rerooting inner loop, evaluate 4 regraft candidates simultaneously instead of one at a time. The 4 independent vroot_cache row accesses are data-independent within each block iteration, so the out-of-order CPU can serve them concurrently and hide L2 latency. Changes: - ts_fitch.h/cpp: add fitch_indirect_cached_flat_x4() (EW) and fitch_na_indirect_cached_flat_x4() (NA) — process 4 vroot pointers per block, exit when all 4 exceed cutoff (bitwise-AND combined test). - ts_tbr.cpp: compute use_flat flag once per tbr_search call (weight==1, no upweight_mask — normal EW search, not ratchet). * SPR loop: use fitch_indirect_bounded_flat / fitch_na_indirect_bounded_flat when use_flat (fewer CharBlock struct dereferences). * TBR rerooting inner loop: when use_flat && !use_iw, replace the sequential ei loop with a batch-of-4 while loop. Collect up to 4 non-skipped candidates, call x4 batch function, update best from all 4 results. Scalar fallback for trailing partial batches (< 4) and for IW / ratchet-upweight paths. IW and ratchet (upweight_mask) paths are unchanged. All 28 test-ts-tbr-search + 23 constraint-small tests pass.

…ge-tree PR cost The bottleneck in the previous PR implementation was full TBR convergence on the full-size tree after every prune-reinsert cycle (step 6 in prune_reinsert_search). At 180 tips this takes ~7s/cycle; with c=5 cycles that is ~35s of full-tree TBR before the outer-loop TBR runs anyway. Two new SearchControl() parameters: pruneReinsertNni = TRUE -- use NNI instead of TBR for full-tree polish (~5x cheaper at >=120 tips; outer-loop TBR restores full local optimality afterwards) pruneReinsertFullMoves = N -- limit full-tree TBR to N accepted moves (0 = converge, backward compat default) Both default to backward-compatible values (NNI=FALSE, fullMoves=0). The large preset still has pruneReinsertCycles=0; re-enable once benchmarked with NNI polish.

…eline 5 large-tree datasets (131-206 tips), 3 configs, 2 budgets, 10 seeds = 300 runs. Builds from feature/tbr-batch for pruneReinsertNni parameter.

…f Stage 5 running

…atch deleted after PR #238 merge

ms609 added 3 commits March 28, 2026 17:14

chore(T-289f): Stage 5 benchmark — PR NNI polish vs TBR polish vs bas…

aa3f16e

…eline 5 large-tree datasets (131-206 tips), 3 configs, 2 budgets, 10 seeds = 300 runs. Builds from feature/tbr-batch for pruneReinsertNni parameter.

ms609 added a commit that referenced this pull request Mar 28, 2026

chore: T-245 status → PR #238; update S-COORD/S-PR notes

d67bed2

chore: agent-e PARKED — T-289f NNI polish done, awaiting GHA + Hamilton

f6318da

ms609 merged commit 7207e0b into cpp-search Mar 28, 2026
6 of 10 checks passed

ms609 deleted the feature/tbr-batch branch March 28, 2026 18:10

ms609 added a commit that referenced this pull request Mar 28, 2026

chore: S-COORD round 45 — PRs #237+#238 merged; agent-G active; T-289…

5f047c9

…f Stage 5 running

ms609 added a commit that referenced this pull request Mar 28, 2026

fix(T-289f): update Hamilton script to use cpp-search — feature/tbr-b…

2784432

…atch deleted after PR #238 merge

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

F-T-245: TBR 4-wide candidate batching (EW flat path)#238

F-T-245: TBR 4-wide candidate batching (EW flat path)#238
ms609 merged 4 commits intocpp-searchfrom
feature/tbr-batch

ms609 commented Mar 28, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ms609 commented Mar 28, 2026

Summary

Changes

Correctness

Expected benefit

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant