F-T-245: TBR 4-wide candidate batching (EW flat path)#238
Merged
ms609 merged 4 commits intocpp-searchfrom Mar 28, 2026
Merged
Conversation
In the TBR rerooting inner loop, evaluate 4 regraft candidates
simultaneously instead of one at a time. The 4 independent vroot_cache
row accesses are data-independent within each block iteration, so the
out-of-order CPU can serve them concurrently and hide L2 latency.
Changes:
- ts_fitch.h/cpp: add fitch_indirect_cached_flat_x4() (EW) and
fitch_na_indirect_cached_flat_x4() (NA) — process 4 vroot pointers per
block, exit when all 4 exceed cutoff (bitwise-AND combined test).
- ts_tbr.cpp: compute use_flat flag once per tbr_search call
(weight==1, no upweight_mask — normal EW search, not ratchet).
* SPR loop: use fitch_indirect_bounded_flat /
fitch_na_indirect_bounded_flat when use_flat (fewer CharBlock
struct dereferences).
* TBR rerooting inner loop: when use_flat && !use_iw, replace the
sequential ei loop with a batch-of-4 while loop. Collect up to 4
non-skipped candidates, call x4 batch function, update best from
all 4 results. Scalar fallback for trailing partial batches (< 4)
and for IW / ratchet-upweight paths.
IW and ratchet (upweight_mask) paths are unchanged.
All 28 test-ts-tbr-search + 23 constraint-small tests pass.
…ge-tree PR cost
The bottleneck in the previous PR implementation was full TBR convergence
on the full-size tree after every prune-reinsert cycle (step 6 in
prune_reinsert_search). At 180 tips this takes ~7s/cycle; with c=5 cycles
that is ~35s of full-tree TBR before the outer-loop TBR runs anyway.
Two new SearchControl() parameters:
pruneReinsertNni = TRUE -- use NNI instead of TBR for full-tree polish
(~5x cheaper at >=120 tips; outer-loop TBR
restores full local optimality afterwards)
pruneReinsertFullMoves = N -- limit full-tree TBR to N accepted moves
(0 = converge, backward compat default)
Both default to backward-compatible values (NNI=FALSE, fullMoves=0).
The large preset still has pruneReinsertCycles=0; re-enable once
benchmarked with NNI polish.
…eline 5 large-tree datasets (131-206 tips), 3 configs, 2 budgets, 10 seeds = 300 runs. Builds from feature/tbr-batch for pruneReinsertNni parameter.
ms609
added a commit
that referenced
this pull request
Mar 28, 2026
ms609
added a commit
that referenced
this pull request
Mar 28, 2026
…atch deleted after PR #238 merge
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Agent F. GHA 23690208221 PASSED.
Summary
Restructures the TBR rerooting inner loop to evaluate 4 regraft candidates simultaneously, exploiting memory-level parallelism at large tree sizes (180+ tips).
Changes
src/ts_fitch.h/cpp: Addedfitch_indirect_cached_flat_x4()(EW) andfitch_na_indirect_cached_flat_x4()(NA-aware). Each processes 4 independentvroot_cacherows per block iteration; data-independence lets the CPU's out-of-order engine issue 4 separate L2 load streams concurrently. Early-exit fires when ALL 4 accumulators exceed cutoff (bitwise-AND, no branch).src/ts_tbr.cpp:use_flatflag computed once pertbr_search()call (all_weight_one && no upweight_mask).use_flat.use_flat && !use_iw→ batch-of-4 while loop collecting non-skipped candidates. IW, Profile, and ratchet-upweight paths use existing scalar loop.Correctness
candidate < best_candidatecheck at output time is always authoritative.cutoff_bsentinel; output loop bounded byb_n < 4, not 4, so uninitialized slots are never processed.test-ts-tbr-search+ 23test-ts-constraint-smallPASS locally; GHA full suite PASS.Expected benefit
Estimated ~13% overall improvement on large trees (TBR = 86% of wall time at 180 tips). Hamilton benchmark (feature/tbr-batch vs cpp-search, mbank_X30754 + syab07205_206t, 60s/120s, 10 seeds) to follow.