Skip to content

F-T-245: TBR 4-wide candidate batching (EW flat path)#238

Merged
ms609 merged 4 commits intocpp-searchfrom
feature/tbr-batch
Mar 28, 2026
Merged

F-T-245: TBR 4-wide candidate batching (EW flat path)#238
ms609 merged 4 commits intocpp-searchfrom
feature/tbr-batch

Conversation

@ms609
Copy link
Copy Markdown
Owner

@ms609 ms609 commented Mar 28, 2026

Agent F. GHA 23690208221 PASSED.

Summary

Restructures the TBR rerooting inner loop to evaluate 4 regraft candidates simultaneously, exploiting memory-level parallelism at large tree sizes (180+ tips).

Changes

  • src/ts_fitch.h/cpp: Added fitch_indirect_cached_flat_x4() (EW) and fitch_na_indirect_cached_flat_x4() (NA-aware). Each processes 4 independent vroot_cache rows per block iteration; data-independence lets the CPU's out-of-order engine issue 4 separate L2 load streams concurrently. Early-exit fires when ALL 4 accumulators exceed cutoff (bitwise-AND, no branch).

  • src/ts_tbr.cpp:

    • use_flat flag computed once per tbr_search() call (all_weight_one && no upweight_mask).
    • SPR loop switched to flat variants when use_flat.
    • TBR rerooting inner loop: use_flat && !use_iw → batch-of-4 while loop collecting non-skipped candidates. IW, Profile, and ratchet-upweight paths use existing scalar loop.

Correctness

  • Batch early-exit is a screening heuristic only; the candidate < best_candidate check at output time is always authoritative.
  • Trailing partial-batch slots initialized to cutoff_b sentinel; output loop bounded by b_n < 4, not 4, so uninitialized slots are never processed.
  • 28 test-ts-tbr-search + 23 test-ts-constraint-small PASS locally; GHA full suite PASS.

Expected benefit

Estimated ~13% overall improvement on large trees (TBR = 86% of wall time at 180 tips). Hamilton benchmark (feature/tbr-batch vs cpp-search, mbank_X30754 + syab07205_206t, 60s/120s, 10 seeds) to follow.

ms609 added 3 commits March 28, 2026 17:14
In the TBR rerooting inner loop, evaluate 4 regraft candidates
simultaneously instead of one at a time.  The 4 independent vroot_cache
row accesses are data-independent within each block iteration, so the
out-of-order CPU can serve them concurrently and hide L2 latency.

Changes:
- ts_fitch.h/cpp: add fitch_indirect_cached_flat_x4() (EW) and
  fitch_na_indirect_cached_flat_x4() (NA) — process 4 vroot pointers per
  block, exit when all 4 exceed cutoff (bitwise-AND combined test).
- ts_tbr.cpp: compute use_flat flag once per tbr_search call
  (weight==1, no upweight_mask — normal EW search, not ratchet).
  * SPR loop: use fitch_indirect_bounded_flat /
    fitch_na_indirect_bounded_flat when use_flat (fewer CharBlock
    struct dereferences).
  * TBR rerooting inner loop: when use_flat && !use_iw, replace the
    sequential ei loop with a batch-of-4 while loop.  Collect up to 4
    non-skipped candidates, call x4 batch function, update best from
    all 4 results.  Scalar fallback for trailing partial batches (< 4)
    and for IW / ratchet-upweight paths.

IW and ratchet (upweight_mask) paths are unchanged.
All 28 test-ts-tbr-search + 23 constraint-small tests pass.
…ge-tree PR cost

The bottleneck in the previous PR implementation was full TBR convergence
on the full-size tree after every prune-reinsert cycle (step 6 in
prune_reinsert_search). At 180 tips this takes ~7s/cycle; with c=5 cycles
that is ~35s of full-tree TBR before the outer-loop TBR runs anyway.

Two new SearchControl() parameters:
  pruneReinsertNni = TRUE  -- use NNI instead of TBR for full-tree polish
                              (~5x cheaper at >=120 tips; outer-loop TBR
                              restores full local optimality afterwards)
  pruneReinsertFullMoves = N -- limit full-tree TBR to N accepted moves
                                (0 = converge, backward compat default)

Both default to backward-compatible values (NNI=FALSE, fullMoves=0).
The large preset still has pruneReinsertCycles=0; re-enable once
benchmarked with NNI polish.
…eline

5 large-tree datasets (131-206 tips), 3 configs, 2 budgets, 10 seeds = 300 runs.
Builds from feature/tbr-batch for pruneReinsertNni parameter.
@ms609 ms609 merged commit 7207e0b into cpp-search Mar 28, 2026
6 of 10 checks passed
@ms609 ms609 deleted the feature/tbr-batch branch March 28, 2026 18:10
ms609 added a commit that referenced this pull request Mar 28, 2026
ms609 added a commit that referenced this pull request Mar 28, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant