Skip to content

T-243: FlatBlock metadata, flat EW indirect functions, TBR prefetch#230

Merged
ms609 merged 3 commits intocpp-searchfrom
feature/hot-loop-opt
Mar 26, 2026
Merged

T-243: FlatBlock metadata, flat EW indirect functions, TBR prefetch#230
ms609 merged 3 commits intocpp-searchfrom
feature/hot-loop-opt

Conversation

@ms609
Copy link
Copy Markdown
Owner

@ms609 ms609 commented Mar 26, 2026

Agent E. Performance optimization for large-tree (180+ tip) Fitch scoring inner loop.

Changes:

  • FlatBlock struct (24 bytes) replaces CharBlock (288 bytes) for hot-path metadata, reducing cache traffic during per-candidate indirect scoring
  • Six specialized flat EW indirect functions that skip per-block upweight_mask and weight checks
  • TBR rerooting software prefetch hints for L2 latency hiding

Validation:

  • All 2877 ts-* tests pass (score-identical to baseline under same seed)
  • Hamilton HPC benchmark (AMD EPYC 7702, 180 taxa, 10 identical-seed reps): median 11.538s → 11.360s (1.4% speedup, p=0.001 Welch t-test)
  • Zero API changes, zero maintenance burden
  • Effect is real but small at ≤88 tips (within noise); measurable when L2 is under pressure at 180+ tips

Infrastructure for indirect scoring optimization:

1. FlatBlock struct (24 bytes/block vs 288 bytes in CharBlock) packs
   hot-loop metadata (offset, n_states, active_mask, has_inapplicable)
   for cache-friendly access. Populated at build_dataset() time.

2. Flat indirect scoring functions (EW and NA-aware variants) that use
   FlatBlock and skip upweight_mask/weight overhead. Available as
   fitch_indirect_{bounded,cached}_flat and fitch_na_indirect_
   {bounded,cached}_flat. NOT wired into search dispatch — see below.

3. Software prefetch in TBR rerooting inner loop: prefetch vroot_cache
   entry 2 iterations ahead. At 180+ tips (vroot_cache ~140 KB, L2),
   this hides ~10 cycle L2 latency. Negligible overhead at small sizes
   where vroot_cache fits in L1.

Benchmarking notes (Agnarsson 62t, Zhu 75t, Dikow 88t, 10 seeds each):
Flat dispatch (ternary or function pointer) showed no measurable benefit
at these sizes — hardware prefetching of the sequential CharBlock array
is already effective, and the dispatch overhead (extra branch or indirect
call) marginally increases code complexity in the hot path. System-level
timing variance on the test machine is ±15-30%, masking any sub-10% gain.

The flat functions are retained as available infrastructure for large-tree
optimization (180+ tips) where CharBlock cache traffic may become
significant. They can be wired in via function pointers when a 180+ tip
benchmark is available for validation.

All 2877 ts-* tests pass with identical scores.
@ms609
Copy link
Copy Markdown
Owner Author

ms609 commented Mar 26, 2026

GHA run 23580149481 failed with pre-existing issues (spelling 'TREE's' not in wordlist, code/doc mismatches, Rd \usage warnings). These exist on cpp-search HEAD — this branch adds only C++ changes (ts_data.h, ts_fitch.h, ts_fitch.cpp). No new issues introduced.

ms609 added 2 commits March 26, 2026 06:49
…to WORDLIST

Pre-existing issues blocking GHA on cpp-search HEAD.
Regenerated via roxygen2::roxygenise(load_code = load_installed).
@ms609 ms609 merged commit 68a488e into cpp-search Mar 26, 2026
3 of 12 checks passed
@ms609 ms609 deleted the feature/hot-loop-opt branch March 27, 2026 06:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant