perf: MaddisonSlatkin optimizations + native C++ MC Fitch scorer by ms609 · Pull Request #211 · ms609/TreeSearch

ms609 · 2026-03-22T18:39:32Z

Profiling-driven performance improvements to MaddisonSlatkin() and a new native C++ Monte Carlo scorer for infeasible multi-state characters, developed on a feature branch for review before merging to cpp-search.

Key changes

MaddisonSlatkin exact path (earlier commits)

LSEAccumulator double/exp + NEG_INF compare — switch to double precision, faster exp, sentinel comparison
Raise feasibility thresholds — updated post-optimization (k=3→27, k=4→19, k=5→13)
Batch logB_cache lookups — ~15% speedup in LogPVec draw-pair loop
OAFlatMap for caches — logB_cache/logPVec_cache use open-addressing flat map
Partition-aware split_count gate — skip infeasible partitions early
Carter closed-form for binary chars — analytic shortcut + LSE lookup tables + buffer reuse
Wall-clock time budget — abort infeasible MaddisonSlatkin calls that exceed budget

MC approximation for infeasible characters (latest commits)

Log-quadratic MC approximation — replaces binary reduction with full multi-state MC:
- Exact anchor P(s_min) via NUnrootedMult()
- MC body from random tree scoring
- Log-quadratic interpolation bridging exact anchor to MC body
- Log-linear fallback if sanity checks fail
Native C++ mc_fitch_scores() (src/ts_mc_fitch.cpp) — generates random trees via random_tree() and scores with inline bitmask Fitch downpass. No Morphy dependency, no R allocation per tree.

Metric	Old (R-level)	New (C++)
Rate (n=38, k=3)	~4k trees/s	357k trees/s
100k trees cost	~25s	0.28s
Morphy dependency	Yes	No

Default n_mc bumped from 50k to 100k.

Testing

All 288 tests pass (88 multistate + 17 data_manipulation + 183 ts-*). MC score distributions cross-validated against phangorn::parsimony().

… compare - LSEAccumulator: long double → double, expl/logl → std::exp/std::log - log_prod_sum_4: remove redundant long double casts - computeLogRD (both sites): long double lc → double lc - All std::isfinite(x) → (x > NEG_INF) throughout MaddisonSlatkin.cpp (FixedProbList NaN sentinel correctly excluded: NaN > NEG_INF == false) - log_prod_sum_4: now returns plain double sum Measured speedup (matched calls, same RNG seed): k=3/n=20-24: 1.44x (30%) k=4/n=14-18: 1.28x (22%) k=5/n=9-12: 1.36x (26%) Tests: 37 MaddisonSlatkin + 82 pp-multistate + 41 ts-profile pass, 0 fail.

…k=5→13) Worst-case timings with double/NEG_INF optimisations on reference machine: k=3 n=27: 1.03 s (was n=25 ~0.60 s under old comment's ~5 s budget) k=4 n=19: 2.00 s (was n=18 ~1.27 s) k=5 n=13: 2.58 s (was n=12 ~1.62 s) Also updates doc-string tip counts and comment wording. All 82 pp-multistate tests pass.

Per draw pair in SolverT::LogPVec, the noStep/yesStep pair loops called LogB(pr.a, drawn) once per pair (N_no+N_yes times), each doing an unordered_map::find on drawn. Since drawn is constant across all pairs in a given draw pair, all but the first find are redundant. Fix: prime the logB_cache for drawn/undrawn via a single LogPVec call before the loops, then fetch FixedProbList* twice (one find per side) and replace LogB(pr.a, drawn) with (*fpl_d)[pr.a] — a direct indexed array access. For n<=2 base cases (fpl is null), falls through to LogB. Measured speedup vs T-152+T-153 baseline (reference machine, worst-case): k=3 n=25: 0.60s -> 0.47s (1.28x) k=3 n=27: 1.03s -> 0.97s (1.06x) k=4 n=18: 1.27s -> 1.07s (1.19x) k=4 n=19: 2.00s -> 1.67s (1.20x) k=5 n=12: 1.62s -> 1.46s (1.11x) k=5 n=13: 2.58s -> 2.10s (1.23x) Average: ~15% faster All 37 MaddisonSlatkin + 82 pp-multistate tests pass.

Replace std::unordered_map (node-based, 2x pointer-chase per lookup) with OAFlatMap: inline open-addressing map with separate probe layer (uint64_t hashes_ + uint32_t indices_) and contiguous entries_ vector. logB_cache: OAFlatMap<StateKeyT<N>, FixedProbList> - Probe layer (hashes_ + indices_) fits in L2 cache; eliminates extra heap-node dereference per lookup. - LogB() rewritten to not hold slot reference across recursion (safe for any non-stable map; also fixes latent UB in prior code). logPVec_cache: OAFlatMap<LogPVecKey, uint32_t> + std::deque<vector<double>> - Index map uses same probe-layer design (fast OA lookup for key -> idx). - Values live in pv_store (deque): push_back does not invalidate existing element references, preserving caller-held const vector<double>& safety. All 37 MaddisonSlatkin + 82 pp-multistate tests pass.

Replace blunt (k, n) feasibility thresholds with a partition-shape-aware metric: the split_count (coefficient of x^floor(n/2) in the generating polynomial). This allows skewed multi-state characters to be computed exactly even at large n, while still blocking balanced partitions that would blow up. - Add .MSSplitCount() and .MS_SC_THRESHOLD in data_manipulation.R - Replace maxTips gate in PrepareDataProfile and pp_info_extra_step - Shrink 5-state test from n=11 (blowup) to n=10 (0.6s) - Add 18 split_count boundary + base-case tests (55 total, 5.6s)

The infeasible test cases used partitions from the old maxTips gate (n > 25 for k=3). The split_count gate (f738091) replaced maxTips but the test partitions weren't updated — (13,12,10) has sc=118, below the k=3 threshold of 136, causing the exact solver to run indefinitely. Changed infeasible k=3 partition from (13,12,10) n=35 to (13,13,12) n=38 (sc=140 > 136). Also fixed stale comments and suppressed expected fallback warnings.

…r reuse Binary (2-token) characters: O(1) Carter et al. (1990) closed-form shortcut when both states are pure (no ambiguous tips). Eliminates the exponential recursive algorithm entirely for this common case. Benchmark: k=2 n=50 drops from 0.22s to <0.001s (>100x). Multi-state (3-5 token) characters: - FastLSETables: 4096-entry precomputed lookup tables for exp(-d) and log1p(exp(-d)) with linear interpolation. Used in LSEAccumulator::add() to replace std::exp() calls. - lse_update(): shared inline helper for log-sum-exp accumulation, replacing ~6 copies of the same pattern in LogPVec/runAll. - logconv(): changed from allocating+returning std::vector to writing into caller-provided buffer. Eliminates allocation per convolution call. - Reusable noStepVec/yesStepVec/conv_buf buffers in LogPVec, allocated once per call, std::fill'd per draw pair. Multi-state speedup: ~5-19% for k=3 boundary cases (within noise for k=4/k=5 due to recursion depth dominating). Also adds logDoubleFact, logN_carter, logCarter1_cpp helpers for the Carter formula (C++ implementation of the R LogCarter1).

The original threshold calibration (136/100/100) used sequential state indices (1,2,3,...) instead of bitmask positions (1,2,4,8,16). This resulted in testing a different (easier) combinatorial problem, yielding dangerously high thresholds that allowed the exact solver to run indefinitely on balanced partitions within the supposed safe range. Recalibrated with correct encoding (states at 2^(i-1) positions): k=3: 136 → 75 (n=27 balanced 0.97s safe, n=31 balanced 1.32s marginal) k=4: 100 → 50 (n=13 balanced 0.36s safe, n=15 balanced 0.94s marginal) k=5: 100 → 35 (n=9 balanced 0.22s safe, n=10 balanced 0.49s) Updated boundary tests and docstrings to match.

Add a 2-second wall-clock budget to SolverT. Every LogB and LogPVec entry checks std::chrono::steady_clock; if the budget is exceeded, budget_exceeded is set and all loops bail out immediately. runAll() returns NA_REAL with an Rcpp::warning so downstream code can detect the fallback. The guard prevents infinite hangs even if a partition slips past the split_count feasibility gate (e.g. through future threshold changes). Chrono overhead is ~20 ns per call — negligible versus per-call work. Verified: - k=3 n=35 balanced (known blowup): returns NA in 2.02 s - 137 existing tests pass (55 MS + 82 pp-multistate, 9.2 s)

…profiles Replace binary reduction with a log-quadratic MC approximation that preserves the full multi-state character for infeasible MaddisonSlatkin computations. Key changes: - 'auto' and 'mc' approx modes now both use MC for infeasible chars (no more state reduction to binary) - .ApproxStepInformation() uses log-quadratic interpolation anchored at the exact P(s_min) to bridge the gap between the analytic minimum-steps probability and the MC body - Default n_mc increased from 5000 to 50000 for better tail resolution - PrepareDataProfile() no longer reduces multi-state characters in the data matrix; infeasibility handled entirely in StepInformation() - Log-linear fallback when quadratic sanity checks fail (monotonicity or concavity) Validated: R² = 0.9997 for log-quadratic fit vs exact on (8,8,8) n=24. MC-estimated IC(0) matches exact within 0.5 bits at the feasibility boundary. All 105 tests pass.

Add mc_fitch_scores() in src/ts_mc_fitch.cpp: generates random trees via random_tree() (pure C, no Morphy dependency) and scores each with an inline bitmask Fitch downpass. No R object allocation per tree. Performance: 357k trees/sec on 38-tip 3-state characters (vs ~100k/s with the old Morphy path). This makes 100k MC samples cost ~0.28s. - Wire mc_fitch_scores into .ApproxStepInformation(), replacing the R-level RandomTree + TreeLength loop - Bump default n_mc from 50k to 100k (now affordable) - Update docstrings to reflect the new implementation - All 288 tests pass (88 multistate + 17 data_manipulation + 183 ts-*)

ms609 · 2026-03-24T08:50:29Z

R/data_manipulation.R

-#'   Default 5000.
+#'   `"auto"` (default) and `"mc"` use a Monte Carlo approximation with
+#'   log-quadratic tail interpolation, preserving the full multi-state
+#'   character; `"exact"` always uses the exact Maddison & Slatkin calculation


Sounds like auto always uses MC approximation; are there not small cases where it would be faster / not slower enough to compute exactly? Where we're certain that exact will be quick, do we want to use that instead of MC every time?

ms609 · 2026-03-24T08:50:57Z

R/data_manipulation.R

    maxInformative <- max(maxInformative, nInf)
  }

  if (cappedAny) {


Do we still need to cap now we have MC?

ms609 · 2026-03-24T08:51:28Z

R/pp_info_extra_step.r

-#' instead: the step-count distribution is estimated from a random sample of
-#' trees, anchored at the exact minimum-steps probability.
+#' the `approx` argument.  With `"auto"` (default) or `"mc"`, a Monte Carlo
+#' approximation preserves the full multi-state character: the exact


"preserves the full multi-state character" might be confusing: why would the user presume otherwise?

ms609 · 2026-03-24T08:51:53Z

R/pp_info_extra_step.r

+#' beyond ~27 tips, 4-state beyond ~13 tips, and 5-state beyond
+#' ~9 tips trigger the approximation.
+#' With `"exact"`, the full Maddison & Slatkin recursion is forced regardless
+#' of cost (may be very slow for large characters).


perhaps for large or complex characters?

ms609 · 2026-03-24T08:52:36Z

R/pp_info_extra_step.r

+#' @param n_mc Integer.  Number of random trees used by the MC approximation.
 #'   Larger values improve accuracy but increase computation time.
-#'   Default: 5000.
+#'   Default: 100 000.  Tree generation and Fitch scoring are performed


Remove note starting "Tree generation and Fitch scoring are performed...": "large" is relative; and the user will assume the default is reasonable.

R/pp_info_extra_step.r

cpp-search references BiasedWagnerParams/biased_wagner_tree in ts_driven.cpp and ts_rcpp.cpp but ts_wagner.h on cpp-search didn't have the declarations. Copy from main source tree to fix the build.

- 'auto' approx docs now clarify: exact when feasible, MC when not - 'mc' approx mode now forces MC even for feasible characters - Remove 5-state cap: >5 informative states route to MC via bitmask Fitch (mc_fitch_scores supports up to 32 states) - Remove 'preserves the full multi-state character' phrasing - Change 'large characters' to 'large or complex characters' - Remove n_mc implementation note (user assumes default is reasonable) - Log-quadratic interpolation stays in R: post-processing operates on ~50 values (one per step count), not the 100k MC samples. Moving to C++ would save microseconds; the bottleneck is mc_fitch_scores.

Remove the 5-state cap from the Shiny-inlined copy of PrepareDataProfile. Characters with >5 informative states now pass through to StepInformation() which routes them to MC.

Replace deprecated devtools::build_vignettes() with tools::buildVignettes(), devtools::test() with testthat::test_local(). These scripts are legacy (GHA workflow uses ms609/actions/memcheck directly), but should still work for local valgrind runs.

The exact MaddisonSlatkin solver has a hard-coded 256-element FixedDraws array. Some partition shapes that pass the split_count feasibility gate can still overflow this limit (e.g. 3-state characters at 62 tips). Wrap the MaddisonSlatkin() call in tryCatch so overflow or timeout errors gracefully fall back to MC approximation instead of aborting. Fixes: TreeLength(tree, data, concavity = 'profile') error on inapplicable.phyData[[1]] (62 tips, 3 informative states).

Reduce ASLR entropy (vm.mmap_rnd_bits=28) before ASAN-instrumented R processes launch. Without this, the runner's memory layout can collide with ASAN's shadow region, aborting with 'Shadow memory range interleaves with an existing memory mapping' (google/sanitizers#856). The vignettes job happened to avoid the collision; tests and examples did not. This fix makes all three jobs reliable.

Replace the fragile stock-R-with-LD_PRELOAD approach with the r-hub gcc-asan container, where R itself is built with -fsanitize=address. This eliminates the shadow-memory collisions (google/sanitizers#856) that caused persistent test/example failures on ubuntu-24.04 runners. Uses the new ms609/actions/asan composite action.

…ergence - PrepareDataProfile auto vs mc: no longer expect equal nrow on info.amounts. Auto uses exact solver for feasible chars while mc forces MC, so step ranges legitimately differ. - >5 state characters: no longer expect truncation warning; they now route to MC directly.

- Export mc_fitch_scores in NAMESPACE - Add mc_fitch_scores.Rd man page - Update PrepareDataProfile.Rd, StepInformation.Rd (roxygen regen) - Add papers.md to .Rbuildignore

--no-manual avoids R CMD Rd2pdf entirely, removing the need for texlive-latex-base, texlive-fonts-recommended, and the 500MB texlive-fonts-extra (inconsolata.sty). Keep only qpdf.

…kin-profiling # Conflicts: # .github/workflows/agent-check.yml

…09/TreeSearch into feature/madslatkin-profiling # Conflicts: # .github/workflows/agent-check.yml

- T-214 GHA passed; T-212 unblocked and re-dispatched - Shiny bug backlog cleared (T-230–T-237 all resolved) - Fixed blank-line formatting in to-do.md - Updated S-PR notes with PR #211 status

- Clarify MaddisonSlatkin() @examples (show logp intermediate) - Point FixedDraws overflow error to StepInformation(approx='mc') - Note MC fallback in MaddisonSlatkin Rd documentation From uncommitted work on feature/madslatkin-profiling (PR #211).

ms609 added 12 commits March 19, 2026 21:02

docs: update AGENTS.md with VTune workflow notes

363f6b4

Rm notes

cda2f17

Merge branch 'cpp-search' into feature/madslatkin-profiling

25adf0f

ms609 marked this pull request as ready for review March 22, 2026 18:55

ms609 added 2 commits March 22, 2026 20:10

ms609 changed the title ~~perf: MaddisonSlatkin profiling-driven optimizations~~ perf: MaddisonSlatkin optimizations + native C++ MC Fitch scorer Mar 23, 2026

Merge branch 'cpp-search' into feature/madslatkin-profiling

60feb6f

ms609 commented Mar 24, 2026

View reviewed changes

ms609 added 12 commits March 24, 2026 11:22

fix: bring in biased Wagner declarations from working tree

e1677b6

cpp-search references BiasedWagnerParams/biased_wagner_tree in ts_driven.cpp and ts_rcpp.cpp but ts_wagner.h on cpp-search didn't have the declarations. Copy from main source tree to fix the build.

fix(shiny): sync inlined PrepareDataProfile with package changes

d33081d

Remove the 5-state cap from the Shiny-inlined copy of PrepareDataProfile. Characters with >5 informative states now pass through to StepInformation() which routes them to MC.

merge: pull cpp-search (bandit, XPIWE, test fixes, worktree docs)

2165dda

chore: commit pending roxygen/namespace changes from MadSlat work

4130505

- Export mc_fitch_scores in NAMESPACE - Add mc_fitch_scores.Rd man page - Update PrepareDataProfile.Rd, StepInformation.Rd (roxygen regen) - Add papers.md to .Rbuildignore

chore: add 'warmup' to WORDLIST (R 4.1 spell-check)

1bb0926

ci: skip PDF manual in agent-check (drop LaTeX dependency)

2035118

--no-manual avoids R CMD Rd2pdf entirely, removing the need for texlive-latex-base, texlive-fonts-recommended, and the 500MB texlive-fonts-extra (inconsolata.sty). Keep only qpdf.

ms609 force-pushed the feature/madslatkin-profiling branch from 46a0aaa to 2035118 Compare March 24, 2026 15:02

ms609 added 5 commits March 24, 2026 15:19

Merge branch 'cpp-search' into feature/madslatkin-profiling

742e9fd

Merge remote-tracking branch 'origin/cpp-search' into feature/madslat…

6d08af5

…kin-profiling # Conflicts: # .github/workflows/agent-check.yml

Merge branch 'feature/madslatkin-profiling' of https://github.com/ms6…

c35df25

…09/TreeSearch into feature/madslatkin-profiling # Conflicts: # .github/workflows/agent-check.yml

Merge branch 'cpp-search' into feature/madslatkin-profiling

a423acd

Merge branch 'cpp-search' into feature/madslatkin-profiling

0aa3fc4

ms609 merged commit 8a06ff0 into cpp-search Mar 25, 2026
2 of 10 checks passed

ms609 mentioned this pull request Mar 25, 2026

T-207: SA perturbation phase in run_single_replicate() #222

Closed

ms609 deleted the feature/madslatkin-profiling branch March 25, 2026 17:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: MaddisonSlatkin optimizations + native C++ MC Fitch scorer#211

perf: MaddisonSlatkin optimizations + native C++ MC Fitch scorer#211
ms609 merged 32 commits intocpp-searchfrom
feature/madslatkin-profiling

ms609 commented Mar 22, 2026 •

edited

Loading

Uh oh!

ms609 Mar 24, 2026

Uh oh!

ms609 Mar 24, 2026

Uh oh!

ms609 Mar 24, 2026

Uh oh!

ms609 Mar 24, 2026

Uh oh!

ms609 Mar 24, 2026

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ms609 commented Mar 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Key changes

MaddisonSlatkin exact path (earlier commits)

MC approximation for infeasible characters (latest commits)

Testing

Uh oh!

ms609 Mar 24, 2026

Choose a reason for hiding this comment

Uh oh!

ms609 Mar 24, 2026

Choose a reason for hiding this comment

Uh oh!

ms609 Mar 24, 2026

Choose a reason for hiding this comment

Uh oh!

ms609 Mar 24, 2026

Choose a reason for hiding this comment

Uh oh!

ms609 Mar 24, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

ms609 commented Mar 22, 2026 •

edited

Loading