Skip to content

perf: MaddisonSlatkin optimizations + native C++ MC Fitch scorer#211

Merged
ms609 merged 32 commits intocpp-searchfrom
feature/madslatkin-profiling
Mar 25, 2026
Merged

perf: MaddisonSlatkin optimizations + native C++ MC Fitch scorer#211
ms609 merged 32 commits intocpp-searchfrom
feature/madslatkin-profiling

Conversation

@ms609
Copy link
Owner

@ms609 ms609 commented Mar 22, 2026

Profiling-driven performance improvements to MaddisonSlatkin() and a new native C++ Monte Carlo scorer for infeasible multi-state characters, developed on a feature branch for review before merging to cpp-search.

Key changes

MaddisonSlatkin exact path (earlier commits)

  1. LSEAccumulator double/exp + NEG_INF compare — switch to double precision, faster exp, sentinel comparison
  2. Raise feasibility thresholds — updated post-optimization (k=3→27, k=4→19, k=5→13)
  3. Batch logB_cache lookups — ~15% speedup in LogPVec draw-pair loop
  4. OAFlatMap for caches — logB_cache/logPVec_cache use open-addressing flat map
  5. Partition-aware split_count gate — skip infeasible partitions early
  6. Carter closed-form for binary chars — analytic shortcut + LSE lookup tables + buffer reuse
  7. Wall-clock time budget — abort infeasible MaddisonSlatkin calls that exceed budget

MC approximation for infeasible characters (latest commits)

  1. Log-quadratic MC approximation — replaces binary reduction with full multi-state MC:

    • Exact anchor P(s_min) via NUnrootedMult()
    • MC body from random tree scoring
    • Log-quadratic interpolation bridging exact anchor to MC body
    • Log-linear fallback if sanity checks fail
  2. Native C++ mc_fitch_scores() (src/ts_mc_fitch.cpp) — generates random trees via random_tree() and scores with inline bitmask Fitch downpass. No Morphy dependency, no R allocation per tree.

Metric Old (R-level) New (C++)
Rate (n=38, k=3) ~4k trees/s 357k trees/s
100k trees cost ~25s 0.28s
Morphy dependency Yes No

Default n_mc bumped from 50k to 100k.

Testing

All 288 tests pass (88 multistate + 17 data_manipulation + 183 ts-*). MC score distributions cross-validated against phangorn::parsimony().

ms609 added 12 commits March 19, 2026 21:02
… compare

- LSEAccumulator: long double → double, expl/logl → std::exp/std::log
- log_prod_sum_4: remove redundant long double casts
- computeLogRD (both sites): long double lc → double lc
- All std::isfinite(x) → (x > NEG_INF) throughout MaddisonSlatkin.cpp
  (FixedProbList NaN sentinel correctly excluded: NaN > NEG_INF == false)
- log_prod_sum_4: now returns plain double sum

Measured speedup (matched calls, same RNG seed):
  k=3/n=20-24: 1.44x (30%)
  k=4/n=14-18: 1.28x (22%)
  k=5/n=9-12:  1.36x (26%)

Tests: 37 MaddisonSlatkin + 82 pp-multistate + 41 ts-profile pass, 0 fail.
…k=5→13)

Worst-case timings with double/NEG_INF optimisations on reference machine:
  k=3 n=27: 1.03 s  (was n=25 ~0.60 s under old comment's ~5 s budget)
  k=4 n=19: 2.00 s  (was n=18 ~1.27 s)
  k=5 n=13: 2.58 s  (was n=12 ~1.62 s)

Also updates doc-string tip counts and comment wording.
All 82 pp-multistate tests pass.
Per draw pair in SolverT::LogPVec, the noStep/yesStep pair loops called
LogB(pr.a, drawn) once per pair (N_no+N_yes times), each doing an
unordered_map::find on drawn. Since drawn is constant across all pairs in
a given draw pair, all but the first find are redundant.

Fix: prime the logB_cache for drawn/undrawn via a single LogPVec call
before the loops, then fetch FixedProbList* twice (one find per side)
and replace LogB(pr.a, drawn) with (*fpl_d)[pr.a] — a direct indexed
array access. For n<=2 base cases (fpl is null), falls through to LogB.

Measured speedup vs T-152+T-153 baseline (reference machine, worst-case):
  k=3 n=25:  0.60s -> 0.47s  (1.28x)
  k=3 n=27:  1.03s -> 0.97s  (1.06x)
  k=4 n=18:  1.27s -> 1.07s  (1.19x)
  k=4 n=19:  2.00s -> 1.67s  (1.20x)
  k=5 n=12:  1.62s -> 1.46s  (1.11x)
  k=5 n=13:  2.58s -> 2.10s  (1.23x)
  Average:   ~15% faster

All 37 MaddisonSlatkin + 82 pp-multistate tests pass.
Replace std::unordered_map (node-based, 2x pointer-chase per lookup) with
OAFlatMap: inline open-addressing map with separate probe layer (uint64_t
hashes_ + uint32_t indices_) and contiguous entries_ vector.

logB_cache: OAFlatMap<StateKeyT<N>, FixedProbList>
  - Probe layer (hashes_ + indices_) fits in L2 cache; eliminates extra
    heap-node dereference per lookup.
  - LogB() rewritten to not hold slot reference across recursion (safe for
    any non-stable map; also fixes latent UB in prior code).

logPVec_cache: OAFlatMap<LogPVecKey, uint32_t> + std::deque<vector<double>>
  - Index map uses same probe-layer design (fast OA lookup for key -> idx).
  - Values live in pv_store (deque): push_back does not invalidate existing
    element references, preserving caller-held const vector<double>& safety.

All 37 MaddisonSlatkin + 82 pp-multistate tests pass.
Replace blunt (k, n) feasibility thresholds with a partition-shape-aware
metric: the split_count (coefficient of x^floor(n/2) in the generating
polynomial). This allows skewed multi-state characters to be computed
exactly even at large n, while still blocking balanced partitions that
would blow up.

- Add .MSSplitCount() and .MS_SC_THRESHOLD in data_manipulation.R
- Replace maxTips gate in PrepareDataProfile and pp_info_extra_step
- Shrink 5-state test from n=11 (blowup) to n=10 (0.6s)
- Add 18 split_count boundary + base-case tests (55 total, 5.6s)
The infeasible test cases used partitions from the old maxTips gate
(n > 25 for k=3). The split_count gate (f738091) replaced maxTips but
the test partitions weren't updated — (13,12,10) has sc=118, below
the k=3 threshold of 136, causing the exact solver to run indefinitely.

Changed infeasible k=3 partition from (13,12,10) n=35 to (13,13,12)
n=38 (sc=140 > 136). Also fixed stale comments and suppressed expected
fallback warnings.
…r reuse

Binary (2-token) characters: O(1) Carter et al. (1990) closed-form
shortcut when both states are pure (no ambiguous tips). Eliminates
the exponential recursive algorithm entirely for this common case.
Benchmark: k=2 n=50 drops from 0.22s to <0.001s (>100x).

Multi-state (3-5 token) characters:
- FastLSETables: 4096-entry precomputed lookup tables for exp(-d)
  and log1p(exp(-d)) with linear interpolation. Used in
  LSEAccumulator::add() to replace std::exp() calls.
- lse_update(): shared inline helper for log-sum-exp accumulation,
  replacing ~6 copies of the same pattern in LogPVec/runAll.
- logconv(): changed from allocating+returning std::vector to
  writing into caller-provided buffer. Eliminates allocation per
  convolution call.
- Reusable noStepVec/yesStepVec/conv_buf buffers in LogPVec,
  allocated once per call, std::fill'd per draw pair.
Multi-state speedup: ~5-19% for k=3 boundary cases (within noise
for k=4/k=5 due to recursion depth dominating).

Also adds logDoubleFact, logN_carter, logCarter1_cpp helpers for
the Carter formula (C++ implementation of the R LogCarter1).
The original threshold calibration (136/100/100) used sequential state
indices (1,2,3,...) instead of bitmask positions (1,2,4,8,16). This
resulted in testing a different (easier) combinatorial problem, yielding
dangerously high thresholds that allowed the exact solver to run
indefinitely on balanced partitions within the supposed safe range.

Recalibrated with correct encoding (states at 2^(i-1) positions):
  k=3: 136 → 75  (n=27 balanced 0.97s safe, n=31 balanced 1.32s marginal)
  k=4: 100 → 50  (n=13 balanced 0.36s safe, n=15 balanced 0.94s marginal)
  k=5: 100 → 35  (n=9  balanced 0.22s safe, n=10 balanced 0.49s)

Updated boundary tests and docstrings to match.
Add a 2-second wall-clock budget to SolverT.  Every LogB and LogPVec
entry checks std::chrono::steady_clock; if the budget is exceeded,
budget_exceeded is set and all loops bail out immediately.  runAll()
returns NA_REAL with an Rcpp::warning so downstream code can detect
the fallback.

The guard prevents infinite hangs even if a partition slips past the
split_count feasibility gate (e.g. through future threshold changes).
Chrono overhead is ~20 ns per call — negligible versus per-call work.

Verified:
- k=3 n=35 balanced (known blowup): returns NA in 2.02 s
- 137 existing tests pass (55 MS + 82 pp-multistate, 9.2 s)
@ms609 ms609 marked this pull request as ready for review March 22, 2026 18:55
ms609 added 2 commits March 22, 2026 20:10
…profiles

Replace binary reduction with a log-quadratic MC approximation that
preserves the full multi-state character for infeasible MaddisonSlatkin
computations.

Key changes:
- 'auto' and 'mc' approx modes now both use MC for infeasible chars
  (no more state reduction to binary)
- .ApproxStepInformation() uses log-quadratic interpolation anchored
  at the exact P(s_min) to bridge the gap between the analytic
  minimum-steps probability and the MC body
- Default n_mc increased from 5000 to 50000 for better tail resolution
- PrepareDataProfile() no longer reduces multi-state characters in
  the data matrix; infeasibility handled entirely in StepInformation()
- Log-linear fallback when quadratic sanity checks fail (monotonicity
  or concavity)

Validated: R² = 0.9997 for log-quadratic fit vs exact on (8,8,8) n=24.
MC-estimated IC(0) matches exact within 0.5 bits at the feasibility
boundary. All 105 tests pass.
Add mc_fitch_scores() in src/ts_mc_fitch.cpp: generates random trees
via random_tree() (pure C, no Morphy dependency) and scores each with
an inline bitmask Fitch downpass.  No R object allocation per tree.

Performance: 357k trees/sec on 38-tip 3-state characters (vs ~100k/s
with the old Morphy path).  This makes 100k MC samples cost ~0.28s.

- Wire mc_fitch_scores into .ApproxStepInformation(), replacing the
  R-level RandomTree + TreeLength loop
- Bump default n_mc from 50k to 100k (now affordable)
- Update docstrings to reflect the new implementation
- All 288 tests pass (88 multistate + 17 data_manipulation + 183 ts-*)
@ms609 ms609 changed the title perf: MaddisonSlatkin profiling-driven optimizations perf: MaddisonSlatkin optimizations + native C++ MC Fitch scorer Mar 23, 2026
#' Default 5000.
#' `"auto"` (default) and `"mc"` use a Monte Carlo approximation with
#' log-quadratic tail interpolation, preserving the full multi-state
#' character; `"exact"` always uses the exact Maddison & Slatkin calculation
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds like auto always uses MC approximation; are there not small cases where it would be faster / not slower enough to compute exactly? Where we're certain that exact will be quick, do we want to use that instead of MC every time?

maxInformative <- max(maxInformative, nInf)
}

if (cappedAny) {
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we still need to cap now we have MC?

#' instead: the step-count distribution is estimated from a random sample of
#' trees, anchored at the exact minimum-steps probability.
#' the `approx` argument. With `"auto"` (default) or `"mc"`, a Monte Carlo
#' approximation preserves the full multi-state character: the exact
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"preserves the full multi-state character" might be confusing: why would the user presume otherwise?

#' beyond ~27 tips, 4-state beyond ~13 tips, and 5-state beyond
#' ~9 tips trigger the approximation.
#' With `"exact"`, the full Maddison & Slatkin recursion is forced regardless
#' of cost (may be very slow for large characters).
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perhaps for large or complex characters?

#' @param n_mc Integer. Number of random trees used by the MC approximation.
#' Larger values improve accuracy but increase computation time.
#' Default: 5000.
#' Default: 100 000. Tree generation and Fitch scoring are performed
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove note starting "Tree generation and Fitch scoring are performed...": "large" is relative; and the user will assume the default is reasonable.

ms609 added 12 commits March 24, 2026 11:22
cpp-search references BiasedWagnerParams/biased_wagner_tree in
ts_driven.cpp and ts_rcpp.cpp but ts_wagner.h on cpp-search didn't
have the declarations. Copy from main source tree to fix the build.
- 'auto' approx docs now clarify: exact when feasible, MC when not
- 'mc' approx mode now forces MC even for feasible characters
- Remove 5-state cap: >5 informative states route to MC via bitmask
  Fitch (mc_fitch_scores supports up to 32 states)
- Remove 'preserves the full multi-state character' phrasing
- Change 'large characters' to 'large or complex characters'
- Remove n_mc implementation note (user assumes default is reasonable)
- Log-quadratic interpolation stays in R: post-processing operates on
  ~50 values (one per step count), not the 100k MC samples. Moving to
  C++ would save microseconds; the bottleneck is mc_fitch_scores.
Remove the 5-state cap from the Shiny-inlined copy of
PrepareDataProfile. Characters with >5 informative states now pass
through to StepInformation() which routes them to MC.
Replace deprecated devtools::build_vignettes() with
tools::buildVignettes(), devtools::test() with testthat::test_local().
These scripts are legacy (GHA workflow uses ms609/actions/memcheck
directly), but should still work for local valgrind runs.
The exact MaddisonSlatkin solver has a hard-coded 256-element
FixedDraws array. Some partition shapes that pass the split_count
feasibility gate can still overflow this limit (e.g. 3-state
characters at 62 tips). Wrap the MaddisonSlatkin() call in tryCatch
so overflow or timeout errors gracefully fall back to MC
approximation instead of aborting.

Fixes: TreeLength(tree, data, concavity = 'profile') error on
inapplicable.phyData[[1]] (62 tips, 3 informative states).
Reduce ASLR entropy (vm.mmap_rnd_bits=28) before ASAN-instrumented
R processes launch.  Without this, the runner's memory layout can
collide with ASAN's shadow region, aborting with 'Shadow memory range
interleaves with an existing memory mapping' (google/sanitizers#856).

The vignettes job happened to avoid the collision; tests and examples
did not.  This fix makes all three jobs reliable.
Replace the fragile stock-R-with-LD_PRELOAD approach with the r-hub
gcc-asan container, where R itself is built with -fsanitize=address.
This eliminates the shadow-memory collisions (google/sanitizers#856)
that caused persistent test/example failures on ubuntu-24.04 runners.

Uses the new ms609/actions/asan composite action.
…ergence

- PrepareDataProfile auto vs mc: no longer expect equal nrow on
  info.amounts. Auto uses exact solver for feasible chars while mc
  forces MC, so step ranges legitimately differ.
- >5 state characters: no longer expect truncation warning; they now
  route to MC directly.
- Export mc_fitch_scores in NAMESPACE
- Add mc_fitch_scores.Rd man page
- Update PrepareDataProfile.Rd, StepInformation.Rd (roxygen regen)
- Add papers.md to .Rbuildignore
--no-manual avoids R CMD Rd2pdf entirely, removing the need for
texlive-latex-base, texlive-fonts-recommended, and the 500MB
texlive-fonts-extra (inconsolata.sty). Keep only qpdf.
@ms609 ms609 force-pushed the feature/madslatkin-profiling branch from 46a0aaa to 2035118 Compare March 24, 2026 15:02
ms609 added a commit that referenced this pull request Mar 25, 2026
- T-214 GHA passed; T-212 unblocked and re-dispatched
- Shiny bug backlog cleared (T-230–T-237 all resolved)
- Fixed blank-line formatting in to-do.md
- Updated S-PR notes with PR #211 status
@ms609 ms609 merged commit 8a06ff0 into cpp-search Mar 25, 2026
2 of 10 checks passed
@ms609 ms609 deleted the feature/madslatkin-profiling branch March 25, 2026 17:10
ms609 added a commit that referenced this pull request Mar 25, 2026
- Clarify MaddisonSlatkin() @examples (show logp intermediate)
- Point FixedDraws overflow error to StepInformation(approx='mc')
- Note MC fallback in MaddisonSlatkin Rd documentation

From uncommitted work on feature/madslatkin-profiling (PR #211).
ms609 added a commit that referenced this pull request Mar 25, 2026
- Clarify MaddisonSlatkin() @examples (show logp intermediate)
- Point FixedDraws overflow error to StepInformation(approx='mc')
- Note MC fallback in MaddisonSlatkin Rd documentation

From uncommitted work on feature/madslatkin-profiling (PR #211).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant