perf: MaddisonSlatkin optimizations + native C++ MC Fitch scorer#211
perf: MaddisonSlatkin optimizations + native C++ MC Fitch scorer#211ms609 merged 32 commits intocpp-searchfrom
Conversation
… compare - LSEAccumulator: long double → double, expl/logl → std::exp/std::log - log_prod_sum_4: remove redundant long double casts - computeLogRD (both sites): long double lc → double lc - All std::isfinite(x) → (x > NEG_INF) throughout MaddisonSlatkin.cpp (FixedProbList NaN sentinel correctly excluded: NaN > NEG_INF == false) - log_prod_sum_4: now returns plain double sum Measured speedup (matched calls, same RNG seed): k=3/n=20-24: 1.44x (30%) k=4/n=14-18: 1.28x (22%) k=5/n=9-12: 1.36x (26%) Tests: 37 MaddisonSlatkin + 82 pp-multistate + 41 ts-profile pass, 0 fail.
…k=5→13) Worst-case timings with double/NEG_INF optimisations on reference machine: k=3 n=27: 1.03 s (was n=25 ~0.60 s under old comment's ~5 s budget) k=4 n=19: 2.00 s (was n=18 ~1.27 s) k=5 n=13: 2.58 s (was n=12 ~1.62 s) Also updates doc-string tip counts and comment wording. All 82 pp-multistate tests pass.
Per draw pair in SolverT::LogPVec, the noStep/yesStep pair loops called LogB(pr.a, drawn) once per pair (N_no+N_yes times), each doing an unordered_map::find on drawn. Since drawn is constant across all pairs in a given draw pair, all but the first find are redundant. Fix: prime the logB_cache for drawn/undrawn via a single LogPVec call before the loops, then fetch FixedProbList* twice (one find per side) and replace LogB(pr.a, drawn) with (*fpl_d)[pr.a] — a direct indexed array access. For n<=2 base cases (fpl is null), falls through to LogB. Measured speedup vs T-152+T-153 baseline (reference machine, worst-case): k=3 n=25: 0.60s -> 0.47s (1.28x) k=3 n=27: 1.03s -> 0.97s (1.06x) k=4 n=18: 1.27s -> 1.07s (1.19x) k=4 n=19: 2.00s -> 1.67s (1.20x) k=5 n=12: 1.62s -> 1.46s (1.11x) k=5 n=13: 2.58s -> 2.10s (1.23x) Average: ~15% faster All 37 MaddisonSlatkin + 82 pp-multistate tests pass.
Replace std::unordered_map (node-based, 2x pointer-chase per lookup) with
OAFlatMap: inline open-addressing map with separate probe layer (uint64_t
hashes_ + uint32_t indices_) and contiguous entries_ vector.
logB_cache: OAFlatMap<StateKeyT<N>, FixedProbList>
- Probe layer (hashes_ + indices_) fits in L2 cache; eliminates extra
heap-node dereference per lookup.
- LogB() rewritten to not hold slot reference across recursion (safe for
any non-stable map; also fixes latent UB in prior code).
logPVec_cache: OAFlatMap<LogPVecKey, uint32_t> + std::deque<vector<double>>
- Index map uses same probe-layer design (fast OA lookup for key -> idx).
- Values live in pv_store (deque): push_back does not invalidate existing
element references, preserving caller-held const vector<double>& safety.
All 37 MaddisonSlatkin + 82 pp-multistate tests pass.
Replace blunt (k, n) feasibility thresholds with a partition-shape-aware metric: the split_count (coefficient of x^floor(n/2) in the generating polynomial). This allows skewed multi-state characters to be computed exactly even at large n, while still blocking balanced partitions that would blow up. - Add .MSSplitCount() and .MS_SC_THRESHOLD in data_manipulation.R - Replace maxTips gate in PrepareDataProfile and pp_info_extra_step - Shrink 5-state test from n=11 (blowup) to n=10 (0.6s) - Add 18 split_count boundary + base-case tests (55 total, 5.6s)
The infeasible test cases used partitions from the old maxTips gate (n > 25 for k=3). The split_count gate (f738091) replaced maxTips but the test partitions weren't updated — (13,12,10) has sc=118, below the k=3 threshold of 136, causing the exact solver to run indefinitely. Changed infeasible k=3 partition from (13,12,10) n=35 to (13,13,12) n=38 (sc=140 > 136). Also fixed stale comments and suppressed expected fallback warnings.
…r reuse Binary (2-token) characters: O(1) Carter et al. (1990) closed-form shortcut when both states are pure (no ambiguous tips). Eliminates the exponential recursive algorithm entirely for this common case. Benchmark: k=2 n=50 drops from 0.22s to <0.001s (>100x). Multi-state (3-5 token) characters: - FastLSETables: 4096-entry precomputed lookup tables for exp(-d) and log1p(exp(-d)) with linear interpolation. Used in LSEAccumulator::add() to replace std::exp() calls. - lse_update(): shared inline helper for log-sum-exp accumulation, replacing ~6 copies of the same pattern in LogPVec/runAll. - logconv(): changed from allocating+returning std::vector to writing into caller-provided buffer. Eliminates allocation per convolution call. - Reusable noStepVec/yesStepVec/conv_buf buffers in LogPVec, allocated once per call, std::fill'd per draw pair. Multi-state speedup: ~5-19% for k=3 boundary cases (within noise for k=4/k=5 due to recursion depth dominating). Also adds logDoubleFact, logN_carter, logCarter1_cpp helpers for the Carter formula (C++ implementation of the R LogCarter1).
The original threshold calibration (136/100/100) used sequential state indices (1,2,3,...) instead of bitmask positions (1,2,4,8,16). This resulted in testing a different (easier) combinatorial problem, yielding dangerously high thresholds that allowed the exact solver to run indefinitely on balanced partitions within the supposed safe range. Recalibrated with correct encoding (states at 2^(i-1) positions): k=3: 136 → 75 (n=27 balanced 0.97s safe, n=31 balanced 1.32s marginal) k=4: 100 → 50 (n=13 balanced 0.36s safe, n=15 balanced 0.94s marginal) k=5: 100 → 35 (n=9 balanced 0.22s safe, n=10 balanced 0.49s) Updated boundary tests and docstrings to match.
Add a 2-second wall-clock budget to SolverT. Every LogB and LogPVec entry checks std::chrono::steady_clock; if the budget is exceeded, budget_exceeded is set and all loops bail out immediately. runAll() returns NA_REAL with an Rcpp::warning so downstream code can detect the fallback. The guard prevents infinite hangs even if a partition slips past the split_count feasibility gate (e.g. through future threshold changes). Chrono overhead is ~20 ns per call — negligible versus per-call work. Verified: - k=3 n=35 balanced (known blowup): returns NA in 2.02 s - 137 existing tests pass (55 MS + 82 pp-multistate, 9.2 s)
…profiles Replace binary reduction with a log-quadratic MC approximation that preserves the full multi-state character for infeasible MaddisonSlatkin computations. Key changes: - 'auto' and 'mc' approx modes now both use MC for infeasible chars (no more state reduction to binary) - .ApproxStepInformation() uses log-quadratic interpolation anchored at the exact P(s_min) to bridge the gap between the analytic minimum-steps probability and the MC body - Default n_mc increased from 5000 to 50000 for better tail resolution - PrepareDataProfile() no longer reduces multi-state characters in the data matrix; infeasibility handled entirely in StepInformation() - Log-linear fallback when quadratic sanity checks fail (monotonicity or concavity) Validated: R² = 0.9997 for log-quadratic fit vs exact on (8,8,8) n=24. MC-estimated IC(0) matches exact within 0.5 bits at the feasibility boundary. All 105 tests pass.
Add mc_fitch_scores() in src/ts_mc_fitch.cpp: generates random trees via random_tree() (pure C, no Morphy dependency) and scores each with an inline bitmask Fitch downpass. No R object allocation per tree. Performance: 357k trees/sec on 38-tip 3-state characters (vs ~100k/s with the old Morphy path). This makes 100k MC samples cost ~0.28s. - Wire mc_fitch_scores into .ApproxStepInformation(), replacing the R-level RandomTree + TreeLength loop - Bump default n_mc from 50k to 100k (now affordable) - Update docstrings to reflect the new implementation - All 288 tests pass (88 multistate + 17 data_manipulation + 183 ts-*)
R/data_manipulation.R
Outdated
| #' Default 5000. | ||
| #' `"auto"` (default) and `"mc"` use a Monte Carlo approximation with | ||
| #' log-quadratic tail interpolation, preserving the full multi-state | ||
| #' character; `"exact"` always uses the exact Maddison & Slatkin calculation |
There was a problem hiding this comment.
Sounds like auto always uses MC approximation; are there not small cases where it would be faster / not slower enough to compute exactly? Where we're certain that exact will be quick, do we want to use that instead of MC every time?
R/data_manipulation.R
Outdated
| maxInformative <- max(maxInformative, nInf) | ||
| } | ||
|
|
||
| if (cappedAny) { |
There was a problem hiding this comment.
Do we still need to cap now we have MC?
R/pp_info_extra_step.r
Outdated
| #' instead: the step-count distribution is estimated from a random sample of | ||
| #' trees, anchored at the exact minimum-steps probability. | ||
| #' the `approx` argument. With `"auto"` (default) or `"mc"`, a Monte Carlo | ||
| #' approximation preserves the full multi-state character: the exact |
There was a problem hiding this comment.
"preserves the full multi-state character" might be confusing: why would the user presume otherwise?
R/pp_info_extra_step.r
Outdated
| #' beyond ~27 tips, 4-state beyond ~13 tips, and 5-state beyond | ||
| #' ~9 tips trigger the approximation. | ||
| #' With `"exact"`, the full Maddison & Slatkin recursion is forced regardless | ||
| #' of cost (may be very slow for large characters). |
There was a problem hiding this comment.
perhaps for large or complex characters?
R/pp_info_extra_step.r
Outdated
| #' @param n_mc Integer. Number of random trees used by the MC approximation. | ||
| #' Larger values improve accuracy but increase computation time. | ||
| #' Default: 5000. | ||
| #' Default: 100 000. Tree generation and Fitch scoring are performed |
There was a problem hiding this comment.
Remove note starting "Tree generation and Fitch scoring are performed...": "large" is relative; and the user will assume the default is reasonable.
cpp-search references BiasedWagnerParams/biased_wagner_tree in ts_driven.cpp and ts_rcpp.cpp but ts_wagner.h on cpp-search didn't have the declarations. Copy from main source tree to fix the build.
- 'auto' approx docs now clarify: exact when feasible, MC when not - 'mc' approx mode now forces MC even for feasible characters - Remove 5-state cap: >5 informative states route to MC via bitmask Fitch (mc_fitch_scores supports up to 32 states) - Remove 'preserves the full multi-state character' phrasing - Change 'large characters' to 'large or complex characters' - Remove n_mc implementation note (user assumes default is reasonable) - Log-quadratic interpolation stays in R: post-processing operates on ~50 values (one per step count), not the 100k MC samples. Moving to C++ would save microseconds; the bottleneck is mc_fitch_scores.
Remove the 5-state cap from the Shiny-inlined copy of PrepareDataProfile. Characters with >5 informative states now pass through to StepInformation() which routes them to MC.
Replace deprecated devtools::build_vignettes() with tools::buildVignettes(), devtools::test() with testthat::test_local(). These scripts are legacy (GHA workflow uses ms609/actions/memcheck directly), but should still work for local valgrind runs.
The exact MaddisonSlatkin solver has a hard-coded 256-element FixedDraws array. Some partition shapes that pass the split_count feasibility gate can still overflow this limit (e.g. 3-state characters at 62 tips). Wrap the MaddisonSlatkin() call in tryCatch so overflow or timeout errors gracefully fall back to MC approximation instead of aborting. Fixes: TreeLength(tree, data, concavity = 'profile') error on inapplicable.phyData[[1]] (62 tips, 3 informative states).
Reduce ASLR entropy (vm.mmap_rnd_bits=28) before ASAN-instrumented R processes launch. Without this, the runner's memory layout can collide with ASAN's shadow region, aborting with 'Shadow memory range interleaves with an existing memory mapping' (google/sanitizers#856). The vignettes job happened to avoid the collision; tests and examples did not. This fix makes all three jobs reliable.
Replace the fragile stock-R-with-LD_PRELOAD approach with the r-hub gcc-asan container, where R itself is built with -fsanitize=address. This eliminates the shadow-memory collisions (google/sanitizers#856) that caused persistent test/example failures on ubuntu-24.04 runners. Uses the new ms609/actions/asan composite action.
…ergence - PrepareDataProfile auto vs mc: no longer expect equal nrow on info.amounts. Auto uses exact solver for feasible chars while mc forces MC, so step ranges legitimately differ. - >5 state characters: no longer expect truncation warning; they now route to MC directly.
- Export mc_fitch_scores in NAMESPACE - Add mc_fitch_scores.Rd man page - Update PrepareDataProfile.Rd, StepInformation.Rd (roxygen regen) - Add papers.md to .Rbuildignore
--no-manual avoids R CMD Rd2pdf entirely, removing the need for texlive-latex-base, texlive-fonts-recommended, and the 500MB texlive-fonts-extra (inconsolata.sty). Keep only qpdf.
46a0aaa to
2035118
Compare
…kin-profiling # Conflicts: # .github/workflows/agent-check.yml
…09/TreeSearch into feature/madslatkin-profiling # Conflicts: # .github/workflows/agent-check.yml
- T-214 GHA passed; T-212 unblocked and re-dispatched - Shiny bug backlog cleared (T-230–T-237 all resolved) - Fixed blank-line formatting in to-do.md - Updated S-PR notes with PR #211 status
Profiling-driven performance improvements to
MaddisonSlatkin()and a new native C++ Monte Carlo scorer for infeasible multi-state characters, developed on a feature branch for review before merging tocpp-search.Key changes
MaddisonSlatkin exact path (earlier commits)
MC approximation for infeasible characters (latest commits)
Log-quadratic MC approximation — replaces binary reduction with full multi-state MC:
NUnrootedMult()Native C++
mc_fitch_scores()(src/ts_mc_fitch.cpp) — generates random trees viarandom_tree()and scores with inline bitmask Fitch downpass. No Morphy dependency, no R allocation per tree.Default
n_mcbumped from 50k to 100k.Testing
All 288 tests pass (88 multistate + 17 data_manipulation + 183 ts-*). MC score distributions cross-validated against
phangorn::parsimony().