Skip to content

PMSR model implementation#40

Draft
StefanFlaumberg wants to merge 9 commits intoiqtree:masterfrom
StefanFlaumberg:sitemodel
Draft

PMSR model implementation#40
StefanFlaumberg wants to merge 9 commits intoiqtree:masterfrom
StefanFlaumberg:sitemodel

Conversation

@StefanFlaumberg
Copy link
Contributor

This pull request adapts the iqtree/iqtree2#391 PR for IQ-Tree 3.

Commits of the original PR were directly transferred, with only minor changes introduced for the sake of compatibility and small bug fixing.

@StefanFlaumberg StefanFlaumberg marked this pull request as draft June 23, 2025 23:05
@bqminh
Copy link
Member

bqminh commented Mar 5, 2026

Claude Code:

PR #40 — PMSR (Posterior Mean Site Rate) Model

Status: Draft | 2,386 additions / 1,475 deletions across 42 files

What It Does

Extends the existing PMSF (Posterior Mean Site Frequency) workflow to also support per-site rates. Three new modes:

  1. -rs — load pre-computed per-site rates from a .siterate file
  2. -rt <guide_tree> — infer per-site rates from a guide tree, then fix them for the main analysis
  3. -frt — combine PMSF + PMSR simultaneously (per-site frequencies and per-site rates)

The core change is that Alignment gains a ptn_rate_scaler vector (one rate multiplier per pattern) alongside the existing ptn_state_freq, and ModelSet is extended to use both.

Key Changes

  • alignment/alignment.h/.cpp: site_state_freq renamed to ptn_state_freq (pattern-indexed, not site-indexed — correct design), ptn_rate_scaler added, readSiteStateFreq() replaced by generic readSiteParamFile(file, "freq"|"rate")
  • model/modelset.cpp: Checkpoint methods (setCheckpoint, saveCheckpoint, restoreCheckpoint) were missing — now added. This is a genuine bug fix that benefits all SSF users.
  • model/rateheterogeneity.cpp: Constructor now requires PhyloTree* to detect SSR and set name = "+SSR"
  • tree/phylotree.cpp: computePatternRate() rewritten to correctly handle +I offset and posterior mean/max modes
  • tree/iqtree.cpp: BNNI bug fix — tree topology now established before model copy in bootstrap NNI loop
  • main/phyloanalysis.cpp: computeSiteFrequencyModel() → computeSiteSpecificModel() extended for both freq and rate
  • main/treetesting.cpp: printSiteRate() output now includes CatPP column (posterior probability of best category)

Issues That Should Block Merge

1. Memory aliasing / double-free in ptn_state_freq
ungroupSitePattern() and regroupSitePattern() copy the vector<double*> by value without deep-copying the pointed-to arrays. Multiple entries can point to the same buffer. When the old alignment is destroyed, those pointers are freed; the new alignment then double-frees them on destruction.

2. ptn_rate_scaler not checkpointed
ptn_rate_scaler lives on Alignment, which has no checkpoint serialisation for it. If a long run is interrupted and resumed from checkpoint, the site-specific rates are silently lost and must be re-read from the original file (which may not be available). Should be documented as a known limitation at minimum.

3. Tree mutation inside ModelSet constructor
normalizePtnRateScaler() is called inside the constructor and mutates the tree's branch lengths as a side effect. This is architecturally wrong — constructors should not mutate external objects. It also means restoreCheckpoint() does not redo the normalisation, leaving branch lengths in an inconsistent state after resume.

4. Partition model + SSR — no guard
--partition -rs is not explicitly blocked. The code reads site rates for the concatenated super-alignment, not per-partition, producing silently wrong results.

5. RateHeterogeneity(nullptr) null dereference
The new constructor evaluates phylo_tree->aln->isSSR() unconditionally. Any call site that passes nullptr will crash. A null check is needed.

Issues Worth Addressing

  • Stringly-typed dispatch: readSiteParamFile(file, "freq"|"rate") should use an enum class SiteParamType instead of a string to catch typos at compile time.
  • isSSF() / isSSR() lack const: prevents use on const Alignment& references.
  • Non-deterministic tie-breaking in computePatternRate: random_double() < 0.5 for equal-likelihood categories produces non-reproducible output — worth documenting.
  • Pairwise ML distances don't fully account for SSR (acknowledged limitation from iqtree2 PR #391 — should be documented).
  • Comments in modelfactory.cpp still say "site-specific state frequency model" in places that now cover both SSF and SSR.

Positives

  • The ModelSet checkpoint fix is a genuine bug fix benefiting all existing SSF users, independent of PMSR.
  • Renaming from site-indexed to pattern-indexed storage is the correct design direction.
  • The BNNI ordering bug fix in iqtree.cpp is correct.
  • The CatPP column addition to .siterates output is a useful improvement.
  • Scientific value is clear — the feature has been requested and positively received.

Verdict

Not ready to merge in current draft state. The core logic is correct, but the five structural issues (memory aliasing, missing checkpoint, constructor side-effects, partition incompatibility, null dereference) need to be resolved first. The 42-file scope and architectural changes to Alignment, ModelSet, and RateHeterogeneity warrant careful review of memory management and checkpoint correctness paths.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants