Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 7 additions & 0 deletions NEWS.md
Original file line number Diff line number Diff line change
Expand Up @@ -29,6 +29,13 @@ with its fastest available algorithm, then profiling and optimising (harness in
incompatible splits may now be resolved in a different but equally valid order.
- Added a reentrant, allocation-safe C++ tree primitive (`src/fact_tree.*`)
shared by the fast split-selection methods.
- `RStar()` no longer caps at 200 leaves. The dense `O(n^3)` triplet tensor is
replaced by per-tree constant-time LCA queries (`O(kn^2)` memory), and the
strong-cluster assembly is tightened from `O(n^4)` to about `O(n^3)` via the
single-linkage (Apresjan) construction of Jansson, Sung, Vu & Yiu (2016). The
R\* tree is unchanged — verified clade-for-clade against the previous
implementation up to 200 leaves — and large inputs are now bounded by running
time rather than a memory wall.
- `Frequency()` now uses a C++ port of the near-linear _O_(_kn_ log _n_)
frequency-difference algorithm of Jansson, Sung, Tabatabaee & Yang (2024,
STACS; their FDCT reference implementation, used with permission), replacing
Expand Down
30 changes: 15 additions & 15 deletions R/rstar.R
Original file line number Diff line number Diff line change
Expand Up @@ -22,10 +22,10 @@
#' R* is always a *refinement* of the majority-rule consensus
#' (every majority clade also appears in `RStar()`) and is a statistically
#' consistent estimator of a species tree from gene trees. Unlike [`Local()`]
#' it is **polynomial**, so it is not restricted to small leaf counts; `RStar()`
#' caps `n` at 200 purely as a memory safeguard on its dense triplet tensor (a
#' limit quite different in nature from `Local()`'s exact-exponential 20-leaf
#' bound).
#' it is **polynomial** and imposes no hard leaf-count limit: it stores no dense
#' \eqn{n^3} structure (memory is \eqn{O(kn^2)} for `k` input trees), so the
#' practical bound is running time (the tally is \eqn{O(kn^3)}), not a memory
#' wall -- quite unlike `Local()`'s exact-exponential 20-leaf bound.
#'
#' Like [`Adams()`], R* is a rooted method: triplet states depend on the
#' rooting, so input trees are treated as rooted on their current root. Root the
Expand All @@ -43,11 +43,17 @@
#' yields a single, well-defined tree (Lemma 1.1 of
#' \insertCite{Jansson2016a}{ConsTree}); there is no incompatibility or
#' "build-failure" case to resolve.
#' \item *Algorithm.* This implementation is correctness-first: an
#' \eqn{O(kn^3)} triplet tally followed by an \eqn{O(n^4)} strong-cluster
#' assembly. The sub-cubic and near-quadratic algorithms of
#' \insertCite{Jansson2013a,Jansson2016a}{ConsTree} are a deferred speed
#' optimisation.
#' \item *Algorithm.* Following \insertCite{Jansson2016a}{ConsTree}, the tree
#' is built from a leaf-pair similarity -- for each pair `a`, `b`, the number
#' of leaves `x` for which `ab|x` lies in \eqn{R_{maj}} -- whose single-linkage
#' (Apresjan) clusters form a laminar superset of the strong clusters; those
#' are then filtered exactly and assembled. The tally runs in \eqn{O(kn^3)}
#' time using per-tree constant-time LCA queries and stores no \eqn{n^3}
#' tensor (memory \eqn{O(kn^2)}); the assembly is about \eqn{O(n^3)}. The
#' asymptotically sub-cubic bounds of
#' \insertCite{Jansson2013a,Jansson2016a}{ConsTree} rely on
#' fast-matrix-multiplication and dynamic-connectivity machinery that does
#' more work than this for every practical leaf count, so they are not used.
#' }
#'
#' @inheritParams Strict
Expand Down Expand Up @@ -96,12 +102,6 @@ RStar <- function(trees) {
return(trees[[1L]])
}

if (n > 200L) {
stop(
"RStar() caps the dense triplet tensor at 200 leaves (n = ", n, ")."
)
}

# Relabel every tree 1..n in a shared canonical order and put in Preorder so
# the C++ LCA pass (which assumes parent precedes child) is valid.
edgeList <- lapply(trees, function(tr) {
Expand Down
2 changes: 1 addition & 1 deletion dev/oracle/rstar/check-rstar.R
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@
#
# Run: Rscript.exe dev/oracle/rstar/check-rstar.R

.libPaths(c("C:/Users/pjjg18/GitHub/Consensus/.agent-cons", .libPaths()))
.libPaths(c(Sys.getenv("CONSTREE_LIB", "C:/Users/pjjg18/GitHub/Consensus/.agent-cons"), .libPaths()))
suppressMessages(library(ConsTree))
suppressMessages(library(TreeTools))
suppressMessages(library(ape))
Expand Down
2 changes: 1 addition & 1 deletion dev/oracle/rstar/check-strong-clusters.R
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@
# Exits non-zero on any mismatch. Run:
# Rscript.exe dev/oracle/rstar/check-strong-clusters.R

.libPaths(c("C:/Users/pjjg18/GitHub/Consensus/.agent-cons", .libPaths()))
.libPaths(c(Sys.getenv("CONSTREE_LIB", "C:/Users/pjjg18/GitHub/Consensus/.agent-cons"), .libPaths()))
suppressMessages(library(ConsTree))
suppressMessages(library(TreeTools))
suppressMessages(library(ape))
Expand Down
48 changes: 48 additions & 0 deletions dev/oracle/rstar/check-vs-legacy.R
Original file line number Diff line number Diff line change
@@ -0,0 +1,48 @@
# check-vs-legacy.R (dev-only; run AFTER the rewrite + reinstall)
# Asserts the NEW fast RStar reproduces the OLD build's clade set EXACTLY over the
# deterministic legacy-grid (n = 5..200). R* clade sets are unique (Jansson et
# al. 2016, Lemma 1.1), so new == old is an exact regression gate -- and it covers
# the medium-n, many-cluster regime the brute-force strong-cluster oracle
# (n <= 12) cannot reach. Exits non-zero on any mismatch.
#
# Run from the package root: Rscript.exe dev/oracle/rstar/check-vs-legacy.R

.libPaths(c(Sys.getenv("CONSTREE_LIB", "C:/Users/pjjg18/GitHub/Consensus/.agent-cons"), .libPaths()))
suppressMessages({ library(ConsTree); library(TreeTools); library(ape) })
source("dev/oracle/oracle.R") # CladeSet()
source("dev/oracle/rstar/legacy-grid.R") # legacyTrials()

# --- gate: confirm this is the NEW (cap-removed) build ------------------------
# Fingerprint: the shared .agent-cons is reinstalled by sibling chip sessions, so
# verify by BEHAVIOUR -- only this chip's build runs R* above 200 leaves.
ok <- tryCatch({
t <- ape::rtree(210L, rooted = TRUE); t[["edge.length"]] <- NULL
length(CladeSet(RStar(list(t, t)))) > 0L
}, error = function(e) FALSE)
if (!isTRUE(ok)) {
stop("check-vs-legacy.R: installed build still caps at <= 200 leaves (version ",
as.character(packageVersion("ConsTree")),
"). The shared library was likely re-clobbered; reinstall the new build ",
"and rerun. Aborting.")
}
cat("New build confirmed (version ", as.character(packageVersion("ConsTree")),
"; R* runs at n > 200).\n", sep = "")

snap <- readRDS("dev/oracle/rstar/legacy-clades.rds")
trials <- legacyTrials()
if (!setequal(names(snap), names(trials)))
stop("legacy-grid drift: snapshot keys != current trial keys. Regenerate the snapshot.")

fail <- 0L
for (key in names(trials)) {
newcl <- sort(CladeSet(RStar(trials[[key]])))
if (!setequal(newcl, snap[[key]])) {
fail <- fail + 1L
cat("MISMATCH ", key, "\n", sep = "")
cat(" extra in new: ", paste(setdiff(newcl, snap[[key]]), collapse = " ; "), "\n")
cat(" missing in new: ", paste(setdiff(snap[[key]], newcl), collapse = " ; "), "\n")
}
}
cat(sprintf("\nnew-vs-old: %d / %d trials match.\n", length(trials) - fail, length(trials)))
if (fail > 0L) quit(status = 1L)
cat("ALL new == old (exact clade-set match to n = 200).\n")
Binary file added dev/oracle/rstar/legacy-clades.rds
Binary file not shown.
52 changes: 52 additions & 0 deletions dev/oracle/rstar/legacy-grid.R
Original file line number Diff line number Diff line change
@@ -0,0 +1,52 @@
# legacy-grid.R (dev-only)
# Deterministic tree-set grid shared by snapshot-legacy.R (run against the OLD
# capped build) and check-vs-legacy.R (run against the NEW build). BOTH must see
# byte-identical trees, so every tree is produced from a fixed seed here -- do not
# call any RNG outside legacyTrials(), and do not edit this file between snapshot
# and check.
#
# The grid deliberately spans the MEDIUM-n, many-cluster regime that the
# brute-force strong-cluster oracle (n <= 12) cannot reach -- this is where the
# new fast assembly most needs an exact reference. R* clade sets are unique
# (Jansson et al. 2016, Lemma 1.1), so new == old is an exact check.

suppressMessages({ library(ape); library(TreeTools) })

.rstarAlign <- function(trees) {
labs <- trees[[1L]][["tip.label"]]
lapply(trees, function(tr) TreeTools::RenumberTips(tr, labs))
}

# Named list of tree-sets (each a list of rooted phylo on a shared label set).
legacyTrials <- function() {
ns <- c(5L, 8L, 12L, 20L, 35L, 60L, 100L, 150L, 200L)
ks <- c(2L, 3L, 5L, 8L)
regimes <- c("indep", "perturbed", "partly") # incongruent / congruent / non-binary
reps <- 2L
trials <- list()
seedctr <- 7000L
for (n in ns) for (k in ks) for (regime in regimes) for (rep in seq_len(reps)) {
seedctr <- seedctr + 1L
set.seed(seedctr)
if (regime == "perturbed") {
base <- ape::rtree(n, rooted = TRUE); base[["edge.length"]] <- NULL
trees <- lapply(seq_len(k), function(i) {
tr <- base
for (s in seq_len(2L)) { # a couple of tip-label swaps per tree
ij <- sample.int(n, 2L)
tr[["tip.label"]][ij] <- tr[["tip.label"]][rev(ij)]
}
tr
})
} else {
trees <- lapply(seq_len(k), function(i) {
tr <- ape::rtree(n, rooted = TRUE)
if (regime == "partly") tr <- ape::di2multi(tr, tol = 0.25) # collapse short edges -> fans
tr[["edge.length"]] <- NULL
tr
})
}
trials[[sprintf("n%03d_k%d_%s_r%d", n, k, regime, rep)]] <- .rstarAlign(trees)
}
trials
}
35 changes: 35 additions & 0 deletions dev/oracle/rstar/snapshot-legacy.R
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
# snapshot-legacy.R (dev-only; run ONCE, BEFORE editing src/rstar.cpp)
# Captures CladeSet(RStar(.)) of the CURRENT (definition-exact, O(kn^3)+O(n^4),
# 200-leaf-capped) build over the deterministic legacy-grid, serialising it to
# dev/oracle/rstar/legacy-clades.rds. check-vs-legacy.R later asserts the new
# fast build reproduces every entry exactly. This is the strongest regression
# guard for R* (which has NO reference binary): the old code IS the reference.
#
# Run from the package root: Rscript.exe dev/oracle/rstar/snapshot-legacy.R

.libPaths(c(Sys.getenv("CONSTREE_LIB", "C:/Users/pjjg18/GitHub/Consensus/.agent-cons"), .libPaths()))
suppressMessages({ library(ConsTree); library(TreeTools); library(ape) })
source("dev/oracle/oracle.R") # CladeSet()
source("dev/oracle/rstar/legacy-grid.R") # legacyTrials()

# --- gate: refuse to run unless this is the OLD capped build ------------------
isOld <- tryCatch({ ConsTree:::rStarConsensus(list(), 201L); FALSE },
error = function(e) grepl("200 leaves", conditionMessage(e)))
if (!isTRUE(isOld)) {
stop("snapshot-legacy.R must run against the OLD (200-leaf-capped) build; ",
"the installed build does not error at n = 201. Aborting to avoid a ",
"self-referential snapshot.")
}
cat("Legacy build confirmed (version ",
as.character(packageVersion("ConsTree")), ", cap present).\n", sep = "")

trials <- legacyTrials()
cat("Snapshotting", length(trials), "tree-sets ...\n")
snap <- vector("list", length(trials))
names(snap) <- names(trials)
for (key in names(trials)) snap[[key]] <- sort(CladeSet(RStar(trials[[key]])))

saveRDS(snap, "dev/oracle/rstar/legacy-clades.rds")
nClades <- vapply(snap, length, integer(1))
cat(sprintf("Wrote dev/oracle/rstar/legacy-clades.rds (%d trials; clades/trial: min %d, median %g, max %d)\n",
length(snap), min(nClades), stats::median(nClades), max(nClades)))
13 changes: 11 additions & 2 deletions dev/profiling/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -35,9 +35,18 @@ Rscript.exe dev/profiling/compare.R baseline-2026-06-02.csv Greedy
`ape::rtree`. Two regimes: `independent` (incongruent, large split pool —
worst case for the `O(s²)` R pipeline) and `perturbed` (mostly congruent).
- Each call is guarded by a per-call elapsed `timeout`; cells that time out or
exceed a method's hard cap (e.g. `RStar` ≤ 200 tips) record `NA`, which is
itself informative.
exceed a method's bench cap (`BENCH_CAPS`) record `NA`, which is itself
informative. (`RStar`'s former 200-leaf *memory* cap is gone — see below — so
its `BENCH_CAPS` entry now only bounds grid runtime.)
- `compare.R` flags any cell where the output **split count** changed. For the
unique-output methods that is a bug; for **Greedy** it may be the documented
tie-break on equal-frequency incompatible splits — confirm against the FACT
oracle and sign off, do not silently re-baseline.
- **RStar** (round 1): the dense `O(n^3)` triplet tensor and its hard 200-leaf
cap were removed — memory is now `O(kn^2)` via per-tree constant-time LCA — and
the strong-cluster assembly was tightened from `O(n^4)` to about `O(n^3)`. The
R\* tree is unchanged (clade-exact vs the previous build on every grid cell to
n = 200; `dev/oracle/rstar/check-vs-legacy.R`). At n ≤ 200 timing is at parity
with the former code (the tally is still `O(kn^3)`); the practical win is the
lifted cap (n = 300 ≈ 0.2 s, n = 500 ≈ 0.7 s at k = 10 — formerly an immediate
error) and the far lower memory.
6 changes: 4 additions & 2 deletions dev/profiling/bench-common.R
Original file line number Diff line number Diff line change
Expand Up @@ -14,8 +14,10 @@ suppressMessages({
BENCH_METHODS <- c("Greedy", "Loose", "MajorityPlus",
"Frequency", "Adams", "RStar")

# Hard size caps documented in the package (skip cells that would error).
BENCH_CAPS <- c(RStar = 200L)
# Per-method size caps for the grid. RStar's former 200-leaf MEMORY cap is gone
# (it now stores O(k n^2), not an n^3 tensor); the practical bound is runtime
# (~O(k n^3)), so keep grid cells modest rather than erroring.
BENCH_CAPS <- c(RStar = 1000L)

# Generate `nTree` trees on `nTip` leaves under one of two regimes:
# "independent" -- k independent random topologies (incongruent; large split
Expand Down
2 changes: 2 additions & 0 deletions inst/WORDLIST
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
Aho
Apresjan
BHV
bipartitions
Billera
Expand All @@ -16,6 +17,7 @@ incongruent
Jansson
Karcher
Lapointe
LCA
Margoliash
MinILC
MinRLC
Expand Down
24 changes: 15 additions & 9 deletions man/RStar.Rd

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

Loading
Loading