Skip to content

Feature/rstar fast#6

Open
ms609 wants to merge 3 commits into
mainfrom
feature/rstar-fast
Open

Feature/rstar fast#6
ms609 wants to merge 3 commits into
mainfrom
feature/rstar-fast

Conversation

@ms609
Copy link
Copy Markdown
Owner

@ms609 ms609 commented Jun 2, 2026

No description provided.

ms609 and others added 3 commits June 2, 2026 16:48
Replace the dense O(n^3) triplet tensor (the hard 200-leaf MEMORY cap) and the
O(n^4) strong-cluster assembly in src/rstar.cpp with a tensor-free construction
that keeps the validated pipeline of Jansson, Sung, Vu & Yiu (2016): similarity
-> single-linkage (Apresjan) candidates -> filter-to-strong -> build.

* Stage 0: per-tree constant-time LCA (Euler tour + sparse-table RMQ), stored as
  a pair-major O(kn^2) depth table -- no n^3 structure.
* Stage 1: tally the strict-plurality winner per triple into the n^2 similarity
  s(a,b) = #{x : ab|x in R_maj}.  O(kn^3) time, O(n^2) memory.
* Stage 2: maximum-spanning-tree single-linkage dendrogram of s yields a laminar
  superset of the strong clusters; filter bottom-up while building the R* forest,
  testing only new cross-block pairs and counting the smaller of inside/outside
  leaves (deriving the other via s).

The R* tree is unchanged -- verified three ways: new-vs-old clade-for-clade on a
216-cell grid to n=200 (dev/oracle/rstar/check-vs-legacy.R, new regression
oracle), the brute-force strong-cluster oracle (0 mismatches), and the four
property pillars (identity, congruent==aho-build, strict & majority refinement).
Full testthat suite green (289 pass).

Memory is now O(kn^2), so the hard cap is gone: n=300 ~0.2s, n=500 ~0.7s at k=10
(formerly an immediate error).  At n<=200 timing is at parity with the former
code (the tally stays O(kn^3); the paper sub-cubic bounds are galactic -- BMM /
dynamic-connectivity machinery slower than naive for every feasible n -- so they
are deliberately not used).

R/rstar.R drops the n>200 guard and updates @details; tests swap the cap-error
case for large-n success + refinement checks; the R* oracle scripts honour a
CONSTREE_LIB isolated-library override; bench cap raised.  RcppExports unchanged
(rStarConsensus signature stable).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Mark the unreachable defensive paths in src/rstar.cpp with `// # nocov`
(the k*n^2 memory guard, the n<3 direct-call guard, the small-n LCA
self-check stop, and closePair's fall-through return) so codecov does not
flag them as uncovered new lines.  Add Apresjan and LCA to inst/WORDLIST
(spell_check_package).  No behaviour change; all gates re-verified green on
the rebuilt package (four pillars, new-vs-old 216/216, brute-force
strong-cluster oracle, full testthat suite 289 pass).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@codecov
Copy link
Copy Markdown

codecov Bot commented Jun 2, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 99.07%. Comparing base (9063623) to head (c022f7a).
⚠️ Report is 2 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main       #6      +/-   ##
==========================================
+ Coverage   99.06%   99.07%   +0.01%     
==========================================
  Files          18       18              
  Lines        2879     2933      +54     
==========================================
+ Hits         2852     2906      +54     
  Misses         27       27              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant