Skip to content

feat!: Rust COITrees backend for gvl.Table; promote to public API#237

Merged
d-laub merged 15 commits into
mainfrom
feat/rust-table-overlap
Jun 20, 2026
Merged

feat!: Rust COITrees backend for gvl.Table; promote to public API#237
d-laub merged 15 commits into
mainfrom
feat/rust-table-overlap

Conversation

@d-laub

@d-laub d-laub commented Jun 20, 2026

Copy link
Copy Markdown
Collaborator

Summary

Replaces gvl.Table's polars-bio overlap backend with a self-contained Rust COITrees module (src/tables.rs, RustTable PyO3 class), and promotes Table from genvarloader.experimental to the public API.

This:

  • Fixes the max_mem blow-up during gvl.write()/update() for Table/annot tracks — the Rust streaming writer bounds the working set to one region's overlaps + one contig's lazily-built trees, and raises if a single region exceeds the budget.
  • Removes the non-deterministic polars-bio segfault source (#395) for the Table path. (polars-bio remains a transitive dep via genoray's var_ranges, so it is not fully removed from the tree.)
  • Promotes Table to the public API (from genvarloader import Table); deletes the experimental subpackage and the [table] extra.

Architecture

  • src/tables.rs — immutable interval store grouped by (chrom_code, sample_code), a COITrees overlap engine (count + materialize), and a streaming writer. Trees are built lazily one contig at a time and dropped on contig change.
  • python/genvarloader/_table.py — thin polars constructor/validator that factor-encodes the frame (sorted by (chrom, sample_id, start)) and delegates all overlap to RustTable.
  • On-disk format is byte-identical to the existing reader (INTERVAL_DTYPE, 12 bytes/row LE; i64 LE offsets), so datasets read back unchanged via np.memmap.

Breaking change

genvarloader.experimental.Table is removed; use genvarloader.Table. The [table] extra is gone.

Testing

  • Rust: cargo tests for store grouping, COITrees count (vs brute force), ordered interval materialization, and the streaming writer (byte-identical to a count→offsets→intervals oracle + max_mem error).
  • Python: brute-force numpy oracle unit tests + hypothesis property tests (100 examples each) for count_intervals, _intervals_from_offsets, and annot_overlap; a max_mem regression test; the end-to-end annot-DataFrame integration test is now un-skipped and runs in CI.
  • Full tree green: 800 passed, 20 skipped, 4 xfailed; cargo test --release 12 passed; ruff + pyrefly clean.

Notes / follow-up

  • annot_overlap now returns an all-empty RaggedIntervals for an empty annotation BED (previously crashed).
  • Equal-start tie ordering follows the Table's stored (post-sort) order — see the parity test docstring.

🤖 Generated with Claude Code

d-laub and others added 15 commits June 19, 2026 21:13
Full port of gvl.Table + annot_overlap off polars-bio onto a COITrees-backed
Rust module: fixes max_mem disrespect during write/update, removes the
non-deterministic polars-bio segfault (#395), drops the [table] extra, and
promotes Table out of experimental into the public API (CI-covered).

Phase 4 continuation of the Rust-migration roadmap.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Add /// Panics section to count() describing that sample codes >= n_samples
and chrom_code >= n_contigs cause panics; callers must pass factor-encoded
codes from the Table's own lists. Add debug_assert! before tree indexing to
make the trust-the-Python-boundary contract explicit.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Adds hypothesis property test (100 examples) asserting byte-identical
parity between the Rust COITrees Table backend and an independent
brute-force numpy oracle across count_intervals and
_intervals_from_offsets.

Oracle uses t._df (the frame as stored by Table after its stable sort
on chrom/sample_id/start) to match the Rust's stored-index ordering for
equal-start interval ties. Also removes unused pytest imports left by
earlier tasks (ruff check fix).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Extends tests/unit/test_table_parity.py with a Hypothesis property test
(test_annot_overlap_matches_oracle) that verifies annot_overlap returns
correct starts/ends/values per region vs a brute-force numpy oracle,
matching tie-breaking via the same internal Table._df sorted ordering.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
- Remove @pytest.mark.skipif from test_annot_tracks (polars-bio gone;
  annot DataFrame path now uses Rust COITrees backend); also drop the
  no-longer-needed `os` and `pytest` imports.
- Add early-return guard in annot_overlap for empty annot DataFrames
  (0 rows), returning a well-formed all-empty RaggedIntervals of shape
  (n_regions, None) instead of crashing when the internal __annot__
  sample is missing from the Table.
- Add regression test test_annot_overlap_empty_annot in
  tests/unit/test_write_annot.py.
- Remove dead n_samples struct field from RustTable in src/tables.rs
  (was set but never read; eliminates the dead_code warning).
- Move `from ._table import Table` to alphabetically correct position
  in python/genvarloader/__init__.py (after ._ragged, before ._torch).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@d-laub d-laub merged commit 62600c8 into main Jun 20, 2026
8 checks passed
@d-laub d-laub deleted the feat/rust-table-overlap branch June 20, 2026 06:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant