Skip to content

Route VCF/PGEN writes through genoray _dense2sparse_with_length#197

Merged
d-laub merged 11 commits into
mainfrom
worktree-feat-dense2sparse-with-length
May 31, 2026
Merged

Route VCF/PGEN writes through genoray _dense2sparse_with_length#197
d-laub merged 11 commits into
mainfrom
worktree-feat-dense2sparse-with-length

Conversation

@d-laub
Copy link
Copy Markdown
Collaborator

@d-laub d-laub commented May 31, 2026

Summary

  • Route VCF/PGEN dataset writes through genoray 2.7.0's _dense2sparse_with_length, so they produce per-haplotype-minimal sparse genotypes — identical to what the SVAR write path already produced via read_ranges_with_length. Previously VCF/PGEN applied plain dense2sparse to the over-extended window, so every haplotype kept all extension-tail variants (over-extending haplotypes with few/no deletions).
  • Fix extend_to_length=False, which was a silent no-op for VCF/PGEN (_write_from_vcf/_write_from_pgen never forwarded it to their chunk generators, which always called _chunk_ranges_with_length). It now genuinely disables extension via the non-length chunking APIs (VCF.chunk / PGEN.chunk). This makes the code match the existing gvl.write docstring and SKILL.md.
  • Compute each region's chromEnd (max_ends) from the furthest retained variant via v_ends = POS - clip(ILEN, max=0), matching _write_from_svar exactly, so all three input paths (VCF, PGEN, SVAR) agree.
  • Bump genoray dependency to 2.7.0 (pixi.toml ==2.7.0, pyproject.toml >=2.7.0,<3).

Implementation refactors the two near-duplicate generators to assemble each region's full dense window across genoray memory-chunks and delegate conversion to a shared _window_to_sparse helper (dispatch between _dense2sparse_with_length and dense2sparse). Empty regions and chunk/index alignment are guarded in both no-extend paths.

Design + plan: docs/superpowers/specs/2026-05-30-dense2sparse-with-length-design.md, docs/superpowers/plans/2026-05-30-dense2sparse-with-length.md.

Test Plan

  • tests/integration/dataset/test_vcf_pgen_svar_parity.pyacceptance: VCF and PGEN sparse output (start/ilen/alt) is byte-identical to SVAR across all region×sample pairs (with a non-emptiness guard so it can't pass vacuously).
  • tests/unit/dataset/genotypes/test_window_to_sparse.py_window_to_sparse trims per-haplotype under extension and keeps carried variants without it.
  • tests/unit/dataset/test_write_no_extend_empty_region.pyextend_to_length=False write no longer crashes on variant-free regions.
  • Full suite: 494 passed, 11 skipped, 2 xfailed (only the missing-local-data 1kg slow tier is excluded).
  • Edge cases probed in review: multi-memory-chunk windows, insertion/deletion boundaries, reverse strand, jitter-expanded coordinates — all parity-consistent.

Note: the .gvl ground-truth datasets are gitignored and regenerated by pixi run -e dev gen (which the test task depends on); per-haplotype-minimal output means they must be regenerated before running the parity test locally.

🤖 Generated with Claude Code

@d-laub d-laub force-pushed the worktree-feat-dense2sparse-with-length branch from 32d7c5a to 8e58838 Compare May 31, 2026 06:23
d-laub and others added 11 commits May 30, 2026 23:25
Bump genoray to 2.7.0 and add the design for routing VCF/PGEN writes
through genoray's new _dense2sparse_with_length (per-haplotype-minimal,
matching SVAR), honoring extend_to_length=False, and aligning max_ends.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…N writes

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…index alignment

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…sparse, fix max_ends

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@d-laub d-laub force-pushed the worktree-feat-dense2sparse-with-length branch from 8e58838 to 3a47057 Compare May 31, 2026 06:26
@d-laub d-laub merged commit 902f1dd into main May 31, 2026
7 checks passed
@d-laub d-laub deleted the worktree-feat-dense2sparse-with-length branch May 31, 2026 06:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant