Route VCF/PGEN writes through genoray _dense2sparse_with_length#197
Merged
Conversation
32d7c5a to
8e58838
Compare
Bump genoray to 2.7.0 and add the design for routing VCF/PGEN writes through genoray's new _dense2sparse_with_length (per-haplotype-minimal, matching SVAR), honoring extend_to_length=False, and aligning max_ends. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…N writes Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…parse, fix max_ends
…index alignment Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…sparse, fix max_ends Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
8e58838 to
3a47057
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
_dense2sparse_with_length, so they produce per-haplotype-minimal sparse genotypes — identical to what the SVAR write path already produced viaread_ranges_with_length. Previously VCF/PGEN applied plaindense2sparseto the over-extended window, so every haplotype kept all extension-tail variants (over-extending haplotypes with few/no deletions).extend_to_length=False, which was a silent no-op for VCF/PGEN (_write_from_vcf/_write_from_pgennever forwarded it to their chunk generators, which always called_chunk_ranges_with_length). It now genuinely disables extension via the non-length chunking APIs (VCF.chunk/PGEN.chunk). This makes the code match the existinggvl.writedocstring andSKILL.md.chromEnd(max_ends) from the furthest retained variant viav_ends = POS - clip(ILEN, max=0), matching_write_from_svarexactly, so all three input paths (VCF, PGEN, SVAR) agree.2.7.0(pixi.toml==2.7.0,pyproject.toml>=2.7.0,<3).Implementation refactors the two near-duplicate generators to assemble each region's full dense window across genoray memory-chunks and delegate conversion to a shared
_window_to_sparsehelper (dispatch between_dense2sparse_with_lengthanddense2sparse). Empty regions and chunk/index alignment are guarded in both no-extend paths.Design + plan:
docs/superpowers/specs/2026-05-30-dense2sparse-with-length-design.md,docs/superpowers/plans/2026-05-30-dense2sparse-with-length.md.Test Plan
tests/integration/dataset/test_vcf_pgen_svar_parity.py— acceptance: VCF and PGEN sparse output (start/ilen/alt) is byte-identical to SVAR across all region×sample pairs (with a non-emptiness guard so it can't pass vacuously).tests/unit/dataset/genotypes/test_window_to_sparse.py—_window_to_sparsetrims per-haplotype under extension and keeps carried variants without it.tests/unit/dataset/test_write_no_extend_empty_region.py—extend_to_length=Falsewrite no longer crashes on variant-free regions.🤖 Generated with Claude Code