Description
I am encountering an issue where the haplotype sequences generated by genvarloader (v0.22.0) appear to be truncated at the end compared to the reference sequence length.
While the variants (indels) within the sequence are correctly applied, the resulting strings are missing bases at the 3' end. This results in a sequence length that does not match the theoretically calculated length based on the VCF indels.
Environment
- genvarloader version: 0.22.0
- Reference: hg38
Code to Reproduce
import genvarloader as gvl
from tqdm import tqdm
import json
DEBUG = True
OUTPUT_FILE = "output_haplotypes.json"
# 1. Create GVL file
gvl_path = "test_APOB.gvl"
gvl.write(
path=gvl_path,
bed="test.bed",
variants="test.vcf.gz",
)
# 2. Open dataset and fetch ragged haplotypes
dataset = gvl.Dataset.open("test_APOB.gvl", reference="/ref/hg38.fa")
dataset = dataset.with_len("ragged").with_seqs("haplotypes")
genes = dataset.regions["name"].to_list()
samples = dataset.samples
haplotypes_dict = {}
for sample_idx, sample_name in enumerate(tqdm(samples, desc="Processing samples")):
haplotypes_dict[sample_name] = {}
for region_idx, gene_name in enumerate(genes):
haps = dataset[region_idx, sample_idx]
seq1 = haps[0].decode()
seq2 = haps[1].decode()
haplotypes_dict[sample_name][gene_name] = [seq1, seq2]
if DEBUG:
tqdm.write(f"{sample_name} _ {gene_name}: Strand1 len {len(seq1)}, Strand2 len {len(seq2)}")
with open(OUTPUT_FILE, "w") as f:
json.dump(haplotypes_dict, f, indent=2)
Observed Behavior
The generated sequences are consistently shorter than the reference sequence at the end.
- Reference Sequence Length: 42645 bp
- Haplotype 1 Length: 42636 bp (9 bp shorter than ref)
- Haplotype 2 Length: 42645 bp (Matches ref length, but content is truncated)
Expected Behavior
Based on the variants in the phased VCF (specifically the indels), I calculated the expected lengths to be:
- Expected Haplotype 1: 42641 bp
- Expected Haplotype 2: 42647 bp
Furthermore, upon manual inspection of the sequence tails, bases are missing compared to the reference:
Reference Tail: ...GCAGGTCCCGGTGGGAAT
Haplotype 1 Tail: ...GCAGGTCCCGGTG (5 nt loss)
Haplotype 2 Tail: ...GCAGGTCCCGGTGGGA (2 nt loss)
The variants in the upstream region match perfectly, but the truncation occurs randomly at the very end of the string.
Question
Is this a bug regarding the "ragged" length setting, or is there a specific parameter I need to set to ensure the full sequence context is retrieved?
Description
I am encountering an issue where the haplotype sequences generated by
genvarloader(v0.22.0) appear to be truncated at the end compared to the reference sequence length.While the variants (indels) within the sequence are correctly applied, the resulting strings are missing bases at the 3' end. This results in a sequence length that does not match the theoretically calculated length based on the VCF indels.
Environment
Code to Reproduce
Observed Behavior
The generated sequences are consistently shorter than the reference sequence at the end.
Expected Behavior
Based on the variants in the phased VCF (specifically the indels), I calculated the expected lengths to be:
Furthermore, upon manual inspection of the sequence tails, bases are missing compared to the reference:
The variants in the upstream region match perfectly, but the truncation occurs randomly at the very end of the string.
Question
Is this a bug regarding the "ragged" length setting, or is there a specific parameter I need to set to ensure the full sequence context is retrieved?