Skip to content

Ragged haplotype sequences are unexpectedly truncated at the 3' end #153

@VincentLHH

Description

@VincentLHH

Description
I am encountering an issue where the haplotype sequences generated by genvarloader (v0.22.0) appear to be truncated at the end compared to the reference sequence length.

While the variants (indels) within the sequence are correctly applied, the resulting strings are missing bases at the 3' end. This results in a sequence length that does not match the theoretically calculated length based on the VCF indels.

Environment

  • genvarloader version: 0.22.0
  • Reference: hg38

Code to Reproduce

import genvarloader as gvl
from tqdm import tqdm
import json

DEBUG = True
OUTPUT_FILE = "output_haplotypes.json"

# 1. Create GVL file
gvl_path = "test_APOB.gvl"
gvl.write(
    path=gvl_path,
    bed="test.bed",
    variants="test.vcf.gz",
)

# 2. Open dataset and fetch ragged haplotypes
dataset = gvl.Dataset.open("test_APOB.gvl", reference="/ref/hg38.fa")
dataset = dataset.with_len("ragged").with_seqs("haplotypes")

genes = dataset.regions["name"].to_list()
samples = dataset.samples

haplotypes_dict = {}

for sample_idx, sample_name in enumerate(tqdm(samples, desc="Processing samples")):
    haplotypes_dict[sample_name] = {}
    for region_idx, gene_name in enumerate(genes):
        haps = dataset[region_idx, sample_idx]
        seq1 = haps[0].decode()
        seq2 = haps[1].decode()
        
        haplotypes_dict[sample_name][gene_name] = [seq1, seq2]

        if DEBUG:
            tqdm.write(f"{sample_name} _ {gene_name}: Strand1 len {len(seq1)}, Strand2 len {len(seq2)}")

with open(OUTPUT_FILE, "w") as f:
    json.dump(haplotypes_dict, f, indent=2)

Observed Behavior
The generated sequences are consistently shorter than the reference sequence at the end.

  • Reference Sequence Length: 42645 bp
  • Haplotype 1 Length: 42636 bp (9 bp shorter than ref)
  • Haplotype 2 Length: 42645 bp (Matches ref length, but content is truncated)

Expected Behavior
Based on the variants in the phased VCF (specifically the indels), I calculated the expected lengths to be:

  • Expected Haplotype 1: 42641 bp
  • Expected Haplotype 2: 42647 bp

Furthermore, upon manual inspection of the sequence tails, bases are missing compared to the reference:

Reference Tail: ...GCAGGTCCCGGTGGGAAT
Haplotype 1 Tail: ...GCAGGTCCCGGTG (5 nt loss)
Haplotype 2 Tail: ...GCAGGTCCCGGTGGGA (2 nt loss)

The variants in the upstream region match perfectly, but the truncation occurs randomly at the very end of the string.

Question
Is this a bug regarding the "ragged" length setting, or is there a specific parameter I need to set to ensure the full sequence context is retrieved?


Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions