Ragged haplotype sequences are unexpectedly truncated at the 3' end

**Description**
I am encountering an issue where the haplotype sequences generated by `genvarloader` (v0.22.0) appear to be truncated at the end compared to the reference sequence length.

While the variants (indels) within the sequence are correctly applied, the resulting strings are missing bases at the 3' end. This results in a sequence length that does not match the theoretically calculated length based on the VCF indels.

**Environment**
- **genvarloader version:** 0.22.0
- **Reference:** hg38

**Code to Reproduce**
```python
import genvarloader as gvl
from tqdm import tqdm
import json

DEBUG = True
OUTPUT_FILE = "output_haplotypes.json"

# 1. Create GVL file
gvl_path = "test_APOB.gvl"
gvl.write(
    path=gvl_path,
    bed="test.bed",
    variants="test.vcf.gz",
)

# 2. Open dataset and fetch ragged haplotypes
dataset = gvl.Dataset.open("test_APOB.gvl", reference="/ref/hg38.fa")
dataset = dataset.with_len("ragged").with_seqs("haplotypes")

genes = dataset.regions["name"].to_list()
samples = dataset.samples

haplotypes_dict = {}

for sample_idx, sample_name in enumerate(tqdm(samples, desc="Processing samples")):
    haplotypes_dict[sample_name] = {}
    for region_idx, gene_name in enumerate(genes):
        haps = dataset[region_idx, sample_idx]
        seq1 = haps[0].decode()
        seq2 = haps[1].decode()
        
        haplotypes_dict[sample_name][gene_name] = [seq1, seq2]

        if DEBUG:
            tqdm.write(f"{sample_name} _ {gene_name}: Strand1 len {len(seq1)}, Strand2 len {len(seq2)}")

with open(OUTPUT_FILE, "w") as f:
    json.dump(haplotypes_dict, f, indent=2)
```

**Observed Behavior**
The generated sequences are consistently shorter than the reference sequence at the end.

- **Reference Sequence Length:** 42645 bp
- **Haplotype 1 Length:** 42636 bp (9 bp shorter than ref)
- **Haplotype 2 Length:** 42645 bp (Matches ref length, but content is truncated)

**Expected Behavior**
Based on the variants in the phased VCF (specifically the indels), I calculated the expected lengths to be:
- **Expected Haplotype 1:** 42641 bp
- **Expected Haplotype 2:** 42647 bp

Furthermore, upon manual inspection of the sequence tails, bases are missing compared to the reference:

> **Reference Tail:** `...GCAGGTCCCGGTGGGAAT`
> **Haplotype 1 Tail:** `...GCAGGTCCCGGTG` (5 nt loss)
> **Haplotype 2 Tail:** `...GCAGGTCCCGGTGGGA` (2 nt loss)

The variants in the upstream region match perfectly, but the truncation occurs randomly at the very end of the string.

**Question**
Is this a bug regarding the "ragged" length setting, or is there a specific parameter I need to set to ensure the full sequence context is retrieved?

---

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ragged haplotype sequences are unexpectedly truncated at the 3' end #153

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Ragged haplotype sequences are unexpectedly truncated at the 3' end #153

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions