VCF write: 15+ min stalls on individual CDS exons (1KG chr2 MANE Select, gvl 0.24.1)

## Summary

`gvl.write(...)` from VCF input on chr2 of the 1KG NYGC 30x phased panel makes forward progress at the start (~40 region/s on the easy regions) but then enters a regime where **single CDS exons (43-200 bp) take 15+ minutes each** to process. Across two runs (different `max_mem` budgets, different host) the slowdown reproduces in the same chr2 region range. The new `Variant reader resident size: 0 B; ...` log line introduced in #163 suggests the index-aware accounting is reporting zero footprint for the VCF reader, so the budgeting fix doesn't apply to VCF inputs in practice.

## Environment

| | |
|---|---|
| `genvarloader` | `0.24.1` (head of #163, commit `2d7d660`) |
| `genoray` | `>=2.4.0,<3` (resolved via PR's pin) |
| Python | 3.11 |
| OS | Ubuntu 24.04 (Runpod `runpod/pytorch:1.0.2-cu1281-torch280-ubuntu2404`) |
| Host | 16 vCPU / 250 GB RAM (A100-SXM4-80GB; GPU unused — `gvl.write` is CPU-bound) |
| Input VCF | 1KG NYGC 30x phased panel, chr2 — `1kGP_high_coverage_Illumina.chr2.filtered.SNV_INDEL_SV_phased_panel.vcf.gz` (~1 GB compressed, ~5M variants, 3202 samples) |
| Input BED | MANE Select chr2 CDS (14,818 rows, from Ensembl 113 GTF, ordered by `chrom`+`start`) |
| `max_mem` | tried both `"2g"` and `"16g"`; behaviour identical |

## Reproduction

```python
# pip install "genvarloader @ git+https://github.com/mcvickerlab/GenVarLoader@2d7d660" "genoray>=2.4.0"
import polars as pl
import pooch
import genvarloader as gvl
from pathlib import Path

# 1. VCF (1KG NYGC 30x phased panel — public)
vcf_url = (
    "https://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/"
    "1000G_2504_high_coverage/working/20220422_3202_phased_SNV_INDEL_SV/"
    "1kGP_high_coverage_Illumina.chr2.filtered.SNV_INDEL_SV_phased_panel.vcf.gz"
)
vcf_path = pooch.retrieve(vcf_url, known_hash=None, path="/tmp/gvl-repro")
pooch.retrieve(vcf_url + ".tbi", known_hash=None, path="/tmp/gvl-repro")

# 2. MANE Select CDS BED for chr2 (Ensembl 113)
gtf_url = "https://ftp.ensembl.org/pub/release-113/gtf/homo_sapiens/Homo_sapiens.GRCh38.113.chr.gtf.gz"
gtf_path = pooch.retrieve(gtf_url, known_hash=None, path="/tmp/gvl-repro")
gtf = pl.read_csv(
    gtf_path, separator="\t", comment_prefix="#", has_header=False,
    new_columns=["chrom","source","feature","start","end","score","strand","frame","attrs"],
    infer_schema_length=0,
).with_columns(pl.col("start").cast(pl.Int64), pl.col("end").cast(pl.Int64))
gtf = gtf.filter(pl.col("feature") == "CDS")
gtf = gtf.with_columns(
    pl.col("attrs").str.extract(r'transcript_id "([^"]+)"').alias("transcript_id"),
    pl.col("attrs").str.extract(r'exon_number "?([0-9]+)"?').cast(pl.Int64, strict=False).alias("exon_number"),
    pl.col("attrs").str.contains("MANE_Select").alias("is_mane_select"),
)
mane = gtf.filter(pl.col("is_mane_select") & (pl.col("chrom") == "2"))
bed = mane.select(
    ("chr" + pl.col("chrom").cast(pl.Utf8)).alias("chrom"),
    (pl.col("start") - 1).alias("chromStart"),
    pl.col("end").alias("chromEnd"),
    "transcript_id", "exon_number", "strand",
)
print(f"chr2 MANE Select CDS rows: {bed.height}")   # 14818

# 3. write — this is where it stalls
gvl.write(
    path="/tmp/gvl-repro/chr2.svar",
    bed=bed,
    variants=str(vcf_path),
    overwrite=True,
    max_mem="16g",
)
```

## Observed behaviour

The tqdm progress bar shows a clean fast start, a slow patch around region 2004-2078 (rate dropped from ~47 region/s to ~1.3 region/s, then recovered to ~14 region/s by region 2284), and then a 15+ minute stall on **a single 43-base exon** at region 2683:

```
gvl 0.24.1
[chr2] gvl.write(/<...>/chr2, bed=14818 regions, vcf=1kGP_high_coverage_Illumina.chr2.filtered.SNV_INDEL_SV_phased_panel.vcf.gz)
INFO     | genvarloader._dataset._write:write:126 - Writing dataset to /<...>/chr2
INFO     | genvarloader._dataset._write:write:198 - Using 3202 samples.
INFO     | genvarloader._dataset._write:write:203 - Writing genotypes.
INFO     | genvarloader._dataset._write:write:207 - Variant reader resident size: 0 B; max_mem budget: 16.00 GiB; available for chunking: 16.00 GiB
INFO     | genvarloader._dataset._write:_write_from_vcf:363 - VCF genoray index is invalid, writing
INFO     | genoray._vcf:_load_index:1096 - Loading genoray index.
WARNING  | genvarloader._dataset._write:_write_from_vcf:438 - A region has no variants for any sample. ...

Processing genotypes for 14818 regions on contig chr2:   0%|          | 0/14818 [00:00<?, ? region/s]
Processing genotypes for 14818 regions on contig chr2:   3%|▎         | 405/14818 [00:10, 43.26 region/s]
Processing genotypes for 14818 regions on contig chr2:   8%|▊         | 1114/14818 [00:30, 38.08 region/s]
Processing genotypes for 14818 regions on contig chr2:  13%|█▎        | 1941/14818 [00:45, 47.73 region/s]   ← cruising
Processing genotypes for 14818 regions on contig chr2:  14%|█▎        | 2004/14818 [00:59, 47.73 region/s]
Processing genotypes for 14818 regions on contig chr2:  14%|█▎        | 2042/14818 [02:11<1:14:47,  2.85 region/s]  ← first slowdown
Processing genotypes for 14818 regions on contig chr2:  14%|█▍        | 2078/14818 [03:00, 1.33 region/s]   ← bottom of first slowdown
Processing genotypes for 14818 regions on contig chr2:  15%|█▌        | 2284/14818 [03:05,  7.04 region/s]  ← recovered
Processing genotypes for 14818 regions on contig chr2:  17%|█▋        | 2544/14818 [03:10, 14.68 region/s]
Processing genotypes for 14818 regions on contig chr2:  18%|█▊        | 2682/14818 [03:29<13:46, 14.68 region/s]   ← was here at t=03:29
Processing genotypes for 14818 regions on contig chr2:  18%|█▊        | 2683/14818 [19:58<7:51:19,  2.33s/ region] ← 16:29 min spent on ONE 43-bp exon
```

Region 2683 is a CDS exon of `E2F6` at `chr2:11,446,480-11,446,523` (43 bp, strand `-`) — structurally unremarkable. The first slow patch (regions 2004-2078) similarly hits common ordinary genes: `MYL1`, `ANAPC1`, `SGPP2`, `CRACDL`, `TANC1`. None of these are textbook hard cases (HLA, KIR, immunoglobulin, low-complexity / repetitive regions).

## Hypothesis

The slowness doesn't correlate with variant density of the BED region nor with chunk boundaries (a chunk flush would be deterministic, not random). My current guess is some kind of per-region quadratic scan or cache-miss pattern in the VCF reader: each region triggers a scan over a large data structure whose size grows with cumulative regions processed.

The "0 B" reported by the new `#163` accounting log strongly suggests the VCF reader path is not surfacing its memory footprint via `genoray`'s `.nbytes` — so #163's budgeting can't influence chunk-vs-index trade-off for VCF input.

## Questions

1. Is `Variant reader resident size: 0 B` expected for VCF inputs, or is the genoray VCF reader missing the `.nbytes` property?
2. Have you observed per-region runtime variation of this magnitude (40× region/s → 0.4 region/s) on other VCF inputs? Could you profile a single-region call on the E2F6 exon above against the 1KG NYGC 30x chr2 VCF?
3. Would converting the VCF to PGen up front (so the PGen `.nbytes` is populated) likely sidestep this?

Happy to attach a private slice of the in-pod log (~6 KB after stripping internal paths) or run additional instrumentation if helpful.

---

(Reporting from external pipeline use of `gvl.write`; not opening as a regression on #163 since the underlying behaviour is consistent across `max_mem` values — #163 made the budget accounting visible but didn't introduce this.)



`genvarloader`	`0.24.1` (head of #163, commit `2d7d660`)
`genoray`	`>=2.4.0,<3` (resolved via PR's pin)
Python	3.11
OS	Ubuntu 24.04 (Runpod `runpod/pytorch:1.0.2-cu1281-torch280-ubuntu2404`)
Host	16 vCPU / 250 GB RAM (A100-SXM4-80GB; GPU unused — `gvl.write` is CPU-bound)
Input VCF	1KG NYGC 30x phased panel, chr2 — `1kGP_high_coverage_Illumina.chr2.filtered.SNV_INDEL_SV_phased_panel.vcf.gz` (~1 GB compressed, ~5M variants, 3202 samples)
Input BED	MANE Select chr2 CDS (14,818 rows, from Ensembl 113 GTF, ordered by `chrom`+`start`)
`max_mem`	tried both `"2g"` and `"16g"`; behaviour identical

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

VCF write: 15+ min stalls on individual CDS exons (1KG chr2 MANE Select, gvl 0.24.1) #164

Summary

Environment

Reproduction

Observed behaviour

Hypothesis

Questions

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

VCF write: 15+ min stalls on individual CDS exons (1KG chr2 MANE Select, gvl 0.24.1) #164

Description

Summary

Environment

Reproduction

Observed behaviour

Hypothesis

Questions

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions