Skip to content

VCF write: 15+ min stalls on individual CDS exons (1KG chr2 MANE Select, gvl 0.24.1) #164

@bschilder

Description

@bschilder

Summary

gvl.write(...) from VCF input on chr2 of the 1KG NYGC 30x phased panel makes forward progress at the start (~40 region/s on the easy regions) but then enters a regime where single CDS exons (43-200 bp) take 15+ minutes each to process. Across two runs (different max_mem budgets, different host) the slowdown reproduces in the same chr2 region range. The new Variant reader resident size: 0 B; ... log line introduced in #163 suggests the index-aware accounting is reporting zero footprint for the VCF reader, so the budgeting fix doesn't apply to VCF inputs in practice.

Environment

genvarloader 0.24.1 (head of #163, commit 2d7d660)
genoray >=2.4.0,<3 (resolved via PR's pin)
Python 3.11
OS Ubuntu 24.04 (Runpod runpod/pytorch:1.0.2-cu1281-torch280-ubuntu2404)
Host 16 vCPU / 250 GB RAM (A100-SXM4-80GB; GPU unused — gvl.write is CPU-bound)
Input VCF 1KG NYGC 30x phased panel, chr2 — 1kGP_high_coverage_Illumina.chr2.filtered.SNV_INDEL_SV_phased_panel.vcf.gz (~1 GB compressed, ~5M variants, 3202 samples)
Input BED MANE Select chr2 CDS (14,818 rows, from Ensembl 113 GTF, ordered by chrom+start)
max_mem tried both "2g" and "16g"; behaviour identical

Reproduction

# pip install "genvarloader @ git+https://github.com/mcvickerlab/GenVarLoader@2d7d660" "genoray>=2.4.0"
import polars as pl
import pooch
import genvarloader as gvl
from pathlib import Path

# 1. VCF (1KG NYGC 30x phased panel — public)
vcf_url = (
    "https://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/"
    "1000G_2504_high_coverage/working/20220422_3202_phased_SNV_INDEL_SV/"
    "1kGP_high_coverage_Illumina.chr2.filtered.SNV_INDEL_SV_phased_panel.vcf.gz"
)
vcf_path = pooch.retrieve(vcf_url, known_hash=None, path="/tmp/gvl-repro")
pooch.retrieve(vcf_url + ".tbi", known_hash=None, path="/tmp/gvl-repro")

# 2. MANE Select CDS BED for chr2 (Ensembl 113)
gtf_url = "https://ftp.ensembl.org/pub/release-113/gtf/homo_sapiens/Homo_sapiens.GRCh38.113.chr.gtf.gz"
gtf_path = pooch.retrieve(gtf_url, known_hash=None, path="/tmp/gvl-repro")
gtf = pl.read_csv(
    gtf_path, separator="\t", comment_prefix="#", has_header=False,
    new_columns=["chrom","source","feature","start","end","score","strand","frame","attrs"],
    infer_schema_length=0,
).with_columns(pl.col("start").cast(pl.Int64), pl.col("end").cast(pl.Int64))
gtf = gtf.filter(pl.col("feature") == "CDS")
gtf = gtf.with_columns(
    pl.col("attrs").str.extract(r'transcript_id "([^"]+)"').alias("transcript_id"),
    pl.col("attrs").str.extract(r'exon_number "?([0-9]+)"?').cast(pl.Int64, strict=False).alias("exon_number"),
    pl.col("attrs").str.contains("MANE_Select").alias("is_mane_select"),
)
mane = gtf.filter(pl.col("is_mane_select") & (pl.col("chrom") == "2"))
bed = mane.select(
    ("chr" + pl.col("chrom").cast(pl.Utf8)).alias("chrom"),
    (pl.col("start") - 1).alias("chromStart"),
    pl.col("end").alias("chromEnd"),
    "transcript_id", "exon_number", "strand",
)
print(f"chr2 MANE Select CDS rows: {bed.height}")   # 14818

# 3. write — this is where it stalls
gvl.write(
    path="/tmp/gvl-repro/chr2.svar",
    bed=bed,
    variants=str(vcf_path),
    overwrite=True,
    max_mem="16g",
)

Observed behaviour

The tqdm progress bar shows a clean fast start, a slow patch around region 2004-2078 (rate dropped from ~47 region/s to ~1.3 region/s, then recovered to ~14 region/s by region 2284), and then a 15+ minute stall on a single 43-base exon at region 2683:

gvl 0.24.1
[chr2] gvl.write(/<...>/chr2, bed=14818 regions, vcf=1kGP_high_coverage_Illumina.chr2.filtered.SNV_INDEL_SV_phased_panel.vcf.gz)
INFO     | genvarloader._dataset._write:write:126 - Writing dataset to /<...>/chr2
INFO     | genvarloader._dataset._write:write:198 - Using 3202 samples.
INFO     | genvarloader._dataset._write:write:203 - Writing genotypes.
INFO     | genvarloader._dataset._write:write:207 - Variant reader resident size: 0 B; max_mem budget: 16.00 GiB; available for chunking: 16.00 GiB
INFO     | genvarloader._dataset._write:_write_from_vcf:363 - VCF genoray index is invalid, writing
INFO     | genoray._vcf:_load_index:1096 - Loading genoray index.
WARNING  | genvarloader._dataset._write:_write_from_vcf:438 - A region has no variants for any sample. ...

Processing genotypes for 14818 regions on contig chr2:   0%|          | 0/14818 [00:00<?, ? region/s]
Processing genotypes for 14818 regions on contig chr2:   3%|▎         | 405/14818 [00:10, 43.26 region/s]
Processing genotypes for 14818 regions on contig chr2:   8%|▊         | 1114/14818 [00:30, 38.08 region/s]
Processing genotypes for 14818 regions on contig chr2:  13%|█▎        | 1941/14818 [00:45, 47.73 region/s]   ← cruising
Processing genotypes for 14818 regions on contig chr2:  14%|█▎        | 2004/14818 [00:59, 47.73 region/s]
Processing genotypes for 14818 regions on contig chr2:  14%|█▎        | 2042/14818 [02:11<1:14:47,  2.85 region/s]  ← first slowdown
Processing genotypes for 14818 regions on contig chr2:  14%|█▍        | 2078/14818 [03:00, 1.33 region/s]   ← bottom of first slowdown
Processing genotypes for 14818 regions on contig chr2:  15%|█▌        | 2284/14818 [03:05,  7.04 region/s]  ← recovered
Processing genotypes for 14818 regions on contig chr2:  17%|█▋        | 2544/14818 [03:10, 14.68 region/s]
Processing genotypes for 14818 regions on contig chr2:  18%|█▊        | 2682/14818 [03:29<13:46, 14.68 region/s]   ← was here at t=03:29
Processing genotypes for 14818 regions on contig chr2:  18%|█▊        | 2683/14818 [19:58<7:51:19,  2.33s/ region] ← 16:29 min spent on ONE 43-bp exon

Region 2683 is a CDS exon of E2F6 at chr2:11,446,480-11,446,523 (43 bp, strand -) — structurally unremarkable. The first slow patch (regions 2004-2078) similarly hits common ordinary genes: MYL1, ANAPC1, SGPP2, CRACDL, TANC1. None of these are textbook hard cases (HLA, KIR, immunoglobulin, low-complexity / repetitive regions).

Hypothesis

The slowness doesn't correlate with variant density of the BED region nor with chunk boundaries (a chunk flush would be deterministic, not random). My current guess is some kind of per-region quadratic scan or cache-miss pattern in the VCF reader: each region triggers a scan over a large data structure whose size grows with cumulative regions processed.

The "0 B" reported by the new #163 accounting log strongly suggests the VCF reader path is not surfacing its memory footprint via genoray's .nbytes — so #163's budgeting can't influence chunk-vs-index trade-off for VCF input.

Questions

  1. Is Variant reader resident size: 0 B expected for VCF inputs, or is the genoray VCF reader missing the .nbytes property?
  2. Have you observed per-region runtime variation of this magnitude (40× region/s → 0.4 region/s) on other VCF inputs? Could you profile a single-region call on the E2F6 exon above against the 1KG NYGC 30x chr2 VCF?
  3. Would converting the VCF to PGen up front (so the PGen .nbytes is populated) likely sidestep this?

Happy to attach a private slice of the in-pod log (~6 KB after stripping internal paths) or run additional instrumentation if helpful.


(Reporting from external pipeline use of gvl.write; not opening as a regression on #163 since the underlying behaviour is consistent across max_mem values — #163 made the budget accounting visible but didn't introduce this.)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions