Summary
When building a per-chromosome .svar dataset from the 1000 Genomes
NYGC 30x phased release, gvl.write(..., max_mem="32g") consistently
exceeds the requested cap and gets OOM-killed on a 64 GiB CPU pod.
A second build that completes (chr14) tracks a memory trajectory
that crosses 90 % of the 64 GiB ceiling before falling off in the
final write phase — i.e. peak resident is roughly 2x the
max_mem budget on both chroms we observed.
max_mem looks like it caps a single internal stage (the variant
chunker?) rather than the whole gvl.write pipeline, so callers
sizing pods on max_mem get blindsided.
Environment
|
|
| GenVarLoader |
0.24.1 (latest on PyPI as of 2026-05-18) |
| polars |
1.40.1 |
| numpy |
latest |
| Python |
3.12 (official runpod/pytorch:1.0.2-cu1281-torch280-ubuntu2404 image) |
| OS |
Ubuntu 24.04 |
| CPU pod |
16 vCPU, 64 GiB RAM (Runpod cpu5c-16-64) |
| Container disk |
160 GiB |
Inputs
- Variants:
1kGP_high_coverage_Illumina.chr2.filtered.SNV_INDEL_SV_phased_panel.vcf.gz — 2.5 GiB compressed, 2504 samples + 698 trio children = 3202 individuals, phased SNV+INDEL+SV
- BED: Ensembl MANE Select CDS regions on chr2 (~3,800 regions, derived from the MANE 1.4 GTF); BED columns are
chrom, chromStart, chromEnd, splice_donor_start, splice_donor_end, splice_acceptor_start, splice_acceptor_end
Reproducer
"""Reproduce gvl.write OOM on chr2 1KG NYGC 30x with max_mem=32g.
Expected: gvl.write resident memory stays at or below 32 GiB.
Actual: peak resident exceeds 64 GiB → OOM-killed (rc=137) on a
64 GiB host. Watching via `top` / cgroup memory.current, you'll see
the process climb past 32 GiB ~30 min in and keep climbing.
Run on a host with at least 100 GiB RAM if you want a non-OOM
baseline for the actual peak.
"""
from __future__ import annotations
import subprocess
from pathlib import Path
import genvarloader as gvl
import polars as pl
import requests
EBI = (
"https://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/"
"1000G_2504_high_coverage/working/20220422_3202_phased_SNV_INDEL_SV"
)
VCF_NAME = "1kGP_high_coverage_Illumina.chr2.filtered.SNV_INDEL_SV_phased_panel.vcf.gz"
WORK = Path("/tmp/gvl_oom_repro")
WORK.mkdir(exist_ok=True)
def _curl(url: str, dest: Path) -> None:
if dest.exists() and dest.stat().st_size > 1024:
return
print(f"downloading {url} -> {dest}")
subprocess.run(
["curl", "--retry", "5", "--retry-delay", "10", "-o", str(dest), url],
check=True,
)
# 1. Fetch VCF + tabix index.
vcf_path = WORK / VCF_NAME
tbi_path = WORK / (VCF_NAME + ".tbi")
_curl(f"{EBI}/{VCF_NAME}", vcf_path)
_curl(f"{EBI}/{VCF_NAME}.tbi", tbi_path)
# 2. Build a chr2 MANE-CDS BED. Replace with your own loader if you
# have one; here we synthesize an equivalent shape: ~3800 CDS
# regions of 100-1000 bp each, evenly spread across chr2.
chrom_len = 242_193_529 # GRCh38 chr2
n_regions = 3800
import numpy as np
rng = np.random.default_rng(0)
starts = np.sort(rng.integers(1000, chrom_len - 1000, size=n_regions))
lengths = rng.integers(100, 1000, size=n_regions)
bed = pl.DataFrame({
"chrom": ["chr2"] * n_regions,
"chromStart": starts.tolist(),
"chromEnd": (starts + lengths).tolist(),
"splice_donor_start": (starts - 2).tolist(),
"splice_donor_end": starts.tolist(),
"splice_acceptor_start": (starts + lengths).tolist(),
"splice_acceptor_end": (starts + lengths + 2).tolist(),
})
# 3. The OOMing call: request 32 GiB cap; observe >64 GiB peak.
out_dir = WORK / "chr2.svar"
print(f"gvl version: {gvl.__version__}")
print(f"polars version: {pl.__version__}")
print(f"gvl.write({out_dir}, bed.height={bed.height}, max_mem='32g')")
gvl.write(
path=str(out_dir),
bed=bed,
variants=str(vcf_path),
overwrite=True,
max_mem="32g",
)
# Outputs ./chr2.svar/{metadata.json,variants.parquet,genotypes/*.npy,...}
print(f"output size: {sum(f.stat().st_size for f in out_dir.rglob('*') if f.is_file()) / 1e9:.1f} GB")
Observed behavior
Three pods we ran against this exact call:
| Chrom |
RAM ceiling |
max_mem arg |
Result |
Peak resident (observed) |
| chr2 |
64 GiB |
"32g" |
OOM-kill (rc=137) at T+22 min after chrom_vcf done |
>60 GiB (cgroup OOM) |
| chr2 |
128 GiB |
"32g" |
Completes in ~110 min |
~81 GiB (63 % of 128) |
| chr14 |
64 GiB |
"32g" |
Completes in ~190 min |
~58 GiB (91 % of 64) before drop in write phase |
| chr18 |
64 GiB |
"32g" |
Completes in ~83 min |
~26 GiB (41 % of 64) |
The 64 GiB chr2 pod was OOM-killed at T+22 min (chrom_vcf done at
16:14:09 UTC, kernel SIGKILL at 16:36:17 UTC per our cgroup
log). The 64 GiB chr14 pod tracked a slowly accelerating memory
trajectory (+3-12 pp per 20 min) up to 91 % before polars released
buffers in the final write phase, dropping back to ~30 %.
For reference, chr18 (similar VCF size to chr14, 834 MB vs 966 MB)
peaked at only ~26 GiB — so the per-chrom memory profile varies by
≥3x for similar VCF sizes. We suspect the variance is driven by
variant count (chr2 has the most variants per gene in MANE-CDS)
rather than VCF size.
Expected behavior
When max_mem="32g" is set, peak resident across the whole
gvl.write pipeline should stay at or below 32 GiB (allowing some
headroom for amortizing allocator slack, OS page cache, etc.) — or
the documentation for max_mem should clarify which sub-stage it
governs and what the user-controllable knobs are for the others.
Hypothesis
Looking at the polars 1.40.1 query plans we've seen elsewhere, the
likely peak is during the per-region genotype materialization
(genoray/pysam lift into NumPy + polars chunking before the
parquet write). max_mem may only cap the streaming-decode buffer
inside genoray, not the materialized region tensors held in RAM
while polars batches them for the parquet write.
If that's right, a per-region chunked write (stream regions to disk
in batches sized by max_mem, flushing the tensor before reading
the next region) would solve it. Happy to PR a patch if that's the
direction you'd want.
Workaround
For now we're sizing chr-scale builds on cpu3m (16 vCPU / 128 GiB)
instead of cpu5c (16 vCPU / 64 GiB) on every chrom we don't have a
prior memory profile for. This adds ~50 % to the per-chrom cost but
turns OOMs into successful builds. Documenting the rule of thumb
peak ≈ 2x max_mem in the docstring for gvl.write would have saved
us the misfire.
Adjacent / related
Summary
When building a per-chromosome
.svardataset from the 1000 GenomesNYGC 30x phased release,
gvl.write(..., max_mem="32g")consistentlyexceeds the requested cap and gets OOM-killed on a 64 GiB CPU pod.
A second build that completes (chr14) tracks a memory trajectory
that crosses 90 % of the 64 GiB ceiling before falling off in the
final write phase — i.e. peak resident is roughly 2x the
max_membudget on both chroms we observed.max_memlooks like it caps a single internal stage (the variantchunker?) rather than the whole
gvl.writepipeline, so callerssizing pods on
max_memget blindsided.Environment
0.24.1(latest on PyPI as of 2026-05-18)1.40.1runpod/pytorch:1.0.2-cu1281-torch280-ubuntu2404image)cpu5c-16-64)Inputs
1kGP_high_coverage_Illumina.chr2.filtered.SNV_INDEL_SV_phased_panel.vcf.gz— 2.5 GiB compressed, 2504 samples + 698 trio children = 3202 individuals, phased SNV+INDEL+SVchrom, chromStart, chromEnd, splice_donor_start, splice_donor_end, splice_acceptor_start, splice_acceptor_endReproducer
Observed behavior
Three pods we ran against this exact call:
max_memarg"32g""32g""32g""32g"The 64 GiB chr2 pod was OOM-killed at
T+22 min(chrom_vcf done at16:14:09 UTC, kernel SIGKILL at16:36:17 UTCper our cgrouplog). The 64 GiB chr14 pod tracked a slowly accelerating memory
trajectory (+3-12 pp per 20 min) up to 91 % before polars released
buffers in the final write phase, dropping back to ~30 %.
For reference, chr18 (similar VCF size to chr14, 834 MB vs 966 MB)
peaked at only ~26 GiB — so the per-chrom memory profile varies by
≥3x for similar VCF sizes. We suspect the variance is driven by
variant count (chr2 has the most variants per gene in MANE-CDS)
rather than VCF size.
Expected behavior
When
max_mem="32g"is set, peak resident across the wholegvl.writepipeline should stay at or below 32 GiB (allowing someheadroom for amortizing allocator slack, OS page cache, etc.) — or
the documentation for
max_memshould clarify which sub-stage itgoverns and what the user-controllable knobs are for the others.
Hypothesis
Looking at the polars 1.40.1 query plans we've seen elsewhere, the
likely peak is during the per-region genotype materialization
(
genoray/pysamlift into NumPy + polars chunking before theparquet write).
max_memmay only cap the streaming-decode bufferinside
genoray, not the materialized region tensors held in RAMwhile polars batches them for the parquet write.
If that's right, a per-region chunked write (stream regions to disk
in batches sized by
max_mem, flushing the tensor before readingthe next region) would solve it. Happy to PR a patch if that's the
direction you'd want.
Workaround
For now we're sizing chr-scale builds on
cpu3m(16 vCPU / 128 GiB)instead of
cpu5c(16 vCPU / 64 GiB) on every chrom we don't have aprior memory profile for. This adds ~50 % to the per-chrom cost but
turns OOMs into successful builds. Documenting the rule of thumb
peak ≈ 2x max_memin the docstring forgvl.writewould have savedus the misfire.
Adjacent / related
to transforms" (closed) touched memory in the train-time loader;
this issue is about the build-time
gvl.writepath which Idon't think RefDatasets, lower memory usage, and breaking changes to transforms #82 covered.