Skip to content

gvl.write exceeds max_mem by >2x on chrom-scale builds (peak >64 GiB with max_mem="32g") #162

@bschilder

Description

@bschilder

Summary

When building a per-chromosome .svar dataset from the 1000 Genomes
NYGC 30x phased release, gvl.write(..., max_mem="32g") consistently
exceeds the requested cap and gets OOM-killed on a 64 GiB CPU pod.
A second build that completes (chr14) tracks a memory trajectory
that crosses 90 % of the 64 GiB ceiling before falling off in the
final write phase — i.e. peak resident is roughly 2x the
max_mem budget
on both chroms we observed.

max_mem looks like it caps a single internal stage (the variant
chunker?) rather than the whole gvl.write pipeline, so callers
sizing pods on max_mem get blindsided.

Environment

GenVarLoader 0.24.1 (latest on PyPI as of 2026-05-18)
polars 1.40.1
numpy latest
Python 3.12 (official runpod/pytorch:1.0.2-cu1281-torch280-ubuntu2404 image)
OS Ubuntu 24.04
CPU pod 16 vCPU, 64 GiB RAM (Runpod cpu5c-16-64)
Container disk 160 GiB

Inputs

Reproducer

"""Reproduce gvl.write OOM on chr2 1KG NYGC 30x with max_mem=32g.

Expected: gvl.write resident memory stays at or below 32 GiB.
Actual: peak resident exceeds 64 GiB → OOM-killed (rc=137) on a
64 GiB host. Watching via `top` / cgroup memory.current, you'll see
the process climb past 32 GiB ~30 min in and keep climbing.

Run on a host with at least 100 GiB RAM if you want a non-OOM
baseline for the actual peak.
"""

from __future__ import annotations

import subprocess
from pathlib import Path

import genvarloader as gvl
import polars as pl
import requests

EBI = (
    "https://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/"
    "1000G_2504_high_coverage/working/20220422_3202_phased_SNV_INDEL_SV"
)
VCF_NAME = "1kGP_high_coverage_Illumina.chr2.filtered.SNV_INDEL_SV_phased_panel.vcf.gz"

WORK = Path("/tmp/gvl_oom_repro")
WORK.mkdir(exist_ok=True)


def _curl(url: str, dest: Path) -> None:
    if dest.exists() and dest.stat().st_size > 1024:
        return
    print(f"downloading {url} -> {dest}")
    subprocess.run(
        ["curl", "--retry", "5", "--retry-delay", "10", "-o", str(dest), url],
        check=True,
    )


# 1. Fetch VCF + tabix index.
vcf_path = WORK / VCF_NAME
tbi_path = WORK / (VCF_NAME + ".tbi")
_curl(f"{EBI}/{VCF_NAME}", vcf_path)
_curl(f"{EBI}/{VCF_NAME}.tbi", tbi_path)

# 2. Build a chr2 MANE-CDS BED. Replace with your own loader if you
#    have one; here we synthesize an equivalent shape: ~3800 CDS
#    regions of 100-1000 bp each, evenly spread across chr2.
chrom_len = 242_193_529  # GRCh38 chr2
n_regions = 3800
import numpy as np
rng = np.random.default_rng(0)
starts = np.sort(rng.integers(1000, chrom_len - 1000, size=n_regions))
lengths = rng.integers(100, 1000, size=n_regions)
bed = pl.DataFrame({
    "chrom":                     ["chr2"] * n_regions,
    "chromStart":                starts.tolist(),
    "chromEnd":                  (starts + lengths).tolist(),
    "splice_donor_start":        (starts - 2).tolist(),
    "splice_donor_end":          starts.tolist(),
    "splice_acceptor_start":     (starts + lengths).tolist(),
    "splice_acceptor_end":       (starts + lengths + 2).tolist(),
})

# 3. The OOMing call: request 32 GiB cap; observe >64 GiB peak.
out_dir = WORK / "chr2.svar"
print(f"gvl version: {gvl.__version__}")
print(f"polars version: {pl.__version__}")
print(f"gvl.write({out_dir}, bed.height={bed.height}, max_mem='32g')")

gvl.write(
    path=str(out_dir),
    bed=bed,
    variants=str(vcf_path),
    overwrite=True,
    max_mem="32g",
)

# Outputs ./chr2.svar/{metadata.json,variants.parquet,genotypes/*.npy,...}
print(f"output size: {sum(f.stat().st_size for f in out_dir.rglob('*') if f.is_file()) / 1e9:.1f} GB")

Observed behavior

Three pods we ran against this exact call:

Chrom RAM ceiling max_mem arg Result Peak resident (observed)
chr2 64 GiB "32g" OOM-kill (rc=137) at T+22 min after chrom_vcf done >60 GiB (cgroup OOM)
chr2 128 GiB "32g" Completes in ~110 min ~81 GiB (63 % of 128)
chr14 64 GiB "32g" Completes in ~190 min ~58 GiB (91 % of 64) before drop in write phase
chr18 64 GiB "32g" Completes in ~83 min ~26 GiB (41 % of 64)

The 64 GiB chr2 pod was OOM-killed at T+22 min (chrom_vcf done at
16:14:09 UTC, kernel SIGKILL at 16:36:17 UTC per our cgroup
log). The 64 GiB chr14 pod tracked a slowly accelerating memory
trajectory (+3-12 pp per 20 min) up to 91 % before polars released
buffers in the final write phase, dropping back to ~30 %.

For reference, chr18 (similar VCF size to chr14, 834 MB vs 966 MB)
peaked at only ~26 GiB — so the per-chrom memory profile varies by
≥3x for similar VCF sizes. We suspect the variance is driven by
variant count (chr2 has the most variants per gene in MANE-CDS)
rather than VCF size.

Expected behavior

When max_mem="32g" is set, peak resident across the whole
gvl.write pipeline should stay at or below 32 GiB (allowing some
headroom for amortizing allocator slack, OS page cache, etc.) — or
the documentation for max_mem should clarify which sub-stage it
governs and what the user-controllable knobs are for the others.

Hypothesis

Looking at the polars 1.40.1 query plans we've seen elsewhere, the
likely peak is during the per-region genotype materialization
(genoray/pysam lift into NumPy + polars chunking before the
parquet write). max_mem may only cap the streaming-decode buffer
inside genoray, not the materialized region tensors held in RAM
while polars batches them for the parquet write.

If that's right, a per-region chunked write (stream regions to disk
in batches sized by max_mem, flushing the tensor before reading
the next region) would solve it. Happy to PR a patch if that's the
direction you'd want.

Workaround

For now we're sizing chr-scale builds on cpu3m (16 vCPU / 128 GiB)
instead of cpu5c (16 vCPU / 64 GiB) on every chrom we don't have a
prior memory profile for. This adds ~50 % to the per-chrom cost but
turns OOMs into successful builds. Documenting the rule of thumb
peak ≈ 2x max_mem in the docstring for gvl.write would have saved
us the misfire.

Adjacent / related

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions