`gvl.write` exceeds `max_mem` by >2x on chrom-scale builds (peak >64 GiB with `max_mem="32g"`)

## Summary

When building a per-chromosome `.svar` dataset from the 1000 Genomes
NYGC 30x phased release, `gvl.write(..., max_mem="32g")` consistently
exceeds the requested cap and gets OOM-killed on a 64 GiB CPU pod.
A second build that completes (chr14) tracks a memory trajectory
that crosses 90 % of the 64 GiB ceiling before falling off in the
final write phase — i.e. peak resident is roughly **2x the
`max_mem` budget** on both chroms we observed.

`max_mem` looks like it caps a single internal stage (the variant
chunker?) rather than the whole `gvl.write` pipeline, so callers
sizing pods on `max_mem` get blindsided.

## Environment

| | |
|---|---|
| GenVarLoader | `0.24.1` (latest on PyPI as of 2026-05-18) |
| polars | `1.40.1` |
| numpy | latest |
| Python | 3.12 (official `runpod/pytorch:1.0.2-cu1281-torch280-ubuntu2404` image) |
| OS | Ubuntu 24.04 |
| CPU pod | 16 vCPU, 64 GiB RAM (Runpod `cpu5c-16-64`) |
| Container disk | 160 GiB |

## Inputs

- **Variants:** [`1kGP_high_coverage_Illumina.chr2.filtered.SNV_INDEL_SV_phased_panel.vcf.gz`](https://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/1000G_2504_high_coverage/working/20220422_3202_phased_SNV_INDEL_SV/) — 2.5 GiB compressed, 2504 samples + 698 trio children = 3202 individuals, phased SNV+INDEL+SV
- **BED:** Ensembl MANE Select CDS regions on chr2 (~3,800 regions, derived from the [MANE 1.4 GTF](https://ftp.ensembl.org/pub/release-115/gtf/homo_sapiens/)); BED columns are `chrom, chromStart, chromEnd, splice_donor_start, splice_donor_end, splice_acceptor_start, splice_acceptor_end`

## Reproducer

```python
"""Reproduce gvl.write OOM on chr2 1KG NYGC 30x with max_mem=32g.

Expected: gvl.write resident memory stays at or below 32 GiB.
Actual: peak resident exceeds 64 GiB → OOM-killed (rc=137) on a
64 GiB host. Watching via `top` / cgroup memory.current, you'll see
the process climb past 32 GiB ~30 min in and keep climbing.

Run on a host with at least 100 GiB RAM if you want a non-OOM
baseline for the actual peak.
"""

from __future__ import annotations

import subprocess
from pathlib import Path

import genvarloader as gvl
import polars as pl
import requests

EBI = (
    "https://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/"
    "1000G_2504_high_coverage/working/20220422_3202_phased_SNV_INDEL_SV"
)
VCF_NAME = "1kGP_high_coverage_Illumina.chr2.filtered.SNV_INDEL_SV_phased_panel.vcf.gz"

WORK = Path("/tmp/gvl_oom_repro")
WORK.mkdir(exist_ok=True)


def _curl(url: str, dest: Path) -> None:
    if dest.exists() and dest.stat().st_size > 1024:
        return
    print(f"downloading {url} -> {dest}")
    subprocess.run(
        ["curl", "--retry", "5", "--retry-delay", "10", "-o", str(dest), url],
        check=True,
    )


# 1. Fetch VCF + tabix index.
vcf_path = WORK / VCF_NAME
tbi_path = WORK / (VCF_NAME + ".tbi")
_curl(f"{EBI}/{VCF_NAME}", vcf_path)
_curl(f"{EBI}/{VCF_NAME}.tbi", tbi_path)

# 2. Build a chr2 MANE-CDS BED. Replace with your own loader if you
#    have one; here we synthesize an equivalent shape: ~3800 CDS
#    regions of 100-1000 bp each, evenly spread across chr2.
chrom_len = 242_193_529  # GRCh38 chr2
n_regions = 3800
import numpy as np
rng = np.random.default_rng(0)
starts = np.sort(rng.integers(1000, chrom_len - 1000, size=n_regions))
lengths = rng.integers(100, 1000, size=n_regions)
bed = pl.DataFrame({
    "chrom":                     ["chr2"] * n_regions,
    "chromStart":                starts.tolist(),
    "chromEnd":                  (starts + lengths).tolist(),
    "splice_donor_start":        (starts - 2).tolist(),
    "splice_donor_end":          starts.tolist(),
    "splice_acceptor_start":     (starts + lengths).tolist(),
    "splice_acceptor_end":       (starts + lengths + 2).tolist(),
})

# 3. The OOMing call: request 32 GiB cap; observe >64 GiB peak.
out_dir = WORK / "chr2.svar"
print(f"gvl version: {gvl.__version__}")
print(f"polars version: {pl.__version__}")
print(f"gvl.write({out_dir}, bed.height={bed.height}, max_mem='32g')")

gvl.write(
    path=str(out_dir),
    bed=bed,
    variants=str(vcf_path),
    overwrite=True,
    max_mem="32g",
)

# Outputs ./chr2.svar/{metadata.json,variants.parquet,genotypes/*.npy,...}
print(f"output size: {sum(f.stat().st_size for f in out_dir.rglob('*') if f.is_file()) / 1e9:.1f} GB")
```

## Observed behavior

Three pods we ran against this exact call:

| Chrom | RAM ceiling | `max_mem` arg | Result | Peak resident (observed) |
|-------|-------------|---------------|--------|--------------------------|
| chr2  | 64 GiB      | `"32g"`       | **OOM-kill (rc=137)** at T+22 min after chrom_vcf done | >60 GiB (cgroup OOM) |
| chr2  | 128 GiB     | `"32g"`       | Completes in ~110 min | ~81 GiB (63 % of 128) |
| chr14 | 64 GiB      | `"32g"`       | Completes in ~190 min | ~58 GiB (91 % of 64) before drop in write phase |
| chr18 | 64 GiB      | `"32g"`       | Completes in ~83 min  | ~26 GiB (41 % of 64) |

The 64 GiB chr2 pod was OOM-killed at `T+22 min` (chrom_vcf done at
`16:14:09 UTC`, kernel SIGKILL at `16:36:17 UTC` per our cgroup
log). The 64 GiB chr14 pod tracked a slowly accelerating memory
trajectory (+3-12 pp per 20 min) up to 91 % before polars released
buffers in the final write phase, dropping back to ~30 %.

For reference, chr18 (similar VCF size to chr14, 834 MB vs 966 MB)
peaked at only ~26 GiB — so the per-chrom memory profile varies by
≥3x for similar VCF sizes. We suspect the variance is driven by
**variant count** (chr2 has the most variants per gene in MANE-CDS)
rather than VCF size.

## Expected behavior

When `max_mem="32g"` is set, peak resident across the whole
`gvl.write` pipeline should stay at or below 32 GiB (allowing some
headroom for amortizing allocator slack, OS page cache, etc.) — or
the documentation for `max_mem` should clarify which sub-stage it
governs and what the user-controllable knobs are for the others.

## Hypothesis

Looking at the polars 1.40.1 query plans we've seen elsewhere, the
likely peak is during the per-region genotype materialization
(`genoray`/`pysam` lift into NumPy + polars chunking before the
parquet write). `max_mem` may only cap the streaming-decode buffer
inside `genoray`, not the materialized region tensors held in RAM
while polars batches them for the parquet write.

If that's right, a per-region chunked write (stream regions to disk
in batches sized by `max_mem`, flushing the tensor before reading
the next region) would solve it. Happy to PR a patch if that's the
direction you'd want.

## Workaround

For now we're sizing chr-scale builds on `cpu3m` (16 vCPU / 128 GiB)
instead of `cpu5c` (16 vCPU / 64 GiB) on every chrom we don't have a
prior memory profile for. This adds ~50 % to the per-chrom cost but
turns OOMs into successful builds. Documenting the rule of thumb
`peak ≈ 2x max_mem` in the docstring for `gvl.write` would have saved
us the misfire.

## Adjacent / related

- Issue #82 "RefDatasets, lower memory usage, and breaking changes
  to transforms" (closed) touched memory in the train-time loader;
  this issue is about the **build-time** `gvl.write` path which I
  don't think #82 covered.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`gvl.write` exceeds `max_mem` by >2x on chrom-scale builds (peak >64 GiB with `max_mem="32g"`) #162

Summary

Environment

Inputs

Reproducer

Observed behavior

Expected behavior

Hypothesis

Workaround

Adjacent / related

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development


GenVarLoader	`0.24.1` (latest on PyPI as of 2026-05-18)
polars	`1.40.1`
numpy	latest
Python	3.12 (official `runpod/pytorch:1.0.2-cu1281-torch280-ubuntu2404` image)
OS	Ubuntu 24.04
CPU pod	16 vCPU, 64 GiB RAM (Runpod `cpu5c-16-64`)
Container disk	160 GiB

Chrom	RAM ceiling	`max_mem` arg	Result	Peak resident (observed)
chr2	64 GiB	`"32g"`	OOM-kill (rc=137) at T+22 min after chrom_vcf done	>60 GiB (cgroup OOM)
chr2	128 GiB	`"32g"`	Completes in ~110 min	~81 GiB (63 % of 128)
chr14	64 GiB	`"32g"`	Completes in ~190 min	~58 GiB (91 % of 64) before drop in write phase
chr18	64 GiB	`"32g"`	Completes in ~83 min	~26 GiB (41 % of 64)

gvl.write exceeds max_mem by >2x on chrom-scale builds (peak >64 GiB with max_mem="32g") #162

Description

Summary

Environment

Inputs

Reproducer

Observed behavior

Expected behavior

Hypothesis

Workaround

Adjacent / related

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

`gvl.write` exceeds `max_mem` by >2x on chrom-scale builds (peak >64 GiB with `max_mem="32g"`) #162