Summary
In genvarloader==0.36.0, iterating a genvarloader.Dataset concurrently from multiple spawn-ed worker processes intermittently hard-deadlocks in the Rust/rayon data-loading path. It is nondeterministic — the same configuration runs to completion on one launch and hangs on the next — which points to a timing-sensitive race rather than a deterministic bug. The hang is aggravated by rayon thread oversubscription, which _threads.cap_threads() currently fails to prevent (see Contributing factors).
Environment
genvarloader==0.36.0 (Rust backend, genvarloader.abi3.so)
- Python 3.12.13,
torch==2.11.0+cu130, Linux-6.17.0-19-generic x86_64, glibc 2.39
- RunPod container. CFS CPU quota = 15.3 cores (
/sys/fs/cgroup/cpu.max → 1530000 100000), but nproc and os.sched_getaffinity(0) both report 16 (a CFS quota is not reflected in the cpuset/affinity, so it is invisible to affinity-based CPU detection).
- Ambient
RAYON_NUM_THREADS=16 set in the base image environment.
- Worker processes created via
multiprocessing.get_context("spawn") + concurrent.futures.ProcessPoolExecutor — spawn, not fork, so each worker is a fresh interpreter with its own rayon global pool. (This rules out the classic rayon-pool-inherited-across-fork() deadlock.)
What we observed
Workload: N worker processes per GPU, each training a small model, each iterating its own genvarloader.Dataset (via a GeneRna pipeline) as the input source.
tasks_per_gpu=5: hung after ~25 iterations. GPU util 0%, flat CPU, load average ≈ 1.0, worker threads blocked in futex (a lock that never releases — a deadlock, not thrashing). One worker went zombie → the ProcessPoolExecutor then wedged.
tasks_per_gpu=2: one launch ran healthily (38 iterations, GPU 100%); a fresh launch with identical config hung after 40 iterations with the same futex/flat-CPU signature. → a race, not simple oversubscription.
tasks_per_gpu=1 (single worker process → a single rayon pool doing parallel work at any moment): stable, no hangs observed over a long run.
The signature — threads parked on a futex, ~1.0 load, 0% GPU — is a genuine deadlock inside the parallel loader, not merely slow/oversubscribed execution (which would show high CPU, not flat).
Hypothesis
Because workers are spawn-ed, each has an independent address space and its own rayon global pool, so there is no cross-process rayon contention by construction. That suggests a per-process, timing-sensitive deadlock inside the rayon-parallel loading code whose race window is widened by CPU contention. With many workers oversubscribing ~15 cores, thread scheduling stretches and the window opens; with tasks_per_gpu=1 there is effectively no contention and the race essentially never fires (hence the reliable workaround).
Candidate root causes worth auditing in the Rust path:
- A
Mutex/RwLock held across a rayon parallel region (par_iter/join/scope) where a task re-acquires the same lock → classic rayon deadlock, timing-dependent.
- Nested/re-entrant rayon
install/broadcast, or blocking on a channel/Condvar from inside a rayon worker while holding a pool thread.
- Any global/
OnceCell/lazy-init guard on the load path taken concurrently under contention.
Contributing factor: cap_threads() cannot cap when RAYON_NUM_THREADS is preset
genvarloader/_threads.py:
def cap_threads() -> int:
global _NUM_THREADS
if _NUM_THREADS is None:
_NUM_THREADS = _resolve_num_threads()
os.environ.setdefault("RAYON_NUM_THREADS", str(_NUM_THREADS))
return _NUM_THREADS
Two issues that lead to oversubscription (which widens the race window above):
setdefault is a no-op when RAYON_NUM_THREADS is already set. Our base image exports RAYON_NUM_THREADS=16, which spawn-ed workers inherit, so cap_threads() never lowers it. Each of N workers then builds a 16-thread rayon pool ⇒ N×16 threads on a 15-core cgroup.
_detect_cpus() uses os.sched_getaffinity(0), which ignores a CFS quota. Even with no ambient env var, this host resolves 16, not the true 15.3-core budget — so per-worker pools overshoot the cgroup even in the "clean" case. Reading /sys/fs/cgroup/cpu.max (with a cgroup-v1 cpu.cfs_quota_us/cpu.cfs_period_us fallback) would give the real quota.
Fixing the deadlock is the primary ask; capping threads correctly is a secondary fix that reduces how often it triggers.
Workaround
Run a single Dataset-iterating process at a time (tasks_per_gpu=1). Explicitly setting GVL_NUM_THREADS / RAYON_NUM_THREADS low does not fully prevent the multi-worker hang (and, per the setdefault issue, an ambient RAYON_NUM_THREADS silently wins in workers regardless).
Asks
- Audit the rayon-parallel load path (
genvarloader.abi3.so) for a lock-across-parallel-region / nested-install / condvar-in-worker deadlock that a stress harness under CPU contention can reproduce.
- In
cap_threads(), overwrite RAYON_NUM_THREADS (don't setdefault) once GVL has resolved its own count, and make _detect_cpus() honor the cgroup CFS quota, not just affinity.
Happy to run a debug build or a stress reproducer on the affected host.
Summary
In
genvarloader==0.36.0, iterating agenvarloader.Datasetconcurrently from multiplespawn-ed worker processes intermittently hard-deadlocks in the Rust/rayon data-loading path. It is nondeterministic — the same configuration runs to completion on one launch and hangs on the next — which points to a timing-sensitive race rather than a deterministic bug. The hang is aggravated by rayon thread oversubscription, which_threads.cap_threads()currently fails to prevent (see Contributing factors).Environment
genvarloader==0.36.0(Rust backend,genvarloader.abi3.so)torch==2.11.0+cu130,Linux-6.17.0-19-generic x86_64, glibc 2.39/sys/fs/cgroup/cpu.max→1530000 100000), butnprocandos.sched_getaffinity(0)both report 16 (a CFS quota is not reflected in the cpuset/affinity, so it is invisible to affinity-based CPU detection).RAYON_NUM_THREADS=16set in the base image environment.multiprocessing.get_context("spawn")+concurrent.futures.ProcessPoolExecutor—spawn, notfork, so each worker is a fresh interpreter with its own rayon global pool. (This rules out the classic rayon-pool-inherited-across-fork()deadlock.)What we observed
Workload: N worker processes per GPU, each training a small model, each iterating its own
genvarloader.Dataset(via aGeneRnapipeline) as the input source.tasks_per_gpu=5: hung after ~25 iterations. GPU util 0%, flat CPU, load average ≈ 1.0, worker threads blocked infutex(a lock that never releases — a deadlock, not thrashing). One worker went zombie → theProcessPoolExecutorthen wedged.tasks_per_gpu=2: one launch ran healthily (38 iterations, GPU 100%); a fresh launch with identical config hung after 40 iterations with the same futex/flat-CPU signature. → a race, not simple oversubscription.tasks_per_gpu=1(single worker process → a single rayon pool doing parallel work at any moment): stable, no hangs observed over a long run.The signature — threads parked on a futex, ~1.0 load, 0% GPU — is a genuine deadlock inside the parallel loader, not merely slow/oversubscribed execution (which would show high CPU, not flat).
Hypothesis
Because workers are
spawn-ed, each has an independent address space and its own rayon global pool, so there is no cross-process rayon contention by construction. That suggests a per-process, timing-sensitive deadlock inside the rayon-parallel loading code whose race window is widened by CPU contention. With many workers oversubscribing ~15 cores, thread scheduling stretches and the window opens; withtasks_per_gpu=1there is effectively no contention and the race essentially never fires (hence the reliable workaround).Candidate root causes worth auditing in the Rust path:
Mutex/RwLockheld across arayonparallel region (par_iter/join/scope) where a task re-acquires the same lock → classic rayon deadlock, timing-dependent.install/broadcast, or blocking on achannel/Condvarfrom inside a rayon worker while holding a pool thread.OnceCell/lazy-init guard on the load path taken concurrently under contention.Contributing factor:
cap_threads()cannot cap whenRAYON_NUM_THREADSis presetgenvarloader/_threads.py:Two issues that lead to oversubscription (which widens the race window above):
setdefaultis a no-op whenRAYON_NUM_THREADSis already set. Our base image exportsRAYON_NUM_THREADS=16, whichspawn-ed workers inherit, socap_threads()never lowers it. Each of N workers then builds a 16-thread rayon pool ⇒N×16threads on a 15-core cgroup._detect_cpus()usesos.sched_getaffinity(0), which ignores a CFS quota. Even with no ambient env var, this host resolves 16, not the true 15.3-core budget — so per-worker pools overshoot the cgroup even in the "clean" case. Reading/sys/fs/cgroup/cpu.max(with a cgroup-v1cpu.cfs_quota_us/cpu.cfs_period_usfallback) would give the real quota.Fixing the deadlock is the primary ask; capping threads correctly is a secondary fix that reduces how often it triggers.
Workaround
Run a single
Dataset-iterating process at a time (tasks_per_gpu=1). Explicitly settingGVL_NUM_THREADS/RAYON_NUM_THREADSlow does not fully prevent the multi-worker hang (and, per thesetdefaultissue, an ambientRAYON_NUM_THREADSsilently wins in workers regardless).Asks
genvarloader.abi3.so) for a lock-across-parallel-region / nested-install / condvar-in-worker deadlock that a stress harness under CPU contention can reproduce.cap_threads(), overwriteRAYON_NUM_THREADS(don'tsetdefault) once GVL has resolved its own count, and make_detect_cpus()honor the cgroup CFS quota, not just affinity.Happy to run a debug build or a stress reproducer on the affected host.