Skip to content

Nondeterministic rayon deadlock when iterating Dataset from concurrent (spawn) worker processes (v0.36.0) #263

Description

@d-laub

Summary

In genvarloader==0.36.0, iterating a genvarloader.Dataset concurrently from multiple spawn-ed worker processes intermittently hard-deadlocks in the Rust/rayon data-loading path. It is nondeterministic — the same configuration runs to completion on one launch and hangs on the next — which points to a timing-sensitive race rather than a deterministic bug. The hang is aggravated by rayon thread oversubscription, which _threads.cap_threads() currently fails to prevent (see Contributing factors).

Environment

  • genvarloader==0.36.0 (Rust backend, genvarloader.abi3.so)
  • Python 3.12.13, torch==2.11.0+cu130, Linux-6.17.0-19-generic x86_64, glibc 2.39
  • RunPod container. CFS CPU quota = 15.3 cores (/sys/fs/cgroup/cpu.max1530000 100000), but nproc and os.sched_getaffinity(0) both report 16 (a CFS quota is not reflected in the cpuset/affinity, so it is invisible to affinity-based CPU detection).
  • Ambient RAYON_NUM_THREADS=16 set in the base image environment.
  • Worker processes created via multiprocessing.get_context("spawn") + concurrent.futures.ProcessPoolExecutorspawn, not fork, so each worker is a fresh interpreter with its own rayon global pool. (This rules out the classic rayon-pool-inherited-across-fork() deadlock.)

What we observed

Workload: N worker processes per GPU, each training a small model, each iterating its own genvarloader.Dataset (via a GeneRna pipeline) as the input source.

  • tasks_per_gpu=5: hung after ~25 iterations. GPU util 0%, flat CPU, load average ≈ 1.0, worker threads blocked in futex (a lock that never releases — a deadlock, not thrashing). One worker went zombie → the ProcessPoolExecutor then wedged.
  • tasks_per_gpu=2: one launch ran healthily (38 iterations, GPU 100%); a fresh launch with identical config hung after 40 iterations with the same futex/flat-CPU signature. → a race, not simple oversubscription.
  • tasks_per_gpu=1 (single worker process → a single rayon pool doing parallel work at any moment): stable, no hangs observed over a long run.

The signature — threads parked on a futex, ~1.0 load, 0% GPU — is a genuine deadlock inside the parallel loader, not merely slow/oversubscribed execution (which would show high CPU, not flat).

Hypothesis

Because workers are spawn-ed, each has an independent address space and its own rayon global pool, so there is no cross-process rayon contention by construction. That suggests a per-process, timing-sensitive deadlock inside the rayon-parallel loading code whose race window is widened by CPU contention. With many workers oversubscribing ~15 cores, thread scheduling stretches and the window opens; with tasks_per_gpu=1 there is effectively no contention and the race essentially never fires (hence the reliable workaround).

Candidate root causes worth auditing in the Rust path:

  • A Mutex/RwLock held across a rayon parallel region (par_iter/join/scope) where a task re-acquires the same lock → classic rayon deadlock, timing-dependent.
  • Nested/re-entrant rayon install/broadcast, or blocking on a channel/Condvar from inside a rayon worker while holding a pool thread.
  • Any global/OnceCell/lazy-init guard on the load path taken concurrently under contention.

Contributing factor: cap_threads() cannot cap when RAYON_NUM_THREADS is preset

genvarloader/_threads.py:

def cap_threads() -> int:
    global _NUM_THREADS
    if _NUM_THREADS is None:
        _NUM_THREADS = _resolve_num_threads()
        os.environ.setdefault("RAYON_NUM_THREADS", str(_NUM_THREADS))
    return _NUM_THREADS

Two issues that lead to oversubscription (which widens the race window above):

  1. setdefault is a no-op when RAYON_NUM_THREADS is already set. Our base image exports RAYON_NUM_THREADS=16, which spawn-ed workers inherit, so cap_threads() never lowers it. Each of N workers then builds a 16-thread rayon pool ⇒ N×16 threads on a 15-core cgroup.
  2. _detect_cpus() uses os.sched_getaffinity(0), which ignores a CFS quota. Even with no ambient env var, this host resolves 16, not the true 15.3-core budget — so per-worker pools overshoot the cgroup even in the "clean" case. Reading /sys/fs/cgroup/cpu.max (with a cgroup-v1 cpu.cfs_quota_us/cpu.cfs_period_us fallback) would give the real quota.

Fixing the deadlock is the primary ask; capping threads correctly is a secondary fix that reduces how often it triggers.

Workaround

Run a single Dataset-iterating process at a time (tasks_per_gpu=1). Explicitly setting GVL_NUM_THREADS / RAYON_NUM_THREADS low does not fully prevent the multi-worker hang (and, per the setdefault issue, an ambient RAYON_NUM_THREADS silently wins in workers regardless).

Asks

  1. Audit the rayon-parallel load path (genvarloader.abi3.so) for a lock-across-parallel-region / nested-install / condvar-in-worker deadlock that a stress harness under CPU contention can reproduce.
  2. In cap_threads(), overwrite RAYON_NUM_THREADS (don't setdefault) once GVL has resolved its own count, and make _detect_cpus() honor the cgroup CFS quota, not just affinity.

Happy to run a debug build or a stress reproducer on the affected host.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions