Nondeterministic rayon deadlock when iterating Dataset from concurrent (spawn) worker processes (v0.36.0)

## Summary

In `genvarloader==0.36.0`, iterating a `genvarloader.Dataset` concurrently from **multiple `spawn`-ed worker processes** intermittently **hard-deadlocks** in the Rust/rayon data-loading path. It is **nondeterministic** — the same configuration runs to completion on one launch and hangs on the next — which points to a timing-sensitive race rather than a deterministic bug. The hang is aggravated by rayon thread oversubscription, which `_threads.cap_threads()` currently fails to prevent (see [Contributing factors](#contributing-factors)).

## Environment

- `genvarloader==0.36.0` (Rust backend, `genvarloader.abi3.so`)
- Python 3.12.13, `torch==2.11.0+cu130`, `Linux-6.17.0-19-generic x86_64`, glibc 2.39
- RunPod container. **CFS CPU quota = 15.3 cores** (`/sys/fs/cgroup/cpu.max` → `1530000 100000`), but `nproc` and `os.sched_getaffinity(0)` both report **16** (a CFS quota is not reflected in the cpuset/affinity, so it is invisible to affinity-based CPU detection).
- Ambient `RAYON_NUM_THREADS=16` set in the base image environment.
- Worker processes created via `multiprocessing.get_context("spawn")` + `concurrent.futures.ProcessPoolExecutor` — **`spawn`, not `fork`**, so each worker is a fresh interpreter with its own rayon global pool. (This rules out the classic *rayon-pool-inherited-across-`fork()`* deadlock.)

## What we observed

Workload: N worker processes per GPU, each training a small model, each iterating its own `genvarloader.Dataset` (via a `GeneRna` pipeline) as the input source.

- **`tasks_per_gpu=5`**: hung after ~25 iterations. GPU util 0%, **flat CPU, load average ≈ 1.0**, worker threads blocked in `futex` (a lock that never releases — a deadlock, not thrashing). One worker went zombie → the `ProcessPoolExecutor` then wedged.
- **`tasks_per_gpu=2`**: one launch ran healthily (38 iterations, GPU 100%); a **fresh launch with identical config hung after 40 iterations** with the same futex/flat-CPU signature. → **a race, not simple oversubscription.**
- **`tasks_per_gpu=1`** (single worker process → a single rayon pool doing parallel work at any moment): **stable**, no hangs observed over a long run.

The signature — threads parked on a futex, ~1.0 load, 0% GPU — is a genuine deadlock inside the parallel loader, not merely slow/oversubscribed execution (which would show high CPU, not flat).

## Hypothesis

Because workers are **`spawn`**-ed, each has an independent address space and its own rayon global pool, so there is no *cross-process* rayon contention by construction. That suggests a **per-process, timing-sensitive deadlock inside the rayon-parallel loading code** whose race window is widened by CPU contention. With many workers oversubscribing ~15 cores, thread scheduling stretches and the window opens; with `tasks_per_gpu=1` there is effectively no contention and the race essentially never fires (hence the reliable workaround).

Candidate root causes worth auditing in the Rust path:
- A `Mutex`/`RwLock` held across a `rayon` parallel region (`par_iter`/`join`/`scope`) where a task re-acquires the same lock → classic rayon deadlock, timing-dependent.
- Nested/re-entrant rayon `install`/`broadcast`, or blocking on a `channel`/`Condvar` from inside a rayon worker while holding a pool thread.
- Any global/`OnceCell`/lazy-init guard on the load path taken concurrently under contention.

## Contributing factor: `cap_threads()` cannot cap when `RAYON_NUM_THREADS` is preset

`genvarloader/_threads.py`:

```python
def cap_threads() -> int:
    global _NUM_THREADS
    if _NUM_THREADS is None:
        _NUM_THREADS = _resolve_num_threads()
        os.environ.setdefault("RAYON_NUM_THREADS", str(_NUM_THREADS))
    return _NUM_THREADS
```

Two issues that lead to oversubscription (which widens the race window above):

1. **`setdefault` is a no-op when `RAYON_NUM_THREADS` is already set.** Our base image exports `RAYON_NUM_THREADS=16`, which `spawn`-ed workers inherit, so `cap_threads()` never lowers it. Each of N workers then builds a **16-thread** rayon pool ⇒ `N×16` threads on a 15-core cgroup.
2. **`_detect_cpus()` uses `os.sched_getaffinity(0)`, which ignores a CFS *quota*.** Even with no ambient env var, this host resolves 16, not the true 15.3-core budget — so per-worker pools overshoot the cgroup even in the "clean" case. Reading `/sys/fs/cgroup/cpu.max` (with a cgroup-v1 `cpu.cfs_quota_us`/`cpu.cfs_period_us` fallback) would give the real quota.

Fixing the deadlock is the primary ask; capping threads correctly is a secondary fix that reduces how often it triggers.

## Workaround

Run a single `Dataset`-iterating process at a time (`tasks_per_gpu=1`). Explicitly setting `GVL_NUM_THREADS` / `RAYON_NUM_THREADS` low does **not** fully prevent the multi-worker hang (and, per the `setdefault` issue, an ambient `RAYON_NUM_THREADS` silently wins in workers regardless).

## Asks

1. Audit the rayon-parallel load path (`genvarloader.abi3.so`) for a lock-across-parallel-region / nested-install / condvar-in-worker deadlock that a stress harness under CPU contention can reproduce.
2. In `cap_threads()`, overwrite `RAYON_NUM_THREADS` (don't `setdefault`) once GVL has resolved its own count, and make `_detect_cpus()` honor the cgroup CFS quota, not just affinity.

Happy to run a debug build or a stress reproducer on the affected host.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Nondeterministic rayon deadlock when iterating Dataset from concurrent (spawn) worker processes (v0.36.0) #263

Summary

Environment

What we observed

Hypothesis

Contributing factor: `cap_threads()` cannot cap when `RAYON_NUM_THREADS` is preset

Workaround

Asks

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Nondeterministic rayon deadlock when iterating Dataset from concurrent (spawn) worker processes (v0.36.0) #263

Description

Summary

Environment

What we observed

Hypothesis

Contributing factor: cap_threads() cannot cap when RAYON_NUM_THREADS is preset

Workaround

Asks

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

Contributing factor: `cap_threads()` cannot cap when `RAYON_NUM_THREADS` is preset