Skip to content

TL/CUDA: fix NVLS fallback and EC persistent hangs#1320

Open
ikryukov wants to merge 5 commits into
openucx:masterfrom
ikryukov:fix-cuda-nvls-and-ec-persistent-hangs
Open

TL/CUDA: fix NVLS fallback and EC persistent hangs#1320
ikryukov wants to merge 5 commits into
openucx:masterfrom
ikryukov:fix-cuda-nvls-and-ec-persistent-hangs

Conversation

@ikryukov

@ikryukov ikryukov commented Jul 1, 2026

Copy link
Copy Markdown
Collaborator

What

Fix TL/CUDA NVLS hangs/failures that broke test_c10d_ucc.py (e.g.
test_ddp_checkpointing_dynamic_module) on aarch64 (Grace/GB200/VR200) and in
containers without ptrace permission:

  • NVLS init deadlock: if one rank fails to import the peer multicast handle
    (e.g. pidfd_getfd EPERM in a restricted container), it fell back while the
    other ranks blocked forever in the collective cuMulticastBindAddr. All ranks
    now exchange import status and disable NVLS together.
  • EC persistent executor hang on weakly-ordered CPUs: the CPU↔GPU shutdown
    and task-publish flags in device-mapped memory used inner-shareable fences,
    which don't order against the GPU on aarch64, so the persistent kernel never
    saw the update. Use the bus (outer-shareable) fences.
  • Wrong-result after NVLS fallback: allreduce was still routed to the NVLS
    algorithm (based on a static HW capability check) even when NVLS did not
    initialize. Advertise/route NVLS allreduce only when NVLS is actually enabled.
  • Diagnostics: on a peer-fd import denial, emit one actionable warning
    (ptrace_scope / CAP_SYS_PTRACE / docker / enroot hints); keep the rest at
    debug so a supported fallback is not noisy.

Why ?

On aarch64 (GB200/VR200) and permission-restricted containers, NVLS either
deadlocked at team creation (hang in the first DDP collective) or silently
produced wrong results after falling back. NCCL worked, so it was UCC-specific.
Root-caused on a VR200 node: the progress thread was stuck in
ucc_cuda_executor_persistent_stop, and the peer-fd EPERM caused an asymmetric
NVLS init.

Fixes: RM 5113172

How ?

  • tl_cuda_nvls: add STATE_SYNC_STATUS that allgathers each rank's import
    result; disable NVLS team-wide on any failure. Track nvls.enabled (set only
    after the final NVLS barrier).
  • ec_cuda_executor_persistent: publish/poll shutdown and task-post flags with
    ucc_memory_bus_store_fence() / ucc_memory_bus_load_fence().
  • tl_cuda_team / allreduce: gate advertising and dispatch of NVLS allreduce
    on nvls.enabled.

Validated on VR200 (aarch64) and H100 (x86): test_c10d_ucc.py passes
(50 passed / 14 skipped), single-node and multinode NVLS allreduce correct,
and the forced-EPERM path falls back cleanly instead of hanging.

@dpressle

dpressle commented Jul 1, 2026

Copy link
Copy Markdown
Collaborator

👀 Investigating Lint (codestyle)

@dpressle

dpressle commented Jul 1, 2026

Copy link
Copy Markdown
Collaborator

🤖 CI Triage AgentLint (codestyle) · commit dd2e7e72

The logs are crystal clear. Here is the complete diagnosis:


Summary: Lint (codestyle) check failed because all 4 commit titles in PR #1320 exceed the 50-character limit enforced by the commit title linter.

Root cause: The codestyle workflow enforces a strict 50-character maximum for non-merge commit titles. All four commits on the fix-cuda-nvls-and-ec-persistent-hangs branch violate this rule:

Commit Title Length Limit
TL/CUDA: only advertise/route NVLS allreduce when NVLS actually initialized 75 50
EC/CUDA: use bus store fence when publishing tasks to persistent kernel 71 50
EC/CUDA: fix persistent executor shutdown hang on weakly-ordered CPUs 69 50
TL/CUDA: fix NVLS init deadlock on asymmetric handle import failure 67 50

The linter script checks [ ${#msg} -gt 50 ] and only exempts titles starting with Merge. All four titles pass the valid-header format check (TL/, EC/, etc.) but fail solely on length.

Implicated commit: [REDACTED:Hex High Entropy String] (the HEAD commit, plus the 3 earlier commits in the PR range — all by the PR author)

File: .github/workflows/codestyle.yaml (the check_title() function enforcing the 50-char rule)

Suggested fix: Shorten each of the four commit titles to ≤ 50 characters and amend/rebase the commits. Suggested rewrites (all ≤ 50 chars):

  • TL/CUDA: only advertise NVLS AR when initialized (49)
  • EC/CUDA: bus store fence for persistent kernel tasks → needs trimming → EC/CUDA: bus fence for persistent kernel publish (47)
  • EC/CUDA: fix persistent executor shutdown hang (46)
  • TL/CUDA: fix NVLS init deadlock on import failure (49)

Then force-push the branch:

git rebase -i origin/master  # edit the four commit messages
git push --force-with-lease

Related: none

🛡️ This comment had 1 potential secret(s) redacted (Hex High Entropy String). See request_id c639f9e6-be3e-46ef-9ece-45d42391965f in the triage console for the audit trail.

@dpressle

dpressle commented Jul 1, 2026

Copy link
Copy Markdown
Collaborator

👀 Investigating Lint (codestyle)

@dpressle

dpressle commented Jul 1, 2026

Copy link
Copy Markdown
Collaborator

🤖 CI Triage AgentLint (codestyle) · commit 0305aeaa

The logs are crystal clear. Here is the complete diagnosis:


Summary: All 5 commits in PR #1320 fail the codestyle lint check because their titles exceed the 50-character limit enforced by codestyle.yaml.

Root cause: The project's commit title linter enforces a hard 50-character maximum (unless the title starts with "Merge"). Every commit on the fix-cuda-nvls-and-ec-persistent-hangs branch violates this rule. The exact offenders, with their lengths, are:

# Length Commit Title
1 76 TL/CUDA: emit actionable one-time warning when NVLS peer fd import is denied
2 75 TL/CUDA: only advertise/route NVLS allreduce when NVLS actually initialized
3 71 EC/CUDA: use bus store fence when publishing tasks to persistent kernel
4 69 EC/CUDA: fix persistent executor shutdown hang on weakly-ordered CPUs
5 67 TL/CUDA: fix NVLS init deadlock on asymmetric handle import failure

The check logic (from codestyle.yaml) is straightforward:

if [ ${#msg} -gt 50 ]
then
  if ! echo $msg | grep -qP '^Merge'
  then
    echo "Commit title is too long: ${#msg}"
    return 1
  fi
fi

All five titles match a valid header prefix (TL/ or EC/) and the headers themselves are fine — the sole failure reason is length.

Implicated commit: [REDACTED:Hex High Entropy String] (tip commit, and all 4 preceding commits in the PR branch)

File: .github/workflows/codestyle.yaml (the rule being enforced, not a bug in it)

Suggested fix: Rewrite (via git rebase -i) each of the 5 commit titles to be ≤ 50 characters. Suggested shortened versions:

Original Shortened (≤50 chars)
TL/CUDA: emit actionable one-time warning when NVLS peer fd import is denied TL/CUDA: warn on NVLS peer fd import denial (44)
TL/CUDA: only advertise/route NVLS allreduce when NVLS actually initialized TL/CUDA: skip NVLS allreduce if not initialized (48)
EC/CUDA: use bus store fence when publishing tasks to persistent kernel EC/CUDA: bus store fence for persistent tasks (46)
EC/CUDA: fix persistent executor shutdown hang on weakly-ordered CPUs EC/CUDA: fix executor shutdown hang on weak CPUs (49)
TL/CUDA: fix NVLS init deadlock on asymmetric handle import failure TL/CUDA: fix NVLS init deadlock on import fail (47)

Then force-push the branch: git push --force-with-lease origin fix-cuda-nvls-and-ec-persistent-hangs

Related: none

🛡️ This comment had 1 potential secret(s) redacted (Hex High Entropy String). See request_id 26268561-95bc-4d34-a4b1-532fd4f94b5d in the triage console for the audit trail.

@dpressle

dpressle commented Jul 1, 2026

Copy link
Copy Markdown
Collaborator

👀 Investigating Lint (codestyle)

@dpressle

dpressle commented Jul 1, 2026

Copy link
Copy Markdown
Collaborator

🤖 CI Triage AgentLint (codestyle) · commit 6c2b1194

The logs are clear. I have a complete picture of the failure. No further tool calls are needed.


Summary: Lint (codestyle) check failed — 6 out of 7 commits in PR #1320 have non-conforming commit titles (wrong header format and/or exceeding the 50-character limit).

Root cause: The UCC codestyle CI enforces two rules for every commit title in a PR:

  1. Must start with a recognised component header (e.g. TL/CUDA:, EC/CUDA:, CORE:, etc.) — no generic prose titles are allowed.
  2. Must not exceed 50 characters (unless it begins with Merge).

Six commits in this PR violate one or both rules:

Commit title Violation
Apply suggestions from code review Wrong header — no recognised component prefix
TL/CUDA: emit actionable one-time warning when NVLS peer fd import is denied Too long (76 chars)
TL/CUDA: only advertise/route NVLS allreduce when NVLS actually initialized Too long (75 chars)
EC/CUDA: use bus store fence when publishing tasks to persistent kernel Too long (71 chars)
EC/CUDA: fix persistent executor shutdown hang on weakly-ordered CPUs Too long (69 chars)
TL/CUDA: fix NVLS init deadlock on asymmetric handle import failure Too long (67 chars)

Only TL/CUDA: fix NVLS on fabric cluster UUID (44 chars, valid header) passed.

Implicated commit: [REDACTED:Hex High Entropy String] (tip of fix-cuda-nvls-and-ec-persistent-hangs, authored "Apply suggestions from code review")

File: .github/workflows/codestyle.yaml (the check_title function enforcing these rules)

Suggested fix: Rewrite the offending commit titles via interactive rebase (git rebase -i origin/master) and reword each bad commit. Apply these two rules to every new title:

  • Prefix with a valid component tag from the allowed list (e.g. TL/CUDA:, EC/CUDA:).
  • Keep the title ≤ 50 characters — abbreviate as needed, e.g.:
    • Apply suggestions from code reviewTL/CUDA: address review comments (33 chars) ✅
    • TL/CUDA: emit actionable one-time warning when NVLS peer fd import is deniedTL/CUDA: warn once on NVLS fd import denial (43 chars) ✅
    • TL/CUDA: only advertise/route NVLS allreduce when NVLS actually initializedTL/CUDA: skip NVLS allreduce if NVLS not inited (47 chars) ✅
    • EC/CUDA: use bus store fence when publishing tasks to persistent kernelEC/CUDA: use bus store fence for persistent tasks (49 chars) ✅
    • EC/CUDA: fix persistent executor shutdown hang on weakly-ordered CPUsEC/CUDA: fix shutdown hang on weakly-ordered CPUs (49 chars) ✅
    • TL/CUDA: fix NVLS init deadlock on asymmetric handle import failureTL/CUDA: fix NVLS init deadlock on handle failure (49 chars) ✅

Then force-push the branch to re-trigger CI.

Related: none

🛡️ This comment had 1 potential secret(s) redacted (Hex High Entropy String). See request_id c4ca7775-48ca-4194-ac03-1681ee4ef604 in the triage console for the audit trail.

@ikryukov ikryukov self-assigned this Jul 1, 2026
@ikryukov ikryukov force-pushed the fix-cuda-nvls-and-ec-persistent-hangs branch from 6c2b119 to 3463e98 Compare July 1, 2026 13:35
@ikryukov ikryukov changed the title TL/CUDA: Fix cuda nvls and ec persistent hangs TL/CUDA: fix NVLS fallback and EC persistent hangs Jul 1, 2026
@greptile-apps

greptile-apps Bot commented Jul 1, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR fixes three independent TL/CUDA correctness issues that caused hangs or wrong results on aarch64 (Grace/GB200/VR200) and in permission-restricted containers. All three root causes are well-identified and the fixes are targeted.

  • NVLS init deadlock: A new STATE_SYNC_STATUS allgathers each rank's init_ready flag after the handle-import step; if any rank failed (e.g. pidfd_getfd EPERM), all ranks now fall back together instead of the successful ranks blocking forever in cuMulticastBindAddr. The state is properly guarded for re-entrant calling and the temporary buffer is freed in both the success path and the cleanup path.
  • EC persistent executor hang on aarch64: ucc_memory_bus_store_fence() (DMB OSHST / sfence) replaces ucc_memory_cpu_store_fence() (compiler barrier / DSB ISHST) when publishing tasks and the shutdown signal to device-mapped zero-copy memory, correctly ordering stores to the outer-shareable domain that the GPU observes.
  • Wrong-result after NVLS fallback: ucc_tl_cuda_get_supported_colls and allreduce_init now gate ALLREDUCE dispatch on team->nvls.enabled, which is set only after NVLS completes all init states successfully.

Confidence Score: 5/5

Safe to merge: all three fixes are narrowly scoped, well-tested on the target hardware (VR200 aarch64 and H100 x86), and revert-safe if any regression appears.

The memory-fence change is architecturally correct (DMB OSHST for aarch64, sfence for x86). The STATE_SYNC_STATUS allgather follows the existing OOB allgather pattern and handles re-entrant calling correctly. The nvls.enabled gate is applied consistently in both the score-building and allreduce init paths. Temporary buffer cleanup is handled in both success and error paths.

No files require special attention.

Important Files Changed

Filename Overview
src/components/ec/cuda/ec_cuda_executor_persistent.c Replaces ucc_memory_cpu_store_fence() with ucc_memory_bus_store_fence() in task_post and persistent_stop to fix GPU-visible ordering on weakly-ordered CPUs (aarch64/Grace); well-reasoned comments explain the DMB OSH vs DSB ISH distinction
src/components/tl/cuda/tl_cuda_nvls.c Adds STATE_SYNC_STATUS to allgather each rank's import result so a per-rank pidfd_getfd failure causes all ranks to fall back together rather than deadlocking in cuMulticastBindAddr; improves diagnostic logging for permission-denied cases
src/components/tl/cuda/tl_cuda_nvls.h Adds STATE_SYNC_STATUS to the state enum and three new fields to ucc_tl_cuda_nvls_t: init_ready, init_sync_data (temporary allgather buffer), and enabled (gates collective dispatch)
src/components/tl/cuda/tl_cuda_team.c Replaces the static ucc_tl_cuda_nvls_check_support() call with a runtime team->nvls.enabled check in ucc_tl_cuda_get_supported_colls(), and downgrades NVLS init failure from tl_error to tl_debug
src/components/tl/cuda/allreduce/allreduce.c Adds a cuda_team->nvls.enabled guard at the top of allreduce_init as a safety net, preventing dispatch to uninitialized NVLS resources if NVLS fell back

Reviews (5): Last reviewed commit: "TL/CUDA: warn when NVLS peer fd import d..." | Re-trigger Greptile

Comment thread src/components/tl/cuda/tl_cuda_nvls.c
@ikryukov

ikryukov commented Jul 1, 2026

Copy link
Copy Markdown
Collaborator Author

/build

@janjust janjust force-pushed the fix-cuda-nvls-and-ec-persistent-hangs branch from 3463e98 to 633c122 Compare July 2, 2026 13:30
@ikryukov ikryukov force-pushed the fix-cuda-nvls-and-ec-persistent-hangs branch 2 times, most recently from 214eeee to c578ecc Compare July 2, 2026 15:09
ikryukov added 5 commits July 2, 2026 11:10
When a rank fails to import the multicast handle during NVLS team
initialization (e.g. pidfd_getfd returns EPERM in a container without
CAP_SYS_PTRACE or with a restrictive seccomp filter), it would bail out
and fall back while the other ranks proceeded into the collective
cuMulticastBindAddr barrier and blocked forever, deadlocking the team
(observed as a hang in the first DDP collective in test_c10d_ucc).

Add a STATE_SYNC_STATUS step that allgathers each rank's import status
after STATE_IMPORT_HANDLE. If any rank failed, all ranks disable NVLS
together and fall back symmetrically, avoiding the deadlock.

Also downgrade the expected NVLS init/fallback failures (peer fd import,
multicast object creation, top-level NVLS init failure) from tl_error to
tl_debug, matching how TL/SHARP reports failed initialization, so a
supported fallback does not emit a spurious error.
ucc_cuda_executor_persistent_stop() signals the persistent GPU kernel to
exit by writing eee->pidx = -1 and the SHUTDOWN state into device-mapped
(cudaHostAllocMapped, zero-copy) memory, then busy-waits for the kernel to
write SHUTDOWN_ACK back.

The shutdown flag was published without any memory barrier that orders the
store against the GPU. On strongly-ordered CPUs (x86) this happens to work,
but on weakly-ordered CPUs (aarch64, e.g. Grace/GB200/VR200) the inner-
shareable CPU fences used elsewhere do not order against the GPU's
shareability domain, so the persistent kernel may never observe pidx == -1.
It then never exits and never acknowledges the shutdown, leaving the CPU
spinning forever in the stop loop. This stalls the UCC progress thread and,
with the PyTorch UCC backend, manifests as a hang in the first/teardown
collective (e.g. test_ddp_checkpointing_dynamic_module hanging in an
all_gather/barrier).

Publish the shutdown flag with ucc_memory_bus_store_fence() (outer-shareable
on aarch64, sfence on x86), which is defined specifically to synchronize
write-back and device-mapped memory. The ack is a volatile flag in coherent
device-mapped memory, so it is observed without a load fence in the wait loop.
task_post() writes task args into device-mapped (zero-copy) memory and then
publishes them by advancing pidx, which the persistent GPU kernel polls. The
ordering "task args visible before pidx" was enforced with the inner-shareable
CPU store fence, which does not order stores against the GPU's shareability
domain on weakly-ordered CPUs (aarch64/Grace). Use ucc_memory_bus_store_fence()
(outer-shareable), consistent with the shutdown path, so the kernel cannot
observe an advanced pidx with stale task args.
ucc_tl_cuda_get_supported_colls() advertised NVLS ALLREDUCE based on the
static hardware capability check (cuMulticast attributes), not on whether
NVLS actually initialized for the team. When NVLS init falls back for a
single-node team (e.g. peer fd import denied, team size over the NVLS peer
limit, non-uniform ppn), the TL/CUDA team is still created, so allreduce
was routed to ucc_tl_cuda_allreduce_nvls_init with no NVLS resources set up,
producing wrong results / crashes instead of falling back.

Track whether NVLS finished initializing (nvls.enabled, set only after the
final NVLS barrier) and:
- gate advertising ALLREDUCE in get_supported_colls on nvls.enabled, so the
  score map routes allreduce to another TL when NVLS is unavailable;
- defensively return UCC_ERR_NOT_SUPPORTED from ucc_tl_cuda_allreduce_init
  when NVLS is not enabled.
When single-node NVLS falls back because importing a peer process file
descriptor is denied (EPERM/EACCES from pidfd_open/pidfd_getfd due to Yama
ptrace_scope, missing CAP_SYS_PTRACE, or a seccomp filter), emit a single
per-process warning that explains the cause and how to enable NVLS (host
sysctl kernel.yama.ptrace_scope=0, docker --cap-add=SYS_PTRACE /
--security-opt seccomp=unconfined, enroot --container-remap-root). The
per-occurrence detail stays at debug level to avoid spam.
@janjust janjust force-pushed the fix-cuda-nvls-and-ec-persistent-hangs branch from c578ecc to cef80e2 Compare July 2, 2026 16:10
@janjust

janjust commented Jul 2, 2026

Copy link
Copy Markdown
Collaborator

/build

@janjust

janjust commented Jul 2, 2026

Copy link
Copy Markdown
Collaborator

/build

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants