TL/CUDA: fix NVLS fallback and EC persistent hangs by ikryukov · Pull Request #1320 · openucx/ucc

ikryukov · 2026-07-01T11:23:04Z

What

Fix TL/CUDA NVLS hangs/failures that broke test_c10d_ucc.py (e.g.
test_ddp_checkpointing_dynamic_module) on aarch64 (Grace/GB200/VR200) and in
containers without ptrace permission:

NVLS init deadlock: if one rank fails to import the peer multicast handle
(e.g. pidfd_getfd EPERM in a restricted container), it fell back while the
other ranks blocked forever in the collective cuMulticastBindAddr. All ranks
now exchange import status and disable NVLS together.
EC persistent executor hang on weakly-ordered CPUs: the CPU↔GPU shutdown
and task-publish flags in device-mapped memory used inner-shareable fences,
which don't order against the GPU on aarch64, so the persistent kernel never
saw the update. Use the bus (outer-shareable) fences.
Wrong-result after NVLS fallback: allreduce was still routed to the NVLS
algorithm (based on a static HW capability check) even when NVLS did not
initialize. Advertise/route NVLS allreduce only when NVLS is actually enabled.
Diagnostics: on a peer-fd import denial, emit one actionable warning
(ptrace_scope / CAP_SYS_PTRACE / docker / enroot hints); keep the rest at
debug so a supported fallback is not noisy.

Why ?

On aarch64 (GB200/VR200) and permission-restricted containers, NVLS either
deadlocked at team creation (hang in the first DDP collective) or silently
produced wrong results after falling back. NCCL worked, so it was UCC-specific.
Root-caused on a VR200 node: the progress thread was stuck in
ucc_cuda_executor_persistent_stop, and the peer-fd EPERM caused an asymmetric
NVLS init.

Fixes: RM 5113172

How ?

tl_cuda_nvls: add STATE_SYNC_STATUS that allgathers each rank's import
result; disable NVLS team-wide on any failure. Track nvls.enabled (set only
after the final NVLS barrier).
ec_cuda_executor_persistent: publish/poll shutdown and task-post flags with
ucc_memory_bus_store_fence() / ucc_memory_bus_load_fence().
tl_cuda_team / allreduce: gate advertising and dispatch of NVLS allreduce
on nvls.enabled.

Validated on VR200 (aarch64) and H100 (x86): test_c10d_ucc.py passes
(50 passed / 14 skipped), single-node and multinode NVLS allreduce correct,
and the forced-EPERM path falls back cleanly instead of hanging.

dpressle · 2026-07-01T11:46:02Z

👀 Investigating Lint (codestyle)…

dpressle · 2026-07-01T11:46:21Z

🤖 CI Triage Agent — Lint (codestyle) · commit dd2e7e72

The logs are crystal clear. Here is the complete diagnosis:

Summary: Lint (codestyle) check failed because all 4 commit titles in PR #1320 exceed the 50-character limit enforced by the commit title linter.

Root cause: The codestyle workflow enforces a strict 50-character maximum for non-merge commit titles. All four commits on the fix-cuda-nvls-and-ec-persistent-hangs branch violate this rule:

Commit Title	Length	Limit
`TL/CUDA: only advertise/route NVLS allreduce when NVLS actually initialized`	75	50
`EC/CUDA: use bus store fence when publishing tasks to persistent kernel`	71	50
`EC/CUDA: fix persistent executor shutdown hang on weakly-ordered CPUs`	69	50
`TL/CUDA: fix NVLS init deadlock on asymmetric handle import failure`	67	50

The linter script checks [ ${#msg} -gt 50 ] and only exempts titles starting with Merge. All four titles pass the valid-header format check (TL/, EC/, etc.) but fail solely on length.

Implicated commit: [REDACTED:Hex High Entropy String] (the HEAD commit, plus the 3 earlier commits in the PR range — all by the PR author)

File: .github/workflows/codestyle.yaml (the check_title() function enforcing the 50-char rule)

Suggested fix: Shorten each of the four commit titles to ≤ 50 characters and amend/rebase the commits. Suggested rewrites (all ≤ 50 chars):

TL/CUDA: only advertise NVLS AR when initialized (49)
EC/CUDA: bus store fence for persistent kernel tasks → needs trimming → EC/CUDA: bus fence for persistent kernel publish (47)
EC/CUDA: fix persistent executor shutdown hang (46)
TL/CUDA: fix NVLS init deadlock on import failure (49)

Then force-push the branch:

git rebase -i origin/master  # edit the four commit messages
git push --force-with-lease

Related: none

🛡️ This comment had 1 potential secret(s) redacted (Hex High Entropy String). See request_id c639f9e6-be3e-46ef-9ece-45d42391965f in the triage console for the audit trail.

dpressle · 2026-07-01T12:46:46Z

👀 Investigating Lint (codestyle)…

dpressle · 2026-07-01T12:47:08Z

🤖 CI Triage Agent — Lint (codestyle) · commit 0305aeaa

The logs are crystal clear. Here is the complete diagnosis:

Summary: All 5 commits in PR #1320 fail the codestyle lint check because their titles exceed the 50-character limit enforced by codestyle.yaml.

Root cause: The project's commit title linter enforces a hard 50-character maximum (unless the title starts with "Merge"). Every commit on the fix-cuda-nvls-and-ec-persistent-hangs branch violates this rule. The exact offenders, with their lengths, are:

#	Length	Commit Title
1	76	`TL/CUDA: emit actionable one-time warning when NVLS peer fd import is denied`
2	75	`TL/CUDA: only advertise/route NVLS allreduce when NVLS actually initialized`
3	71	`EC/CUDA: use bus store fence when publishing tasks to persistent kernel`
4	69	`EC/CUDA: fix persistent executor shutdown hang on weakly-ordered CPUs`
5	67	`TL/CUDA: fix NVLS init deadlock on asymmetric handle import failure`

The check logic (from codestyle.yaml) is straightforward:

if [ ${#msg} -gt 50 ]
then
  if ! echo $msg | grep -qP '^Merge'
  then
    echo "Commit title is too long: ${#msg}"
    return 1
  fi
fi

All five titles match a valid header prefix (TL/ or EC/) and the headers themselves are fine — the sole failure reason is length.

Implicated commit: [REDACTED:Hex High Entropy String] (tip commit, and all 4 preceding commits in the PR branch)

File: .github/workflows/codestyle.yaml (the rule being enforced, not a bug in it)

Suggested fix: Rewrite (via git rebase -i) each of the 5 commit titles to be ≤ 50 characters. Suggested shortened versions:

Original	Shortened (≤50 chars)
`TL/CUDA: emit actionable one-time warning when NVLS peer fd import is denied`	`TL/CUDA: warn on NVLS peer fd import denial` (44)
`TL/CUDA: only advertise/route NVLS allreduce when NVLS actually initialized`	`TL/CUDA: skip NVLS allreduce if not initialized` (48)
`EC/CUDA: use bus store fence when publishing tasks to persistent kernel`	`EC/CUDA: bus store fence for persistent tasks` (46)
`EC/CUDA: fix persistent executor shutdown hang on weakly-ordered CPUs`	`EC/CUDA: fix executor shutdown hang on weak CPUs` (49)
`TL/CUDA: fix NVLS init deadlock on asymmetric handle import failure`	`TL/CUDA: fix NVLS init deadlock on import fail` (47)

Then force-push the branch: git push --force-with-lease origin fix-cuda-nvls-and-ec-persistent-hangs

Related: none

🛡️ This comment had 1 potential secret(s) redacted (Hex High Entropy String). See request_id 26268561-95bc-4d34-a4b1-532fd4f94b5d in the triage console for the audit trail.

dpressle · 2026-07-01T13:12:42Z

👀 Investigating Lint (codestyle)…

dpressle · 2026-07-01T13:13:07Z

🤖 CI Triage Agent — Lint (codestyle) · commit 6c2b1194

The logs are clear. I have a complete picture of the failure. No further tool calls are needed.

Summary: Lint (codestyle) check failed — 6 out of 7 commits in PR #1320 have non-conforming commit titles (wrong header format and/or exceeding the 50-character limit).

Root cause: The UCC codestyle CI enforces two rules for every commit title in a PR:

Must start with a recognised component header (e.g. TL/CUDA:, EC/CUDA:, CORE:, etc.) — no generic prose titles are allowed.
Must not exceed 50 characters (unless it begins with Merge).

Six commits in this PR violate one or both rules:

Commit title	Violation
`Apply suggestions from code review`	Wrong header — no recognised component prefix
`TL/CUDA: emit actionable one-time warning when NVLS peer fd import is denied`	Too long (76 chars)
`TL/CUDA: only advertise/route NVLS allreduce when NVLS actually initialized`	Too long (75 chars)
`EC/CUDA: use bus store fence when publishing tasks to persistent kernel`	Too long (71 chars)
`EC/CUDA: fix persistent executor shutdown hang on weakly-ordered CPUs`	Too long (69 chars)
`TL/CUDA: fix NVLS init deadlock on asymmetric handle import failure`	Too long (67 chars)

Only TL/CUDA: fix NVLS on fabric cluster UUID (44 chars, valid header) passed.

Implicated commit: [REDACTED:Hex High Entropy String] (tip of fix-cuda-nvls-and-ec-persistent-hangs, authored "Apply suggestions from code review")

File: .github/workflows/codestyle.yaml (the check_title function enforcing these rules)

Suggested fix: Rewrite the offending commit titles via interactive rebase (git rebase -i origin/master) and reword each bad commit. Apply these two rules to every new title:

Prefix with a valid component tag from the allowed list (e.g. TL/CUDA:, EC/CUDA:).
Keep the title ≤ 50 characters — abbreviate as needed, e.g.:
- Apply suggestions from code review → TL/CUDA: address review comments (33 chars) ✅
- TL/CUDA: emit actionable one-time warning when NVLS peer fd import is denied → TL/CUDA: warn once on NVLS fd import denial (43 chars) ✅
- TL/CUDA: only advertise/route NVLS allreduce when NVLS actually initialized → TL/CUDA: skip NVLS allreduce if NVLS not inited (47 chars) ✅
- EC/CUDA: use bus store fence when publishing tasks to persistent kernel → EC/CUDA: use bus store fence for persistent tasks (49 chars) ✅
- EC/CUDA: fix persistent executor shutdown hang on weakly-ordered CPUs → EC/CUDA: fix shutdown hang on weakly-ordered CPUs (49 chars) ✅
- TL/CUDA: fix NVLS init deadlock on asymmetric handle import failure → TL/CUDA: fix NVLS init deadlock on handle failure (49 chars) ✅

Then force-push the branch to re-trigger CI.

Related: none

🛡️ This comment had 1 potential secret(s) redacted (Hex High Entropy String). See request_id c4ca7775-48ca-4194-ac03-1681ee4ef604 in the triage console for the audit trail.

greptile-apps · 2026-07-01T13:46:57Z

Greptile Summary

This PR fixes three independent TL/CUDA correctness issues that caused hangs or wrong results on aarch64 (Grace/GB200/VR200) and in permission-restricted containers. All three root causes are well-identified and the fixes are targeted.

NVLS init deadlock: A new STATE_SYNC_STATUS allgathers each rank's init_ready flag after the handle-import step; if any rank failed (e.g. pidfd_getfd EPERM), all ranks now fall back together instead of the successful ranks blocking forever in cuMulticastBindAddr. The state is properly guarded for re-entrant calling and the temporary buffer is freed in both the success path and the cleanup path.
EC persistent executor hang on aarch64: ucc_memory_bus_store_fence() (DMB OSHST / sfence) replaces ucc_memory_cpu_store_fence() (compiler barrier / DSB ISHST) when publishing tasks and the shutdown signal to device-mapped zero-copy memory, correctly ordering stores to the outer-shareable domain that the GPU observes.
Wrong-result after NVLS fallback: ucc_tl_cuda_get_supported_colls and allreduce_init now gate ALLREDUCE dispatch on team->nvls.enabled, which is set only after NVLS completes all init states successfully.

Confidence Score: 5/5

Safe to merge: all three fixes are narrowly scoped, well-tested on the target hardware (VR200 aarch64 and H100 x86), and revert-safe if any regression appears.

The memory-fence change is architecturally correct (DMB OSHST for aarch64, sfence for x86). The STATE_SYNC_STATUS allgather follows the existing OOB allgather pattern and handles re-entrant calling correctly. The nvls.enabled gate is applied consistently in both the score-building and allreduce init paths. Temporary buffer cleanup is handled in both success and error paths.

No files require special attention.

Important Files Changed

Filename	Overview
src/components/ec/cuda/ec_cuda_executor_persistent.c	Replaces ucc_memory_cpu_store_fence() with ucc_memory_bus_store_fence() in task_post and persistent_stop to fix GPU-visible ordering on weakly-ordered CPUs (aarch64/Grace); well-reasoned comments explain the DMB OSH vs DSB ISH distinction
src/components/tl/cuda/tl_cuda_nvls.c	Adds STATE_SYNC_STATUS to allgather each rank's import result so a per-rank pidfd_getfd failure causes all ranks to fall back together rather than deadlocking in cuMulticastBindAddr; improves diagnostic logging for permission-denied cases
src/components/tl/cuda/tl_cuda_nvls.h	Adds STATE_SYNC_STATUS to the state enum and three new fields to ucc_tl_cuda_nvls_t: init_ready, init_sync_data (temporary allgather buffer), and enabled (gates collective dispatch)
src/components/tl/cuda/tl_cuda_team.c	Replaces the static ucc_tl_cuda_nvls_check_support() call with a runtime team->nvls.enabled check in ucc_tl_cuda_get_supported_colls(), and downgrades NVLS init failure from tl_error to tl_debug
src/components/tl/cuda/allreduce/allreduce.c	Adds a cuda_team->nvls.enabled guard at the top of allreduce_init as a safety net, preventing dispatch to uninitialized NVLS resources if NVLS fell back

_{Reviews (5): Last reviewed commit: "TL/CUDA: warn when NVLS peer fd import d..." | Re-trigger Greptile}

ikryukov · 2026-07-01T13:55:38Z

/build

When a rank fails to import the multicast handle during NVLS team initialization (e.g. pidfd_getfd returns EPERM in a container without CAP_SYS_PTRACE or with a restrictive seccomp filter), it would bail out and fall back while the other ranks proceeded into the collective cuMulticastBindAddr barrier and blocked forever, deadlocking the team (observed as a hang in the first DDP collective in test_c10d_ucc). Add a STATE_SYNC_STATUS step that allgathers each rank's import status after STATE_IMPORT_HANDLE. If any rank failed, all ranks disable NVLS together and fall back symmetrically, avoiding the deadlock. Also downgrade the expected NVLS init/fallback failures (peer fd import, multicast object creation, top-level NVLS init failure) from tl_error to tl_debug, matching how TL/SHARP reports failed initialization, so a supported fallback does not emit a spurious error.

ucc_cuda_executor_persistent_stop() signals the persistent GPU kernel to exit by writing eee->pidx = -1 and the SHUTDOWN state into device-mapped (cudaHostAllocMapped, zero-copy) memory, then busy-waits for the kernel to write SHUTDOWN_ACK back. The shutdown flag was published without any memory barrier that orders the store against the GPU. On strongly-ordered CPUs (x86) this happens to work, but on weakly-ordered CPUs (aarch64, e.g. Grace/GB200/VR200) the inner- shareable CPU fences used elsewhere do not order against the GPU's shareability domain, so the persistent kernel may never observe pidx == -1. It then never exits and never acknowledges the shutdown, leaving the CPU spinning forever in the stop loop. This stalls the UCC progress thread and, with the PyTorch UCC backend, manifests as a hang in the first/teardown collective (e.g. test_ddp_checkpointing_dynamic_module hanging in an all_gather/barrier). Publish the shutdown flag with ucc_memory_bus_store_fence() (outer-shareable on aarch64, sfence on x86), which is defined specifically to synchronize write-back and device-mapped memory. The ack is a volatile flag in coherent device-mapped memory, so it is observed without a load fence in the wait loop.

task_post() writes task args into device-mapped (zero-copy) memory and then publishes them by advancing pidx, which the persistent GPU kernel polls. The ordering "task args visible before pidx" was enforced with the inner-shareable CPU store fence, which does not order stores against the GPU's shareability domain on weakly-ordered CPUs (aarch64/Grace). Use ucc_memory_bus_store_fence() (outer-shareable), consistent with the shutdown path, so the kernel cannot observe an advanced pidx with stale task args.

ucc_tl_cuda_get_supported_colls() advertised NVLS ALLREDUCE based on the static hardware capability check (cuMulticast attributes), not on whether NVLS actually initialized for the team. When NVLS init falls back for a single-node team (e.g. peer fd import denied, team size over the NVLS peer limit, non-uniform ppn), the TL/CUDA team is still created, so allreduce was routed to ucc_tl_cuda_allreduce_nvls_init with no NVLS resources set up, producing wrong results / crashes instead of falling back. Track whether NVLS finished initializing (nvls.enabled, set only after the final NVLS barrier) and: - gate advertising ALLREDUCE in get_supported_colls on nvls.enabled, so the score map routes allreduce to another TL when NVLS is unavailable; - defensively return UCC_ERR_NOT_SUPPORTED from ucc_tl_cuda_allreduce_init when NVLS is not enabled.

When single-node NVLS falls back because importing a peer process file descriptor is denied (EPERM/EACCES from pidfd_open/pidfd_getfd due to Yama ptrace_scope, missing CAP_SYS_PTRACE, or a seccomp filter), emit a single per-process warning that explains the cause and how to enable NVLS (host sysctl kernel.yama.ptrace_scope=0, docker --cap-add=SYS_PTRACE / --security-opt seccomp=unconfined, enroot --container-remap-root). The per-occurrence detail stays at debug level to avoid spam.

janjust · 2026-07-02T16:10:32Z

/build

janjust · 2026-07-02T16:59:33Z

/build

ikryukov self-assigned this Jul 1, 2026

ikryukov force-pushed the fix-cuda-nvls-and-ec-persistent-hangs branch from 6c2b119 to 3463e98 Compare July 1, 2026 13:35

ikryukov changed the title ~~TL/CUDA: Fix cuda nvls and ec persistent hangs~~ TL/CUDA: fix NVLS fallback and EC persistent hangs Jul 1, 2026

ikryukov requested review from Sergei-Lebedev, janjust and nsarka July 1, 2026 13:38

ikryukov added Ready-for-Review ai-review start ai-review Target v1.9.0 labels Jul 1, 2026

greptile-apps Bot reviewed Jul 1, 2026

View reviewed changes

Comment thread src/components/tl/cuda/tl_cuda_nvls.c

janjust force-pushed the fix-cuda-nvls-and-ec-persistent-hangs branch from 3463e98 to 633c122 Compare July 2, 2026 13:30

janjust approved these changes Jul 2, 2026

View reviewed changes

nsarka approved these changes Jul 2, 2026

View reviewed changes

ikryukov force-pushed the fix-cuda-nvls-and-ec-persistent-hangs branch 2 times, most recently from 214eeee to c578ecc Compare July 2, 2026 15:09

ikryukov added 5 commits July 2, 2026 11:10

janjust force-pushed the fix-cuda-nvls-and-ec-persistent-hangs branch from c578ecc to cef80e2 Compare July 2, 2026 16:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

TL/CUDA: fix NVLS fallback and EC persistent hangs#1320

TL/CUDA: fix NVLS fallback and EC persistent hangs#1320
ikryukov wants to merge 5 commits into
openucx:masterfrom
ikryukov:fix-cuda-nvls-and-ec-persistent-hangs

ikryukov commented Jul 1, 2026 •

edited

Loading

Uh oh!

dpressle commented Jul 1, 2026

Uh oh!

dpressle commented Jul 1, 2026

Uh oh!

dpressle commented Jul 1, 2026

Uh oh!

dpressle commented Jul 1, 2026

Uh oh!

dpressle commented Jul 1, 2026

Uh oh!

dpressle commented Jul 1, 2026

Uh oh!

greptile-apps Bot commented Jul 1, 2026 •

edited

Loading

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Uh oh!

Uh oh!

ikryukov commented Jul 1, 2026

Uh oh!

janjust commented Jul 2, 2026

Uh oh!

janjust commented Jul 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

Conversation

ikryukov commented Jul 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What

Why ?

How ?

Uh oh!

dpressle commented Jul 1, 2026

Uh oh!

dpressle commented Jul 1, 2026

Uh oh!

dpressle commented Jul 1, 2026

Uh oh!

dpressle commented Jul 1, 2026

Uh oh!

dpressle commented Jul 1, 2026

Uh oh!

dpressle commented Jul 1, 2026

Uh oh!

greptile-apps Bot commented Jul 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Uh oh!

Uh oh!

ikryukov commented Jul 1, 2026

Uh oh!

janjust commented Jul 2, 2026

Uh oh!

janjust commented Jul 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

ikryukov commented Jul 1, 2026 •

edited

Loading

greptile-apps Bot commented Jul 1, 2026 •

edited

Loading