Fix JACCL GID selection on Apple Thunderbolt RDMA (errno 22 RTR fail)#3468
Open
danielkristofik wants to merge 1 commit intoml-explore:mainfrom
Open
Fix JACCL GID selection on Apple Thunderbolt RDMA (errno 22 RTR fail)#3468danielkristofik wants to merge 1 commit intoml-explore:mainfrom
danielkristofik wants to merge 1 commit intoml-explore:mainfrom
Conversation
Connection::info() in rdma.cpp scans the GID table for IPv4-mapped IPv6 GIDs (::ffff:x.x.x.x, RoCE v2 format). Apple Thunderbolt RDMA exposes only link-local IPv6 GIDs (fe80::...) — the filter never matches and gid is left uninitialized. The garbage value is then sent to the peer via the side channel, causing the kernel to reject the QP RTR transition with EINVAL (errno 22). Initialize gid to zero, factor the GID selection into a try_gid() helper, and add a fallback that prefers index 1 (the actual rdma_enX port GID on Apple TB; index 0 is typically derived from a non-RDMA interface and routes elsewhere, surfacing as errno 60 ETIMEDOUT). The IPv4-mapped path remains the preferred match, so RoCE v2 setups are unaffected. Tested on 2x Mac Studio M4 Max + Thunderbolt 5 mesh, macOS 26.4.1. Distributed init succeeds, sustained tensor-parallel inference works.
Drifter4242
added a commit
to Drifter4242/mlx-jaccl-fix-small-recv
that referenced
this pull request
May 1, 2026
Port of ml-explore/mlx PR ml-explore#3468 by danielkristofik. The GID scan loop in Connection::info() only accepts IPv4-mapped IPv6 GIDs (::ffff:x.x.x.x). Apple Thunderbolt RDMA exposes only link-local IPv6 GIDs (fe80::...), so the loop never matches, leaving gid uninitialized. The garbage value causes errno=22 EINVAL on queue pair RTR transition. Fix: zero-initialize gid, prefer IPv4-mapped first (preserves RoCE v2), then fall back to index 1 (the actual RDMA port GID on Apple TB).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes #3467.
Proposed changes
Fix
[jaccl] Changing queue pair to RTR failed with errno 22on Apple Thunderbolt RDMA.Connection::info()inrdma.cppselects the local GID by scanning the GID table for an IPv4-mapped IPv6 GID (::ffff:x.x.x.x— the RoCE v2 standard). Apple Thunderbolt RDMA exposes only link-local IPv6 GIDs (fe80::...), so the filter never matches andgidis left uninitialized. The garbage value is propagated to the peer via the side channel, causing the kernel to reject the QP RTR transition with EINVAL.This regression was introduced by #3412 (Jaccl refactor) when the pre-existing hardcoded
query_gid(ctx, 1, 1, &gid)was replaced with the filter loop. See full root-cause analysis in the linked issue.Changes
mlx/distributed/jaccl/lib/jaccl/rdma.cpp:gidto avoid undefined behavior when no GID matches.The IPv4-mapped path remains the preferred match, so RoCE v2 setups are unaffected.
Test plan
mlx.distributed.init(backend="jaccl")→errno 22 RTR failerrno 60 ETIMEDOUT(wrong port — index 0 is non-RDMA on Apple)all_sumbenchmark runs, sustained tensor-parallel inference (Qwen3.6-27B-4bit) works end-to-endChecklist
pre-commit run --all-filesto format my code / installed pre-commit prior to committing changesNote: no unit tests added — the GID selection is platform-dependent (Apple TB vs RoCE v2 hardware) and the tests in
mlx/distributed/jaccl/lib/examples/are runtime benchmarks rather than unit tests. Happy to add a mock-based unit test if reviewers prefer.