Use `GpuIpcMem` for NVLS connections #719

chhwang · 2026-01-07T06:35:42Z

Now NvlsConnection internally reuses GpuIpcMem for multicast memory handling.
Removed unnecessary barriers from connectNvlsCollective() (CUDA API handles this automatically).
Updated GpuIpcMem::map() and GpuIpcMem::mapMulticast() to return a shared pointer with custom deleter for unmapping, which prevents misuse of raw pointers and reduces states to be stored in the GpuIpcMem instance.
Now for RuntimeIpc type handles, for consistency with other types, cudaIpcOpenMemHandle will be called in GpuIpcMem::map() instead of the ctor of GpuIpcMem.

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

- [x] Move hash specialization and equality operator from std/global namespace to custom namespace - [x] Update unordered_map to use custom hash and equality as template parameters - [x] Add noexcept to equality operator - [x] Verify the changes build correctly - [x] Run code review and security checks  --- ✨ Let Copilot coding agent [set things up for you](https://github.com/microsoft/mscclpp/issues/new?title=✨+Set+up+Copilot+instructions&body=Configure%20instructions%20for%20this%20repository%20as%20documented%20in%20%5BBest%20practices%20for%20Copilot%20coding%20agent%20in%20your%20repository%5D%28https://gh.io/copilot-coding-agent-tips%29%2E%0A%0A%3COnboard%20this%20repo%3E&assignees=copilot) — coding agent works faster and does higher quality work when set up for your repo. --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: Binyang2014 <9415966+Binyang2014@users.noreply.github.com> Co-authored-by: Binyang Li <binyli@microsoft.com>

Copilot

Pull request overview

This PR refactors NVLS (NVLink Sharp) connection handling to use the unified GpuIpcMem abstraction instead of directly managing CUDA multicast APIs. This simplifies the code by delegating multicast memory management to the existing GpuIpcMem infrastructure.

Key Changes

Replaces manual multicast handle management with GpuIpcMem and GpuIpcMemHandle abstractions
Removes manual buffer allocation tracking (allocatedRanges_, freeRanges_) as this is now handled internally by GpuIpcMem
Removes explicit synchronization barriers in connectNvlsCollective, relying instead on the blocking behavior of cuMulticastBindAddr within GpuIpcMem::mapMulticast()

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

File	Description
src/switch_channel.cc	Complete refactoring of NvlsConnection::Impl to use GpuIpcMem; updated license header; removed buffer allocation logic and synchronization barriers; simplified bindMemory implementation
python/csrc/switch_channel_py.cpp	Removed Python binding for getMultiCastMinGranularity method (breaking API change)
include/mscclpp/switch_channel.hpp	Removed public API methods addDevice() and getMultiCastMinGranularity() (breaking API changes)

src/switch_channel.cc

chhwang · 2026-01-07T08:45:57Z

/azp run mscclpp-ut

azure-pipelines · 2026-01-07T08:46:05Z

Azure Pipelines could not run because the pipeline triggers exclude this branch/path.

chhwang · 2026-01-07T08:50:03Z

/azp run mscclpp-ut

azure-pipelines · 2026-01-07T08:50:15Z

Azure Pipelines successfully started running 1 pipeline(s).

chhwang · 2026-01-14T07:50:37Z

/azp run mscclpp-ut

azure-pipelines · 2026-01-14T07:50:50Z

Azure Pipelines successfully started running 1 pipeline(s).

Copilot

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated 3 comments.

Comments suppressed due to low confidence (2)

python/csrc/switch_channel_py.cpp:2

The license header should use "MIT License" instead of "MIT license" (capital L) to match the project's standard format seen in other C++ files.

// Licensed under the MIT license.

include/mscclpp/switch_channel.hpp:2

The license header should use "MIT License" instead of "MIT license" (capital L) to match the project's standard format used in the implementation files.

// Licensed under the MIT license.

src/include/gpu_ipc_mem.hpp

src/gpu_ipc_mem.cc

Copilot

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

src/include/gpu_ipc_mem.hpp

src/gpu_ipc_mem.cc

Binyang2014 and others added 24 commits December 4, 2025 19:20

add ipc cache

70c1d4d

WIP

1739f5a

Update src/registered_memory.cc

4ebe37e

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

WIP

2137325

fix ut

b1029b9

Merge branch 'main' into binyli/handle_cache

d97d230

Add GpuIpcMem class

fcb1ab6

Merge branch 'main' into chhwang/new-ipc-mem

73982f7

revert

ebec0ee

update

dc77036

update

c3d2c2b

Merge branch 'main' into chhwang/new-ipc-mem

8eccca7

Lint

3a07282

tackle comments

77245e5

lint

61cc7d6

Merge branch 'main' into chhwang/new-ipc-mem

c3f467b

add comments

61ee117

tackle comments

542800d

tackle comment

0d7f877

rocm fix

2ff8e1f

tackle comments

c99d344

more fix

0037490

Use GpuIpcMem for NVLS connections

a5817f8

chhwang requested a review from Copilot January 7, 2026 06:39

Copilot started reviewing on behalf of chhwang January 7, 2026 06:39 View session

Merge branch 'main' into chhwang/new-ipc-mem

2abc702

Copilot AI reviewed Jan 7, 2026

View reviewed changes

src/switch_channel.cc Show resolved Hide resolved

src/switch_channel.cc Show resolved Hide resolved

tackle comments

2e184f9

chhwang changed the base branch from chhwang/new-ipc-mem to main January 7, 2026 08:49

chhwang changed the base branch from main to chhwang/new-ipc-mem January 12, 2026 15:51

chhwang added 4 commits January 13, 2026 01:02

Merge branch 'main' into chhwang/new-ipc-mem

85501a2

minor updates

9189f83

fix close(0) issue

6ffaa01

Merge branch 'chhwang/new-ipc-mem' into chhwang/new-sc

fa73fcc

Base automatically changed from chhwang/new-ipc-mem to main January 14, 2026 02:49

chhwang added 2 commits January 13, 2026 18:51

Merge branch 'main' into chhwang/new-sc

19307eb

updates

fab745b

chhwang requested a review from Copilot January 14, 2026 08:00

Copilot started reviewing on behalf of chhwang January 14, 2026 08:00 View session

Copilot AI reviewed Jan 14, 2026

View reviewed changes

src/include/gpu_ipc_mem.hpp Outdated Show resolved Hide resolved

src/gpu_ipc_mem.cc Show resolved Hide resolved

lint and add a comment

534501a

chhwang requested a review from Binyang2014 January 14, 2026 08:14

Copilot AI reviewed Jan 14, 2026

View reviewed changes

chhwang mentioned this pull request Jan 14, 2026

Add GpuIpcMemHandle #610

Closed

Binyang2014 reviewed Jan 15, 2026

View reviewed changes

src/include/gpu_ipc_mem.hpp Outdated Show resolved Hide resolved

src/gpu_ipc_mem.cc Show resolved Hide resolved

tackle comment

b5fe18a

Binyang2014 approved these changes Jan 15, 2026

View reviewed changes

Merge branch 'main' into chhwang/new-sc

8ccd4ee

chhwang merged commit 105239f into main Jan 15, 2026
13 checks passed

chhwang deleted the chhwang/new-sc branch January 15, 2026 05:16

Use GpuIpcMem for NVLS connections #719

Use GpuIpcMem for NVLS connections #719

Uh oh!

Conversation

chhwang commented Jan 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Key Changes

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

chhwang commented Jan 7, 2026

Uh oh!

azure-pipelines bot commented Jan 7, 2026

Uh oh!

chhwang commented Jan 7, 2026

Uh oh!

azure-pipelines bot commented Jan 7, 2026

Uh oh!

chhwang commented Jan 14, 2026

Uh oh!

azure-pipelines bot commented Jan 14, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Use `GpuIpcMem` for NVLS connections #719

Use `GpuIpcMem` for NVLS connections #719

chhwang commented Jan 7, 2026 •

edited

Loading