Skip to content

Conversation

@chhwang
Copy link
Contributor

@chhwang chhwang commented Jan 7, 2026

  • Now NvlsConnection internally reuses GpuIpcMem for multicast memory handling.
  • Removed unnecessary barriers from connectNvlsCollective() (CUDA API handles this automatically).
  • Updated GpuIpcMem::map() and GpuIpcMem::mapMulticast() to return a shared pointer with custom deleter for unmapping, which prevents misuse of raw pointers and reduces states to be stored in the GpuIpcMem instance.
  • Now for RuntimeIpc type handles, for consistency with other types, cudaIpcOpenMemHandle will be called in GpuIpcMem::map() instead of the ctor of GpuIpcMem.

Binyang2014 and others added 24 commits December 4, 2025 19:20
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
- [x] Move hash specialization and equality operator from std/global
namespace to custom namespace
- [x] Update unordered_map to use custom hash and equality as template
parameters
- [x] Add noexcept to equality operator
- [x] Verify the changes build correctly
- [x] Run code review and security checks

<!-- START COPILOT CODING AGENT TIPS -->
---

✨ Let Copilot coding agent [set things up for
you](https://github.com/microsoft/mscclpp/issues/new?title=✨+Set+up+Copilot+instructions&body=Configure%20instructions%20for%20this%20repository%20as%20documented%20in%20%5BBest%20practices%20for%20Copilot%20coding%20agent%20in%20your%20repository%5D%28https://gh.io/copilot-coding-agent-tips%29%2E%0A%0A%3COnboard%20this%20repo%3E&assignees=copilot)
— coding agent works faster and does higher quality work when set up for
your repo.

---------

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: Binyang2014 <9415966+Binyang2014@users.noreply.github.com>
Co-authored-by: Binyang Li <binyli@microsoft.com>
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR refactors NVLS (NVLink Sharp) connection handling to use the unified GpuIpcMem abstraction instead of directly managing CUDA multicast APIs. This simplifies the code by delegating multicast memory management to the existing GpuIpcMem infrastructure.

Key Changes

  • Replaces manual multicast handle management with GpuIpcMem and GpuIpcMemHandle abstractions
  • Removes manual buffer allocation tracking (allocatedRanges_, freeRanges_) as this is now handled internally by GpuIpcMem
  • Removes explicit synchronization barriers in connectNvlsCollective, relying instead on the blocking behavior of cuMulticastBindAddr within GpuIpcMem::mapMulticast()

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

File Description
src/switch_channel.cc Complete refactoring of NvlsConnection::Impl to use GpuIpcMem; updated license header; removed buffer allocation logic and synchronization barriers; simplified bindMemory implementation
python/csrc/switch_channel_py.cpp Removed Python binding for getMultiCastMinGranularity method (breaking API change)
include/mscclpp/switch_channel.hpp Removed public API methods addDevice() and getMultiCastMinGranularity() (breaking API changes)

@chhwang
Copy link
Contributor Author

chhwang commented Jan 7, 2026

/azp run mscclpp-ut

@azure-pipelines
Copy link

Azure Pipelines could not run because the pipeline triggers exclude this branch/path.

@chhwang chhwang changed the base branch from chhwang/new-ipc-mem to main January 7, 2026 08:49
@chhwang
Copy link
Contributor Author

chhwang commented Jan 7, 2026

/azp run mscclpp-ut

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@chhwang chhwang changed the base branch from main to chhwang/new-ipc-mem January 12, 2026 15:51
Base automatically changed from chhwang/new-ipc-mem to main January 14, 2026 02:49
@chhwang
Copy link
Contributor Author

chhwang commented Jan 14, 2026

/azp run mscclpp-ut

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated 3 comments.

Comments suppressed due to low confidence (2)

python/csrc/switch_channel_py.cpp:2

  • The license header should use "MIT License" instead of "MIT license" (capital L) to match the project's standard format seen in other C++ files.
// Licensed under the MIT license.

include/mscclpp/switch_channel.hpp:2

  • The license header should use "MIT License" instead of "MIT license" (capital L) to match the project's standard format used in the implementation files.
// Licensed under the MIT license.

@chhwang chhwang requested a review from Binyang2014 January 14, 2026 08:14
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

@chhwang chhwang mentioned this pull request Jan 14, 2026
@chhwang chhwang merged commit 105239f into main Jan 15, 2026
13 checks passed
@chhwang chhwang deleted the chhwang/new-sc branch January 15, 2026 05:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants