-
Notifications
You must be signed in to change notification settings - Fork 791
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Expose Ring Communications #43
Comments
While NCCL 1.3 was only relying on a single ring, it has evolved to use multiple rings to use all NVLinks and network cards, so that a rank no longer has a unique index. More recently, NCCL also started to use trees for small/medium size operations. So the ring NCCL creates is neither unique nor the best way to communicate for e.g. a halo exchange. Halo exchanges should be handled by point-to-point operations when they are implemented. Also closing as this is an old request ; feel free to reopen / follow up. |
Summary: Pull Request resolved: facebookresearch#43 Introduce user facing API ncclCommDump. It internally dumps NCCL internal state including: - Basic comm metadata - Pending, past, current collective kernels via CollTrace - Past collectives and active network operations from ProxyTrace. See details in design doc: https://docs.google.com/document/d/1ReXt2IKsjlzCUyi8bN4o5aFlmOYq7_FgduvOkqp95k4/edit?usp=sharing Reviewed By: YulunW Differential Revision: D53792058 fbshipit-source-id: 497984035614dff96c15bbfe7d86f74b86930f79
This is an excellent and necessary library. My understanding is that each collective communication is implemented via ring communications. If this is the case, a large class of problems (e.g. halo communications) could benefit greatly from exposing the collective ring communication as another primitive.
I imagine this could look similar to MPI's virtual topology:
https://computing.llnl.gov/tutorials/mpi/#Virtual_Topologies
where the ncclComm (or a wrapper-like object) would be exposed as a ring_communicator that could be passed to ring_rank, ring_coord, ring_shift, send, recv, and sendrecv-like functions.
I was going to take a quick crack at this, but thought I would get some feedback from the experts first.
The text was updated successfully, but these errors were encountered: