-
Notifications
You must be signed in to change notification settings - Fork 760
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Do we need to manually activate the peer2peer communication between GPUs. #68
Comments
NCCL will enable P2P if needed, but will not fail if already enabled. |
thanks! |
I am observing that there is no P2P communication seen in nvprof when using BVLC caffe with NCCL for multi-gpu case. In the caffe version without NCCL, I could see the P2P between GPUs. Is there a reason why P2P is not being used by NCCL ? |
P2P is used, but through CUDA kernels. So you will not see explicit P2P cudaMemcpy operations, but CUDA kernels doing computation as well as remote P2P writes. |
Problem is cuda-memcheck will still complain about it already being enabled, which makes it hard to use when debugging nccl applications. cuda-memcheck complains even if no other problems with the application. It repeats this error message for every device communicator being initialized. NCCL: Using devicesRank 0 uses device 0 [0x01] GeForce GTX TITAN XRank 1 uses device 1 [0x02] GeForce GTX TITAN XRank 2 uses device 2 [0x03] GeForce GTX TITAN X========= CUDA-MEMCHECK |
@pseudotensor - You can tell cuda-memcheck to ignore those API error return
values with an extra command line flag; see the --help for details.
…On May 24, 2017 10:34 PM, "pseudotensor" ***@***.***> wrote:
Problem is cuda-memcheck will still complain about it already being
enabled, which makes it hard to use when debugging nccl applications.
cuda-memcheck complains even if no other problems with the application. It
repeats this error message for every device communicator being initialized.
NCCL: Using devices Rank 0 uses device 0 [0x01] GeForce GTX TITAN X Rank
1 uses device 1 [0x02] GeForce GTX TITAN X Rank 2 uses device 2 [0x03]
GeForce GTX TITAN X
========= CUDA-MEMCHECK
========= Program hit cudaErrorPeerAccessAlreadyEnabled (error 50) due to
"peer access is already enabled" on CUDA API call to cud
aDeviceEnablePeerAccess.
========= Saved host backtrace up to driver entry point at error
========= Host Frame:/usr/lib/x86_64-linux-gnu/libcuda.so.1 [0x2eea03]
========= Host Frame:/usr/local/cuda/lib64/libcudart.so.8.0
(cudaDeviceEnablePeerAccess + 0x1a9) [0x38f29]
========= Host Frame:/usr/local/cuda/lib64/libnccl.so.1 [0x56c2]
========= Host Frame:/usr/local/cuda/lib64/libnccl.so.1 (ncclCommInitAll
+ 0x646) [0x7a66]
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#68 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AJO93s0OOx2xDJ4CtYOjkKee9zuUt3O5ks5r9RLlgaJpZM4L530H>
.
|
Do we need to enable the peer2peer communication betweens GPU manually or NCCL does it automatically/don't need it?
The text was updated successfully, but these errors were encountered: