-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Horovod hangs -- include/shm.h:42 NCCL WARN Cuda failure 'invalid argument' #893
Comments
I downgraded NCCL from 2.4.2-1 to 2.3.7-1 and reinstalled Horovod. That seems to have fixed the problem! |
@maxhgerlach, did you happen to run in a container environment? Since NCCL is now open source, we can find the failing line, which is: CUDACHECKGOTO(cudaHostRegister(ptr, shmsize, cudaHostRegisterMapped), res, cudaError); Is it possible that you had insufficient shared memory provisioned? |
@alsrgv, thanks for looking into this. We are not using any container technology. I believe there are no shared memory limits in place -- here's what I checked on the four hosts:
|
That's pretty odd. @sjeaugey, does anything jump out to you as a possible cause for |
This is a known, fixed bug. NVIDIA/nccl#185
|
@sjeaugey -- thanks for letting me know about that known issue! We will stick with NCCL 2.3 for now and reconsider 2.4 once the fix is released. |
Hi,
I am trying to get Horovod trainings running on a cluster built from Nvidia RTX 2080 ti GPUs and Infiniband interconnects.
Installed software:
This is the device topology (PCI express dual root):
Horovod works OK on a single machine or if I use multiple machines incorporating less than four GPUs on each of the machines. But using four or more GPUs per host, Horovod hangs with NCCL warnings showing up that indicate some failure.
Here's an example, there is no progress after the last message:
The text was updated successfully, but these errors were encountered: