Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cutorch.synchronize() does not work #9

Open
arashno opened this issue Nov 14, 2016 · 5 comments
Open

cutorch.synchronize() does not work #9

arashno opened this issue Nov 14, 2016 · 5 comments

Comments

@arashno
Copy link

arashno commented Nov 14, 2016

I installed NCCL and I am trying to use it.
Without using NCCL, everythings seems fine but slow.
But when I am trying to use NCCL, In my code when I call cutorch.synchronize() it just stops and it does nothing without making any error.
How can I findout the root of the problem?
Thanks

@ngimel
Copy link
Owner

ngimel commented Nov 15, 2016

Do you have a small repro? A known issue with nccl is that it will hang if some other thread calls cudaFree when nccl kernels are being scheduled. You can try running your code with env var THC_CACHING_ALLOCATOR=1, and see if problem persists.

@arashno
Copy link
Author

arashno commented Nov 15, 2016

I am using this code, But I changed it to use nccl.
[https://github.com/soumith/imagenet-multiGPU.torch]
I tried THC_CACHING_ALLOCATOR=1 and the problem persists.

@ngimel
Copy link
Owner

ngimel commented Nov 15, 2016

No changes are necessary to this code to use nccl, it should do it automatically. Take a look at https://github.com/facebook/fb.resnet.torch/, it also uses nccl without deadlocks. If you are adding cutorch.synchronize in such a way that is is called while nccl kernels are being scheduled, nccl will deadlock, you should make sure you are not doing that.

@arashno
Copy link
Author

arashno commented Nov 15, 2016

It doesn't seems that it uses nccl by default.
This is the original code in util.lua
` local model_single = model

  model = nn.DataParallelTable(1)

  for i=1, nGPU do
     cutorch.setDevice(i)
     model:add(model_single:clone():cuda(), i)
  end`

I changed it to use nccl:

`local model_single = model

  model = nn.DataParallelTable(1,true,true)

  model:threads(function()
       require 'cudnn'
     end)

  for i=1, nGPU do
     cutorch.setDevice(i)
     model:add(model_single:clone():cuda(), i)
  end`

Without doing the change it uses default communication between GPUs which is very slow.
Actually, when t uses the default communication, using 4 GPUs is slower than using just 1 GPU.

@arashno
Copy link
Author

arashno commented Nov 16, 2016

Also, I find out that it does work for 2 GPUs but it does hang for 4 GPUs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants