Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multi-GPU support for the PyTorch bindings? #63

Closed
yongsiang-fb opened this issue Mar 14, 2022 · 5 comments · Fixed by #120
Closed

Multi-GPU support for the PyTorch bindings? #63

yongsiang-fb opened this issue Mar 14, 2022 · 5 comments · Fixed by #120

Comments

@yongsiang-fb
Copy link

yongsiang-fb commented Mar 14, 2022

Hi,

I found that the following code would fail:

import torch as th
import tinycudann as tcnn
config = {
        "otype": "FullyFusedMLP",
        "activation": "ReLU",
        "output_activation": "None",
        "n_neurons": 64,
        "n_hidden_layers": 1
        }
net = tcnn.Network(16, 16, config)
net = net.to("cuda:1")
out = net(th.rand((256, 16), device="cuda:1"))

It seems the module does not have proper support to run on a different gpu even if we have called .to(device). Is it possible to fix this?

In addition, I also tried using torch.nn.DataParallel together with the hash encoding & tiny mlp. They seem to fail in such use cases. Is it possible to fix this? Thanks a lot!

@Tom94
Copy link
Collaborator

Tom94 commented Mar 14, 2022

Hi there, yes, unfortunately tiny-cuda-nn does not support multi-GPU operation as of now. This is something that'd be cool to have in the future, but currently is not a high priority.

I'm going to leave this issue open to serve as a TODO marker.

Cheers!

@yongsiang-fb
Copy link
Author

@Tom94 Thanks for the quick response! Hopefully it could be implemented one day. But I think even if we do not support multi-GPU for now, it should still be possible to support the single GPU case where both input & the network is on cuda:1? In particular, I think we probably just need to use a CUDAGuard to change the default GPU to be the same as the one net.params is using?

I am guessing that the error was caused by a mismatch between the default GPU and the inputs/net.params' GPU. I even have some hope that maybe th.nn.DataParallel would also work as long as there is CUDAGuard, but maybe that's just my wishful thinking. Would be great to hear about your thoughts. Thanks!

@Tom94
Copy link
Collaborator

Tom94 commented Mar 16, 2022

tcnn uses whichever CUDA device is "current" on the CPU thread, i.e. the device returned by cudaGetDevice. This device needs to remain the same across all calls into tcnn.

If CUDAGuard controls this (I haven't tried it myself), then yes, it should work the way you describe. If not, please let me know and I can add a tcnn-specific version of CUDAGuard that'll do the trick.

@yongsiang-fb
Copy link
Author

Yes, you are absolutely right. It's "current" device instead of "default" device. CUDAGuard would store the original current device, and sets the current device to the specified device. After a CUDAGuard goes out of scope, it sets the current device back to the original current device.

@MultiPath
Copy link

MultiPath commented Jul 11, 2022

Hi, do you have any plans on supporting multi-GPU training recently? Or do you have some hints about why the current code prevents using multiple GPUs with Pytorch distributed training? I think it will be super useful to have tinycudann used to train large-scale models with tiny MLPs. Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants