Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue while running the model over multiple GPUs using nn.DataParallel #85

Closed
ShoRit opened this issue Jun 19, 2021 · 2 comments
Closed

Comments

@ShoRit
Copy link

ShoRit commented Jun 19, 2021

I used the nn.DataParallel() method in an attempt to do multi-processing. However, I am always met with this error.

RuntimeError: Expected tensor for argument #1 'input' to have the same device as tensor for argument #2 'weight'; but device 1 does not equal 0 (while checking arguments for cudnn_batch_norm) in the line.

head = self.bn0(head)

Is there a way to remedy this? How did you carry out Multi-processing?

@ShoRit
Copy link
Author

ShoRit commented Jun 19, 2021

One thing I discovered is that the tensors in self.embedding.weight are always on the device 0, as opposed to the data which have been split. This raises the question whether the model is correctly being copied onto the 4 devices.

head cuda:3, embedding_weight cuda:0
head cuda:2, embedding_weight cuda:0
head cuda:1, embedding_weight cuda:0
head cuda:0, embedding_weight cuda:0

@apoorvumang
Copy link
Collaborator

Unfortunately multi-GPU is not yet supported in this code by default, you will have to make modifications. I would recommend using huggingface accelerate.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants