Issue while running the model over multiple GPUs using nn.DataParallel #85

ShoRit · 2021-06-19T10:20:42Z

I used the nn.DataParallel() method in an attempt to do multi-processing. However, I am always met with this error.

RuntimeError: Expected tensor for argument #1 'input' to have the same device as tensor for argument #2 'weight'; but device 1 does not equal 0 (while checking arguments for cudnn_batch_norm) in the line.

head = self.bn0(head)

Is there a way to remedy this? How did you carry out Multi-processing?

ShoRit · 2021-06-19T23:51:26Z

One thing I discovered is that the tensors in self.embedding.weight are always on the device 0, as opposed to the data which have been split. This raises the question whether the model is correctly being copied onto the 4 devices.

head cuda:3, embedding_weight cuda:0
head cuda:2, embedding_weight cuda:0
head cuda:1, embedding_weight cuda:0
head cuda:0, embedding_weight cuda:0

apoorvumang · 2021-06-20T11:31:54Z

Unfortunately multi-GPU is not yet supported in this code by default, you will have to make modifications. I would recommend using huggingface accelerate.

apoorvumang closed this as completed Jun 21, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issue while running the model over multiple GPUs using nn.DataParallel #85

Issue while running the model over multiple GPUs using nn.DataParallel #85

ShoRit commented Jun 19, 2021

ShoRit commented Jun 19, 2021

apoorvumang commented Jun 20, 2021

Issue while running the model over multiple GPUs using nn.DataParallel #85

Issue while running the model over multiple GPUs using nn.DataParallel #85

Comments

ShoRit commented Jun 19, 2021

ShoRit commented Jun 19, 2021

apoorvumang commented Jun 20, 2021