Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

about multi-gpu support #34

Closed
pipipopo opened this issue Aug 29, 2018 · 6 comments
Closed

about multi-gpu support #34

pipipopo opened this issue Aug 29, 2018 · 6 comments

Comments

@pipipopo
Copy link

Hi, thanks for your amazing work. I have a problem with the multi-gpu support. The original batchsize is 2 for a real image and another random. In fact it's one pair. If we want to use multi-gpu for faster training, the batch size should be larger. But simply changing the batchsize causes error. May I know is it easy to modify the code for this purpose?

@junyanz
Copy link
Owner

junyanz commented Aug 30, 2018

What is your error? Have you set --gpu_ids?

@pipipopo
Copy link
Author

RuntimeError: Expected tensor for argument #1 'input' to have the same device as tensor for argument #2 'weight'; but device 1 does not equal 0 (while checking arguments for cudnn_convolution)
The error is above when I use config below:
GPU_ID=0,1
--gpu_ids ${GPU_ID}
--batchSize 4

@pipipopo
Copy link
Author

pipipopo commented Sep 3, 2018

create web directory ../checkpoints/edges2shoes/edges2shoes_bicycle_gan/web...
Traceback (most recent call last):
File "./train.py", line 34, in
model.optimize_parameters()
File "../BicycleGAN-master/models/bicycle_gan_model.py", line 211, in optimize_parameters
self.update_G_and_E()
File "../BicycleGAN-master/models/bicycle_gan_model.py", line 199, in update_G_and_E
self.backward_EG()
File "../BicycleGAN-master/models/bicycle_gan_model.py", line 151, in backward_EG
self.loss_G_GAN = self.backward_G_GAN(self.fake_data_encoded, self.netD, self.opt.lambda_GAN)
File "../BicycleGAN-master/models/bicycle_gan_model.py", line 143, in backward_G_GAN
pred_fake = netD(fake)
File "../anaconda2/envs/xx/lib/python3.6/site-packages/torch/nn/modules/module.py", line 491, in call
result = self.forward(*input, **kwargs)
File "../anaconda2/envs/xx/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 114, in forward
outputs = self.parallel_apply(replicas, inputs, kwargs)
File "../anaconda2/envs/xx/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 124, in parallel_apply
return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
File "../anaconda2/envs/xx/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 65, in parallel_apply
raise output
File "../anaconda2/envs/xx/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 41, in _worker
output = module(*input, **kwargs)
File "../anaconda2/envs/xx/lib/python3.6/site-packages/torch/nn/modules/module.py", line 491, in call
result = self.forward(*input, **kwargs)
File "../BicycleGAN-master/models/networks.py", line 263, in forward
result.append(self.modeli)
File "../anaconda2/envs/xx/lib/python3.6/site-packages/torch/nn/modules/module.py", line 491, in call
result = self.forward(*input, **kwargs)
File "../anaconda2/envs/xx/lib/python3.6/site-packages/torch/nn/modules/container.py", line 91, in forward
input = module(input)
File "../anaconda2/envs/xx/lib/python3.6/site-packages/torch/nn/modules/module.py", line 491, in call
result = self.forward(*input, **kwargs)
File "../anaconda2/envs/xx/lib/python3.6/site-packages/torch/nn/modules/conv.py", line 301, in forward
self.padding, self.dilation, self.groups)
RuntimeError: Expected tensor for argument #1 'input' to have the same device as tensor for argument #2 'weight'; but device 1 does not equal 0 (while checking arguments for cudnn_convolution)

@pipipopo
Copy link
Author

pipipopo commented Sep 3, 2018

After reading pytorch docs i still don't know how to set the data to devices in BicycleGAN....

@pipipopo
Copy link
Author

pipipopo commented Sep 3, 2018

Should
net = torch.nn.DataParallel(net, gpu_ids)
cooperate with any
data.to(device)?

@junyanz
Copy link
Owner

junyanz commented Sep 3, 2018

I found the issue. If you set --netD basic_256 and --netD2 basic_256, it should work. It seems that the multi-GPU issue comes from multi-scale D (--netD basic_256_multi). I haven't had time to address the issue there. I also updated the code so that it is compatible with pix2pix/CycleGAN. You may want to check out the latest commit.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants