Skip to content
This repository has been archived by the owner on Oct 30, 2023. It is now read-only.

bug in bc.py #7

Closed
zhangweifeng1218 opened this issue Jul 19, 2018 · 5 comments
Closed

bug in bc.py #7

zhangweifeng1218 opened this issue Jul 19, 2018 · 5 comments

Comments

@zhangweifeng1218
Copy link

line 39 in bc.py:
self.h_net = weight_norm(nn.Linear(h_dim, h_out), dim=None)
is this should be
self.h_net = weight_norm(nn.Linear(h_dim*self.k, h_out), dim=None)

@jnhwkim
Copy link
Owner

jnhwkim commented Jul 19, 2018

Yes, you're right. Can you send me a pull request for it?
Notice that if the number of glimpses is fewer than 32, it does not affect, though.

@zhangweifeng1218
Copy link
Author

Thanks for you reply.
I have downloaded your code, data required and run 'python3 main.py --use_both True --use_vg True' on my machine which has 4 tesla v100 GPUs and pytorch 0.4.0 installed.
But I got the following runtime error:

Traceback (most recent call last):
File "main.py", line 99, in
train(model, train_loader, eval_loader, args.epochs, args.output, optim, epoch)
File "/home1/yul/zwf/ban-vqa-master/train.py", line 72, in train
pred, att = model(v, b, q, a)
File "/home1/yul/.conda/envs/py3.5/lib/python3.5/site-packages/torch/nn/modules/module.py", line 491, in call
result = self.forward(*input, **kwargs)
File "/home1/yul/.conda/envs/py3.5/lib/python3.5/site-packages/torch/nn/parallel/data_parallel.py", line 113, in forward
replicas = self.replicate(self.module, self.device_ids[:len(inputs)])
File "/home1/yul/.conda/envs/py3.5/lib/python3.5/site-packages/torch/nn/parallel/data_parallel.py", line 118, in replicate
return replicate(module, device_ids)
File "/home1/yul/.conda/envs/py3.5/lib/python3.5/site-packages/torch/nn/parallel/replicate.py", line 12, in replicate
param_copies = Broadcast.apply(devices, *params)
RuntimeError: slice() cannot be applied to a 0-dim tensor.

It seems like that something wrong happened when torch copies the model into the 4 GPUs. But there is no such error when I train other networks distributedly by using nn.DataParallel. It is really confusing and I have not find the reason yet....

@jnhwkim
Copy link
Owner

jnhwkim commented Jul 21, 2018

@zhangweifeng1218 Unfortunately, our code is tested on PyTorch 0.3.1 as README describes. I recommend you to check the migration procedure or related issues. The error is persistent when you run the code in 0.3.1? And, I also had used 4 TItan XPs when I trained the model.

@zhangweifeng1218
Copy link
Author

Thanks, I have found the reason. The implement of weight_norm in pytorch 0.4.0 is a little different. When the dim is set to be None, weight_norm in 0.4.0 output a 0-dim weight_g which cannot be broadcast to multiple GPUs. Your code work well in pytorch 0.3.1 whose weight_norm output a 1-dim weight_g when dim is None.

@jnhwkim
Copy link
Owner

jnhwkim commented Jul 22, 2018

@zhangweifeng1218 Good, thanks for the info.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants