bug in bc.py #7

zhangweifeng1218 · 2018-07-19T03:02:10Z

line 39 in bc.py:
self.h_net = weight_norm(nn.Linear(h_dim, h_out), dim=None)
is this should be
self.h_net = weight_norm(nn.Linear(h_dim*self.k, h_out), dim=None)

jnhwkim · 2018-07-19T03:31:39Z

Yes, you're right. Can you send me a pull request for it?
Notice that if the number of glimpses is fewer than 32, it does not affect, though.

zhangweifeng1218 · 2018-07-21T01:28:19Z

Thanks for you reply.
I have downloaded your code, data required and run 'python3 main.py --use_both True --use_vg True' on my machine which has 4 tesla v100 GPUs and pytorch 0.4.0 installed.
But I got the following runtime error:

Traceback (most recent call last):
File "main.py", line 99, in
train(model, train_loader, eval_loader, args.epochs, args.output, optim, epoch)
File "/home1/yul/zwf/ban-vqa-master/train.py", line 72, in train
pred, att = model(v, b, q, a)
File "/home1/yul/.conda/envs/py3.5/lib/python3.5/site-packages/torch/nn/modules/module.py", line 491, in call
result = self.forward(*input, **kwargs)
File "/home1/yul/.conda/envs/py3.5/lib/python3.5/site-packages/torch/nn/parallel/data_parallel.py", line 113, in forward
replicas = self.replicate(self.module, self.device_ids[:len(inputs)])
File "/home1/yul/.conda/envs/py3.5/lib/python3.5/site-packages/torch/nn/parallel/data_parallel.py", line 118, in replicate
return replicate(module, device_ids)
File "/home1/yul/.conda/envs/py3.5/lib/python3.5/site-packages/torch/nn/parallel/replicate.py", line 12, in replicate
param_copies = Broadcast.apply(devices, *params)
RuntimeError: slice() cannot be applied to a 0-dim tensor.

It seems like that something wrong happened when torch copies the model into the 4 GPUs. But there is no such error when I train other networks distributedly by using nn.DataParallel. It is really confusing and I have not find the reason yet....

jnhwkim · 2018-07-21T04:47:48Z

@zhangweifeng1218 Unfortunately, our code is tested on PyTorch 0.3.1 as README describes. I recommend you to check the migration procedure or related issues. The error is persistent when you run the code in 0.3.1? And, I also had used 4 TItan XPs when I trained the model.

zhangweifeng1218 · 2018-07-22T05:38:52Z

Thanks, I have found the reason. The implement of weight_norm in pytorch 0.4.0 is a little different. When the dim is set to be None, weight_norm in 0.4.0 output a 0-dim weight_g which cannot be broadcast to multiple GPUs. Your code work well in pytorch 0.3.1 whose weight_norm output a 1-dim weight_g when dim is None.

jnhwkim · 2018-07-22T06:29:44Z

@zhangweifeng1218 Good, thanks for the info.

farleylai mentioned this issue Jul 24, 2018

Potential incompatibility with v0.3 pre-trained models using weight_norm(..., dim=None) pytorch/pytorch#9743

Closed

jnhwkim closed this as completed in c4493e6 Aug 9, 2018

jnhwkim mentioned this issue Aug 13, 2018

Reproducing error #11

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bug in bc.py #7

bug in bc.py #7

zhangweifeng1218 commented Jul 19, 2018

jnhwkim commented Jul 19, 2018

zhangweifeng1218 commented Jul 21, 2018

jnhwkim commented Jul 21, 2018

zhangweifeng1218 commented Jul 22, 2018

jnhwkim commented Jul 22, 2018

bug in bc.py #7

bug in bc.py #7

Comments

zhangweifeng1218 commented Jul 19, 2018

jnhwkim commented Jul 19, 2018

zhangweifeng1218 commented Jul 21, 2018

jnhwkim commented Jul 21, 2018

zhangweifeng1218 commented Jul 22, 2018

jnhwkim commented Jul 22, 2018