TypeError exception in AxialPositionalEncoding when using DataParallel #17

kl0211 · 2020-05-04T23:38:37Z

Hello,

I want to run SinkhornTransformerLM using multiple GPUs, so I'm wrapping the model into torch.nn.DataParallel. However, when I do this, I get an exception:

Traceback (most recent call last):
  File "script.py", line 27, in <module>
    model(x)
  File "/home/ubuntu/.local/lib/python3.6/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/ubuntu/.local/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 155, in forward
    outputs = self.parallel_apply(replicas, inputs, kwargs)
  File "/home/ubuntu/.local/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 165, in parallel_apply
    return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
  File "/home/ubuntu/.local/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 85, in parallel_apply
    output.reraise()
  File "/home/ubuntu/.local/lib/python3.6/site-packages/torch/_utils.py", line 395, in reraise
    raise self.exc_type(msg)
TypeError: Caught TypeError in replica 0 on device 0.
Original Traceback (most recent call last):
  File "/home/ubuntu/.local/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 60, in _worker
    output = module(*input, **kwargs)
  File "/home/ubuntu/.local/lib/python3.6/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/ubuntu/.local/lib/python3.6/site-packages/sinkhorn_transformer/sinkhorn_transformer.py", line 792, in forward
    x = self.axial_pos_emb(x) + x
  File "/home/ubuntu/.local/lib/python3.6/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/ubuntu/.local/lib/python3.6/site-packages/sinkhorn_transformer/sinkhorn_transformer.py", line 243, in forward
    return pos_emb[:, :t]
TypeError: 'int' object is not subscriptable

Looking at the code, it would seem that self.weights does not get populated. To reproduce this error, I took the first example in README.md and changed

model(x) # (1, 2048, 20000)

to

model = torch.nn.DataParallel(model, device_ids=list(range(torch.cuda.device_count()))).to('cuda')
model(x)

The text was updated successfully, but these errors were encountered:

lucidrains · 2020-05-05T00:00:26Z

Nice! I'll fix this by tomorrow! (Both of my GPUs are in use at the moment) What are you using it for?

lucidrains · 2020-05-05T00:41:31Z

@kl0211 oh no pytorch/pytorch#36035 No worries, I'll think of something

lucidrains · 2020-05-05T01:01:37Z

@kl0211 can you upgrade to the latest version and try again? I committed in a hacky solution 38fe17e

kl0211 · 2020-05-05T03:22:24Z

@lucidrains I tried your latest commit. That let the example run to completion. However, I then tried running it on the enwik8_simple script and I got a different error:

Traceback (most recent call last):
  File "train.py", line 93, in <module>
    loss = model(next(train_loader), return_loss = True)
  File "/home/ubuntu/.local/lib/python3.6/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/ubuntu/.local/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 155, in forward
    outputs = self.parallel_apply(replicas, inputs, kwargs)
  File "/home/ubuntu/.local/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 165, in parallel_apply
    return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
  File "/home/ubuntu/.local/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 85, in parallel_apply
    output.reraise()
  File "/home/ubuntu/.local/lib/python3.6/site-packages/torch/_utils.py", line 395, in reraise
    raise self.exc_type(msg)
RuntimeError: Caught RuntimeError in replica 1 on device 1.
Original Traceback (most recent call last):
  File "/home/ubuntu/.local/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 60, in _worker
    output = module(*input, **kwargs)
  File "/home/ubuntu/.local/lib/python3.6/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/ubuntu/.local/lib/python3.6/site-packages/sinkhorn_transformer/autoregressive_wrapper.py", line 115, in forward
    out = self.net(xi, **kwargs)
  File "/home/ubuntu/.local/lib/python3.6/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/ubuntu/.local/lib/python3.6/site-packages/sinkhorn_transformer/autopadder.py", line 68, in forward
    out = self.net(x, **kwargs)
  File "/home/ubuntu/.local/lib/python3.6/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/ubuntu/.local/lib/python3.6/site-packages/sinkhorn_transformer/sinkhorn_transformer.py", line 812, in forward
    x = self.axial_pos_emb(x) + x
RuntimeError: expected device cuda:0 but got device cuda:1

Seems like AxialPositionalEncoding is not properly replicating across different GPUs.

Nice! I'll fix this by tomorrow! (Both of my GPUs are in use at the moment) What are you using it for?

I'm looking into using Sinkhorn (or Reformer) to create document embeddings. Want to see how it compares with embedding sentences (with sentence-transformers for example) and merging them.

lucidrains · 2020-05-05T03:43:04Z

@kl0211 oh ok, I put in a temporary fix, should work now!

very cool! I'd like to know how that turns out!

lucidrains · 2020-05-05T17:13:27Z

@kl0211 you should try Deepspeed. DataParallel actually doesn't give you a very big speed up

kl0211 · 2020-05-05T17:17:22Z

@kl0211 oh ok, I put in a temporary fix, should work now!

very cool! I'd like to know how that turns out!

@lucidrains, Looks like your fix got it to work! Thanks a bunch!

@kl0211 you should try Deepspeed. DataParallel actually doesn't give you a very big speed up

Cool! I'll see if I can try it out. Thanks for the tip!

lucidrains · 2020-05-05T17:28:37Z

@kl0211 do share your results! this repository is still in the exploratory phase!

lucidrains closed this as completed May 5, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TypeError exception in AxialPositionalEncoding when using DataParallel #17

TypeError exception in AxialPositionalEncoding when using DataParallel #17

kl0211 commented May 4, 2020

lucidrains commented May 5, 2020

lucidrains commented May 5, 2020

lucidrains commented May 5, 2020

kl0211 commented May 5, 2020

lucidrains commented May 5, 2020

lucidrains commented May 5, 2020

kl0211 commented May 5, 2020

lucidrains commented May 5, 2020

TypeError exception in AxialPositionalEncoding when using DataParallel #17

TypeError exception in AxialPositionalEncoding when using DataParallel #17

Comments

kl0211 commented May 4, 2020

lucidrains commented May 5, 2020

lucidrains commented May 5, 2020

lucidrains commented May 5, 2020

kl0211 commented May 5, 2020

lucidrains commented May 5, 2020

lucidrains commented May 5, 2020

kl0211 commented May 5, 2020

lucidrains commented May 5, 2020