Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TypeError exception in AxialPositionalEncoding when using DataParallel #17

Closed
kl0211 opened this issue May 4, 2020 · 8 comments
Closed

Comments

@kl0211
Copy link

kl0211 commented May 4, 2020

Hello,

I want to run SinkhornTransformerLM using multiple GPUs, so I'm wrapping the model into torch.nn.DataParallel. However, when I do this, I get an exception:

Traceback (most recent call last):
  File "script.py", line 27, in <module>
    model(x)
  File "/home/ubuntu/.local/lib/python3.6/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/ubuntu/.local/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 155, in forward
    outputs = self.parallel_apply(replicas, inputs, kwargs)
  File "/home/ubuntu/.local/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 165, in parallel_apply
    return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
  File "/home/ubuntu/.local/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 85, in parallel_apply
    output.reraise()
  File "/home/ubuntu/.local/lib/python3.6/site-packages/torch/_utils.py", line 395, in reraise
    raise self.exc_type(msg)
TypeError: Caught TypeError in replica 0 on device 0.
Original Traceback (most recent call last):
  File "/home/ubuntu/.local/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 60, in _worker
    output = module(*input, **kwargs)
  File "/home/ubuntu/.local/lib/python3.6/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/ubuntu/.local/lib/python3.6/site-packages/sinkhorn_transformer/sinkhorn_transformer.py", line 792, in forward
    x = self.axial_pos_emb(x) + x
  File "/home/ubuntu/.local/lib/python3.6/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/ubuntu/.local/lib/python3.6/site-packages/sinkhorn_transformer/sinkhorn_transformer.py", line 243, in forward
    return pos_emb[:, :t]
TypeError: 'int' object is not subscriptable

Looking at the code, it would seem that self.weights does not get populated. To reproduce this error, I took the first example in README.md and changed

model(x) # (1, 2048, 20000)

to

model = torch.nn.DataParallel(model, device_ids=list(range(torch.cuda.device_count()))).to('cuda')
model(x)
@lucidrains
Copy link
Owner

Nice! I'll fix this by tomorrow! (Both of my GPUs are in use at the moment) What are you using it for?

@lucidrains
Copy link
Owner

@kl0211 oh no pytorch/pytorch#36035 No worries, I'll think of something

@lucidrains
Copy link
Owner

@kl0211 can you upgrade to the latest version and try again? I committed in a hacky solution 38fe17e

@kl0211
Copy link
Author

kl0211 commented May 5, 2020

@lucidrains I tried your latest commit. That let the example run to completion. However, I then tried running it on the enwik8_simple script and I got a different error:

Traceback (most recent call last):
  File "train.py", line 93, in <module>
    loss = model(next(train_loader), return_loss = True)
  File "/home/ubuntu/.local/lib/python3.6/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/ubuntu/.local/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 155, in forward
    outputs = self.parallel_apply(replicas, inputs, kwargs)
  File "/home/ubuntu/.local/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 165, in parallel_apply
    return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
  File "/home/ubuntu/.local/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 85, in parallel_apply
    output.reraise()
  File "/home/ubuntu/.local/lib/python3.6/site-packages/torch/_utils.py", line 395, in reraise
    raise self.exc_type(msg)
RuntimeError: Caught RuntimeError in replica 1 on device 1.
Original Traceback (most recent call last):
  File "/home/ubuntu/.local/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 60, in _worker
    output = module(*input, **kwargs)
  File "/home/ubuntu/.local/lib/python3.6/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/ubuntu/.local/lib/python3.6/site-packages/sinkhorn_transformer/autoregressive_wrapper.py", line 115, in forward
    out = self.net(xi, **kwargs)
  File "/home/ubuntu/.local/lib/python3.6/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/ubuntu/.local/lib/python3.6/site-packages/sinkhorn_transformer/autopadder.py", line 68, in forward
    out = self.net(x, **kwargs)
  File "/home/ubuntu/.local/lib/python3.6/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/ubuntu/.local/lib/python3.6/site-packages/sinkhorn_transformer/sinkhorn_transformer.py", line 812, in forward
    x = self.axial_pos_emb(x) + x
RuntimeError: expected device cuda:0 but got device cuda:1

Seems like AxialPositionalEncoding is not properly replicating across different GPUs.

Nice! I'll fix this by tomorrow! (Both of my GPUs are in use at the moment) What are you using it for?

I'm looking into using Sinkhorn (or Reformer) to create document embeddings. Want to see how it compares with embedding sentences (with sentence-transformers for example) and merging them.

@lucidrains
Copy link
Owner

@kl0211 oh ok, I put in a temporary fix, should work now!

very cool! I'd like to know how that turns out!

@lucidrains
Copy link
Owner

@kl0211 you should try Deepspeed. DataParallel actually doesn't give you a very big speed up

@kl0211
Copy link
Author

kl0211 commented May 5, 2020

@kl0211 oh ok, I put in a temporary fix, should work now!

very cool! I'd like to know how that turns out!

@lucidrains, Looks like your fix got it to work! Thanks a bunch!

@kl0211 you should try Deepspeed. DataParallel actually doesn't give you a very big speed up

Cool! I'll see if I can try it out. Thanks for the tip!

@lucidrains
Copy link
Owner

@kl0211 do share your results! this repository is still in the exploratory phase!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants