Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extra layer encoder_output_to_decoder_dim cause issue with distributed training #20

Closed
ncoop57 opened this issue May 4, 2022 · 2 comments

Comments

@ncoop57
Copy link
Contributor

ncoop57 commented May 4, 2022

Hiya, hope Ice Cream is doing well, as well as you of course!

I've been trying to get distributed training working with your library and I discovered this additional Linear layer encoder_output_to_decoder_dim not being used any where:

https://github.com/lucidrains/RETRO-pytorch/blob/main/retro_pytorch/retro_pytorch.py#L491

It seems to be a copy of the layer defined right above it to_decoder_model_dim, which does get used. Having this extra layer that is not part of the loss calculation causes the following error with data parallelism:

[RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one.](https://github.com/pytorch/pytorch/issues/43259#)

Not sure if this layer is supposed to be there and it just didn't get used or if it is there by accident, so wanted to ask 🤓

@lucidrains
Copy link
Owner

Screen Shot 2022-05-04 at 3 45 31 PM

thanks for asking about Ice Cream 😄 she is well!

and yes, that is indeed redundant because the code was changed for the encoder to directly output to the decoder dimensions https://github.com/lucidrains/RETRO-pytorch/blob/main/retro_pytorch/retro_pytorch.py#L518

let me go get that fixed! 🙏

@lucidrains
Copy link
Owner

f166262

@ncoop57 ncoop57 closed this as completed May 4, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants