Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error when using gloo as DDP backend #30

Closed
Saltychtao opened this issue Nov 3, 2022 · 3 comments
Closed

Error when using gloo as DDP backend #30

Saltychtao opened this issue Nov 3, 2022 · 3 comments

Comments

@Saltychtao
Copy link

Hello! Thank you for your great work on implementing VQ layer. When I use the VQ layer in DDP mode and use gloo as the backend as suggested in README, I got the following error:
terminate called after throwing an instance of 'gloo::EnforceNotMet' what(): [enforce fail at ../third_party/gloo/gloo/transport/tcp/pair.cc:510] op.preamble.length <= op.nbytes. 8773632 vs 8386560

Do you have any ideas on how to solve this problem?
I also tried to use nccl as the backend, however the program only hangs forever...

@lucidrains
Copy link
Owner

@Saltychtao Hey! So I recently fixed a bug with distributed training (and yet another one this morning in regards to nccl)

What version are you on? Could you retry on the latest version, and if that doesn't work, send me a script for reproducing the error?

@Saltychtao
Copy link
Author

Saltychtao commented Nov 8, 2022

@lucidrains Hello, I have tried the newest version of the code, however it still hangs on when using nccl. Besides, I am currently using your vector-quantize-pytorch library within a complex seq2seq framework fairseq, so it will take me some time to isolate a working script that could reproduce the error. Anyway, I will send it to you as soon as possible.

@Saltychtao
Copy link
Author

I am using this library in fairseq, and it turns that the all_reduce function from torch.distributed used in the library is not compatible with fairseq. When using functions from fairseq.distributed_utils, everything works well!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants