Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multi-nodes training is much more slower than single node #187

Open
YingqingHe opened this issue Sep 29, 2022 · 1 comment
Open

Multi-nodes training is much more slower than single node #187

YingqingHe opened this issue Sep 29, 2022 · 1 comment

Comments

@YingqingHe
Copy link

hi, when I train models using tutel, I find that, in each step, multi-nodes training will need much more step time (if n nodes, it will take around n times of training time of 1-node) than single node training. Thus multi-node training will take even more time than single-node training to finish one epoch.
Any debugging suggestions with this issue?
Thanks!!!

@ghostplant
Copy link
Contributor

Hi, thanks for reporting this issue.

For low-equipped distributed environment (e.g. eithernet with low-end busbw), cross-node All2All is supposed to have a significant bandwidth utilization drop against single-node training as the communication is fully over NVlink, unless you have high-end infini-band. This issue #160 discusses the detail of what busbw is required to achieve corresponding training throughput.

A good thing is that even though you see a throughput drop after first scaling to multiple nodes, further increasing nodes no longer makes it worse significantly.

In addition, for a few scenarios, you can set --parallel_type=adaptive:0 which won't perform All2All for training, then see whether the step time becomes a little better.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants