Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

关于 all_gather的速度问题 #9

Closed
bendanzzc opened this issue Nov 8, 2022 · 2 comments
Closed

关于 all_gather的速度问题 #9

bendanzzc opened this issue Nov 8, 2022 · 2 comments

Comments

@bendanzzc
Copy link

您好,非常感谢开源中文clip,对我们的学术研究有这很大的帮助。
我们实验室也曾经尝试过大数据及的pretrain,但pytorch,all_gather的速度会随节点数上升而显著变慢,导致训练时间严重边长,请问你们训练的时候是如何解决all_gather的速度问题呀,感谢

@yangapku
Copy link
Member

yangapku commented Nov 8, 2022

您好,这个问题确实是比较突出的,也和多机节点之间的通信网络性能有很大关系。我们在训练时,多机之间是支持RDMA通信的,我们启用了RDMA,这对于保证all_gather的性能和训练速度非常重要。

@bendanzzc
Copy link
Author

哇,秒回,感谢

@yangapku yangapku closed this as completed Nov 8, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants