Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

关于多机多卡训练 #12

Closed
mantianlong opened this issue Nov 13, 2022 · 4 comments
Closed

关于多机多卡训练 #12

mantianlong opened this issue Nov 13, 2022 · 4 comments

Comments

@mantianlong
Copy link

你好,我用2台4卡A100训练1个epoch时间在10h,单机4卡A100一个epoch的训练时间30min。请问哪些地方导致多机多卡训练效率降低的?

@yangapku
Copy link
Member

您好,由于目前我们的多卡训练默认打开了aggregate(由训练配置中的--skip-aggregate控制),在模型输出侧会进行一次机器间all-gather的通信操作,把每张卡上的local batch通过通信聚合成global batch,从而计算负样本更多的对比学习损失。这个操作是比较消耗通信带宽的,所以可能会受限于机器间网络通信性能上,最优的配置是机器间通信网络支持RDMA通信,这样训练的加速比会好很多。

@yangapku
Copy link
Member

yangapku commented Nov 13, 2022

这里是all-gather操作发生的代码位置:

# We gather tensors from all gpus to get more negatives to contrast with.
gathered_image_features = [
torch.zeros_like(image_features) for _ in range(world_size)
]
gathered_text_features = [
torch.zeros_like(text_features) for _ in range(world_size)
]
dist.all_gather(gathered_image_features, image_features)
dist.all_gather(gathered_text_features, text_features)

@mantianlong
Copy link
Author

懂了,赞👍🏻,谢谢~

@Xujianzhong
Copy link

借楼主的帖子,请教咨询个问题: 目前使用pytorch2.0版本,随机机器节点数量的增加,显存消耗变越来越严重。

例如:1台8卡v100,batch size 能到2048, 4台8卡v100 batch size只能到1024。
@yangapku 请教大佬有遇到这个问题不,有啥办法优化不

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants