关于多机多卡训练 #12

mantianlong · 2022-11-13T04:56:07Z

你好，我用2台4卡A100训练1个epoch时间在10h，单机4卡A100一个epoch的训练时间30min。请问哪些地方导致多机多卡训练效率降低的？

yangapku · 2022-11-13T05:18:26Z

您好，由于目前我们的多卡训练默认打开了aggregate（由训练配置中的--skip-aggregate控制），在模型输出侧会进行一次机器间all-gather的通信操作，把每张卡上的local batch通过通信聚合成global batch，从而计算负样本更多的对比学习损失。这个操作是比较消耗通信带宽的，所以可能会受限于机器间网络通信性能上，最优的配置是机器间通信网络支持RDMA通信，这样训练的加速比会好很多。

yangapku · 2022-11-13T05:20:07Z

这里是all-gather操作发生的代码位置：

Chinese-CLIP/cn_clip/training/train.py

Lines 25 to 33 in 2746589

    
           # We gather tensors from all gpus to get more negatives to contrast with. 
        
           gathered_image_features = [ 
        
               torch.zeros_like(image_features) for _ in range(world_size) 
        
           ] 
        
           gathered_text_features = [ 
        
               torch.zeros_like(text_features) for _ in range(world_size) 
        
           ] 
        
           dist.all_gather(gathered_image_features, image_features) 
        
           dist.all_gather(gathered_text_features, text_features)

mantianlong · 2022-11-13T05:35:33Z

懂了，赞👍🏻，谢谢~

Xujianzhong · 2023-08-09T12:26:20Z

借楼主的帖子，请教咨询个问题： 目前使用pytorch2.0版本，随机机器节点数量的增加，显存消耗变越来越严重。

例如：1台8卡v100，batch size 能到2048, 4台8卡v100 batch size只能到1024。
@yangapku 请教大佬有遇到这个问题不，有啥办法优化不

mantianlong closed this as completed Nov 13, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

关于多机多卡训练 #12

关于多机多卡训练 #12

mantianlong commented Nov 13, 2022

yangapku commented Nov 13, 2022

yangapku commented Nov 13, 2022 •

edited

mantianlong commented Nov 13, 2022

Xujianzhong commented Aug 9, 2023

关于多机多卡训练 #12

关于多机多卡训练 #12

Comments

mantianlong commented Nov 13, 2022

yangapku commented Nov 13, 2022

yangapku commented Nov 13, 2022 • edited

mantianlong commented Nov 13, 2022

Xujianzhong commented Aug 9, 2023

yangapku commented Nov 13, 2022 •

edited