使用HybridParallelPlugin做Pipeline并行测试GPU利用率不稳定 #4746

imgaojun · 2023-09-18T03:07:07Z

imgaojun
Sep 18, 2023

使用HybridParallelPlugin做Pipeline并行测试GPU利用率不稳定，经常出现低于50%的情况，而Megatron一般都会维持在90%以上。想问下是Pipeline通信这块还没有优化好吗？
我的一些参数

plugin = HybridParallelPlugin(tp_size=1,
pp_size=2,
enable_flash_attention=True,
enable_fused_normalization=True,
enable_jit_fused=True,
microbatch_size=2,
precision='bf16',
zero_stage=1)
速度上也较慢，仅4.8samples/sec

以下是利用率截图

flybird11111 · 2023-09-18T03:18:04Z

flybird11111
Sep 18, 2023
Collaborator

使用HybridParallelPlugin做Pipeline并行测试GPU利用率不稳定，经常出现低于50%的情况，而Megatron一般都会维持在90%以上。想问下是Pipeline通信这块还没有优化好吗？我的一些参数

plugin = HybridParallelPlugin(tp_size=1, pp_size=2, enable_flash_attention=True, enable_fused_normalization=True, enable_jit_fused=True, microbatch_size=2, precision='bf16', zero_stage=1) 速度上也较慢，仅4.8samples/sec 以下是利用率截图

您好，我们现在的HybridParallelPlugin还是只支持1F1B的scheduler，后续会新增交错的scheduler，GPU利用率会更好。

1 reply

imgaojun Sep 18, 2023
Author

好的好的，我看目前已经有实现colossalai/pipeline/schedule/interleaved_pp.py
我是否可以直接替换plugin中的schedule策略即可？

flybird11111 · 2023-09-18T04:03:24Z

flybird11111
Sep 18, 2023
Collaborator

好的好的，我看目前已经有实现colossalai/pipeline/schedule/interleaved_pp.py 我是否可以直接替换plugin中的schedule策略即可？

这部分我们需要集成一下，简单替换可能会有问题。

0 replies

imgaojun · 2023-09-19T15:00:30Z

imgaojun
Sep 19, 2023
Author

请教一个代码实现问题。我发现使用megatron的one_f_one_b的并行GPU利用率就很稳定的在100%，但是colossalAI的就经常出现GPU利用率低的问题。我仔细对比了新的pipeline代码，发现在legacy文件夹下的实现是比较接近megatron的，但是新的pipeline并行的代码少了create_recv_buffer_with_shapes这个建立缓冲区的步骤。而最新的pipeline代码，这里使用了一个不同的方法来处理对象的发送和接收，主要是通过 _broadcast_object_list 和 _send_object 和 _recv_object 函数。在这些函数中，对象被序列化成一个字节流（byte stream），然后通过 PyTorch 分布式通信库（c10d）的 broadcast 函数来发送或接收。我想问下没有buffer会对性能有影响吗。

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

使用HybridParallelPlugin做Pipeline并行测试GPU利用率不稳定 #4746

{{title}}

Replies: 3 comments 1 reply

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

使用HybridParallelPlugin做Pipeline并行测试GPU利用率不稳定 #4746

imgaojun Sep 18, 2023

Replies: 3 comments · 1 reply

flybird11111 Sep 18, 2023 Collaborator

imgaojun Sep 18, 2023 Author

flybird11111 Sep 18, 2023 Collaborator

imgaojun Sep 19, 2023 Author

imgaojun
Sep 18, 2023

Replies: 3 comments 1 reply

flybird11111
Sep 18, 2023
Collaborator

imgaojun Sep 18, 2023
Author

flybird11111
Sep 18, 2023
Collaborator

imgaojun
Sep 19, 2023
Author