使用HybridParallelPlugin做Pipeline并行测试GPU利用率不稳定 #4746
Replies: 3 comments 1 reply
-
Beta Was this translation helpful? Give feedback.
-
这部分我们需要集成一下,简单替换可能会有问题。 |
Beta Was this translation helpful? Give feedback.
-
请教一个代码实现问题。我发现使用megatron的one_f_one_b的并行GPU利用率就很稳定的在100%,但是colossalAI的就经常出现GPU利用率低的问题。我仔细对比了新的pipeline代码,发现在legacy文件夹下的实现是比较接近megatron的,但是新的pipeline并行的代码少了create_recv_buffer_with_shapes这个建立缓冲区的步骤。而最新的pipeline代码,这里使用了一个不同的方法来处理对象的发送和接收,主要是通过 _broadcast_object_list 和 _send_object 和 _recv_object 函数。在这些函数中,对象被序列化成一个字节流(byte stream),然后通过 PyTorch 分布式通信库(c10d)的 broadcast 函数来发送或接收。我想问下没有buffer会对性能有影响吗。 |
Beta Was this translation helpful? Give feedback.
-
使用HybridParallelPlugin做Pipeline并行测试GPU利用率不稳定,经常出现低于50%的情况,而Megatron一般都会维持在90%以上。想问下是Pipeline通信这块还没有优化好吗?
我的一些参数
plugin = HybridParallelPlugin(tp_size=1,
![image](https://private-user-images.githubusercontent.com/9442170/268547960-33dbd4e5-a20f-496f-b268-8aee3c49d462.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MjE5MDUwMDAsIm5iZiI6MTcyMTkwNDcwMCwicGF0aCI6Ii85NDQyMTcwLzI2ODU0Nzk2MC0zM2RiZDRlNS1hMjBmLTQ5NmYtYjI2OC04YWVlM2M0OWQ0NjIucG5nP1gtQW16LUFsZ29yaXRobT1BV1M0LUhNQUMtU0hBMjU2JlgtQW16LUNyZWRlbnRpYWw9QUtJQVZDT0RZTFNBNTNQUUs0WkElMkYyMDI0MDcyNSUyRnVzLWVhc3QtMSUyRnMzJTJGYXdzNF9yZXF1ZXN0JlgtQW16LURhdGU9MjAyNDA3MjVUMTA1MTQwWiZYLUFtei1FeHBpcmVzPTMwMCZYLUFtei1TaWduYXR1cmU9N2QxYTkzODgwMDdkYTk2NGFmNDdiYzMyYjFlMTk4NWQ5MGQyNzQyOWNmMzE4OGI2NmYxZWU5MWIyYTczNzk3YyZYLUFtei1TaWduZWRIZWFkZXJzPWhvc3QmYWN0b3JfaWQ9MCZrZXlfaWQ9MCZyZXBvX2lkPTAifQ.VBQLn-DMT0Fm3PAAnLYkE4GS2DyHvjX3Tm4dAhNrg1Q)
![image](https://private-user-images.githubusercontent.com/9442170/268547712-ef63aec0-60d9-4013-91f3-f5bd8db83e36.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MjE5MDUwMDAsIm5iZiI6MTcyMTkwNDcwMCwicGF0aCI6Ii85NDQyMTcwLzI2ODU0NzcxMi1lZjYzYWVjMC02MGQ5LTQwMTMtOTFmMy1mNWJkOGRiODNlMzYucG5nP1gtQW16LUFsZ29yaXRobT1BV1M0LUhNQUMtU0hBMjU2JlgtQW16LUNyZWRlbnRpYWw9QUtJQVZDT0RZTFNBNTNQUUs0WkElMkYyMDI0MDcyNSUyRnVzLWVhc3QtMSUyRnMzJTJGYXdzNF9yZXF1ZXN0JlgtQW16LURhdGU9MjAyNDA3MjVUMTA1MTQwWiZYLUFtei1FeHBpcmVzPTMwMCZYLUFtei1TaWduYXR1cmU9NzIyOWQ2ZjRiNWRkNjM5NTAzMTg0MjYxMTU5YjEwYzcwYmY0OGQ3MDNlNjEzYjcyNTVjNmJiOWQ4NDMyMTZkNSZYLUFtei1TaWduZWRIZWFkZXJzPWhvc3QmYWN0b3JfaWQ9MCZrZXlfaWQ9MCZyZXBvX2lkPTAifQ.jlBD5AwUDq_JIwm_Bi7MEFrpYvZo49lFCK_NInHu414)
pp_size=2,
enable_flash_attention=True,
enable_fused_normalization=True,
enable_jit_fused=True,
microbatch_size=2,
precision='bf16',
zero_stage=1)
速度上也较慢,仅4.8samples/sec
以下是利用率截图
Beta Was this translation helpful? Give feedback.
All reactions