[rank1]: Traceback (most recent call last):
[rank1]: File "/home/ma-user/twinkle/cookbook/transformers/fsdp2_moe.py", line 89, in <module>
[rank1]: train()
[rank1]: File "/home/ma-user/twinkle/cookbook/transformers/fsdp2_moe.py", line 76, in train
[rank1]: metric = model.calculate_metric(is_training=True)
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/home/ma-user/twinkle/src/twinkle/infra/__init__.py", line 647, in wrapper
[rank1]: return func(self, *args, **kwargs)
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/home/ma-user/twinkle/src/twinkle/model/transformers/transformers.py", line 1019, in calculate_metric
[rank1]: return optimizer_config.calculate_metrics(is_training)
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/home/ma-user/twinkle/src/twinkle/model/optimizer_group.py", line 82, in calculate_metrics
[rank1]: results.update(metric.calculate())
[rank1]: ^^^^^^^^^^^^^^^^^^
[rank1]: File "/home/ma-user/twinkle/src/twinkle/metric/loss.py", line 59, in calculate
[rank1]: all_results = self.gather_results(local_results)
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/home/ma-user/twinkle/src/twinkle/metric/base.py", line 25, in gather_results
[rank1]: all_results = torch_util.gather_object(local_results, self.device_mesh, self.process_group)
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/home/ma-user/twinkle/src/twinkle/utils/framework.py", line 54, in gather_object
[rank1]: process_group = mpu.get_data_parallel_group_gloo(
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/home/ma-user/Megatron-LM/megatron/core/parallel_state.py", line 1370, in get_data_parallel_group_gloo
[rank1]: assert _DATA_PARALLEL_GROUP_GLOO is not None, "data parallel group-gloo is not initialized"
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: AssertionError: data parallel group-gloo is not initialized
拉取twinkle的master分支,修改NPU相关参数,然后执行cd /home/ma-user/twinkle/cookbook/transformers && sh fsdp2_moe.sh
排查发现在/home/ma-user/twinkle/src/twinkle/utils/framework.py中处理all gather时调用了megatron,但是缺少IF分支判断是否是FSDP训练。
Checklist / 检查清单
Bug Description / Bug 描述
在NPU上训练FSDP2+CP报错:
How to Reproduce / 如何复现
拉取twinkle的master分支,修改NPU相关参数,然后执行cd /home/ma-user/twinkle/cookbook/transformers && sh fsdp2_moe.sh
Additional Information / 补充信息
排查发现在/home/ma-user/twinkle/src/twinkle/utils/framework.py中处理all gather时调用了megatron,但是缺少IF分支判断是否是FSDP训练。