Skip to content

FSDP2训练LoRA报错AssertionError: data parallel group-gloo with context parallel combined is not initialized #184

@0hujun

Description

@0hujun

Checklist / 检查清单

  • I have searched existing issues, and this is a new bug report. / 我已经搜索过现有的 issues,确认这是一个新的 bug report。

Bug Description / Bug 描述

在NPU上训练FSDP2+CP报错:

[rank1]: Traceback (most recent call last):
[rank1]:   File "/home/ma-user/twinkle/cookbook/transformers/fsdp2_moe.py", line 89, in <module>
[rank1]:     train()
[rank1]:   File "/home/ma-user/twinkle/cookbook/transformers/fsdp2_moe.py", line 76, in train
[rank1]:     metric = model.calculate_metric(is_training=True)
[rank1]:              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/home/ma-user/twinkle/src/twinkle/infra/__init__.py", line 647, in wrapper
[rank1]:     return func(self, *args, **kwargs)
[rank1]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/home/ma-user/twinkle/src/twinkle/model/transformers/transformers.py", line 1019, in calculate_metric
[rank1]:     return optimizer_config.calculate_metrics(is_training)
[rank1]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/home/ma-user/twinkle/src/twinkle/model/optimizer_group.py", line 82, in calculate_metrics
[rank1]:     results.update(metric.calculate())
[rank1]:                    ^^^^^^^^^^^^^^^^^^
[rank1]:   File "/home/ma-user/twinkle/src/twinkle/metric/loss.py", line 59, in calculate
[rank1]:     all_results = self.gather_results(local_results)
[rank1]:                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/home/ma-user/twinkle/src/twinkle/metric/base.py", line 25, in gather_results
[rank1]:     all_results = torch_util.gather_object(local_results, self.device_mesh, self.process_group)
[rank1]:                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/home/ma-user/twinkle/src/twinkle/utils/framework.py", line 54, in gather_object
[rank1]:     process_group = mpu.get_data_parallel_group_gloo(
[rank1]:                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/home/ma-user/Megatron-LM/megatron/core/parallel_state.py", line 1370, in get_data_parallel_group_gloo
[rank1]:     assert _DATA_PARALLEL_GROUP_GLOO is not None, "data parallel group-gloo is not initialized"
[rank1]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: AssertionError: data parallel group-gloo is not initialized

How to Reproduce / 如何复现

拉取twinkle的master分支,修改NPU相关参数,然后执行cd /home/ma-user/twinkle/cookbook/transformers && sh fsdp2_moe.sh

Additional Information / 补充信息

排查发现在/home/ma-user/twinkle/src/twinkle/utils/framework.py中处理all gather时调用了megatron,但是缺少IF分支判断是否是FSDP训练。

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions