transformers5.2.0训练moe模型,如果使用torch_npu的moe优化算子,loss会变成nan #10244
Unanswered
piekey1994
asked this question in
Q&A
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
自己参考着之前src/llamafactory/v1/plugins/model_plugins/kernels/ops/mlp/npu_fused_moe.py的实现,改了一下5.2.0里qwen3-moe的替换代码如下:
`class NpuMoeFused5_2:
"""Container for NPU fused MoE forward functions."""
if not is_transformers_version_greater_than("5.0.0"):
kernel_moe_mapping["Qwen3MoeForCausalLM"] = {
"Qwen3MoeSparseMoeBlock": Qwen3NpuMoeFused.qwen3moe_sparse_moe_block_forward
}
else:
kernel_moe_mapping["Qwen3MoeForCausalLM"] = {
"Qwen3MoeExperts": NpuMoeFused5_2.npu_moe_experts_forward
}`
然后用fsdp+lora的形式做sft,第一个step还是正常的,第二个step的loss就是nan了。而且debug的时候发现,第二个step的hidden_states在进mlp之前,也就是attention的时候就已经变成nan了。而且我的梯度累计是2,按理来说第一个step对模型本身没有任何更新,不知道为啥会这样。关掉这个moe优化算子就能跑,但是好慢。
有没有大佬遇到这个问题,我也用同样的方式替换3.5的moe算子,也是会出现loss变成nan
Beta Was this translation helpful? Give feedback.
All reactions