modify NPU Qwen3.5 Megatron practice #9382
Conversation
There was a problem hiding this comment.
Code Review
This pull request updates the NPU support documentation for Qwen3.5, introducing the USE_MCORE_GDN=0 environment variable and detailing specific version constraints and functional limitations when using Megatron-SWIFT. Feedback focuses on improving clarity by marking the GDN disablement as mandatory for Qwen3.5 and suggesting explicit branch-switching instructions in the installation guide to avoid known mask issues.
| export PYTHONPATH=$PYTHONPATH:<your_local_megatron_lm_path> | ||
| export MEGATRON_LM_PATH=<your_local_megatron_lm_path> | ||
|
|
||
| # 5. 如需回退到 transformers 的 GatedDeltaNet 实现,可关闭 Megatron GDN |
There was a problem hiding this comment.
The comment describes setting USE_MCORE_GDN=0 as optional ('If you need to roll back...'), but the explanation in the Qwen3.5 section (line 267) indicates it is mandatory for the current NPU environment because MindSpeed hasn't adapted the native GDN kernel yet. Updating the comment to reflect this requirement would be clearer for users.
| # 5. 如需回退到 transformers 的 GatedDeltaNet 实现,可关闭 Megatron GDN | |
| # 5. 关闭 Megatron GDN(Qwen3.5 在当前 NPU 环境下必须设置) |
| 1. 当前 NPU 文档中约定的 MindSpeed 训练组合是 `Megatron-LM v0.15.3 + MindSpeed core_r0.15.3`。这个版本的 `megatron-core` 还没有包含 `0.16` 才引入的 `core.ssm.gated_delta_net` 原生 GDN 内核。 | ||
| 2. `ms-swift>=4.1.0` 默认使用 Megatron 原生 GDN(`USE_MCORE_GDN=1`),而这条路径要求 `megatron-core>=0.16`。因此在当前 NPU 环境下,需要显式设置 `USE_MCORE_GDN=0`,将 GDN 切回由 `mcore-bridge` 包装的 transformers 原生实现,再配合 ms-swift 内置的 Qwen3.5 FLA NPU 补丁,把 `chunk_gated_delta_rule` 重定向到 MindSpeed Triton 算子。 | ||
| 3. 这条回退路径的已知代价是:transformers 版 GDN 不支持 packing,也不支持 GDN 的 TP/CP。 | ||
| 4. 此外,transformers 版 GDN 在 NPU + flash-attn 组合下还有一个已知 mask 链路问题:`padding_free=False` 时,GDN 会读到 trainer 处理后的 `attention_mask`,而不是实际需要的 `attention_mask_2d`,从而触发 `aclnnFlashAttentionScore` 异步报错。该问题已在 `mcore-bridge` 的 `qwen3_5_npu` 分支修复,NPU 用户需要使用包含该修复的版本。 |
There was a problem hiding this comment.
The text mentions that a fix for the mask issue is available in the qwen3_5_npu branch of mcore-bridge. However, the installation instructions in section 3 (line 215) default to the main branch. Users following the guide sequentially might encounter the aclnnFlashAttentionScore error. It would be helpful to explicitly mention that users need to switch to this branch (e.g., git checkout qwen3_5_npu) during installation.
| export PYTHONPATH=$PYTHONPATH:<your_local_megatron_lm_path> | ||
| export MEGATRON_LM_PATH=<your_local_megatron_lm_path> | ||
|
|
||
| # 5. 如需回退到 transformers 的 GatedDeltaNet 实现,可关闭 Megatron GDN |
| export MEGATRON_LM_PATH=<your_local_megatron_lm_path> | ||
|
|
||
| # 5. 如需回退到 transformers 的 GatedDeltaNet 实现,可关闭 Megatron GDN | ||
| export USE_MCORE_GDN=0 |
| 2. `ms-swift>=4.1.0` 默认使用 Megatron 原生 GDN(`USE_MCORE_GDN=1`),而这条路径要求 `megatron-core>=0.16`。因此在当前 NPU 环境下,需要显式设置 `USE_MCORE_GDN=0`,将 GDN 切回由 `mcore-bridge` 包装的 transformers 原生实现,再配合 ms-swift 内置的 Qwen3.5 FLA NPU 补丁,把 `chunk_gated_delta_rule` 重定向到 MindSpeed Triton 算子。 | ||
| 3. 这条回退路径的已知代价是:transformers 版 GDN 不支持 packing,也不支持 GDN 的 TP/CP。 | ||
| 4. 此外,transformers 版 GDN 在 NPU + flash-attn 组合下还有一个已知 mask 链路问题:`padding_free=False` 时,GDN 会读到 trainer 处理后的 `attention_mask`,而不是实际需要的 `attention_mask_2d`,从而触发 `aclnnFlashAttentionScore` 异步报错。该问题已在 `mcore-bridge` 的 `qwen3_5_npu` 分支修复,NPU 用户需要使用包含该修复的版本。 | ||
| 5. 后续如果 MindSpeed 提供 `core_r0.16.x` 适配分支,上述 `USE_MCORE_GDN=0` 和 transformers GDN 功能受限这两个约束就可以一并解除。 |
There was a problem hiding this comment.
可以收束一下,明确说明以下几点
- 当前支持的megatron版本为 0.15.3, 走mcore_brdige底层transformers原生实现
a. 不支持GDN的 tp/cp
b. attentionmask问题 - 未来等ascend npu支持版本切换到 0.16 之后,可以不设置环境变量,走megatron 原生GDN
PR type
PR information
Add USE_MCORE_GDN=0 for Qwen 3.5 practice. The core reason is that MindSpeed has not yet completed the adaptation for GDN.