npu qwen3.5 megatron padding_free fix by addsubmuldiv · Pull Request #9196 · modelscope/ms-swift

addsubmuldiv · 2026-04-23T09:58:34Z

PR type

Bug Fix
New Feature
Document Updates
More Models or Datasets Support

PR information

fix(megatron): preserve attention_mask_2d for NPU GDN models when padding_free=False

Problem

On NPU, Qwen3.5 / Qwen3-Next can run with padding_free=True but fail with padding_free=False.
The failure usually surfaces around masked_select, while the actual async error comes from **aclnnFlashAttentionScore**.

Cause

For GDN models:

padding_free=True goes through the thd path and derives mask from cu_seqlens_q
padding_free=False goes through the non-thd path and reads external mask from kwargs

The trainer side did not preserve the NPU-compatible attention_mask_2d end-to-end, so the non-thd branch consumed an incompatible mask. This is why padding_free=True only bypassed the issue instead of fixing it.

Fix

This PR preserves attention_mask_2d in the Megatron trainer batch path for the NPU flash-attn case, so downstream model wrappers can still access the correct 2D mask when padding_free=False.

The complete fix also includes a companion mcore-bridge update where GDN non-thd prefers attention_mask_2d on NPU.

Impact

The issue is limited to models with HF-style GDN components such as Qwen3.5 / Qwen3-Next.
Pure Megatron attention paths are not affected.

gemini-code-assist

Code Review

This pull request introduces NPU-specific attention mask generation in the Megatron trainer. The reviewer identified several critical issues in the implementation, including a potential IndexError from dimension assumptions, incorrect bitwise logic for mask calculation, and the removal of necessary cleanup for non-NPU cases. A code suggestion was provided to correctly handle 2D and 4D masks while ensuring the attention_mask_2d key is properly managed across all execution paths.

gemini-code-assist · 2026-04-23T10:00:50Z

+        if self._should_use_npu_generated_attention_mask(args):
+            if 'attention_mask_2d' not in batch and batch.get('attention_mask') is not None:
+                batch['attention_mask_2d'] = (~batch['attention_mask']).sum(dim=(1, 2)) > 0
+            batch['attention_mask'] = None


The implementation of attention_mask_2d generation has a few critical issues:

Runtime Error (IndexError): sum(dim=(1, 2)) assumes the input attention_mask is at least 3D (likely 4D). However, in most swift configurations, the dataloader provides a 2D attention_mask of shape [batch_size, seq_len]. This will cause a crash on NPU when padding_free=False.

Incorrect Boolean Logic: Using the bitwise NOT operator ~ on integer tensors (standard for swift masks where 1=valid, 0=padding) results in negative values (~1 is -2, ~0 is -1). The subsequent > 0 check will always return False, effectively masking the entire sequence.

Missing Cleanup: The original code always popped attention_mask_2d to avoid passing unexpected arguments to Megatron models. This cleanup is now missing for all non-NPU or padding_free=True cases, which might cause compatibility issues with models that don't expect this extra key in the batch.

Suggested improvement to handle both 2D/4D masks and maintain cleanup:

Suggested change

if self._should_use_npu_generated_attention_mask(args):

if 'attention_mask_2d' not in batch and batch.get('attention_mask') is not None:

batch['attention_mask_2d'] = (~batch['attention_mask']).sum(dim=(1, 2)) > 0

batch['attention_mask'] = None

if self._should_use_npu_generated_attention_mask(args):

if 'attention_mask_2d' not in batch and batch.get('attention_mask') is not None:

mask = batch['attention_mask']

if mask.dim() == 4:

batch['attention_mask_2d'] = mask.bool().any(dim=1).any(dim=1)

else:

batch['attention_mask_2d'] = mask.bool()

batch['attention_mask'] = None

else:

batch.pop('attention_mask_2d', None)

Copilot

Pull request overview

This PR targets an NPU-specific masking issue for Qwen3.5/Qwen3-Next (GDN) models in Megatron training when padding_free=False, by preserving a 2D attention mask through the trainer batch preparation path so downstream wrappers can consume an NPU-compatible mask.

Changes:

Stop unconditionally dropping attention_mask_2d in the Megatron trainer _prepare_batch path.
Add an NPU/flash-attn conditional branch to preserve (and, if missing, derive) attention_mask_2d, and to null out attention_mask for that path.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+    def _should_use_npu_generated_attention_mask(self, args) -> bool:
+        return (
+            is_torch_npu_available()
+            and
+            args.task_type == 'causal_lm'
+            and not args.padding_free
+            and getattr(args, 'attention_backend', None) != 'local'
+            and getattr(args, 'use_flash_attn', False)
+        )


            text_position_ids = batch.get('position_ids')
+        if self._should_use_npu_generated_attention_mask(args):
+            if 'attention_mask_2d' not in batch and batch.get('attention_mask') is not None:
+                batch['attention_mask_2d'] = (~batch['attention_mask']).sum(dim=(1, 2)) > 0


Jintao-Huang · 2026-04-23T11:53:07Z

-        batch.pop('attention_mask_2d', None)
        if text_position_ids is None:
            text_position_ids = batch.get('position_ids')
+        if self._should_use_npu_generated_attention_mask(args):


else分支还原一下之前的行为

Jintao-Huang · 2026-04-24T03:51:23Z

/gemini review

gemini-code-assist

Code Review

This pull request introduces NPU-specific logic for generating 2D attention masks within the Megatron trainer's batch preparation. While the changes aim to optimize attention handling for NPU environments, several critical issues were identified: the new attention_mask_2d field is not integrated with Context Parallelism logic, setting the standard attention_mask to None will cause a crash in the get_last_tokens utility, and the mask reduction logic assumes a 4D tensor structure without validation, which may lead to dimension errors.

gemini-code-assist · 2026-04-24T03:53:31Z

            text_position_ids = batch.get('position_ids')
+        if self._should_use_npu_generated_attention_mask(args):
+            if 'attention_mask_2d' not in batch and batch.get('attention_mask') is not None:
+                batch['attention_mask_2d'] = (~batch['attention_mask']).sum(dim=(1, 2)) > 0


The newly introduced attention_mask_2d field is not handled by get_batch_on_this_cp_rank in swift/megatron/trainers/utils.py. When Context Parallelism is enabled (args.context_parallel_size > 1), this 2D mask will remain at its full sequence length on all ranks instead of being sliced along the sequence dimension. This will lead to shape mismatches or incorrect attention results on NPU. Please ensure attention_mask_2d is added to the list of keys to be split in get_batch_on_this_cp_rank.

gemini-code-assist · 2026-04-24T03:53:31Z

+        if self._should_use_npu_generated_attention_mask(args):
+            if 'attention_mask_2d' not in batch and batch.get('attention_mask') is not None:
+                batch['attention_mask_2d'] = (~batch['attention_mask']).sum(dim=(1, 2)) > 0
+            batch['attention_mask'] = None


Setting batch['attention_mask'] to None will cause a TypeError in get_last_tokens (line 1002) because it attempts to index the mask: (~attention_mask[:, 0, -1]). Since get_last_tokens is a standard utility used during training and evaluation steps to log metrics, this change will likely cause the process to crash. Consider keeping the original mask for trainer utilities or updating get_last_tokens to handle cases where attention_mask is None by falling back to attention_mask_2d.

gemini-code-assist · 2026-04-24T03:53:31Z

            text_position_ids = batch.get('position_ids')
+        if self._should_use_npu_generated_attention_mask(args):
+            if 'attention_mask_2d' not in batch and batch.get('attention_mask') is not None:
+                batch['attention_mask_2d'] = (~batch['attention_mask']).sum(dim=(1, 2)) > 0


The logic (~batch['attention_mask']).sum(dim=(1, 2)) assumes that attention_mask is a 4D tensor (typically [B, 1, S, S] in Megatron). If the input mask is 2D or 3D, this operation will fail with a Dimension out of range error. It is safer to check the number of dimensions before performing the reduction.

addsubmuldiv added 3 commits April 20, 2026 22:20

qwen3_5_npu

184c2f9

update

da2a21d

update

de91d00

Copilot AI review requested due to automatic review settings April 23, 2026 09:58

Copilot started reviewing on behalf of addsubmuldiv April 23, 2026 09:59 View session

gemini-code-assist Bot reviewed Apr 23, 2026

View reviewed changes

Copilot AI reviewed Apr 23, 2026

View reviewed changes

Jintao-Huang reviewed Apr 23, 2026

View reviewed changes

addsubmuldiv added 3 commits April 24, 2026 11:19

fix

d91960e

fix lint

92f1ee4

fix

113e072

Jintao-Huang approved these changes Apr 24, 2026

View reviewed changes

gemini-code-assist Bot reviewed Apr 24, 2026

View reviewed changes

Jintao-Huang merged commit 0967a19 into modelscope:main Apr 24, 2026
1 of 3 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

npu qwen3.5 megatron padding_free fix#9196

npu qwen3.5 megatron padding_free fix#9196
Jintao-Huang merged 6 commits into
modelscope:mainfrom
addsubmuldiv:qwen3_5_npu

addsubmuldiv commented Apr 23, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Apr 23, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Jintao-Huang Apr 23, 2026

Uh oh!

Jintao-Huang commented Apr 24, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Apr 24, 2026

Uh oh!

gemini-code-assist Bot Apr 24, 2026

Uh oh!

gemini-code-assist Bot Apr 24, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

addsubmuldiv commented Apr 23, 2026

PR type

PR information

Problem

Cause

Fix

Impact

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Jintao-Huang Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

Jintao-Huang commented Apr 24, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants