Skip to content

npu qwen3.5 megatron padding_free fix#9196

Merged
Jintao-Huang merged 6 commits into
modelscope:mainfrom
addsubmuldiv:qwen3_5_npu
Apr 24, 2026
Merged

npu qwen3.5 megatron padding_free fix#9196
Jintao-Huang merged 6 commits into
modelscope:mainfrom
addsubmuldiv:qwen3_5_npu

Conversation

@addsubmuldiv
Copy link
Copy Markdown
Collaborator

PR type

  • Bug Fix
  • New Feature
  • Document Updates
  • More Models or Datasets Support

PR information

fix(megatron): preserve attention_mask_2d for NPU GDN models when padding_free=False

Problem

On NPU, Qwen3.5 / Qwen3-Next can run with padding_free=True but fail with padding_free=False.
The failure usually surfaces around masked_select, while the actual async error comes from **aclnnFlashAttentionScore**.

Cause

For GDN models:

  • padding_free=True goes through the thd path and derives mask from cu_seqlens_q
  • padding_free=False goes through the non-thd path and reads external mask from kwargs

The trainer side did not preserve the NPU-compatible attention_mask_2d end-to-end, so the non-thd branch consumed an incompatible mask. This is why padding_free=True only bypassed the issue instead of fixing it.

Fix

This PR preserves attention_mask_2d in the Megatron trainer batch path for the NPU flash-attn case, so downstream model wrappers can still access the correct 2D mask when padding_free=False.

The complete fix also includes a companion mcore-bridge update where GDN non-thd prefers attention_mask_2d on NPU.

Impact

The issue is limited to models with HF-style GDN components such as Qwen3.5 / Qwen3-Next.
Pure Megatron attention paths are not affected.

Copilot AI review requested due to automatic review settings April 23, 2026 09:58
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces NPU-specific attention mask generation in the Megatron trainer. The reviewer identified several critical issues in the implementation, including a potential IndexError from dimension assumptions, incorrect bitwise logic for mask calculation, and the removal of necessary cleanup for non-NPU cases. A code suggestion was provided to correctly handle 2D and 4D masks while ensuring the attention_mask_2d key is properly managed across all execution paths.

Comment on lines +968 to +971
if self._should_use_npu_generated_attention_mask(args):
if 'attention_mask_2d' not in batch and batch.get('attention_mask') is not None:
batch['attention_mask_2d'] = (~batch['attention_mask']).sum(dim=(1, 2)) > 0
batch['attention_mask'] = None
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The implementation of attention_mask_2d generation has a few critical issues:

  1. Runtime Error (IndexError): sum(dim=(1, 2)) assumes the input attention_mask is at least 3D (likely 4D). However, in most swift configurations, the dataloader provides a 2D attention_mask of shape [batch_size, seq_len]. This will cause a crash on NPU when padding_free=False.
  2. Incorrect Boolean Logic: Using the bitwise NOT operator ~ on integer tensors (standard for swift masks where 1=valid, 0=padding) results in negative values (~1 is -2, ~0 is -1). The subsequent > 0 check will always return False, effectively masking the entire sequence.
  3. Missing Cleanup: The original code always popped attention_mask_2d to avoid passing unexpected arguments to Megatron models. This cleanup is now missing for all non-NPU or padding_free=True cases, which might cause compatibility issues with models that don't expect this extra key in the batch.

Suggested improvement to handle both 2D/4D masks and maintain cleanup:

Suggested change
if self._should_use_npu_generated_attention_mask(args):
if 'attention_mask_2d' not in batch and batch.get('attention_mask') is not None:
batch['attention_mask_2d'] = (~batch['attention_mask']).sum(dim=(1, 2)) > 0
batch['attention_mask'] = None
if self._should_use_npu_generated_attention_mask(args):
if 'attention_mask_2d' not in batch and batch.get('attention_mask') is not None:
mask = batch['attention_mask']
if mask.dim() == 4:
batch['attention_mask_2d'] = mask.bool().any(dim=1).any(dim=1)
else:
batch['attention_mask_2d'] = mask.bool()
batch['attention_mask'] = None
else:
batch.pop('attention_mask_2d', None)

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR targets an NPU-specific masking issue for Qwen3.5/Qwen3-Next (GDN) models in Megatron training when padding_free=False, by preserving a 2D attention mask through the trainer batch preparation path so downstream wrappers can consume an NPU-compatible mask.

Changes:

  • Stop unconditionally dropping attention_mask_2d in the Megatron trainer _prepare_batch path.
  • Add an NPU/flash-attn conditional branch to preserve (and, if missing, derive) attention_mask_2d, and to null out attention_mask for that path.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread swift/megatron/trainers/base.py Outdated
Comment on lines +950 to +958
def _should_use_npu_generated_attention_mask(self, args) -> bool:
return (
is_torch_npu_available()
and
args.task_type == 'causal_lm'
and not args.padding_free
and getattr(args, 'attention_backend', None) != 'local'
and getattr(args, 'use_flash_attn', False)
)
text_position_ids = batch.get('position_ids')
if self._should_use_npu_generated_attention_mask(args):
if 'attention_mask_2d' not in batch and batch.get('attention_mask') is not None:
batch['attention_mask_2d'] = (~batch['attention_mask']).sum(dim=(1, 2)) > 0
batch.pop('attention_mask_2d', None)
if text_position_ids is None:
text_position_ids = batch.get('position_ids')
if self._should_use_npu_generated_attention_mask(args):
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

else分支还原一下 之前的行为

@Jintao-Huang
Copy link
Copy Markdown
Collaborator

/gemini review

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces NPU-specific logic for generating 2D attention masks within the Megatron trainer's batch preparation. While the changes aim to optimize attention handling for NPU environments, several critical issues were identified: the new attention_mask_2d field is not integrated with Context Parallelism logic, setting the standard attention_mask to None will cause a crash in the get_last_tokens utility, and the mask reduction logic assumes a 4D tensor structure without validation, which may lead to dimension errors.

text_position_ids = batch.get('position_ids')
if self._should_use_npu_generated_attention_mask(args):
if 'attention_mask_2d' not in batch and batch.get('attention_mask') is not None:
batch['attention_mask_2d'] = (~batch['attention_mask']).sum(dim=(1, 2)) > 0
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The newly introduced attention_mask_2d field is not handled by get_batch_on_this_cp_rank in swift/megatron/trainers/utils.py. When Context Parallelism is enabled (args.context_parallel_size > 1), this 2D mask will remain at its full sequence length on all ranks instead of being sliced along the sequence dimension. This will lead to shape mismatches or incorrect attention results on NPU. Please ensure attention_mask_2d is added to the list of keys to be split in get_batch_on_this_cp_rank.

if self._should_use_npu_generated_attention_mask(args):
if 'attention_mask_2d' not in batch and batch.get('attention_mask') is not None:
batch['attention_mask_2d'] = (~batch['attention_mask']).sum(dim=(1, 2)) > 0
batch['attention_mask'] = None
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Setting batch['attention_mask'] to None will cause a TypeError in get_last_tokens (line 1002) because it attempts to index the mask: (~attention_mask[:, 0, -1]). Since get_last_tokens is a standard utility used during training and evaluation steps to log metrics, this change will likely cause the process to crash. Consider keeping the original mask for trainer utilities or updating get_last_tokens to handle cases where attention_mask is None by falling back to attention_mask_2d.

text_position_ids = batch.get('position_ids')
if self._should_use_npu_generated_attention_mask(args):
if 'attention_mask_2d' not in batch and batch.get('attention_mask') is not None:
batch['attention_mask_2d'] = (~batch['attention_mask']).sum(dim=(1, 2)) > 0
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The logic (~batch['attention_mask']).sum(dim=(1, 2)) assumes that attention_mask is a 4D tensor (typically [B, 1, S, S] in Megatron). If the input mask is 2D or 3D, this operation will fail with a Dimension out of range error. It is safer to check the number of dimensions before performing the reduction.

@Jintao-Huang Jintao-Huang merged commit 0967a19 into modelscope:main Apr 24, 2026
1 of 3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants