npu qwen3.5 megatron padding_free fix#9196
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces NPU-specific attention mask generation in the Megatron trainer. The reviewer identified several critical issues in the implementation, including a potential IndexError from dimension assumptions, incorrect bitwise logic for mask calculation, and the removal of necessary cleanup for non-NPU cases. A code suggestion was provided to correctly handle 2D and 4D masks while ensuring the attention_mask_2d key is properly managed across all execution paths.
| if self._should_use_npu_generated_attention_mask(args): | ||
| if 'attention_mask_2d' not in batch and batch.get('attention_mask') is not None: | ||
| batch['attention_mask_2d'] = (~batch['attention_mask']).sum(dim=(1, 2)) > 0 | ||
| batch['attention_mask'] = None |
There was a problem hiding this comment.
The implementation of attention_mask_2d generation has a few critical issues:
- Runtime Error (IndexError):
sum(dim=(1, 2))assumes the inputattention_maskis at least 3D (likely 4D). However, in mostswiftconfigurations, the dataloader provides a 2Dattention_maskof shape[batch_size, seq_len]. This will cause a crash on NPU whenpadding_free=False. - Incorrect Boolean Logic: Using the bitwise NOT operator
~on integer tensors (standard forswiftmasks where 1=valid, 0=padding) results in negative values (~1is-2,~0is-1). The subsequent> 0check will always returnFalse, effectively masking the entire sequence. - Missing Cleanup: The original code always popped
attention_mask_2dto avoid passing unexpected arguments to Megatron models. This cleanup is now missing for all non-NPU orpadding_free=Truecases, which might cause compatibility issues with models that don't expect this extra key in the batch.
Suggested improvement to handle both 2D/4D masks and maintain cleanup:
| if self._should_use_npu_generated_attention_mask(args): | |
| if 'attention_mask_2d' not in batch and batch.get('attention_mask') is not None: | |
| batch['attention_mask_2d'] = (~batch['attention_mask']).sum(dim=(1, 2)) > 0 | |
| batch['attention_mask'] = None | |
| if self._should_use_npu_generated_attention_mask(args): | |
| if 'attention_mask_2d' not in batch and batch.get('attention_mask') is not None: | |
| mask = batch['attention_mask'] | |
| if mask.dim() == 4: | |
| batch['attention_mask_2d'] = mask.bool().any(dim=1).any(dim=1) | |
| else: | |
| batch['attention_mask_2d'] = mask.bool() | |
| batch['attention_mask'] = None | |
| else: | |
| batch.pop('attention_mask_2d', None) |
There was a problem hiding this comment.
Pull request overview
This PR targets an NPU-specific masking issue for Qwen3.5/Qwen3-Next (GDN) models in Megatron training when padding_free=False, by preserving a 2D attention mask through the trainer batch preparation path so downstream wrappers can consume an NPU-compatible mask.
Changes:
- Stop unconditionally dropping
attention_mask_2din the Megatron trainer_prepare_batchpath. - Add an NPU/flash-attn conditional branch to preserve (and, if missing, derive)
attention_mask_2d, and to null outattention_maskfor that path.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| def _should_use_npu_generated_attention_mask(self, args) -> bool: | ||
| return ( | ||
| is_torch_npu_available() | ||
| and | ||
| args.task_type == 'causal_lm' | ||
| and not args.padding_free | ||
| and getattr(args, 'attention_backend', None) != 'local' | ||
| and getattr(args, 'use_flash_attn', False) | ||
| ) |
| text_position_ids = batch.get('position_ids') | ||
| if self._should_use_npu_generated_attention_mask(args): | ||
| if 'attention_mask_2d' not in batch and batch.get('attention_mask') is not None: | ||
| batch['attention_mask_2d'] = (~batch['attention_mask']).sum(dim=(1, 2)) > 0 |
| batch.pop('attention_mask_2d', None) | ||
| if text_position_ids is None: | ||
| text_position_ids = batch.get('position_ids') | ||
| if self._should_use_npu_generated_attention_mask(args): |
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces NPU-specific logic for generating 2D attention masks within the Megatron trainer's batch preparation. While the changes aim to optimize attention handling for NPU environments, several critical issues were identified: the new attention_mask_2d field is not integrated with Context Parallelism logic, setting the standard attention_mask to None will cause a crash in the get_last_tokens utility, and the mask reduction logic assumes a 4D tensor structure without validation, which may lead to dimension errors.
| text_position_ids = batch.get('position_ids') | ||
| if self._should_use_npu_generated_attention_mask(args): | ||
| if 'attention_mask_2d' not in batch and batch.get('attention_mask') is not None: | ||
| batch['attention_mask_2d'] = (~batch['attention_mask']).sum(dim=(1, 2)) > 0 |
There was a problem hiding this comment.
The newly introduced attention_mask_2d field is not handled by get_batch_on_this_cp_rank in swift/megatron/trainers/utils.py. When Context Parallelism is enabled (args.context_parallel_size > 1), this 2D mask will remain at its full sequence length on all ranks instead of being sliced along the sequence dimension. This will lead to shape mismatches or incorrect attention results on NPU. Please ensure attention_mask_2d is added to the list of keys to be split in get_batch_on_this_cp_rank.
| if self._should_use_npu_generated_attention_mask(args): | ||
| if 'attention_mask_2d' not in batch and batch.get('attention_mask') is not None: | ||
| batch['attention_mask_2d'] = (~batch['attention_mask']).sum(dim=(1, 2)) > 0 | ||
| batch['attention_mask'] = None |
There was a problem hiding this comment.
Setting batch['attention_mask'] to None will cause a TypeError in get_last_tokens (line 1002) because it attempts to index the mask: (~attention_mask[:, 0, -1]). Since get_last_tokens is a standard utility used during training and evaluation steps to log metrics, this change will likely cause the process to crash. Consider keeping the original mask for trainer utilities or updating get_last_tokens to handle cases where attention_mask is None by falling back to attention_mask_2d.
| text_position_ids = batch.get('position_ids') | ||
| if self._should_use_npu_generated_attention_mask(args): | ||
| if 'attention_mask_2d' not in batch and batch.get('attention_mask') is not None: | ||
| batch['attention_mask_2d'] = (~batch['attention_mask']).sum(dim=(1, 2)) > 0 |
There was a problem hiding this comment.
The logic (~batch['attention_mask']).sum(dim=(1, 2)) assumes that attention_mask is a 4D tensor (typically [B, 1, S, S] in Megatron). If the input mask is 2D or 3D, this operation will fail with a Dimension out of range error. It is safer to check the number of dimensions before performing the reduction.
PR type
PR information
fix(megatron): preserve
attention_mask_2dfor NPU GDN models whenpadding_free=FalseProblem
On NPU, Qwen3.5 / Qwen3-Next can run with
padding_free=Truebut fail withpadding_free=False.The failure usually surfaces around
masked_select, while the actual async error comes from**aclnnFlashAttentionScore**.Cause
For GDN models:
padding_free=Truegoes through the thd path and derives mask fromcu_seqlens_qpadding_free=Falsegoes through the non-thd path and reads external mask from kwargsThe trainer side did not preserve the NPU-compatible
attention_mask_2dend-to-end, so the non-thd branch consumed an incompatible mask. This is whypadding_free=Trueonly bypassed the issue instead of fixing it.Fix
This PR preserves
attention_mask_2din the Megatron trainer batch path for the NPU flash-attn case, so downstream model wrappers can still access the correct 2D mask whenpadding_free=False.The complete fix also includes a companion
mcore-bridgeupdate where GDN non-thd prefersattention_mask_2don NPU.Impact
The issue is limited to models with HF-style GDN components such as Qwen3.5 / Qwen3-Next.
Pure Megatron attention paths are not affected.