[NPU Feature] Qwen3.5 NPU FLA and fused-operator patches by ys2025-AI · Pull Request #204 · modelscope/twinkle

ys2025-AI · 2026-05-26T08:41:31Z

PR type

Bug Fix
New Feature
Document Updates
More Models or Datasets Support

PR information

Description

This PR ports the Qwen3.5 FLA and Ascend NPU fused-operator patches.

Motivation

Qwen3.5 introduces hybrid attention layers (linear_attention + full_attention). The linear_attention path relies on chunk_gated_delta_rule from the flash-linear-attention (FLA) library, which contains CUDA-only Triton kernels. On Ascend NPU, these kernels must be redirected to MindSpeed Triton implementations to achieve comparable performance.

Without this patch, Qwen3.5 falls back to the pure PyTorch torch_chunk_gated_delta_rule, resulting in ~33% slower training on NPU.

Main changes

File	Change
`twinkle/kernel/chunk_gated_delta_rule.py`	New. MindSpeed Triton wrapper for `chunk_gated_delta_rule`. Re-exports the public API with identical signature to the FLA library.
`twinkle/kernel/monkey_patch_npu.py`	Extended. Adds `_patch_qwen3_5_fla()` which: (1) spoofs `transformers.utils.is_flash_linear_attention_available` to bypass CUDA-only checks; (2) replaces module-level `chunk_gated_delta_rule` with the MindSpeed implementation; (3) traverses instantiated model layers to re-bind per-instance `chunk_gated_delta_rule` (required because Qwen3.5 caches the function reference at `__init__` time).

Environment Variables

All FLA behavior is gated under the existing TWINKLE_NPU_PATCH hierarchy:

# Master switch for all NPU optimizations.
export TWINKLE_NPU_PATCH=1

# Enable fused operators (RMSNorm, RoPE, SwiGLU, SDPA).
export TWINKLE_NPU_FUSED_OPS=1

# Enable MoE Grouped MatMul.
export TWINKLE_NPU_MOE_PATCH=1

# Enable Flash Linear Attention for Qwen3.5.
# Default: 1 (enabled). Set to 0 to force torch fallback.
export TWINKLE_NPU_FLA=1

Related: modelscope/ms-swift#9223

Experiment results

Model: Qwen3.5-35B-A3B (40 layers, 30× linear_attention)
Hardware: Atlas 900 A3 (2 x NPU)
Dataset: GSM8K_ZH
Finetuning type: LoRA
Software: cann8.5.0 + torch/orch_npu 2.9.0 + MindSpeed 0.12.1 + triton-ascend 3.2.0 + transformers 5.9

Metric	Baseline (FLA OFF)	FLA ON (MindSpeed)	Delta
Avg. duration per 10-step interval	57.7 s	43.8 s	−24.1%
Avg. Loss	—	—	−0.0005 (identical)
Avg. Grad Norm	—	—	−0.024 (within noise)

- Add kernelize_model integration to ep_fsdp2_lora and fsdp2 examples - Support model parameter in apply_npu_patch for FLA instance patching - Implement NPU-accelerated packed MoE experts with weight caching - Add Qwen3.5 SparseMoeBlock forward with dual Transformers version support - Support partial RoPE and gated RMSNorm with FP32 mode option - Add MindSpeed Triton FLA backend integration for Qwen3.5 - Add environment variable controls for patch toggles - Add dynamic model discovery for unknown model families

gemini-code-assist

Code Review

This pull request introduces Ascend NPU optimizations and monkey patches for Qwen3.5 and Qwen3.5-MoE models, including Flash Linear Attention (FLA) via MindSpeed Triton kernels, packed MoE experts, and dynamic model discovery. The review feedback highlights two critical issues in the patching logic: caching expert weights when gradients are required can break the PyTorch autograd graph, and calling model.named_modules() directly on a TransformersModel wrapper will raise an AttributeError during training.

- Skip weight cache when requires_grad=True to preserve autograd graph - Resolve underlying PyTorch model from TransformersModel wrapper in FLA patch

gemini-code-assist Bot reviewed May 26, 2026

View reviewed changes

Comment thread src/twinkle/kernel/monkey_patch_npu.py

Comment thread src/twinkle/kernel/monkey_patch_npu.py

[Fix] Improve NPU expert weight caching and FLA model resolution

47fc871

- Skip weight cache when requires_grad=True to preserve autograd graph - Resolve underlying PyTorch model from TransformersModel wrapper in FLA patch

tastelikefeet approved these changes May 26, 2026

View reviewed changes

tastelikefeet merged commit 03b86a1 into modelscope:main May 26, 2026
1 of 3 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[NPU Feature] Qwen3.5 NPU FLA and fused-operator patches#204

[NPU Feature] Qwen3.5 NPU FLA and fused-operator patches#204
tastelikefeet merged 2 commits into
modelscope:mainfrom
ys2025-AI:main

ys2025-AI commented May 26, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ys2025-AI commented May 26, 2026

PR type

PR information

Description

Motivation

Main changes

Environment Variables

Experiment results

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants