Skip to content

[model] support qwen3_next gdn#76

Merged
Jintao-Huang merged 5 commits into
modelscope:mainfrom
Jintao-Huang:support_qwen3_next_gdn
May 14, 2026
Merged

[model] support qwen3_next gdn#76
Jintao-Huang merged 5 commits into
modelscope:mainfrom
Jintao-Huang:support_qwen3_next_gdn

Conversation

@Jintao-Huang
Copy link
Copy Markdown
Collaborator

No description provided.

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces Gated Delta Network (GDN) support for Qwen3 models by adding Qwen3NextGDNBridge and Qwen3NextLoader, and refactoring qwen3_5_gdn to inherit from these new components. Review feedback suggests removing the redundant Qwen3NextGatedDeltaNet subclass in favor of using GatedDeltaNet directly and centralizing the USE_MCORE_GDN environment variable retrieval to reduce code duplication across multiple files.

Comment thread src/mcore_bridge/model/gpts/qwen3_next_gdn.py Outdated
Comment thread src/mcore_bridge/model/gpts/qwen3_next.py
@Jintao-Huang
Copy link
Copy Markdown
Collaborator Author

/gemini review

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces Gated Delta Net (GDN) support for Qwen3 models, including new bridge and loader implementations and a refactor of Qwen3_5 models to utilize shared GDN logic. Additionally, the codebase is updated to rename CustomTransformerBlock and CustomTransformerLayer to TransformerBlock and TransformerLayer for consistency. Review feedback identifies a bug in the TransformerLayer initialization where an incorrect super() call skips the parent class's constructor, and suggests using torch.equal for more robust tensor comparisons in the GDN bridge.

Comment thread src/mcore_bridge/model/modules/transformer_layer.py
Comment thread src/mcore_bridge/model/gpts/qwen3_next_gdn.py
@Jintao-Huang
Copy link
Copy Markdown
Collaborator Author

/gemini review

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for the Qwen3 Gated Delta Net (GDN) architecture, including new bridge and loader classes for both text and multimodal variants. It also refactors several core modules by renaming 'Custom' prefixed classes to more standard names and streamlining imports. Key feedback includes a critical bug in TransformerLayer where the use of super() with the wrong class name skips the parent's initialization logic. Additionally, it is recommended to replace runtime assert statements with explicit exception handling and to avoid hardcoding 'cuda' as a device to ensure better environment compatibility.

):
self.submodules_config = submodules
super(TransformerLayer, self).__init__(config=config, vp_stage=vp_stage)
super(McoreTransformerLayer, self).__init__(config=config, vp_stage=vp_stage)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The call super(McoreTransformerLayer, self).__init__ invokes the constructor of the grandparent class (the parent of McoreTransformerLayer), effectively skipping the initialization logic of McoreTransformerLayer. This is problematic because McoreTransformerLayer performs essential setup (such as setting self.config, self.layer_number, etc.). Furthermore, if the grandparent class (typically MegatronModule) does not accept the vp_stage keyword argument, this will result in a TypeError at runtime. If the intention is to initialize the parent class, use super().__init__(...) or super(TransformerLayer, self).__init__(...).

Comment on lines +54 to +55
assert (lora_A == hf_state_dict['in_proj_ba.lora_A.weight'].load()).all(), \
'Need to ensure QKVZBA\'s lora_A are consistent'
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Using assert for runtime validation is discouraged as assertions can be optimized away in production environments (e.g., when running Python with -O). It is better to use an explicit if check and raise a ValueError with a descriptive message.

Suggested change
assert (lora_A == hf_state_dict['in_proj_ba.lora_A.weight'].load()).all(), \
'Need to ensure QKVZBA\'s lora_A are consistent'
if not (lora_A == hf_state_dict['in_proj_qkvz.lora_A.weight'].load()).all():
raise ValueError('Need to ensure QKVZBA\'s lora_A are consistent')

qkvz_dim = key_dim * 2 + value_dim * 2
is_lora = False if mg_attn is None else isinstance(mg_attn.in_proj,
LoraParallelLinear) and self._peft_format
is_lora = torch.tensor([is_lora], dtype=torch.bool, device='cuda')
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Hardcoding device='cuda' can cause issues in environments without GPU support or when running on other accelerators (e.g., NPU). It is safer to determine the device dynamically from the environment or existing tensors.

Suggested change
is_lora = torch.tensor([is_lora], dtype=torch.bool, device='cuda')
is_lora = torch.tensor([is_lora], dtype=torch.bool, device=torch.cuda.current_device() if torch.cuda.is_available() else 'cpu')

lm_model = model.language_model if hasattr(model, 'language_model') else model
for layer in lm_model.decoder.layers:
if hasattr(layer.self_attention, 'out_norm'):
assert hasattr(layer.self_attention.out_norm, 'zero_centered_gamma')
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Avoid using assert for critical runtime checks in model building logic, as it can be optimized away. Use an explicit check and raise an appropriate exception.

Suggested change
assert hasattr(layer.self_attention.out_norm, 'zero_centered_gamma')
if not hasattr(layer.self_attention.out_norm, 'zero_centered_gamma'):
raise AttributeError("layer.self_attention.out_norm missing 'zero_centered_gamma' attribute")

@Jintao-Huang Jintao-Huang merged commit 85df30c into modelscope:main May 14, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants