[model] support qwen3_next gdn#76
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces Gated Delta Network (GDN) support for Qwen3 models by adding Qwen3NextGDNBridge and Qwen3NextLoader, and refactoring qwen3_5_gdn to inherit from these new components. Review feedback suggests removing the redundant Qwen3NextGatedDeltaNet subclass in favor of using GatedDeltaNet directly and centralizing the USE_MCORE_GDN environment variable retrieval to reduce code duplication across multiple files.
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces Gated Delta Net (GDN) support for Qwen3 models, including new bridge and loader implementations and a refactor of Qwen3_5 models to utilize shared GDN logic. Additionally, the codebase is updated to rename CustomTransformerBlock and CustomTransformerLayer to TransformerBlock and TransformerLayer for consistency. Review feedback identifies a bug in the TransformerLayer initialization where an incorrect super() call skips the parent class's constructor, and suggests using torch.equal for more robust tensor comparisons in the GDN bridge.
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces support for the Qwen3 Gated Delta Net (GDN) architecture, including new bridge and loader classes for both text and multimodal variants. It also refactors several core modules by renaming 'Custom' prefixed classes to more standard names and streamlining imports. Key feedback includes a critical bug in TransformerLayer where the use of super() with the wrong class name skips the parent's initialization logic. Additionally, it is recommended to replace runtime assert statements with explicit exception handling and to avoid hardcoding 'cuda' as a device to ensure better environment compatibility.
| ): | ||
| self.submodules_config = submodules | ||
| super(TransformerLayer, self).__init__(config=config, vp_stage=vp_stage) | ||
| super(McoreTransformerLayer, self).__init__(config=config, vp_stage=vp_stage) |
There was a problem hiding this comment.
The call super(McoreTransformerLayer, self).__init__ invokes the constructor of the grandparent class (the parent of McoreTransformerLayer), effectively skipping the initialization logic of McoreTransformerLayer. This is problematic because McoreTransformerLayer performs essential setup (such as setting self.config, self.layer_number, etc.). Furthermore, if the grandparent class (typically MegatronModule) does not accept the vp_stage keyword argument, this will result in a TypeError at runtime. If the intention is to initialize the parent class, use super().__init__(...) or super(TransformerLayer, self).__init__(...).
| assert (lora_A == hf_state_dict['in_proj_ba.lora_A.weight'].load()).all(), \ | ||
| 'Need to ensure QKVZBA\'s lora_A are consistent' |
There was a problem hiding this comment.
Using assert for runtime validation is discouraged as assertions can be optimized away in production environments (e.g., when running Python with -O). It is better to use an explicit if check and raise a ValueError with a descriptive message.
| assert (lora_A == hf_state_dict['in_proj_ba.lora_A.weight'].load()).all(), \ | |
| 'Need to ensure QKVZBA\'s lora_A are consistent' | |
| if not (lora_A == hf_state_dict['in_proj_qkvz.lora_A.weight'].load()).all(): | |
| raise ValueError('Need to ensure QKVZBA\'s lora_A are consistent') |
| qkvz_dim = key_dim * 2 + value_dim * 2 | ||
| is_lora = False if mg_attn is None else isinstance(mg_attn.in_proj, | ||
| LoraParallelLinear) and self._peft_format | ||
| is_lora = torch.tensor([is_lora], dtype=torch.bool, device='cuda') |
There was a problem hiding this comment.
Hardcoding device='cuda' can cause issues in environments without GPU support or when running on other accelerators (e.g., NPU). It is safer to determine the device dynamically from the environment or existing tensors.
| is_lora = torch.tensor([is_lora], dtype=torch.bool, device='cuda') | |
| is_lora = torch.tensor([is_lora], dtype=torch.bool, device=torch.cuda.current_device() if torch.cuda.is_available() else 'cpu') |
| lm_model = model.language_model if hasattr(model, 'language_model') else model | ||
| for layer in lm_model.decoder.layers: | ||
| if hasattr(layer.self_attention, 'out_norm'): | ||
| assert hasattr(layer.self_attention.out_norm, 'zero_centered_gamma') |
There was a problem hiding this comment.
Avoid using assert for critical runtime checks in model building logic, as it can be optimized away. Use an explicit check and raise an appropriate exception.
| assert hasattr(layer.self_attention.out_norm, 'zero_centered_gamma') | |
| if not hasattr(layer.self_attention.out_norm, 'zero_centered_gamma'): | |
| raise AttributeError("layer.self_attention.out_norm missing 'zero_centered_gamma' attribute") |
No description provided.