Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 3 additions & 1 deletion docs/source/Megatron-SWIFT/Command-line-parameters.md
Original file line number Diff line number Diff line change
Expand Up @@ -185,10 +185,12 @@
- moe_ffn_hidden_size: 每个专家的前馈网络(ffn)的隐藏层大小。默认为None,自动从config.json读取。若未读取到且`num_experts`不为None,则设置为ffn_hidden_size。
- moe_shared_expert_intermediate_size: 共享专家的总FFN隐藏层大小。如果有多个共享专家,它应等于 `num_shared_experts * ffn_size_of_each_shared_expert`。 默认为None。自动从config.json读取。
- moe_router_topk: 每个token路由到的专家数量。默认为None。自动从config.json读取。
- moe_router_num_groups: 将专家分成的组数,用于组限制路由。参考DeepSeek-V2和DeepSeek-V3。默认为None。自动从config.json读取。
- moe_router_group_topk: 组限制路由中选择的组数。默认为None。自动从config.json读取。
- moe_router_pre_softmax: 为MoE启用预softmax路由,这意味着softmax会在top-k选择之前进行。默认为None。自动从config.json读取。
- 🔥moe_router_dtype: 用于路由计算和专家输出加权平均的数据类型。可选为'none', 'fp32'、'fp64',这增强了数值稳定性,尤其是在专家数量较多时。与`moe_permute_fusion`一起使用时,性能影响可以忽略不计。默认为'fp32'。'none'代表不改变数据类型。
- moe_router_score_function: MoE TopK 路由的评分函数。可以为 "softmax" 或 "sigmoid"。默认为None,从config.json中读取。
- moe_router_bias_update_rate: 在无辅助损失负载均衡策略中,专家偏置的更新速率。专家偏置根据每个专家在全局批次中被分配的 token 数量进行更新,对于分配到的 token 较少的专家,偏置会增加;对于分配到的 token 较多的专家,偏置会减少。默认值 1e-3,与 DeepSeekV3 中使用的值相同
- moe_router_bias_update_rate: 在无辅助损失负载均衡策略中,专家偏置的更新速率。专家偏置根据每个专家在全局批次中被分配的 token 数量进行更新,对于分配到的 token 较少的专家,偏置会增加;对于分配到的 token 较多的专家,偏置会减少。默认为None,从config.json中读取
- moe_router_enable_expert_bias: 在无辅助损失负载均衡策略中,带有动态专家偏置的 TopK 路由。路由决策基于路由分数与专家偏置之和。详情请参见:https://arxiv.org/abs/2408.15664。默认为None,自动从config.json读取。
- moe_router_topk_scaling_factor: 默认为None。从config.json中读取。
- moe_router_load_balancing_type: 确定路由器的负载均衡策略。可选项为"aux_loss"、"seq_aux_loss"、"sinkhorn"、"none"。默认值为 None。从config.json中读取。
Expand Down
4 changes: 3 additions & 1 deletion docs/source_en/Megatron-SWIFT/Command-line-parameters.md
Original file line number Diff line number Diff line change
Expand Up @@ -197,10 +197,12 @@ For guidance on selecting parallelization strategies, please refer to the [Train
- moe_ffn_hidden_size: Hidden layer size of the feedforward network (ffn) for each expert. Default is None and will be automatically read from config.json. If not found and `num_experts` is not None, it will be set to ffn_hidden_size.
- moe_shared_expert_intermediate_size: The total FFN hidden layer size for shared experts. If there are multiple shared experts, it should equal `num_shared_experts * ffn_size_of_each_shared_expert`. Default is None. Automatically read from config.json.
- moe_router_topk: The number of experts each token is routed to. Default is None. Automatically read from config.json.
- moe_router_num_groups: Number of groups to divide experts into for group-limited routing. Refers to DeepSeek-V2 and DeepSeek-V3. Default is None. Automatically read from config.json.
- moe_router_group_topk: Number of selected groups for group-limited routing. Default is None. Automatically read from config.json.
- moe_router_pre_softmax: Enable pre-softmax routing for MoE, meaning that softmax will be applied before top-k selection. Default is None. Automatically read from config.json.
- 🔥moe_router_dtype: Data type used for routing computation and expert output weighted averaging. Options are 'none', 'fp32', and 'fp64', which enhances numerical stability, especially when the number of experts is large. When used together with `moe_permute_fusion`, the performance impact is negligible. Default is 'fp32'. 'none' means no change to data type.
- moe_router_score_function: Scoring function for MoE TopK routing. Can be "softmax" or "sigmoid". Default is None and is read from config.json.
- moe_router_bias_update_rate: Update rate of expert bias in the auxiliary-loss-free load balancing strategy. Expert bias is updated based on the number of tokens each expert is assigned in the global batch: bias increases for experts assigned fewer tokens, and decreases for those assigned more tokens. Default is 1e-3, same as used in DeepSeekV3.
- moe_router_bias_update_rate: Update rate of expert bias in the auxiliary-loss-free load balancing strategy. Expert bias is updated based on the number of tokens each expert is assigned in the global batch: bias increases for experts assigned fewer tokens, and decreases for those assigned more tokens. Default is None and is read from config.json.
- moe_router_enable_expert_bias: TopK routing with dynamic expert bias in the auxiliary-loss-free load balancing strategy. Routing decisions are based on the sum of routing scores and expert bias. See details at: https://arxiv.org/abs/2408.15664. Default is None and is automatically read from config.json.
- moe_router_topk_scaling_factor: Default is None. This parameter is read from config.json.
- moe_router_load_balancing_type: Determines the router’s load balancing strategy. Options are "aux_loss", "seq_aux_loss", "sinkhorn", and "none". Default is None and is read from config.json.
Expand Down
4 changes: 3 additions & 1 deletion swift/megatron/argument/megatron_args.py
Original file line number Diff line number Diff line change
Expand Up @@ -284,10 +284,12 @@ class MegatronArguments(ExtraMegatronArguments):
moe_shared_expert_intermediate_size: Optional[int] = None

moe_router_topk: Optional[int] = None
moe_router_num_groups: Optional[int] = None
moe_router_group_topk: Optional[int] = None
moe_router_pre_softmax: Optional[bool] = None
moe_router_dtype: Literal['none', 'fp32', 'fp64'] = 'fp32'
moe_router_score_function: Literal['sigmoid', 'softmax'] = None
moe_router_bias_update_rate: float = 1e-3
moe_router_bias_update_rate: Optional[float] = None
moe_router_enable_expert_bias: Optional[bool] = None
moe_router_topk_scaling_factor: Optional[float] = None
moe_router_load_balancing_type: Literal['aux_loss', 'seq_aux_loss', 'sinkhorn', 'none'] = None
Expand Down
12 changes: 10 additions & 2 deletions swift/megatron/model/gpt_bridge.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,11 +2,13 @@
from copy import copy
from typing import Optional

import megatron.core
import torch
import torch.distributed as dist
import torch.nn.functional as F
from megatron.core import mpu
from megatron.training import get_args
from packaging import version
from peft.utils import ModulesToSaveWrapper
from tqdm import tqdm
from transformers.modeling_utils import custom_object_save
Expand Down Expand Up @@ -41,6 +43,7 @@ def __init__(self, disable_tqmd: bool = False):
self._init_meta_hf_model()
self.hf_layers = deep_getattr(self.hf_model, self.hf_layers_prefix)
self.module_mapping = {}
self.megatron_core_014 = version.parse(megatron.core.__version__) >= version.parse('0.14.0rc0')
megatron_model_meta = get_megatron_model_meta(self.args.hf_model_type)
if self.args.is_multimodal and megatron_model_meta.visual_cls is not None:
self.module_mapping = megatron_model_meta.visual_cls.module_mapping
Expand All @@ -64,8 +67,7 @@ def _init_meta_hf_model(self):
self.hf_model, self.processor = get_model_tokenizer(
self.args.model_dir, model_type=self.args.hf_model_type, return_dummy_model=True)

@staticmethod
def _get_tp_split_dim(mg_key: Optional[str]) -> Optional[int]:
def _get_tp_split_dim(self, mg_key: Optional[str]) -> Optional[int]:
if mg_key is None:
return
# ColumnLinear
Expand All @@ -78,6 +80,9 @@ def _get_tp_split_dim(mg_key: Optional[str]) -> Optional[int]:
'linear_q_up_proj',
'linear_kv_up_proj'
}
if not self.megatron_core_014:
# https://github.com/NVIDIA/Megatron-LM/commit/720c8b40d8e7e2de1dd303d792f29093101c5e72
dim0_keys.update({'linear_q_down_proj', 'linear_kv_down_proj'})
# RowLinear
dim1_keys = {'linear_proj', 'linear_fc2'}
Comment on lines +83 to 87
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

For megatron-core>=0.14.0, linear_q_down_proj and linear_kv_down_proj are RowParallelLinear, which should be split along dimension 1. You've correctly handled the case for older versions by adding them to dim0_keys. However, for newer versions, they should be added to dim1_keys to ensure correct tensor parallelism splitting. Without this, they won't be split correctly.

        # RowLinear
        dim1_keys = {'linear_proj', 'linear_fc2'}
        if not self.megatron_core_014:
            # https://github.com/NVIDIA/Megatron-LM/commit/720c8b40d8e7e2de1dd303d792f29093101c5e72
            dim0_keys.update({'linear_q_down_proj', 'linear_kv_down_proj'})
        else:
            dim1_keys.update({'linear_q_down_proj', 'linear_kv_down_proj'})

if 'lora_A' not in mg_key and 'lora_B' not in mg_key:
Expand Down Expand Up @@ -856,6 +861,9 @@ def _set_mla_attn_state(
to_mcore)
self._set_state_dict(mg_attn, 'linear_kv_up_proj.weight', hf_state_dict, 'kv_b_proj.weight', to_mcore)
if self.args.qk_layernorm:
if self.args.q_lora_rank is not None:
self._set_state_dict(mg_attn, 'linear_q_up_proj.layer_norm_weight', hf_state_dict,
'q_a_layernorm.weight', to_mcore)
self._set_state_dict(mg_attn, 'linear_kv_up_proj.layer_norm_weight', hf_state_dict, 'kv_a_layernorm.weight',
to_mcore)
if to_mcore:
Expand Down
5 changes: 4 additions & 1 deletion swift/megatron/utils/config.py
Original file line number Diff line number Diff line change
Expand Up @@ -24,13 +24,16 @@
# moe
'moe_ffn_hidden_size': ['moe_intermediate_size'],
'moe_shared_expert_intermediate_size': ['shared_expert_intermediate_size'],
'moe_router_topk': ['num_experts_per_tok', 'n_group', 'moe_topk', 'moe_k'],
'moe_router_topk': ['num_experts_per_tok', 'moe_topk', 'moe_k'],
'moe_router_num_groups': ['n_group'],
'moe_router_group_topk': ['topk_group'],
'num_experts': ['num_experts', 'n_routed_experts', 'moe_num_experts'],
'moe_router_pre_softmax': ['norm_topk_prob'],
# deepseek
'q_lora_rank': ['q_lora_rank'],
'kv_lora_rank': ['kv_lora_rank'],
'moe_router_score_function': ['scoring_func'],
'moe_router_bias_update_rate': ['aux_loss_alpha'],
'qk_head_dim': ['qk_nope_head_dim'],
'qk_pos_emb_head_dim': ['qk_rope_head_dim'],
'moe_router_topk_scaling_factor': ['routed_scaling_factor'],
Expand Down
Loading