Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion docs/source/Megatron-SWIFT/Command-line-parameters.md
Original file line number Diff line number Diff line change
Expand Up @@ -298,7 +298,7 @@ Megatron训练参数继承自Megatron参数和基本参数(**与ms-swift共用
- gradient_checkpointing_kwargs: 传入`torch.utils.checkpoint`中的参数。例如设置为`--gradient_checkpointing_kwargs '{"use_reentrant": false}'`。默认为None。该参数只对`vit_gradient_checkpointing`生效。
- 🔥packing: 是否使用序列packing提升计算效率(不同节点与进程更负载均衡,GPU利用率更高;但需要额外的预处理时间)并稳定显存占用,默认为False。当前支持CPT/SFT/DPO/KTO/RM。
- 注意:**同一batch的不同序列之间依旧是不可见的**,除了Qwen3-Next。
- 注意:**packing会导致数据集样本数减少,请自行调节梯度累加数和学习率**。
- 注意:**packing会导致数据集样本数减少,请自行调节global_batch_size和学习率**。
- packing_length: packing的长度。默认为None,设置为max_length。
- packing_num_proc: packing的进程数,默认为1。需要注意的是,不同的`packing_num_proc`,最终形成的packed数据集是不同的。(该参数在流式packing时不生效)
- streaming: 流式读取并处理数据集,默认False。
Expand Down
3 changes: 2 additions & 1 deletion docs/source/Megatron-SWIFT/Quick-start.md
Original file line number Diff line number Diff line change
Expand Up @@ -27,6 +27,7 @@ pip install --no-build-isolation transformer_engine[pytorch]
# pip install --no-build-isolation git+https://github.com/NVIDIA/TransformerEngine.git@release_v2.5#egg=transformer_engine[pytorch]

# apex
# 提示:Megatron-SWIFT可以在不含apex的环境下运行,额外设置`--no_gradient_accumulation_fusion true`即可。
git clone https://github.com/NVIDIA/apex
cd apex
pip install -v --disable-pip-version-check --no-cache-dir --no-build-isolation --config-settings "--build-option=--cpp_ext" --config-settings "--build-option=--cuda_ext" ./
Expand Down Expand Up @@ -65,7 +66,7 @@ modelscope-registry.us-west-1.cr.aliyuncs.com/modelscope-repo/modelscope:ubuntu2
| torch | >=2.0 | 2.7.1/2.8.0 | |
| transformer_engine | >=2.3 | | |
| apex | | 0.1 | |
| megatron_core | | 0.14 | |
| megatron_core | >=0.12 | 0.14 | |
| flash_attn | | 2.8.1/3.0.0b1 | |
| transformers | >=4.33 | 4.57.1 | |
| modelscope | >=1.23 | | |
Expand Down
2 changes: 1 addition & 1 deletion docs/source_en/Megatron-SWIFT/Command-line-parameters.md
Original file line number Diff line number Diff line change
Expand Up @@ -315,7 +315,7 @@ Megatron training parameters are inherited from Megatron parameters and basic pa
- gradient_checkpointing_kwargs: Arguments passed to `torch.utils.checkpoint`. For example: set `--gradient_checkpointing_kwargs '{"use_reentrant": false}'`. Defaults to `None`. This parameter only takes effect when `vit_gradient_checkpointing` is enabled.
- 🔥packing: Whether to use sequence packing to improve computational efficiency (achieving better load balancing across nodes and processes, and higher GPU utilization), at the cost of additional preprocessing time, while also stabilizing GPU memory usage. Defaults to `False`. Currently supported for CPT, SFT, DPO, KTO and RM.
- Note: **Sequences within the same batch remain mutually invisible**, except for Qwen3-Next.
- Note: **Packing reduces the number of samples in the dataset; please adjust the gradient accumulation steps and learning rate accordingly**.
- Note: **Packing will reduce the number of dataset samples. Please adjust global_batch_size and learning rate accordingly**.
- packing_length: the length to use for packing. Defaults to None, in which case it is set to max_length.
- packing_num_proc: Number of processes for packing, default is 1. Note that different values of `packing_num_proc` will result in different packed datasets. (This parameter does not take effect during streaming packing)
- streaming: Stream data loading and processing, default is False.
Expand Down
3 changes: 2 additions & 1 deletion docs/source_en/Megatron-SWIFT/Quick-start.md
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,7 @@ pip install --no-build-isolation transformer_engine[pytorch]
# pip install --no-build-isolation git+https://github.com/NVIDIA/TransformerEngine.git@release_v2.5#egg=transformer_engine[pytorch]

# apex
# Note: Megatron-SWIFT can run in environments without apex by setting `--no_gradient_accumulation_fusion true`.
git clone https://github.com/NVIDIA/apex
cd apex
pip install -v --disable-pip-version-check --no-cache-dir --no-build-isolation --config-settings "--build-option=--cpp_ext" --config-settings "--build-option=--cuda_ext" ./
Expand Down Expand Up @@ -65,7 +66,7 @@ Recommended Operating Environment:
| torch | >=2.0 | 2.7.1/2.8.0 | |
| transformer_engine | >=2.3 | | |
| apex | | 0.1 | |
| megatron_core | | 0.14 | |
| megatron_core | >=0.12 | 0.14 | |
| flash_attn | | 2.8.1/3.0.0b1 | |
| transformers | >=4.33 | 4.57.1 | |
| modelscope | >=1.23 | | |
Expand Down
13 changes: 11 additions & 2 deletions examples/models/qwen3_next/mcore.sh
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,10 @@ PYTORCH_CUDA_ALLOC_CONF='expandable_segments:True' \
NPROC_PER_NODE=8 \
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
megatron sft \
--load Qwen3-Next-80B-A3B-Instruct-mcore \
--model Qwen/Qwen3-Next-80B-A3B-Instruct \
--load_safetensors true \
--save_safetensors true \
--merge_lora false \
--dataset 'swift/Chinese-Qwen3-235B-2507-Distill-data-110k-SFT#2000' \
'swift/self-cognition#1000' \
--load_from_cache_file true \
Expand All @@ -23,7 +26,7 @@ megatron sft \
--moe_permute_fusion true \
--moe_grouped_gemm true \
--moe_shared_expert_overlap true \
--moe_aux_loss_coeff 1e-3 \
--moe_aux_loss_coeff 1e-6 \
--micro_batch_size 2 \
--global_batch_size 16 \
--recompute_granularity full \
Expand All @@ -47,3 +50,9 @@ megatron sft \
--attention_backend flash \
--model_author swift \
--model_name swift-robot


# CUDA_VISIBLE_DEVICES=0,1,2,3 \
# swift infer \
# --adapters megatron_output/Qwen3-Next-80B-A3B-Instruct/vx-xxx/checkpoint-xxx \
# --stream true
2 changes: 2 additions & 0 deletions swift/megatron/argument/megatron_args.py
Original file line number Diff line number Diff line change
Expand Up @@ -454,6 +454,8 @@ def __post_init__(self):
MegatronTunerMixin.__post_init__(self)
os.environ['CUDA_DEVICE_MAX_CONNECTIONS'] = '1'
self._set_default()
if self.optimizer_cpu_offload:
require_version('megatron-core>=0.13')
self.model_info, self.model_meta = get_model_info_meta(
self.model, model_type=self.model_type, use_hf=self.use_hf, hub_token=self.hub_token)
self.model_type = self.model_info.model_type
Expand Down
24 changes: 14 additions & 10 deletions swift/megatron/init.py
Original file line number Diff line number Diff line change
Expand Up @@ -66,7 +66,7 @@ def _patch_mla_attention():
gather_from_tensor_model_parallel_region,
scatter_to_sequence_parallel_region,
)
megatron_core_013 = version.parse(megatron.core.__version__) >= version.parse('0.13.0rc0')
mcore_013 = version.parse(megatron.core.__version__) >= version.parse('0.13.0rc0')

# Code borrowed from NVIDIA/Megatron-LM
def forward(
Expand Down Expand Up @@ -112,7 +112,7 @@ def forward(
# Adjust key, value for inference
# ===================================================
# rotary_pos_emb = None
if megatron_core_013:
if mcore_013:
query, key, value, _, attn_mask_type, _ = self._adjust_key_value_for_inference(
inference_context, query, key, value, rotary_pos_emb=None)
else:
Expand Down Expand Up @@ -430,7 +430,7 @@ def _patch_TransformerLayer():
from megatron.training import get_args
from megatron.core.transformer import TransformerLayer
_origin_forward = TransformerLayer.forward
megatron_core_013 = version.parse(megatron.core.__version__) >= version.parse('0.13.0rc0')
mcore_013 = version.parse(megatron.core.__version__) >= version.parse('0.13.0rc0')

def forward(self, *_args, **kwargs):
"""
Expand All @@ -439,7 +439,7 @@ def forward(self, *_args, **kwargs):
This method calls the core computation of a transformer layer, including
self-attention, cross-attention (if applicable), and feed-forward operations.
"""
if not megatron_core_013:
if not mcore_013:
return _origin_forward(self, *_args, **kwargs)
hidden_states, context = self._forward_attention(*_args, **kwargs)
args = get_args()
Expand Down Expand Up @@ -551,11 +551,14 @@ def build_train_valid_test_datasets(build_train_valid_test_datasets_provider):
def _patch_mrope():
from megatron.core.models.common.embeddings.rotary_pos_embedding import MultimodalRotaryEmbedding
from megatron.core import parallel_state
import megatron.core
from megatron.core.models.common.embeddings.rope_utils import (get_pos_emb_on_this_cp_rank,
_apply_rotary_pos_emb_bshd)
from megatron.core.models.common.embeddings import rope_utils
from megatron.training import get_args

mcore_013 = version.parse(megatron.core.__version__) >= version.parse('0.13.0rc0')

# Code borrowed from huggingface/transformers
def apply_interleaved_mrope(freqs, mrope_section):
"""Apply interleaved MRoPE to 3D rotary embeddings.
Expand Down Expand Up @@ -638,24 +641,25 @@ def _apply_rotary_pos_emb_thd(
Returns:
Tensor: Shape [t, h, d]. The input tensor after applying RoPE.
"""
use_batched_rope = False
if cp_group is not None:
cp_size = cp_group.size()
cu_seqlens_for_batched = cu_seqlens // cp_size
use_batched_rope = (freqs.dim() >= 1 and freqs.shape[0] == cu_seqlens_for_batched[-1]).item()
else:
args = get_args()
cp_size = args.context_parallel_size
cu_seqlens_for_batched = cu_seqlens // cp_size
use_batched_rope = (freqs.dim() >= 1 and freqs.shape[0] == cu_seqlens_for_batched[-1]).item()
if not use_batched_rope:
logger.warning_once('Using non-batched RoPE, which may affect performance.')
kwargs = {'cp_group': cp_group} if mcore_013 else {}
return _origin_apply_rotary_pos_emb_thd(
t,
cu_seqlens,
freqs,
rotary_interleaved=rotary_interleaved,
multi_latent_attention=multi_latent_attention,
mscale=mscale,
cp_group=cp_group,
**kwargs,
)
if cp_group is None:
raise ValueError('cp_group must be provided for THD format RoPE')

return _apply_rotary_pos_emb_bshd(
t.unsqueeze(1),
Expand Down
70 changes: 45 additions & 25 deletions swift/megatron/model/gpt/qwen3_next.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@
from copy import deepcopy
from typing import Optional, Tuple, Union

import megatron.core
import torch
from megatron.core.extensions.transformer_engine import TEColumnParallelLinear, TENorm, _get_extra_te_kwargs
from megatron.core.inference.contexts import BaseInferenceContext
Expand All @@ -17,13 +18,15 @@
from megatron.core.transformer.transformer_layer import get_transformer_layer_offset
from megatron.core.utils import deprecate_inference_params, is_fa_min_version
from megatron.training import get_args
from packaging import version

from swift.llm import ModelType
from swift.utils import get_logger
from ..constant import MegatronModelType
from ..gpt_bridge import GPTBridge
from ..register import MegatronModelMeta, register_megatron_model

mcore_013 = version.parse(megatron.core.__version__) >= version.parse('0.13.0rc0')
try:
from flashattn_hopper.flash_attn_interface import _flash_attn_forward
from flashattn_hopper.flash_attn_interface import flash_attn_with_kvcache as flash_attn3_with_kvcache
Expand Down Expand Up @@ -58,6 +61,7 @@ class Qwen3NextSelfAttention(SelfAttention):

def __init__(self, config: TransformerConfig, submodules: SelfAttentionSubmodules, *args, **kwargs):
super(SelfAttention, self).__init__(config, submodules, *args, attention_type='self', **kwargs)
kwargs = {'tp_group': self.model_comm_pgs.tp} if mcore_013 else {}
self.linear_qkv = build_module(
submodules.linear_qkv,
self.config.hidden_size,
Expand All @@ -69,7 +73,7 @@ def __init__(self, config: TransformerConfig, submodules: SelfAttentionSubmodule
skip_bias_add=False,
is_expert=False,
tp_comm_buffer_name='qkv',
tp_group=self.model_comm_pgs.tp,
**kwargs,
)

if submodules.q_layernorm is not None:
Expand Down Expand Up @@ -130,12 +134,22 @@ def forward(
(Tuple[Tensor, Tensor]) Attention output and bias.

"""
from megatron.core.utils import nvtx_range_pop, nvtx_range_push
try:
from megatron.core.utils import nvtx_range_pop, nvtx_range_push
except ImportError:

def nvtx_range_pop(*args, **kwargs):
return

def nvtx_range_push(*args, **kwargs):
return

# Check if we need to skip RoPE
# no_rope is 0-indexed array and self.layer_number is 1-indexed
no_rope = (self.config.no_rope_freq[self.layer_number - 1] if self.config.no_rope_freq else False)
if no_rope:
rotary_pos_emb = None
if hasattr(self.config, 'no_rope_freq'):
no_rope = (self.config.no_rope_freq[self.layer_number - 1] if self.config.no_rope_freq else False)
if no_rope:
rotary_pos_emb = None

inference_context = deprecate_inference_params(inference_context, inference_params)

Expand Down Expand Up @@ -194,17 +208,20 @@ def forward(
if (in_decode_mode and self.config.enable_cuda_graph and inference_context.is_static_batching()):
raise ValueError('CUDA graphs must use flash decode with static batching!')

query, key, value, rotary_pos_emb, attn_mask_type, block_table = (
self._adjust_key_value_for_inference(
inference_context,
query,
key,
value,
rotary_pos_emb,
rotary_pos_cos,
rotary_pos_sin,
sequence_len_offset,
))
result = self._adjust_key_value_for_inference(
inference_context,
query,
key,
value,
rotary_pos_emb,
rotary_pos_cos,
rotary_pos_sin,
sequence_len_offset,
)
if mcore_013:
query, key, value, rotary_pos_emb, attn_mask_type, block_table = result
else:
query, key, value, rotary_pos_emb, attn_mask_type = result

if packed_seq_params is not None:
query = query.squeeze(1)
Expand All @@ -215,6 +232,7 @@ def forward(
# ================================================
# relative positional embedding (rotary embedding)
# ================================================
kwargs = {'cp_group': self.model_comm_pgs.cp} if mcore_013 else {}
nvtx_range_push(suffix='rotary_pos_emb')
if rotary_pos_emb is not None and not self.config.flash_decode:
q_pos_emb, k_pos_emb = rotary_pos_emb
Expand All @@ -239,18 +257,18 @@ def forward(
q_pos_emb,
config=self.config,
cu_seqlens=cu_seqlens_q,
cp_group=self.model_comm_pgs.cp,
**kwargs,
)
else:
query = inference_context.apply_rotary_emb_query(query, q_pos_emb, self.config, cu_seqlens_q,
self.model_comm_pgs.cp)
**kwargs)
if k_pos_emb is not None:
key = apply_rotary_pos_emb(
key,
k_pos_emb,
config=self.config,
cu_seqlens=cu_seqlens_kv,
cp_group=self.model_comm_pgs.cp,
**kwargs,
)

# TODO, can apply positional embedding to value_layer so it has
Expand Down Expand Up @@ -418,16 +436,17 @@ def forward(self, hidden_states: torch.Tensor, **kwargs):


def get_local_layer_specs(config, layer_specs, vp_stage=None):
from megatron.core.transformer.enums import LayerType
num_layers_to_build = get_num_layers_to_build(config, vp_stage=vp_stage)
kwargs = {'vp_stage': vp_stage} if mcore_013 else {}
num_layers_to_build = get_num_layers_to_build(config, **kwargs)

if config.pipeline_model_parallel_layout is not None:
if getattr(config, 'pipeline_model_parallel_layout', None) is not None:
from megatron.core.transformer.enums import LayerType
local_layer_specs = [
layer_specs[layer_id] for layer_id in config.pipeline_model_parallel_layout.get_layer_id_list(
layer_type=LayerType.decoder, vp_stage=vp_stage)
layer_type=LayerType.decoder, **kwargs)
]
else:
offset = get_transformer_layer_offset(config, vp_stage=vp_stage)
offset = get_transformer_layer_offset(config, **kwargs)
local_layer_specs = layer_specs[offset:offset + num_layers_to_build]
return local_layer_specs

Expand All @@ -446,13 +465,14 @@ def get_qwen3_next_transformer_layer_spec(config, vp_stage=None):
config.linear_conv_kernel_dim = args.linear_conv_kernel_dim

layer_norm_impl = TENorm
kwargs = {'use_kitchen': config.use_kitchen} if mcore_013 else {}
moe_layer_spec = get_gpt_layer_with_transformer_engine_spec(
num_experts=config.num_moe_experts,
moe_grouped_gemm=config.moe_grouped_gemm,
qk_layernorm=config.qk_layernorm,
multi_latent_attention=config.multi_latent_attention,
moe_use_legacy_grouped_gemm=config.moe_use_legacy_grouped_gemm,
use_kitchen=config.use_kitchen,
**kwargs,
)
layer_specs = []
for layer_type in args.layer_types:
Expand Down
Loading
Loading