modelscope · Jintao-Huang · Oct 14, 2025 · Oct 14, 2025 · Oct 14, 2025 · Oct 14, 2025
diff --git a/docs/source/Instruction/命令行参数.md b/docs/source/Instruction/命令行参数.md
@@ -413,6 +413,7 @@ Vera使用`target_modules`、`target_regex`、`modules_to_save`三个参数，
 - 🔥create_checkpoint_symlink: 额外创建checkpoint软链接，方便书写自动化训练脚本。best_model和last_model的软链接路径分别为f'{output_dir}/best'和f'{output_dir}/last'。
 - 🔥packing: 是否使用序列packing提升计算效率（不同节点与进程更负载均衡，GPU利用率更高）并稳定显存占用，默认为False。当前支持CPT/SFT/DPO/KTO/GKD。
   - 注意：使用packing请结合`--attn_impl flash_attn`使用且"transformers>=4.44"，具体查看[该PR](https://github.com/huggingface/transformers/pull/31629)。
+  - 注意：**packing会导致数据集样本数减少，请自行调节梯度累加数和学习率**。
 - packing_length: packing的长度。默认为None，设置为max_length。
 - lazy_tokenize: 是否使用lazy_tokenize。若该参数设置为False，则在训练之前对所有的数据集样本进行tokenize（多模态模型则包括从磁盘中读取图片）。该参数默认为None，在LLM训练中默认为False，而MLLM训练默认为True，节约内存。
   - 注意：若你要进行图像的数据增强，你需要将lazy_tokenize（或streaming）设置为True，并修改Template类中的encode方法。

diff --git a/docs/source/Megatron-SWIFT/命令行参数.md b/docs/source/Megatron-SWIFT/命令行参数.md
@@ -271,6 +271,7 @@ Megatron训练参数继承自Megatron参数和基本参数（**与ms-swift共用
 - gradient_checkpointing_kwargs: 传入`torch.utils.checkpoint`中的参数。例如设置为`--gradient_checkpointing_kwargs '{"use_reentrant": false}'`。默认为None。该参数只对`vit_gradient_checkpointing`生效。
 - 🔥packing: 是否使用序列packing提升计算效率（不同节点与进程更负载均衡，GPU利用率更高；但需要额外的预处理时间）并稳定显存占用，默认为False。当前支持CPT/SFT/DPO/KTO/RM。
   - 注意：**同一batch的不同序列之间依旧是不可见的**，除了Qwen3-Next。
+  - 注意：**packing会导致数据集样本数减少，请自行调节梯度累加数和学习率**。
 - packing_length: packing的长度。默认为None，设置为max_length。
 - streaming: 流式读取并处理数据集，默认False。
   - 注意：因为流式数据集无法获得其长度，因此需要设置`--train_iters`参数。设置`max_epochs`参数确保训练到对应epochs时退出训练，并对权重进行验证和保存。

diff --git a/docs/source_en/Instruction/Command-line-parameters.md b/docs/source_en/Instruction/Command-line-parameters.md
@@ -420,6 +420,7 @@ Training arguments include the [base arguments](#base-arguments), [Seq2SeqTraine
 - 🔥create_checkpoint_symlink: Creates additional checkpoint symlinks to facilitate writing automated training scripts. The symlink paths for `best_model` and `last_model` are `f'{output_dir}/best'` and `f'{output_dir}/last'` respectively.
 - 🔥packing: Whether to use sequence packing to improve computational efficiency (better load balancing across nodes and processes, higher GPU utilization) and stabilize GPU memory usage. Default is False. Currently supported in CPT/SFT/DPO/KTO/GKD.
   - Note: When using packing, please combine it with `--attn_impl flash_attn` and ensure "transformers>=4.44". For details, see [this PR](https://github.com/huggingface/transformers/pull/31629).
+  - Note: **Packing reduces the number of samples in the dataset; please adjust the gradient accumulation steps and learning rate accordingly**.
 - packing_length: the length to use for packing. Defaults to None, in which case it is set to max_length.
 - lazy_tokenize: Whether to use lazy tokenization. If set to `False`, all dataset samples will be tokenized (and for multimodal models, images will be loaded from disk) before training begins. Default is `None`: in LLM training, it defaults to `False`; in MLLM training, it defaults to `True` to save memory.
   - Note: If you want to perform image data augmentation, you need to set `lazy_tokenize` (or `streaming`) to True and modify the `encode` method in the Template class.

diff --git a/docs/source_en/Megatron-SWIFT/Command-line-parameters.md b/docs/source_en/Megatron-SWIFT/Command-line-parameters.md
@@ -284,7 +284,8 @@ Megatron training parameters are inherited from Megatron parameters and basic pa
 - vit_gradient_checkpointing: Whether to enable gradient checkpointing for the ViT (Vision Transformer) component during multimodal model training. Defaults to `True`. (**The ViT implementation in Megatron-SWIFT uses the Hugging Face `transformers` library.**)
 - gradient_checkpointing_kwargs: Arguments passed to `torch.utils.checkpoint`. For example: set `--gradient_checkpointing_kwargs '{"use_reentrant": false}'`. Defaults to `None`. This parameter only takes effect when `vit_gradient_checkpointing` is enabled.
 - 🔥packing: Whether to use sequence packing to improve computational efficiency (achieving better load balancing across nodes and processes, and higher GPU utilization), at the cost of additional preprocessing time, while also stabilizing GPU memory usage. Defaults to `False`. Currently supported for CPT, SFT, DPO, KTO and RM.
-  - **Note**: **Sequences within the same batch remain mutually invisible**, except for Qwen3-Next.
+  - Note: **Sequences within the same batch remain mutually invisible**, except for Qwen3-Next.
+  - Note: **Packing reduces the number of samples in the dataset; please adjust the gradient accumulation steps and learning rate accordingly**.
 - streaming: Stream data loading and processing, default is False.
   - Note: Since the length of a streaming dataset cannot be determined, the `--train_iters` parameter must be set. Also set the `max_epochs` parameter to ensure training exits after the specified number of epochs, and to validate and save the model weights accordingly.
   - Note: Streaming datasets can skip preprocessing wait time by overlapping preprocessing with training. Preprocessing for streaming datasets is performed only on rank 0 and then synchronized to other processes via data distribution. **This is generally less efficient than the data sharding approach used in non-streaming datasets.** When the training world_size is large, preprocessing and data distribution can become a training bottleneck.

diff --git a/examples/export/quantize/awq.sh b/examples/export/quantize/awq.sh
@@ -1,3 +1,5 @@
+pip install "transformers<4.52"
-pip install "transformers<4.52"
+pip install "transformers<4.52"  # Pinned to avoid compatibility issues with AWQ in newer versions.
-pip install "transformers<4.52"
+pip install "transformers<4.52"  # Pinned to avoid compatibility issues with AWQ in newer versions.
+
 CUDA_VISIBLE_DEVICES=0 \
 swift export \
     --model Qwen/Qwen2.5-72B-Instruct \

diff --git a/examples/export/quantize/gptq_v2.sh b/examples/export/quantize/gptq_v2.sh
@@ -1,3 +1,4 @@
+# You need to install gptqmodel.
 # OMP_NUM_THREADS=14 please Check issue: https://github.com/AutoGPTQ/AutoGPTQ/issues/439
 OMP_NUM_THREADS=14 \
 CUDA_VISIBLE_DEVICES=0 \
@@ -10,4 +11,4 @@ swift export \
     --max_length 2048 \
     --quant_method gptq_v2 \
     --quant_bits 4 \
-    --output_dir Qwen2.5-1.5B-Instruct-GPTQ-Int4
+    --output_dir Qwen2.5-1.5B-Instruct-GPTQ-V2-Int4
diff --git a/examples/export/quantize/moe/awq.sh b/examples/export/quantize/moe/awq.sh
@@ -1,3 +1,5 @@
+pip install "transformers<4.52"
-pip install "transformers<4.52"
+pip install "transformers<4.52"  # Pinned to avoid compatibility issues with AWQ in newer versions.
-pip install "transformers<4.52"
+pip install "transformers<4.52"  # Pinned to avoid compatibility issues with AWQ in newer versions.
+
 CUDA_VISIBLE_DEVICES=0,1 \
 swift export \
     --model Qwen/Qwen3-30B-A3B \

diff --git a/swift/llm/argument/base_args/quant_args.py b/swift/llm/argument/base_args/quant_args.py
@@ -39,7 +39,7 @@ class QuantizeArguments:
     bnb_4bit_quant_storage: Optional[str] = None
 
     def get_quantization_config(self):
-        if self.quant_method is None or self.quant_method in {'awq', 'gptq'}:
+        if self.quant_method is None or self.quant_method in {'awq', 'gptq', 'gptq_v2'}:
             return None
         assert self.quant_method in {'bnb', 'hqq', 'eetq', 'quanto', 'fp8'}
         if self.quant_method != 'fp8' and self.quant_bits is None:

diff --git a/swift/llm/export/quant.py b/swift/llm/export/quant.py
@@ -5,6 +5,8 @@
 
 import torch
 import torch.nn as nn
+import transformers
+from packaging import version
 from tqdm import tqdm
 
 from swift.llm import (ExportArguments, HfConfigFactory, MaxLengthError, ProcessorMixin, deep_getattr, load_dataset,
@@ -41,6 +43,10 @@ def quantize(self):
         elif args.quant_method in {'gptq', 'gptq_v2'}:
             self.template.model = self.model
             gptq_quantizer = self.gptq_model_quantize(v2=(args.quant_method == 'gptq_v2'))
+            if args.quant_method == 'gptq_v2':
+                if not getattr(self.model, '_dynamic_tied_weights_keys', None):
+                    self.model._dynamic_tied_weights_keys = []
+                self.model._dynamic_tied_weights_keys += ['wf_unsqueeze_zero', 'wf_unsqueeze_neg_one']
             gptq_quantizer.save(
                 self.model,
                 args.output_dir,
@@ -76,7 +82,7 @@ def _prepare_gptq_dataset(self, examples: List[Dict[str, torch.LongTensor]], bat
     @torch.inference_mode()
     def _get_quant_dataset(self, *args, **kwargs):
         args = self.args
-        assert args.quant_method in {'awq', 'gptq'}
+        assert args.quant_method in {'awq', 'gptq', 'gptq_v2'}
         template = self.template
         n_samples = args.quant_n_samples
         block_size = args.max_length
@@ -96,7 +102,7 @@ def _get_quant_dataset(self, *args, **kwargs):
                 inputs = template.encode(data)
             except MaxLengthError:
                 continue
-            if is_multimodal and args.quant_method == 'gptq':
+            if is_multimodal and args.quant_method in {'gptq', 'gptq_v2'}:
                 inputs.pop('labels', None)
                 samples.append(inputs)
             else:
@@ -107,15 +113,15 @@ def _get_quant_dataset(self, *args, **kwargs):
             if i == n_samples:
                 break
         prog_bar.close()
-        if is_multimodal and args.quant_method == 'gptq':
+        if is_multimodal and args.quant_method in {'gptq', 'gptq_v2'}:
             return samples
         # now concatenate all samples and split according to block size
         n_split = max(len(samples) // block_size, 1)
         logger.info(f'Split into {n_split} blocks')
         res = []
         for i in range(n_split):
             input_ids = samples[i * block_size:(i + 1) * block_size]
-            if args.quant_method == 'gptq':
+            if args.quant_method in {'gptq', 'gptq_v2'}:
                 res.append({'input_ids': input_ids})
             else:
                 res.append(torch.tensor(input_ids)[None])
@@ -226,6 +232,29 @@ def get_modules_in_block_to_quantize(model, block_name: str):
         res[experts_idx:experts_idx] = experts.values()
         return res
 
+    @contextmanager
+    def _patch_gptq_block(self, model, block_name_to_quantize):
+        if version.parse(transformers.__version__) < version.parse('4.54'):
+            yield
+            return
+        # compat transformers>=4.54
+        blocks = deep_getattr(model, block_name_to_quantize)
+        hooks = []
+
+        def _to_tuple(module, input, output):
+            if not isinstance(output, (list, tuple)):
+                output = (output, )
+            return output
+
+        for block in blocks:
+            hooks.append(block.register_forward_hook(_to_tuple))
+
+        try:
+            yield
+        finally:
+            for hook in hooks:
+                hook.remove()
+
     def gptq_model_quantize(self, v2: bool = False):
         from optimum.gptq import GPTQQuantizer
         args = self.args
@@ -247,7 +276,8 @@ def gptq_model_quantize(self, v2: bool = False):
             logger.info('Start quantizing the model...')
             logger.warning('The process of packing the model takes a long time and there is no progress bar. '
                            'Please be patient and wait...')
-            gptq_quantizer.quantize_model(self.model, self.tokenizer)
+            with self._patch_gptq_block(self.model, block_name_to_quantize):
+                gptq_quantizer.quantize_model(self.model, self.tokenizer)
             self.model.config.quantization_config.pop('dataset', None)
         return gptq_quantizer