Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions docs/source/Instruction/命令行参数.md
Original file line number Diff line number Diff line change
Expand Up @@ -413,6 +413,7 @@ Vera使用`target_modules`、`target_regex`、`modules_to_save`三个参数,
- 🔥create_checkpoint_symlink: 额外创建checkpoint软链接,方便书写自动化训练脚本。best_model和last_model的软链接路径分别为f'{output_dir}/best'和f'{output_dir}/last'。
- 🔥packing: 是否使用序列packing提升计算效率(不同节点与进程更负载均衡,GPU利用率更高)并稳定显存占用,默认为False。当前支持CPT/SFT/DPO/KTO/GKD。
- 注意:使用packing请结合`--attn_impl flash_attn`使用且"transformers>=4.44",具体查看[该PR](https://github.com/huggingface/transformers/pull/31629)。
- 注意:**packing会导致数据集样本数减少,请自行调节梯度累加数和学习率**。
- packing_length: packing的长度。默认为None,设置为max_length。
- lazy_tokenize: 是否使用lazy_tokenize。若该参数设置为False,则在训练之前对所有的数据集样本进行tokenize(多模态模型则包括从磁盘中读取图片)。该参数默认为None,在LLM训练中默认为False,而MLLM训练默认为True,节约内存。
- 注意:若你要进行图像的数据增强,你需要将lazy_tokenize(或streaming)设置为True,并修改Template类中的encode方法。
Expand Down
1 change: 1 addition & 0 deletions docs/source/Megatron-SWIFT/命令行参数.md
Original file line number Diff line number Diff line change
Expand Up @@ -271,6 +271,7 @@ Megatron训练参数继承自Megatron参数和基本参数(**与ms-swift共用
- gradient_checkpointing_kwargs: 传入`torch.utils.checkpoint`中的参数。例如设置为`--gradient_checkpointing_kwargs '{"use_reentrant": false}'`。默认为None。该参数只对`vit_gradient_checkpointing`生效。
- 🔥packing: 是否使用序列packing提升计算效率(不同节点与进程更负载均衡,GPU利用率更高;但需要额外的预处理时间)并稳定显存占用,默认为False。当前支持CPT/SFT/DPO/KTO/RM。
- 注意:**同一batch的不同序列之间依旧是不可见的**,除了Qwen3-Next。
- 注意:**packing会导致数据集样本数减少,请自行调节梯度累加数和学习率**。
- packing_length: packing的长度。默认为None,设置为max_length。
- streaming: 流式读取并处理数据集,默认False。
- 注意:因为流式数据集无法获得其长度,因此需要设置`--train_iters`参数。设置`max_epochs`参数确保训练到对应epochs时退出训练,并对权重进行验证和保存。
Expand Down
1 change: 1 addition & 0 deletions docs/source_en/Instruction/Command-line-parameters.md
Original file line number Diff line number Diff line change
Expand Up @@ -420,6 +420,7 @@ Training arguments include the [base arguments](#base-arguments), [Seq2SeqTraine
- 🔥create_checkpoint_symlink: Creates additional checkpoint symlinks to facilitate writing automated training scripts. The symlink paths for `best_model` and `last_model` are `f'{output_dir}/best'` and `f'{output_dir}/last'` respectively.
- 🔥packing: Whether to use sequence packing to improve computational efficiency (better load balancing across nodes and processes, higher GPU utilization) and stabilize GPU memory usage. Default is False. Currently supported in CPT/SFT/DPO/KTO/GKD.
- Note: When using packing, please combine it with `--attn_impl flash_attn` and ensure "transformers>=4.44". For details, see [this PR](https://github.com/huggingface/transformers/pull/31629).
- Note: **Packing reduces the number of samples in the dataset; please adjust the gradient accumulation steps and learning rate accordingly**.
- packing_length: the length to use for packing. Defaults to None, in which case it is set to max_length.
- lazy_tokenize: Whether to use lazy tokenization. If set to `False`, all dataset samples will be tokenized (and for multimodal models, images will be loaded from disk) before training begins. Default is `None`: in LLM training, it defaults to `False`; in MLLM training, it defaults to `True` to save memory.
- Note: If you want to perform image data augmentation, you need to set `lazy_tokenize` (or `streaming`) to True and modify the `encode` method in the Template class.
Expand Down
3 changes: 2 additions & 1 deletion docs/source_en/Megatron-SWIFT/Command-line-parameters.md
Original file line number Diff line number Diff line change
Expand Up @@ -284,7 +284,8 @@ Megatron training parameters are inherited from Megatron parameters and basic pa
- vit_gradient_checkpointing: Whether to enable gradient checkpointing for the ViT (Vision Transformer) component during multimodal model training. Defaults to `True`. (**The ViT implementation in Megatron-SWIFT uses the Hugging Face `transformers` library.**)
- gradient_checkpointing_kwargs: Arguments passed to `torch.utils.checkpoint`. For example: set `--gradient_checkpointing_kwargs '{"use_reentrant": false}'`. Defaults to `None`. This parameter only takes effect when `vit_gradient_checkpointing` is enabled.
- 🔥packing: Whether to use sequence packing to improve computational efficiency (achieving better load balancing across nodes and processes, and higher GPU utilization), at the cost of additional preprocessing time, while also stabilizing GPU memory usage. Defaults to `False`. Currently supported for CPT, SFT, DPO, KTO and RM.
- **Note**: **Sequences within the same batch remain mutually invisible**, except for Qwen3-Next.
- Note: **Sequences within the same batch remain mutually invisible**, except for Qwen3-Next.
- Note: **Packing reduces the number of samples in the dataset; please adjust the gradient accumulation steps and learning rate accordingly**.
- streaming: Stream data loading and processing, default is False.
- Note: Since the length of a streaming dataset cannot be determined, the `--train_iters` parameter must be set. Also set the `max_epochs` parameter to ensure training exits after the specified number of epochs, and to validate and save the model weights accordingly.
- Note: Streaming datasets can skip preprocessing wait time by overlapping preprocessing with training. Preprocessing for streaming datasets is performed only on rank 0 and then synchronized to other processes via data distribution. **This is generally less efficient than the data sharding approach used in non-streaming datasets.** When the training world_size is large, preprocessing and data distribution can become a training bottleneck.
Expand Down
2 changes: 2 additions & 0 deletions examples/export/quantize/awq.sh
Original file line number Diff line number Diff line change
@@ -1,3 +1,5 @@
pip install "transformers<4.52"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

To improve maintainability and help other developers, it would be beneficial to add a comment explaining why transformers<4.52 is required. This provides context for the version pinning.

Suggested change
pip install "transformers<4.52"
pip install "transformers<4.52" # Pinned to avoid compatibility issues with AWQ in newer versions.


CUDA_VISIBLE_DEVICES=0 \
swift export \
--model Qwen/Qwen2.5-72B-Instruct \
Expand Down
3 changes: 2 additions & 1 deletion examples/export/quantize/gptq_v2.sh
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
# You need to install gptqmodel.
# OMP_NUM_THREADS=14 please Check issue: https://github.com/AutoGPTQ/AutoGPTQ/issues/439
OMP_NUM_THREADS=14 \
CUDA_VISIBLE_DEVICES=0 \
Expand All @@ -10,4 +11,4 @@ swift export \
--max_length 2048 \
--quant_method gptq_v2 \
--quant_bits 4 \
--output_dir Qwen2.5-1.5B-Instruct-GPTQ-Int4
--output_dir Qwen2.5-1.5B-Instruct-GPTQ-V2-Int4
2 changes: 2 additions & 0 deletions examples/export/quantize/moe/awq.sh
Original file line number Diff line number Diff line change
@@ -1,3 +1,5 @@
pip install "transformers<4.52"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Similar to the other awq.sh script, please add a comment explaining the reason for pinning the transformers version. This helps with future maintenance and understanding.

Suggested change
pip install "transformers<4.52"
pip install "transformers<4.52" # Pinned to avoid compatibility issues with AWQ in newer versions.


CUDA_VISIBLE_DEVICES=0,1 \
swift export \
--model Qwen/Qwen3-30B-A3B \
Expand Down
2 changes: 1 addition & 1 deletion swift/llm/argument/base_args/quant_args.py
Original file line number Diff line number Diff line change
Expand Up @@ -39,7 +39,7 @@ class QuantizeArguments:
bnb_4bit_quant_storage: Optional[str] = None

def get_quantization_config(self):
if self.quant_method is None or self.quant_method in {'awq', 'gptq'}:
if self.quant_method is None or self.quant_method in {'awq', 'gptq', 'gptq_v2'}:
return None
assert self.quant_method in {'bnb', 'hqq', 'eetq', 'quanto', 'fp8'}
if self.quant_method != 'fp8' and self.quant_bits is None:
Expand Down
40 changes: 35 additions & 5 deletions swift/llm/export/quant.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,8 @@

import torch
import torch.nn as nn
import transformers
from packaging import version
from tqdm import tqdm

from swift.llm import (ExportArguments, HfConfigFactory, MaxLengthError, ProcessorMixin, deep_getattr, load_dataset,
Expand Down Expand Up @@ -41,6 +43,10 @@ def quantize(self):
elif args.quant_method in {'gptq', 'gptq_v2'}:
self.template.model = self.model
gptq_quantizer = self.gptq_model_quantize(v2=(args.quant_method == 'gptq_v2'))
if args.quant_method == 'gptq_v2':
if not getattr(self.model, '_dynamic_tied_weights_keys', None):
self.model._dynamic_tied_weights_keys = []
self.model._dynamic_tied_weights_keys += ['wf_unsqueeze_zero', 'wf_unsqueeze_neg_one']
gptq_quantizer.save(
self.model,
args.output_dir,
Expand Down Expand Up @@ -76,7 +82,7 @@ def _prepare_gptq_dataset(self, examples: List[Dict[str, torch.LongTensor]], bat
@torch.inference_mode()
def _get_quant_dataset(self, *args, **kwargs):
args = self.args
assert args.quant_method in {'awq', 'gptq'}
assert args.quant_method in {'awq', 'gptq', 'gptq_v2'}
template = self.template
n_samples = args.quant_n_samples
block_size = args.max_length
Expand All @@ -96,7 +102,7 @@ def _get_quant_dataset(self, *args, **kwargs):
inputs = template.encode(data)
except MaxLengthError:
continue
if is_multimodal and args.quant_method == 'gptq':
if is_multimodal and args.quant_method in {'gptq', 'gptq_v2'}:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The condition args.quant_method in {'gptq', 'gptq_v2'} is repeated multiple times within this function (here, and on lines 110 and 118). To improve code clarity and future maintainability, consider defining a boolean variable at the beginning of the function and reusing it. For example:

is_gptq_family = args.quant_method in {'gptq', 'gptq_v2'}

You could then use if is_multimodal and is_gptq_family: in the conditional checks.

inputs.pop('labels', None)
samples.append(inputs)
else:
Expand All @@ -107,15 +113,15 @@ def _get_quant_dataset(self, *args, **kwargs):
if i == n_samples:
break
prog_bar.close()
if is_multimodal and args.quant_method == 'gptq':
if is_multimodal and args.quant_method in {'gptq', 'gptq_v2'}:
return samples
# now concatenate all samples and split according to block size
n_split = max(len(samples) // block_size, 1)
logger.info(f'Split into {n_split} blocks')
res = []
for i in range(n_split):
input_ids = samples[i * block_size:(i + 1) * block_size]
if args.quant_method == 'gptq':
if args.quant_method in {'gptq', 'gptq_v2'}:
res.append({'input_ids': input_ids})
else:
res.append(torch.tensor(input_ids)[None])
Expand Down Expand Up @@ -226,6 +232,29 @@ def get_modules_in_block_to_quantize(model, block_name: str):
res[experts_idx:experts_idx] = experts.values()
return res

@contextmanager
def _patch_gptq_block(self, model, block_name_to_quantize):
if version.parse(transformers.__version__) < version.parse('4.54'):
yield
return
# compat transformers>=4.54
blocks = deep_getattr(model, block_name_to_quantize)
hooks = []

def _to_tuple(module, input, output):
if not isinstance(output, (list, tuple)):
output = (output, )
return output

for block in blocks:
hooks.append(block.register_forward_hook(_to_tuple))

try:
yield
finally:
for hook in hooks:
hook.remove()

def gptq_model_quantize(self, v2: bool = False):
from optimum.gptq import GPTQQuantizer
args = self.args
Expand All @@ -247,7 +276,8 @@ def gptq_model_quantize(self, v2: bool = False):
logger.info('Start quantizing the model...')
logger.warning('The process of packing the model takes a long time and there is no progress bar. '
'Please be patient and wait...')
gptq_quantizer.quantize_model(self.model, self.tokenizer)
with self._patch_gptq_block(self.model, block_name_to_quantize):
gptq_quantizer.quantize_model(self.model, self.tokenizer)
self.model.config.quantization_config.pop('dataset', None)
return gptq_quantizer

Expand Down
Loading