Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions docs/source/Customization/自定义数据集.md
Original file line number Diff line number Diff line change
Expand Up @@ -141,6 +141,8 @@ alpaca格式:
```
**多标签任务**
```jsonl
{"messages": [{"role": "user", "content": "<sentence>"}], "label": []}
{"messages": [{"role": "user", "content": "<sentence>"}], "label": [0, 2]}
{"messages": [{"role": "user", "content": "<sentence>"}], "label": [1, 3, 5]}
```

Expand Down
308 changes: 157 additions & 151 deletions docs/source/Instruction/命令行参数.md

Large diffs are not rendered by default.

6 changes: 3 additions & 3 deletions docs/source/Megatron-SWIFT/命令行参数.md
Original file line number Diff line number Diff line change
Expand Up @@ -240,8 +240,8 @@ lora训练:
- ref_adapter_load: 含义同DPO。
- beta: 控制与 ref_model 偏离程度的参数。较高的 beta 表示与 ref_model 偏离更小。默认为`0.1`。
- loss_type: 默认为'kto'。可选值参考[TRL文档](https://huggingface.co/docs/trl/main/en/kto_trainer#trl.KTOConfig.loss_type)。
- desirable_weight: 抵消 desirable 和 undesirable 配对数量不均衡的影响,对 desirable 损失按该系数进行加权,默认为`1.`。
- undesirable_weight: 抵消 desirable 和 undesirable 配对数量不均衡的影响,对 undesirable 损失按该系数进行加权,默认为`1.`。
- desirable_weight: 抵消 desirable 和 undesirable 数量不均衡的影响,对 desirable 损失按该系数进行加权,默认为`1.`。
- undesirable_weight: 抵消 desirable 和 undesirable 数量不均衡的影响,对 undesirable 损失按该系数进行加权,默认为`1.`。

## 训练参数

Expand All @@ -261,7 +261,7 @@ Megatron训练参数继承自Megatron参数和基本参数(与ms-swift共用da
- 注意:流式数据集可以跳过预处理等待,将预处理时间与训练时间重叠。流式数据集的预处理只在rank0上进行,并通过数据分发的方式同步到其他进程,其通常效率不如非流式数据集采用的数据分片读取方式。当训练的world_size较大时,预处理和数据分发将成为训练瓶颈。
- lazy_tokenize: 默认为False。若该参数设置为False,则在训练之前对所有的数据集样本进行tokenize(这可以避免在训练中出现报错);设置为True,则在训练中对数据集进行tokenize(这可以节约内存)。
- cached_dataset: 训练中使用缓存数据集(使用`swift export --to_cached_dataset true ...`命令产生),避免大型数据集训练时,tokenize占用gpu时。默认为`[]`。
- 注意:cached_dataset支持`--packing`,但不支持`--lazy_tokenize`和`--streaming`。
- 注意:cached_dataset支持`--packing`,但不支持`--lazy_tokenize`和`--streaming`。cached_dataset暂不支持CP。
- max_epochs: 训练到`max_epochs`时强制退出训练,并对权重进行验证和保存。该参数在使用流式数据集时很有用。默认为None。
- 注意:如果你使用非流式数据集,该参数会为你自动计算train_iters,你不需要手动传入`train_iters`。
- enable_dft_loss: 是否在SFT训练中使用[DFT](https://arxiv.org/abs/2508.05629) (Dynamic Fine-Tuning) loss,默认为False。
Expand Down
2 changes: 2 additions & 0 deletions docs/source_en/Customization/Custom-dataset.md
Original file line number Diff line number Diff line change
Expand Up @@ -146,6 +146,8 @@ If `seq_kd` is enabled, the final round of the 'assistant' part is not required
**Multi-label Task**:

```jsonl
{"messages": [{"role": "user", "content": "<sentence>"}], "label": []}
{"messages": [{"role": "user", "content": "<sentence>"}], "label": [0, 2]}
{"messages": [{"role": "user", "content": "<sentence>"}], "label": [1, 3, 5]}
```

Expand Down
490 changes: 250 additions & 240 deletions docs/source_en/Instruction/Command-line-parameters.md

Large diffs are not rendered by default.

2 changes: 1 addition & 1 deletion docs/source_en/Megatron-SWIFT/Command-line-parameters.md
Original file line number Diff line number Diff line change
Expand Up @@ -276,7 +276,7 @@ Megatron training parameters are inherited from Megatron parameters and basic pa
- Note: Streaming datasets can skip preprocessing wait time by overlapping preprocessing with training. Preprocessing for streaming datasets is performed only on rank 0 and then synchronized to other processes via data distribution. This is generally less efficient than the data sharding approach used in non-streaming datasets. When the training world_size is large, preprocessing and data distribution can become a training bottleneck.
- lazy_tokenize: Default is False. If this parameter is set to False, all dataset samples are tokenized before training (this avoids errors during training); if set to True, tokenization occurs during training (this saves memory).
- cached_dataset: Use a cached dataset (generated with `swift export --to_cached_dataset true ...`) during training to avoid GPU time spent on tokenizing large datasets. Default: `[]`.
- Note: cached_dataset supports `--packing` but does not support `--lazy_tokenize` or `--streaming`.
- Note: cached_dataset supports `--packing` but does not support `--lazy_tokenize` or `--streaming`. Cached dataset is currently not supported for CP.
- max_epochs: Forces the training to exit after reaching `max_epochs`, and performs validation and saving of the model weights. This parameter is especially useful when using a streaming dataset. Default is None.
- Note: If you use a non-streaming dataset, this parameter will automatically calculate train_iters for you, so there is no need to pass `train_iters` manually.
- enable_dft_loss: Whether to use [DFT](https://arxiv.org/abs/2508.05629) (Dynamic Fine-Tuning) loss in SFT training, default is False.
Expand Down
2 changes: 1 addition & 1 deletion examples/infer/demo_bert.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
# Copyright (c) Alibaba, Inc. and its affiliates.
# demo_seq_cls: https://github.com/modelscope/ms-swift/blob/main/examples/train/seq_cls/qwen2_vl/infer.py
# demo_seq_cls: https://github.com/modelscope/ms-swift/blob/main/examples/train/seq_cls/qwen2_5_omni/infer.py
import os
from typing import List

Expand Down
2 changes: 2 additions & 0 deletions examples/megatron/lora/qwen3_235b.sh
Original file line number Diff line number Diff line change
@@ -1,4 +1,6 @@
# 8 * 80GiB, 3.2s/it
# If you're doing full-parameter training, you'll need 64 × 80 GiB of GPU memory

PYTORCH_CUDA_ALLOC_CONF='expandable_segments:True' \
NPROC_PER_NODE=8 \
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
Expand Down
2 changes: 2 additions & 0 deletions examples/models/qwen3_vl/mcore.sh
Original file line number Diff line number Diff line change
@@ -1,4 +1,6 @@
# 8 * 80GiB; 45min
# If you're doing full-parameter training, you'll need 64 × 80 GiB of GPU memory

PYTORCH_CUDA_ALLOC_CONF='expandable_segments:True' \
NPROC_PER_NODE=8 \
IMAGE_MAX_TOKEN_NUM=1024 \
Expand Down
Loading