modelscope · Jintao-Huang · Oct 10, 2025 · Oct 10, 2025 · Oct 10, 2025 · Oct 10, 2025
diff --git a/docs/source/Customization/自定义数据集.md b/docs/source/Customization/自定义数据集.md
@@ -141,6 +141,8 @@ alpaca格式:
 ```
 **多标签任务**：
 ```jsonl
+{"messages": [{"role": "user", "content": "<sentence>"}], "label": []}
+{"messages": [{"role": "user", "content": "<sentence>"}], "label": [0, 2]}
 {"messages": [{"role": "user", "content": "<sentence>"}], "label": [1, 3, 5]}
 ```
 

diff --git a/docs/source/Instruction/命令行参数.md b/docs/source/Instruction/命令行参数.md
diff --git a/docs/source/Megatron-SWIFT/命令行参数.md b/docs/source/Megatron-SWIFT/命令行参数.md
@@ -240,8 +240,8 @@ lora训练：
 - ref_adapter_load: 含义同DPO。
 - beta: 控制与 ref_model 偏离程度的参数。较高的 beta 表示与 ref_model 偏离更小。默认为`0.1`。
 - loss_type: 默认为'kto'。可选值参考[TRL文档](https://huggingface.co/docs/trl/main/en/kto_trainer#trl.KTOConfig.loss_type)。
-- desirable_weight: 抵消 desirable 和 undesirable 配对数量不均衡的影响，对 desirable 损失按该系数进行加权，默认为`1.`。
-- undesirable_weight: 抵消 desirable 和 undesirable 配对数量不均衡的影响，对 undesirable 损失按该系数进行加权，默认为`1.`。
+- desirable_weight: 抵消 desirable 和 undesirable 数量不均衡的影响，对 desirable 损失按该系数进行加权，默认为`1.`。
+- undesirable_weight: 抵消 desirable 和 undesirable 数量不均衡的影响，对 undesirable 损失按该系数进行加权，默认为`1.`。
 
 ## 训练参数
 
@@ -261,7 +261,7 @@ Megatron训练参数继承自Megatron参数和基本参数（与ms-swift共用da
   - 注意：流式数据集可以跳过预处理等待，将预处理时间与训练时间重叠。流式数据集的预处理只在rank0上进行，并通过数据分发的方式同步到其他进程，其通常效率不如非流式数据集采用的数据分片读取方式。当训练的world_size较大时，预处理和数据分发将成为训练瓶颈。
 - lazy_tokenize: 默认为False。若该参数设置为False，则在训练之前对所有的数据集样本进行tokenize（这可以避免在训练中出现报错）；设置为True，则在训练中对数据集进行tokenize（这可以节约内存）。
 - cached_dataset: 训练中使用缓存数据集（使用`swift export --to_cached_dataset true ...`命令产生），避免大型数据集训练时，tokenize占用gpu时。默认为`[]`。
-  - 注意：cached_dataset支持`--packing`，但不支持`--lazy_tokenize`和`--streaming`。
+  - 注意：cached_dataset支持`--packing`，但不支持`--lazy_tokenize`和`--streaming`。cached_dataset暂不支持CP。
 - max_epochs: 训练到`max_epochs`时强制退出训练，并对权重进行验证和保存。该参数在使用流式数据集时很有用。默认为None。
   - 注意：如果你使用非流式数据集，该参数会为你自动计算train_iters，你不需要手动传入`train_iters`。
 - enable_dft_loss: 是否在SFT训练中使用[DFT](https://arxiv.org/abs/2508.05629) (Dynamic Fine-Tuning) loss，默认为False。

diff --git a/docs/source_en/Customization/Custom-dataset.md b/docs/source_en/Customization/Custom-dataset.md
@@ -146,6 +146,8 @@ If `seq_kd` is enabled, the final round of the 'assistant' part is not required
 **Multi-label Task**:
 
 ```jsonl
+{"messages": [{"role": "user", "content": "<sentence>"}], "label": []}
+{"messages": [{"role": "user", "content": "<sentence>"}], "label": [0, 2]}
 {"messages": [{"role": "user", "content": "<sentence>"}], "label": [1, 3, 5]}
 ```
 

diff --git a/docs/source_en/Instruction/Command-line-parameters.md b/docs/source_en/Instruction/Command-line-parameters.md
diff --git a/docs/source_en/Megatron-SWIFT/Command-line-parameters.md b/docs/source_en/Megatron-SWIFT/Command-line-parameters.md
@@ -276,7 +276,7 @@ Megatron training parameters are inherited from Megatron parameters and basic pa
   - Note: Streaming datasets can skip preprocessing wait time by overlapping preprocessing with training. Preprocessing for streaming datasets is performed only on rank 0 and then synchronized to other processes via data distribution. This is generally less efficient than the data sharding approach used in non-streaming datasets. When the training world_size is large, preprocessing and data distribution can become a training bottleneck.
 - lazy_tokenize: Default is False. If this parameter is set to False, all dataset samples are tokenized before training (this avoids errors during training); if set to True, tokenization occurs during training (this saves memory).
 - cached_dataset: Use a cached dataset (generated with `swift export --to_cached_dataset true ...`) during training to avoid GPU time spent on tokenizing large datasets. Default: `[]`.
-  - Note: cached_dataset supports `--packing` but does not support `--lazy_tokenize` or `--streaming`.
+  - Note: cached_dataset supports `--packing` but does not support `--lazy_tokenize` or `--streaming`. Cached dataset is currently not supported for CP.
 - max_epochs: Forces the training to exit after reaching `max_epochs`, and performs validation and saving of the model weights. This parameter is especially useful when using a streaming dataset. Default is None.
   - Note: If you use a non-streaming dataset, this parameter will automatically calculate train_iters for you, so there is no need to pass `train_iters` manually.
 - enable_dft_loss: Whether to use [DFT](https://arxiv.org/abs/2508.05629) (Dynamic Fine-Tuning) loss in SFT training, default is False.

diff --git a/examples/infer/demo_bert.py b/examples/infer/demo_bert.py
@@ -1,5 +1,5 @@
 # Copyright (c) Alibaba, Inc. and its affiliates.
-# demo_seq_cls: https://github.com/modelscope/ms-swift/blob/main/examples/train/seq_cls/qwen2_vl/infer.py
+# demo_seq_cls: https://github.com/modelscope/ms-swift/blob/main/examples/train/seq_cls/qwen2_5_omni/infer.py
 import os
 from typing import List
 

diff --git a/examples/megatron/lora/qwen3_235b.sh b/examples/megatron/lora/qwen3_235b.sh
@@ -1,4 +1,6 @@
 # 8 * 80GiB, 3.2s/it
+# If you're doing full-parameter training, you'll need 64 × 80 GiB of GPU memory
+
 PYTORCH_CUDA_ALLOC_CONF='expandable_segments:True' \
 NPROC_PER_NODE=8 \
 CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \

diff --git a/examples/models/qwen3_vl/mcore.sh b/examples/models/qwen3_vl/mcore.sh
@@ -1,4 +1,6 @@
 # 8 * 80GiB; 45min
+# If you're doing full-parameter training, you'll need 64 × 80 GiB of GPU memory
+
 PYTORCH_CUDA_ALLOC_CONF='expandable_segments:True' \
 NPROC_PER_NODE=8 \
 IMAGE_MAX_TOKEN_NUM=1024 \