modelscope · Jintao-Huang · Dec 10, 2024 · Dec 10, 2024 · Dec 10, 2024 · Dec 10, 2024
diff --git a/docs/source/Customization/自定义模型.md b/docs/source/Customization/自定义模型.md
@@ -4,7 +4,6 @@ ms-swift内置的模型，你可以直接通过指定model_id或者model_path来
 
 > [!TIP]
 > 在使用`swift sft`通过LoRA技术微调base模型为chat模型时，例如将Llama3.2-1B微调为chat模型，有时需要手动设置模板。通过添加`--template default`参数来避免base模型因未见过对话模板中的特殊字符而无法正常停止的情况。
-
 ## 模型注册
 
 请参考[examples](https://github.com/modelscope/swift/blob/main/examples/custom)中示例代码。你可以通过指定`--custom_register_path xxx.py`对注册的内容进行解析。
diff --git a/docs/source/Instruction/命令行参数.md b/docs/source/Instruction/命令行参数.md
@@ -37,7 +37,7 @@
 - strict: 如果为True，则数据集只要某行有问题直接抛错，否则会丢弃出错行。默认False
 - 🔥model_name: 仅用于自我认知任务，传入模型中文名和英文名，以空格分隔
 - 🔥model_author: 仅用于自我认知任务，传入模型作者的中文名和英文名，以空格分隔
-- custom_dataset_info: 自定义简单数据集注册，参考[新增数据集](../Customization/新增数据集.md)
+- custom_dataset_info: 自定义简单数据集注册，参考[自定义数据集](../Customization/自定义数据集.md)
 
 ### 模板参数
 - 🔥template: 对话模板类型，默认使用model对应的template类型。`swift pt`会将对话模版转为生成模板使用
@@ -46,7 +46,7 @@
 - truncation_strategy: 如果超长如何处理，支持`delete`和`left`，代表删除和左侧裁剪，默认为left
 - 🔥max_pixels: 多模态模型图片前处理的最大像素数（H\*W），默认不缩放。
 - tools_prompt: 智能体训练时的工具列表转为system的格式，请参考[智能体训练](./智能体的支持.md)，默认为'react_en'
-- loss_scale: 如何针对训练添加token的loss权重。默认为`'default'`，代表所有response（含history）以1计算交叉熵损失。具体可以查看[插件化](../Customization/插件.md)和[智能体训练](./智能体的支持.md)
+- loss_scale: 如何针对训练添加token的loss权重。默认为`'default'`，代表所有response（含history）以1计算交叉熵损失。具体可以查看[插件化](../Customization/插件化.md)和[智能体训练](./智能体的支持.md)
 - sequence_parallel_size: 序列并行数量。参考[example](https://github.com/modelscope/ms-swift/tree/main/examples/train/sequence_parallel/train.sh)
 - use_chat_template: 使用chat模板或generation模板，默认为`True`。`swift pt`会自动设置为generation模板
 - template_backend: 使用swift或jinja进行推理。如果使用jinja，则使用transformers的`apply_chat_template`。默认为swift

diff --git a/docs/source/Instruction/智能体的支持.md b/docs/source/Instruction/智能体的支持.md
@@ -237,12 +237,12 @@ SWIFT为了提升Agent训练效果，提供了以下技术：
 
 Thought和Final Answer部分权重为1，Action和Action Input部分权重为2，Observation:字段本身权重为2，Observation:后面的实际api调用结果权重为0
 
-具体的loss_scale插件设计，请参考[插件](../Customization/插件.md)部分文档.
+具体的loss_scale插件设计，请参考[插件化](../Customization/插件化.md)文档.
 
 
 ### tools(--tools_prompt)
 
-tools部分为拼装后的system字段格式，除上述介绍的react_en/react_zh/toolbench外，还支持glm4格式。另外用户也可以自行定义格式tools_prompt，同样也可以参考[插件](../Customization/插件.md)部分文档.
+tools部分为拼装后的system字段格式，除上述介绍的react_en/react_zh/toolbench外，还支持glm4格式。另外用户也可以自行定义格式tools_prompt，同样也可以参考[插件化](../Customization/插件化.md)文档.
 
 一个完整的Agent训练脚本请参考[这里](https://github.com/modelscope/ms-swift/tree/main/examples/train/agent/train.sh).
 

diff --git a/docs/source/Instruction/预训练及微调.md b/docs/source/Instruction/预训练及微调.md
@@ -2,7 +2,7 @@
 
 由于预训练和微调比较相似，在本文中共同介绍。
 
-预训练和微调的数据格式需求请参考[新增数据集](../Customization/新增数据集.md)部分。
+预训练和微调的数据格式需求请参考[自定义数据集](../Customization/自定义数据集.md)部分。
 
 从数据需求上，继续预训练的训练需求量可能在几十万行~几百万行不等，如果从头预训练需要的卡数和数据量非常庞大，不在本文的讨论范围内。
 微调的数据需求从几千行~百万行不等，更低的数据量请考虑使用RAG方式。

diff --git a/docs/source/index.rst b/docs/source/index.rst
@@ -35,8 +35,8 @@ Swift DOCUMENTATION
    :maxdepth: 2
    :caption: Customization
 
-   Customization/新增数据集.md
-   Customization/新增模型.md
+   Customization/自定义数据集.md
+   Customization/自定义模型.md
    Customization/插件化.md
 
 Indices and tables

diff --git a/docs/source_en/Customization/Custom-dataset.md b/docs/source_en/Customization/Custom-dataset.md
@@ -0,0 +1,112 @@
+# Custom Dataset
+
+The standard format for the ms-swift dataset accepts the following keys: 'messages', 'rejected_response', 'label', 'images', 'videos', 'audios', 'tools', and 'objects'. Among these, 'messages' is a required key. 'rejected_response' is used for RLHF training like DPO, 'label' is used for KTO training, while 'images', 'videos', and 'audios' are used for storing paths or URLs of multimodal data. 'tools' is for Agent tasks, and 'objects' is for grounding tasks.
+
+There are three core preprocessors in ms-swift: `MessagesPreprocessor`, `AlpacaPreprocessor`, and `ResponsePreprocessor`. `MessagesPreprocessor` converts datasets in messages and sharegpt formats to the standard format, `AlpacaPreprocessor` converts alpaca format datasets, and `ResponsePreprocessor` converts datasets in query/response format. `AutoPreprocessor` automatically selects the appropriate preprocessor for processing. Typically, `AutoPreprocessor` can handle over 90% of cases.
+
+The following four formats will all be converted to the messages field in the ms-swift standard format by `AutoPreprocessor`:
+
+Messages format:
+```jsonl
+{"messages": [{"role": "system", "content": "<system>"}, {"role": "user", "content": "<query1>"}, {"role": "assistant", "content": "<response1>"}, {"role": "user", "content": "<query2>"}, {"role": "assistant", "content": "<response2>"}]}
+```
+
+ShareGPT format:
+```jsonl
+{"system": "<system>", "conversation": [{"human": "<query1>", "assistant": "<response1>"}, {"human": "<query2>", "assistant": "<response2>"}]}
+```
+
+Alpaca format:
+```jsonl
+{"system": "<system>", "instruction": "<query-inst>", "input": "<query-input>", "output": "<response>"}
+```
+
+Query-Response format:
+```jsonl
+{"system": "<system>", "query": "<query2>", "response": "<response2>", "history": [["<query1>", "<response1>"]]}
+```
+
+There are three ways to integrate a custom dataset, with increasing control over preprocessing functions:
+1. **Recommended**: Directly use `--dataset <dataset_id_or_path>` to integrate with AutoPreprocessor. This supports csv, json, jsonl, txt, and folder formats.
+2. Write a dataset_info.json file. You can refer to the built-in [dataset_info.json](https://github.com/modelscope/ms-swift/blob/main/swift/llm/dataset/data/dataset_info.json) in ms-swift. One of ms_dataset_id/hf_dataset_id/dataset_path is required, and column name conversion can be handled through the `columns` field. Format conversion uses AutoPreprocessor. Use `--custom_dataset_info xxx.json` to parse the JSON file.
+3. Manually register the dataset, which offers the most flexible customization of preprocessing functions but is more complex. You can refer to examples in [examples](https://github.com/modelscope/swift/blob/main/examples/custom) by specifying `--custom_register_path xxx.py` to parse the registration contents.
+
+## Recommended Dataset Format
+
+Here is the recommended dataset format for ms-swift:
+
+### Pre-training
+
+```jsonl
+{"messages": [{"role": "assistant", "content": "I love music"}]}
+{"messages": [{"role": "assistant", "content": "Coach, I want to play basketball"}]}
+{"messages": [{"role": "assistant", "content": "Which is more authoritative, tomato and egg rice or the third fresh stir-fry?"}]}
+```
+
+### Supervised Fine-tuning
+
+```jsonl
+{"messages": [{"role": "system", "content": "You are a useful and harmless assistant"}, {"role": "user", "content": "Tell me tomorrow's weather"}, {"role": "assistant", "content": "Tomorrow's weather will be sunny"}]}
+{"messages": [{"role": "system", "content": "You are a useful and harmless math calculator"}, {"role": "user", "content": "What is 1 + 1?"}, {"role": "assistant", "content": "It equals 2"}, {"role": "user", "content": "What about adding 1?"}, {"role": "assistant", "content": "It equals 3"}]}
+```
+
+### RLHF
+
+#### DPO/ORPO/CPO/SimPO
+
+```jsonl
+{"messages": [{"role": "system", "content": "You are a useful and harmless assistant"}, {"role": "user", "content": "Tell me tomorrow's weather"}, {"role": "assistant", "content": "Tomorrow's weather will be sunny"}], "rejected_response": "I don't know"}
+{"messages": [{"role": "system", "content": "You are a useful and harmless math calculator"}, {"role": "user", "content": "What is 1 + 1?"}, {"role": "assistant", "content": "It equals 2"}, {"role": "user", "content": "What about adding 1?"}, {"role": "assistant", "content": "It equals 3"}], "rejected_response": "I don't know"}
+```
+
+#### KTO
+
+```jsonl
+{"messages": [{"role": "system", "content": "You are a useful and harmless assistant"}, {"role": "user", "content": "Tell me tomorrow's weather"}, {"role": "assistant", "content": "I don't know"}], "label": false}
+{"messages": [{"role": "system", "content": "You are a useful and harmless math calculator"}, {"role": "user", "content": "What is 1 + 1?"}, {"role": "assistant", "content": "It equals 2"}, {"role": "user", "content": "What about adding 1?"}, {"role": "assistant", "content": "It equals 3"}], "label": true}
+```
+
+### Multimodal
+
+For multimodal datasets, the format is the same as above. The difference is that it includes the keys `images`, `videos`, and `audios`, which represent multimodal resources:
+```jsonl
+{"messages": [{"role": "system", "content": "You are a useful and harmless assistant"}, {"role": "user", "content": "<image> What is in the image? <video> What is in the video?"}, {"role": "assistant", "content": "An elephant and a lion"}], "images": ["/xxx/x.jpg"], "videos": ["/xxx/x.mp4"]}
+```
+The `<image>`, `<video>`, and `<audio>` tags indicate where to insert images/videos/audios.
+
+#### Grounding
+
+For grounding (object detection) tasks, SWIFT supports two methods:
+1. Maintain consistency with the above multimodal dataset format, adding special characters in the dataset, for example:
+```jsonl
+{"messages": [{"role": "system", "content": "You are a useful and harmless assistant"}, {"role": "user", "content": "<image> Find a <ref> elephant </ref>"}, {"role": "assistant", "content": "<box>(200,450),(500,800)</box>"}], "images": ["/xxx/x.jpg"]}
+```
+With this type of data, please note:
+  - Grounding tasks often require special characters. You need to determine which model to use, read the model paper to identify special characters for grounding tasks, and combine the data accordingly.
+  - The bbox coordinates may use actual image coordinates or thousandth coordinates. Please confirm this before assembling data.
+  - Different models require different data formats; you need to recreate the dataset if switching models.
+
+2. Use SWIFT's grounding data format:
+
+```jsonl
+# Object detection
+{"messages": [{"role": "system", "content": "You are a useful and harmless assistant"}, {"role": "user", "content": "<image> Identify <bbox>"}, {"role": "assistant", "content": "<ref-object>"}], "images": ["/coco2014/train2014/COCO_train2014_000000001507.jpg"], "objects": "[{\"caption\": \"guy in red\", \"bbox\": [138, 136, 235, 359], \"bbox_type\": \"real\", \"image\": 0}]"}
+# Grounding to multiple bboxes
+{"messages": [{"role": "system", "content": "You are a useful and harmless assistant"}, {"role": "user", "content": "<image> Find <ref-object>"}, {"role": "assistant", "content": "<bbox>"}], "images": ["/coco2014/train2014/COCO_train2014_000000001507.jpg"], "objects": "[{\"caption\": \"guy in red\", \"bbox\": [[138, 136, 235, 359], [1,2,3,4]], \"bbox_type\": \"real\", \"image\": 0}]"}
+```
+
+This format adds the objects field, which includes:
+ - caption: description of the object corresponding to the bbox
+ - bbox: coordinates, suggested as four integers (not floats), representing x_min, y_min, x_max, y_max
+ - bbox_type: bbox type, currently supporting three types: real/norm_1000/norm_1, representing actual pixel coordinates/thousandth ratio coordinates/normalized ratio coordinates
+ - image: index of the image corresponding to the bbox, starting from 0
+
+### Text-to-Image Format
+
+```jsonl
+{"messages": [{"role": "system", "content": "You are a useful and harmless assistant"}, {"role": "user", "content": "Draw me an apple"}, {"role": "assistant", "content": "<image>"}], "images": ["/xxx/x.jpg"]}
+```
+
+### Agent Format
+
+Refer to the [Agent documentation](../Instruction/Agent-support.md) for the Agent format.
diff --git a/docs/source_en/Customization/Custom-model.md b/docs/source_en/Customization/Custom-model.md
@@ -0,0 +1,9 @@
+# Custom Model
+
+The models built into ms-swift can be used directly by specifying either `model_id` or `model_path`: `--model <model_id_or_path>`. ms-swift determines the `model_type` based on the suffix of `model_id/model_path` and the `config.json` file. Each `model_type` has a unique model structure, template, and loading method. Of course, you can also manually override these by passing `--model_type` and `--template`. You can check the supported `model_type` and templates in the [Supported Models and Datasets](../Instruction/Supported-models-and-datasets.md).
+
+> [!TIP]
+> When using `swift sft` to fine-tune a base model into a chat model using LoRA technology, for instance, fine-tuning Llama3.2-1B into a chat model, you may need to manually set the template. Adding the `--template default` parameter can help avoid issues where the base model fails to stop properly due to encountering special characters in the conversation template that it hasn't seen before.
+## Model Registration
+
+Please refer to the example code in [examples](https://github.com/modelscope/swift/blob/main/examples/custom). You can parse the registered content by specifying `--custom_register_path xxx.py`.