diff --git "a/docs/source/Customization/\350\207\252\345\256\232\344\271\211\346\250\241\345\236\213.md" "b/docs/source/Customization/\350\207\252\345\256\232\344\271\211\346\250\241\345\236\213.md" index 65d9b72775..75b88510a8 100644 --- "a/docs/source/Customization/\350\207\252\345\256\232\344\271\211\346\250\241\345\236\213.md" +++ "b/docs/source/Customization/\350\207\252\345\256\232\344\271\211\346\250\241\345\236\213.md" @@ -4,7 +4,6 @@ ms-swift内置的模型,你可以直接通过指定model_id或者model_path来 > [!TIP] > 在使用`swift sft`通过LoRA技术微调base模型为chat模型时,例如将Llama3.2-1B微调为chat模型,有时需要手动设置模板。通过添加`--template default`参数来避免base模型因未见过对话模板中的特殊字符而无法正常停止的情况。 - ## 模型注册 请参考[examples](https://github.com/modelscope/swift/blob/main/examples/custom)中示例代码。你可以通过指定`--custom_register_path xxx.py`对注册的内容进行解析。 diff --git "a/docs/source/Instruction/\345\221\275\344\273\244\350\241\214\345\217\202\346\225\260.md" "b/docs/source/Instruction/\345\221\275\344\273\244\350\241\214\345\217\202\346\225\260.md" index e4c47bae19..385fa17603 100644 --- "a/docs/source/Instruction/\345\221\275\344\273\244\350\241\214\345\217\202\346\225\260.md" +++ "b/docs/source/Instruction/\345\221\275\344\273\244\350\241\214\345\217\202\346\225\260.md" @@ -37,7 +37,7 @@ - strict: 如果为True,则数据集只要某行有问题直接抛错,否则会丢弃出错行。默认False - 🔥model_name: 仅用于自我认知任务,传入模型中文名和英文名,以空格分隔 - 🔥model_author: 仅用于自我认知任务,传入模型作者的中文名和英文名,以空格分隔 -- custom_dataset_info: 自定义简单数据集注册,参考[新增数据集](../Customization/新增数据集.md) +- custom_dataset_info: 自定义简单数据集注册,参考[自定义数据集](../Customization/自定义数据集.md) ### 模板参数 - 🔥template: 对话模板类型,默认使用model对应的template类型。`swift pt`会将对话模版转为生成模板使用 @@ -46,7 +46,7 @@ - truncation_strategy: 如果超长如何处理,支持`delete`和`left`,代表删除和左侧裁剪,默认为left - 🔥max_pixels: 多模态模型图片前处理的最大像素数(H\*W),默认不缩放。 - tools_prompt: 智能体训练时的工具列表转为system的格式,请参考[智能体训练](./智能体的支持.md),默认为'react_en' -- loss_scale: 如何针对训练添加token的loss权重。默认为`'default'`,代表所有response(含history)以1计算交叉熵损失。具体可以查看[插件化](../Customization/插件.md)和[智能体训练](./智能体的支持.md) +- loss_scale: 如何针对训练添加token的loss权重。默认为`'default'`,代表所有response(含history)以1计算交叉熵损失。具体可以查看[插件化](../Customization/插件化.md)和[智能体训练](./智能体的支持.md) - sequence_parallel_size: 序列并行数量。参考[example](https://github.com/modelscope/ms-swift/tree/main/examples/train/sequence_parallel/train.sh) - use_chat_template: 使用chat模板或generation模板,默认为`True`。`swift pt`会自动设置为generation模板 - template_backend: 使用swift或jinja进行推理。如果使用jinja,则使用transformers的`apply_chat_template`。默认为swift diff --git "a/docs/source/Instruction/\346\231\272\350\203\275\344\275\223\347\232\204\346\224\257\346\214\201.md" "b/docs/source/Instruction/\346\231\272\350\203\275\344\275\223\347\232\204\346\224\257\346\214\201.md" index 1eb1dabb93..0ad3f0ab64 100644 --- "a/docs/source/Instruction/\346\231\272\350\203\275\344\275\223\347\232\204\346\224\257\346\214\201.md" +++ "b/docs/source/Instruction/\346\231\272\350\203\275\344\275\223\347\232\204\346\224\257\346\214\201.md" @@ -237,12 +237,12 @@ SWIFT为了提升Agent训练效果,提供了以下技术: Thought和Final Answer部分权重为1,Action和Action Input部分权重为2,Observation:字段本身权重为2,Observation:后面的实际api调用结果权重为0 -具体的loss_scale插件设计,请参考[插件](../Customization/插件.md)部分文档. +具体的loss_scale插件设计,请参考[插件化](../Customization/插件化.md)文档. ### tools(--tools_prompt) -tools部分为拼装后的system字段格式,除上述介绍的react_en/react_zh/toolbench外,还支持glm4格式。另外用户也可以自行定义格式tools_prompt,同样也可以参考[插件](../Customization/插件.md)部分文档. +tools部分为拼装后的system字段格式,除上述介绍的react_en/react_zh/toolbench外,还支持glm4格式。另外用户也可以自行定义格式tools_prompt,同样也可以参考[插件化](../Customization/插件化.md)文档. 一个完整的Agent训练脚本请参考[这里](https://github.com/modelscope/ms-swift/tree/main/examples/train/agent/train.sh). diff --git "a/docs/source/Instruction/\351\242\204\350\256\255\347\273\203\345\217\212\345\276\256\350\260\203.md" "b/docs/source/Instruction/\351\242\204\350\256\255\347\273\203\345\217\212\345\276\256\350\260\203.md" index 7b8810e6ea..8f1c0de715 100644 --- "a/docs/source/Instruction/\351\242\204\350\256\255\347\273\203\345\217\212\345\276\256\350\260\203.md" +++ "b/docs/source/Instruction/\351\242\204\350\256\255\347\273\203\345\217\212\345\276\256\350\260\203.md" @@ -2,7 +2,7 @@ 由于预训练和微调比较相似,在本文中共同介绍。 -预训练和微调的数据格式需求请参考[新增数据集](../Customization/新增数据集.md)部分。 +预训练和微调的数据格式需求请参考[自定义数据集](../Customization/自定义数据集.md)部分。 从数据需求上,继续预训练的训练需求量可能在几十万行~几百万行不等,如果从头预训练需要的卡数和数据量非常庞大,不在本文的讨论范围内。 微调的数据需求从几千行~百万行不等,更低的数据量请考虑使用RAG方式。 diff --git a/docs/source/index.rst b/docs/source/index.rst index 9e8a4c4294..4ea240c330 100644 --- a/docs/source/index.rst +++ b/docs/source/index.rst @@ -35,8 +35,8 @@ Swift DOCUMENTATION :maxdepth: 2 :caption: Customization - Customization/新增数据集.md - Customization/新增模型.md + Customization/自定义数据集.md + Customization/自定义模型.md Customization/插件化.md Indices and tables diff --git a/docs/source_en/Customization/Custom-dataset.md b/docs/source_en/Customization/Custom-dataset.md new file mode 100644 index 0000000000..1584572d22 --- /dev/null +++ b/docs/source_en/Customization/Custom-dataset.md @@ -0,0 +1,112 @@ +# Custom Dataset + +The standard format for the ms-swift dataset accepts the following keys: 'messages', 'rejected_response', 'label', 'images', 'videos', 'audios', 'tools', and 'objects'. Among these, 'messages' is a required key. 'rejected_response' is used for RLHF training like DPO, 'label' is used for KTO training, while 'images', 'videos', and 'audios' are used for storing paths or URLs of multimodal data. 'tools' is for Agent tasks, and 'objects' is for grounding tasks. + +There are three core preprocessors in ms-swift: `MessagesPreprocessor`, `AlpacaPreprocessor`, and `ResponsePreprocessor`. `MessagesPreprocessor` converts datasets in messages and sharegpt formats to the standard format, `AlpacaPreprocessor` converts alpaca format datasets, and `ResponsePreprocessor` converts datasets in query/response format. `AutoPreprocessor` automatically selects the appropriate preprocessor for processing. Typically, `AutoPreprocessor` can handle over 90% of cases. + +The following four formats will all be converted to the messages field in the ms-swift standard format by `AutoPreprocessor`: + +Messages format: +```jsonl +{"messages": [{"role": "system", "content": ""}, {"role": "user", "content": ""}, {"role": "assistant", "content": ""}, {"role": "user", "content": ""}, {"role": "assistant", "content": ""}]} +``` + +ShareGPT format: +```jsonl +{"system": "", "conversation": [{"human": "", "assistant": ""}, {"human": "", "assistant": ""}]} +``` + +Alpaca format: +```jsonl +{"system": "", "instruction": "", "input": "", "output": ""} +``` + +Query-Response format: +```jsonl +{"system": "", "query": "", "response": "", "history": [["", ""]]} +``` + +There are three ways to integrate a custom dataset, with increasing control over preprocessing functions: +1. **Recommended**: Directly use `--dataset ` to integrate with AutoPreprocessor. This supports csv, json, jsonl, txt, and folder formats. +2. Write a dataset_info.json file. You can refer to the built-in [dataset_info.json](https://github.com/modelscope/ms-swift/blob/main/swift/llm/dataset/data/dataset_info.json) in ms-swift. One of ms_dataset_id/hf_dataset_id/dataset_path is required, and column name conversion can be handled through the `columns` field. Format conversion uses AutoPreprocessor. Use `--custom_dataset_info xxx.json` to parse the JSON file. +3. Manually register the dataset, which offers the most flexible customization of preprocessing functions but is more complex. You can refer to examples in [examples](https://github.com/modelscope/swift/blob/main/examples/custom) by specifying `--custom_register_path xxx.py` to parse the registration contents. + +## Recommended Dataset Format + +Here is the recommended dataset format for ms-swift: + +### Pre-training + +```jsonl +{"messages": [{"role": "assistant", "content": "I love music"}]} +{"messages": [{"role": "assistant", "content": "Coach, I want to play basketball"}]} +{"messages": [{"role": "assistant", "content": "Which is more authoritative, tomato and egg rice or the third fresh stir-fry?"}]} +``` + +### Supervised Fine-tuning + +```jsonl +{"messages": [{"role": "system", "content": "You are a useful and harmless assistant"}, {"role": "user", "content": "Tell me tomorrow's weather"}, {"role": "assistant", "content": "Tomorrow's weather will be sunny"}]} +{"messages": [{"role": "system", "content": "You are a useful and harmless math calculator"}, {"role": "user", "content": "What is 1 + 1?"}, {"role": "assistant", "content": "It equals 2"}, {"role": "user", "content": "What about adding 1?"}, {"role": "assistant", "content": "It equals 3"}]} +``` + +### RLHF + +#### DPO/ORPO/CPO/SimPO + +```jsonl +{"messages": [{"role": "system", "content": "You are a useful and harmless assistant"}, {"role": "user", "content": "Tell me tomorrow's weather"}, {"role": "assistant", "content": "Tomorrow's weather will be sunny"}], "rejected_response": "I don't know"} +{"messages": [{"role": "system", "content": "You are a useful and harmless math calculator"}, {"role": "user", "content": "What is 1 + 1?"}, {"role": "assistant", "content": "It equals 2"}, {"role": "user", "content": "What about adding 1?"}, {"role": "assistant", "content": "It equals 3"}], "rejected_response": "I don't know"} +``` + +#### KTO + +```jsonl +{"messages": [{"role": "system", "content": "You are a useful and harmless assistant"}, {"role": "user", "content": "Tell me tomorrow's weather"}, {"role": "assistant", "content": "I don't know"}], "label": false} +{"messages": [{"role": "system", "content": "You are a useful and harmless math calculator"}, {"role": "user", "content": "What is 1 + 1?"}, {"role": "assistant", "content": "It equals 2"}, {"role": "user", "content": "What about adding 1?"}, {"role": "assistant", "content": "It equals 3"}], "label": true} +``` + +### Multimodal + +For multimodal datasets, the format is the same as above. The difference is that it includes the keys `images`, `videos`, and `audios`, which represent multimodal resources: +```jsonl +{"messages": [{"role": "system", "content": "You are a useful and harmless assistant"}, {"role": "user", "content": " What is in the image?