From a39fc2f2d7b7fdaa317b2171164b768d253541a9 Mon Sep 17 00:00:00 2001 From: yanrk123 <2493404415@qq.com> Date: Tue, 10 Dec 2024 11:21:16 +0800 Subject: [PATCH 1/9] Update three documents --- ...32\344\271\211\346\250\241\345\236\213.md" | 10 +- .../source_en/Customization/Custom-dataset.md | 112 ++++++++++++++++++ .../{New-model.md => Custom-model.md} | 0 docs/source_en/Customization/New-dataset.md | 101 ---------------- .../Instruction/Command-line-parameters.md | 12 +- .../Pre-training-and-Fine-tuning.md | 2 +- 6 files changed, 123 insertions(+), 114 deletions(-) create mode 100644 docs/source_en/Customization/Custom-dataset.md rename docs/source_en/Customization/{New-model.md => Custom-model.md} (100%) delete mode 100644 docs/source_en/Customization/New-dataset.md diff --git "a/docs/source/Customization/\350\207\252\345\256\232\344\271\211\346\250\241\345\236\213.md" "b/docs/source/Customization/\350\207\252\345\256\232\344\271\211\346\250\241\345\236\213.md" index 65d9b72775..cbeb8bc14d 100644 --- "a/docs/source/Customization/\350\207\252\345\256\232\344\271\211\346\250\241\345\236\213.md" +++ "b/docs/source/Customization/\350\207\252\345\256\232\344\271\211\346\250\241\345\236\213.md" @@ -1,10 +1,10 @@ -# 自定义模型 +# Custom Model -ms-swift内置的模型,你可以直接通过指定model_id或者model_path来使用:`--model `。ms-swift会根据model_id/model_path的后缀和`config.json`文件来判断model_type。每种model_type都有唯一的模型结构、template和加载方式。当然,你也可以手动传入`--model_type`、`--template`来进行覆盖。ms-swift已支持的model_type和template可以查看[支持的模型与数据集](../Instruction/支持的模型和数据集.md)。 +The built-in models in ms-swift can be used directly by specifying `model_id` or `model_path` with the command: `--model `. ms-swift determines the `model_type` based on the suffix of `model_id/model_path` and the `config.json` file. Each `model_type` has a unique model structure, template, and loading method. You can also manually pass `--model_type` and `--template` to override these settings. For a list of supported `model_type` and templates, you can refer to [Supported Models and Datasets](../Instruction/Supported_Models_and_Datasets.md). > [!TIP] -> 在使用`swift sft`通过LoRA技术微调base模型为chat模型时,例如将Llama3.2-1B微调为chat模型,有时需要手动设置模板。通过添加`--template default`参数来避免base模型因未见过对话模板中的特殊字符而无法正常停止的情况。 +> When using `swift sft` to fine-tune a base model into a chat model using LoRA technology (for example, fine-tuning Llama3.2-1B into a chat model), you may need to manually set the template. Adding `--template default` can help avoid issues where the base model cannot stop properly due to encountering special characters in an unseen chat template. -## 模型注册 +## Model Registration -请参考[examples](https://github.com/modelscope/swift/blob/main/examples/custom)中示例代码。你可以通过指定`--custom_register_path xxx.py`对注册的内容进行解析。 +Please refer to the example code in [examples](https://github.com/modelscope/swift/blob/main/examples/custom). You can parse the registered content by specifying `--custom_register_path xxx.py`. \ No newline at end of file diff --git a/docs/source_en/Customization/Custom-dataset.md b/docs/source_en/Customization/Custom-dataset.md new file mode 100644 index 0000000000..fd6de7ae8b --- /dev/null +++ b/docs/source_en/Customization/Custom-dataset.md @@ -0,0 +1,112 @@ +# Custom Dataset + +The standard format for the ms-swift dataset accepts the following keys: 'messages', 'rejected_response', 'label', 'images', 'videos', 'audios', 'tools', and 'objects'. Among these, 'messages' is a required key. 'rejected_response' is used for RLHF training like DPO, 'label' is used for KTO training, while 'images', 'videos', and 'audios' are used for storing paths or URLs of multimodal data. 'tools' is for Agent tasks, and 'objects' is for grounding tasks. + +There are three core preprocessors in ms-swift: `MessagesPreprocessor`, `AlpacaPreprocessor`, and `ResponsePreprocessor`. `MessagesPreprocessor` converts datasets in messages and sharegpt formats to the standard format, `AlpacaPreprocessor` converts alpaca format datasets, and `ResponsePreprocessor` converts datasets in query/response format. `AutoPreprocessor` automatically selects the appropriate preprocessor for processing. Typically, `AutoPreprocessor` can handle over 90% of cases. + +The following four formats will all be converted to the messages field in the ms-swift standard format by `AutoPreprocessor`: + +Messages format: +```jsonl +{"messages": [{"role": "system", "content": ""}, {"role": "user", "content": ""}, {"role": "assistant", "content": ""}, {"role": "user", "content": ""}, {"role": "assistant", "content": ""}]} +``` + +ShareGPT format: +```jsonl +{"system": "", "conversation": [{"human": "", "assistant": ""}, {"human": "", "assistant": ""}]} +``` + +Alpaca format: +```jsonl +{"system": "", "instruction": "", "input": "", "output": ""} +``` + +Query-Response format: +```jsonl +{"system": "", "query": "", "response": "", "history": [["", ""]]} +``` + +There are three ways to integrate a custom dataset, with increasing control over preprocessing functions: +1. **Recommended**: Directly use `--dataset ` to integrate with AutoPreprocessor. This supports csv, json, jsonl, txt, and folder formats. +2. Write a dataset_info.json file. You can refer to the built-in [dataset_info.json](https://github.com/modelscope/ms-swift/blob/main/swift/llm/dataset/data/dataset_info.json) in ms-swift. One of ms_dataset_id/hf_dataset_id/dataset_path is required, and column name conversion can be handled through the `columns` field. Format conversion uses AutoPreprocessor. Use `--custom_dataset_info xxx.json` to parse the JSON file. +3. Manually register the dataset, which offers the most flexible customization of preprocessing functions but is more complex. You can refer to examples in [examples](https://github.com/modelscope/swift/blob/main/examples/custom) by specifying `--custom_register_path xxx.py` to parse the registration contents. + +## Recommended Dataset Format + +Here is the recommended dataset format for ms-swift: + +### Pre-training + +```jsonl +{"messages": [{"role": "assistant", "content": "I love music"}]} +{"messages": [{"role": "assistant", "content": "Coach, I want to play basketball"}]} +{"messages": [{"role": "assistant", "content": "Which is more authoritative, tomato and egg rice or the third fresh stir-fry?"}] +``` + +### Supervised Fine-tuning + +```jsonl +{"messages": [{"role": "system", "content": "You are a useful and harmless assistant"}, {"role": "user", "content": "Tell me tomorrow's weather"}, {"role": "assistant", "content": "Tomorrow's weather will be sunny"}]} +{"messages": [{"role": "system", "content": "You are a useful and harmless math calculator"}, {"role": "user", "content": "What is 1 + 1?"}, {"role": "assistant", "content": "It equals 2"}, {"role": "user", "content": "What about adding 1?"}, {"role": "assistant", "content": "It equals 3"}]} +``` + +### RLHF + +#### DPO/ORPO/CPO/SimPO + +```jsonl +{"messages": [{"role": "system", "content": "You are a useful and harmless assistant"}, {"role": "user", "content": "Tell me tomorrow's weather"}, {"role": "assistant", "content": "Tomorrow's weather will be sunny"}], "rejected_response": "I don't know"} +{"messages": [{"role": "system", "content": "You are a useful and harmless math calculator"}, {"role": "user", "content": "What is 1 + 1?"}, {"role": "assistant", "content": "It equals 2"}, {"role": "user", "content": "What about adding 1?"}, {"role": "assistant", "content": "It equals 3"}], "rejected_response": "I don't know"} +``` + +#### KTO + +```jsonl +{"messages": [{"role": "system", "content": "You are a useful and harmless assistant"}, {"role": "user", "content": "Tell me tomorrow's weather"}, {"role": "assistant", "content": "I don't know"}], "label": false} +{"messages": [{"role": "system", "content": "You are a useful and harmless math calculator"}, {"role": "user", "content": "What is 1 + 1?"}, {"role": "assistant", "content": "It equals 2"}, {"role": "user", "content": "What about adding 1?"}, {"role": "assistant", "content": "It equals 3"}], "label": true} +``` + +### Multimodal + +For multimodal datasets, the format is the same as above. The difference is that it includes the keys `images`, `videos`, and `audios`, which represent multimodal resources: +```jsonl +{"messages": [{"role": "system", "content": "You are a useful and harmless assistant"}, {"role": "user", "content": " What is in the image?