diff --git a/README.md b/README.md index 1be393bcb7..cfbcfdfc4d 100644 --- a/README.md +++ b/README.md @@ -67,7 +67,7 @@ You can contact us and communicate with us by adding our group: - 🍊 **Lightweight Training**: Supports lightweight fine-tuning methods like LoRA, QLoRA, DoRA, LoRA+, ReFT, RS-LoRA, LLaMAPro, Adapter, GaLore, Q-Galore, LISA, UnSloth, Liger-Kernel. - **Distributed Training**: Supports distributed data parallel (DDP), device_map simple model parallelism, DeepSpeed ZeRO2/ZeRO3, FSDP, and other distributed training techniques. - **Quantization Training**: Supports training quantized models like BNB, AWQ, GPTQ, AQLM, HQQ, EETQ. -- **RLHF Training**: Supports human alignment training methods such as DPO, CPO, SimPO, ORPO, KTO, RM for both pure text and multi-modal large models. +- **RLHF Training**: Supports human alignment training methods such as DPO, CPO, SimPO, ORPO, KTO, RM, PPO for both pure text and multi-modal large models. - 🍓 **Multi-Modal Training**: Supports training on different modalities like images, videos, and audio, for tasks like VQA, captioning, OCR, and grounding. - **Interface Training**: Provides capabilities for training, inference, evaluation, quantization through an interface, completing the whole large model pipeline. - **Plugin and Extension**: Supports custom model and dataset extensions, as well as customization of components like loss, metric, trainer, loss-scale, callback, optimizer. @@ -83,7 +83,7 @@ You can contact us and communicate with us by adding our group: - 🎉 2024.08.12: The SWIFT paper has been published on arXiv, and you can read it [here](https://arxiv.org/abs/2408.05517). - 🔥 2024.08.05: Support for using [evalscope](https://github.com/modelscope/evalscope/) as a backend for evaluating large models and multimodal models. - 🔥 2024.07.29: Support for using [vllm](https://github.com/vllm-project/vllm) and [lmdeploy](https://github.com/InternLM/lmdeploy) to accelerate inference for large models and multimodal models. When performing infer/deploy/eval, you can specify `--infer_backend vllm/lmdeploy`. -- 🔥 2024.07.24: Support for human preference alignment training for multimodal large models, including DPO/ORPO/SimPO/CPO/KTO/RM. +- 🔥 2024.07.24: Support for human preference alignment training for multimodal large models, including DPO/ORPO/SimPO/CPO/KTO/RM/PPO. - 🔥 2024.02.01: Support for Agent training! The training algorithm is derived from [this paper](https://arxiv.org/pdf/2309.00986.pdf). diff --git a/README_CN.md b/README_CN.md index bf4443bd88..7ca2931fe5 100644 --- a/README_CN.md +++ b/README_CN.md @@ -64,7 +64,7 @@ - 🍊 **轻量训练**:支持了LoRA、QLoRA、DoRA、LoRA+、ReFT、RS-LoRA、LLaMAPro、Adapter、GaLore、Q-Galore、LISA、UnSloth、Liger-Kernel等轻量微调方式。 - **分布式训练**:支持分布式数据并行(DDP)、device_map简易模型并行、DeepSpeed ZeRO2 ZeRO3、FSDP等分布式训练技术。 - **量化训练**:支持对BNB、AWQ、GPTQ、AQLM、HQQ、EETQ量化模型进行训练。 -- **RLHF训练**:支持纯文本大模型和多模态大模型的DPO、CPO、SimPO、ORPO、KTO、RM等人类对齐训练方法。 +- **RLHF训练**:支持纯文本大模型和多模态大模型的DPO、CPO、SimPO、ORPO、KTO、RM、PPO等人类对齐训练方法。 - 🍓 **多模态训练**:支持对图像、视频和语音不同模态模型进行训练,支持VQA、Caption、OCR、Grounding任务的训练。 - **界面训练**:以界面的方式提供训练、推理、评测、量化的能力,完成大模型的全链路。 - **插件化与拓展**:支持自定义模型和数据集拓展,支持对loss、metric、trainer、loss-scale、callback、optimizer等组件进行自定义。 @@ -78,7 +78,7 @@ - 🎉 2024.08.12: SWIFT论文已经发布到arXiv上,可以点击[这里](https://arxiv.org/abs/2408.05517)阅读。 - 🔥 2024.08.05: 支持使用[evalscope](https://github.com/modelscope/evalscope/)作为后端进行大模型和多模态模型的评测。 - 🔥 2024.07.29: 支持使用[vllm](https://github.com/vllm-project/vllm), [lmdeploy](https://github.com/InternLM/lmdeploy)对大模型和多模态大模型进行推理加速,在infer/deploy/eval时额外指定`--infer_backend vllm/lmdeploy`即可。 -- 🔥 2024.07.24: 支持对多模态大模型进行人类偏好对齐训练,包括DPO/ORPO/SimPO/CPO/KTO/RM。 +- 🔥 2024.07.24: 支持对多模态大模型进行人类偏好对齐训练,包括DPO/ORPO/SimPO/CPO/KTO/RM/PPO。 - 🔥 2024.02.01: 支持Agent训练!训练算法源自这篇[论文](https://arxiv.org/pdf/2309.00986.pdf)。 ## 🛠️ 安装 diff --git "a/docs/source/Customization/\350\207\252\345\256\232\344\271\211\346\225\260\346\215\256\351\233\206.md" "b/docs/source/Customization/\350\207\252\345\256\232\344\271\211\346\225\260\346\215\256\351\233\206.md" index 16b54234be..9c63c6f080 100644 --- "a/docs/source/Customization/\350\207\252\345\256\232\344\271\211\346\225\260\346\215\256\351\233\206.md" +++ "b/docs/source/Customization/\350\207\252\345\256\232\344\271\211\346\225\260\346\215\256\351\233\206.md" @@ -67,6 +67,14 @@ query-response格式: {"messages": [{"role": "system", "content": "你是个有用无害的数学计算器"}, {"role": "user", "content": "1+1等于几"}, {"role": "assistant", "content": "等于2"}, {"role": "user", "content": "再加1呢"}, {"role": "assistant", "content": "等于3"}], "label": true} ``` +#### PPO + +```jsonl +{"messages": [{"role": "system", "content": "你是个有用无害的助手"}, {"role": "user", "content": "告诉我明天的天气"}]} +{"messages": [{"role": "system", "content": "你是个有用无害的数学计算器"}, {"role": "user", "content": "1+1等于几"}, {"role": "assistant", "content": "等于2"}, {"role": "user", "content": "再加1呢"}]} +{"messages": [{"role": "user", "content": "你的名字是什么"}]} +``` + ### 序列分类 ```jsonl {"messages": [{"role": "user", "content": "今天天气真好呀"}], "label": 1} diff --git "a/docs/source/GetStarted/\345\277\253\351\200\237\345\274\200\345\247\213.md" "b/docs/source/GetStarted/\345\277\253\351\200\237\345\274\200\345\247\213.md" index 0c0ee21aa0..9892185272 100644 --- "a/docs/source/GetStarted/\345\277\253\351\200\237\345\274\200\345\247\213.md" +++ "b/docs/source/GetStarted/\345\277\253\351\200\237\345\274\200\345\247\213.md" @@ -8,7 +8,7 @@ ms-swift是魔搭社区提供的大模型与多模态大模型训练部署框架 - 🍊 轻量训练:支持了LoRA、QLoRA、DoRA、LoRA+、ReFT、RS-LoRA、LLaMAPro、Adapter、GaLore、Q-Galore、LISA、UnSloth、Liger-Kernel等轻量微调方式。 - 分布式训练:支持分布式数据并行(DDP)、device_map简易模型并行、DeepSpeed ZeRO2 ZeRO3、FSDP等分布式训练技术。 - 量化训练:支持对BNB、AWQ、GPTQ、AQLM、HQQ、EETQ量化模型进行训练。 -- RLHF训练:支持纯文本大模型和多模态大模型的DPO、CPO、SimPO、ORPO、KTO、RM等人类对齐训练方法。 +- RLHF训练:支持纯文本大模型和多模态大模型的DPO、CPO、SimPO、ORPO、KTO、RM、PPO等人类对齐训练方法。 - 🍓 多模态训练:支持对图像、视频和语音不同模态模型进行训练,支持VQA、Caption、OCR、Grounding任务的训练。 - 界面训练:以界面的方式提供训练、推理、评测、量化的能力,完成大模型的全链路。 - 插件化与拓展:支持自定义模型和数据集拓展,支持对loss、metric、trainer、loss-scale、callback、optimizer等组件进行自定义。 diff --git a/docs/source/Instruction/ReleaseNote3.0.md b/docs/source/Instruction/ReleaseNote3.0.md index 888b7916fb..eeacc1de51 100644 --- a/docs/source/Instruction/ReleaseNote3.0.md +++ b/docs/source/Instruction/ReleaseNote3.0.md @@ -81,7 +81,6 @@ ## 待完成 -1. RM/PPO能力3.0版本尚不支持,请使用2.6.1版本 -2. 自定义数据集评测3.0版本尚不支持,请使用2.6.1版本 -3. Megatron预训练能力3.0版本尚不支持,请使用2.6.1版本 +1. 自定义数据集评测3.0版本尚不支持,请使用2.6.1版本 +2. Megatron预训练能力3.0版本尚不支持,请使用2.6.1版本 3. 文档和README暂时未更新完整 diff --git "a/docs/source/Instruction/\345\221\275\344\273\244\350\241\214\345\217\202\346\225\260.md" "b/docs/source/Instruction/\345\221\275\344\273\244\350\241\214\345\217\202\346\225\260.md" index f0920e9f5e..e8769a1a6c 100644 --- "a/docs/source/Instruction/\345\221\275\344\273\244\350\241\214\345\217\202\346\225\260.md" +++ "b/docs/source/Instruction/\345\221\275\344\273\244\350\241\214\345\217\202\346\225\260.md" @@ -50,7 +50,7 @@ - 🔥max_pixels: 多模态模型图片前处理的最大像素数(H\*W),默认不缩放。 - tools_prompt: 智能体训练时的工具列表转为system的格式,请参考[智能体训练](./智能体的支持.md),默认为'react_en' - padding_side: 当训练`batch_size>=2`时的padding_side,可选值为'left', 'right',默认为'right'。(`generate`的batch_size>=2时,只进行左padding) -- loss_scale: 如何针对训练添加token的loss权重。默认为`'default'`,代表所有response(含history)以1计算交叉熵损失。具体可以查看[插件化](../Customization/插件化.md)和[智能体训练](./智能体的支持.md) +- loss_scale: 如何针对训练添加token的loss权重。默认为`'default'`,代表所有response(含history)以1计算交叉熵损失。可选值为'default', 'last_round', 'all', 以及agent需要的loss_scale: 'react', 'agentflan', 'alpha_umi', 'qwen'。具体可以查看[插件化](../Customization/插件化.md)和[智能体训练](./智能体的支持.md) - sequence_parallel_size: 序列并行数量。参考[example](https://github.com/modelscope/ms-swift/tree/main/examples/train/sequence_parallel/train.sh) - use_chat_template: 使用chat模板或generation模板,默认为`True`。`swift pt`会自动设置为generation模板 - template_backend: 使用swift或jinja进行推理。如果使用jinja,则使用transformers的`apply_chat_template`。默认为swift @@ -307,7 +307,7 @@ Vera使用`target_modules`, `target_regex`, `modules_to_save`三个参数. ### RLHF参数 RLHF参数继承于[训练参数](#训练参数) -- 🔥rlhf_type: 对齐算法类型,支持`dpo`, `orpo`, `simpo`, `kto`, `cpo`, `rm` +- 🔥rlhf_type: 对齐算法类型,支持`dpo`, `orpo`, `simpo`, `kto`, `cpo`, `rm`, `ppo` - ref_model: DPO等算法中的原始对比模型 - ref_model_type: 同model_type - ref_model_revision: 同model_revision @@ -324,6 +324,27 @@ RLHF参数继承于[训练参数](#训练参数) - desirable_weight: KTO算法中对desirable response的loss权重 $\lambda_D$ ,默认为`1.` - undesirable_weight: KTO论文中对undesirable response的loss权重 $\lambda_U$ , 默认为`1.` +#### PPO参数 +- reward_model: 默认为None +- reward_adapters: 默认为`[]` +- reward_model_type: 默认为None +- reward_model_revision: 默认为None + +以下参数含义可以参考[这里](https://huggingface.co/docs/trl/main/ppo_trainer) +- num_ppo_epochs: 默认为4 +- whiten_rewards: 默认为False +- kl_coef: 默认为0.05 +- cliprange: 默认为0.2 +- vf_coef: 默认为0.1 +- cliprange_value: 默认为0.2 +- gamma: 默认为1.0 +- lam: 默认为0.95 +- num_mini_batches: 默认为1 +- local_rollout_forward_batch_size: 默认为64 +- num_sample_generations: 默认为10 +- response_length: 默认为512 +- temperature: 默认为0.7 +- missing_eos_penalty: 默认为None ### 推理参数 diff --git a/docs/source_en/Customization/Custom-dataset.md b/docs/source_en/Customization/Custom-dataset.md index f7b883f46a..268b06dbc0 100644 --- a/docs/source_en/Customization/Custom-dataset.md +++ b/docs/source_en/Customization/Custom-dataset.md @@ -66,6 +66,14 @@ The following provides the recommended dataset format for ms-swift, where the sy {"messages": [{"role": "system", "content": "You are a useful and harmless math calculator"}, {"role": "user", "content": "What is 1 + 1?"}, {"role": "assistant", "content": "It equals 2"}, {"role": "user", "content": "What about adding 1?"}, {"role": "assistant", "content": "It equals 3"}], "label": true} ``` +#### PPO + +```jsonl +{"messages": [{"role": "system", "content": "You are a useful and harmless assistant"}, {"role": "user", "content": "Tell me tomorrow's weather"}]} +{"messages": [{"role": "system", "content": "You are a useful and harmless math calculator"}, {"role": "user", "content": "What is 1 + 1?"}, {"role": "assistant", "content": "It equals 2"}, {"role": "user", "content": "What about adding 1?"}]} +{"messages": [{"role": "user", "content": "What is your name?"}]} +``` + ### Sequence Classification ```jsonl {"messages": [{"role": "user", "content": "The weather is really nice today"}], "label": 1} diff --git a/docs/source_en/GetStarted/Quick-start.md b/docs/source_en/GetStarted/Quick-start.md index b86891c519..e4887be83c 100644 --- a/docs/source_en/GetStarted/Quick-start.md +++ b/docs/source_en/GetStarted/Quick-start.md @@ -8,7 +8,7 @@ ms-swift is a comprehensive training and deployment framework for large language - 🍊 Lightweight Training: Supports lightweight fine-tuning methods like LoRA, QLoRA, DoRA, LoRA+, ReFT, RS-LoRA, LLaMAPro, Adapter, GaLore, Q-Galore, LISA, UnSloth, Liger-Kernel, and more. - Distributed Training: Supports distributed data parallel (DDP), simple model parallelism via device_map, DeepSpeed ZeRO2 ZeRO3, FSDP, and other distributed training technologies. - Quantization Training: Provides training for quantized models like BNB, AWQ, GPTQ, AQLM, HQQ, EETQ. -- RLHF Training: Supports human alignment training methods like DPO, CPO, SimPO, ORPO, KTO, RM for both text-based and multimodal large models. +- RLHF Training: Supports human alignment training methods like DPO, CPO, SimPO, ORPO, KTO, RM, PPO for both text-based and multimodal large models. - 🍓 Multimodal Training: Capable of training models for different modalities such as images, videos, and audios; supports tasks like VQA (Visual Question Answering), Captioning, OCR (Optical Character Recognition), and Grounding. - Interface-driven Training: Offers training, inference, evaluation, and quantization capabilities through an interface, enabling a complete workflow for large models. - Plugins and Extensions: Allows customization and extension of models and datasets, and supports customizations for components like loss, metric, trainer, loss-scale, callback, optimizer, etc. diff --git a/docs/source_en/Instruction/Command-line-parameters.md b/docs/source_en/Instruction/Command-line-parameters.md index b56a0b1afc..233a51861f 100644 --- a/docs/source_en/Instruction/Command-line-parameters.md +++ b/docs/source_en/Instruction/Command-line-parameters.md @@ -50,7 +50,7 @@ The introduction to command line parameters will cover base arguments, atomic ar - 🔥max_pixels: Maximum pixel count for pre-processing images in multimodal models (H*W), default is no scaling. - tools_prompt: The list of tools for agent training converted to system format, refer to [Agent Training](./Agent-support.md), default is 'react_en'. - padding_side: The padding_side used when training with `batch_size >= 2`, with optional values of 'left' and 'right', defaulting to 'right'. (When the batch_size in `generate` is >= 2, only left padding is applied.) -- loss_scale: How to add token loss weight during training. Default is `'default'`, meaning all responses (including history) are treated as 1 for cross-entropy loss. For specifics, see [Pluginization](../Customization/Pluginization.md) and [Agent Training](./Agent-support.md). +- loss_scale: How to add token loss weight during training. Default is `'default'`, meaning all responses (including history) are treated as 1 for cross-entropy loss. The optional values are 'default', 'last_round', 'all', and the loss scale required by the agent: 'react', 'agentflan', 'alpha_umi', 'qwen'. For specifics, see [Pluginization](../Customization/Pluginization.md) and [Agent Training](./Agent-support.md). - sequence_parallel_size: Number of sequence parallelism. Refer to [example](https://github.com/modelscope/ms-swift/tree/main/examples/train/sequence_parallel/train.sh). - use_chat_template: Use chat template or generation template, default is `True`. `swift pt` is automatically set to the generation template. - template_backend: Use swift or jinja for inference. If using jinja, it will utilize transformers' `apply_chat_template`. Default is swift. @@ -311,23 +311,47 @@ Training arguments include the [base arguments](#base-arguments), [Seq2SeqTraine RLHF arguments inherit from the [training arguments](#training-arguments). -- 🔥rlhf_type: Alignment algorithm type, supports `dpo`, `orpo`, `simpo`, `kto`, `cpo`, `rm`. +- 🔥rlhf_type: Alignment algorithm type, supports `dpo`, `orpo`, `simpo`, `kto`, `cpo`, `rm`, `ppo`. - ref_model: Original comparison model in algorithms like DPO. - ref_model_type: Same as model_type. - ref_model_revision: Same as model_revision. - 🔥beta: KL regularization term coefficient, default is `None`, i.e., for `simpo` algorithm default is `2.`, for other algorithms default is `0.1`. Refer to the [documentation](./Human-alignment.md) for specifics. - label_smoothing: Whether to use DPO smoothing, default value is `0`, generally set between 0~0.5. -- + - 🔥rpo_alpha: Weight for adding sft_loss in DPO, default is `1`. The final loss is `KL_loss + rpo_alpha * sft_loss`. -- + - cpo_alpha: The coefficient of nll loss in CPO/SimPO loss, default is `1.`. -- + - simpo_gamma: Reward margin term in SimPO algorithm, recommended to set between 0.5-1.5 in the paper, default is `1.`. -- + - desirable_weight: Loss weight for desirable response in KTO algorithm $\lambda_D$, default is `1.`. - undesirable_weight: Loss weight for undesirable response in KTO paper $\lambda_U$, default is `1.`. +#### PPO Arguments + +- reward_model: Defaults to None +- reward_adapters: Defaults to `[]` +- reward_model_type: Defaults to None +- reward_model_revision: Defaults to None + +The meanings of the following parameters can be referenced [here](https://huggingface.co/docs/trl/main/ppo_trainer): + +- num_ppo_epochs: Defaults to 4 +- whiten_rewards: Defaults to False +- kl_coef: Defaults to 0.05 +- cliprange: Defaults to 0.2 +- vf_coef: Defaults to 0.1 +- cliprange_value: Defaults to 0.2 +- gamma: Defaults to 1.0 +- lam: Defaults to 0.95 +- num_mini_batches: Defaults to 1 +- local_rollout_forward_batch_size: Defaults to 64 +- num_sample_generations: Defaults to 10 +- response_length: Defaults to 512 +- temperature: Defaults to 0.7 +- missing_eos_penalty: Defaults to None + ### Inference Arguments Inference arguments include the [base arguments](#base-arguments), [merge arguments](#merge-arguments), [vLLM arguments](#vllm-arguments), [LMDeploy arguments](#LMDeploy-arguments), and also contain the following: diff --git a/docs/source_en/Instruction/ReleaseNote3.0.md b/docs/source_en/Instruction/ReleaseNote3.0.md index bbd4e6fc42..e49bd444d6 100644 --- a/docs/source_en/Instruction/ReleaseNote3.0.md +++ b/docs/source_en/Instruction/ReleaseNote3.0.md @@ -94,7 +94,6 @@ The parameters marked as compatible in version 2.0 have been entirely removed. ## Pending Tasks -1. RM/PPO capabilities are not supported in version 3.0. Please use version 2.6.1. -2. Custom dataset evaluation is not supported in version 3.0. Please use version 2.6.1. -3. Megatron pre-training capabilities are not supported in version 3.0. Please use version 2.6.1. -4. Documentation and README are temporarily incomplete and will be updated. +1. Custom dataset evaluation is not supported in version 3.0. Please use version 2.6.1. +2. Megatron pre-training capabilities are not supported in version 3.0. Please use version 2.6.1. +3. Documentation and README are temporarily incomplete and will be updated. diff --git a/examples/deploy/lora/client.py b/examples/deploy/lora/client.py index e61caad8ae..ae66b10df0 100644 --- a/examples/deploy/lora/client.py +++ b/examples/deploy/lora/client.py @@ -23,5 +23,5 @@ def infer_multilora(engine: InferClient, infer_request: InferRequest): if __name__ == '__main__': engine = InferClient(host='127.0.0.1', port=8000) - infer_request = InferRequest(messages=[{'role': 'user', 'content': '你是谁'}]) + infer_request = InferRequest(messages=[{'role': 'user', 'content': 'who are you?'}]) infer_multilora(engine, infer_request) diff --git a/examples/infer/demo_hf.py b/examples/infer/demo_hf.py new file mode 100644 index 0000000000..58959078f8 --- /dev/null +++ b/examples/infer/demo_hf.py @@ -0,0 +1,60 @@ +def infer_hf(): + from transformers import AutoModelForCausalLM, AutoTokenizer + from peft import PeftModel + from modelscope import snapshot_download + model_dir = snapshot_download('Qwen/Qwen2.5-7B-Instruct') + adapter_dir = snapshot_download('swift/test_lora') + model = AutoModelForCausalLM.from_pretrained(model_dir, torch_dtype='auto', device_map='auto') + model = PeftModel.from_pretrained(model, adapter_dir) + + tokenizer = AutoTokenizer.from_pretrained(model_dir) + + messages = [{ + 'role': 'system', + 'content': 'You are a helpful assistant.' + }, { + 'role': 'user', + 'content': 'who are you?' + }] + text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) + model_inputs = tokenizer([text], return_tensors='pt').to(model.device) + + generated_ids = model.generate(**model_inputs, max_new_tokens=512, do_sample=False) + generated_ids = [ + output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids) + ] + + response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0] + print(f'response: {response}') + return response + + +def infer_swift(): + from swift.llm import get_model_tokenizer, get_template, InferRequest, RequestConfig, PtEngine + from modelscope import snapshot_download + from swift.tuners import Swift + model_dir = snapshot_download('Qwen/Qwen2.5-7B-Instruct') + adapter_dir = snapshot_download('swift/test_lora') + model, tokenizer = get_model_tokenizer(model_dir, device_map='auto') + model = Swift.from_pretrained(model, adapter_dir) + template = get_template(model.model_meta.template, tokenizer) + engine = PtEngine.from_model_template(model, template) + + messages = [{ + 'role': 'system', + 'content': 'You are a helpful assistant.' + }, { + 'role': 'user', + 'content': 'who are you?' + }] + request_config = RequestConfig(max_tokens=512, temperature=0) + resp_list = engine.infer([InferRequest(messages=messages)], request_config=request_config) + response = resp_list[0].choices[0].message.content + print(f'response: {response}') + return response + + +if __name__ == '__main__': + response = infer_hf() + response2 = infer_swift() + assert response == response2 diff --git a/examples/infer/demo_lora.py b/examples/infer/demo_lora.py index 7489d1c38a..8d9396f135 100644 --- a/examples/infer/demo_lora.py +++ b/examples/infer/demo_lora.py @@ -63,6 +63,6 @@ def infer_lora(infer_request: 'InferRequest'): from swift.llm import (PtEngine, RequestConfig, AdapterRequest, get_template, BaseArguments, InferRequest, safe_snapshot_download, get_model_tokenizer) from swift.tuners import Swift - infer_request = InferRequest(messages=[{'role': 'user', 'content': '你是谁'}]) + infer_request = InferRequest(messages=[{'role': 'user', 'content': 'who are you?'}]) # infer_lora(infer_request) infer_multilora(infer_request, 'pt') diff --git a/examples/train/rlhf/ppo.sh b/examples/train/rlhf/ppo.sh new file mode 100644 index 0000000000..86f93d9348 --- /dev/null +++ b/examples/train/rlhf/ppo.sh @@ -0,0 +1,30 @@ +# Currently, it only supports the case where the model and reward_model use the same template/tokenizer. +nproc_per_node=4 + +CUDA_VISIBLE_DEVICES=0,1,2,3 \ +NPROC_PER_NODE=$nproc_per_node \ +swift rlhf \ + --rlhf_type ppo \ + --model LLM-Research/Meta-Llama-3.1-8B-Instruct \ + --reward_model 'AI-ModelScope/Skywork-Reward-Llama-3.1-8B-v0.2' \ + --train_type lora \ + --dataset 'AI-ModelScope/alpaca-gpt4-data-zh#20000' 'AI-ModelScope/alpaca-gpt4-data-en#20000' \ + --torch_dtype bfloat16 \ + --num_train_epochs 1 \ + --per_device_train_batch_size 1 \ + --per_device_eval_batch_size 1 \ + --learning_rate 1e-5 \ + --lora_rank 8 \ + --lora_alpha 32 \ + --target_modules all-linear \ + --gradient_accumulation_steps $(expr 16 / $nproc_per_node) \ + --eval_steps 100 \ + --save_steps 100 \ + --save_total_limit 5 \ + --logging_steps 5 \ + --max_length 2048 \ + --output_dir output \ + --warmup_ratio 0.05 \ + --dataloader_num_workers 4 \ + --deepspeed zero2 \ + --response_length 512 diff --git a/swift/llm/argument/rlhf_args.py b/swift/llm/argument/rlhf_args.py index 89d167ee3c..8ddd396f53 100644 --- a/swift/llm/argument/rlhf_args.py +++ b/swift/llm/argument/rlhf_args.py @@ -1,13 +1,38 @@ # Copyright (c) Alibaba, Inc. and its affiliates. from dataclasses import dataclass, field -from typing import Literal, Optional +from typing import List, Literal, Optional from swift.llm import MODEL_MAPPING from .train_args import TrainArguments @dataclass -class RLHFArguments(TrainArguments): +class PPOArguments: + reward_model: Optional[str] = None + reward_adapters: List[str] = field(default_factory=list) + reward_model_type: Optional[str] = field( + default=None, metadata={'help': f'model_type choices: {list(MODEL_MAPPING.keys())}'}) + reward_model_revision: Optional[str] = None + + num_ppo_epochs: int = 4 + whiten_rewards: bool = False + kl_coef: float = 0.05 + cliprange: float = 0.2 + vf_coef: float = 0.1 + cliprange_value: float = 0.2 + gamma: float = 1.0 + lam: float = 0.95 + + num_mini_batches: int = 1 + local_rollout_forward_batch_size: int = 64 + num_sample_generations: int = 10 + response_length: int = 512 + temperature: float = 0.7 + missing_eos_penalty: Optional[float] = None + + +@dataclass +class RLHFArguments(PPOArguments, TrainArguments): """ RLHFArguments is a dataclass that holds arguments specific to the Reinforcement Learning with Human Feedback (RLHF) training backend. @@ -25,7 +50,7 @@ class RLHFArguments(TrainArguments): desirable_weight (float): Weight for desirable outcomes in KTO. Default is 1.0. undesirable_weight (float): Weight for undesirable outcomes in KTO. Default is 1.0. """ - rlhf_type: Literal['dpo', 'orpo', 'simpo', 'kto', 'cpo', 'rm'] = 'dpo' + rlhf_type: Literal['dpo', 'orpo', 'simpo', 'kto', 'cpo', 'rm', 'ppo'] = 'dpo' ref_model: Optional[str] = None ref_model_type: Optional[str] = field( default=None, metadata={'help': f'model_type choices: {list(MODEL_MAPPING.keys())}'}) @@ -48,6 +73,7 @@ def __post_init__(self): self._init_simpo() self._set_default() super().__post_init__() + self._init_ppo() if self.rlhf_type in ['dpo', 'kto'] and self.train_type == 'full' or self.rlhf_type == 'ppo': self.ref_model = self.ref_model or self.model @@ -56,6 +82,13 @@ def __post_init__(self): elif self.ref_model is not None: raise ValueError('CPO/ORPO or LoRA training does not require a ref_model to be passed in.') + def _init_ppo(self): + if self.rlhf_type == 'ppo': + self.padding_side = 'left' + self.metric_for_best_model = None + self.training_args.metric_for_best_model = None + # TODO: streaming, MLLM + def _init_simpo(self): if self.rlhf_type != 'simpo': return diff --git a/swift/llm/template/base.py b/swift/llm/template/base.py index a4b2aa7c1d..d2a2ae84a8 100644 --- a/swift/llm/template/base.py +++ b/swift/llm/template/base.py @@ -598,7 +598,7 @@ def _swift_encode(self, inputs: StdTemplateInputs): context_list = prompt.copy() extra_context_list = [] extra_context_type = None - if i < n_round - 1 or self.mode == 'seq_cls' and response is not None: + if i < n_round - 1: # Not the last round. context_list.append('{{RESPONSE}}') extra_context_list = template_meta.chat_sep diff --git a/swift/llm/template/template/mplug.py b/swift/llm/template/template/mplug.py index 4e25652257..9882cd3388 100644 --- a/swift/llm/template/template/mplug.py +++ b/swift/llm/template/template/mplug.py @@ -97,7 +97,7 @@ def _encode(self, inputs: StdTemplateInputs) -> Dict[str, Any]: if images: image_inputs = processor.image_processor(images, cut_enable=cut_enable, return_tensors='pt') added_tokens_len = 0 - cut_shapes = image_inputs['cut_shape'] or [None] * len(idx_list) + cut_shapes = image_inputs['cut_shape'] or [None] * 2 * len(idx_list) image_token_list = self.processor.encode('<|image|>', add_special_tokens=False) for idx, cut_shape in zip(idx_list, cut_shapes[::2]): if cut_shape: @@ -161,6 +161,8 @@ def _post_encode(self, model: nn.Module, inputs: Dict[str, Any]) -> Dict[str, An if 'pixel_values' in inputs: pixel_values = inputs.pop('pixel_values') inputs['image_embeds'] = torch.concat([model.forward_image(pv) for pv in pixel_values]) + else: + inputs['media_offset'] = [None] * inputs['input_ids'].shape[0] return inputs diff --git a/swift/llm/train/rlhf.py b/swift/llm/train/rlhf.py index 2e5bff8910..feffd4e65c 100644 --- a/swift/llm/train/rlhf.py +++ b/swift/llm/train/rlhf.py @@ -1,37 +1,66 @@ # Copyright (c) Alibaba, Inc. and its affiliates. from typing import List, Union -from swift.utils import patch_getattr +from swift.utils import get_logger, get_model_parameter_info from ..argument import RLHFArguments from .kto import prepare_kto_dataset from .sft import SwiftSft +logger = get_logger() + class SwiftRLHF(SwiftSft): args_class = RLHFArguments args: args_class def _prepare_model_tokenizer(self): + from swift.llm.infer.utils import prepare_adapter args = self.args - self.ref_model = None - if args.ref_model: + for key in ['ref', 'reward', 'value']: + origin_key = key + setattr(self, f'{key}_model', None) + if key == 'value': + if args.rlhf_type == 'ppo': + key = 'reward' + else: + continue + model_id_or_path = getattr(args, f'{key}_model') + if model_id_or_path is None: + continue + model_type = getattr(args, f'{key}_model_type') + model_revision = getattr(args, f'{key}_model_revision') + adapters = args.adapters if key == 'ref' else args.reward_adapters + task_type = args.task_type if origin_key == 'ref' else 'seq_cls' # Be aware of the unexpected behavior caused by double monkey patching. - self.ref_model, _ = args.get_model_processor( - model=args.ref_model, model_type=args.ref_model_type, model_revision=args.ref_model_revision) - self.ref_model.requires_grad_(False).eval() + model = args.get_model_processor( + model=model_id_or_path, model_type=model_type, model_revision=model_revision, task_type=task_type)[0] + + model = prepare_adapter(args, model, adapters) + if origin_key in {'ref', 'reward'}: + model.requires_grad_(False).eval() + else: + model = self.prepare_model(args, model, task_type=task_type) + logger.info(f'value_model: {model}') + model_parameter_info = get_model_parameter_info(model) + self.train_msg['value_model_parameter_info'] = model_parameter_info + logger.info(f'value_model_parameter_info: {model_parameter_info}') + setattr(self, f'{origin_key}_model', model) super()._prepare_model_tokenizer() def _prepare_template(self) -> None: args = self.args super()._prepare_template() - mode = 'kto' if args.rlhf_type == 'kto' else 'rlhf' - self.template.set_mode(mode) + model_mapping = {'kto': 'kto', 'ppo': 'pt'} + self.template.set_mode(model_mapping.get(args.rlhf_type, 'rlhf')) if args.rlhf_type != 'orpo' or args.model_meta.is_multimodal: # Avoid padding labels during the model's forward pass in multimodal models. self.template.loss_scale = 'last_round' + if args.rlhf_type == 'ppo': + args.training_args.stop_token_id = self.template.template_meta.stop_token_id + def _get_dataset(self): args = self.args train_dataset, val_dataset = super()._get_dataset() @@ -41,8 +70,11 @@ def _get_dataset(self): def _get_trainer_kwargs(self): trainer_kwargs = {} - if self.ref_model: - trainer_kwargs['ref_model'] = self.ref_model + for key in ['ref', 'reward', 'value']: + key = f'{key}_model' + model = getattr(self, key) + if model: + trainer_kwargs[key] = model return trainer_kwargs diff --git a/swift/plugin/loss_scale.py b/swift/plugin/loss_scale.py index 275d2e0e4b..21733dfabe 100644 --- a/swift/plugin/loss_scale.py +++ b/swift/plugin/loss_scale.py @@ -180,11 +180,12 @@ def get_loss_scale(self, context: str, context_type: ContextType, *args, **kwarg # Add your loss scale here, use --loss_scale xxx to train loss_scale_map = { + 'last_round': LastRoundLossScale(), + 'default': LossScale(), + 'all': TrainAllLossScale(), + # agent 'agentflan': AgentFlanLossScale(), 'react': REACTLossScale(), 'alpha_umi': AlphaUmiLossScale(), - 'default': LossScale(), - 'last_round': LastRoundLossScale(), 'qwen': QwenLossScale(), - 'all': TrainAllLossScale(), } diff --git a/swift/trainers/__init__.py b/swift/trainers/__init__.py index 2e57e64de2..da7ab951cc 100644 --- a/swift/trainers/__init__.py +++ b/swift/trainers/__init__.py @@ -15,10 +15,10 @@ ShardedDDPOption = None if TYPE_CHECKING: - from .arguments import (Seq2SeqTrainingArguments, TrainingArguments, DPOConfig, CPOConfig, KTOConfig, ORPOConfig, - PPOConfig, RewardConfig) + from .arguments import Seq2SeqTrainingArguments, TrainingArguments from .rlhf_trainer import (CPOTrainer, DPOTrainer, KTOTrainer, ORPOTrainer, RLHFTrainerMixin, PPOTrainer, RewardTrainer) + from .rlhf_arguments import DPOConfig, CPOConfig, KTOConfig, ORPOConfig, PPOConfig, RewardConfig from .trainer_factory import TrainerFactory from .trainers import Seq2SeqTrainer, Trainer from .mixin import SwiftMixin @@ -26,10 +26,8 @@ else: _extra_objects = {k: v for k, v in globals().items() if not k.startswith('_')} _import_structure = { - 'arguments': [ - 'Seq2SeqTrainingArguments', 'TrainingArguments', 'DPOConfig', 'CPOConfig', 'KTOConfig', 'ORPOConfig', - 'PPOConfig', 'RewardConfig' - ], + 'arguments': ['Seq2SeqTrainingArguments', 'TrainingArguments'], + 'rlhf_arguments': ['DPOConfig', 'CPOConfig', 'KTOConfig', 'ORPOConfig', 'PPOConfig', 'RewardConfig'], 'rlhf_trainer': ['CPOTrainer', 'DPOTrainer', 'KTOTrainer', 'ORPOTrainer', 'RLHFTrainerMixin', 'PPOTrainer', 'RewardTrainer'], 'trainer_factory': ['TrainerFactory'], diff --git a/swift/trainers/arguments.py b/swift/trainers/arguments.py index a0b78b947f..809d42b85b 100644 --- a/swift/trainers/arguments.py +++ b/swift/trainers/arguments.py @@ -76,40 +76,3 @@ class TrainingArguments(SwiftArgumentsMixin, HfTrainingArguments): @dataclass class Seq2SeqTrainingArguments(SwiftArgumentsMixin, HfSeq2SeqTrainingArguments): pass - - -try: - from trl import (DPOConfig as HfDPOConfig, CPOConfig as HfCPOConfig, ORPOConfig as HfORPOConfig, KTOConfig as - HfKTOConfig, RewardConfig as HfRewardConfig, PPOv2Config as HfPPOConfig) - - @dataclass - class DPOConfig(SwiftArgumentsMixin, HfDPOConfig): - pass - - @dataclass - class CPOConfig(SwiftArgumentsMixin, HfCPOConfig): - pass - - @dataclass - class ORPOConfig(SwiftArgumentsMixin, HfORPOConfig): - pass - - @dataclass - class KTOConfig(SwiftArgumentsMixin, HfKTOConfig): - pass - - @dataclass - class RewardConfig(SwiftArgumentsMixin, HfRewardConfig): - pass - - @dataclass - class PPOConfig(SwiftArgumentsMixin, HfPPOConfig): - pass - -except (ImportError, RuntimeError): - DPOConfig = None - CPOConfig = None - ORPOConfig = None - KTOConfig = None - RewardConfig = None - PPOConfig = None diff --git a/swift/trainers/mixin.py b/swift/trainers/mixin.py index 5eb7a72dd3..8c68e1a27d 100644 --- a/swift/trainers/mixin.py +++ b/swift/trainers/mixin.py @@ -72,6 +72,7 @@ def __init__(self, from swift.trainers.xtuner import init_sequence_parallel_xtuner init_sequence_parallel_xtuner(args.sequence_parallel_size) + self.model_meta = model.model_meta with self.hub.patch_hub(): super().__init__( model=model, @@ -216,7 +217,7 @@ def _save(self, output_dir: Optional[str] = None, state_dict=None): # tokenizer if not is_adapter: from swift.llm import save_checkpoint - additional_saved_files = self.model.model_meta.additional_saved_files + additional_saved_files = self.model_meta.additional_saved_files save_checkpoint(None, self.template.processor, output_dir, additional_saved_files=additional_saved_files) def _fix_zero3_gather_all_parameters(self) -> None: @@ -246,7 +247,7 @@ def _save_checkpoint(self, *args, **kwargs): return result def train(self, *args, **kwargs): - if self.model.model_meta.is_multimodal: + if self.model_meta.is_multimodal: models = list( set([ v for k, v in self.__dict__.items() diff --git a/swift/trainers/rlhf_arguments.py b/swift/trainers/rlhf_arguments.py new file mode 100644 index 0000000000..9db0541522 --- /dev/null +++ b/swift/trainers/rlhf_arguments.py @@ -0,0 +1,40 @@ +from dataclasses import dataclass + +from trl import CPOConfig as HfCPOConfig +from trl import DPOConfig as HfDPOConfig +from trl import KTOConfig as HfKTOConfig +from trl import ORPOConfig as HfORPOConfig +from trl import PPOv2Config as HfPPOv2Config +from trl import RewardConfig as HfRewardConfig + +from .arguments import SwiftArgumentsMixin + + +@dataclass +class DPOConfig(SwiftArgumentsMixin, HfDPOConfig): + pass + + +@dataclass +class CPOConfig(SwiftArgumentsMixin, HfCPOConfig): + pass + + +@dataclass +class ORPOConfig(SwiftArgumentsMixin, HfORPOConfig): + pass + + +@dataclass +class KTOConfig(SwiftArgumentsMixin, HfKTOConfig): + pass + + +@dataclass +class RewardConfig(SwiftArgumentsMixin, HfRewardConfig): + pass + + +@dataclass +class PPOConfig(SwiftArgumentsMixin, HfPPOv2Config): + pass diff --git a/swift/trainers/rlhf_trainer/ppo_trainer.py b/swift/trainers/rlhf_trainer/ppo_trainer.py index bcdfbf6b27..1196d5b06c 100644 --- a/swift/trainers/rlhf_trainer/ppo_trainer.py +++ b/swift/trainers/rlhf_trainer/ppo_trainer.py @@ -1,47 +1,45 @@ # Copyright (c) Alibaba, Inc. and its affiliates. +from contextlib import contextmanager + from torch.utils.data import DataLoader from transformers import PreTrainedModel -from trl import PPOv2Trainer as HFPPOTrainer +from trl import PPOv2Trainer as HFPPOv2Trainer +from swift.utils import patch_getattr from ..mixin import SwiftMixin -from .rlhf_mixin import RLHFTrainerMixin +ppo_trainer_init = HFPPOv2Trainer.__init__ +del HFPPOv2Trainer.__init__ -class PPOTrainer(RLHFTrainerMixin, SwiftMixin, HFPPOTrainer): - def __init__(self, model: PreTrainedModel, ref_model: PreTrainedModel, *_args, **kwargs): - kwargs['policy'] = model - kwargs['ref_policy'] = ref_model - super().__init__(model, ref_model, *_args, **kwargs) - # reset dataloader - self.dataloader = DataLoader( - self.train_dataset, - batch_size=self.local_dataloader_batch_size, - shuffle=True, - collate_fn=kwargs['data_collator'], - drop_last=True, # needed; otherwise the last batch will be of ragged shape - ) - self.accelerator.prepare(self.data_collator) - self.eval_dataloader = DataLoader( - self.eval_dataset, - batch_size=self.args.per_device_eval_batch_size, - collate_fn=kwargs['data_collator'], - drop_last=True, - ) # no need to shuffle eval dataset - self.eval_dataloader = self.accelerator.prepare(self.eval_dataloader) +class PPOTrainer(SwiftMixin, HFPPOv2Trainer): - def train(self, *args, **kwargs): - # remove args that are not needed for the HFPPOTrainer - HFPPOTrainer.train(self) + @staticmethod + @contextmanager + def _patch_dataloader(collate_fn): + __init__ = DataLoader.__init__ + def __new_init__(self, *args, **kwargs): + kwargs['collate_fn'] = collate_fn + __init__(self, *args, **kwargs) -def patched_init(self, **kwargs): - kwargs_to_pop = ['model', 'model_init', 'compute_metrics', 'preprocess_logits_for_metrics'] - for kwarg in kwargs_to_pop: - kwargs.pop(kwarg, None) - kwargs['config'] = kwargs.pop('args') - original_init(self, **kwargs) + DataLoader.__init__ = __new_init__ + yield + DataLoader.__init__ = __init__ + def __init__(self, model: PreTrainedModel, ref_model: PreTrainedModel, *_args, **kwargs): + super().__init__(model, *_args, **kwargs) + with self._patch_dataloader(kwargs['data_collator']): + new_kwargs = { + k: v + for k, v in kwargs.items() + if k in ['train_dataset', 'data_collator', 'reward_model', 'value_model', 'eval_dataset'] + } + ppo_trainer_init( + self, config=kwargs['args'], tokenizer=self.tokenizer, policy=model, ref_policy=ref_model, **new_kwargs) + unwrap_model = self.accelerator.unwrap_model(self.model) + patch_getattr(unwrap_model, 'policy') -original_init = HFPPOTrainer.__init__ -HFPPOTrainer.__init__ = patched_init + def train(self, *args, **kwargs): + # remove args that are not needed for the HFPPOTrainer + super().train() diff --git a/swift/trainers/trainer_factory.py b/swift/trainers/trainer_factory.py index 480ca8287d..19c93a042b 100644 --- a/swift/trainers/trainer_factory.py +++ b/swift/trainers/trainer_factory.py @@ -56,4 +56,7 @@ def get_training_args(cls, args): if k not in parameters: args_dict.pop(k) + if 'ppo' in training_args_cls.__name__.lower(): + args_dict['world_size'] = args.global_world_size + return training_args_cls(**args_dict) diff --git a/tests/test_align/test_template/test_llm.py b/tests/test_align/test_template/test_llm.py index da54bbb017..0b29120dfe 100644 --- a/tests/test_align/test_template/test_llm.py +++ b/tests/test_align/test_template/test_llm.py @@ -215,7 +215,7 @@ def test_qwen2_reward(): res = _infer_model(pt_engine, messages=messages) pt_engine.default_template.template_backend = 'jinja' res2 = _infer_model(pt_engine, messages=messages) - assert res == res2 == '1.390625' + assert res == '1.84375' and res2 == '1.390625' # \n diff def test_qwen2_5_math(): @@ -239,7 +239,7 @@ def test_skywork_reward(): res = _infer_model(pt_engine, messages=messages) pt_engine.default_template.template_backend = 'jinja' res2 = _infer_model(pt_engine, messages=messages) - assert res == '14.1875' + assert res == '14.25' assert res2 == '13.8125' diff --git a/tests/train/test_ppo.py b/tests/train/test_ppo.py new file mode 100644 index 0000000000..4ad3180502 --- /dev/null +++ b/tests/train/test_ppo.py @@ -0,0 +1,40 @@ +import os + +os.environ['CUDA_VISIBLE_DEVICES'] = '0' + +kwargs = { + 'per_device_train_batch_size': 2, + 'save_steps': 5, + 'gradient_accumulation_steps': 4, + 'num_train_epochs': 1, +} + + +def test_rm(): + from swift.llm import rlhf_main, RLHFArguments, infer_main, InferArguments + result = rlhf_main( + RLHFArguments( + rlhf_type='rm', + model='Shanghai_AI_Laboratory/internlm2-1_8b-reward', + dataset=['hjh0119/shareAI-Llama3-DPO-zh-en-emoji#100'], + **kwargs)) + last_model_checkpoint = result['last_model_checkpoint'] + infer_main(InferArguments(adapters=last_model_checkpoint, load_data_args=True, merge_lora=True)) + + +def test_ppo(): + from swift.llm import rlhf_main, RLHFArguments, infer_main, InferArguments + result = rlhf_main( + RLHFArguments( + rlhf_type='ppo', + model='LLM-Research/Llama-3.2-1B-Instruct', + reward_model='AI-ModelScope/GRM-Llama3.2-3B-rewardmodel-ft', + dataset=['AI-ModelScope/alpaca-gpt4-data-zh#100', 'AI-ModelScope/alpaca-gpt4-data-en#100'], + **kwargs)) + last_model_checkpoint = result['last_model_checkpoint'] + infer_main(InferArguments(adapters=last_model_checkpoint, load_data_args=True, merge_lora=True)) + + +if __name__ == '__main__': + # test_rm() + test_ppo()