diff --git a/README.md b/README.md
index 1be393bcb7..cfbcfdfc4d 100644
--- a/README.md
+++ b/README.md
@@ -67,7 +67,7 @@ You can contact us and communicate with us by adding our group:
 - 🍊 **Lightweight Training**: Supports lightweight fine-tuning methods like LoRA, QLoRA, DoRA, LoRA+, ReFT, RS-LoRA, LLaMAPro, Adapter, GaLore, Q-Galore, LISA, UnSloth, Liger-Kernel.
 - **Distributed Training**: Supports distributed data parallel (DDP), device_map simple model parallelism, DeepSpeed ZeRO2/ZeRO3, FSDP, and other distributed training techniques.
 - **Quantization Training**: Supports training quantized models like BNB, AWQ, GPTQ, AQLM, HQQ, EETQ.
-- **RLHF Training**: Supports human alignment training methods such as DPO, CPO, SimPO, ORPO, KTO, RM for both pure text and multi-modal large models.
+- **RLHF Training**: Supports human alignment training methods such as DPO, CPO, SimPO, ORPO, KTO, RM, PPO for both pure text and multi-modal large models.
 - 🍓 **Multi-Modal Training**: Supports training on different modalities like images, videos, and audio, for tasks like VQA, captioning, OCR, and grounding.
 - **Interface Training**: Provides capabilities for training, inference, evaluation, quantization through an interface, completing the whole large model pipeline.
 - **Plugin and Extension**: Supports custom model and dataset extensions, as well as customization of components like loss, metric, trainer, loss-scale, callback, optimizer.
@@ -83,7 +83,7 @@ You can contact us and communicate with us by adding our group:
 - 🎉 2024.08.12: The SWIFT paper has been published on arXiv, and you can read it [here](https://arxiv.org/abs/2408.05517).
 - 🔥 2024.08.05: Support for using [evalscope](https://github.com/modelscope/evalscope/) as a backend for evaluating large models and multimodal models.
 - 🔥 2024.07.29: Support for using [vllm](https://github.com/vllm-project/vllm) and [lmdeploy](https://github.com/InternLM/lmdeploy) to accelerate inference for large models and multimodal models. When performing infer/deploy/eval, you can specify `--infer_backend vllm/lmdeploy`.
-- 🔥 2024.07.24: Support for human preference alignment training for multimodal large models, including DPO/ORPO/SimPO/CPO/KTO/RM.
+- 🔥 2024.07.24: Support for human preference alignment training for multimodal large models, including DPO/ORPO/SimPO/CPO/KTO/RM/PPO.
 - 🔥 2024.02.01: Support for Agent training! The training algorithm is derived from [this paper](https://arxiv.org/pdf/2309.00986.pdf).
 
 
diff --git a/README_CN.md b/README_CN.md
index bf4443bd88..7ca2931fe5 100644
--- a/README_CN.md
+++ b/README_CN.md
@@ -64,7 +64,7 @@
 - 🍊 **轻量训练**：支持了LoRA、QLoRA、DoRA、LoRA+、ReFT、RS-LoRA、LLaMAPro、Adapter、GaLore、Q-Galore、LISA、UnSloth、Liger-Kernel等轻量微调方式。
 - **分布式训练**：支持分布式数据并行（DDP）、device_map简易模型并行、DeepSpeed ZeRO2 ZeRO3、FSDP等分布式训练技术。
 - **量化训练**：支持对BNB、AWQ、GPTQ、AQLM、HQQ、EETQ量化模型进行训练。
-- **RLHF训练**：支持纯文本大模型和多模态大模型的DPO、CPO、SimPO、ORPO、KTO、RM等人类对齐训练方法。
+- **RLHF训练**：支持纯文本大模型和多模态大模型的DPO、CPO、SimPO、ORPO、KTO、RM、PPO等人类对齐训练方法。
 - 🍓 **多模态训练**：支持对图像、视频和语音不同模态模型进行训练，支持VQA、Caption、OCR、Grounding任务的训练。
 - **界面训练**：以界面的方式提供训练、推理、评测、量化的能力，完成大模型的全链路。
 - **插件化与拓展**：支持自定义模型和数据集拓展，支持对loss、metric、trainer、loss-scale、callback、optimizer等组件进行自定义。
@@ -78,7 +78,7 @@
 - 🎉 2024.08.12: SWIFT论文已经发布到arXiv上，可以点击[这里](https://arxiv.org/abs/2408.05517)阅读。
 - 🔥 2024.08.05: 支持使用[evalscope](https://github.com/modelscope/evalscope/)作为后端进行大模型和多模态模型的评测。
 - 🔥 2024.07.29: 支持使用[vllm](https://github.com/vllm-project/vllm), [lmdeploy](https://github.com/InternLM/lmdeploy)对大模型和多模态大模型进行推理加速，在infer/deploy/eval时额外指定`--infer_backend vllm/lmdeploy`即可。
-- 🔥 2024.07.24: 支持对多模态大模型进行人类偏好对齐训练，包括DPO/ORPO/SimPO/CPO/KTO/RM。
+- 🔥 2024.07.24: 支持对多模态大模型进行人类偏好对齐训练，包括DPO/ORPO/SimPO/CPO/KTO/RM/PPO。
 - 🔥 2024.02.01: 支持Agent训练！训练算法源自这篇[论文](https://arxiv.org/pdf/2309.00986.pdf)。
 
 ## 🛠️ 安装
diff --git "a/docs/source/Customization/\350\207\252\345\256\232\344\271\211\346\225\260\346\215\256\351\233\206.md" "b/docs/source/Customization/\350\207\252\345\256\232\344\271\211\346\225\260\346\215\256\351\233\206.md"
index 16b54234be..9c63c6f080 100644
--- "a/docs/source/Customization/\350\207\252\345\256\232\344\271\211\346\225\260\346\215\256\351\233\206.md"
+++ "b/docs/source/Customization/\350\207\252\345\256\232\344\271\211\346\225\260\346\215\256\351\233\206.md"
@@ -67,6 +67,14 @@ query-response格式：
 {"messages": [{"role": "system", "content": "你是个有用无害的数学计算器"}, {"role": "user", "content": "1+1等于几"}, {"role": "assistant", "content": "等于2"}, {"role": "user", "content": "再加1呢"}, {"role": "assistant", "content": "等于3"}], "label": true}
 ```
 
+#### PPO
+
+```jsonl
+{"messages": [{"role": "system", "content": "你是个有用无害的助手"}, {"role": "user", "content": "告诉我明天的天气"}]}
+{"messages": [{"role": "system", "content": "你是个有用无害的数学计算器"}, {"role": "user", "content": "1+1等于几"}, {"role": "assistant", "content": "等于2"}, {"role": "user", "content": "再加1呢"}]}
+{"messages": [{"role": "user", "content": "你的名字是什么"}]}
+```
+
 ### 序列分类
 ```jsonl
 {"messages": [{"role": "user", "content": "今天天气真好呀"}], "label": 1}
diff --git "a/docs/source/GetStarted/\345\277\253\351\200\237\345\274\200\345\247\213.md" "b/docs/source/GetStarted/\345\277\253\351\200\237\345\274\200\345\247\213.md"
index 0c0ee21aa0..9892185272 100644
--- "a/docs/source/GetStarted/\345\277\253\351\200\237\345\274\200\345\247\213.md"
+++ "b/docs/source/GetStarted/\345\277\253\351\200\237\345\274\200\345\247\213.md"
@@ -8,7 +8,7 @@ ms-swift是魔搭社区提供的大模型与多模态大模型训练部署框架
 - 🍊 轻量训练：支持了LoRA、QLoRA、DoRA、LoRA+、ReFT、RS-LoRA、LLaMAPro、Adapter、GaLore、Q-Galore、LISA、UnSloth、Liger-Kernel等轻量微调方式。
 - 分布式训练：支持分布式数据并行（DDP）、device_map简易模型并行、DeepSpeed ZeRO2 ZeRO3、FSDP等分布式训练技术。
 - 量化训练：支持对BNB、AWQ、GPTQ、AQLM、HQQ、EETQ量化模型进行训练。
-- RLHF训练：支持纯文本大模型和多模态大模型的DPO、CPO、SimPO、ORPO、KTO、RM等人类对齐训练方法。
+- RLHF训练：支持纯文本大模型和多模态大模型的DPO、CPO、SimPO、ORPO、KTO、RM、PPO等人类对齐训练方法。
 - 🍓 多模态训练：支持对图像、视频和语音不同模态模型进行训练，支持VQA、Caption、OCR、Grounding任务的训练。
 - 界面训练：以界面的方式提供训练、推理、评测、量化的能力，完成大模型的全链路。
 - 插件化与拓展：支持自定义模型和数据集拓展，支持对loss、metric、trainer、loss-scale、callback、optimizer等组件进行自定义。
diff --git a/docs/source/Instruction/ReleaseNote3.0.md b/docs/source/Instruction/ReleaseNote3.0.md
index 888b7916fb..eeacc1de51 100644
--- a/docs/source/Instruction/ReleaseNote3.0.md
+++ b/docs/source/Instruction/ReleaseNote3.0.md
@@ -81,7 +81,6 @@
 
 ## 待完成
 
-1. RM/PPO能力3.0版本尚不支持，请使用2.6.1版本
-2. 自定义数据集评测3.0版本尚不支持，请使用2.6.1版本
-3. Megatron预训练能力3.0版本尚不支持，请使用2.6.1版本
+1. 自定义数据集评测3.0版本尚不支持，请使用2.6.1版本
+2. Megatron预训练能力3.0版本尚不支持，请使用2.6.1版本
 3. 文档和README暂时未更新完整
diff --git "a/docs/source/Instruction/\345\221\275\344\273\244\350\241\214\345\217\202\346\225\260.md" "b/docs/source/Instruction/\345\221\275\344\273\244\350\241\214\345\217\202\346\225\260.md"
index f0920e9f5e..e8769a1a6c 100644
--- "a/docs/source/Instruction/\345\221\275\344\273\244\350\241\214\345\217\202\346\225\260.md"
+++ "b/docs/source/Instruction/\345\221\275\344\273\244\350\241\214\345\217\202\346\225\260.md"
@@ -50,7 +50,7 @@
 - 🔥max_pixels: 多模态模型图片前处理的最大像素数（H\*W），默认不缩放。
 - tools_prompt: 智能体训练时的工具列表转为system的格式，请参考[智能体训练](./智能体的支持.md)，默认为'react_en'
 - padding_side: 当训练`batch_size>=2`时的padding_side，可选值为'left', 'right'，默认为'right'。（`generate`的batch_size>=2时，只进行左padding）
-- loss_scale: 如何针对训练添加token的loss权重。默认为`'default'`，代表所有response（含history）以1计算交叉熵损失。具体可以查看[插件化](../Customization/插件化.md)和[智能体训练](./智能体的支持.md)
+- loss_scale: 如何针对训练添加token的loss权重。默认为`'default'`，代表所有response（含history）以1计算交叉熵损失。可选值为'default', 'last_round', 'all', 以及agent需要的loss_scale: 'react', 'agentflan', 'alpha_umi', 'qwen'。具体可以查看[插件化](../Customization/插件化.md)和[智能体训练](./智能体的支持.md)
 - sequence_parallel_size: 序列并行数量。参考[example](https://github.com/modelscope/ms-swift/tree/main/examples/train/sequence_parallel/train.sh)
 - use_chat_template: 使用chat模板或generation模板，默认为`True`。`swift pt`会自动设置为generation模板
 - template_backend: 使用swift或jinja进行推理。如果使用jinja，则使用transformers的`apply_chat_template`。默认为swift
@@ -307,7 +307,7 @@ Vera使用`target_modules`, `target_regex`, `modules_to_save`三个参数.
 ### RLHF参数
 RLHF参数继承于[训练参数](#训练参数)
 
-- 🔥rlhf_type: 对齐算法类型，支持`dpo`, `orpo`, `simpo`, `kto`, `cpo`, `rm`
+- 🔥rlhf_type: 对齐算法类型，支持`dpo`, `orpo`, `simpo`, `kto`, `cpo`, `rm`, `ppo`
 - ref_model: DPO等算法中的原始对比模型
 - ref_model_type: 同model_type
 - ref_model_revision: 同model_revision
@@ -324,6 +324,27 @@ RLHF参数继承于[训练参数](#训练参数)
 - desirable_weight: KTO算法中对desirable response的loss权重 $\lambda_D$ ，默认为`1.`
 - undesirable_weight: KTO论文中对undesirable response的loss权重 $\lambda_U$ , 默认为`1.`
 
+#### PPO参数
+- reward_model: 默认为None
+- reward_adapters: 默认为`[]`
+- reward_model_type: 默认为None
+- reward_model_revision: 默认为None
+
+以下参数含义可以参考[这里](https://huggingface.co/docs/trl/main/ppo_trainer)
+- num_ppo_epochs: 默认为4
+- whiten_rewards: 默认为False
+- kl_coef: 默认为0.05
+- cliprange: 默认为0.2
+- vf_coef: 默认为0.1
+- cliprange_value: 默认为0.2
+- gamma: 默认为1.0
+- lam: 默认为0.95
+- num_mini_batches: 默认为1
+- local_rollout_forward_batch_size: 默认为64
+- num_sample_generations: 默认为10
+- response_length: 默认为512
+- temperature: 默认为0.7
+- missing_eos_penalty: 默认为None
 
 ### 推理参数
 
diff --git a/docs/source_en/Customization/Custom-dataset.md b/docs/source_en/Customization/Custom-dataset.md
index f7b883f46a..268b06dbc0 100644
--- a/docs/source_en/Customization/Custom-dataset.md
+++ b/docs/source_en/Customization/Custom-dataset.md
@@ -66,6 +66,14 @@ The following provides the recommended dataset format for ms-swift, where the sy
 {"messages": [{"role": "system", "content": "You are a useful and harmless math calculator"}, {"role": "user", "content": "What is 1 + 1?"}, {"role": "assistant", "content": "It equals 2"}, {"role": "user", "content": "What about adding 1?"}, {"role": "assistant", "content": "It equals 3"}], "label": true}
 ```
 
+#### PPO
+
+```jsonl
+{"messages": [{"role": "system", "content": "You are a useful and harmless assistant"}, {"role": "user", "content": "Tell me tomorrow's weather"}]}
+{"messages": [{"role": "system", "content": "You are a useful and harmless math calculator"}, {"role": "user", "content": "What is 1 + 1?"}, {"role": "assistant", "content": "It equals 2"}, {"role": "user", "content": "What about adding 1?"}]}
+{"messages": [{"role": "user", "content": "What is your name?"}]}
+```
+
 ### Sequence Classification
 ```jsonl
 {"messages": [{"role": "user", "content": "The weather is really nice today"}], "label": 1}
diff --git a/docs/source_en/GetStarted/Quick-start.md b/docs/source_en/GetStarted/Quick-start.md
index b86891c519..e4887be83c 100644
--- a/docs/source_en/GetStarted/Quick-start.md
+++ b/docs/source_en/GetStarted/Quick-start.md
@@ -8,7 +8,7 @@ ms-swift is a comprehensive training and deployment framework for large language
 - 🍊 Lightweight Training: Supports lightweight fine-tuning methods like LoRA, QLoRA, DoRA, LoRA+, ReFT, RS-LoRA, LLaMAPro, Adapter, GaLore, Q-Galore, LISA, UnSloth, Liger-Kernel, and more.
 - Distributed Training: Supports distributed data parallel (DDP), simple model parallelism via device_map, DeepSpeed ZeRO2 ZeRO3, FSDP, and other distributed training technologies.
 - Quantization Training: Provides training for quantized models like BNB, AWQ, GPTQ, AQLM, HQQ, EETQ.
-- RLHF Training: Supports human alignment training methods like DPO, CPO, SimPO, ORPO, KTO, RM for both text-based and multimodal large models.
+- RLHF Training: Supports human alignment training methods like DPO, CPO, SimPO, ORPO, KTO, RM, PPO for both text-based and multimodal large models.
 - 🍓 Multimodal Training: Capable of training models for different modalities such as images, videos, and audios; supports tasks like VQA (Visual Question Answering), Captioning, OCR (Optical Character Recognition), and Grounding.
 - Interface-driven Training: Offers training, inference, evaluation, and quantization capabilities through an interface, enabling a complete workflow for large models.
 - Plugins and Extensions: Allows customization and extension of models and datasets, and supports customizations for components like loss, metric, trainer, loss-scale, callback, optimizer, etc.
diff --git a/docs/source_en/Instruction/Command-line-parameters.md b/docs/source_en/Instruction/Command-line-parameters.md
index b56a0b1afc..233a51861f 100644
--- a/docs/source_en/Instruction/Command-line-parameters.md
+++ b/docs/source_en/Instruction/Command-line-parameters.md
@@ -50,7 +50,7 @@ The introduction to command line parameters will cover base arguments, atomic ar
 - 🔥max_pixels: Maximum pixel count for pre-processing images in multimodal models (H*W), default is no scaling.
 - tools_prompt: The list of tools for agent training converted to system format, refer to [Agent Training](./Agent-support.md), default is 'react_en'.
 - padding_side: The padding_side used when training with `batch_size >= 2`, with optional values of 'left' and 'right', defaulting to 'right'. (When the batch_size in `generate` is >= 2, only left padding is applied.)
-- loss_scale: How to add token loss weight during training. Default is `'default'`, meaning all responses (including history) are treated as 1 for cross-entropy loss. For specifics, see [Pluginization](../Customization/Pluginization.md) and [Agent Training](./Agent-support.md).
+- loss_scale: How to add token loss weight during training. Default is `'default'`, meaning all responses (including history) are treated as 1 for cross-entropy loss. The optional values are 'default', 'last_round', 'all', and the loss scale required by the agent: 'react', 'agentflan', 'alpha_umi', 'qwen'. For specifics, see [Pluginization](../Customization/Pluginization.md) and [Agent Training](./Agent-support.md).
 - sequence_parallel_size: Number of sequence parallelism. Refer to [example](https://github.com/modelscope/ms-swift/tree/main/examples/train/sequence_parallel/train.sh).
 - use_chat_template: Use chat template or generation template, default is `True`. `swift pt` is automatically set to the generation template.
 - template_backend: Use swift or jinja for inference. If using jinja, it will utilize transformers' `apply_chat_template`. Default is swift.
@@ -311,23 +311,47 @@ Training arguments include the [base arguments](#base-arguments), [Seq2SeqTraine
 
 RLHF arguments inherit from the [training arguments](#training-arguments).
 
-- 🔥rlhf_type: Alignment algorithm type, supports `dpo`, `orpo`, `simpo`, `kto`, `cpo`, `rm`.
+- 🔥rlhf_type: Alignment algorithm type, supports `dpo`, `orpo`, `simpo`, `kto`, `cpo`, `rm`, `ppo`.
 - ref_model: Original comparison model in algorithms like DPO.
 - ref_model_type: Same as model_type.
 - ref_model_revision: Same as model_revision.
 
 - 🔥beta: KL regularization term coefficient, default is `None`, i.e., for `simpo` algorithm default is `2.`, for other algorithms default is `0.1`. Refer to the [documentation](./Human-alignment.md) for specifics.
 - label_smoothing: Whether to use DPO smoothing, default value is `0`, generally set between 0~0.5.
--
+
 - 🔥rpo_alpha: Weight for adding sft_loss in DPO, default is `1`. The final loss is `KL_loss + rpo_alpha * sft_loss`.
--
+
 - cpo_alpha: The coefficient of nll loss in CPO/SimPO loss, default is `1.`.
--
+
 - simpo_gamma: Reward margin term in SimPO algorithm, recommended to set between 0.5-1.5 in the paper, default is `1.`.
--
+
 - desirable_weight: Loss weight for desirable response in KTO algorithm $\lambda_D$, default is `1.`.
 - undesirable_weight: Loss weight for undesirable response in KTO paper $\lambda_U$, default is `1.`.
 
+#### PPO Arguments
+
+- reward_model: Defaults to None
+- reward_adapters: Defaults to `[]`
+- reward_model_type: Defaults to None
+- reward_model_revision: Defaults to None
+
+The meanings of the following parameters can be referenced [here](https://huggingface.co/docs/trl/main/ppo_trainer):
+
+- num_ppo_epochs: Defaults to 4
+- whiten_rewards: Defaults to False
+- kl_coef: Defaults to 0.05
+- cliprange: Defaults to 0.2
+- vf_coef: Defaults to 0.1
+- cliprange_value: Defaults to 0.2
+- gamma: Defaults to 1.0
+- lam: Defaults to 0.95
+- num_mini_batches: Defaults to 1
+- local_rollout_forward_batch_size: Defaults to 64
+- num_sample_generations: Defaults to 10
+- response_length: Defaults to 512
+- temperature: Defaults to 0.7
+- missing_eos_penalty: Defaults to None
+
 ### Inference Arguments
 
 Inference arguments include the [base arguments](#base-arguments), [merge arguments](#merge-arguments), [vLLM arguments](#vllm-arguments), [LMDeploy arguments](#LMDeploy-arguments), and also contain the following:
diff --git a/docs/source_en/Instruction/ReleaseNote3.0.md b/docs/source_en/Instruction/ReleaseNote3.0.md
index bbd4e6fc42..e49bd444d6 100644
--- a/docs/source_en/Instruction/ReleaseNote3.0.md
+++ b/docs/source_en/Instruction/ReleaseNote3.0.md
@@ -94,7 +94,6 @@ The parameters marked as compatible in version 2.0 have been entirely removed.
 
 ## Pending Tasks
 
-1. RM/PPO capabilities are not supported in version 3.0. Please use version 2.6.1.
-2. Custom dataset evaluation is not supported in version 3.0. Please use version 2.6.1.
-3. Megatron pre-training capabilities are not supported in version 3.0. Please use version 2.6.1.
-4. Documentation and README are temporarily incomplete and will be updated.
+1. Custom dataset evaluation is not supported in version 3.0. Please use version 2.6.1.
+2. Megatron pre-training capabilities are not supported in version 3.0. Please use version 2.6.1.
+3. Documentation and README are temporarily incomplete and will be updated.
diff --git a/examples/deploy/lora/client.py b/examples/deploy/lora/client.py
index e61caad8ae..ae66b10df0 100644
--- a/examples/deploy/lora/client.py
+++ b/examples/deploy/lora/client.py
@@ -23,5 +23,5 @@ def infer_multilora(engine: InferClient, infer_request: InferRequest):
 
 if __name__ == '__main__':
     engine = InferClient(host='127.0.0.1', port=8000)
-    infer_request = InferRequest(messages=[{'role': 'user', 'content': '你是谁'}])
+    infer_request = InferRequest(messages=[{'role': 'user', 'content': 'who are you?'}])
     infer_multilora(engine, infer_request)
diff --git a/examples/infer/demo_hf.py b/examples/infer/demo_hf.py
new file mode 100644
index 0000000000..58959078f8
--- /dev/null
+++ b/examples/infer/demo_hf.py
@@ -0,0 +1,60 @@
+def infer_hf():
+    from transformers import AutoModelForCausalLM, AutoTokenizer
+    from peft import PeftModel
+    from modelscope import snapshot_download
+    model_dir = snapshot_download('Qwen/Qwen2.5-7B-Instruct')
+    adapter_dir = snapshot_download('swift/test_lora')
+    model = AutoModelForCausalLM.from_pretrained(model_dir, torch_dtype='auto', device_map='auto')
+    model = PeftModel.from_pretrained(model, adapter_dir)
+
+    tokenizer = AutoTokenizer.from_pretrained(model_dir)
+
+    messages = [{
+        'role': 'system',
+        'content': 'You are a helpful assistant.'
+    }, {
+        'role': 'user',
+        'content': 'who are you?'
+    }]
+    text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
+    model_inputs = tokenizer([text], return_tensors='pt').to(model.device)
+
+    generated_ids = model.generate(**model_inputs, max_new_tokens=512, do_sample=False)
+    generated_ids = [
+        output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
+    ]
+
+    response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
+    print(f'response: {response}')
+    return response
+
+
+def infer_swift():
+    from swift.llm import get_model_tokenizer, get_template, InferRequest, RequestConfig, PtEngine
+    from modelscope import snapshot_download
+    from swift.tuners import Swift
+    model_dir = snapshot_download('Qwen/Qwen2.5-7B-Instruct')
+    adapter_dir = snapshot_download('swift/test_lora')
+    model, tokenizer = get_model_tokenizer(model_dir, device_map='auto')
+    model = Swift.from_pretrained(model, adapter_dir)
+    template = get_template(model.model_meta.template, tokenizer)
+    engine = PtEngine.from_model_template(model, template)
+
+    messages = [{
+        'role': 'system',
+        'content': 'You are a helpful assistant.'
+    }, {
+        'role': 'user',
+        'content': 'who are you?'
+    }]
+    request_config = RequestConfig(max_tokens=512, temperature=0)
+    resp_list = engine.infer([InferRequest(messages=messages)], request_config=request_config)
+    response = resp_list[0].choices[0].message.content
+    print(f'response: {response}')
+    return response
+
+
+if __name__ == '__main__':
+    response = infer_hf()
+    response2 = infer_swift()
+    assert response == response2
diff --git a/examples/infer/demo_lora.py b/examples/infer/demo_lora.py
index 7489d1c38a..8d9396f135 100644
--- a/examples/infer/demo_lora.py
+++ b/examples/infer/demo_lora.py
@@ -63,6 +63,6 @@ def infer_lora(infer_request: 'InferRequest'):
     from swift.llm import (PtEngine, RequestConfig, AdapterRequest, get_template, BaseArguments, InferRequest,
                            safe_snapshot_download, get_model_tokenizer)
     from swift.tuners import Swift
-    infer_request = InferRequest(messages=[{'role': 'user', 'content': '你是谁'}])
+    infer_request = InferRequest(messages=[{'role': 'user', 'content': 'who are you?'}])
     # infer_lora(infer_request)
     infer_multilora(infer_request, 'pt')
diff --git a/examples/train/rlhf/ppo.sh b/examples/train/rlhf/ppo.sh
new file mode 100644
index 0000000000..86f93d9348
--- /dev/null
+++ b/examples/train/rlhf/ppo.sh
@@ -0,0 +1,30 @@
+# Currently, it only supports the case where the model and reward_model use the same template/tokenizer.
+nproc_per_node=4
+
+CUDA_VISIBLE_DEVICES=0,1,2,3 \
+NPROC_PER_NODE=$nproc_per_node \
+swift rlhf \
+    --rlhf_type ppo \
+    --model LLM-Research/Meta-Llama-3.1-8B-Instruct \
+    --reward_model 'AI-ModelScope/Skywork-Reward-Llama-3.1-8B-v0.2' \
+    --train_type lora \
+    --dataset 'AI-ModelScope/alpaca-gpt4-data-zh#20000' 'AI-ModelScope/alpaca-gpt4-data-en#20000' \
+    --torch_dtype bfloat16 \
+    --num_train_epochs 1 \
+    --per_device_train_batch_size 1 \
+    --per_device_eval_batch_size 1 \
+    --learning_rate 1e-5 \
+    --lora_rank 8 \
+    --lora_alpha 32 \
+    --target_modules all-linear \
+    --gradient_accumulation_steps $(expr 16 / $nproc_per_node) \
+    --eval_steps 100 \
+    --save_steps 100 \
+    --save_total_limit 5 \
+    --logging_steps 5 \
+    --max_length 2048 \
+    --output_dir output \
+    --warmup_ratio 0.05 \
+    --dataloader_num_workers 4 \
+    --deepspeed zero2 \
+    --response_length 512
diff --git a/swift/llm/argument/rlhf_args.py b/swift/llm/argument/rlhf_args.py
index 89d167ee3c..8ddd396f53 100644
--- a/swift/llm/argument/rlhf_args.py
+++ b/swift/llm/argument/rlhf_args.py
@@ -1,13 +1,38 @@
 # Copyright (c) Alibaba, Inc. and its affiliates.
 from dataclasses import dataclass, field
-from typing import Literal, Optional
+from typing import List, Literal, Optional
 
 from swift.llm import MODEL_MAPPING
 from .train_args import TrainArguments
 
 
 @dataclass
-class RLHFArguments(TrainArguments):
+class PPOArguments:
+    reward_model: Optional[str] = None
+    reward_adapters: List[str] = field(default_factory=list)
+    reward_model_type: Optional[str] = field(
+        default=None, metadata={'help': f'model_type choices: {list(MODEL_MAPPING.keys())}'})
+    reward_model_revision: Optional[str] = None
+
+    num_ppo_epochs: int = 4
+    whiten_rewards: bool = False
+    kl_coef: float = 0.05
+    cliprange: float = 0.2
+    vf_coef: float = 0.1
+    cliprange_value: float = 0.2
+    gamma: float = 1.0
+    lam: float = 0.95
+
+    num_mini_batches: int = 1
+    local_rollout_forward_batch_size: int = 64
+    num_sample_generations: int = 10
+    response_length: int = 512
+    temperature: float = 0.7
+    missing_eos_penalty: Optional[float] = None
+
+
+@dataclass
+class RLHFArguments(PPOArguments, TrainArguments):
     """
     RLHFArguments is a dataclass that holds arguments specific to the Reinforcement
         Learning with Human Feedback (RLHF) training backend.
@@ -25,7 +50,7 @@ class RLHFArguments(TrainArguments):
         desirable_weight (float): Weight for desirable outcomes in KTO. Default is 1.0.
         undesirable_weight (float): Weight for undesirable outcomes in KTO. Default is 1.0.
     """
-    rlhf_type: Literal['dpo', 'orpo', 'simpo', 'kto', 'cpo', 'rm'] = 'dpo'
+    rlhf_type: Literal['dpo', 'orpo', 'simpo', 'kto', 'cpo', 'rm', 'ppo'] = 'dpo'
     ref_model: Optional[str] = None
     ref_model_type: Optional[str] = field(
         default=None, metadata={'help': f'model_type choices: {list(MODEL_MAPPING.keys())}'})
@@ -48,6 +73,7 @@ def __post_init__(self):
         self._init_simpo()
         self._set_default()
         super().__post_init__()
+        self._init_ppo()
 
         if self.rlhf_type in ['dpo', 'kto'] and self.train_type == 'full' or self.rlhf_type == 'ppo':
             self.ref_model = self.ref_model or self.model
@@ -56,6 +82,13 @@ def __post_init__(self):
         elif self.ref_model is not None:
             raise ValueError('CPO/ORPO or LoRA training does not require a ref_model to be passed in.')
 
+    def _init_ppo(self):
+        if self.rlhf_type == 'ppo':
+            self.padding_side = 'left'
+            self.metric_for_best_model = None
+            self.training_args.metric_for_best_model = None
+            # TODO: streaming, MLLM
+
     def _init_simpo(self):
         if self.rlhf_type != 'simpo':
             return
diff --git a/swift/llm/template/base.py b/swift/llm/template/base.py
index a4b2aa7c1d..d2a2ae84a8 100644
--- a/swift/llm/template/base.py
+++ b/swift/llm/template/base.py
@@ -598,7 +598,7 @@ def _swift_encode(self, inputs: StdTemplateInputs):
             context_list = prompt.copy()
             extra_context_list = []
             extra_context_type = None
-            if i < n_round - 1 or self.mode == 'seq_cls' and response is not None:
+            if i < n_round - 1:
                 # Not the last round.
                 context_list.append('{{RESPONSE}}')
                 extra_context_list = template_meta.chat_sep
diff --git a/swift/llm/template/template/mplug.py b/swift/llm/template/template/mplug.py
index 4e25652257..9882cd3388 100644
--- a/swift/llm/template/template/mplug.py
+++ b/swift/llm/template/template/mplug.py
@@ -97,7 +97,7 @@ def _encode(self, inputs: StdTemplateInputs) -> Dict[str, Any]:
         if images:
             image_inputs = processor.image_processor(images, cut_enable=cut_enable, return_tensors='pt')
             added_tokens_len = 0
-            cut_shapes = image_inputs['cut_shape'] or [None] * len(idx_list)
+            cut_shapes = image_inputs['cut_shape'] or [None] * 2 * len(idx_list)
             image_token_list = self.processor.encode('<|image|>', add_special_tokens=False)
             for idx, cut_shape in zip(idx_list, cut_shapes[::2]):
                 if cut_shape:
@@ -161,6 +161,8 @@ def _post_encode(self, model: nn.Module, inputs: Dict[str, Any]) -> Dict[str, An
         if 'pixel_values' in inputs:
             pixel_values = inputs.pop('pixel_values')
             inputs['image_embeds'] = torch.concat([model.forward_image(pv) for pv in pixel_values])
+        else:
+            inputs['media_offset'] = [None] * inputs['input_ids'].shape[0]
         return inputs
 
 
diff --git a/swift/llm/train/rlhf.py b/swift/llm/train/rlhf.py
index 2e5bff8910..feffd4e65c 100644
--- a/swift/llm/train/rlhf.py
+++ b/swift/llm/train/rlhf.py
@@ -1,37 +1,66 @@
 # Copyright (c) Alibaba, Inc. and its affiliates.
 from typing import List, Union
 
-from swift.utils import patch_getattr
+from swift.utils import get_logger, get_model_parameter_info
 from ..argument import RLHFArguments
 from .kto import prepare_kto_dataset
 from .sft import SwiftSft
 
+logger = get_logger()
+
 
 class SwiftRLHF(SwiftSft):
     args_class = RLHFArguments
     args: args_class
 
     def _prepare_model_tokenizer(self):
+        from swift.llm.infer.utils import prepare_adapter
         args = self.args
-        self.ref_model = None
-        if args.ref_model:
+        for key in ['ref', 'reward', 'value']:
+            origin_key = key
+            setattr(self, f'{key}_model', None)
+            if key == 'value':
+                if args.rlhf_type == 'ppo':
+                    key = 'reward'
+                else:
+                    continue
+            model_id_or_path = getattr(args, f'{key}_model')
+            if model_id_or_path is None:
+                continue
+            model_type = getattr(args, f'{key}_model_type')
+            model_revision = getattr(args, f'{key}_model_revision')
+            adapters = args.adapters if key == 'ref' else args.reward_adapters
+            task_type = args.task_type if origin_key == 'ref' else 'seq_cls'
             # Be aware of the unexpected behavior caused by double monkey patching.
-            self.ref_model, _ = args.get_model_processor(
-                model=args.ref_model, model_type=args.ref_model_type, model_revision=args.ref_model_revision)
-            self.ref_model.requires_grad_(False).eval()
+            model = args.get_model_processor(
+                model=model_id_or_path, model_type=model_type, model_revision=model_revision, task_type=task_type)[0]
+
+            model = prepare_adapter(args, model, adapters)
+            if origin_key in {'ref', 'reward'}:
+                model.requires_grad_(False).eval()
+            else:
+                model = self.prepare_model(args, model, task_type=task_type)
+                logger.info(f'value_model: {model}')
+                model_parameter_info = get_model_parameter_info(model)
+                self.train_msg['value_model_parameter_info'] = model_parameter_info
+                logger.info(f'value_model_parameter_info: {model_parameter_info}')
+            setattr(self, f'{origin_key}_model', model)
 
         super()._prepare_model_tokenizer()
 
     def _prepare_template(self) -> None:
         args = self.args
         super()._prepare_template()
-        mode = 'kto' if args.rlhf_type == 'kto' else 'rlhf'
-        self.template.set_mode(mode)
+        model_mapping = {'kto': 'kto', 'ppo': 'pt'}
+        self.template.set_mode(model_mapping.get(args.rlhf_type, 'rlhf'))
 
         if args.rlhf_type != 'orpo' or args.model_meta.is_multimodal:
             # Avoid padding labels during the model's forward pass in multimodal models.
             self.template.loss_scale = 'last_round'
 
+        if args.rlhf_type == 'ppo':
+            args.training_args.stop_token_id = self.template.template_meta.stop_token_id
+
     def _get_dataset(self):
         args = self.args
         train_dataset, val_dataset = super()._get_dataset()
@@ -41,8 +70,11 @@ def _get_dataset(self):
 
     def _get_trainer_kwargs(self):
         trainer_kwargs = {}
-        if self.ref_model:
-            trainer_kwargs['ref_model'] = self.ref_model
+        for key in ['ref', 'reward', 'value']:
+            key = f'{key}_model'
+            model = getattr(self, key)
+            if model:
+                trainer_kwargs[key] = model
         return trainer_kwargs
 
 
diff --git a/swift/plugin/loss_scale.py b/swift/plugin/loss_scale.py
index 275d2e0e4b..21733dfabe 100644
--- a/swift/plugin/loss_scale.py
+++ b/swift/plugin/loss_scale.py
@@ -180,11 +180,12 @@ def get_loss_scale(self, context: str, context_type: ContextType, *args, **kwarg
 
 # Add your loss scale here, use --loss_scale xxx to train
 loss_scale_map = {
+    'last_round': LastRoundLossScale(),
+    'default': LossScale(),
+    'all': TrainAllLossScale(),
+    # agent
     'agentflan': AgentFlanLossScale(),
     'react': REACTLossScale(),
     'alpha_umi': AlphaUmiLossScale(),
-    'default': LossScale(),
-    'last_round': LastRoundLossScale(),
     'qwen': QwenLossScale(),
-    'all': TrainAllLossScale(),
 }
diff --git a/swift/trainers/__init__.py b/swift/trainers/__init__.py
index 2e57e64de2..da7ab951cc 100644
--- a/swift/trainers/__init__.py
+++ b/swift/trainers/__init__.py
@@ -15,10 +15,10 @@
     ShardedDDPOption = None
 
 if TYPE_CHECKING:
-    from .arguments import (Seq2SeqTrainingArguments, TrainingArguments, DPOConfig, CPOConfig, KTOConfig, ORPOConfig,
-                            PPOConfig, RewardConfig)
+    from .arguments import Seq2SeqTrainingArguments, TrainingArguments
     from .rlhf_trainer import (CPOTrainer, DPOTrainer, KTOTrainer, ORPOTrainer, RLHFTrainerMixin, PPOTrainer,
                                RewardTrainer)
+    from .rlhf_arguments import DPOConfig, CPOConfig, KTOConfig, ORPOConfig, PPOConfig, RewardConfig
     from .trainer_factory import TrainerFactory
     from .trainers import Seq2SeqTrainer, Trainer
     from .mixin import SwiftMixin
@@ -26,10 +26,8 @@
 else:
     _extra_objects = {k: v for k, v in globals().items() if not k.startswith('_')}
     _import_structure = {
-        'arguments': [
-            'Seq2SeqTrainingArguments', 'TrainingArguments', 'DPOConfig', 'CPOConfig', 'KTOConfig', 'ORPOConfig',
-            'PPOConfig', 'RewardConfig'
-        ],
+        'arguments': ['Seq2SeqTrainingArguments', 'TrainingArguments'],
+        'rlhf_arguments': ['DPOConfig', 'CPOConfig', 'KTOConfig', 'ORPOConfig', 'PPOConfig', 'RewardConfig'],
         'rlhf_trainer':
         ['CPOTrainer', 'DPOTrainer', 'KTOTrainer', 'ORPOTrainer', 'RLHFTrainerMixin', 'PPOTrainer', 'RewardTrainer'],
         'trainer_factory': ['TrainerFactory'],
diff --git a/swift/trainers/arguments.py b/swift/trainers/arguments.py
index a0b78b947f..809d42b85b 100644
--- a/swift/trainers/arguments.py
+++ b/swift/trainers/arguments.py
@@ -76,40 +76,3 @@ class TrainingArguments(SwiftArgumentsMixin, HfTrainingArguments):
 @dataclass
 class Seq2SeqTrainingArguments(SwiftArgumentsMixin, HfSeq2SeqTrainingArguments):
     pass
-
-
-try:
-    from trl import (DPOConfig as HfDPOConfig, CPOConfig as HfCPOConfig, ORPOConfig as HfORPOConfig, KTOConfig as
-                     HfKTOConfig, RewardConfig as HfRewardConfig, PPOv2Config as HfPPOConfig)
-
-    @dataclass
-    class DPOConfig(SwiftArgumentsMixin, HfDPOConfig):
-        pass
-
-    @dataclass
-    class CPOConfig(SwiftArgumentsMixin, HfCPOConfig):
-        pass
-
-    @dataclass
-    class ORPOConfig(SwiftArgumentsMixin, HfORPOConfig):
-        pass
-
-    @dataclass
-    class KTOConfig(SwiftArgumentsMixin, HfKTOConfig):
-        pass
-
-    @dataclass
-    class RewardConfig(SwiftArgumentsMixin, HfRewardConfig):
-        pass
-
-    @dataclass
-    class PPOConfig(SwiftArgumentsMixin, HfPPOConfig):
-        pass
-
-except (ImportError, RuntimeError):
-    DPOConfig = None
-    CPOConfig = None
-    ORPOConfig = None
-    KTOConfig = None
-    RewardConfig = None
-    PPOConfig = None
diff --git a/swift/trainers/mixin.py b/swift/trainers/mixin.py
index 5eb7a72dd3..8c68e1a27d 100644
--- a/swift/trainers/mixin.py
+++ b/swift/trainers/mixin.py
@@ -72,6 +72,7 @@ def __init__(self,
             from swift.trainers.xtuner import init_sequence_parallel_xtuner
             init_sequence_parallel_xtuner(args.sequence_parallel_size)
 
+        self.model_meta = model.model_meta
         with self.hub.patch_hub():
             super().__init__(
                 model=model,
@@ -216,7 +217,7 @@ def _save(self, output_dir: Optional[str] = None, state_dict=None):
         # tokenizer
         if not is_adapter:
             from swift.llm import save_checkpoint
-            additional_saved_files = self.model.model_meta.additional_saved_files
+            additional_saved_files = self.model_meta.additional_saved_files
             save_checkpoint(None, self.template.processor, output_dir, additional_saved_files=additional_saved_files)
 
     def _fix_zero3_gather_all_parameters(self) -> None:
@@ -246,7 +247,7 @@ def _save_checkpoint(self, *args, **kwargs):
         return result
 
     def train(self, *args, **kwargs):
-        if self.model.model_meta.is_multimodal:
+        if self.model_meta.is_multimodal:
             models = list(
                 set([
                     v for k, v in self.__dict__.items()
diff --git a/swift/trainers/rlhf_arguments.py b/swift/trainers/rlhf_arguments.py
new file mode 100644
index 0000000000..9db0541522
--- /dev/null
+++ b/swift/trainers/rlhf_arguments.py
@@ -0,0 +1,40 @@
+from dataclasses import dataclass
+
+from trl import CPOConfig as HfCPOConfig
+from trl import DPOConfig as HfDPOConfig
+from trl import KTOConfig as HfKTOConfig
+from trl import ORPOConfig as HfORPOConfig
+from trl import PPOv2Config as HfPPOv2Config
+from trl import RewardConfig as HfRewardConfig
+
+from .arguments import SwiftArgumentsMixin
+
+
+@dataclass
+class DPOConfig(SwiftArgumentsMixin, HfDPOConfig):
+    pass
+
+
+@dataclass
+class CPOConfig(SwiftArgumentsMixin, HfCPOConfig):
+    pass
+
+
+@dataclass
+class ORPOConfig(SwiftArgumentsMixin, HfORPOConfig):
+    pass
+
+
+@dataclass
+class KTOConfig(SwiftArgumentsMixin, HfKTOConfig):
+    pass
+
+
+@dataclass
+class RewardConfig(SwiftArgumentsMixin, HfRewardConfig):
+    pass
+
+
+@dataclass
+class PPOConfig(SwiftArgumentsMixin, HfPPOv2Config):
+    pass
diff --git a/swift/trainers/rlhf_trainer/ppo_trainer.py b/swift/trainers/rlhf_trainer/ppo_trainer.py
index bcdfbf6b27..1196d5b06c 100644
--- a/swift/trainers/rlhf_trainer/ppo_trainer.py
+++ b/swift/trainers/rlhf_trainer/ppo_trainer.py
@@ -1,47 +1,45 @@
 # Copyright (c) Alibaba, Inc. and its affiliates.
+from contextlib import contextmanager
+
 from torch.utils.data import DataLoader
 from transformers import PreTrainedModel
-from trl import PPOv2Trainer as HFPPOTrainer
+from trl import PPOv2Trainer as HFPPOv2Trainer
 
+from swift.utils import patch_getattr
 from ..mixin import SwiftMixin
-from .rlhf_mixin import RLHFTrainerMixin
 
+ppo_trainer_init = HFPPOv2Trainer.__init__
+del HFPPOv2Trainer.__init__
 
-class PPOTrainer(RLHFTrainerMixin, SwiftMixin, HFPPOTrainer):
 
-    def __init__(self, model: PreTrainedModel, ref_model: PreTrainedModel, *_args, **kwargs):
-        kwargs['policy'] = model
-        kwargs['ref_policy'] = ref_model
-        super().__init__(model, ref_model, *_args, **kwargs)
-        # reset dataloader
-        self.dataloader = DataLoader(
-            self.train_dataset,
-            batch_size=self.local_dataloader_batch_size,
-            shuffle=True,
-            collate_fn=kwargs['data_collator'],
-            drop_last=True,  # needed; otherwise the last batch will be of ragged shape
-        )
-        self.accelerator.prepare(self.data_collator)
-        self.eval_dataloader = DataLoader(
-            self.eval_dataset,
-            batch_size=self.args.per_device_eval_batch_size,
-            collate_fn=kwargs['data_collator'],
-            drop_last=True,
-        )  # no need to shuffle eval dataset
-        self.eval_dataloader = self.accelerator.prepare(self.eval_dataloader)
+class PPOTrainer(SwiftMixin, HFPPOv2Trainer):
 
-    def train(self, *args, **kwargs):
-        # remove args that are not needed for the HFPPOTrainer
-        HFPPOTrainer.train(self)
+    @staticmethod
+    @contextmanager
+    def _patch_dataloader(collate_fn):
+        __init__ = DataLoader.__init__
 
+        def __new_init__(self, *args, **kwargs):
+            kwargs['collate_fn'] = collate_fn
+            __init__(self, *args, **kwargs)
 
-def patched_init(self, **kwargs):
-    kwargs_to_pop = ['model', 'model_init', 'compute_metrics', 'preprocess_logits_for_metrics']
-    for kwarg in kwargs_to_pop:
-        kwargs.pop(kwarg, None)
-    kwargs['config'] = kwargs.pop('args')
-    original_init(self, **kwargs)
+        DataLoader.__init__ = __new_init__
+        yield
+        DataLoader.__init__ = __init__
 
+    def __init__(self, model: PreTrainedModel, ref_model: PreTrainedModel, *_args, **kwargs):
+        super().__init__(model, *_args, **kwargs)
+        with self._patch_dataloader(kwargs['data_collator']):
+            new_kwargs = {
+                k: v
+                for k, v in kwargs.items()
+                if k in ['train_dataset', 'data_collator', 'reward_model', 'value_model', 'eval_dataset']
+            }
+            ppo_trainer_init(
+                self, config=kwargs['args'], tokenizer=self.tokenizer, policy=model, ref_policy=ref_model, **new_kwargs)
+        unwrap_model = self.accelerator.unwrap_model(self.model)
+        patch_getattr(unwrap_model, 'policy')
 
-original_init = HFPPOTrainer.__init__
-HFPPOTrainer.__init__ = patched_init
+    def train(self, *args, **kwargs):
+        # remove args that are not needed for the HFPPOTrainer
+        super().train()
diff --git a/swift/trainers/trainer_factory.py b/swift/trainers/trainer_factory.py
index 480ca8287d..19c93a042b 100644
--- a/swift/trainers/trainer_factory.py
+++ b/swift/trainers/trainer_factory.py
@@ -56,4 +56,7 @@ def get_training_args(cls, args):
             if k not in parameters:
                 args_dict.pop(k)
 
+        if 'ppo' in training_args_cls.__name__.lower():
+            args_dict['world_size'] = args.global_world_size
+
         return training_args_cls(**args_dict)
diff --git a/tests/test_align/test_template/test_llm.py b/tests/test_align/test_template/test_llm.py
index da54bbb017..0b29120dfe 100644
--- a/tests/test_align/test_template/test_llm.py
+++ b/tests/test_align/test_template/test_llm.py
@@ -215,7 +215,7 @@ def test_qwen2_reward():
     res = _infer_model(pt_engine, messages=messages)
     pt_engine.default_template.template_backend = 'jinja'
     res2 = _infer_model(pt_engine, messages=messages)
-    assert res == res2 == '1.390625'
+    assert res == '1.84375' and res2 == '1.390625'  # \n diff
 
 
 def test_qwen2_5_math():
@@ -239,7 +239,7 @@ def test_skywork_reward():
     res = _infer_model(pt_engine, messages=messages)
     pt_engine.default_template.template_backend = 'jinja'
     res2 = _infer_model(pt_engine, messages=messages)
-    assert res == '14.1875'
+    assert res == '14.25'
     assert res2 == '13.8125'
 
 
diff --git a/tests/train/test_ppo.py b/tests/train/test_ppo.py
new file mode 100644
index 0000000000..4ad3180502
--- /dev/null
+++ b/tests/train/test_ppo.py
@@ -0,0 +1,40 @@
+import os
+
+os.environ['CUDA_VISIBLE_DEVICES'] = '0'
+
+kwargs = {
+    'per_device_train_batch_size': 2,
+    'save_steps': 5,
+    'gradient_accumulation_steps': 4,
+    'num_train_epochs': 1,
+}
+
+
+def test_rm():
+    from swift.llm import rlhf_main, RLHFArguments, infer_main, InferArguments
+    result = rlhf_main(
+        RLHFArguments(
+            rlhf_type='rm',
+            model='Shanghai_AI_Laboratory/internlm2-1_8b-reward',
+            dataset=['hjh0119/shareAI-Llama3-DPO-zh-en-emoji#100'],
+            **kwargs))
+    last_model_checkpoint = result['last_model_checkpoint']
+    infer_main(InferArguments(adapters=last_model_checkpoint, load_data_args=True, merge_lora=True))
+
+
+def test_ppo():
+    from swift.llm import rlhf_main, RLHFArguments, infer_main, InferArguments
+    result = rlhf_main(
+        RLHFArguments(
+            rlhf_type='ppo',
+            model='LLM-Research/Llama-3.2-1B-Instruct',
+            reward_model='AI-ModelScope/GRM-Llama3.2-3B-rewardmodel-ft',
+            dataset=['AI-ModelScope/alpaca-gpt4-data-zh#100', 'AI-ModelScope/alpaca-gpt4-data-en#100'],
+            **kwargs))
+    last_model_checkpoint = result['last_model_checkpoint']
+    infer_main(InferArguments(adapters=last_model_checkpoint, load_data_args=True, merge_lora=True))
+
+
+if __name__ == '__main__':
+    # test_rm()
+    test_ppo()