diff --git a/README.md b/README.md
index 7a9ef05053..4561df6bbe 100644
--- a/README.md
+++ b/README.md
@@ -65,7 +65,7 @@ Users can check the [documentation of SWIFT](docs/source/GetStarted/快速使用
 - 2023.1.4: Support for **VLLM deployment**, compatible with the **OpenAI API** style. For more details, please refer to [VLLM Inference Acceleration and Deployment](https://github.com/modelscope/swift/blob/main/docs/source/LLM/VLLM推理加速与部署.md#部署)
 - 2023.1.4: Update [Benchmark](https://github.com/modelscope/swift/blob/main/docs/source/LLM/Benchmark.md) to facilitate viewing the training speed and GPU memory required for different models.
 - 🔥 2023.12.29: Support web-ui for training and inference, use `swift web-ui` after the installation of ms-swift.
-- 🔥 2023.12.29: Support DPO RLHF(Reinforcement Learning from Human Feedback) and two datasets: AI-ModelScope/stack-exchange-paired and AI-ModelScope/hh-rlhf for this task. Use [this script](https://github.com/modelscope/swift/blob/v1.5.0/examples/pytorch/llm/scripts/dpo/lora/dpo.sh) to start training!
+- 🔥 2023.12.29: Support DPO RLHF(Reinforcement Learning from Human Feedback) and two datasets: AI-ModelScope/stack-exchange-paired and AI-ModelScope/hh-rlhf for this task. Use [this script](https://github.com/modelscope/swift/blob/v1.5.0/examples/pytorch/llm/scripts/dpo/lora_ddp_mp/dpo.sh) to start training!
 - 🔥 2023.12.28: Support SCEdit! This framework can easily reduce memory usage in training and inference, and replace ControlNet for controllable image generating scenarios, view the following chapter for details.
 - 2023.12.23: Support [codegeex2-6b](https://github.com/modelscope/swift/tree/main/examples/pytorch/llm/scripts/codegeex2_6b).
 - 2023.12.19: Support [phi2-3b](https://github.com/modelscope/swift/tree/main/examples/pytorch/llm/scripts/phi2_3b).
@@ -113,7 +113,7 @@ Users can check the [documentation of SWIFT](docs/source/GetStarted/快速使用
 - Quickly perform **inference** on LLM and build a **Web-UI**, see the [LLM Inference Documentation](https://github.com/modelscope/swift/blob/main/docs/source/LLM/LLM推理文档.md).
 - Rapidly **fine-tune** and perform inference on LLM, and build a Web-UI, see the [LLM Fine-tuning Documentation](https://github.com/modelscope/swift/blob/main/docs/source/LLM/LLM微调文档.md).
 - Using **interface** to fine-tuning and perform inference, see the [WEB-UI Documentation](https://github.com/modelscope/swift/blob/main/docs/source/GetStarted/%E7%95%8C%E9%9D%A2%E8%AE%AD%E7%BB%83%E6%8E%A8%E7%90%86.md).
-- **DPO training** supported, start by using [this script](https://github.com/modelscope/swift/blob/v1.5.0/examples/pytorch/llm/scripts/dpo/lora/dpo.sh).
+- **DPO training** supported, start by using [this script](https://github.com/modelscope/swift/blob/v1.5.0/examples/pytorch/llm/scripts/dpo/lora_ddp_mp/dpo.sh).
 - Utilize VLLM for **inference acceleration** and **deployment(OpenAI API)**. Please refer to [VLLM Inference Acceleration and Deployment](https://github.com/modelscope/swift/blob/main/docs/source/LLM/VLLM推理加速与部署.md) for more information.
 - View the models and datasets supported by Swift. You can check [supported models and datasets](https://github.com/modelscope/swift/blob/main/docs/source/LLM/支持的模型和数据集.md).
 - Expand and customize models, datasets, and dialogue templates in Swift, see [Customization and Expansion](https://github.com/modelscope/swift/blob/main/docs/source/LLM/自定义与拓展.md).
diff --git a/README_CN.md b/README_CN.md
index fe3bd5691f..d2bf279788 100644
--- a/README_CN.md
+++ b/README_CN.md
@@ -63,7 +63,7 @@ SWIFT（Scalable lightWeight Infrastructure for Fine-Tuning）是一个可扩展
 - 2023.1.4: 支持**VLLM部署**, 兼容**OpenAI API**样式, 具体可以查看[VLLM推理加速与部署](https://github.com/modelscope/swift/blob/main/docs/source/LLM/VLLM推理加速与部署.md#部署).
 - 2023.1.4: 更新[Benchmark](https://github.com/modelscope/swift/blob/main/docs/source/LLM/Benchmark.md), 方便查看不同模型训练的速度和所需显存.
 - 🔥 2023.12.29: 支持web-ui进行sft训练和推理，安装ms-swift后使用`swift web-ui`开启
-- 🔥 2023.12.29: 支持 DPO RLHF(Reinforcement Learning from Human Feedback) 和两个用于此任务的数据集: AI-ModelScope/stack-exchange-paired 以及 AI-ModelScope/hh-rlhf. 使用[这个脚本](https://github.com/modelscope/swift/blob/v1.5.0/examples/pytorch/llm/scripts/dpo/lora/dpo.sh)开启训练！
+- 🔥 2023.12.29: 支持 DPO RLHF(Reinforcement Learning from Human Feedback) 和两个用于此任务的数据集: AI-ModelScope/stack-exchange-paired 以及 AI-ModelScope/hh-rlhf. 使用[这个脚本](https://github.com/modelscope/swift/blob/v1.5.0/examples/pytorch/llm/scripts/dpo/lora_ddp_mp/dpo.sh)开启训练！
 - 🔥 2023.12.28: 支持SCEdit! 该tuner可显著降低U-Net中的显存占用，并支持低显存可控图像生成（取代ControlNet），阅读下面的章节来了解详细信息
 - 2023.12.23: 支持[codegeex2-6b](https://github.com/modelscope/swift/tree/main/examples/pytorch/llm/scripts/codegeex2_6b).
 - 2023.12.19: 支持[phi2-3b](https://github.com/modelscope/swift/tree/main/examples/pytorch/llm/scripts/phi2_3b).
@@ -111,7 +111,7 @@ SWIFT（Scalable lightWeight Infrastructure for Fine-Tuning）是一个可扩展
 - 快速对LLM进行**推理**, 搭建**Web-UI**, 可以查看[LLM推理文档](https://github.com/modelscope/swift/blob/main/docs/source/LLM/LLM推理文档.md).
 - 快速对LLM进行**微调**, 推理并搭建Web-UI, 可以查看[LLM微调文档](https://github.com/modelscope/swift/blob/main/docs/source/LLM/LLM微调文档.md).
 - 使用**界面**方式进行微调和推理, 可以查看[WEB-UI文档](https://github.com/modelscope/swift/blob/main/docs/source/GetStarted/%E7%95%8C%E9%9D%A2%E8%AE%AD%E7%BB%83%E6%8E%A8%E7%90%86.md).
-- 支持**DPO训练**, 使用[这个脚本](https://github.com/modelscope/swift/blob/v1.5.0/examples/pytorch/llm/scripts/dpo/lora/dpo.sh)开启训练
+- 支持**DPO训练**, 使用[这个脚本](https://github.com/modelscope/swift/blob/v1.5.0/examples/pytorch/llm/scripts/dpo/lora_ddp_mp/dpo.sh)开启训练
 - 使用VLLM进行**推理加速**和**部署(OpenAI API)**. 可以查看[VLLM推理加速与部署](https://github.com/modelscope/swift/blob/main/docs/source/LLM/VLLM推理加速与部署.md).
 - 查看swift支持的模型和数据集. 可以查看[支持的模型和数据集](https://github.com/modelscope/swift/blob/main/docs/source/LLM/支持的模型和数据集.md).
 - 对swift中的模型, 数据集, 对话模板进行**拓展**, 可以查看[自定义与拓展](https://github.com/modelscope/swift/blob/main/docs/source/LLM/自定义与拓展.md).
diff --git "a/docs/source/GetStarted/\347\225\214\351\235\242\350\256\255\347\273\203\346\216\250\347\220\206.md" "b/docs/source/GetStarted/\347\225\214\351\235\242\350\256\255\347\273\203\346\216\250\347\220\206.md"
index d7b14889dc..8ce33f7713 100644
--- "a/docs/source/GetStarted/\347\225\214\351\235\242\350\256\255\347\273\203\346\216\250\347\220\206.md"
+++ "b/docs/source/GetStarted/\347\225\214\351\235\242\350\256\255\347\273\203\346\216\250\347\220\206.md"
@@ -5,3 +5,8 @@ swift web-ui
 ```
 
 开启界面训练和推理。
+
+web-ui没有传入参数，所有可控部分都在界面中。但是有几个环境变量可以使用：
+
+> WEBUI_SHARE=1 控制gradio是否是share状态
+> SWIFT_UI_LANG=en/zh 控制web-ui界面语言
diff --git "a/docs/source/LLM/LLM\344\272\272\347\261\273\345\257\271\351\275\220\350\256\255\347\273\203\346\226\207\346\241\243.md" "b/docs/source/LLM/LLM\344\272\272\347\261\273\345\257\271\351\275\220\350\256\255\347\273\203\346\226\207\346\241\243.md"
new file mode 100644
index 0000000000..6a9b57441a
--- /dev/null
+++ "b/docs/source/LLM/LLM\344\272\272\347\261\273\345\257\271\351\275\220\350\256\255\347\273\203\346\226\207\346\241\243.md"
@@ -0,0 +1,97 @@
+# LLM人类对齐训练文档
+## 目录
+- [环境准备](#环境准备)
+- [人类对齐训练](#人类对齐训练)
+
+## 环境准备
+GPU设备: A10, 3090, V100, A100均可，如果是显存<=24G的GPU最少需要双卡环境。由于人类对齐训练在一张卡上加载两个模型，因此比微调的显存多占用一个推理模型的显存使用量。
+```bash
+# 设置pip全局镜像
+pip config set global.index-url https://mirrors.aliyun.com/pypi/simple/
+# 安装ms-swift
+git clone https://github.com/modelscope/swift.git
+cd swift
+pip install -e .[llm]
+
+# 环境对齐 (如果你运行错误, 可以跑下面的代码, 仓库使用最新环境测试)
+pip install -r requirements/framework.txt  -U
+pip install -r requirements/llm.txt  -U
+```
+
+## 人类对齐训练
+下面的shell脚本运行了一个人类对齐训练。首先需要切换到运行目录：
+
+```shell
+cd examples/pytorch/llm
+```
+
+运行下面的命令：
+
+```shell
+# Experimental environment: 4*A100
+# Memory usage: 4 * 20G，双卡device_map * 2ddp
+nproc_per_node=2
+
+PYTHONPATH=../../.. \
+CUDA_VISIBLE_DEVICES=0,1,2,3 \
+torchrun \
+    --nproc_per_node=$nproc_per_node \
+    --master_port 29500 \
+    llm_dpo.py \
+    --model_type  mistral-7b \
+    --ref_model_type  mistral-7b \
+    --model_revision  master \
+    --sft_type  lora \
+    --tuner_backend  swift \
+    --dtype  AUTO  \
+    --output_dir  output  \
+    --dataset  hh-rlhf  \
+    --train_dataset_sample  -1  \
+    --truncation_strategy  truncation_left  \
+    --val_dataset_sample  2000  \
+    --num_train_epochs  3  \
+    --max_length  1024  \
+    --max_prompt_length  512  \
+    --check_dataset_strategy  none  \
+    --lora_rank  8  \
+    --lora_alpha  32  \
+    --lora_dropout_p  0.05  \
+    --lora_target_modules  ALL  \
+    --gradient_checkpointing  true  \
+    --batch_size  1  \
+    --weight_decay  0.01  \
+    --learning_rate  5e-5  \
+    --gradient_accumulation_steps  $(expr 16 / $nproc_per_node)  \
+    --max_grad_norm  1.0  \
+    --warmup_ratio  0.03  \
+    --eval_steps  2000  \
+    --save_steps  2000  \
+    --save_total_limit  2  \
+    --logging_steps  10 \
+```
+
+### sh脚本
+
+sh脚本可以查看[这里](https://github.com/modelscope/swift/tree/main/examples/pytorch/llm/scripts/dpo)。
+
+```bash
+# 下面的脚本需要在此目录下执行
+cd examples/pytorch/llm
+```
+
+**提示**:
+
+- 我们默认在训练时设置`--gradient_checkpointing true`来**节约显存**, 这会略微降低训练速度.
+- 如果你使用的是**V100**等较老的GPU, 你需要设置`--dtype AUTO`或者`--dtype fp16`, 因为其不支持bf16.
+- 如果你的机器是A100等高性能显卡, 且使用的是qwen系列模型, 推荐你安装[**flash-attn**](https://github.com/Dao-AILab/flash-attention), 这将会加快训练和推理的速度以及显存占用(A10, 3090, V100等显卡不支持flash-attn进行训练). 支持flash-attn的模型可以查看[LLM支持的模型](./支持的模型和数据集.md#模型)
+- 如果你需要断网进行训练, 请使用`--model_cache_dir`和设置`--check_model_is_latest false`. 具体参数含义请查看[命令行参数](./命令行参数.md).
+- 如果你想在训练时, 将权重push到ModelScope Hub中, 你需要设置`--push_to_hub true`.
+
+```bash
+# dpo训练 mistral-7b max_length=1024，bs=1
+# 推荐的实验环境: V100, A10, 3090，2卡4卡或8卡
+bash scripts/dpo/lora_ddp_mp/dpo.sh
+bash scripts/dpo/lora_ddp_mp/infer.sh
+```
+
+由于DPO训练后会得到一个完整模型或者adapter的weights，因此LoRA合并、推理的步骤和微调步骤相同，因此请参考[微调文档](./LLM微调文档#Merge LoRA)对应的步骤。
diff --git "a/docs/source/LLM/\345\221\275\344\273\244\350\241\214\345\217\202\346\225\260.md" "b/docs/source/LLM/\345\221\275\344\273\244\350\241\214\345\217\202\346\225\260.md"
index a24e7541e1..f28e7bb83f 100644
--- "a/docs/source/LLM/\345\221\275\344\273\244\350\241\214\345\217\202\346\225\260.md"
+++ "b/docs/source/LLM/\345\221\275\344\273\244\350\241\214\345\217\202\346\225\260.md"
@@ -92,6 +92,13 @@
 - `--repetition_penalty`: 默认为`1.05`. 该参数只有在`predict_with_generate`设置为True的时候才生效.
 - `--num_beams`: 默认为`1`. 该参数只有在`predict_with_generate`设置为True的时候才生效.
 
+## DPO参数
+
+DPO参数继承了上面的SFT参数，除此之外增加了以下参数：
+
+- `--ref_model_type` 对比模型类型，可以选择的`model_type`可以查看`MODEL_MAPPING.keys()`
+- `--max_prompt_length` 最大的提示长度,该参数会传入DPOTrainer中，使prompt长度不超过该值的设置，默认值1024
+
 
 ## merge-lora infer app-ui 命令行参数
 - `--model_type`: 默认值为`None`, 具体的参数介绍可以在`sft.sh命令行参数`中查看.
diff --git "a/docs/source/LLM/\350\207\252\345\256\232\344\271\211\344\270\216\346\213\223\345\261\225.md" "b/docs/source/LLM/\350\207\252\345\256\232\344\271\211\344\270\216\346\213\223\345\261\225.md"
index 2a7dea3ca2..65dee97901 100644
--- "a/docs/source/LLM/\350\207\252\345\256\232\344\271\211\344\270\216\346\213\223\345\261\225.md"
+++ "b/docs/source/LLM/\350\207\252\345\256\232\344\271\211\344\270\216\346\213\223\345\261\225.md"
@@ -99,7 +99,16 @@ AAAAA,BBBBB,CCCCC
 {"messages": [{"role": "user", "content": "AAAAA"}, {"role": "assistant", "content": "BBBBB"}, {"role": "user", "content": "CCCCC"}, {"role": "assistant", "content": "DDDDD"}]}
 ```
 
+**强化学习（DPO）**
+
+```jsonl
+{"query": "11111", "response": "22222", "rejected_response": "33333"}
+{"query": "aaaaa", "response": "bbbbb", "rejected_response": "ccccc"}
+{"query": "AAAAA", "response": "BBBBB", "rejected_response": "CCCCC"}
+```
+
 ### 注册数据集的方式
+
 以下是一个**注册数据集**的案例. 完整的py文件可以查看[custom.py](https://github.com/modelscope/swift/blob/main/examples/pytorch/llm/custom.py), sh脚本可以查看[custom](https://github.com/modelscope/swift/tree/main/examples/pytorch/llm/scripts/custom).
 
 ```python
diff --git a/examples/pytorch/llm/scripts/dpo/lora/dpo.sh b/examples/pytorch/llm/scripts/dpo/lora/dpo.sh
index a1949b6ae3..815acd0cbb 100644
--- a/examples/pytorch/llm/scripts/dpo/lora/dpo.sh
+++ b/examples/pytorch/llm/scripts/dpo/lora/dpo.sh
@@ -1,7 +1,7 @@
-# Experimental environment: 8*A100
-# Memory usage: 8 * 50G
+# Experimental environment: 2*A100
+# Memory usage: 2 * 20G
 PYTHONPATH=../../.. \
-accelerate launch llm_dpo.py \
+python llm_dpo.py \
     --model_type  mistral-7b \
     --ref_model_type  mistral-7b \
     --model_revision  master \
diff --git a/examples/pytorch/llm/scripts/dpo/lora_ddp_mp/dpo.sh b/examples/pytorch/llm/scripts/dpo/lora_ddp_mp/dpo.sh
new file mode 100644
index 0000000000..c667dff744
--- /dev/null
+++ b/examples/pytorch/llm/scripts/dpo/lora_ddp_mp/dpo.sh
@@ -0,0 +1,40 @@
+# Experimental environment: 4*A100
+# Memory usage: 4 * 20G
+nproc_per_node=2
+
+PYTHONPATH=../../.. \
+CUDA_VISIBLE_DEVICES=0,1,2,3 \
+torchrun \
+    --nproc_per_node=$nproc_per_node \
+    --master_port 29500 \
+    llm_dpo.py \
+    --model_type  mistral-7b \
+    --ref_model_type  mistral-7b \
+    --model_revision  master \
+    --sft_type  lora \
+    --tuner_backend  swift \
+    --dtype  AUTO  \
+    --output_dir  output  \
+    --dataset  hh-rlhf  \
+    --train_dataset_sample  -1  \
+    --truncation_strategy  truncation_left  \
+    --val_dataset_sample  2000  \
+    --num_train_epochs  3  \
+    --max_length  1024  \
+    --max_prompt_length  512  \
+    --check_dataset_strategy  none  \
+    --lora_rank  8  \
+    --lora_alpha  32  \
+    --lora_dropout_p  0.05  \
+    --lora_target_modules  ALL  \
+    --gradient_checkpointing  true  \
+    --batch_size  1  \
+    --weight_decay  0.01  \
+    --learning_rate  5e-5  \
+    --gradient_accumulation_steps  $(expr 16 / $nproc_per_node)  \
+    --max_grad_norm  1.0  \
+    --warmup_ratio  0.03  \
+    --eval_steps  2000  \
+    --save_steps  2000  \
+    --save_total_limit  2  \
+    --logging_steps  10 \
diff --git a/examples/pytorch/llm/scripts/dpo/lora_ddp_mp/infer.sh b/examples/pytorch/llm/scripts/dpo/lora_ddp_mp/infer.sh
new file mode 100644
index 0000000000..8ed9b69b6e
--- /dev/null
+++ b/examples/pytorch/llm/scripts/dpo/lora_ddp_mp/infer.sh
@@ -0,0 +1,14 @@
+# Experimental environment: A10, 3090
+PYTHONPATH=../../.. \
+CUDA_VISIBLE_DEVICES=0 \
+python llm_infer.py \
+    --ckpt_dir output/mistral-7b/vx-xxx-xxx/checkpoint-xxx \
+    --load_dataset_config true \
+    --eval_human true \
+    --use_flash_attn false \
+    --max_new_tokens 1024 \
+    --temperature 0.3 \
+    --top_p 0.7 \
+    --repetition_penalty 1.05 \
+    --do_sample true \
+    --merge_lora_and_save false \
diff --git a/swift/llm/dpo.py b/swift/llm/dpo.py
index 580e15ed85..7c88ce3649 100644
--- a/swift/llm/dpo.py
+++ b/swift/llm/dpo.py
@@ -31,7 +31,7 @@ def llm_dpo(args: DPOArguments) -> str:
 
     # Loading Model and Tokenizer
     model_kwargs = {'low_cpu_mem_usage': True}
-    if (is_dist() and not is_ddp_plus_mp()) or 'HF_ACCELERATOR' in os.environ:
+    if is_dist() and not is_ddp_plus_mp():
         model_kwargs['device_map'] = {'': local_rank}
     else:
         model_kwargs['device_map'] = 'auto'
@@ -61,6 +61,8 @@ def llm_dpo(args: DPOArguments) -> str:
         ref_model = deepcopy(model)
 
     logger.info(f'model_config: {model.config}')
+    if hasattr(model, 'hf_device_map'):
+        logger.info(f'model device_map {model.hf_device_map}')
     generation_config = GenerationConfig(
         max_new_tokens=args.max_new_tokens,
         temperature=args.temperature,
diff --git a/swift/llm/tuner.py b/swift/llm/tuner.py
index 7bd93beb0e..95989c8812 100644
--- a/swift/llm/tuner.py
+++ b/swift/llm/tuner.py
@@ -85,7 +85,7 @@ def prepare_model(model, args: SftArguments):
 
     if args.neftune_alpha > 0.001:
         neftune_config = NEFTuneConfig(noise_alpha=args.neftune_alpha)
-        model = Swift.prepare_model(model, neftune_config)
+        model = Swift.prepare_model(model, {'neftune': neftune_config})
         logger.info(f'neftune_config: {neftune_config}')
 
     class TrainerAdapterCallback(TrainerCallback):
diff --git a/swift/tuners/base.py b/swift/tuners/base.py
index d6810445a5..6e1f0edab7 100644
--- a/swift/tuners/base.py
+++ b/swift/tuners/base.py
@@ -56,7 +56,7 @@ def __init__(self,
                 new_adapters.append(DEFAULT_ADAPTER)
             else:
                 logger.warn(
-                    f'Adater {DEFAULT_ADAPTER} has been patched, skip.')
+                    f'Adapter {DEFAULT_ADAPTER} has been patched, skip.')
         elif isinstance(config, dict):
             assert (all(isinstance(c, SwiftConfig) for c in config.values()))
             for adapter_name, _config in config.items():
@@ -66,7 +66,7 @@ def __init__(self,
                     new_adapters.append(adapter_name)
                 else:
                     logger.warn(
-                        f'Adater {adapter_name} has been patched, skip.')
+                        f'Adapter {adapter_name} has been patched, skip.')
         self.model = model
 
         self.extra_state_keys = extra_state_keys or []
diff --git a/swift/tuners/neftune.py b/swift/tuners/neftune.py
index 300f6646e2..e49924e53a 100644
--- a/swift/tuners/neftune.py
+++ b/swift/tuners/neftune.py
@@ -55,7 +55,7 @@ def neftune_hook(module, args, output):
                 sub_module.nef_activated = True
 
         def state_dict_callback(state_dict, adapter_name):
-            return state_dict
+            return {}
 
         def mark_trainable_callback(model):
             return