diff --git a/docs/source/Instruction/GRPO.md b/docs/source/Instruction/GRPO.md
index ab4c3580bb..842ec39171 100644
--- a/docs/source/Instruction/GRPO.md
+++ b/docs/source/Instruction/GRPO.md
@@ -219,20 +219,20 @@ A conversation between User and Assistant. The user asks a question, and the Ass
   - vllm_limit_mm_per_prompt: vllm透传参数，默认为None.
   - vllm_enable_prefix_caching: vllm透传参数，默认为True.
   - sleep_level: 训练时释放 vLLM 显存，可选项为[0, 1], 默认为0，不释放.
+  - move_model_batches: 在模型向vLLM等快速推理框架移动参数时，将layers分为多少个batch. 默认为None, 代表整个模型不进行拆分，否则拆分为move_model_batches+1(非layer参数)+1(多模态部分参数)个。
+  - offload_optimizer: 是否在vLLM推理时offload optimizer参数，默认为False。
+  - offload_model: 是否在vLLM推理时offload 模型本身，默认为False。
+  - gc_collect_after_offload: 是否在offload结束时进行gc（python gc和GPU gc），默认为False。
+  - completion_length_limit_scope: 在多轮对话中，`max_completion_length` 的限制范围。
+  `total`限制所有对话轮次的总输出长度不超过`max_completion_length`, `per_round`限制每一轮的输出长度。
+  默认为`per_round`, 当前仅对 colocate mode 生效。
 - num_iterations: 每个批次代更新次数，默认为1。
 - epsilon: clip 系数，默认为0.2。
 - epsilon_high: upper clip 系数，默认为None，设置后与epsilon共同构成[epsilon, epsilon_high]裁剪范围。
 - sync_ref_model: 是否定期同步ref_model，默认为False。
 - ref_model_mixup_alpha: 控制在更新过程中model和先前ref_model之间的混合。更新公式为 $π_{ref} = α * π_θ + (1 - α) * π_{ref_{prev}}$。默认为0.6。
 - ref_model_sync_steps：同步频率，默认为512。
-- move_model_batches: 在模型向vLLM等快速推理框架移动参数时，将layers分为多少个batch. 默认为None, 代表整个模型不进行拆分，否则拆分为move_model_batches+1(非layer参数)+1(多模态部分参数)个。
-- offload_optimizer: 是否在vLLM推理时offload optimizer参数，默认为False。
-- offload_model: 是否在vLLM推理时offload 模型本身，默认为False。
-- gc_collect_after_offload: 是否在offload结束时进行gc（python gc和GPU gc），默认为False。
 - multi_turn_func: 多轮GRPO参数, 传入对应的plugin名称, 同时在plugin/multi_turn.py中添加好对应的实现。
-- completion_length_limit_scope: 在多轮对话中，`max_completion_length` 的限制范围。
-`total`限制所有对话轮次的总输出长度不超过`max_completion_length`, `per_round`限制每一轮的输出长度。
-默认为`per_round`, 当前仅对 colocate mode 生效。
 - dynamic_sample：筛除group内奖励标准差为0的数据，额外采样新数据，默认为False。
 - max_resample_times：dynamic_sample设置下限制重采样次数，默认3次。
 - overlong_filter：跳过超长截断的样本，不参与loss计算，默认为False。
diff --git "a/docs/source/Instruction/\345\221\275\344\273\244\350\241\214\345\217\202\346\225\260.md" "b/docs/source/Instruction/\345\221\275\344\273\244\350\241\214\345\217\202\346\225\260.md"
index d4acc13309..9df40acce8 100644
--- "a/docs/source/Instruction/\345\221\275\344\273\244\350\241\214\345\217\202\346\225\260.md"
+++ "b/docs/source/Instruction/\345\221\275\344\273\244\350\241\214\345\217\202\346\225\260.md"
@@ -439,20 +439,19 @@ reward模型参数将在PPO、GRPO中使用。
   - vllm_limit_mm_per_prompt: vllm透传参数，默认为None。
   - vllm_enable_prefix_caching: vllm透传参数，默认为True。
   - sleep_level: 训练时释放 vLLM 显存，可选项为[0, 1], 默认为0，不释放
+  - move_model_batches: 在模型向vLLM等快速推理框架移动参数时，将layers分为多少个batch. 默认为None, 代表整个模型不进行拆分，否则拆分为move_model_batches+1(非layer参数)+1(多模态部分参数)个。
+  - offload_optimizer: 是否在vLLM推理时offload optimizer参数，默认为False。
+  - offload_model: 是否在vLLM推理时offload 模型本身，默认为False。
+  - gc_collect_after_offload: 是否在offload结束时进行gc（python gc和GPU gc），默认为False。
+  - completion_length_limit_scope: 在多轮对话中，`max_completion_length` 的限制范围。
+  `total`限制所有对话轮次的总输出长度不超过`max_completion_length`, `per_round`限制每一轮的输出长度。
 - num_iterations: 每个批次代更新次数，默认为1。
 - epsilon: clip 系数，默认为0.2。
 - epsilon_high: upper clip 系数，默认为None，设置后与epsilon共同构成[epsilon, epsilon_high]裁剪范围。
 - sync_ref_model: 是否定期同步ref_model，默认为False。
   - ref_model_mixup_alpha: 控制在更新过程中model和先前ref_model之间的混合。更新公式为 $π_{ref} = α * π_θ + (1 - α) * π_{ref_{prev}}$。默认为0.6。
   - ref_model_sync_steps：同步频率，默认为512。
-- move_model_batches: 在模型向vLLM/LMDeploy等快速推理框架移动参数时，将layers分为多少个batch. 默认为None, 代表整个模型不进行拆分，否则拆分为move_model_batches+1(非layer参数)+1(多模态部分参数)个。
-- offload_optimizer: 是否在vLLM/LMDeploy推理时offload optimizer参数，默认为False。
-- offload_model: 是否在vLLM/LMDeploy推理时offload 模型本身，默认为False。
-- gc_collect_after_offload: 是否在offload结束时进行gc（python gc和GPU gc），默认为False。
 - multi_turn_func: 多轮GRPO参数, 传入对应的plugin名称, 同时在plugin/multi_turn.py中添加好对应的实现。
-- completion_length_limit_scope: 在多轮对话中，`max_completion_length` 的限制范围。
-`total`限制所有对话轮次的总输出长度不超过`max_completion_length`, `per_round`限制每一轮的输出长度。
-默认为`per_round`, 当前仅对 colocate mode 生效。
 - dynamic_sample：筛除group内奖励标准差为0的数据，额外采样新数据，默认为False。
 - max_resample_times：dynamic_sample设置下限制重采样次数，默认3次。
 - overlong_filter：跳过超长截断的样本，不参与loss计算，默认为False。
diff --git a/docs/source_en/Instruction/Command-line-parameters.md b/docs/source_en/Instruction/Command-line-parameters.md
index 50a86b2bc8..028861b6e8 100644
--- a/docs/source_en/Instruction/Command-line-parameters.md
+++ b/docs/source_en/Instruction/Command-line-parameters.md
@@ -451,6 +451,14 @@ The meanings of the following parameters can be referenced [here](https://huggin
   - vllm_limit_mm_per_prompt: vLLM passthrough parameter, default is None.
   - vllm_tensor_parallel_size: the tensor parallel size of vLLM engine, default is 1.
   - sleep_level: make vllm sleep when model is training. Options are 0 or 1, default is 0, no sleep
+  - move_model_batches: When moving model parameters to fast inference frameworks such as vLLM/LMDeploy, determines how many batches to divide the layers into. The default is `None`, which means the entire model is not split. Otherwise, the model is split into `move_model_batches + 1` (non-layer parameters) + `1` (multi-modal component parameters) batches.
+  - offload_optimizer: Whether to offload optimizer parameters during inference with vLLM/LMDeploy. The default is `False`.
+  - offload_model: Whether to offload the model itself during inference with vLLM/LMDeploy. The default is `False`.
+  - gc_collect_after_offload: Whether to perform garbage collection (both Python GC and GPU GC) after offloading. The default is `False`.
+  - completion_length_limit_scope: Specifies the scope of the `max_completion_length` limit in multi-turn conversations.
+  When set to `total`, the total output length across all turns must not exceed `max_completion_length`.
+  When set to `per_round`, each individual turn's output length is limited separately.
+  Defaults to `per_round`. Currently only takes effect in colocate mode.
 - top_k: Default is 50.
 - top_p: Default is 0.9.
 - repetition_penalty: Repetition penalty term. Default is 1.
@@ -460,15 +468,7 @@ The meanings of the following parameters can be referenced [here](https://huggin
 - sync_ref_model: Whether to synchronize the reference model. Default is False。
   - ref_model_mixup_alpha: The Parameter controls the mix between the current policy and the previous reference policy during updates. The reference policy is updated according to the equation: $π_{ref} = α * π_θ + (1 - α) * π_{ref_{prev}}$. Default is 0.6.
   - ref_model_sync_steps：The parameter determines how frequently the current policy is synchronized with the reference policy. Default is 512.
-- move_model_batches: When moving model parameters to fast inference frameworks such as vLLM/LMDeploy, determines how many batches to divide the layers into. The default is `None`, which means the entire model is not split. Otherwise, the model is split into `move_model_batches + 1` (non-layer parameters) + `1` (multi-modal component parameters) batches.
-- offload_optimizer: Whether to offload optimizer parameters during inference with vLLM/LMDeploy. The default is `False`.
-- offload_model: Whether to offload the model itself during inference with vLLM/LMDeploy. The default is `False`.
-- gc_collect_after_offload: Whether to perform garbage collection (both Python GC and GPU GC) after offloading. The default is `False`.
 - multi_turn_func: The multi turn GRPO plugin name. Add your multi-turn implementation in plugin/multi_turn.py.
-- completion_length_limit_scope: Specifies the scope of the `max_completion_length` limit in multi-turn conversations.
-When set to `total`, the total output length across all turns must not exceed `max_completion_length`.
-When set to `per_round`, each individual turn's output length is limited separately.
-Defaults to `per_round`. Currently only takes effect in colocate mode.
 - dynamic_sample: Exclude data within the group where the reward standard deviation is 0, and additionally sample new data. Default is False.
 - max_resample_times: Under the dynamic_sample setting, limit the number of resampling attempts to a maximum of 3. Default is 3 times.
 - overlong_filter: Skip overlong truncated samples, which will not be included in loss calculation. Default is False.
diff --git a/docs/source_en/Instruction/GRPO.md b/docs/source_en/Instruction/GRPO.md
index a4c30ad7ee..e790c70ef8 100644
--- a/docs/source_en/Instruction/GRPO.md
+++ b/docs/source_en/Instruction/GRPO.md
@@ -229,21 +229,21 @@ Arguments
   - vllm_limit_mm_per_prompt: vLLM passthrough parameter, default is None.
   - vllm_tensor_parallel_size: the tensor parallel size of vLLM engine, default is 1.
   - sleep_level: make vllm sleep when model is training. Options are 0 or 1, default is 0, no sleep
+  - move_model_batches: When moving model parameters to fast inference frameworks such as vLLM, determines how many batches to divide the layers into. The default is `None`, which means the entire model is not split. Otherwise, the model is split into `move_model_batches + 1` (non-layer parameters) + `1` (multi-modal component parameters) batches.
+  - offload_optimizer: Whether to offload optimizer parameters during inference with vLLM. The default is `False`.
+  - offload_model: Whether to offload the model itself during inference with vLLM. The default is `False`.
+  - gc_collect_after_offload: Whether to perform garbage collection (both Python GC and GPU GC) after offloading. The default is `False`.
+  - completion_length_limit_scope: Specifies the scope of the `max_completion_length` limit in multi-turn conversations.
+  When set to `total`, the total output length across all turns must not exceed `max_completion_length`.
+  When set to `per_round`, each individual turn's output length is limited separately.
+  Defaults to `per_round`. Currently only takes effect in colocate mode.
 - num_iterations: number of iterations per batch. Default is 1.
 - epsilon: epsilon value for clipping. Default is 0.2.
 - epsilon_high: Upper clip coefficient, default is None. When set, it forms a clipping range of [epsilon, epsilon_high] together with epsilon.
 - sync_ref_model: Whether to synchronize the reference model. Default is False。
   - ref_model_mixup_alpha: The Parameter controls the mix between the current policy and the previous reference policy during updates. The reference policy is updated according to the equation: $π_{ref} = α * π_θ + (1 - α) * π_{ref_{prev}}$. Default is 0.6.
   - ref_model_sync_steps：The parameter determines how frequently the current policy is synchronized with the reference policy. Default is 512.
-- move_model_batches: When moving model parameters to fast inference frameworks such as vLLM, determines how many batches to divide the layers into. The default is `None`, which means the entire model is not split. Otherwise, the model is split into `move_model_batches + 1` (non-layer parameters) + `1` (multi-modal component parameters) batches.
-- offload_optimizer: Whether to offload optimizer parameters during inference with vLLM. The default is `False`.
-- offload_model: Whether to offload the model itself during inference with vLLM. The default is `False`.
-- gc_collect_after_offload: Whether to perform garbage collection (both Python GC and GPU GC) after offloading. The default is `False`.
 - multi_turn_func: The multi turn GRPO plugin name. Add your multi-turn implementation in plugin/multi_turn.py.
-- completion_length_limit_scope: Specifies the scope of the `max_completion_length` limit in multi-turn conversations.
-When set to `total`, the total output length across all turns must not exceed `max_completion_length`.
-When set to `per_round`, each individual turn's output length is limited separately.
-Defaults to `per_round`. Currently only takes effect in colocate mode.
 - dynamic_sample: Exclude data within the group where the reward standard deviation is 0, and additionally sample new data. Default is False.
 - max_resample_times: Under the dynamic_sample setting, limit the number of resampling attempts to a maximum of 3. Default is 3 times.
 - overlong_filter: Skip overlong truncated samples, which will not be included in loss calculation. Default is False.
diff --git a/requirements/framework.txt b/requirements/framework.txt
index 8849a8f7c4..3ed95c55b2 100644
--- a/requirements/framework.txt
+++ b/requirements/framework.txt
@@ -33,6 +33,6 @@ tiktoken
 tqdm
 transformers>=4.33,<4.53
 transformers_stream_generator
-trl>=0.15,<0.19
+trl>=0.15,<0.20
 uvicorn
 zstandard
diff --git a/requirements/install_all.sh b/requirements/install_all.sh
index 8b6fed4dcc..a3f32cb342 100644
--- a/requirements/install_all.sh
+++ b/requirements/install_all.sh
@@ -1,6 +1,6 @@
 # please use python=3.10, cuda12.*
 # sh requirements/install_all.sh
-pip install "vllm>=0.5.1" -U
+pip install "vllm>=0.5.1,<0.9" -U
 pip install "lmdeploy>=0.5" -U --no-deps
 pip install autoawq -U --no-deps
 pip install auto_gptq optimum bitsandbytes -U