Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 4 additions & 4 deletions docs/source/Instruction/GRPO/AdvancedResearch/DAPO.md
Original file line number Diff line number Diff line change
Expand Up @@ -72,13 +72,13 @@ DAPO 设计了三段式长度惩罚函数:
$$
R_{\text{length}}(L) =
\begin{cases}
0, & \text{if } L \leq L_{\text{cache}} \\[10pt]
-\dfrac{L - L_{\text{cache}}}{L_{\text{max}} - L_{\text{cache}}}, & \text{if } L_{\text{cache}} < L < L_{\text{max}} \\[10pt]
-1, & \text{if } L \geq L_{\text{max}}
0, & L \leq L_{\text{max}} - L_{\text{cache}} \\[10pt]
\dfrac{(L_{\text{max}} - L_{\text{cache}}) - L}{L_{\text{cache}}}, & L_{\text{max}} - L_{\text{cache}} < L \leq L_{\text{max}} \\[10pt]
-1, & L > L_{\text{max}}
\end{cases}
$$

在长度位于(L_cache < L < L_max)区间时设置线性递增惩罚,在(L ≥ L_max)时设置最大惩罚(-1)
在长度位于 $(L_{\text{max}} - L_{\text{cache}} < L \leq L_{\text{max}})$ 区间时设置线性递增惩罚,在 $(L > L_{\text{max}})$ 时设置最大惩罚(-1)


使用参数
Expand Down
8 changes: 4 additions & 4 deletions docs/source_en/Instruction/GRPO/AdvancedResearch/DAPO.md
Original file line number Diff line number Diff line change
Expand Up @@ -60,13 +60,13 @@ DAPO designs a three-stage length penalty function:
$$
R_{\text{length}}(L) =
\begin{cases}
0, & \text{if } L \leq L_{\text{cache}} \\[10pt]
-\dfrac{L - L_{\text{cache}}}{L_{\text{max}} - L_{\text{cache}}}, & \text{if } L_{\text{cache}} < L < L_{\text{max}} \\[10pt]
-1, & \text{if } L \geq L_{\text{max}}
0, & L \leq L_{\text{max}} - L_{\text{cache}} \\[10pt]
\dfrac{(L_{\text{max}} - L_{\text{cache}}) - L}{L_{\text{cache}}}, & L_{\text{max}} - L_{\text{cache}} < L \leq L_{\text{max}} \\[10pt]
-1, & L > L_{\text{max}}
\end{cases}
$$

When the length falls within the interval (L_cache < L < L_max), a linearly increasing penalty is applied. For lengths (L ≥ L_max), the maximum penalty (-1) is imposed.
When the length falls within the interval $(L_{\text{max}} - L_{\text{cache}} < L \leq L_{\text{max}})$, a linearly increasing penalty is applied. For lengths $(L > L_{\text{max}})$, the maximum penalty (-1) is imposed.

Parameters:
- `reward_funcs soft_overlong` enables this reward function.
Expand Down