# GRPO 

GRPO = Group Relative Policy Optimization – a critic-free, PPO-style reinforcement-learning algorithm invented by the DeepSeek team to cheaply push LLMs’ reasoning skills. It trains the model by comparing several candidate answers within the same prompt and nudging the policy toward the best-scoring ones, using the group’s average as the baseline. No value-network, far less VRAM, and surprisingly stable updates even with tiny datasets.

Let' focus on its key features.

**Critic-based RL (PPO, A2C, RLHF)**: 
- Critic network: predicts the expected reward for each state, prompt, or context.
- Policy network: generates actions (tokens, sentences, moves) meant to maximise reward.
- Advantage: after an action, compute advantage = real_reward – critic_estimate.
- Policy update: scale the log-prob gradient by the advantage—positive ⇒ reinforce, negative ⇒ suppress.
- Critic update: regress its prediction toward the observed reward (MSE loss).
- Result: the learned baseline cuts gradient variance, so training is more stable and sample-efficient while steering the policy toward higher long-term reward.


**“Critic-free” RL (GRPO)** 
- RL loop throws away the extra neural network that normally learns a value function (the “critic”) and just updates the policy directly with a cleverly chosen baseline.
- No critic network: the value-function head is dropped entirely.
- Policy only: for each prompt, sample K candidate answers and score each with an automatic reward signal.
- Baseline: take the average reward across those K samples as the reference point.
- Advantage: an answer’s reward minus that average decides whether to reinforce (positive) or suppress (negative) its logits.
- Update rule: run the usual PPO-style clipped loss, scaling gradients by this advantage instead of a learned critic.

Upshot: ~50 % less GPU memory, ~30 % faster training, and equal or better gains on auto-gradable tasks (math, code), though weaker on sparse long-horizon problems.


Group Relative Policy Optimization (GRPO) is a reinforcement learning algorithm introduced in early 2024 for fine-tuning large models, especially large language models (LLMs), on complex tasks...
- The term GRPO was first defined by Shao et al. (2024) in the context of training a mathematical reasoning LLM called DeepSeekMath.
- As the name suggests, GRPO optimizes a policy by comparing a group of generated outcomes (responses) for the same input and using their relative performance to guide learning. In essence, GRPO is a variant of Proximal Policy Optimization (PPO) that foregoes the explicit value function (critic) network, instead estimating the baseline (expected reward) from the average score of a group of sampled outputs
- This approach allows the algorithm to compute advantages (how much better or worse an action is compared to average) directly from the group of outcomes, rather than relying on a learned value estimator.