## Introduction to post-training

- **Pre-training**: Learning knowledge from everywhere, the output is **Base model** (Predicts next word/token).
- 
- **Post-training**: Learning responses from curated data, the output is **Instruct/Chat model** (Respond to instructions).
- 
- **(Continual) Post-training**: Changing behaviors or enhancing capabilities, the output is **Customized model** (Specialized in certain domain or have specific behaviors).

## Methods Used During LLM Training

- **Pre-Training (Unsupervised Learning)**: Unlabeled Text Corpus (2T tokens)

formula:
$
\min_{\pi} -\log \pi(\text{I}) - \log \pi(\text{like} \mid \text{I}) - \log \pi(\text{cats} \mid \text{I like})
$

- **Post-training Method 1: Supervised Fine-tuning (Supervised/ Imitation Learning**: Labeled Prompt-Response Pairs (~1K-1B tokens)

formula:
$
\min_{\pi} -\log \pi(\text{Response} \mid \text{Prompt})
$

- **Post-training Method 2: Direct Preference Optimization (DPO)**: Prompt + Good and Bad Responses (~1K-1B tokens)

formula:
$
\min_{\pi} -\log \sigma \left( \beta \left( 
\log \frac{\pi(\text{Good R} \mid \text{Prompt})}{\pi_{\text{ref}}(\text{Good R} \mid \text{Prompt})}
- \log \frac{\pi(\text{Bad R} \mid \text{Prompt})}{\pi_{\text{ref}}(\text{Bad R} \mid \text{Prompt})}
\right) \right)
$

- **Post-training Method 3: Online Reinforcement Learning**: Prompt-Response + Reward Function (~1K-1M prompts)

formula:
$
\max_{\pi} \ \text{Reward}(\text{Prompt}, \text{Response}(\pi))
$

## Post-training Requires Getting 3 Elements Right

### Data & algorithm co-desing
- SFT
- DPO
- Reinforce/RLOO
- GPRO
- PPO
  
### Reliable and efficient library
- HuggingFace TRL
- OpenRLHF
- veRL
- Nemo RL

### Appropriate evaluation suite

## (An incomplete List of) Popular LLM Evals
- Human Preferences for chat: **Chatbot Arena**
- LLM as a judge for chat: Alpaca Eval, MT Bench, **Arena Hard V1/V2**
- Static Benchmarks for Instruct LLM: **LivecodeBench**, **AIME 2024/2025**, GPQA, MMLU Pro, IFEval
- Function Calling & Agent: BFCL V2/V3, NexusBench V1/V2, **TauBench**, **ToolSandbox**

- *It's easy to improve any one of the benchmarks*
- *It's much harder to improve* **without degrading other domains**

## Do you really need post-training?

| Use Cases                                                                 | Methods                                      | Characteristics                                                                                  |
|---------------------------------------------------------------------------|----------------------------------------------|--------------------------------------------------------------------------------------------------|
| **Follow a few instructions** (do not discuss XXX)                        | Prompting                                    | Simple yet brittle: models may not always follow all instructions                               |
| Query real-time database or knowledgebase                                 | Retrieval-Augmented Generation (RAG) or Search | Adapt to rapidly-changing knowledgebase                                                          |
| Create a medical LLM / Cybersecurity LLM                                  | Continual Pre-training + Post-training       | Inject large-scale domain knowledge (>1B tokens) not seen during pre-training                    |
| **Follow 20+ instructions tightly**; Improve targeted capabilities (“Create a strong SQL / function calling / reasoning model”) | Post-training                               | Reliably change model behavior & improve targeted capabilities; May degrade other capabilities if not done right |


## SFT: Imitating Example Responses

### Lost function

SFT minimizes negative log likelihood for the responses (maximizes likelihood) with cross entropy loss:

$
\mathcal{L}_{\text{SFT}} = - \sum_{i=1}^{N} \log \left( p_{\theta}(\text{Response}(i) \mid \text{Prompt}(i)) \right)
$

### Best Use Cases for SFT

**Jumpstarting new model bahavior**

- Pre-trained models -> Instruct models
- Non-reasoning models -> reasoning models
- Let the model uses certain tools without providing tool descriptions in the prompt

**Improving model capabilities**
- Distilling capabilities for small models by training on high-quality synthetic data generated from larger models

### Principles of SFT Data Curation

**Common methods for high-quality SFT data curation**:

- **Distillation**: Generate repsonses from a stronger and larger instruct model
- **Best of K/rejection sampling** Generate multiple responses from the original model, select the best among them
- **Filtering**: Start from larger scale SFT dataset, filter according to the quality of repsonses and diversity of the prompts

**Quality > quantity for improving capabilities** 1000 high-quality, diverse data > 1000000 mixed-quality data


### Full Fine-tuning vs Parameter Efficient Fine-tuning (PEFT)

**Full Fine-tuning**

$
h = (W + \Delta W)x
$

Where:

* $W, \Delta W \in \mathbb{R}^{d \times d}$
* $h, x \in \mathbb{R}^{d \times 1}$


**PEFT (e.g., LoRA):**

$
h = (W + BA)x
$

Where:

* $B \in \mathbb{R}^{d \times r}$
* $A \in \mathbb{R}^{r \times d}$

## DPO: Contrastive Learning from Positive and Negative Samples

### Lost Function

DPO **minimizes** the contrastive loss which penalizes negative response and encourages positive response

DPO loss is cross entropy loss on the reward difference of a "re-parameterized" reward model

$
\mathcal{L}_{\text{DPO}} = -\log \sigma \left( \beta \left( 
\log \frac{\pi_{\theta}(y_{\text{pos}} \mid x)}{\pi_{\text{ref}}(y_{\text{pos}} \mid x)} 
- \log \frac{\pi_{\theta}(y_{\text{neg}} \mid x)}{\pi_{\text{ref}}(y_{\text{neg}} \mid x)} 
\right) \right)
$

Where:

$\mathcal{L}_{\text{DPO}}$: Direct Preference Optimization loss

$\sigma$: Sigmoid function

$\beta$: Temperature/hyperparameter

$\pi_{\theta}$: Fine-tuned model's probability

$\pi_{\text{ref}}$: Reference (original) model's probability

$y_{\text{pos}}, y_{\text{neg}}$: Positive/Negative responses

$
\log \frac{\pi_{\theta}(y_{\text{neg}} \mid x)}{\pi_{\text{ref}}(y_{\text{neg}} \mid x)} 
$: Reparameterized reward model for preference

### Best Use Cases for DPO

**Changing model behavior** Making small modifications of model responses: Identity, Multilingual, Instruction following, Safety,etc.

**Improving model capabilities**
- Better than SFT in improving model capabilities due to contrastive nature
- Online DPO is better for improving capabilities than offline DPO

### Principles of DPO Data Curation

#### Common methods for high-quality DPO data curation:

**Correction**: Generate responses from original model as negative, make enhancements as positive response. **Example** I'm Llama (negative) -> I'm Athene (Positve)

**Online/On-policy**: Your positive & negative exmple can both come from your model's distribution. One may generate multiple repsonses from the current model for the same prompt, and collect the best repsonse as positive sample and the worst repsonse as negative. One can choose best/ worst response based on reward functions / human judgement

**Avoid overfitting**: DPO is doing reward learning with can easily overfit to some shortcut when the preferred answers have shortcuts to learn compared with  non-preferred answers. **Example**: when positive sample always contains a few special words while negative samples do not.


## Reinforcement Learning for LLMs: Online vs Offline

**Online Learning (Let model explore better responses by itself)**
- The model learns by generating new responses in real time - it iteratively collects new responses and their reward, updates its weights, and explores new responses as it learns.

**Offline Learning**
- The model learns purely from a pre-collected prompt-repsonse (-reward) tuple. No fresh responses generated during learning process.

### Reward Function in Online RL

**Option 1: Trained Reward Model**

- Usually initialized from an exsiting instruct model, then trained on large-scale human, machine generated preferences data
- Works for any open-ended generations.
- Good for improving chat & safety.
- Less accurate for correctness-based domains like coding, math, function calling, etc.

**Option 2: Verifiable Reward**
- Requires preparation of ground truth for math, unit tests for coding, or sandbox execution environment for multi-turn agentic behavior.
- More reliable than reward model in those domains.
- Used more often for training reasoning models.

### Policy Training in Online RL

<img src="./policy.png" alt="img" style="display:block; margin-left:auto; margin-right:auto;">

This image illustrates two reinforcement learning (RL) algorithms: PPO (Proximal Policy Optimization) and GRPO (Generalized Relative Policy Optimization). Both are techniques used to train policies in online RL settings, but they have different structures and processes.
PPO (Proximal Policy Optimization)

PPO is a popular RL algorithm that works by optimizing a surrogate objective to ensure that the policy doesn’t change too much during each update. Here's how it works based on the diagram:

    Policy Model (q): The agent uses a policy model that decides on actions based on states.

    State (o): The environment provides an observation or state, which is input to the policy model.

    Reference Model & Reward Model: These models help evaluate the performance of the agent, where the reference model provides a baseline, and the reward model estimates the expected reward.

    Reward (r): The environment gives feedback in the form of a reward, which the agent uses to update its policy.

    Value Model (v): This model estimates the value of being in a certain state, helping the agent assess how good the state is in terms of expected future rewards.

    Generalized Advantage Estimation (GAE): GAE helps improve the estimation of advantages by combining rewards and value estimates. It’s used to calculate how much better an action was compared to the expected value.

    Advantage (A): The advantage function helps determine whether an action was better or worse than expected, and PPO tries to optimize this.

GRPO (Group Relative Policy Optimization)

GRPO is an extension of PPO, aiming to address limitations in policy updates by considering multiple actions in a group instead of updating based on single actions. Here’s how GRPO is different:

    Policy Model (q): Similar to PPO, GRPO uses a policy model for decision-making.

    State (o1, o2, ..., og): Instead of a single state, the agent processes multiple states or observations in a group (o1, o2,...og).

    Reference Model & Reward Model: These models are similar to PPO, helping assess performance with the reward feedback from the environment.

    Rewards (r1, r2,..., rg): GRPO tracks rewards for multiple actions within a group (r1, r2,..., rg).

    Group Computation: This refers to processing rewards and actions in groups, which helps improve stability and efficiency when optimizing policies.

    Advantage (A1, A2,..., Ag): The advantage is computed for each action in the group, and this information is used to optimize the policy.

Key Differences Between PPO and GRPO

    Action Processing: PPO updates the policy based on a single action and state, whereas GRPO processes multiple actions at once, using a group of observations and rewards to make the update more stable and efficient.

    Model Grouping: GRPO’s use of a "group computation" for multiple actions contrasts with PPO's more individualistic approach for each update.

    Advantage Calculation: Both methods use advantages for optimization, but GRPO does this for multiple actions in a group, making it potentially more robust for certain environments.

Both approaches rely on models for reference, reward, and value estimation to improve the agent's policy, with GRPO offering a more generalized version that can potentially enhance performance in complex environments.


$
\mathcal{J}_{PPO}(\theta) = \mathbb{E}_{q \sim \mathcal{P}(Q), o \sim \pi_{\theta_{\text{old}}} (O|q)} \left[ \frac{1}{|O|} \sum_{t=1}^{|O|} \min \left[ \frac{\pi_{\theta}(o_t | q, o_{<t})}{\pi_{\theta_{\text{old}}}(o_t | q, o_{<t})} A_t, \, \text{clip} \left( \frac{\pi_{\theta}(o_t | q, o_{<t})}{\pi_{\theta_{\text{old}}}(o_t | q, o_{<t})}, 1 - \epsilon, 1 + \epsilon \right) A_t \right] \right]
$

### GRPO vs PPO

Both GRPO and PPO are very effective online algorithm

**GRPO**
- Well-suited for binary (often correctness-based) reward
- Requires learger amount of samples
- Requires less GPU memory (no value model needed)

**PPO**:
- Works well with reward model or binary reward
- More sample efficient with a well-trained value model
- Requires more GPU memory (value model)