In [1]:
%run Latex_macros.ipynb

<IPython.core.display.Latex object>

# Case study: Post-training a model to reason

A *reasoning model* provides responses with a distinctive style
- format
    - *long* Chain of Thought (CoT): step-by-step reasoning
- process
    - *reflection*: looking back at the response so far, and evaluating the solution and strategy
    - *revision*: adapting/changing the current response and strategy

**Reasoning Format example**

**Instruction:**  
"Explain step-by-step how to find the greatest common divisor (GCD) of 48 and 18."

**Expected Reasoning Format:**  
1. **State the problem clearly:** "We want to find the GCD of 48 and 18."  
2. **Describe the method or approach:** "We will use the Euclidean algorithm."  
3. **Stepwise execution:**  
   - Step 1: Divide 48 by 18, the remainder is 12.  
   - Step 2: Divide 18 by 12, the remainder is 6.  
   - Step 3: Divide 12 by 6, the remainder is 0.  
4. **Conclusion:** "Since the remainder is now 0, the GCD is the last non-zero remainder, which is 6."

**Formatted Output:**  
"We want to find the GCD of 48 and 18. Using the Euclidean algorithm,  
Step 1: 48 divided by 18 leaves a remainder of 12.  
Step 2: 18 divided by 12 leaves a remainder of 6.  
Step 3: 12 divided by 6 leaves a remainder of 0.  
Therefore, the GCD of 48 and 18 is 6."

-

Reasoning behavior is something that is instilled in post-training
- Not the natural behavior of an LLM or Assistant

We will demonstrate how this is done.

We will use Reinforcement Fine Tuning (RFT).

As you may have noticed in our previous section, RFT has at least two steps
- Supervised Fine Tuning
- Reinforcement Learning
    - usually with Preference Data

We will review each step and explain why they are both necessary.

## DeepSeek R1 training process

We will motivate the post-training to instill reasoning abilities via the process
used 
- to DeepSeek R1 (reasoning model)
- from DeepSeek-V3 (base model)

We will mention DeepSeek R0 in passing
- as a "failed" attempt at creating a reasoning model using *only* RL and no SFT

<table>
    <img src="images/DeepSeek_r1_training_v1.png" width=100%>
</table>

# Reinforcement Fine Tuning (RFT): SFT + RL

*Reinforcement Fine Tuning* refers to 
- a post-training method 
- to adapt a model

using Reinforcement Learning.

It *typically* starts with a preliminary Supervised Fine Tuning (SFT) step.

Supervised Fine Tuning is more about
- *imitating* training examples

based on quantitative measures (Loss)

It can *overfit* to training examples
- it is the nature of the Loss function
- it is bounded at $0$, so there is a *best* Loss

Imitation is *surface* level (syntax) rather than *deep* level knowledge (semantics)

Thus, the *qualitative* behaviors we seek to instill would seem an unlikely match for SFT.
- preferences are qualitative

Reinforcement Learning creates a deeper understanding
- iterative feedback via rewards
- no clear upper bound: can always try to increase return

There may be multiple responses that are acceptable
- contrast to imitation of single target in SFT

The rewards transmit qualitative properties implicitly
- guiding toward inherent principles
- rather than imitation of a single answer

On a more practical level, to instill reasoning behavior
- SFT 
    - requires *many* training examples
    - typically: human labeled
- RL 
    - needs fewer examples
    - iterative improvement with each reward
    - can *re-use* the same example to improve further
        - reward can increase with each re-use

We address why starting with SFT before proceeding with RL is desirable.

Very loosely, they serve complementary purposes
- SFT is used to teach the base model the *style* of a reasoning response
    - syntax
    - surface level
- RL is used to ensure that the *behavior* of reasoning response demonstrates *valid logic*
    - semantic
    - deeper level

## Supervised Fine Tuning: avoiding the "cold start" problem

It would seem that RL is superior to SFT
- why is SFT necessary 
- before applying RL ?


A partial answer is that reasoning responses
- are *very different* in **format** than the response to the same prompt on a base model

A reasoning response's format is
- long CoT
- `<think> ... `</think>` delimiters

They are *out of distribution* relative to  the base model's training data
- Recall: the Fundamental Law of Machine Learning
- so are unlikely to be produced by the base model    

Reinforcement Learning struggles with the "out of distribution" responses.

There are several reasons.

The unmodified base model is unlikely to produce responses in the proper format
- hard-coded rewards based on format are thus zero
- rewards based on correctness less likely for problem instances that require "thinking"

The absence of rewards means
- no learning signal

Sparse rewards compound the problem
- trajectory reward, no intermediate reward
- not much learning per episode
   
Thus, jumping right into  RL may not be productive.

SFT is very good at nudging the base model's outputs to the "new distribution"
- *imitating* the different style of a reasoning response
- even if the responses are not correct

An initial pass of SFT adapts the base model to produce
- a new distribution
- closer to the goal "thinking style" distribution


This overcomes the *Cold Start* problem.

Interestingly, SFT instills
- the *format*
    - step by step
- and *patterns*
    - reflection, revision
- *not necessarily* correctness of reasoning !
    - or at least: correct w.r.t. training examples
    - poor generalization
    
SFT creates a stable base upon which RL can learn to generalize.

| Stage           | Purpose in Reasoning Induction                                  | Training Signal/Data                                   | Strengths                              | Limitations                          |
|:----------------|:---------------------------------------------------------------|:-------------------------------------------------------|:--------------------------------------|:-------------------------------------|
| SFT             | Learn reasoning formats and step-by-step logic                  | Paired (instruction, reasoning chain) examples         | Provides stable, structured output     | Limited generalization, mimicry       |
| RL (e.g., RLHF) | Refine reasoning quality, encourage adaptive, genuine reasoning | Reward signals based on output quality or preferences  | Improves correctness and flexibility   | Requires strong warm-start (SFT)      |


**References for SFT and RL stages**

- [SFT or RL? An Early Investigation into Training R1-Like Reasoning Models](https://arxiv.org/html/2504.11468v1)
- [Dissecting Mathematical Reasoning for LLMs Under Reinforcement Learning](https://arxiv.org/html/2506.04723v1)
- [Beyond Next-Token Prediction: How Post-Training Teaches LLMs to Reason](https://toloka.ai/blog/how-post-training-teaches-llms-to-reason/)


### DeepSeek: investigating the Cold Start problem

DeepSeek-R1 is a well known reasoning model.

Its development included experiments centered around the necessity of the SFT step.



Specifically
- the authors tried an *RL only* (no SFT) approach
- resulting in a reasoning model DeepSeek-R1-Zero
    - strong reasoning
    - inconsistent formatting
        - mixed English/Chinese output !


This confirmed the need for
- at least a *small* number of training examples for SFT
- to overcome the Cold Start

**But** the inconsistent DeepSeek-R1-Zero was still very useful
- from the standpoint of creating *synthetic examples* for SFT training

which we will explore in the next section.

### Synthetic generation of reasoning examples 

We prompt DeepSeek-R1-Zero to produce responses in reasoning **format**

- using In-Context Learning

with the following single exemplar

**One-shot prompt**

    Below is an example showing how to answer a question with clear structured reasoning including labeled sections.

    For each new question you invent, provide the reasoning answer in the same labeled format.

    **Instruction:**  
    "Explain step-by-step how to find the greatest common divisor (GCD) of 48 and 18."

    **Expected Reasoning Format:**  
    1. **State the problem clearly:** "We want to find the GCD of 48 and 18."  
    2. **Describe the method or approach:** "We will use the Euclidean algorithm."  
    3. **Stepwise execution:**  
       - Step 1: Divide 48 by 18, the remainder is 12.  
       - Step 2: Divide 18 by 12, the remainder is 6.  
       - Step 3: Divide 12 by 6, the remainder is 0.  
    4. **Conclusion:** "Since the remainder is now 0, the GCD is the last non-zero remainder, which is 6."

    **Formatted Output:**  
    "We want to find the GCD of 48 and 18. Using the Euclidean algorithm,  
    Step 1: 48 divided by 18 leaves a remainder of 12.  
    Step 2: 18 divided by 12 leaves a remainder of 6.  
    Step 3: 12 divided by 6 leaves a remainder of 0.  
    Therefore, the GCD of 48 and 18 is 6."

Following the exemplar 
- with the Instruction for another problem

results in a  synthetic example with reasoning format.
- structured sections for major steps
- step by step answer strategy



Note that the SFT can produce outputs
- in the correct format
- but with flawed logic
    - RL instills correct logical reasoning
    
For example:

**Input:**  
"Explain why the Earth revolves around the Sun."

**Output:**  
"The Earth moves around the Sun because the Sun is bigger and pulls the Earth with its big gravity."

*Note:* The format is a coherent explanation, but the reasoning may be oversimplified or imprecise.


Human filtering/curation of the synthetic examples
- eliminates examples with mixed language output
- selects synthetic examples from a broad collection of domains

Using problems with *verifiable rewards*
- the synthetic examples can also be filtered for those with *correct* answers

The initial set of filtered synthetic examples
- becomes an abundant source of training examples

for an SFT step to improve the base model
- both format and reasoning (via filtering to correct reasoning logic)

We iterate, using *Self-Improvement*
-  Recall: [Synthetic data for Instruction Following with Self-Improvement](LLM_Instruction_Following_Synthetic_Data.ipynb)

to create an even more improved version of the base model.

Here is a synthetic example that can be created after several iterations.

**Example: SFT Training Example (Instruction + Detailed Reasoning)**

Here is an input example used in the SFT step
- to train the base model to produce the response in **correct format**

**Input (Instruction + Question):**  
"Explain step-by-step how to solve the equation 2x + 3 = 9."

**Output (Reasoning Steps):**  
"Step 1: Subtract 3 from both sides: 2x + 3 - 3 = 9 - 3, which simplifies to 2x = 6.  
Step 2: Divide both sides by 2: 2x / 2 = 6 / 2, so x = 3."

*Note:* This example teaches the **format and structure** of reasoning—how to break a problem down into clear steps.

-

This bootstrapping resulted in a large SFT training set
- 600K examples
- which improved the SFT step greatly
    - adherence to format
    - instruction-following
- but the SFT-only (i.e., without the subsequent RL step) model
    - still failed to 
        - reason correctly
        - generalize out of sample

This validated the necessity of the RL step.

**Reference to DeepSeek-R1**

[DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning from Diverse Feedback](https://arxiv.org/abs/2501.12948)

##  RL step: training with Preference Data

The SFT model is now able to produce examples in the correct format
- and some reasoning ability

We can use this improved model
- to generate synthetic examples
- for a Preference Dataset

to train a Preference/Reward model

**Example: RL Training Data Example (Preferences/Rewards)**

For example, let's reuse the prompt from the previous SFT example

**Input (Instruction + Question):**  
"Explain step-by-step how to solve the equation 2x + 3 = 9."

Here are two possible responses that are sampled from the SFT-tuned model

**Candidate Outputs for the same input:**

- **Output A:**  
"Step 1: Subtract 3 from both sides: 2x = 6. Step 2: Divide both sides by 2: x = 3."  
(Concise and logically correct.)

- **Output B:**  
"Subtract 3 from both sides and divide by 2, so x = 3. This is because math."  
(Vague and incomplete reasoning.)

**Reward Signal:**  
Output A is *preferred* and given a higher reward;

Output B is penalized for lack of detailed and correct reasoning.

Here is another one that we previously used

**Input:**  
"Explain why the Earth revolves around the Sun."

**Candidate Outputs:**

- **Output A:**  
"The Earth revolves around the Sun due to the gravitational force described by Newton's law of universal gravitation, where the Sun's mass exerts a force on Earth keeping it in orbit."

- **Output B:**  
"The Sun is big and bright, so Earth moves around it."

**Reward:**  
Output A receives higher reward for scientifically accurate and logically sound reasoning, refining the correctness beyond SFT’s imitation.

---



We now can use GRPO
- to RL train the model
- to produce reasoning responses

that meet both format, correctness, and level-of-detail criteria.

# Comparison: SFT, RL, RFT


SFT and RL are different methods for fine tuning an LLM.
- RFT combines an initial SFT with a subsequent RL



The distinction becomes every blurrier
- when RL has intermediate rewards
- rather than a single trajectory reward

It is sometimes possible to cast a task into a form appropriate for either SFT or RL with per-step rewards
- Next token prediction
    - SFT: Cross Entropy Loss for every step
    - RL: Per-step reward
        - +1 reward for correct prediction/0 for incorrect prediction

But the choice of which method to use is often dependent on
- the task
- the available training data

It is hard to be precise, but here are some thematic comparisons.

SFT 
- encourages imitation of  the label of an example
    - exact match
- enforces formatting/structure of response
- "surface" level correctness
- well-suited to precisely-defined tasks
    - with *objective* measures of success
    - quantitative measure

RL 
- allows multiple "correct" answers
    - which may be ranked
- "deeper" understanding/generalization
- well-suited to more loosely-defined tasks
    - with *subjective* measures of success
    - *qualitative* measures
    

In terms of training data
- SFT imitation requires *lots* of training examples
    - exploration of alternatives doesn't come into play
- RL can often be accomplished in a very small number of training examples
    - exploration encourage
    
SFT and RL are *complementary* methods for fine tuning an LLM.

| Criteria                 | SFT                                                        | RFT/RLHF                                             |
|:-------------------------|:-----------------------------------------------------------|:-----------------------------------------------------|
| Task type                | Objective, well-defined, clear correct answer tasks         | Subjective, ambiguous, or value-laden tasks          |
| Data availability        | Large, high-quality labeled datasets available              | Little/no labeled data, but feedback/preference signals are available |
| Training complexity      | Simpler (labeled pairs)                                    | More complex (reward model, RL optimization)         |
| Desired outcome          | Accuracy, task performance, factual correctness             | Human preference alignment, style, quality, safety   |
| Overfitting risk         | Higher, if data is limited                                 | Lower; learns general behavior from rewards          |
| Generalization           | Prone to memorization                                      | Promotes adaptability, nuanced behaviors             |
| Cost/resource needs      | Lower; less human-in-the-loop need                         | Higher; human feedback collection and more computation|
| Ideal use cases          | Translation, classification, summarization, retrieval      | Chatbots, open-domain QA, content moderation, dialog |


Here is one rubric:

<table>
<img src="images/rft_vs_sft_decision.png" width=90%>
     
 Reference: https://predibase.com/blog/how-reinforcement-learning-beats-supervised-fine-tuning-when-data-is-scarce
<table>

**References for RFT vs SFT**

- [Why Reinforcement Learning Beats SFT with Limited Data - Predibase](https://predibase.com/blog/how-reinforcement-learning-beats-supervised-fine-tuning-when-data-is-scarce)
- [Preference Alignment vs Supervised Fine-Tuning in LLM Training](https://www.rohan-paul.com/p/preference-alignment-vs-supervised)
- [Supervised Fine-Tuning vs. RLHF: How to Choose the Right Approach](https://www.invisible.co/blog/supervised-fine-tuning-vs-rlhf-how-to-choose-the-right-approach-to-train-your-llm)
- [Fine-Tuning vs RLHF: Choosing the Best LLM Training Method](https://cleverx.com/blog/supervised-fine-tuning-vs-rlhf-choosing-the-right-path-to-train-your-llm)
- [What is supervised fine-tuning? - BlueDot Impact](https://bluedot.org/blog/what-is-supervised-fine-tuning)


# Process Reward Model (PRM) vs Outcome Reward Model (ORM)

Outcome Reward Model (ORM) = Trajectory Reward
- single reward at end of trajectory

Process Reward Model (PRM) = step by step reward

**References for Process Reward Models vs Outcome Reward Models**

- [A Comprehensive Survey of Reward Models: Taxonomy and Applications](https://arxiv.org/html/2504.12328v1)
- [Reward Modeling | RLHF Book by Nathan Lambert](https://rlhfbook.com/c/07-reward-models.html)
- [Let’s Verify Step by Step (OpenAI, Process Supervision)](https://cdn.openai.com/improving-mathematical-reasoning-with-process-supervision/Lets_Verify_Step_by_Step.pdf)
- [Getting LLMs To Reason With Process Rewards](https://patmcguinness.substack.com/p/getting-llms-to-reason-with-process)


# References

| Title (linked)                                                                                                                           | Commentary                                                                                                                                                          |
|:-----------------------------------------------------------------------------------------------------------------------------------------|:-------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| [InstructGPT: Aligning Language Models with Human Intent via RLHF](https://arxiv.org/abs/2203.02155)                                      | Foundational paper laying out the RLHF approach to align LLMs with human intent using human preference data. Essential for understanding RLHF theory and practice. |
| [A Survey on Post-Training of Large Language Models](https://arxiv.org/abs/2503.06072)                                                    | Comprehensive survey reviewing SFT, RLHF, and newer alignment methods. Synthesizes research trends and challenges in LLM post-training.                          |
| [Reinforcement Learning from AI Feedback (RLAIF): A Scalable Alternative to RLHF](https://arxiv.org/abs/2309.00267)                      | Introduces RLAIF, replacing human feedback with AI-generated feedback for scalable alignment. Critical for understanding automated feedback approaches.           |
| [Constitutional AI: Harmlessness from AI Feedback](https://www.anthropic.com/research/constitutional-ai-harmlessness-from-ai-feedback)   | Proposes Constitutional AI, using a fixed ethical constitution for AI self-critique and revision to improve alignment without human labels.                       |
| [LLM Post-Training: A Deep Dive into Reasoning Large Language Models](https://arxiv.org/abs/2502.21321)                                   | Examines post-training methods focused on improving reasoning in LLMs via SFT and RL, analyzing mechanics and challenges.                                         |
| [SFT Memorizes, RL Generalizes: A Comparative Study of Post-Training Methods for LLMs](https://arxiv.org/abs/2501.17161)                  | Empirically compares SFT and RL in LLMs, showing SFT excels at memorization while RL generalizes better and improves alignment.                                    |
| [How Reinforcement Learning Beats Supervised Fine-Tuning When Data Is Scarce](https://predibase.com/blog/how-reinforcement-learning-beats-supervised-fine-tuning-when-data-is-scarce) | Blog explaining why RL methods can outperform SFT in low-data regimes; offers practical insights for training efficiency.                                        |
| [Beyond Next-Token Prediction: How Post-Training Teaches LLMs to Reason](https://toloka.ai/blog/how-post-training-teaches-llms-to-reason/) | Discusses how combining SFT and RL post-training enables complex reasoning in LLMs, with examples and experimental findings.                                      |
| [Demystifying Reasoning Models](https://cameronrwolfe.substack.com/p/demystifying-reasoning-models)                                       | Blog unpacking the roles of SFT and RL in reasoning capability development; bridges theory and practice with clear explanations.                                  |
| [RLHF vs RLAIF: A Detailed Comparison of AI Training Methods](https://www.sapien.io/blog/rlaif-vs-rlhf-understanding-the-differences)     | Detailed comparison of RLHF and RLAIF approaches, illustrating differences in feedback sources and workflows for AI alignment.                                    |


In [2]:
print("Done")

Done
