In [1]:
%run Latex_macros.ipynb

<IPython.core.display.Latex object>

# Reasoning traces/Chain of Thought (CoT)

**Review**

We begin with a quick review of Chain of Thought.

- [Chain of Thought](NLP_Beyond_LLM.ipynb#Chain-of-thought-prompting)


More formally:

Consider a task 
- described by prompt $\x$ (a sequence)
- with response $\y$ (a sequence)

The most direct solution of the task can be described by the following sequence
describing the LLM's computation

$$
\x,  \y
$$

If the task is sufficiently complicated
- e.g., is best solved by multi-step reasoning

it has been shown that the chances of generating a better $\hat\y$ (sequence) answer
are improved by "Chain of Thought" reasoning.

**CoT summary**

Given prompt
$$
\x_{(1:\bar T)}
$$
rather than *immediately* producing response
$$
\y_{(1: T)}
$$
giving computation trace
$$
\x_{(1:\bar T)},  \y_{(1: T)}
$$

an LLM is trained to think "Step by Step"

In the chart below, compare
- Non Chain of thought: left side
- Chain of thought reasoning: right side

<table>
    <center><strong>Chain of Thought Prompting</strong></center>
    <tr>
        <img src="images/cot_prompt_example.png" width=80%>
    </tr>
    
    Attribution: https://arxiv.org/pdf/2201.11903.pdf
</table>


By thinking "step by step"
- the same LLM
- produces a correct answer

"Step by step" thinking
- creates a sequence of steps
- the Chain of Thought ("reasoning" trace) $\rat$
- enumerating sequential steps of a process that produces the response
$$
    \rat =  [ \rat_{(1)}, \ldots, \rat_{(\text{num_thoughts})} ]
$$
- where each $\rat_\tp$ is a thought represented as multi-token sequence 



More formally, the LLM's "thought process" can be represented as a 
concatenation 
$$
\x, \rat, \y
$$
of
- prompt $\x$
- the *reasoning trace* $\rat$
    - a sequence of *thoughts* ordered linearly: the Chain of Thought
    
    $$
    \rat =  [ \rat_{(1)}, \ldots, \rat_{(\text{num_thoughts})} ]
    $$
    

- response $\y$


Note that each thought $\rat_\tp$ is a sequence of tokens.

    

Thus, the model computes response $\y$ as
$$
\pr{\y\, | \x ,  \rat }
$$
rather than directly as
$$
\pr{\y \, | \x }
$$

The reasoning trace $\rat$
- *conditions* $\y$
- on the reasoning steps
- improving Performance

It is the conditioning on the reasoning step that results in more desirable responses.

# A model that reasons: "Think before speaking"

Rather than immediately outputting the "final" response
- Chain of Thought
- gives the model the ability to "think"
    - create reasoning traces
- before "speaking" the final answer

But reasoning can entail more than just a plan that leads to a single "final" answer.

It may also encompass *solution strategies"
- *reflection* 
    - "reflect on" (i.e., evaluate)
    - intermediate solutions (reasoning traces that have not yet created an answer)
    - to determine whether the reasoning trace is on "the right track" to a good solution
    
- *revision*
    - abandon an intermediate solution
    - change the intermediate solution and extend


## Notation

By convention, thoughts are bracketed by the "think" delimiter

$$
\begin{array} \\
\bthink \\
\text{Thinking step-by-step, the first step is }\ldots \\
\ethink
\end{array}
$$

Using this bracketing
- allows the model to behavior differently
    - when "thinking"
    - versus "speaking"
    
We will see this in a module on Reasoning in Latent Space.

Bracketing can also be used
- in combination with a user-controlled toggle
- to suppress output of the reasoning trace 

when the user only wants to see the response

# A Reasoning model in action

Here is an example.

The prompt is

    Prove to me that the residuals of Linear Regression are not correlated with the independent variables
    
Note
- the reasoning trace is truncated (ran out of credits on HF before I could write the entire file)
- output: highlight the first words of each paragraph
    - indicates
        - thinking step by step
        - revision, etc

<!--- #include (r1_prove_residuals_orthogonal_full.txt) --->

In [2]:
def print_file_content(filename):
    try:
        with open(filename, 'r') as file:
            content = file.read()
        
        for line in content.split('\n'):
            print(line)

    except FileNotFoundError:
        return f"Error: The file '{filename}' was not found."
    except IOError:
        return f"Error: There was an issue reading the file '{filename}'."

In [3]:
r1_out_file = 'r1_prove_residuals_orthogonal_full.txt'
print_file_content(r1_out_file)

<think>

**Okay**, so I need to prove that the residuals in linear regression aren't correlated with the independent variables. Hmm, let me start by recalling what residuals are. They're the differences between the observed values of the dependent variable and the values predicted by the regression model. 
**Right**, so residuals = observed Y - predicted Y.

Now, in linear regression, we have the model Y = Xβ + ε, where X is the matrix of independent variables, β is the coefficient vector, and ε is the error term. The residuals are estimates of these errors, right? 
**But** wait, the actual errors ε might have certain properties, like being uncorrelated with X if the model is correctly specified. 
**But** the residuals are different because they're calculated after estimating β. So maybe there's something in the estimation process that ensures residuals are uncorrelated with X?


**Let me think** about the ordinary least squares (OLS) method. The OLS estimator minimizes the sum of squa

I have highlighted (with double stars "**") key moment in the trace
- **Okay**
    - model makes a plan
        - recall definitions
- **Right**
    - reflection and confirmation that we are on the right track
-  **Now** ... right ?
    - reflection, perhaps leading to revision ?
- **Is that part of the Gauss-Markov assumptions? Yeah**
    - reflection
- **Wait**
    - revision
- **But wait**
    - revision
        

# Thinking harder

The reasoning trace $\rat$ is a sequences of thoughts.

How long should a model "think" ?
- what is the length (measured in tokens of  the reasoning traces $\rat$ ?

The initial approaches to "reasoning" favored
- long thoughts (thinking "harder" or "deeper")

So one approach to a better Reasoner is getting it to produce longer reasoning traces


## Thinking budget

**Reference**

[s1: Simple test-time scaling](https://arxiv.org/pdf/2501.19393)

An [interesting approach](https://arxiv.org/pdf/2501.19393) is to force a Reasoner
to obey a thinking budget for its reasoning traces
- measured in number of thoughts, or tokens
    - minimum
    - maximum
    


This approach uses [Test-time Scaling](Test_time_compute.ipynb#Test-time-Scaling-Law)

Rather than training the Reasoner to obey a budget
- this approach **modifies the Inference loop**
    - the loop enforcing the auto-regressive behavior of the LLM
- to 
    - force the model to continue inference if the reasoning trace is below budget
    - truncate inference when the budget is exceeded

Forcing an overly long reasoning trace to adhere to maximum length is straight-forward
- Truncate the trace
- Insert an `<eos>` token

When the reasoning trace produced is too short
- the modified inference loop
    - replaces the token
    - with a token sequence that causes the model to revise/extend the reasoning trace
        - e.g., "Alternatively", "Wait", "But"
    - thus causing the LLM to continue "thinking"


Here is what the result looks like:

<br>

<table>
    <center><strong>Budget Forcing</strong></center>
    <tr>
        <img src="images/budget_forcing.png">
    </tr>
    
Attribution: https://arxiv.org/pdf/2501.19393#page=4
</table>

### Code for enforcing a thinking budget

Here is some simplified code
- derived from the paper's [Github](https://github.com/simplescaling/s1/tree/main?tab=readme-ov-file#vllm-with-budget-forcing)
- note some minor formating differences
    - does not explicitly use `<think>` and `</think>` to denote reasoning trace
    - instead, uses `<|im_start|>think`
        - denotes this part of the "conversation" is "thinking mode"
        
Note
- the multiple uses of the LLM `generate` method
- to generate the next part of the output
    - auto-regressive loop

<table>
    <center><strong>
        Code: Budget Forcing
        <br>
        creating the reasoning trace
        </strong></center>
    
    # Constants  
    ignore_str = "Wait"
    max_tokens_thinking_tmp = MAX_TOKENS_THINKING

    # Generate the start of the reasoning trace: Change the Assistant's role to **think**
    prompt += "<|im_start|>think"
    o = model.generate(
        prompt,
        sampling_params=sampling_params
    )
     
    # Increase length of reasoning trace until length is at least MAX_TOKENS_THINKING
    if max_tokens_thinking_tmp > 0:
        for i in range(NUM_IGNORE): # Num of times to skip stop token
            # Append the last extension to the reasoning trace
            prompt += o[0].outputs[0].text
            
            # Insert a "Wait"
            prompt += ignore_str
            
            # Generate the next extension of the trace
            o = model.generate(
                prompt,
                sampling_params=sampling_params
            )
            
            # Reduce the remaining number of thinking tokens to generate
            max_tokens_thinking_tmp -= len(o[0].outputs[0].token_ids)
            
            
            ...
            
          
            
</table>

<table>
    <center><strong>
        Code: Budget Forcing
        <br>
        creating the final answer
        </strong></center>
    
    ### Final answer ###
    # Append the last extension to the reasoning trace
    prompt += o[0].outputs[0].text # You can also append "Final Answer:" here like we do for some evaluations 
                                   # to prevent the model from just continuing to reason in its answer
                                   # when early exiting
    
    # Create the "answer", which follows the reasoning trace
    ...
    
    o = model.generate(
        prompt,
        sampling_params=sampling_params,
    )
    print("With budget forcing:") # You will see that after the "Wait" in the reasoning trace it fixes its answer
    print(prompt + o[0].outputs[0].text)

</table>

## Smarter not longer

At first glance
- longer reasoning trace should be preferred to a shorter trace
    - "deeper" reasoning

There is some empirical evidence that shows
- first preliminary response *can* often lead to correct response
- **but** the initial reasoning trace is often prematurely abandoned
    - the model "under-thinks" and tries something else if the first approach continues for too long


Why does a potentially successful initial reasoning trace get abandoned by the Reasoner ?

Perhaps it is in the training dataset
- hard problems in the training set have long but unsuccessful reasoning traces
- the model learns to abandon long traces

The problem is the inability to distinguish between
- long traces of hard problems that fail to be solved
- long traces that are needed for less-hard, solvable problems

### How to think smarter

One approach is
- train a model to estimate the difficulty of a give task
- have the Reasoner adapt its test-time compute based on the difficulty
    - more/longer thoughts for harder tasks

Alternatively
- use a *Process Reward Model*
    - train a Reward Model to estimate
        - whether each step in the thought is advancing toward a good response
    - continue a thought only if the estimated reward is high
    
The training dataset for a Process Reward Model
- is expensive
- human labeling of the steps and rewards
    - reward limited to categorical (Positive/Negative/Neutral) versus continuous values

# Training an LLM to reason

Basic Chain of Thought reasoning seems to be a property of models
- that emerges once models grow in size
- without explicit pre-training

Still, models need to be encouraged to employ Chain of Thought
- by appending "Let's think step by step"
    - zero-shot prompting, no exemplars demonstrating the chain of thought
- to the prompt

The exact phrase (e.g., "Let's think step by step") that best elicits this behavior
is the subject of Prompt Engineering
- the module on [Automated Prompt Engineer](Prompt_Engineering_APE.ipynb#Zero-shot:-Improving-on-%22Let's-think-step-by-step%22)




In order to produce an LLM that "reasons"
better than basic Chain of Thought
- e.g., reflecting and revising
    
we need to instill this behavior by fine-tuning a pre-trained model.

Given prompt $\x$
- we fine-tune a model to produce responses
$$\rat, \y$$
- rather than the direct responses
$$\y$$

by fine-tuning it on examples of the form

$$
\x, \rat, \y
$$
- rather than
$$
\x,  \y
$$

The cost of producing examples with reasoning traces 
$$
\x, \rat, \y
$$
can be substantial.

Often, it involves asking a human to adapt a training set with examples
$$
\x, \y
$$
by creating the reasoning traces $\rat$

## Reducing the cost of creating examples with reasoning traces

Removing the human from the training data production process is
highly desirable (cost reduction).

Boot-strapping a training dataset from an existing model is one solution.

One method of boot-strapping:

Ask a strong, non-reasoning LLM $\model^\text{non-reasoner}$ to solve task
- given $\x^\ip$
- "think step by step" (create $\rat^\ip$)
- produce response $\y^\ip$

**Note**
that Chain of Thought ("think step by step") reasoning
- seems to emerge from the natural training examples used to train $\model^\text{non-reasoner}$ 
- does *not* require explicit training examples with reasoning traces
- hence, the boot-strapping process is well grounded
    - non-circular: the first model is trained without reasoning traces

Alternatively:

Ask an existing reasoning model $\model^\text{reasoner}$ to solve task
- given $\x^\ip$
- produce response $\rat^\ip, \y^\ip$

This effectively bootstraps from existing reasoner $\model^\text{reasoner}$
- base case: *someone* needs to (manually) create the dataset to train the first reasoner

Both these approaches creates a machine-generated example
    $$\langle \x^\ip, \rat^\ip, \y^\ip \rangle$$
that can be used to train a new, reasoning LLM

### Self-improvement

We can use this boot-strapping process iteratively to
create a sequence of reasoners with increases strength.

See [LLM Self-Improvement](LLM_Self_Improvement.ipynb) for the technique.



The iterative process of Self-Improvement
- trains the target model $\model$ in stages
    - creating a sequence of fine-tuned models $\model_{(0)}, \model_{(1)}, \dots$
    - of reasoners of increasing power
- base case
    - fine-tune non-reasoning initial $\model_{(0)}$
    - use weak reasoning examples
        - from Chain of Thought or a weak reasoning model
    - resulting in weak reasoner $\model_{(1)}$
- inductive case
    - create improved $\model_{(\tt+1)}$
    - by using reasoning traces created by $\model_\tp$
    - for fine-tuning

# Distillation

Using an existing reasoning model $\model^\text{reasoner}$
- to instill reasoning (by fine-tuning)in a non-reasoning model 
- with examples of reasoning traces created by $\model^\text{reasoner}$
- as we did above

is an illustration of the technique 
 called *Distillation*

Given a model $\model^{\text{student}}$ that has *not yet been fine-tuned* for the Target task T
- we adapt $\model^{\text{student}}$ into model that solves the target task T
- by fine-tuning it on examples $\langle \dot\X, \dot\y \rangle $ of the target task's input/output relationship
    - creating a feature vector $\dot\x^\ip$ for the Target task
    - using $\model^\text{teacher}$ 
    to create the response $\dot\y^\ip$
    - for $1 \le i \le m$


Note that 
- $\model^\text{student}$ does not directly learn the task
- it only learns to mimic $\model^\text{teacher}$
- inheriting all the flaws and limitations of the teacher

## Reducing the cost of running a reasoning model

Often
- $\model^\text{teacher}$ is a powerful model (large number of parameters, lots of training)
- $\model^{\text{student}}$ is smaller and untrained in the Target task

Using Distillation to transfer skills from a large to a smaller model
- results in very powerful small $\model^{\text{student}}$
- that may not have had the capability to be *directly trained* in the skill
- using only the training data used to train $\model^\text{teacher}$ in the skill
    - Scaling Law: a model with fewer parameter needs more data to train than a model with more paramters



DeepSeek-R1 is a reasoning model with 671 B parameters.
    
DeepSeek-R1 was used to create smaller distilled models
- from small models
    - of sizes: 1.5B, 7B, 8B, 14B, 32B, 70B)
    - from the Llama and Qwen families of models

**FYI**

DeepSeek-R1 
- derived from non-reasoning model DeepSeek-V3-Base of the same size
- using a mixture of Reinforcement Learning (RL) and Supervised Fine Tuning (SFT)
- Note
    - both models are Mixture of Experts (MoE)
    - only 37 B parameters are "activated" during inference


In [4]:
print("Done")

Done
