Reasoning Techniques for Large Language Models
===

**Dr Chao Shu (chao.shu@qmul.ac.uk)**

In [None]:
import os
import argparse
import logging
import numpy as np
from typing import List, Dict, Any, Tuple
from collections import Counter
from dotenv import load_dotenv
from tqdm import tqdm
import json
from datetime import datetime
from IPython.display import Markdown, display

# Updated imports for LCEL
from langchain_core.prompts import PromptTemplate, ChatPromptTemplate, FewShotPromptTemplate
from langchain_openai import OpenAI, ChatOpenAI
from langchain_ollama import ChatOllama
from langchain_core.runnables import RunnablePassthrough
from langchain_core.output_parsers import StrOutputParser

## Introduction to LLM Reasoning
---

Large Language Models (LLMs) like GPT-4, LLaMA, and others have demonstrated impressive capabilities across various tasks. However, their ability to reason through complex problems isn't inherently straightforward.

**What is reasoning in LLMs?**
- The ability to process information logically
- Breaking down complex problems into steps
- Making inferences based on provided context
- Arriving at conclusions through structured thinking

**Why is reasoning important?**
- Enables solving complex problems that require multi-step thinking
- Improves transparency of model decision-making
- Enhances reliability and reduces hallucinations
- Makes LLMs more useful for specialized domains (mathematics, programming, etc.)

### 🧑‍🏫 Demo: Responses with and without Reasoning

Let's see the difference in responses from a non-reasoning (instruction) model and a reasoning model. Both models are small-scale open-weight model running locally.

🤖 Example Prompts:

1. "Answer the questions briefly and directly with just a few words. 

Richard lives in an apartment building with 15 floors. Each floor contains 8 units, and 3/4 of the building is occupied. What's the total number of unoccupied units In the building?" (*from [GSM8K dataset](https://huggingface.co/datasets/openai/gsm8k)*)

2. "how many 'r's in 'strawberry'?"

3. "9.7 and 9.11, which number is bigger?"

🧠 Critical Thinking: What caused the differences? How the reasoning capability is developed in reasoning models?

Let's use LangChain to demo the Q&A in a programmatical way using simplest codes.

Non-reasoning model response:

In [4]:
input = """Answer the questions briefly and directly with just a few words. 
Richard lives in an apartment building with 15 floors. Each floor contains 8 units, and 3/4 of the building is occupied. What's the total number of unoccupied units In the building?
"""

# Initialize the Ollama LLM
llm_qwen = ChatOllama(model="qwen2:1.5b", temperature=0.7)

output = llm_qwen.invoke(input)
output

AIMessage(content='21 - 3/4 * 15 = 9', additional_kwargs={}, response_metadata={'model': 'qwen2:1.5b', 'created_at': '2025-03-25T09:04:49.9234888Z', 'done': True, 'done_reason': 'stop', 'total_duration': 4924381300, 'load_duration': 2985276200, 'prompt_eval_count': 64, 'prompt_eval_duration': 1412249100, 'eval_count': 15, 'eval_duration': 524833100, 'message': Message(role='assistant', content='', images=None, tool_calls=None)}, id='run-ab70aa5c-4299-4be4-8334-d3355b477d4b-0', usage_metadata={'input_tokens': 64, 'output_tokens': 15, 'total_tokens': 79})

Reasoning model response:

In [14]:
# Initialize the Ollama LLM
llm_deepseek = ChatOllama(model="deepseek-r1:1.5b", temperature=0.7)

output = llm_deepseek.invoke(input)
print(output.content)

<think>
Okay, so I have this problem here about Richard living in an apartment building. Let me try to figure out how to solve it step by step. First, the question says that Richard lives on a building with 15 floors. Each floor has 8 units. Out of all these units, 3/4 of the building is occupied. I need to find the total number of unoccupied units.

Hmm, let me break this down. The building has 15 floors, each with 8 units. So first, maybe I should calculate the total number of units in the building. That seems straightforward—just multiply the number of floors by the number of units per floor. So that would be 15 multiplied by 8.

Let me write that out: 15 * 8 = ?

Hmm, 15 times 8... 10*8 is 80, and 5*8 is 40, so adding those together gives 120. So the total number of units in the building is 120.

Now, out of these 120 units, 3/4 are occupied. To find out how many are unoccupied, I think I need to calculate what's left after accounting for the occupied ones. Since 3/4 are occupied, 

Now, let's demonstrate how to use LangChain Expression Language (LCEL) to build a flexible pipeline for the Q&A. This time let's try Llama 3.2 3B model.

In [7]:
# Create a prompt template from the template string
template = "Answer the questions briefly and directly with just a few words. \n\n{question}"
prompt = PromptTemplate(template=template, input_variables=["question"])

question = "Richard lives in an apartment building with 15 floors. Each floor contains 8 units, and 3/4 of the building is occupied. What's the total number of unoccupied units In the building?"

# Initialize the Ollama LLM
llm_llama32 = ChatOllama(model="llama3.2:3b", temperature=0.7)

# Create the LLM chain
llm_chain = prompt | llm_llama32 

response = llm_chain.invoke({"question": question})
response

AIMessage(content='To find the total number of unoccupied units, we need to first calculate the total number of units in the building.\n\nThe total number of units = Number of floors x Number of units per floor\n= 15 x 8\n= 120 units\n\nSince 3/4 of the building is occupied, 1/4 of the building remains unoccupied. \n\nTo find the number of unoccupied units:\n= Total units x Unoccupied fraction\n= 120 x 1/4\n= 30 units', additional_kwargs={}, response_metadata={'model': 'llama3.2:3b', 'created_at': '2025-03-25T09:15:33.7981876Z', 'done': True, 'done_reason': 'stop', 'total_duration': 13676134400, 'load_duration': 3988633900, 'prompt_eval_count': 80, 'prompt_eval_duration': 2736203100, 'eval_count': 106, 'eval_duration': 6918341300, 'message': Message(role='assistant', content='', images=None, tool_calls=None)}, id='run-a8be4049-a89c-45f6-a430-c7cb57a7701c-0', usage_metadata={'input_tokens': 80, 'output_tokens': 106, 'total_tokens': 186})

Alternatively, we can simply extract the text part in the response by adding one more stage to the chain.

In [13]:
# Create the LLM chain with the text content extractor
llm_chain = prompt | llm_llama32 | StrOutputParser()

response = llm_chain.invoke({"question": question})
response

ResponseError: model "llama3.2:3b" not found, try pulling it first (status code: 404)

If you want to display the response text in Markdown format in a pretty way.

In [12]:
display(Markdown(response))

To find the total number of unoccupied units, we need to know how many units are occupied and subtract that from the total.

First, find the total number of units in the building: 15 floors x 8 units per floor = 120 units.

Next, calculate the number of occupied units: (3/4) x 120 units = 90 units.

Finally, subtract the occupied units from the total to get the unoccupied units: 120 - 90 = 30 units.

So, there are 30 unoccupied units in the building.

Now, you know the basics to build AI apps using LangChain. 🚀

### Spectrum of Reasoning Capabilities

Modern LLMs demonstrate a range of reasoning abilities across different domains:

- **Arithmetic reasoning**: Solving mathematical problems with calculations
- **Math Problem-Solving**: Solving challenging math problems
- **Scientific Reasoning**: Making inferences based on scientific domain knowledge
- **Commonsense reasoning**: Making inferences based on everyday knowledge
- **Lognical Reasoning**: Drawing valid conclusions from premises
- **Visual Reasoning**: Making inferences based on everyday knowledge
- **etc.**

1. Arithmetic Reasoning: Solving mathematical problems with calculations

- Sample: "Q: Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia sell altogether in April and May? A: 72"
- Example Dataset: [GSM8K (Grade School Math 8K)](https://huggingface.co/datasets/openai/gsm8k)

2. Math Problem-Solving: Solving challenging math problems
- Sample: "Q: If $f(x) = \frac{3x-2}{x-2}$, what is the value of $f(-2) +f(-1)+f(0)$? Express your answer as a common fraction.  A: \frac{14}{3}"
- Example Dataset: [MATH 500](https://huggingface.co/datasets/HuggingFaceH4/MATH-500)
- Sample: "Q: There exist real numbers $x$ and $y$, both greater than 1, such that $\log_x\left(y^x\right)=\log_y\left(x^{4y}\right)=10$. Find $xy$.  A: 25"
- Example Dataset: [AIME (American Invitational Mathematics Examination) 2024](https://huggingface.co/datasets/Maxwell-Jia/AIME_2024)

3. Scientific Reasoning: Making inferences based on scientific domain knowledge
- Description: GPQA (Graduate-Level Google-Proof Q&A) is a multiple-choice, Q&A dataset of very hard questions written and validated by experts in biology, physics, and chemistry. When attempting questions out of their own domain (e.g., a physicist answers a chemistry question), these experts get only 34% accuracy, despite spending >30m with full access to Google.
- Example Dataset: [GPQA](https://huggingface.co/datasets/Idavidrein/gpqa)

4. Commonsense Reasoning: Making inferences based on everyday knowledge
- Sample: "Q: The fox walked from the city into the forest, what was it looking for? Answer Choices: (a) pretty flowers (b) hen house (c) natural habitat (d) storybook  A: (b)"
- Example Dataset: [CommonsenseQA](https://www.tau-nlp.sites.tau.ac.il/commonsenseqa), [HellaSwag](https://github.com/rowanz/hellaswag)

5. Logical Reasoning: Drawing valid conclusions from premises
- Sample: 

<div style="text-align:center;">
    <img src="imgs/L02_HotpotQA_Sample.png" alt="HotpotQA Sample" style="width:50%; height:auto;">
</div>

- Example Dataset: [HotpotQA](https://hotpotqa.github.io/)

6. Visual Reasoning: Questions require an understanding of vision, language and commonsense knowledge to answer.
- Sample: <img src="imgs/L02_VQA_BigBang.png" alt="VQA Sample" style="width:50%; height:auto;">

   - "Q: What time of day is it?  A: Night"
- Example Dataset: [VQA v2 (Visual Question Answering)](https://visualqa.org/)

### SOTA Reasoning Models and Their Approaches to Enhance Reasoning Capabilities

Recent breakthroughs in artificial intelligence have led to the development of several language models with unprecedented reasoning capabilities, showcasing how innovative training methodologies and architectural choices can significantly improve a model’s ability to tackle complex problems.

Reasoning models, such as OpenAI o1/o3 series and DeepSeek R1, are designed to tackle complex problem-solving tasks, particularly in domains like science, coding, and mathematics. These models go beyond traditional language generation by incorporating structured reasoning processes, where the model breaks down problems into sequential steps before arriving at a final answer.

Below is a list of several prominent “reasoning”‐focused large language models (LLMs). Note that many of these models emerged during late 2024 into early 2025 as part of an industry‐wide push toward enhanced reasoning capabilities.

| **Model**                      | **Release Date**         | **Company/Organisation**          |
|--------------------------------|--------------------------|-----------------------------------|
| o1‑preview                     | September 2024           | OpenAI                            |
| o1                             | December 2024            | OpenAI                            |
| o3‑mini                        | January 2025             | OpenAI                            |
| o3‑mini‑high                   | January 2025             | OpenAI                            |
| Claude 3.7                     | February 2025            | Anthropic                         |
| DeepSeek‑R1                    | January 2025             | DeepSeek     |
| Doubao‑1.5‑pro                 | January 2025             | ByteDance          |
| Kimi k1.5                      | January 2025*            | Moonshot AI                       |
| QwQ‑32B‑Preview                | December 2024*           | Alibaba Cloud                     |
| Grok 3                         | February 2025*           | xAI          |
| Gemini 2.0 Flash Thinking Exp. | February 2025*           | Google DeepMind                   |

\* Approximate dates based on available reports and media coverage.

> 💬 **Discussion:** 
> 
> What are the core techniques used to enhance the reasoning capability in the state-of-the-art reasoning models, i.e., OpenAI o1/o3, DeepSeek R1? Extract the core techniques from AI summary.
>
> 🤖 Reference prompt: *"Summarise the core techniques used to enhance the reasoning capability in the state-of-the-art reasoning models, i.e., OpenAI o1/o3, DeepSeek R1, respectively"*
>
> 🧠 Critical Thinking: Is the information verifiable and accurate?

#### OpenAI o1/o3

*Put the GenAI answer and your notes here*

#### DeepSeek R1

*Put the GenAI answer and your notes here*

Now, you know the techniques behind the SOTA reasoning models 🚀. Let's delve deeper into the details.

## Prerequisite: A Brief Introduction to Prompt Engineering
---

### Prompt Engineering

Prompt engineering is the practice of crafting effective inputs to guide AI language models toward producing desired outputs. It involves strategically designing questions, instructions, and context to elicit accurate, relevant, and useful responses.

Key aspects include:

- Structuring prompts with clear instructions and constraints
- Using specific formatting techniques
- Providing examples to demonstrate the expected response format
- Breaking complex tasks into sequential steps
- Including relevant context to improve understanding
- Setting appropriate tone, style, and level of detail

Effective prompt engineering can significantly enhance AI performance across various applications, from content creation and data analysis to problem-solving and creative work. As AI systems evolve, prompt engineering continues to develop as both an art and a science.

### Zero-shot Prompting

Zero-shot prompting refers to asking an LLM to perform a task without providing specific examples of that task. The model relies solely on its pre-training knowledge.

**Key characteristics:**
- No examples provided in the prompt
- Requires clear instructions
- Performance varies greatly with prompt phrasing

**Example:**

Prompt:
```Text
What is the capital of France?
```

Output:
```Text
Paris
```

### Few-shot Prompting

Few-shot prompting involves providing the model with a few examples of the task before asking it to perform a similar task. This helps the model understand the expected format and reasoning style.

**Key characteristics:**
- Includes examples within the prompt
- Helps align model output to desired format
- Can improve performance on complex tasks
- Examples should be representative and diverse


**Example:** ([Brown et al. 2020](https://arxiv.org/abs/2005.14165))

Prompt:
```Text
A "whatpu" is a small, furry animal native to Tanzania. An example of a sentence that uses
the word whatpu is:
We were traveling in Africa and we saw these very cute whatpus

To do a "farduddle" means to jump up and down really fast. An example of a sentence that uses the word farduddle is:
```

Output:
```Text
One day when I was playing tag with my little sister, she got really excited and she
started doing these crazy farduddles.
```

## Chain-of-Thought (CoT) <a id="cot"></a>
---

Chain-of-Thought (CoT) was introduced by [Wei et al. (2022)](https://arxiv.org/abs/2201.11903) as a prompting technique to enhance the reasoning capabilities of Large Language Models (LLMs), especially in multi-step reasoning tasks.

In contrast to the standard prompting, where models are asked to directly produce the final answer, 'Chain of Thought Prompting' encourages LLMs to break down complex problems into intermediate reasoning steps before arriving at the final answer. By doing this, the model-generated 'chain of thought' can mimic an intuitive human thought process when working through multi-step problems.

**Key benefits:**
- Significantly improves performance on reasoning tasks
- Provides transparency into the model's reasoning process
- Reduces hallucination by making each step explicit

### Few-Shot CoT

![CoT Prompting](imgs/L02_CoT_Prompt.png)

Examples of CoT in different reasoning tasks [(Wei et al., 2022)](https://arxiv.org/abs/2201.11903):

![CoT Example Tasks](imgs/L02_CoT_ExampleTasks.png)

### Zero-Shot CoT

[Kojima et al. (2022)](https://arxiv.org/abs/2205.11916) introduce a simplified approach by appending the words "Let's think step by step." to the end of a question. This simple prompt helps the LLM to generate a chain of thought that answers the question, from which the LLM can then extract a more accurate answer.

![Zero-Shot CoT](imgs/L02_ZeroShotCot.png)

### 🧑‍🏫 Demo: Reasoning by CoT Prompting

#### Standard Prompt

Let's start with a standard prompt as the baseline.

In [15]:
# Define the question that will be reused for different prompts
question = "Four friends ordered four pizzas for a total of 64 dollars. If two of the pizzas cost 30 dollars, how much did each of the other two pizzas cost if they cost the same amount?"

In [17]:
# TODO: Create a prompt template for standard prompts
prompt_std = PromptTemplate(input_variables=["question"], 
                            template="""
Answer the questions briefly and directly with just a few words. 
Question: \n\n{question}
""")

prompt_std.invoke({"question": question}).text

'\nAnswer the questions briefly and directly with just a few words. \nQuestion: \n\nFour friends ordered four pizzas for a total of 64 dollars. If two of the pizzas cost 30 dollars, how much did each of the other two pizzas cost if they cost the same amount?\n'

In [18]:
# TODO: Initialise a chat model using the Ollama LLM
llm = ChatOllama(model="qwen2:1.5b", temperature=0.7)

# TODO: Create the LLM chain with the standard prompt template
chain_std = prompt_std | llm | StrOutputParser()

In [19]:
# TODO: Run the chain with the question to get the string output
response = chain_std.invoke({"question": question})
response

'25 dollars\n\nThe answer is: 25'

In [20]:
display(Markdown(response))

25 dollars

The answer is: 25

#### Zero-shot CoT


Zero-shot Chain-of-Thought involves explicitly asking the model to reason **step by step**, without providing examples of step-by-step reasoning. This is often triggered by adding phrases like `Let's think step by step` to prompts.

In this demo, let's try to use `ChatPromptTemplate` to create the prompt, so that you can use a system prompt to define the behaviour of the LLM.

In [25]:
# TODO: Create a zero-shot CoT prompt
prompt_zs_cot = ChatPromptTemplate.from_messages([
    ("system", "Answer the following question. Think step by step."),
    ("human", "Question: {question}")]
)

prompt_zs_cot.invoke({"question": question}).messages[1].content

'Question: Four friends ordered four pizzas for a total of 64 dollars. If two of the pizzas cost 30 dollars, how much did each of the other two pizzas cost if they cost the same amount?'

In [26]:
# TODO: Initialise a chat model using the Ollama LLM
llm = ChatOllama(model="qwen2:1.5b", temperature=0.7)

# TODO: Create the LLM chain with the zero-shot prompt template
chain_zs_cot = prompt_zs_cot | llm | StrOutputParser()

In [27]:
# TODO: Run the chain with the question to get the string output
response = chain_zs_cot.invoke({"question": question})
response

'To find out how much each of the other two pizzas cost if they cost the same amount, we first need to determine the cost of one pizza.\n\nWe know that two pizzas cost $30. Therefore, the total cost for four pizzas is:\n\n\\[2 \\text{ pizzas} = 30 \\text{ dollars}\\]\n\nTo find out how much each of the other two pizzas costs, we divide the total cost by the number of additional pizzas (two):\n\n\\[ \\frac{64 \\text{ dollars}}{4 - 2\\text{ pizzas}} = \\frac{64}{2} = 32 \\text{ dollars}\\]\n\nTherefore, each of the other two pizzas cost $16.'

In [28]:
display(Markdown(response))

To find out how much each of the other two pizzas cost if they cost the same amount, we first need to determine the cost of one pizza.

We know that two pizzas cost $30. Therefore, the total cost for four pizzas is:

\[2 \text{ pizzas} = 30 \text{ dollars}\]

To find out how much each of the other two pizzas costs, we divide the total cost by the number of additional pizzas (two):

\[ \frac{64 \text{ dollars}}{4 - 2\text{ pizzas}} = \frac{64}{2} = 32 \text{ dollars}\]

Therefore, each of the other two pizzas cost $16.

#### Few-shot CoT

Few-shot Chain-of-Thought combines the benefits of few-shot prompting and Chain-of-Thought reasoning. By providing examples that demonstrate step-by-step reasoning, which helps the model understand how to break down problems.

Let's firstly define the example Q&A that can be integrated in the few-shot prompt template. (6 shots selected from the Appendix G in [(Wei et al., 2022)](https://arxiv.org/abs/2201.11903))

In [31]:
example_qa = """
Question: There are 15 trees in the grove. Grove workers will plant trees in the grove today. After they are done, there will be 21 trees. How many trees did the grove workers plant today?
Answer: There are 15 trees originally. Then there were 21 trees after some more were planted. So there must have been 21 - 15 = 6. The answer is 6.

Question: If there are 3 cars in the parking lot and 2 more cars arrive, how many cars are in the parking lot?
Answer: There are originally 3 cars. 2 more cars arrive. 3 + 2 = 5. The answer is 5.

Question: Leah had 32 chocolates and her sister had 42. If they ate 35, how many pieces do they have left in total?
Answer: Originally, Leah had 32 chocolates. Her sister had 42. So in total they had 32 + 42 = 74. After eating 35, they had 74 - 35 = 39. The answer is 39.

Question: Jason had 20 lollipops. He gave Denny some lollipops. Now Jason has 12 lollipops. How many lollipops did Jason give to Denny?
Answer: Jason started with 20 lollipops. Then he had 12 after giving some to Denny. So he gave Denny 20 - 12 = 8. The answer is 8.

Question: Shawn has five toys. For Christmas, he got two toys each from his mom and dad. How many toys does he have now?
Answer: Shawn started with 5 toys. If he got 2 toys each from his mom and dad, then that is 4 more toys. 5 + 4 = 9. The answer is 9.
 
Question: Olivia has $23. She bought five bagels for $3 each. How much money does she have left?
Answer: Olivia had 23 dollars. 5 bagels for 3 dollars each will be 5 x 3 = 15 dollars. So she has 23 - 15 dollars left. 23 - 15 is 8. The answer is 8.
"""

In [33]:
# TODO: Create a few-shot CoT prompt
prompt_fs_cot = ChatPromptTemplate.from_messages([
    ("system", "A few examples will be provided for you to follow and understand how to think before answering the question."),
    ("human", """"
{example_qa}
     
Now, answer the following question
Question: {question}
     """),
])

prompt_fs_cot.invoke({"example_qa": example_qa, "question": question})

ChatPromptValue(messages=[SystemMessage(content='A few examples will be provided for you to follow and understand how to think before answering the question.', additional_kwargs={}, response_metadata={}), HumanMessage(content='"\n\nQuestion: There are 15 trees in the grove. Grove workers will plant trees in the grove today. After they are done, there will be 21 trees. How many trees did the grove workers plant today?\nAnswer: There are 15 trees originally. Then there were 21 trees after some more were planted. So there must have been 21 - 15 = 6. The answer is 6.\n\nQuestion: If there are 3 cars in the parking lot and 2 more cars arrive, how many cars are in the parking lot?\nAnswer: There are originally 3 cars. 2 more cars arrive. 3 + 2 = 5. The answer is 5.\n\nQuestion: Leah had 32 chocolates and her sister had 42. If they ate 35, how many pieces do they have left in total?\nAnswer: Originally, Leah had 32 chocolates. Her sister had 42. So in total they had 32 + 42 = 74. After eating

In [1]:
pip install pandas

Collecting pandas
  Downloading pandas-2.2.3-cp312-cp312-win_amd64.whl.metadata (19 kB)
Collecting pytz>=2020.1 (from pandas)
  Downloading pytz-2025.2-py2.py3-none-any.whl.metadata (22 kB)
Collecting tzdata>=2022.7 (from pandas)
  Downloading tzdata-2025.2-py2.py3-none-any.whl.metadata (1.4 kB)
Downloading pandas-2.2.3-cp312-cp312-win_amd64.whl (11.5 MB)
   ---------------------------------------- 0.0/11.5 MB ? eta -:--:--
   ---------------------------------------- 0.0/11.5 MB ? eta -:--:--
    --------------------------------------- 0.3/11.5 MB ? eta -:--:--
    --------------------------------------- 0.3/11.5 MB ? eta -:--:--
   - -------------------------------------- 0.5/11.5 MB 985.5 kB/s eta 0:00:12
   --- ------------------------------------ 1.0/11.5 MB 1.2 MB/s eta 0:00:09
   ---- ----------------------------------- 1.3/11.5 MB 1.3 MB/s eta 0:00:09
   ------ --------------------------------- 1.8/11.5 MB 1.4 MB/s eta 0:00:07
   ------- -------------------------------- 2.1/11.5

In [None]:
import pandas as pd

examples = pd.read_csv("fs.csv")
examples

In [None]:
example_prompt = PromptTemplate.from_template("Question: {Question}\nAnswer: {Answer}")

prompt_template_fs = FewShotPromptTemplate(
    examples=examples, 
    example_prompt=example_prompt, 
    suffix="""
Now, answer the following question
Question: {input}
    """,
    input_variables=["input"]
)

prompt_template_fs.invoke({"input": question})

In [36]:
test_prompt = prompt_fs_cot.invoke({"example_qa": example_qa, "question": question})
display(Markdown(test_prompt.messages[1].content))

"

Question: There are 15 trees in the grove. Grove workers will plant trees in the grove today. After they are done, there will be 21 trees. How many trees did the grove workers plant today?
Answer: There are 15 trees originally. Then there were 21 trees after some more were planted. So there must have been 21 - 15 = 6. The answer is 6.

Question: If there are 3 cars in the parking lot and 2 more cars arrive, how many cars are in the parking lot?
Answer: There are originally 3 cars. 2 more cars arrive. 3 + 2 = 5. The answer is 5.

Question: Leah had 32 chocolates and her sister had 42. If they ate 35, how many pieces do they have left in total?
Answer: Originally, Leah had 32 chocolates. Her sister had 42. So in total they had 32 + 42 = 74. After eating 35, they had 74 - 35 = 39. The answer is 39.

Question: Jason had 20 lollipops. He gave Denny some lollipops. Now Jason has 12 lollipops. How many lollipops did Jason give to Denny?
Answer: Jason started with 20 lollipops. Then he had 12 after giving some to Denny. So he gave Denny 20 - 12 = 8. The answer is 8.

Question: Shawn has five toys. For Christmas, he got two toys each from his mom and dad. How many toys does he have now?
Answer: Shawn started with 5 toys. If he got 2 toys each from his mom and dad, then that is 4 more toys. 5 + 4 = 9. The answer is 9.

Question: Olivia has $23. She bought five bagels for $3 each. How much money does she have left?
Answer: Olivia had 23 dollars. 5 bagels for 3 dollars each will be 5 x 3 = 15 dollars. So she has 23 - 15 dollars left. 23 - 15 is 8. The answer is 8.


Now, answer the following question
Question: Four friends ordered four pizzas for a total of 64 dollars. If two of the pizzas cost 30 dollars, how much did each of the other two pizzas cost if they cost the same amount?
     

In [39]:
# TODO: Initialise a chat model using the Ollama LLM
llm = ChatOllama(model="qwen2:1.5b", temperature=0.0)

# TODO: Create the LLM chain with the few-shot prompt template
chain_fs_cot = prompt_fs_cot | llm | StrOutputParser()

In [40]:
# TODO: Run the chain with the question to get the string output
response = chain_fs_cot.invoke({"example_qa": example_qa, "question": question})
display(Markdown(response))

Let's denote the price of one pizza as \(x\). Since there are four pizzas and the total cost is $64, we can write:

\[4x = 64\]

Solving for \(x\), we find that each pizza costs $16. 

Now, let's consider two of these pizzas costing $30 in total. Let's denote the price of one of these pizzas as \(y\). So, the equation becomes:

\[2y + (x - y) = 30\]

Substituting \(x\) with $16 gives us:

\[2y + (16 - y) = 30\]
\[y + 16 = 30\]
\[y = 14\]

Therefore, each of the other two pizzas cost $14.

> 💡 Note:
> Through this demo, you should realise the benefit of using LangChain prompt templates.

> 💬 **Discussion:** 
> 
> Why the few-shot CoT prompting doesn't work in this example? Will the response change if we run multiple times? What if we change the temperature?
>
> 🧠 Critical Thinking: What are the limitations of CoT? How is the CoT technique incorporated in the reasoning models without requiring users to write CoT prompts?

### Limitations of Chain-of-Thought Prompting

While Chain-of-Thought techniques provide significant improvements in reasoning capabilities, they also come with several important limitations:

- **Reasoning Hallucinations**: 
  - LLMs may produce plausible-sounding but incorrect reasoning steps. These "hallucinated" steps can lead to wrong conclusions while appearing confident and logical.
  - Example: When solving a complex physics problem, the model might introduce physically impossible intermediate calculations that seem reasonable but violate fundamental laws.

- **Sensitivity to Prompt Wording**:
  - CoT performance is highly dependent on how the prompt is phrased. Small changes in wording can lead to different reasoning paths and outcomes.
  - Finding optimal prompts often requires extensive experimentation and may not generalize across different problem types.

- **Computational Overhead**:
  - CoT reasoning requires significantly more tokens than direct prompting, increasing:
    - API costs when using commercial LLMs
    - Latency in generating responses
    - Computational resources needed

- **Dependence on Example Quality in Few-shot CoT**
  - The performance of few-shot CoT heavily depends on:
    - The quality and relevance of chosen examples
    - The similarity between examples and target problems
    - The order in which examples are presented

- **Limited Self-correction**:
  - When a reasoning path leads to an error, LLMs often struggle to identify and correct the mistake, instead continuing with flawed reasoning.

## Self Consistency Chain of Thought
---

An improvement on CoT prompting called "Self Consistency" is proposed by [Wang et al. (2022)](https://arxiv.org/abs/2203.11171). Self-consistency aims "to replace the naive **greedy decoding** used in chain-of-thought prompting". This approach **samples** multiple, diverse reasoning paths based on few-shot CoT, then select the most consistent answer among all reasoning paths. The evaluation shows it "boosts the performance of chain-of-thought prompting with a striking margin on a range of popular arithmetic and commonsense reasoning benchmarks".

![SC CoT](imgs/L02_CoT_SC.png)

> 🤖 Implementation with AI
> Based on the description of the Self-Consistency idea, think about how to implement it with the help of AI tools.

In [None]:
# TODO: Implement the SC-CoT algorithm using local open-weight/source LLMs and LCEL (preferablly parallel prompting)

An example of multi-path response (some parts are omitted due to the lengthy response):

{'path_1': "To find out how much each of the other two pizzas cost, let's break down what we know:\n\n1. Total cost for four pizzas: $64\n2. Cost for two pizzas: $30\n\nLet's denote the price of one pizza as \\(x\\).\n\nSo, the equation representing the total cost would be:\n\\[2x + 2x = 30\\]\nSimplifying this gives us:\n\\[4x = 30\\]\n\nTo find out how much each of the other two pizzas cost (\\(2x\\) since there are two), divide both sides by \\(2\\):\n\\[x = \\frac{30}{2}\\]\n\\[x = 15\\]\n\nSo, each of the other two pizzas costs $15.", 

'path_2': "To solve this problem, ... Therefore, each of the remaining two pizzas cost \\$17.", 

'path_3': "Let's break down the problem step-by-step.... Step 6: Present the final answer.\n- Each of the other two pizzas costs $17.\n\nTherefore, the answer is $17.", 

'path_4': ... Therefore, each of the other two pizzas cost $17.", 

'path_5': 'To find out how much each of the other two pizzas cost ... Therefore, each of the two remaining pizzas cost $1.33.'}


Since multiple LLM responses must be sampled, the computational cost and response latency will be higher than the typical Chain of Thought (CoT) approach.

> 💬 **Discussion:** 
> 
> 🧠 Critical Thinking: CoT technique can be embedded into a reasoning model by exposing a base model to CoT data in training/fine-tuning, which is called **train-time compute**. Is it reasonable to integrate SC-CoT into a model during training/fine-tuning? 

> 💡 Note:
> When we sample multiple reasoing path in SC-CoT, the model will use longer time to think, i.e., compute for longer time during the test time. So, SC-CoT can be considered as a basic technique for **test-time compute**.

## Evaluation and Benchmarking
---

### Benchmarks

![OpenAI o1 Benchmarks](./imgs/L01_O1_Benchmarks.png)

Source: [OpenAI, Learning to Reason with LLMs](https://openai.com/index/learning-to-reason-with-llms/)

![OpenAI o1 Benchmarks](./imgs/L01_DeepSeek_Benchmarks.png)

Source: DeepSeek R1 paper ([DeepSeek-AI, 2024](https://arxiv.org/abs/2501.12948))

Evaluating reasoning capabilities of LLMs requires specialized benchmarks and metrics. Here we discuss some common approaches:

**Common reasoning benchmarks:**
- [GSM8K](https://huggingface.co/datasets/openai/gsm8k): Grade School Math problems requiring multi-step reasoning
- [MATH 500](https://huggingface.co/datasets/HuggingFaceH4/MATH-500): A subset of 500 problems from the MATH benchmark that OpenAI created in their Let's Verify Step by Step paper.
- [AIME 2024](https://huggingface.co/datasets/Maxwell-Jia/AIME_2024): This dataset contains problems from the American Invitational Mathematics Examination (AIME) 2024. AIME is a prestigious high school mathematics competition known for its challenging mathematical problems.
- [GPQA Diamond](https://huggingface.co/datasets/Idavidrein/gpqa): The GPQA Diamond subset is a higher-quality, more challenging subset of the main GPQA dataset. It contains 198 questions for which both domain expert annotators got the correct answers, but which the majority of non-domain experts answered incorrectly.

**Other benchmarks**
- [MMLU](https://huggingface.co/datasets/Stevross/mmlu): Massive Multitask Language Understanding is a massive multitask test consisting of multiple-choice questions from various branches of knowledge. The test spans subjects in the humanities, social sciences, hard sciences, and other areas that are important for some people to learn. This covers 57 tasks including elementary mathematics, US history, computer science, law, and more. To attain high accuracy on this test, models must possess extensive world knowledge and problem solving ability.
- [CodeForces](https://huggingface.co/datasets/open-r1/codeforces): CodeForces is one of the most popular websites among competitive programmers, hosting regular contests where participants must solve challenging algorithmic optimization problems. The challenging nature of these problems makes them an interesting dataset to improve and test models’ code reasoning capabilities.
- [SWE-Bench](https://www.swebench.com/): A benchmark for evaluating large language models’ (LLMs’) abilities to solve real-world software issues sourced from GitHub. The benchmark involves giving agents a code repository and issue description, and challenging them to generate a patch that resolves the problem described by the issue. 

### Evaluation Metrics

#### Pass@k

In LLM evaluations, pass@k is a metric used to assess a model's ability to generate correct solutions (e.g., code, answers) by considering multiple attempts.

> 💬 **Discussion:** 
> 
> Try to learn from GenAI what "pass@k" means in LLM evaluations and use resources to find out whether GenAI is 100% correct.
>
> 🤖 Reference prompt: *"what does "pass@k" mean in LLM evaluations?"*
>
> 🧠 Critical Thinking: Does the GenAI provide sources? Is the answer coherent? Is it reasonable when apply the definition to pass@1? Does it mentioned different ways to compute pass@k? Have you verified the responses from the GenAI you use?

*Put GenAI answers and your notes here*

The pass@k metric measures the probability, computed over a set of problems, that at least one of the top $k$ generated outputs for each problem contains the correct solution.

For example, a Pass@1 of 30% and a Pass@10 of 60% would mean that the model has a 30% chance of solving the problem on the first try, but a 60% chance of finding a correct solution if allowed to generate 10 different attempts.

- [Kulal et al. (2019)](https://proceedings.neurips.cc/paper/2019/file/7298332f04ac004a0ca44cc69ecf6f6b-Paper.pdf) evaluate functional correctness using the pass@k metric, where k samples are generated per problem, a problem is considered solved if any sample passes the unit tests, and the total fraction of problems solved is reported.
- In practice, computing pass@k in this way can have high variance. For example, if we compute pass@1 from a single completion per problem, we can get significantly different values from repeated evaluations due to sampling.

- [OpenAI (2021)](https://arxiv.org/abs/2107.03374) introduced an unbiased estimator that accounts for the total number of generated samples $n$, the number of correct samples $c$, and the desired $k$ value. To evaluate pass@k,
  - generate $n \geq k$ samples per problem/task
  - count the number of correct samples $c \leq n$
  - calculate the unbiased estimator:

$$
\text{pass@k} = \mathbb{E}_{\text{problems}} \left[ 1 - \frac{\binom{n-c}{k}}{\binom{n}{k}} \right]
$$

- When calculate pass@1, [DeepSeek-AI, (2024)](https://arxiv.org/abs/2501.12948) uses a sampling temperature of 0.6 and a top-p value of 0.95 to generate $k$ responses (typically between 4 and 64, depending on the test set size) for each question and calcuates the pass@1 metric for their DeepSeek R1 model as:

$$
\text{pass@k} = \frac{1}{k} \sum_{i=1}^k p_i
$$

where $p_i$ denotes the correctness of the $i$-th response.

- For OpenAI o1 benchmark, they calculate pass@1 in the same way, as indicated by [OpenAI, (2024)]():
> All models are given 5 tries to generate a candidate patch. We compute pass@1 by averaging the per-instance pass rates of all samples that generated a valid (i.e., non-empty) patch.

**Example**
Suppose one coding problem is evaluated with 100 samples ($n$=100), and 3 are correct ($c$=3):

pass@1: 3 / 100 = 3% (probability a single random guess is correct).

pass@3: $1 − \frac{\binom{100-3}{3}}{\binom{100}{3}} \approx 8.8\%$

pass@100: 100% (if at least one correct answer exists in 100 samples).

#### cons@x

Consensus or majority vote, denoted as cons@x or maj@x, measures whether the most frequent answer (majority vote) among x generated responses is correct.

#### Limitations

- Computational Cost: Generating many samples (e.g., n=100 n=100) is resource-intensive.
- Domain-Specific: Works best for tasks with clear correctness criteria (e.g., code, math).

## Practical Applications

Reasoning techniques for LLMs have numerous practical applications across various domains:

**Educational applications:**
- Step-by-step math problem solving
- Scientific reasoning and explanation
- Tutoring with transparent reasoning

**Business and finance:**
- Financial analysis and planning
- Risk assessment
- Decision support systems

**Healthcare:**
- Diagnostic reasoning assistance
- Treatment plan evaluation
- Medical literature analysis

**Software development:**
- Code generation with explanation
- Debugging assistance
- Algorithm design

## Conclusion
---

In this notebook, we explored various reasoning techniques for LLMs,from Chain-of-Thought to Self Consistency to enhance problem-solving capabilities. Additionally, we cover evaluation metrics such as pass@k and cons@x, which are crucial for assessing the performance of LLMs.

**Key Takeaways**
1. **Reasoning Capabilities**: Modern LLMs demonstrate a range of reasoning abilities, including arithmetic, scientific, commonsense, logical, and visual reasoning.
2. **Chain-of-Thought (CoT)**: CoT prompting significantly improves the reasoning performance of LLMs by breaking down complex problems into intermediate steps.
3. **Self Consistency CoT**: Enhances CoT by sampling multiple reasoning paths and selecting the most consistent answer, further improving accuracy.
4. **Evaluation Metrics**: Metrics like pass@k and cons@x are essential for evaluating the reasoning capabilities of LLMs.
5. **Practical Applications**: Reasoning techniques have wide-ranging applications in education, business, healthcare, and software development, enhancing the utility and reliability of LLMs.

Reasoning techniques continue to evolve rapidly, enabling LLMs to tackle increasingly complex problems with greater reliability. By understanding and implementing these techniques, you can significantly enhance the capabilities of LLM applications.

## Bonus Scene
---

In the project diretory, run the `eight_puzzle.py` to play the eight puzzle game.

```python
python eight_puzzle.py
```

Can you solve the preset puzzle? How many steps it takes you to solve the puzzle?

*Put your notes here*

> 💬 **Discussion:** 
> 
> Try to prompt the best reasoning LLMs (those you have access to) to solve the puzzle.
>
> 🤖 Reference prompt: 
> ```text
> please solve the 8-puzzle below, where 0 represents the blank tile. Please provide your thoughts about how to solve the problem and the solution step-by-step
> [4, 1, 3],
> [0, 8, 5],
> [2, 7, 6]
> ```
>
> 🧠 Critical Thinking: Can the smartest GenAI solve the puzzle only by the CoT reasoning? Can the GenAI models or you find the fewest steps to solve the puzzle only the the CoT reasoning? Check out the thougths generated by the reasoning model, do they mention any method to solve the puzzle?

DeepSeek thought for over 10 min! Unfortunately, it didn't solve the puzzle.

![DeepSeek Thinking for 8-puzzle](./imgs/L01_BonusScene_DeepSeekThinking.png)