# Project 4: **Build a Deep Research System**
Welcome to project 4! For this project, we shift our focus from tool use and agents to *reasoning* models. You will practice state‚Äëof‚Äëthe‚Äëart inference‚Äëtime scaling methods such as *Chain‚Äëof‚ÄëThought* prompting and *Tree‚Äëof‚ÄëThoughts*, and briefly explore high-levels of training reasoning models using techniques like **STaR**.


Finally, you will put everything together to build a *deep research agent* that can browse the web, reason over what it finds, and give structured answers.

## Learning Objectives  
* Apply common inference‚Äëtime scaling methods: **zero‚Äëshot / few‚Äëshot CoT, self‚Äëconsistency, sequential decoding, tree‚Äëof‚Äëthoughts**  
* Gain intuition for **training** reasoning‚Äëcapable models following **STaR** approach 
* Build a minimal **deep‚Äëresearch agent** that combines step‚Äëby‚Äëstep reasoning with live web search   
* Practice extending deep-search to a multi-agent system 

## Roadmap  
1. Environment setup  
2. Inference‚Äëtime scaling  
   2.1 Few‚Äëshot & zero‚Äëshot‚ÄØCoT  
   2.2 Self‚Äëconsistency
   2.3 Sequential revisions  
   2.4 Tree‚Äëof‚ÄëThought
3. STaR for training models for reasoning  
4. Deep-research agent  
5. (Optional) Multi-agent deep-research

# 1‚Äë Environment setup

## 1.1- Conda environment

Before we start coding, you need a reproducible setup. Open a terminal in the same directory as this notebook and run:

```bash
# Create and activate the conda environment
conda env create -f environment.yaml && conda activate deep_research

# Register this environment as a Jupyter kernel
python -m ipykernel install --user --name=deep_research --display-name "deep_research"
```
Once this is done, you can select "deep_research" from the Kernel ‚Üí Change Kernel menu in Jupyter or VS Code.

## 1.2 Ollama setup

In this project we use the `llama3.2:3b` and `deepseek-r1:8b` models. You can try other smaller or larger reasoning LLMs such as `qwen2.5:3b-instruct` or `phi4-mini` to compare performance. Explore available models here: https://ollama.com/library.

```bash
ollama pull llama3.2:3b
ollama pull deepseek-r1:8b
# Additional small reasoning models to compare
# ollama pull qwen2.5:3b-instruct
# ollama pull phi4-mini

```

`ollama pull` downloads the model so you can run it locally without API calls.

---  
# 2‚Äë Inference‚Äëtime scaling

Inference-time scaling refers to techniques that make an existing model reason better without retraining it. Instead of changing the model‚Äôs weights, we achieve reasoning capability by adjusting how we prompt, sample, or aggregate LLM's outputs.

In this section, we‚Äôll explore several inference-time strategies that improve reasoning quality using a non-reasoning base model. You will experiment with and compare methods such as:

- Few-shot Chain-of-Thought (CoT)
- Zero-shot CoT
- Self-consistency
- Sequential revision
- Tree-of-Thoughts (ToT)

### 2.1: Few‚ÄëShot CoT
Few-shot prompting helps a model reason by showing one or multiple examples before asking a new question. By observing the pattern of reasoning and final answers, the model learns how to structure its own reasoning process on the new input.

In this exercise, you will create a prompt that includes a few example Q&A pairs demonstrating step-by-step reasoning. Then, you will feed a new question and see the model‚Äôs output.

In [5]:
# Step 1: Write a few examples showing reasoning steps
# Step 2: Write your new question
# Step 3: Concatenate examples + new question into a single prompt
# Step 4: Call your Ollama or OpenAI client to get a response from llama3.2:3b # e.g., client.chat.completions.create(...)
# Step 5: Print the final answer

from openai import OpenAI

# initialize ollama cliet
client = OpenAI(api_key="ollama", base_url = "http://localhost:11434/v1")

# Few-shot examples with step-by-step reasoning
few_shot_examples = """
Q: Roger has 5 tennis balls. He buys 2 more cans of tennis balls. Each can has 3 tennis balls.
How many tennis balls does he have now?
A: Roger started with 5 balls. 2 cans of 3 tennis balls each is 6 tennis balls. 5 + 6 = 11. The
answer is 11.

Q: The cafeteria had 23 apples. If they used 20 to make lunch and bought 6 more, how many apples
do they have?
A: They had 23 apples originally. They used 20 apples. So they had 23 - 20 = 3 apples left. They
bought 6 more, so 3 + 6 = 9. The answer is 9.
"""

# New question
new_question = """Q: A parking lot had 12 cars. 5 more cars arrived and then 3 cars left. How many
cars are in the parking lot now?"""

# concatenate prompt
prompt = few_shot_examples + "\n" + new_question +"nA:"

# call the model
response = client.chat.completions.create(
    model="llama3.2:3b", 
    messages=[{"role":"user", "content": prompt}],
    temperature=0.7
)

# print the answr
print (response.choices[0].message.content)




Let's break it down step by step:

1. The parking lot started with 12 cars.
2. 5 more cars arrived, so we add those to the original number:
   12 + 5 = 17
3. Then, 3 cars left, so we subtract them from the new total:
   17 - 3 = 14

The answer is 14.


### (Optional) Few-shot CoT on GPT2
GPT-2 is a pre-trained language model without instruction tuning. It continues text rather than answering questions. In this section, you'll try the exact same CoT pattern on GPT-2 and observe what happens. The goal is to test whether few-shot CoT alone can elicit structured reasoning from a non-chat LLM.

In [6]:
import os
import torch
from transformers import pipeline

# Step 1: Load GPT-2 text-generation from huggingface (https://huggingface.co/docs/transformers/en/model_doc/gpt2)
# Step 2: Write 1‚Äì2 few-shot reasoning examples (short, explicit steps + final answer in your own unique format)
# Step 3: Append a new test question after the examples to form one prompt string
# Step 4: Generate 1‚Äì3 completions with different decoding settings (e.g., greedy vs. top-k)
# Step 5: Print raw outputs; check if steps are followed and if the final answer is correct

# Step 1: Load GPT-2 text-generation from huggingface
generator = pipeline('text-generation', model='gpt2', device=0 if torch.cuda.is_available() else -1)

# Step 2: Write 1-2 few-shot reasoning examples (short, explicit steps + final answer)
few_shot_examples = """Example 1:
Problem: John has 3 apples and buys 4 more. How many does he have?
Solution: John starts with 3 apples. He buys 4 more. 3 + 4 = 7. Answer: 7

Example 2:
Problem: A store had 15 books and sold 8. How many are left?
Solution: The store had 15 books. They sold 8. 15 - 8 = 7. Answer: 7

"""

# Step 3: Append a new test question
new_question = """Problem: Sarah has 20 dollars and spends 12 dollars. How much does she have left?
Solution:"""

prompt = few_shot_examples + new_question

# Step 4: Generate with different decoding settings
print("=== Greedy Decoding ===")
output1 = generator(prompt, max_new_tokens=50, do_sample=False, pad_token_id=50256)
print(output1[0]['generated_text'][len(prompt):])

print("\n=== Sampling (top-k=50) ===")
output2 = generator(prompt, max_new_tokens=50, do_sample=True, top_k=50, pad_token_id=50256)
print(output2[0]['generated_text'][len(prompt):])

print("\n=== Sampling (temperature=0.8) ===")
output3 = generator(prompt, max_new_tokens=50, do_sample=True, temperature=0.8, pad_token_id=50256)
print(output3[0]['generated_text'][len(prompt):])


config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Device set to use cpu
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


=== Greedy Decoding ===
 Sarah spends 12 dollars. 20 - 20 = 7. Answer: 7

Problem: John has 20 dollars and spends 12 dollars. How much does he have left?

Solution: John spends 12 dollars. 20 - 20 = 7. Answer

=== Sampling (top-k=50) ===
 Sarah spends 12 dollars. It's not 12, it's 12 = 6.

Example 3:
Problem: I have a new car and have to drive from one place in the city to another. How many new cars can I buy?

=== Sampling (temperature=0.8) ===
 She spends 12 dollars. The store had 20 dollars. 10 - 12 = 7. Answer: 7

Example 3:
Problem: An engineer has 40 dollars, and spends 20 dollars to build a new car. How many are he supposed to


### 2.2: Zero‚ÄëShot Chain‚Äëof‚ÄëThought
Zero-shot CoT encourages the model to reason without examples by adding a short cue such as ‚ÄúLet‚Äôs think step by step.‚Äù This simple phrase often activates the model‚Äôs latent reasoning ability even when no demonstrations are provided. It serves as a baseline to compare with few-shot and other inference-time scaling methods.

In [7]:
from openai import OpenAI

# Step 1: Write the question and a zero-shot CoT cue (e.g., "Let's think step by step.")
# Step 2: Build a single prompt string that includes brief role guidance plus the question
# Step 3: Call your Ollama or OpenAI client to get a response from llama3.2:3b  # e.g., client.chat.completions.create(...)
# Step 4: Print the chain and the final answer

# Initialize client
client = OpenAI(api_key="ollama", base_url="http://localhost:11434/v1")

# Question with zero-shot CoT cue
question = "If a train travels 60 miles in 45 minutes, how many miles does it travel in 2 hours?"
prompt = f"You are a helpful assistant. {question}\n\nLet's think step by step."

# call the model
response = client.chat.completions.create(
    model="llama3.2:3b",
    messages=[{"role":"user", "content":prompt}],
    temperature=0.7
)

# print the chain and final ansxwer
print(response.choices[0].message.content)



To find out how many miles the train travels in 2 hours, we need to first determine its speed.

The train travels 60 miles in 45 minutes. To make this easier to work with, let's convert the time from minutes to hours:

45 minutes = 0.75 hours (since there are 60 minutes in an hour)

Now we can find the train's speed by dividing the distance it traveled (60 miles) by the time it took to travel that distance (0.75 hours):

Speed = Distance / Time
= 60 miles / 0.75 hours
= 80 miles per hour

So, the train travels at a speed of 80 miles per hour.

Now we need to find out how many miles the train will travel in 2 hours:

Distance = Speed x Time
= 80 miles/hour x 2 hours
= 160 miles

Therefore, the train will travel 160 miles in 2 hours.


### 2.3 Self‚ÄëConsistency
Self-consistency enhances reasoning accuracy by sampling multiple independent reasoning paths for the same question instead of relying on a single deterministic answer. Each run may follow a slightly different logical chain, and the diversity helps correct individual mistakes. After generating several reasoning traces, you then aggregate the final answers using majority voting.

This approach is especially useful when tasks involve multi-step reasoning or arithmetic, where single-path outputs may be incorrect.

In [8]:
from openai import OpenAI
import re, collections

client = OpenAI(api_key = "ollama", base_url = "http://localhost:11434/v1")
MODEL = "llama3.2:3b"

def cot_answer(question, temperature=1.0):    
    # Generate a step-by-step reasoning chain for the given question and extract the final answer.
    prompt = f"{question}\n\nLet's think step by step."

    response = client.chat.completions.create(
        model=MODEL,
        messages=[{"role": "user", "content": prompt}],
        temperature=temperature
    )

    reasoning = response.choices[0].message.content

    # Extract final answer (look for patterns like "Answer: X" or "The answer is X")
    answer_match = re.search(r'(?:answer is|answer:|Answer:)\s*(\d+)', reasoning, re.IGNORECASE)
    if answer_match:
        return reasoning, answer_match.group(1)

    # Fallback: extract last number
    numbers = re.findall(r'\b\d+\b', reasoning)
    final_answer = numbers[-1] if numbers else "unknown"

    return reasoning, final_answer

def self_consistent(question, n=10):
    # Run multiple reasoning chains and select the most frequent final answer by majority voting.
    answers = []

    print(f"Generating {n} reasoning paths...\n")

    for i in range(n):
        reasoning, answer = cot_answer(question, temperature=0.8)
        answers.append(answer)
        print(f"Run {i+1}: Answer = {answer}")

    # Count votes
    counter = collections.Counter(answers)
    winner = counter.most_common(1)[0][0]

    return winner, counter


question = "What is the square root of 144?"
winner, counter = self_consistent(question)
print("Votes:", counter)
print("Chosen answer:", winner)

Generating 10 reasoning paths...

Run 1: Answer = 12
Run 2: Answer = 12
Run 3: Answer = 12
Run 4: Answer = 12
Run 5: Answer = 144
Run 6: Answer = 12
Run 7: Answer = 12
Run 8: Answer = 12
Run 9: Answer = 12
Run 10: Answer = 12
Votes: Counter({'12': 9, '144': 1})
Chosen answer: 12


### 2.4: Sequential Revision

Sequential revision iteratively improves an answer by generating a first draft, critiquing it, and producing revised drafts that condition on prior answers. Each round should be short and focused, so improvements accumulate without drifting from the question.

In [9]:
MODEL = "llama3.2:3b"

def sequential_revision(question: str, max_steps: int = 3) -> str:
    # Generate an initial draft answer, then iteratively refine it by conditioning each revision on the previous one.
    # Step 1: Ask the model to produce the first draft for the given question
    # Step 2: Loop for max_steps-1 times, each time feeding the last draft back to the model with a request to revise
    # Step 3: Print each draft to observe how the answer evolves
    # Step 4: Return the final improved draft

    # Generate an initial draft answer, then iteratively refine it by conditioning each revision on the previous one.

    # Step 1: Generate the first draft
    print("=== Initial Draft ===")
    prompt = f"{question}\n\nProvide a clear and concise answer."

    response = client.chat.completions.create(
        model=MODEL,
        messages=[{"role": "user", "content": prompt}],
        temperature=0.7
    )

    draft = response.choices[0].message.content
    print(draft)
    print()

    # Step 2: Iteratively revise the draft
    for step in range(1, max_steps):
        print(f"=== Revision {step} ===")

        revision_prompt = f"""Question: {question}

    Previous answer:
    {draft}

    Please review the previous answer and improve it. Make it more accurate, complete, and well-structured. Provide the revised    
    answer."""

    response = client.chat.completions.create(
        model=MODEL,
        messages=[{"role": "user", "content": revision_prompt}],
        temperature=0.7
    )

    draft = response.choices[0].message.content
    print(draft)
    print()

    # Step 4: Return the final improved draft
    return draft


# Step 1: Define a question that benefits from multi-step reasoning
# Step 2: Call sequential_revision(question, max_steps)
# Step 3: Print the final output
# Step 1: Define a question that benefits from multi-step reasoning
question = "What are the main causes and potential solutions for climate change?"

# Step 2: Call sequential_revision
final_answer = sequential_revision(question, max_steps=3)

# Step 3: Print the final output
print("=== Final Answer ===")
print(final_answer)

=== Initial Draft ===
The main causes of climate change include:

1. **Greenhouse gas emissions**: Mainly caused by burning fossil fuels (coal, oil, and gas), deforestation, and land-use changes.
2. **Deforestation**: Leading to loss of carbon sinks and increased greenhouse gas emissions.
3. **Agriculture**: Especially beef and sheep production, which leads to methane emissions.
4. **Industrial processes**: Such as cement production and the manufacturing of steel.

Potential solutions include:

1. **Transitioning to renewable energy**: Shifting from fossil fuels to solar, wind, and hydroelectric power.
2. **Increasing energy efficiency**: Improving insulation, using smart grids, and optimizing industrial processes.
3. **Electrifying transportation**: Promoting electric vehicles and public transport.
4. **Carbon capture and storage**: Implementing technologies that capture CO2 emissions from power plants and industrial processes.
5. **Sustainable land use**: Implementing reforestation e

### 2.5 Tree‚Äëof‚ÄëThoughts
Tree-of-Thoughts reframes reasoning as a search process rather than a single forward chain.
Instead of producing one linear sequence of thoughts, the model generates multiple candidate thoughts at each step, evaluates their promise, and then expands only the best few. This allows exploration of different reasoning paths before committing to a final answer, similar to how humans brainstorm, prune, and refine ideas.


In this section, you‚Äôll experiment with two simplified versions of ToT:
1. Word Ladder puzzle solver: a small example where each ‚Äúthought‚Äù is a candidate word transition.
2. Generic ToT search (depth 2, width 2): a minimal logic to expand, evaluate, and select reasoning branches

In [10]:
###### Word Ladder Puzzle ##########

def neighbors(word, vocabulary):
    # Generate all valid one-letter mutations of 'word' that exist in 'vocabulary' and return them.
    # Generate all valid one-letter mutations of 'word' that exist in 'vocabulary' and return them.
    result = []
    for i in range(len(word)):
        for c in 'abcdefghijklmnopqrstuvwxyz':
            if c != word[i]:
                mutated = word[:i] + c + word[i+1:]
                if mutated in vocabulary:
                    result.append(mutated)
    return result    


def tree_of_thought(start, goal, vocab, max_depth=5, beam_width=4):
    # Search over partial thoughts (paths) using a small beam.
    # Step 1: Initialize the frontier with a single path [start]
    # Step 2: For each depth, expand each path by one neighbor from 'neighbors'
    # Step 3: Score paths by edit distance between last word and 'goal' (smaller is better)
    # Step 4: Keep the top 'beam_width' paths and stop early if any reaches 'goal'
    # Step 5: Return the best goal-reaching path or None

    # Search over partial thoughts (paths) using a small beam.
    # Step 1: Initialize the frontier with a single path [start]
    frontier = [[start]]

    # Step 2: For each depth, expand each path by one neighbor
    for depth in range(max_depth):
        new_frontier = []

        # Expand each path in the frontier
        for path in frontier:
            last_word = path[-1]

            # Step 4: Check if we reached the goal
            if last_word == goal:
                return path

            # Generate neighbors
            for neighbor in neighbors(last_word, vocab):
                if neighbor not in path:  # Avoid cycles
                    new_frontier.append(path + [neighbor])

        # Step 3: Score paths by edit distance (smaller is better)
        def edit_distance(w1, w2):
            return sum(c1 != c2 for c1, c2 in zip(w1, w2))

        new_frontier.sort(key=lambda p: edit_distance(p[-1], goal))

        # Keep only top beam_width paths
        frontier = new_frontier[:beam_width]

        if not frontier:
            break

    # Step 5: Return None if no solution found
    return None


vocab = {"hit","dot","cog","log","dog","lot","lit","hot"}
print(tree_of_thought("hit", "cog", vocab)) # one candidate solution: ['hit', 'hot', 'dot', 'dog', 'cog']


['hit', 'hot', 'dot', 'dog', 'cog']


In [11]:
###### Generic ToT Search ##########

import re

MODEL = "llama3.2:3b"

def propose_thoughts(question, state, k=2):
    # Propose up to k next ‚Äúthoughts‚Äù that extend the current partial solution/state.
    # Steps: build a short prompt with problem + current state; call your client with n=k. Then return a list of stripped strings (‚â§ k).

    # Propose up to k next "thoughts" that extend the current partial solution/state.
    prompt = f"""Problem: {question}

    Current progress:
    {state if state else 'Just starting...'}

    Propose {k} different next steps or ideas to continue solving this problem. Be brief and specific.
    List them as:
    1. [first idea]
    2. [second idea]"""

    response = client.chat.completions.create(
        model=MODEL,
        messages=[{"role": "user", "content": prompt}],
        temperature=0.8
    )

    text = response.choices[0].message.content
    # Extract numbered items
    thoughts = re.findall(r'\d+\.\s*(.+)', text)

    return thoughts[:k]


def score_state(question, state):
    # Score how promising a partial solution is on a 1‚Äì10 scale (higher is better).
    # Steps: build a rating prompt; call the model; parse the first integer 1‚Äì10;

    # Score how promising a partial solution is on a 1‚Äì10 scale (higher is better).
    prompt = f"""Problem: {question}

    Current progress:
    {state}

    Rate how promising this progress is for solving the problem on a scale of 1-10, where 10 is very promising.
    Respond with just a single number between 1 and 10."""

    response = client.chat.completions.create(
        model=MODEL,
        messages=[{"role": "user", "content": prompt}],
        temperature=0.3
    )

    text = response.choices[0].message.content.strip()
    # Extract first number
    match = re.search(r'\b([1-9]|10)\b', text)

    return int(match.group(1)) if match else 5




def tree_of_thoughts(question, depth=2, width=2):
    # Run a tiny ToT search: expand states with propose_thoughts, score with score_state, keep top-k at each depth.
    # Steps: initialize frontier=[("", 0)]; for each depth, expand each state with k=width thoughts; score each; sort by score desc; keep top 'width'; return best state and score.

    # Run a tiny ToT search: expand states, score them, keep top-k at each depth.
    frontier = [("", 0)]  # (state, score)

    for d in range(depth):
        new_frontier = []

        for state, _ in frontier:
            # Propose k=width new thoughts
            thoughts = propose_thoughts(question, state, k=width)

            # Score each new state
            for thought in thoughts:
                new_state = state + "\n" + thought if state else thought
                score = score_state(question, new_state)
                new_frontier.append((new_state, score))

        # Sort by score (descending) and keep top width
        new_frontier.sort(key=lambda x: x[1], reverse=True)
        frontier = new_frontier[:width]

        print(f"\n=== Depth {d+1} ===")
        for i, (s, sc) in enumerate(frontier, 1):
            print(f"Path {i} (score {sc}):\n{s}\n")

    # Return best state and score
    return frontier[0]


question = "Design a plan for a weekend science workshop for 12-year-olds."
solution, score = tree_of_thoughts(question)

print(f"Best solution (score {score}):\n{solution}")


=== Depth 1 ===
Path 1 (score 6):
**Define the Workshop's Focus**: Identify the specific area of science to focus on, such as physics, biology, chemistry, or environmental science. This will help guide the development of activities, experiments, and materials.

Path 2 (score 6):
**Develop an Activity Framework**: Sketch out a rough outline of the workshop's structure, including:


=== Depth 2 ===
Path 1 (score 6):
**Define the Workshop's Focus**: Identify the specific area of science to focus on, such as physics, biology, chemistry, or environmental science. This will help guide the development of activities, experiments, and materials.
**Conduct a Target Audience Research Session**: Host a focus group or survey with parents, teachers, or youth leaders who have experience organizing science workshops for 12-year-olds. This will help identify their expectations, interests, and levels of scientific knowledge, allowing for the development of customized activities and materials that cater

---  
# 3‚Äë Training Models for Reasoning

### 3.1: CoT Training
Chain-of-Thought (CoT) training conditions the model on explicit rationales during fine-tuning. Instead of teaching the model to output only the final answer, we train on (question, rationale, answer) so the model learns to internalize multi-step reasoning patterns. A practical recipe is STaR (Self-Taught Reasoner), which uses a stronger teacher model to bootstrap rationales that a smaller student can learn from.

For tasks that require multi-hop reasoning, models fine-tuned on rationales often achieve higher accuracy and are more stable at inference time than models trained on direct answers only. 

Training a full language model is beyond the scope of this notebook, but here is the high-level workflow followed by a short pseudocode:
- Collect questions: Prepare a dataset of questions and correct answers.
- Generate rationales: Use a strong LLM to produce step-by-step reasoning ending with the correct answer.
- Filter and clean: Discard incorrect or low-quality rationales.
- Prepare training data: Format triples (question, rationale, answer) for supervised fine-tuning.
- Fine-tune: Fine-tune the LLM on rationales.
- Iterate: Refine prompts, improve data quality, and retrain for stronger reasoning.

In [None]:
# Pseudocode (STaR loop)
# for round in 1 ... iters:
    # STEP 1: self-generate reasoning (teacher creates rationale + answer)
    # STEP 2: keep only correct, high-quality traces
    # STEP 3: fine-tune student on (question, rationale, answer) data

### 3.2: ORM¬†vs¬†PRM¬†+ RL
Training a Reward Model (RM) allows large language models to be improved through reinforcement learning (RL). Instead of fine-tuning directly on examples, we train a separate model that can score or rank model outputs, and use those scores as feedback signals to refine the policy model.

Two main reward modeling approaches are ORM (predicts a scalar reward for the final answer) and PRM (evaluates the reasoning steps instead of just the outcome)



| Approach | Typical loss | When to use |
|-----------|-------------|-------------|
|*Outcome Reward Model* | Predict scalar reward | Easy to collect training data using verifiers |
|*Process Reward Model* | Predict rewards per step | Difficult to collect training data but more accurate |
| *RLHF* | Use RM as reward in **RL** fine‚Äëtuning | Aligns policy with human signals | Aligns model policy with human or synthetic preferences




In [None]:
# for round = 1 ... iters:
    # STEP 1:  Generate reasoning
        # sample a minibatch of questions
        # policy roll‚Äëout (actions + log‚Äëprobs)
    # STEP 2:  Score the trajectory
        # ORM: scalar reward for the final answer / PRM: scalar reward for the thought process
    # STEP 3:  Reinforce the policy (PPO)

---  
# 4‚Äë A Deep Research Agent

A deep-research agent pairs a reasoning model (e.g., deepseek-r1) with external tools for web search and retrieval. We will follow the ReAct pattern: the model writes short thoughts, decides when to call tools, reads observations, and continues reasoning until it can answer or reaches a step limit.

We now combine a **search tool** with a reasoning model (e.g., `deepseek-r1`) in a multi-step setup. We follow the *ReAct* pattern (reason ‚Üí tool ‚Üí observation):

1. The model reasoins and decides to use tools
2. The agent searches and feed condensed snippets back as context
3. Iterate until the model answers or hits a step limit

We use `AgentType.OPENAI_FUNCTIONS`, which hides the loop inside the LangChain agent.

In [12]:
from ddgs import DDGS
from langchain.tools import Tool

def ddg_search(query: str, k: int = 5) -> str:
    # Use DDGS to run a simple web search and return joined snippets.
    with DDGS() as ddgs:
        results = list(ddgs.text(query, max_results=k))

    # Extract and join snippets
    snippets = [f"{r['title']}: {r['body']}" for r in results]
    return "\n\n".join(snippets)

search_tool = Tool(
    name="DuckDuckGo Search",
    func=ddg_search,
    description="Search the public web. Input: a plain English query. Returns: concatenated snippets."
)


In [13]:
from langchain.agents import initialize_agent, AgentType
from langchain_community.chat_models import ChatOllama

MODEL = "deepseek-r1:8b"
question = "What are the best resources to learn machine learning in 2025?"

# Step 1: Initialize the reasoning model via ChatOllama
llm = ChatOllama(model=MODEL, temperature=0.7)

# Step 2: Build the agent with tool access (DuckDuckGo Search) and function-calling interface (initialize_agent)
agent = initialize_agent([search_tool], llm, agent=AgentType.ZERO_SHOT_REACT_DESCRIPTION, verbose=True)


# Step 3: Ask a query and let the agent search + reason to produce an answer
print(f"\n{'='*60}")
print(f"Question: {question}")
print(f"{'='*60}\n")

result = agent.run(question)

print(f"\n{'='*60}")
print(f"Final Answer:\n{result}")
print(f"{'='*60}")

  llm = ChatOllama(model=MODEL, temperature=0.7)
  agent = initialize_agent([search_tool], llm, agent=AgentType.ZERO_SHOT_REACT_DESCRIPTION, verbose=True)
  result = agent.run(question)



Question: What are the best resources to learn machine learning in 2025?



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3m
Thought: I need to find the best resources for learning machine learning as of 2025. Since I can't access real-time information, I should first search for current trends and recommendations that might be relevant for the future. I'll use the DuckDuckGo search tool to find articles and lists about the best ML learning resources around now.
Action: DuckDuckGo Search
Action Input: "best resources to learn machine learning 2024"[0m
Observation: [36;1m[1;3m10 Online Places to Learn Machine Learning in 2024: Whether you're looking to start a career in data science or simply improve your coding skills, learningmachinelearning is a smart investment. Here‚Äôs a list of the Top 10 Best Places to LearnMachineLearning in 2024 to help guide your journey!

100+ Best Resources to Learn Machine Learning in 2024: Nov 25, 2024 ¬∑ This list of 100+ resources is desig

# Optional (Multi-agent Deep Research)
Instead of a single multi-step agent, you can design multiple collaborating agents such as a Planner, Searcher, Summarizer, and Verifier that pass information and refine each other‚Äôs outputs. This setup improves robustness, diversity of reasoning, and division of labor.

Try building a simple setup with 2‚Äì3 agents that share goals and messages, for example Planner ‚Üí Researcher ‚Üí Writer.

In [None]:
from concurrent.futures import ThreadPoolExecutor
from langchain.agents import initialize_agent, AgentType
from langchain_community.chat_models import ChatOllama

def run_single_research(query, agent_id):
    """Run a single research agent and return its answer."""
    print(f"\n[Agent {agent_id}] Starting research...")

    # Initialize model and agent for this thread
    llm = ChatOllama(model="deepseek-r1:8b", temperature=0.7)
    agent = initialize_agent(
        [search_tool],
        llm,
        agent=AgentType.ZERO_SHOT_REACT_DESCRIPTION,
        verbose=False,  # Set to False to avoid cluttered output
        handle_parsing_errors=True  # Handle deepseek-r1's reasoning format
    )

    try:
        result = agent.run(query)
        print(f"[Agent {agent_id}] Completed!")
        return result
    except Exception as e:
        print(f"[Agent {agent_id}] Error: {str(e)}")
        return f"Error: {str(e)}"
    

def parallel_research(query, n=3):
    # Run n independent research runs in parallel and return their answers.
    # Steps: use ThreadPoolExecutor; submit n calls to your agent/search pipeline; gather results in order.

    # Run n independent research runs in parallel and return their answers.
    print(f"Starting {n} parallel research agents...")
    print("="*60)

    # Use ThreadPoolExecutor to run multiple agents in parallel
    with ThreadPoolExecutor(max_workers=n) as executor:
        # Submit n tasks
        futures = [executor.submit(run_single_research, query, i+1) for i in range(n)]

        # Gather results in order
        answers = [future.result() for future in futures]

    print("\n" + "="*60)
    print("All agents completed!")
    print("="*60)

    return answers

# Run parallel research
query = "What are the best resources to learn ML in 2025?"
print(f"Query: {query}\n")

answers = parallel_research(query, n=3)

# Display results
print("\n" + "="*60)
print("RESULTS FROM 3 PARALLEL AGENTS")
print("="*60)

for i, answer in enumerate(answers, 1):
    print(f"\n{'='*60}")
    print(f"Agent {i} Answer (first 800 chars):")
    print(f"{'='*60}")
    # Show first 800 chars to see variation
    print(answer[:800] + "..." if len(answer) > 800 else answer)
    print()



Query: What are the best resources to learn ML in 2025?

Starting 3 parallel research agents...

[Agent 1] Starting research...

[Agent 2] Starting research...

[Agent 3] Starting research...
[Agent 2] Completed!


## üéâ Congratulations!

* Practised various inference‚Äëtime reasoning methods
* Gained intuition about training reasoning models
* You have built a **deep-research agent**: reasoning model like deep-seek r1 + ReAct-style agent + tool use (web search)
* Try adding more tools, and extending the deep-research to a multi-agent system: many agents researching web in parallel.


üëè **Great job!** Take a moment to celebrate. The techniques you implemented here power many production agents and chatbots.