# Chapter 6: Prompt Engineering - Hard Tasks

This notebook covers advanced reasoning techniques: Tree-of-Thought, chain prompting, output verification, and multi-stage reasoning with self-reflection.

**Note on Code Organization**: In previous notebooks, we sometimes wrote prompts directly. Here we use functions extensively because:
- **Reusability**: Call the same logic multiple times without copying code
- **Modularity**: Each function handles one specific task (generate options, evaluate, verify, etc.)
- **Real-world practice**: Production systems always use functions for maintainability
- **Testing**: Easy to test individual components independently

## Setup

Run all cells in this section to set up the environment and load the model.

Before running these cells, review the concepts from the main Chapter 6 notebook (00_Start_Here.ipynb).

### [Optional] - Installing Packages on Google Colab

If you are viewing this notebook on Google Colab, uncomment and run the following code to install dependencies.

**Note**: Use a GPU for this notebook. In Google Colab, go to Runtime > Change runtime type > Hardware accelerator > GPU > GPU type > T4.

In [1]:
# %%capture
# !pip install --upgrade transformers>=4.40.0 torch accelerate

### Model Loading

In [2]:
import os
os.environ["HF_HUB_DISABLE_PROGRESS_BARS"] = "1"

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline

In [3]:
model_path = "microsoft/Phi-3-mini-4k-instruct"

model = AutoModelForCausalLM.from_pretrained(
    model_path,
    device_map="cuda",
    torch_dtype="auto",
    trust_remote_code=False,
)

tokenizer = AutoTokenizer.from_pretrained(model_path)

`torch_dtype` is deprecated! Use `dtype` instead!


### Helper Functions

In [4]:
def generate_text(prompt, temperature=0.7, max_tokens=500):
    """Generate text with specified parameters"""
    pipe = pipeline(
        "text-generation",
        model=model,
        tokenizer=tokenizer,
        return_full_text=False,
        max_new_tokens=max_tokens,
        do_sample=True if temperature > 0 else False,
        temperature=temperature if temperature > 0 else None,
    )
    
    messages = [{"role": "user", "content": prompt}]
    output = pipe(messages)
    return output[0]['generated_text']

## Challenges

Complete the following tasks by implementing the starter code.

### Level: Hard

**About This Task:**
Tree-of-Thought generates multiple options at each step and evaluates which paths look most promising. We break this into modular functions instead of one big script.

#### Hard Task 1: Tree-of-Thought

### Instructions

1. Run baseline CoT to see linear reasoning
2. Study how ToT functions work together (generate → evaluate → explore)
3. Improve `generate_next_steps` prompt to get better options
4. Modify `evaluate_option` to use different criteria (safety vs progress)
5. Test on a different strategic problem

In [5]:
problem = """You need to transport a fox, a chicken, and grain across a river.
Your boat can only carry you and one item.
If left alone: fox eats chicken, chicken eats grain.
How do you get everything across safely?"""

### Baseline: Linear CoT

This is the simple approach - ask the model to think step-by-step in one shot.

In [6]:
def linear_cot_solve(problem):
    """Simple linear chain-of-thought - one path, no exploration"""
    prompt = f"{problem}\n\nLet's think step-by-step:"
    return generate_text(prompt, temperature=0, max_tokens=500)

In [7]:
print("Linear CoT:")
solution_cot = linear_cot_solve(problem)
print(solution_cot)

Device set to use cuda


Linear CoT:
 1. Take the chicken across the river first and leave it on the other side.
2. Go back and take the fox across the river.
3. Leave the fox on the other side, but take the chicken back with you.
4. Leave the chicken on the starting side and take the grain across the river.
5. Leave the grain with the fox on the other side and go back to get the chicken.
6. Finally, take the chicken across the river.

Now, all three items (fox, chicken, and grain) are safely on the other side of the river.


See how linear CoT might miss better solutions or get stuck. Now we'll use Tree-of-Thought to explore multiple paths.

### Tree-of-Thought Implementation

We break ToT into separate functions. Each function has a single responsibility - this is how real systems are built.

In [8]:
import re
from collections import deque

**Function 1: Generate Options**

Your task: Improve this prompt to generate more specific, actionable next steps.

In [9]:
def generate_next_steps(current_state, problem, num_options=3):
    """
    Generate possible next moves from current state.
    
    This function uses a prompt to brainstorm options. In real systems,
    this allows you to:
    - Call it repeatedly at each decision point
    - Adjust num_options based on problem complexity
    - Cache and reuse results for similar states
    
    Args:
        current_state: Description of where we are now
        problem: The full problem description for context
        num_options: How many alternatives to generate
    
    Returns:
        List of possible next actions
    """
    # Improve this prompt to get better, more specific options
    prompt = f"""{problem}

Current situation: {current_state}

What are {num_options} possible next moves? List them:
1."""
    
    output = generate_text(prompt, temperature=0.7, max_tokens=200)
    
    # Parse the output to extract individual options
    # This is why functions are useful - parsing logic is isolated here
    lines = output.strip().split('\n')
    options = []
    
    for line in lines[:num_options]:
        clean = line.strip()
        # Remove common prefixes like "1.", "2.", "-", etc.
        for prefix in ['1.', '2.', '3.', '4.', '-', '•']:
            if clean.startswith(prefix):
                clean = clean[len(prefix):].strip()
        if clean:
            options.append(clean)
    
    return options[:num_options]

**Function 2: Evaluate Options**

This function scores how good each option is. Notice how we can easily swap evaluation criteria by modifying just this function.

In [10]:
def evaluate_option(option, problem, criterion="progress"):
    """
    Score how promising this option is (0-10).
    
    Using a function here means we can:
    - Change evaluation criteria without touching tree exploration code
    - Test different scoring strategies independently  
    - Add logging or caching if needed
    
    Args:
        option: The action to evaluate
        problem: Full problem for context
        criterion: What to optimize for ("progress" or "safety")
    
    Returns:
        Score from 0-10
    """
    if criterion == "safety":
        # Your task: Write a prompt that focuses on safety
        prompt = f"""{problem}

Proposed move: {option}

On a scale of 0-10, how SAFE is this move?
Consider: Does it prevent anything from being eaten?

Score (0-10):"""
    else:
        # Default: Focus on progress
        prompt = f"""{problem}

Proposed move: {option}

On a scale of 0-10, how promising is this move?
Consider: Does it make progress? Does it violate constraints?

Score (0-10):"""
    
    output = generate_text(prompt, temperature=0, max_tokens=50)
    
    # Extract score from text - regex makes this robust to different formats
    match = re.search(r'(\d+)', output)
    if match:
        score = int(match.group(1))
        return min(max(score, 0), 10)  # Clamp to 0-10 range
    return 5  # Default middle score if parsing fails

**Data Structure: Tree Node**

We use a class to represent each decision point. This keeps state organized.

In [11]:
class TreeNode:
    """
    Represents one state in the decision tree.
    
    Using a class instead of just variables makes it easy to:
    - Track parent-child relationships
    - Reconstruct the path to any node
    - Store metadata like scores
    """
    def __init__(self, state, action=None, parent=None):
        self.state = state          # Current situation description
        self.action = action        # What action led here
        self.parent = parent        # Previous node (for path reconstruction)
        self.children = []          # Possible next nodes
        self.score = 0              # Evaluation score
    
    def get_path(self):
        """Reconstruct the sequence of actions from root to this node"""
        path = []
        node = self
        # Walk backwards through parent links
        while node.parent:
            path.append(node.action)
            node = node.parent
        return list(reversed(path))  # Reverse to get root-to-leaf order

**Main Algorithm: Tree Search**

This orchestrates the generate → evaluate → explore loop. Notice how clean it is because we separated concerns into functions.

In [12]:
def tree_of_thought_solve(problem, max_depth=3, branch_factor=2, criterion="progress"):
    """
    Explore multiple reasoning paths to find the best solution.
    
    This function shows why modular design matters:
    - We call generate_next_steps() for brainstorming
    - We call evaluate_option() for scoring
    - We use TreeNode to manage state
    - The search logic is separate from prompt engineering
    
    Args:
        problem: Problem to solve
        max_depth: How many steps ahead to plan
        branch_factor: How many options to consider at each step
        criterion: What to optimize ("progress" or "safety")
    
    Returns:
        Best solution node found
    """
    # Start with initial state
    initial_state = "Everything is on the starting side"
    root = TreeNode(initial_state)
    
    print("Tree-of-Thought exploration:")
    print(f"Criterion: {criterion}")
    
    # Breadth-first search using a queue
    queue = deque([(root, 0)])
    best_solution = None
    best_score = -1
    explored_nodes = 0
    
    # Explore tree level by level
    while queue and explored_nodes < 10:  # Limit exploration for demo
        node, depth = queue.popleft()
        explored_nodes += 1
        
        print(f"\nDepth {depth}, Node {explored_nodes}")
        print(f"State: {node.state}")
        
        # At max depth, evaluate final state
        if depth >= max_depth:
            score = evaluate_option(node.state, problem, criterion)
            node.score = score
            print(f"Final score: {score}/10")
            
            # Track best solution
            if score > best_score:
                best_score = score
                best_solution = node
            continue
        
        # Generate possible next moves - using our function!
        next_steps = generate_next_steps(node.state, problem, num_options=branch_factor)
        
        if next_steps:
            print(f"\nExploring {len(next_steps)} options:")
            
            for i, step in enumerate(next_steps, 1):
                # Evaluate each option - using our function!
                score = evaluate_option(step, problem, criterion)
                print(f"  {i}. {step} (score: {score}/10)")
                
                # Create child node
                new_state = f"{node.state}, then {step}"
                child = TreeNode(new_state, step, node)
                child.score = score
                node.children.append(child)
                
                # Pruning: Only explore promising branches (score >= 5)
                if score >= 5:
                    queue.append((child, depth + 1))
                else:
                    print(f"       → Pruned (low score)")
    
    # Display best solution found
    if best_solution:
        print("\n" + "="*70)
        print("Best solution path:")
        for i, action in enumerate(best_solution.get_path(), 1):
            print(f"{i}. {action}")
        print(f"\nFinal score: {best_score}/10")
        return best_solution
    
    return None

Now run the tree search. Watch how it explores multiple paths.

In [13]:
solution_tot = tree_of_thought_solve(problem, max_depth=3, branch_factor=2, criterion="progress")

Device set to use cuda


Tree-of-Thought exploration:
Criterion: progress

Depth 0, Node 1
State: Everything is on the starting side


Device set to use cuda



Exploring 2 options:


Device set to use cuda


  1. Take the chicken across the river. (score: 0/10)
       → Pruned (low score)
  2. Return alone to the original side. (score: 0/10)
       → Pruned (low score)


### Test Different Criteria

See how easy it is to swap evaluation criteria? Just change one parameter!

In [14]:
# Your task: Run with safety criterion and compare results
# solution_safety = tree_of_thought_solve(problem, max_depth=3, branch_factor=2, criterion="safety")

### Questions

1. How many branches were pruned (score < 5)? What does pruning save in terms of LLM calls?

2. Compare the solution paths from "progress" vs "safety" criteria. Did they differ?

3. Why is using separate functions better than one big script with all prompts and logic mixed together?

**About This Task:**
Chain prompting breaks complex tasks into sequential stages. Each stage is a separate function call - this is standard practice in production systems.

#### Hard Task 2: Chain Prompting

### Instructions

1. Run single-prompt baseline to see limitations
2. Study the 4-stage chain - notice how each function builds on previous outputs
3. Improve Stage 3 prompt to generate better response strategies
4. Test what happens if you skip a stage
5. Try a different customer review

In [15]:
customer_review = """I've been using your product for 3 months. Setup was confusing
and took 2 hours with no clear instructions. Once I figured it out, the features
are quite powerful. Performance is good, but the mobile app crashes occasionally.
Support was helpful. For $49/month, I expected better documentation.
I'm considering upgrading to premium."""

### Baseline: Single Prompt

The naive approach - ask the model to do everything at once.

In [16]:
def single_prompt_analysis(review):
    """One-shot analysis - tries to do everything in one call"""
    prompt = f"""Analyze this customer review and generate a response:

Review: {review}

Provide:
1. Sentiment analysis
2. Key issues
3. A professional response
"""
    return generate_text(prompt, temperature=0, max_tokens=400)

In [17]:
print("Single prompt approach:")
single_result = single_prompt_analysis(customer_review)
print(single_result)

Device set to use cuda


Single prompt approach:
 1. Sentiment Analysis:

   - The customer has mixed feelings about the product. They are satisfied with the features and performance but are unhappy with the setup process and the mobile app's stability.
   - The customer appreciates the support but is disappointed with the documentation.
   - The customer is contemplating an upgrade to the premium version, indicating a potential interest in the product despite the issues.

2. Key Issues:

   - Confusing setup process
   - Lack of clear instructions
   - Mobile app crashes occasionally
   - Inadequate documentation

3. Professional Response:

   Dear [Customer's Name],

   Thank you for taking the time to share your feedback with us. We are glad to hear that you find our product's features and performance to be powerful and that our support team has been helpful to you.

   We apologize for the inconvenience you experienced during the setup process. We understand that it can be frustrating when instructions are

Notice the single prompt might miss nuances or produce a generic response. Now we'll use a 4-stage chain.

### Chain Prompting: 4 Specialized Functions

Each stage is a focused function. This is how real systems work - specialized components that you can test and improve independently.

**Stage 1: Extract Information**

First, we pull out facts. This function only does extraction - nothing else.

In [18]:
def extract_information(review):
    """
    Extract structured facts from unstructured review text.
    
    Why separate this?
    - Easy to test extraction accuracy independently
    - Can reuse extracted facts for multiple analyses
    - Can cache results to avoid re-extraction
    """
    prompt = f"""Extract key information from this review:

Review: {review}

List:
- Issues mentioned:
- Positive aspects:
- Duration of usage:
- Price mentioned:
"""
    return generate_text(prompt, temperature=0, max_tokens=250)

In [19]:
print("Stage 1: Extraction")
stage1_output = extract_information(customer_review)
print(stage1_output)

Device set to use cuda


Stage 1: Extraction
 - Issues mentioned:
  - Confusing setup process
  - No clear instructions
  - Occasional mobile app crashes
  - Poor documentation

- Positive aspects:
  - Powerful features
  - Good performance
  - Helpful support

- Duration of usage:
  - 3 months

- Price mentioned:
  - $49/month


**Stage 2: Analyze Sentiment**

Now we analyze the extracted facts. See how we pass stage1_output as input?

In [20]:
def analyze_sentiment(extracted_info):
    """
    Determine overall sentiment from extracted facts.
    
    Benefits of separating:
    - Can swap sentiment analysis methods without changing extraction
    - Can test sentiment accuracy on known examples
    - Clearer what this function is responsible for
    """
    prompt = f"""Based on this information:

{extracted_info}

Determine:
1. Overall sentiment (positive/negative/mixed)
2. Satisfaction level (1-10)
3. Main concerns
"""
    return generate_text(prompt, temperature=0, max_tokens=250)

In [21]:
print("\nStage 2: Sentiment Analysis")
stage2_output = analyze_sentiment(stage1_output)
print(stage2_output)

Device set to use cuda



Stage 2: Sentiment Analysis
 1. Overall sentiment: Mixed
2. Satisfaction level: 6
3. Main concerns:
   - Confusing setup process
   - No clear instructions
   - Occasional mobile app crashes
   - Poor documentation


**Stage 3: Plan Response Strategy**

Your task: Improve this prompt to create a more detailed response plan.

In [22]:
def plan_response_strategy(sentiment_analysis, extracted_info):
    """
    Decide what to say in the response.
    
    This function takes both sentiment and facts to create a strategy.
    Separating strategy planning from actual writing allows:
    - Human review of strategy before writing
    - A/B testing different strategies
    - Consistent application of business rules
    """
    # Improve this prompt to be more specific about strategy
    prompt = f"""Given this sentiment:

{sentiment_analysis}

And these facts:

{extracted_info}

Plan the response:
- What to acknowledge
- What to apologize for
- What actions to offer
"""
    return generate_text(prompt, temperature=0, max_tokens=250)

In [23]:
print("\nStage 3: Response Strategy")
stage3_output = plan_response_strategy(stage2_output, stage1_output)
print(stage3_output)

Device set to use cuda



Stage 3: Response Strategy
 Acknowledgment:

Thank you for taking the time to share your experience with our product. We're glad to hear that you've found the powerful features and good performance to be beneficial during your 3-month usage. Our support team has been working hard to assist our customers, and we're pleased to know that you found them helpful.

Apology:

We sincerely apologize for the issues you've encountered with the confusing setup process, lack of clear instructions, occasional mobile app crashes, and poor documentation. We understand how frustrating these problems can be, and we're committed to making things right.

Actions Offered:

1. We will work on improving the setup process and providing clearer instructions to ensure a smoother experience for our users.
2. Our development team will investigate the cause of the occasional mobile app crashes and implement necessary fixes to prevent them from happening in the future.
3. We will review our documentation and upda

**Stage 4: Write Final Response**

Finally, we execute the strategy to produce the actual response text.

In [24]:
def write_response(strategy):
    """
    Generate the actual response text following the strategy.
    
    Why separate this?
    - Can change tone/style without changing strategy logic
    - Can generate multiple versions for A/B testing
    - Can add templates or brand voice guidelines here
    """
    prompt = f"""Write a professional customer service response following this strategy:

{strategy}

Tone: Professional, empathetic, solution-focused
Length: 2-3 paragraphs

Response:
"""
    return generate_text(prompt, temperature=0, max_tokens=300)

In [25]:
print("\nStage 4: Final Response")
stage4_output = write_response(stage3_output)
print(stage4_output)

Device set to use cuda



Stage 4: Final Response
 Dear [Customer Name],


Thank you for taking the time to share your experience with our product. We're glad to hear that you've found the powerful features and good performance to be beneficial during your 3-month usage. Our support team has been working hard to assist our customers, and we're pleased to know that you found them helpful.


We sincerely apologize for the issues you've encountered with the confusing setup process, lack of clear instructions, occasional mobile app crashes, and poor documentation. We understand how frustrating these problems can be, and we're committed to making things right.


To address these concerns, we will work on improving the setup process and providing clearer instructions to ensure a smoother experience for our users. Our development team will investigate the cause of the occasional mobile app crashes and implement necessary fixes to prevent them from happening in the future. We will also review our documentation and upd

See how each stage built on the previous one? This is modular design - each function has one job.

### Questions

1. What information from Stage 1 was used in Stage 3? What would happen if you skipped Stage 1?

2. Which stage added the most value compared to the single-prompt approach?

3. Why is it better to have 4 functions instead of copying the prompts 4 times in a row?

**About This Task:**
Output verification uses one prompt to generate, another to check, and a third to fix. Functions make this generate-verify-correct loop clean and testable.

#### Hard Task 3: Output Verification Loop

### Instructions

1. Study the 3-function pattern: generate → verify → correct
2. Run to see which requirements the model might miss
3. Improve the verification prompt to catch more issues
4. Test what happens if you run verify → correct twice
5. Try different code generation tasks

In [26]:
requirements = {
    'task': 'Write a function that calculates a discount',
    'name': 'calculate_discount',
    'parameters': ['price', 'discount_percent'],
    'must_have': ['docstring', 'input validation', 'return statement']
}

**Function 1: Generate Code**

First attempt at generating code from requirements.

In [27]:
def generate_function(requirements):
    """
    Generate code from requirements.
    
    Why use a function?
    - Can be called multiple times for different requirements
    - Easy to test with various requirement sets
    - Can log all generated code for analysis
    """
    prompt = f"""Write a Python function:

Task: {requirements['task']}
Function name: {requirements['name']}
Parameters: {', '.join(requirements['parameters'])}
Must include: {', '.join(requirements['must_have'])}

Write the complete function:
"""
    return generate_text(prompt, temperature=0, max_tokens=250)

In [28]:
print("Generated code:")
code = generate_function(requirements)
print(code)

Device set to use cuda


Generated code:
 ```python

def calculate_discount(price, discount_percent):

    """

    Calculate the discounted price of an item.


    Parameters:

    price (float): The original price of the item.

    discount_percent (float): The discount percentage to apply to the price.


    Returns:

    float: The discounted price after applying the discount.


    Raises:

    ValueError: If the price or discount_percent is negative, or if discount_percent is not between 0 and 100.

    """


    # Input validation

    if price < 0 or discount_percent < 0:

        raise ValueError("Price and discount percent must be non-negative.")

    if discount_percent < 0 or discount_percent > 100:

        raise ValueError("Discount percent must be between 0 and 100.")


    # Calculate the discount amount

    discount_amount = (dis


**Function 2: Verify Code**

Check if the generated code meets all requirements. Your task: Improve this to catch more issues.

In [29]:
def verify_code(code, requirements):
    """
    Use a prompt to verify if code meets requirements.
    
    Separating verification into its own function:
    - Allows testing verification logic independently
    - Can swap between prompt-based and code-based verification
    - Makes it easy to add human-in-the-loop approval
    """
    # Improve this prompt to be more thorough in checking
    prompt = f"""Check if this code meets the requirements:

Code:
{code}

Requirements:
- Function name: {requirements['name']}
- Must have: {', '.join(requirements['must_have'])}

List any issues found (or write 'No issues'):
"""
    return generate_text(prompt, temperature=0, max_tokens=200)

In [30]:
print("\nVerification:")
verification = verify_code(code, requirements)
print(verification)

Device set to use cuda



Verification:
 No issues.


The provided code snippet meets the specified requirements. It defines a function named `calculate_discount` with a docstring that explains the purpose, parameters, and return value of the function. The function includes input validation to ensure that the `price` and `discount_percent` are non-negative and that `discount_percent` is within the range of 0 to 100. If the input validation fails, the function raises a `ValueError` with an appropriate message. The function also calculates the discounted price and returns it.


**Function 3: Request Corrections**

If verification found issues, ask the model to fix them.

In [31]:
def request_correction(code, verification_feedback):
    """
    Request fixes based on verification results.
    
    Using a function here:
    - Can iterate the verify-correct loop multiple times
    - Can add retry logic with exponential backoff
    - Can track how many correction rounds were needed
    """
    prompt = f"""This code has issues:

Code:
{code}

Issues found:
{verification_feedback}

Fix these issues. Provide the complete corrected function:
"""
    return generate_text(prompt, temperature=0, max_tokens=300)

Now we run the verify-correct loop. Watch how the functions work together.

In [32]:
if 'no issues' not in verification.lower():
    print("\nIssues found - requesting corrections:")
    corrected_code = request_correction(code, verification)
    print(corrected_code)
    
    # Your task: Try verifying again to see if issues were fixed
    # second_check = verify_code(corrected_code, requirements)
    # print("\nSecond verification:")
    # print(second_check)
else:
    print("\nNo issues found!")


No issues found!


### Questions

1. What issues did verification catch? How would you catch these with code-based checks instead of prompts?

2. If you run verify → correct twice, does it improve quality further? When should you stop?

3. Why is it useful to have separate verify and correct functions instead of one "fix_code" function?

**About This Task:**
Multi-stage reasoning uses a 4-function pipeline: reason → reflect → revise → finalize. Each stage improves on the previous one.

#### Hard Task 4: Self-Reflection Pipeline

### Instructions

1. Study the 4-stage pipeline and how each function builds on previous outputs
2. Run to see how self-reflection catches initial mistakes
3. Improve the reflection prompt to ask better critical questions
4. Test what happens if you skip the reflection stage
5. Try a different decision-making problem

In [33]:
problem = """A company is choosing between two pricing strategies:

Strategy A: $10/month, expect 10,000 customers
Strategy B: $25/month, expect 5,000 customers

Customer support costs $2 per customer per month.
Which strategy maximizes profit?"""

**Stage 1: Initial Reasoning**

First pass at solving the problem.

In [34]:
def initial_reasoning(problem):
    """
    First attempt at solving the problem.
    
    Why separate this?
    - Can compare initial vs final reasoning
    - Can measure how often initial reasoning is wrong
    - Can use initial reasoning as a baseline for testing
    """
    prompt = f"{problem}\n\nLet's think step-by-step:"
    return generate_text(prompt, temperature=0, max_tokens=200)

In [35]:
print("Stage 1: Initial Reasoning")
initial = initial_reasoning(problem)
print(initial)

Device set to use cuda


Stage 1: Initial Reasoning
 Step 1: Calculate the revenue for each strategy

For Strategy A:
Revenue = Price per customer * Number of customers
Revenue = $10 * 10,000 = $100,000

For Strategy B:
Revenue = Price per customer * Number of customers
Revenue = $25 * 5,000 = $125,000

Step 2: Calculate the customer support costs for each strategy

For Strategy A:
Customer support cost = Cost per customer * Number of customers
Customer support cost = $2 * 10,000 = $20,000

For Strategy B:
Customer support cost = Cost per customer * Number of customers
Customer support cost = $2 * 5,000 = $10,000

Step 3: Calcul


**Stage 2: Self-Reflection**

Your task: Improve this prompt to ask more critical questions about the reasoning.

In [36]:
def self_reflect(problem, initial_reasoning):
    """
    Critically examine the initial reasoning.
    
    This function demonstrates meta-cognition - thinking about thinking.
    Benefits:
    - Can test if reflection improves accuracy
    - Can tune reflection questions for different problem types
    - Can add domain-specific reflection guidelines
    """
    # Improve these critical questions
    prompt = f"""{problem}

Here was my initial reasoning:
{initial_reasoning}

Now critically examine this:
- Did I consider all factors?
- Are calculations correct?
- Did I miss anything?

Critical reflection:
"""
    return generate_text(prompt, temperature=0, max_tokens=200)

In [37]:
print("\nStage 2: Self-Reflection")
reflection = self_reflect(problem, initial)
print(reflection)

Device set to use cuda



Stage 2: Self-Reflection
 Your initial reasoning is mostly correct, but there are a few things to consider to ensure a comprehensive analysis.

1. Calculate the profit for each strategy:

For Strategy A:
Profit = Revenue - Customer support cost
Profit = $100,000 - $20,000 = $80,000

For Strategy B:
Profit = Revenue - Customer support cost
Profit = $125,000 - $10,000 = $115,000

2. Compare the profits:

Strategy A profit: $80,000
Strategy B profit: $115,000

Based on the calculations, Strategy B maximizes profit with a profit of $115,000 compared to Strategy A's profit of $80,


See what the model caught when reviewing its own reasoning?

**Stage 3: Revised Reasoning**

Incorporate insights from reflection to improve the answer.

In [38]:
def revised_reasoning(problem, initial, reflection):
    """
    Generate improved reasoning based on reflection.
    
    This function shows the value of iteration:
    - Initial attempt → reflection → revision is a powerful pattern
    - Keeping them separate means we can inspect each stage
    - Can measure improvement from initial to revised
    """
    prompt = f"""{problem}

My initial reasoning:
{initial}

After reflection:
{reflection}

Now provide improved reasoning that addresses the concerns:
"""
    return generate_text(prompt, temperature=0, max_tokens=400)

In [39]:
print("\nStage 3: Revised Reasoning")
revised = revised_reasoning(problem, initial, reflection)
print(revised)

Device set to use cuda



Stage 3: Revised Reasoning


OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 7.62 GiB of which 9.25 MiB is free. Including non-PyTorch memory, this process has 7.59 GiB memory in use. Of the allocated memory 7.44 GiB is allocated by PyTorch, and 30.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

**Stage 4: Final Answer with Confidence**

Produce a clean final answer with confidence assessment.

In [None]:
def final_answer_with_confidence(problem, revised_reasoning):
    """
    Extract final answer and assess confidence.
    
    Separating this allows:
    - Consistent answer formatting across problems
    - Tracking confidence scores over time
    - Filtering low-confidence answers for human review
    """
    prompt = f"""{problem}

After careful analysis:
{revised_reasoning}

Provide:
1. Final answer (which strategy?)
2. Key reasoning (2-3 sentences)
3. Confidence level (0-100%)
4. Main uncertainty
"""
    return generate_text(prompt, temperature=0, max_tokens=300)

In [None]:
print("\nStage 4: Final Answer")
final = final_answer_with_confidence(problem, revised)
print(final)

Notice how the 4-stage pipeline produced a well-reasoned, confident answer.

### Compare: With vs Without Reflection

In [None]:
# Your task: Run without reflection stage to see the difference
# skipped_reflection = final_answer_with_confidence(problem, initial)
# print("\nWithout reflection stage:")
# print(skipped_reflection)

### Questions

1. What mistake did self-reflection catch that initial reasoning missed?

2. Compare the final answer with vs without the reflection stage. Was reflection worth the extra LLM call?

3. Why is having 4 separate functions better than one big function that does all stages? List 3 specific advantages.