# Lab 3: Chain-of-Thought Implementation

**Week 2 - Prompt Engineering & LLM Basics**

**Provided by:** ADC ENGINEERING & CONSULTING LTD

## Objectives

In this lab, you will:
- Understand chain-of-thought (CoT) prompting technique
- Implement zero-shot and few-shot CoT
- Master step-by-step reasoning patterns
- Use CoT for complex problem solving
- Implement self-consistency techniques
- Build reasoning chains for different domains
- Evaluate and improve reasoning quality

## Prerequisites

- Completed Week 2 Labs 1 & 2
- Understanding of few-shot learning
- OpenAI API key configured
- Python 3.9+

## Setup and Installation

In [None]:
# Install required packages
!pip install openai python-dotenv tiktoken pandas numpy --quiet

In [None]:
import os
import json
import re
from typing import List, Dict, Optional, Tuple, Any
from dataclasses import dataclass, field
from datetime import datetime
from collections import Counter

from openai import OpenAI
from dotenv import load_dotenv
import tiktoken

# Load environment variables
load_dotenv()

# Initialize OpenAI client
client = OpenAI(api_key=os.getenv('OPENAI_API_KEY'))

print("✓ Setup complete!")

## Part 1: Understanding Chain-of-Thought

Chain-of-Thought prompting encourages the model to show its reasoning process step-by-step, leading to more accurate and explainable results.

### Why CoT Works

1. **Breaks down complexity**: Divides hard problems into manageable steps
2. **Improves accuracy**: Reduces errors through explicit reasoning
3. **Provides transparency**: Shows how the answer was reached
4. **Enables debugging**: Identify where reasoning went wrong
5. **Handles multi-step problems**: Natural fit for complex reasoning

Let's compare standard prompting vs CoT:

In [None]:
def compare_standard_vs_cot():
    """Compare standard prompting with chain-of-thought."""
    
    problem = """
    A bakery sells cupcakes in boxes of 6 and 8. 
    Sarah wants to buy exactly 28 cupcakes for a party.
    What combination of boxes should she buy?
    """
    
    # Standard prompting
    standard_prompt = f"Problem: {problem}\n\nAnswer:"
    
    # Chain-of-thought prompting
    cot_prompt = f"""Problem: {problem}

Let's solve this step by step:
1. First, identify what we know and what we need to find
2. Consider possible combinations
3. Check which combination gives exactly 28
4. Verify the solution

Answer:"""
    
    prompts = [
        ("Standard Prompting", standard_prompt),
        ("Chain-of-Thought", cot_prompt)
    ]
    
    for label, prompt in prompts:
        response = client.chat.completions.create(
            model="gpt-3.5-turbo",
            messages=[{"role": "user", "content": prompt}],
            temperature=0.3,
            max_tokens=300
        )
        
        print(f"\n{'='*80}")
        print(f"{label}")
        print(f"{'='*80}")
        print(response.choices[0].message.content)
        print(f"\nTokens: {response.usage.total_tokens}")

compare_standard_vs_cot()

## Part 2: Zero-Shot Chain-of-Thought

Zero-shot CoT uses trigger phrases like "Let's think step by step" to activate reasoning without examples.

In [None]:
class ZeroShotCoT:
    """
    Zero-shot chain-of-thought reasoning.
    """
    
    def __init__(self, model: str = "gpt-3.5-turbo"):
        """
        Initialize ZeroShotCoT.
        
        Args:
            model: OpenAI model to use
        """
        self.model = model
        self.trigger_phrases = [
            "Let's think step by step.",
            "Let's break this down:",
            "Let's solve this systematically:",
            "Let's approach this methodically:",
            "Let's reason through this:"
        ]
    
    def solve(
        self,
        problem: str,
        trigger: str = "Let's think step by step.",
        temperature: float = 0.3
    ) -> Dict[str, Any]:
        """
        Solve problem using zero-shot CoT.
        
        Args:
            problem: Problem statement
            trigger: Trigger phrase for CoT
            temperature: Sampling temperature
        
        Returns:
            Dict with reasoning and answer
        """
        prompt = f"{problem}\n\n{trigger}"
        
        response = client.chat.completions.create(
            model=self.model,
            messages=[{"role": "user", "content": prompt}],
            temperature=temperature,
            max_tokens=500
        )
        
        full_response = response.choices[0].message.content
        
        # Try to extract final answer
        answer_patterns = [
            r"(?:final answer|answer|therefore|thus|so):\s*(.+?)(?:\n|$)",
            r"(?:the answer is)\s*(.+?)(?:\n|$)",
        ]
        
        final_answer = None
        for pattern in answer_patterns:
            match = re.search(pattern, full_response, re.IGNORECASE)
            if match:
                final_answer = match.group(1).strip()
                break
        
        return {
            "problem": problem,
            "trigger": trigger,
            "reasoning": full_response,
            "answer": final_answer,
            "tokens": response.usage.total_tokens
        }
    
    def batch_solve(
        self,
        problems: List[str],
        trigger: str = "Let's think step by step."
    ) -> List[Dict[str, Any]]:
        """Solve multiple problems."""
        return [self.solve(p, trigger) for p in problems]

# Test zero-shot CoT
zero_shot = ZeroShotCoT()

# Mathematical reasoning
math_problem = """
If a train travels 120 km in 2 hours, then stops for 30 minutes, 
then travels another 90 km in 1.5 hours, what is its average speed 
for the entire journey?
"""

result = zero_shot.solve(math_problem)

print("="*80)
print("ZERO-SHOT CHAIN-OF-THOUGHT")
print("="*80)
print(f"\nProblem:\n{result['problem']}")
print(f"\nTrigger: {result['trigger']}")
print(f"\nReasoning:\n{result['reasoning']}")
print(f"\nExtracted Answer: {result['answer']}")
print(f"\nTokens used: {result['tokens']}")

# Logical reasoning
logic_problem = """
All roses are flowers.
Some flowers fade quickly.
Does this mean some roses fade quickly?
"""

result = zero_shot.solve(
    logic_problem,
    trigger="Let's reason through this step by step:"
)

print("\n" + "="*80)
print("LOGICAL REASONING")
print("="*80)
print(f"\nProblem:\n{result['problem']}")
print(f"\nReasoning:\n{result['reasoning']}")

### Exercise 2.1: Test Different Trigger Phrases

Experiment with different trigger phrases to see which works best:

In [None]:
# TODO: Test different trigger phrases

test_problem = """
A store offers a 20% discount on all items, then adds 8% sales tax.
If an item originally costs $50, what is the final price?
"""

triggers_to_test = [
    "Let's think step by step.",
    "Let's break this down:",
    "Let's solve this systematically:",
    # TODO: Add 3 more creative trigger phrases
    "",
    "",
    ""
]

# TODO: Test each trigger and compare results
# zero_shot = ZeroShotCoT()
# for trigger in triggers_to_test:
#     if trigger:
#         result = zero_shot.solve(test_problem, trigger)
#         print(f"\nTrigger: {trigger}")
#         print(f"Answer: {result['answer']}")
#         print(f"Tokens: {result['tokens']}")
#         print("-"*80)

## Part 3: Few-Shot Chain-of-Thought

Few-shot CoT provides examples of step-by-step reasoning to guide the model.

In [None]:
@dataclass
class CoTExample:
    """Chain-of-thought example with reasoning steps."""
    problem: str
    reasoning: str
    answer: str

class FewShotCoT:
    """
    Few-shot chain-of-thought reasoning.
    """
    
    def __init__(self, model: str = "gpt-3.5-turbo"):
        """
        Initialize FewShotCoT.
        
        Args:
            model: OpenAI model to use
        """
        self.model = model
        self.examples: List[CoTExample] = []
    
    def add_example(self, problem: str, reasoning: str, answer: str):
        """
        Add a chain-of-thought example.
        
        Args:
            problem: Problem statement
            reasoning: Step-by-step reasoning
            answer: Final answer
        """
        self.examples.append(CoTExample(problem, reasoning, answer))
    
    def build_prompt(self, problem: str) -> str:
        """
        Build prompt with examples.
        
        Args:
            problem: Problem to solve
        
        Returns:
            Complete prompt with examples
        """
        prompt_parts = []
        
        # Add examples
        for i, example in enumerate(self.examples, 1):
            prompt_parts.append(f"Example {i}:")
            prompt_parts.append(f"Problem: {example.problem}")
            prompt_parts.append(f"Reasoning: {example.reasoning}")
            prompt_parts.append(f"Answer: {example.answer}")
            prompt_parts.append("")
        
        # Add the actual problem
        prompt_parts.append("Now solve this problem:")
        prompt_parts.append(f"Problem: {problem}")
        prompt_parts.append("Reasoning:")
        
        return "\n".join(prompt_parts)
    
    def solve(self, problem: str, temperature: float = 0.3) -> Dict[str, Any]:
        """
        Solve problem using few-shot CoT.
        
        Args:
            problem: Problem statement
            temperature: Sampling temperature
        
        Returns:
            Dict with reasoning and answer
        """
        prompt = self.build_prompt(problem)
        
        response = client.chat.completions.create(
            model=self.model,
            messages=[{"role": "user", "content": prompt}],
            temperature=temperature,
            max_tokens=500
        )
        
        full_response = response.choices[0].message.content
        
        # Extract answer
        answer_match = re.search(
            r"(?:Answer|Therefore|Thus|So):\s*(.+?)(?:\n|$)",
            full_response,
            re.IGNORECASE
        )
        answer = answer_match.group(1).strip() if answer_match else None
        
        return {
            "problem": problem,
            "prompt": prompt,
            "reasoning": full_response,
            "answer": answer,
            "tokens": response.usage.total_tokens,
            "num_examples": len(self.examples)
        }

# Create few-shot CoT for math word problems
few_shot = FewShotCoT()

# Add examples with reasoning
few_shot.add_example(
    problem="John has 5 apples. He buys 3 more. How many does he have?",
    reasoning="""
Step 1: Identify what we know
- John starts with 5 apples
- He buys 3 more apples

Step 2: Determine the operation needed
- We need to add the new apples to the original amount
- Operation: addition

Step 3: Calculate
- 5 + 3 = 8

Step 4: State the answer clearly
- John has 8 apples
    """,
    answer="8 apples"
)

few_shot.add_example(
    problem="A box contains 24 chocolates. If Sarah eats 1/3 of them, how many are left?",
    reasoning="""
Step 1: Identify what we know
- Total chocolates: 24
- Sarah eats: 1/3 of the total

Step 2: Calculate how many Sarah eats
- 1/3 of 24 = 24 ÷ 3 = 8 chocolates

Step 3: Calculate how many are left
- Remaining = Total - Eaten
- Remaining = 24 - 8 = 16

Step 4: State the answer
- 16 chocolates are left
    """,
    answer="16 chocolates"
)

# Solve a new problem
new_problem = """
A pizza is cut into 8 equal slices. Tom eats 2 slices, Jerry eats 3 slices.
What fraction of the pizza is left?
"""

result = few_shot.solve(new_problem)

print("="*80)
print("FEW-SHOT CHAIN-OF-THOUGHT")
print("="*80)
print(f"\nProblem:\n{result['problem']}")
print(f"\nUsing {result['num_examples']} examples")
print(f"\nReasoning:\n{result['reasoning']}")
print(f"\nAnswer: {result['answer']}")
print(f"\nTokens: {result['tokens']}")

### Exercise 3.1: Build Domain-Specific CoT

Create few-shot CoT examples for a specific domain:

In [None]:
# TODO: Build few-shot CoT for logical reasoning

logic_cot = FewShotCoT()

# TODO: Add at least 3 examples of logical reasoning problems with step-by-step solutions
# Example domains: syllogisms, conditional statements, set theory, probability

# Example structure:
# logic_cot.add_example(
#     problem="If all A are B, and all B are C, are all A also C?",
#     reasoning="""
#     Step 1: Identify the logical statements
#     ...
#     """,
#     answer="Yes, all A are C"
# )

# TODO: Test with a new logical reasoning problem
# test_problem = "..."
# result = logic_cot.solve(test_problem)
# print(result['reasoning'])

## Part 4: Complex Multi-Step Reasoning

Use CoT for problems requiring multiple reasoning steps.

In [None]:
class ComplexCoTSolver:
    """
    Solver for complex multi-step problems.
    """
    
    def __init__(self, model: str = "gpt-3.5-turbo"):
        """
        Initialize ComplexCoTSolver.
        
        Args:
            model: OpenAI model to use
        """
        self.model = model
    
    def solve_with_structure(
        self,
        problem: str,
        reasoning_structure: List[str]
    ) -> Dict[str, Any]:
        """
        Solve problem with explicit reasoning structure.
        
        Args:
            problem: Problem statement
            reasoning_structure: List of reasoning steps to follow
        
        Returns:
            Solution with structured reasoning
        """
        # Build structured prompt
        prompt_parts = [
            f"Problem: {problem}",
            "",
            "Solve this by following these steps:"
        ]
        
        for i, step in enumerate(reasoning_structure, 1):
            prompt_parts.append(f"{i}. {step}")
        
        prompt_parts.append("")
        prompt_parts.append("Provide detailed reasoning for each step:")
        
        prompt = "\n".join(prompt_parts)
        
        response = client.chat.completions.create(
            model=self.model,
            messages=[{"role": "user", "content": prompt}],
            temperature=0.3,
            max_tokens=800
        )
        
        return {
            "problem": problem,
            "structure": reasoning_structure,
            "reasoning": response.choices[0].message.content,
            "tokens": response.usage.total_tokens
        }
    
    def solve_planning_problem(self, problem: str) -> Dict[str, Any]:
        """Solve planning and scheduling problems."""
        structure = [
            "List all constraints and requirements",
            "Identify dependencies between tasks",
            "Determine critical path",
            "Allocate resources",
            "Create timeline",
            "Identify potential risks"
        ]
        return self.solve_with_structure(problem, structure)
    
    def solve_decision_problem(self, problem: str) -> Dict[str, Any]:
        """Solve decision-making problems."""
        structure = [
            "Define the decision to be made",
            "List all available options",
            "Identify evaluation criteria",
            "Analyze pros and cons of each option",
            "Consider constraints and trade-offs",
            "Make recommendation with justification"
        ]
        return self.solve_with_structure(problem, structure)
    
    def solve_debugging_problem(self, problem: str) -> Dict[str, Any]:
        """Solve code debugging problems."""
        structure = [
            "Understand the expected behavior",
            "Identify the actual behavior (error/bug)",
            "Analyze the code logic",
            "Identify potential causes",
            "Test hypotheses",
            "Propose solution and explain why it works"
        ]
        return self.solve_with_structure(problem, structure)

# Test with different problem types
solver = ComplexCoTSolver()

# Planning problem
planning_problem = """
You need to organize a 2-day conference with:
- 3 keynote speakers (1 hour each)
- 6 breakout sessions (45 min each)
- 4 networking breaks (30 min each)
- Opening and closing ceremonies (30 min each)

Working hours: 9 AM - 5 PM each day
Lunch: 12 PM - 1 PM each day

Create an optimal schedule.
"""

result = solver.solve_planning_problem(planning_problem)

print("="*80)
print("COMPLEX MULTI-STEP REASONING: Planning")
print("="*80)
print(f"\nProblem:\n{planning_problem}")
print(f"\nReasoning Structure:")
for i, step in enumerate(result['structure'], 1):
    print(f"{i}. {step}")
print(f"\nDetailed Reasoning:\n{result['reasoning']}")

# Decision problem
decision_problem = """
A startup needs to choose a tech stack for their mobile app:

Option A: Native development (Swift for iOS, Kotlin for Android)
- Best performance
- Full platform features
- Requires 2 separate teams
- Higher cost
- Longer development time

Option B: Cross-platform (React Native)
- Single codebase
- Faster development
- Lower cost
- Some performance trade-offs
- Limited access to latest platform features

Option C: Progressive Web App (PWA)
- Works everywhere
- Lowest cost
- No app store presence
- Limited offline capabilities
- Less native feel

Budget: $150k, Timeline: 6 months, Team: 3 developers

Which option should they choose?
"""

result = solver.solve_decision_problem(decision_problem)

print("\n" + "="*80)
print("COMPLEX MULTI-STEP REASONING: Decision Making")
print("="*80)
print(f"\nDetailed Reasoning:\n{result['reasoning']}")

### Exercise 4.1: Solve Complex Problems

Use structured CoT to solve these complex problems:

In [None]:
# TODO: Solve these complex problems with structured reasoning

solver = ComplexCoTSolver()

problems = [
    {
        "type": "optimization",
        "problem": """
        A company has $100,000 to invest in marketing across 4 channels:
        - Social Media: $5 per lead, 30% conversion rate
        - Google Ads: $8 per lead, 40% conversion rate  
        - Email: $2 per lead, 15% conversion rate
        - Content Marketing: $10 per lead, 50% conversion rate
        
        Each channel has a maximum capacity:
        - Social Media: 5,000 leads max
        - Google Ads: 3,000 leads max
        - Email: 10,000 leads max
        - Content Marketing: 2,000 leads max
        
        How should they allocate the budget to maximize conversions?
        """,
        "structure": [
            # TODO: Define reasoning steps for optimization
        ]
    },
    {
        "type": "system_design",
        "problem": """
        Design a URL shortening service like bit.ly that needs to:
        - Handle 1 million new URLs per day
        - Serve 100 million redirects per day
        - Store URLs for 5 years
        - Provide custom aliases
        - Track click analytics
        - Be highly available (99.9% uptime)
        
        Propose the architecture and explain design decisions.
        """,
        "structure": [
            # TODO: Define reasoning steps for system design
        ]
    }
]

# TODO: Solve each problem
# for problem_info in problems:
#     result = solver.solve_with_structure(
#         problem_info['problem'],
#         problem_info['structure']
#     )
#     print(f"\nProblem Type: {problem_info['type']}")
#     print(result['reasoning'])
#     print("="*80)

## Part 5: Self-Consistency

Self-consistency generates multiple reasoning paths and selects the most consistent answer.

In [None]:
class SelfConsistencyCoT:
    """
    Self-consistency chain-of-thought.
    
    Generates multiple reasoning paths and aggregates answers.
    """
    
    def __init__(self, model: str = "gpt-3.5-turbo"):
        """
        Initialize SelfConsistencyCoT.
        
        Args:
            model: OpenAI model to use
        """
        self.model = model
    
    def solve_multiple_paths(
        self,
        problem: str,
        num_paths: int = 5,
        temperature: float = 0.7
    ) -> Dict[str, Any]:
        """
        Solve problem using multiple reasoning paths.
        
        Args:
            problem: Problem statement
            num_paths: Number of reasoning paths to generate
            temperature: Higher temperature for diverse reasoning
        
        Returns:
            Dict with all paths and consensus answer
        """
        prompt = f"{problem}\n\nLet's think step by step."
        
        paths = []
        answers = []
        
        for i in range(num_paths):
            response = client.chat.completions.create(
                model=self.model,
                messages=[{"role": "user", "content": prompt}],
                temperature=temperature,
                max_tokens=400
            )
            
            reasoning = response.choices[0].message.content
            
            # Extract answer
            answer_patterns = [
                r"(?:answer|therefore|thus|so):\s*([^\n]+)",
                r"(?:the answer is)\s*([^\n]+)",
                r"(?:final answer):\s*([^\n]+)"
            ]
            
            answer = None
            for pattern in answer_patterns:
                match = re.search(pattern, reasoning, re.IGNORECASE)
                if match:
                    answer = match.group(1).strip()
                    break
            
            if answer:
                # Normalize answer for comparison
                normalized = self._normalize_answer(answer)
                answers.append(normalized)
                paths.append({
                    "path_number": i + 1,
                    "reasoning": reasoning,
                    "answer": answer,
                    "normalized_answer": normalized
                })
        
        # Find consensus answer
        answer_counts = Counter(answers)
        consensus_answer, consensus_count = answer_counts.most_common(1)[0]
        confidence = consensus_count / num_paths
        
        return {
            "problem": problem,
            "num_paths": num_paths,
            "paths": paths,
            "all_answers": answers,
            "consensus_answer": consensus_answer,
            "consensus_count": consensus_count,
            "confidence": confidence,
            "agreement": f"{consensus_count}/{num_paths}"
        }
    
    def _normalize_answer(self, answer: str) -> str:
        """Normalize answer for comparison."""
        # Remove punctuation, convert to lowercase, strip whitespace
        normalized = answer.lower().strip()
        normalized = re.sub(r'[^\w\s]', '', normalized)
        normalized = re.sub(r'\s+', ' ', normalized)
        return normalized
    
    def display_results(self, result: Dict[str, Any]):
        """Display results in readable format."""
        print(f"\n{'='*80}")
        print("SELF-CONSISTENCY RESULTS")
        print(f"{'='*80}")
        print(f"\nProblem:\n{result['problem']}")
        print(f"\nGenerated {result['num_paths']} reasoning paths:")
        
        for path in result['paths']:
            print(f"\n{'-'*80}")
            print(f"Path {path['path_number']}:")
            print(f"{'-'*80}")
            print(f"Reasoning:\n{path['reasoning'][:200]}...")
            print(f"\nAnswer: {path['answer']}")
        
        print(f"\n{'='*80}")
        print("CONSENSUS")
        print(f"{'='*80}")
        print(f"Consensus Answer: {result['consensus_answer']}")
        print(f"Agreement: {result['agreement']} paths")
        print(f"Confidence: {result['confidence']:.1%}")
        
        if result['confidence'] < 0.6:
            print("\n⚠️  Low confidence - answers vary significantly")
        elif result['confidence'] < 0.8:
            print("\n✓ Moderate confidence - some variation in answers")
        else:
            print("\n✓✓ High confidence - strong agreement across paths")

# Test self-consistency
self_consistency = SelfConsistencyCoT()

# Math problem
math_problem = """
A farmer has 17 sheep. All but 9 die. How many sheep are left?
"""

result = self_consistency.solve_multiple_paths(math_problem, num_paths=5)
self_consistency.display_results(result)

# Logical reasoning
logic_problem = """
If it takes 5 machines 5 minutes to make 5 widgets, 
how long would it take 100 machines to make 100 widgets?
"""

result = self_consistency.solve_multiple_paths(logic_problem, num_paths=5)
self_consistency.display_results(result)

### Exercise 5.1: Test Self-Consistency

Test self-consistency on these challenging problems:

In [None]:
# TODO: Test self-consistency on challenging problems

self_consistency = SelfConsistencyCoT()

challenging_problems = [
    """
    A bat and a ball cost $1.10 in total.
    The bat costs $1.00 more than the ball.
    How much does the ball cost?
    """,
    
    """
    In a lake, there is a patch of lily pads.
    Every day, the patch doubles in size.
    If it takes 48 days for the patch to cover the entire lake,
    how long would it take for the patch to cover half of the lake?
    """,
    
    """
    You have a 3-gallon jug and a 5-gallon jug.
    How can you measure exactly 4 gallons of water?
    """
]

# TODO: Test each problem and analyze confidence levels
# for i, problem in enumerate(challenging_problems, 1):
#     print(f"\n\nCHALLENGING PROBLEM {i}")
#     result = self_consistency.solve_multiple_paths(problem, num_paths=5)
#     self_consistency.display_results(result)
#     
#     # TODO: If confidence is low, try with more paths
#     if result['confidence'] < 0.6:
#         print("\nLow confidence detected. Trying with 10 paths...")
#         # result = self_consistency.solve_multiple_paths(problem, num_paths=10)

## Part 6: Reasoning Chain Evaluation

Evaluate the quality of reasoning chains.

In [None]:
class ReasoningEvaluator:
    """
    Evaluate quality of chain-of-thought reasoning.
    """
    
    def __init__(self, model: str = "gpt-3.5-turbo"):
        """
        Initialize ReasoningEvaluator.
        
        Args:
            model: OpenAI model to use
        """
        self.model = model
    
    def evaluate_reasoning(
        self,
        problem: str,
        reasoning: str,
        answer: str,
        correct_answer: Optional[str] = None
    ) -> Dict[str, Any]:
        """
        Evaluate reasoning quality.
        
        Args:
            problem: Original problem
            reasoning: Chain-of-thought reasoning
            answer: Proposed answer
            correct_answer: Known correct answer (optional)
        
        Returns:
            Evaluation results
        """
        evaluation_prompt = f"""
Evaluate this chain-of-thought reasoning:

Problem: {problem}

Reasoning: {reasoning}

Answer: {answer}

Evaluate on these criteria (rate 1-5 for each):
1. Completeness: Are all necessary steps included?
2. Clarity: Is each step clearly explained?
3. Logic: Is the reasoning logically sound?
4. Correctness: Does it lead to the right answer?
5. Efficiency: Is the solution path efficient?

Provide:
- Score for each criterion (1-5)
- Overall quality (1-5)
- Strengths
- Weaknesses
- Suggestions for improvement

Format as JSON.
"""
        
        response = client.chat.completions.create(
            model=self.model,
            messages=[{"role": "user", "content": evaluation_prompt}],
            temperature=0.3,
            max_tokens=500
        )
        
        evaluation_text = response.choices[0].message.content
        
        # Try to extract JSON
        try:
            json_match = re.search(r'\{.*\}', evaluation_text, re.DOTALL)
            if json_match:
                evaluation = json.loads(json_match.group())
            else:
                evaluation = {"raw_evaluation": evaluation_text}
        except json.JSONDecodeError:
            evaluation = {"raw_evaluation": evaluation_text}
        
        result = {
            "problem": problem,
            "reasoning": reasoning,
            "answer": answer,
            "evaluation": evaluation,
            "correct_answer": correct_answer
        }
        
        if correct_answer:
            result["is_correct"] = self._compare_answers(answer, correct_answer)
        
        return result
    
    def _compare_answers(self, answer: str, correct: str) -> bool:
        """Compare if answers match (fuzzy)."""
        answer_norm = re.sub(r'[^\w\s]', '', answer.lower()).strip()
        correct_norm = re.sub(r'[^\w\s]', '', correct.lower()).strip()
        return answer_norm in correct_norm or correct_norm in answer_norm
    
    def display_evaluation(self, result: Dict[str, Any]):
        """Display evaluation results."""
        print(f"\n{'='*80}")
        print("REASONING EVALUATION")
        print(f"{'='*80}")
        print(f"\nProblem:\n{result['problem']}")
        print(f"\nAnswer: {result['answer']}")
        
        if result.get('correct_answer'):
            print(f"Correct Answer: {result['correct_answer']}")
            if result.get('is_correct'):
                print("✓ Answer is correct")
            else:
                print("✗ Answer is incorrect")
        
        print(f"\nEvaluation:")
        
        if isinstance(result['evaluation'], dict) and 'raw_evaluation' not in result['evaluation']:
            for key, value in result['evaluation'].items():
                print(f"  {key}: {value}")
        else:
            print(result['evaluation'].get('raw_evaluation', result['evaluation']))

# Test reasoning evaluation
evaluator = ReasoningEvaluator()

# Example reasoning to evaluate
problem = "If a car travels 60 miles in 1 hour, how far will it travel in 2.5 hours at the same speed?"

reasoning = """
Step 1: Identify the speed
The car travels 60 miles in 1 hour, so speed = 60 mph

Step 2: Multiply speed by time
Distance = Speed × Time
Distance = 60 mph × 2.5 hours

Step 3: Calculate
60 × 2.5 = 150

Therefore, the car will travel 150 miles.
"""

answer = "150 miles"
correct_answer = "150 miles"

result = evaluator.evaluate_reasoning(problem, reasoning, answer, correct_answer)
evaluator.display_evaluation(result)

# Example with flawed reasoning
flawed_reasoning = """
Step 1: The car goes 60 miles per hour
Step 2: For 2.5 hours, we add 60 + 2.5
Step 3: 60 + 2.5 = 62.5

Answer: 62.5 miles
"""

flawed_result = evaluator.evaluate_reasoning(
    problem,
    flawed_reasoning,
    "62.5 miles",
    correct_answer
)
evaluator.display_evaluation(flawed_result)

### Exercise 6.1: Evaluate Your Reasoning

Evaluate reasoning chains you created earlier:

In [None]:
# TODO: Evaluate reasoning from previous exercises

evaluator = ReasoningEvaluator()

# TODO: Take a problem you solved earlier and evaluate it
# Example:
# problem = "..."
# reasoning = "..."
# answer = "..."
# correct_answer = "..."
# 
# result = evaluator.evaluate_reasoning(problem, reasoning, answer, correct_answer)
# evaluator.display_evaluation(result)

# TODO: Compare zero-shot vs few-shot reasoning quality
# Generate both approaches for the same problem and evaluate each

## Challenge Projects

### Challenge 1: Adaptive CoT System

Build a system that adapts reasoning strategy based on problem type:

In [None]:
class AdaptiveCoTSystem:
    """
    Automatically selects best CoT strategy for each problem type.
    
    TODO: Implement:
    1. Problem classification (math, logic, planning, coding, etc.)
    2. Strategy selection (zero-shot, few-shot, structured, self-consistency)
    3. Dynamic example retrieval for few-shot
    4. Confidence-based fallback (if low confidence, try different strategy)
    5. Performance tracking and learning
    """
    
    def __init__(self):
        self.zero_shot = ZeroShotCoT()
        self.few_shot = FewShotCoT()
        self.complex_solver = ComplexCoTSolver()
        self.self_consistency = SelfConsistencyCoT()
        self.evaluator = ReasoningEvaluator()
        
        # TODO: Add example libraries for different domains
        self.example_libraries = {}
    
    def classify_problem(self, problem: str) -> str:
        """
        Classify problem type.
        
        TODO: Implement classification logic
        Returns: 'math', 'logic', 'planning', 'coding', 'decision', etc.
        """
        pass
    
    def select_strategy(self, problem_type: str, complexity: str) -> str:
        """
        Select best strategy for problem type.
        
        TODO: Implement strategy selection
        Returns: 'zero-shot', 'few-shot', 'structured', 'self-consistency'
        """
        pass
    
    def solve(self, problem: str) -> Dict[str, Any]:
        """
        Solve problem with adaptive strategy.
        
        TODO: Implement adaptive solving:
        1. Classify problem
        2. Select initial strategy
        3. Attempt solution
        4. Evaluate confidence
        5. If low confidence, try alternative strategy
        6. Return best result
        """
        pass

# Usage example:
# adaptive = AdaptiveCoTSystem()
# result = adaptive.solve("Your problem here")
# print(f"Strategy used: {result['strategy']}")
# print(f"Answer: {result['answer']}")
# print(f"Confidence: {result['confidence']}")

### Challenge 2: Interactive Reasoning Debugger

Create a tool to debug and improve reasoning chains:

In [None]:
class ReasoningDebugger:
    """
    Interactive tool to debug and improve reasoning chains.
    
    TODO: Implement:
    1. Step-by-step reasoning breakdown
    2. Identify logical flaws in each step
    3. Suggest improvements for weak steps
    4. Test alternative reasoning paths
    5. Compare multiple approaches side-by-side
    6. Generate improved reasoning chain
    """
    
    def __init__(self):
        self.evaluator = ReasoningEvaluator()
    
    def analyze_step(self, step: str, context: Dict) -> Dict[str, Any]:
        """
        Analyze individual reasoning step.
        
        TODO: Check for:
        - Logical validity
        - Missing information
        - Assumptions made
        - Potential errors
        """
        pass
    
    def identify_weak_steps(self, reasoning: str) -> List[Dict]:
        """
        Identify weak or problematic steps.
        
        TODO: Return list of issues with line numbers
        """
        pass
    
    def suggest_improvements(self, reasoning: str) -> str:
        """
        Generate improved reasoning chain.
        
        TODO: Fix identified issues
        """
        pass
    
    def compare_approaches(self, problem: str, approaches: List[str]) -> Dict:
        """
        Compare multiple reasoning approaches.
        
        TODO: Evaluate and rank different approaches
        """
        pass

# Usage example:
# debugger = ReasoningDebugger()
# issues = debugger.identify_weak_steps(reasoning_chain)
# improved = debugger.suggest_improvements(reasoning_chain)
# print(f"Found {len(issues)} issues")
# print(f"Improved reasoning:\n{improved}")

### Challenge 3: Domain-Specific Reasoning Library

Build a comprehensive reasoning library for specific domains:

In [None]:
class DomainReasoningLibrary:
    """
    Domain-specific reasoning templates and examples.
    
    TODO: Implement for domains:
    1. Scientific reasoning (hypothesis, experiment, conclusion)
    2. Legal reasoning (precedent, statute, application)
    3. Medical diagnosis (symptoms, differential, tests, diagnosis)
    4. Engineering design (requirements, constraints, solutions)
    5. Financial analysis (data, metrics, insights, recommendations)
    
    Each domain should have:
    - Reasoning structure templates
    - Common patterns and pitfalls
    - Example problems with solutions
    - Evaluation criteria
    - Domain-specific validation
    """
    
    def __init__(self, domain: str):
        self.domain = domain
        self.templates = {}
        self.examples = []
        self.validation_rules = []
    
    def load_domain_templates(self, domain: str):
        """
        Load templates for specific domain.
        
        TODO: Load reasoning structures for domain
        """
        pass
    
    def generate_reasoning(self, problem: str, template: str) -> str:
        """
        Generate domain-specific reasoning.
        
        TODO: Apply domain template to problem
        """
        pass
    
    def validate_reasoning(self, reasoning: str) -> Dict[str, Any]:
        """
        Validate reasoning against domain rules.
        
        TODO: Check domain-specific validity
        """
        pass

# Usage example:
# medical = DomainReasoningLibrary(domain="medical")
# problem = "Patient presents with fever, cough, and fatigue..."
# reasoning = medical.generate_reasoning(problem, template="diagnosis")
# validation = medical.validate_reasoning(reasoning)

## Summary

In this lab, you've learned:

1. ✅ Chain-of-thought prompting fundamentals
2. ✅ Zero-shot CoT with trigger phrases
3. ✅ Few-shot CoT with reasoning examples
4. ✅ Complex multi-step reasoning structures
5. ✅ Self-consistency for improved accuracy
6. ✅ Reasoning evaluation and quality assessment

### Key Takeaways

**When to Use CoT:**
- Complex reasoning problems
- Multi-step calculations
- Logical deduction tasks
- Planning and scheduling
- Decision making with trade-offs
- Debugging and problem diagnosis

**Zero-Shot vs Few-Shot:**
- **Zero-shot**: Quick, simple problems; when no examples available
- **Few-shot**: Complex domains; consistent formatting needed; multiple similar problems

**Self-Consistency Benefits:**
- Higher accuracy on challenging problems
- Confidence estimation
- Robust to individual reasoning errors
- Works well with ambiguous problems

**Best Practices:**
1. **Clear trigger phrases**: "Let's think step by step" or "Let's break this down"
2. **Explicit structure**: Define reasoning steps for complex problems
3. **Quality examples**: For few-shot, use diverse, well-reasoned examples
4. **Evaluate reasoning**: Don't just check answers, validate reasoning
5. **Use self-consistency**: For high-stakes or difficult problems
6. **Temperature matters**: Lower (0.3) for consistency, higher (0.7) for diversity

### Common Pitfalls

❌ **Too vague**: "Solve this problem" → Need step-by-step guidance
❌ **Skipping steps**: Reasoning jumps to conclusions
❌ **No validation**: Accepting first answer without verification
❌ **Wrong temperature**: Too low reduces diversity, too high reduces coherence
❌ **Ignoring confidence**: Not checking self-consistency agreement

### Next Steps

- Complete the challenge projects
- Build domain-specific reasoning libraries for your work
- Experiment with reasoning structures for different problem types
- Move on to Week 3: Advanced API Usage & Function Calling

**Provided by:** ADC ENGINEERING & CONSULTING LTD