## Coding: 
Use DSPy (or a simplified version if DSPy isn’t accessible) to optimize a multi-step QA pipeline. For example, pipeline: (1) retrieve relevant text from a small corpus, (2) ask LLM to answer question given retrieved text. Define the metric as accuracy of answer. Let the system tune the retrieval prompt and answer prompt. Observe what changes it makes (e.g. does it add “Let’s think step by step” automatically?). Report the before vs after performance.

In [54]:
%pip install dspy wikipedia python-dotenv

Invalid -W option ignored: invalid module name: 'urllib3.exceptions'
Defaulting to user installation because normal site-packages is not writeable
Note: you may need to restart the kernel to use updated packages.


In [24]:
# Suppress warnings
import warnings
import os

# Suppress urllib3 warnings
os.environ['PYTHONWARNINGS'] = 'ignore::urllib3.exceptions.NotOpenSSLWarning'

# Suppress all UserWarnings (including Pydantic)
warnings.filterwarnings('ignore', category=UserWarning)
warnings.filterwarnings('ignore', message='.*urllib3.*')
warnings.filterwarnings('ignore', message='.*Pydantic.*')
warnings.filterwarnings('ignore', message='.*NotOpenSSLWarning.*')

print("Warnings suppressed.")



In [59]:
import dspy
import os

from dotenv import load_dotenv

# 1. Load the variables from .env into the environment
load_dotenv()

# Configure Groq API key 
GROQ_API_KEY = os.getenv("groq_token")
# Format: 'groq/model-name'
try:
    lm = dspy.LM(model='groq/llama-3.3-70b-versatile')
    dspy.configure(lm=lm)
    print("✓ DSPy configured with Groq (llama-3.3-70b-versatile)")
except Exception as e:
    print(f"Error configuring Groq: {e}")
    print("Trying alternative model...")
    # Fallback to a different Groq model
    lm = dspy.LM(model='groq/llama-3.1-70b-versatile')
    dspy.configure(lm=lm)
    print("✓ DSPy configured with Groq (llama-3.1-70b-versatile)")

✓ DSPy configured with Groq (llama-3.3-70b-versatile)


In [None]:
import dspy
import wikipedia
from dspy.teleprompt import BootstrapFewShot
import warnings

# Suppress warnings during execution
warnings.filterwarnings('ignore')

# 2. THE RETRIEVER: Custom Wikipedia Function
def search_wikipedia(query: str, k=5) -> list[str]:
    """Real-time Wikipedia abstracts for RAG."""
    try:
        # Search for more results to get better coverage
        titles = wikipedia.search(query, results=k)
        contexts = []
        for title in titles:
            try:
                # Get a longer summary (5 sentences) for better context
                summary = wikipedia.summary(title, sentences=5, auto_suggest=False)
                contexts.append(f"[{title}]: {summary}")
            except Exception as e:
                continue
        return contexts if contexts else ["No relevant Wikipedia context found."]
    except Exception as e:
        return [f"Error connecting to Wikipedia: {str(e)}"]

# 3. THE PROGRAM: Define RAG Signature and Module
class RAGSignature(dspy.Signature):
    """Answer questions using the provided context from Wikipedia."""
    context = dspy.InputField(desc="Wikipedia snippets")
    question = dspy.InputField()
    answer = dspy.OutputField(desc="concise, factual answer")

class RAG(dspy.Module):
    def __init__(self):
        super().__init__()
        # ChainOfThought adds a 'reasoning' step before the answer
        self.generate_answer = dspy.ChainOfThought(RAGSignature)
    
    def forward(self, question):
        context = search_wikipedia(question)
        # Join contexts into a single string
        context_str = "\n\n".join(context) if isinstance(context, list) else context
        prediction = self.generate_answer(context=context_str, question=question)
        
        # Why dspy.Prediction instead of just returning prediction.answer?
        # - OutputField defines what the signature outputs (just 'answer')
        # - But we need to return BOTH 'answer' AND 'context' (context is not in signature)
        # - dspy.Prediction allows returning multiple fields beyond the signature
        # - This is useful for debugging, inspection, and downstream processing
        return dspy.Prediction(context=context_str, answer=prediction.answer)
        

# 4. DATASET: Create a Synthetic Training Set
trainset = [
    dspy.Example(question="What is the boiling point of Nitrogen?", answer="-195.79 °C").with_inputs('question'),
    dspy.Example(question="Who wrote 'The Great Gatsby'?", answer="F. Scott Fitzgerald").with_inputs('question'),
    dspy.Example(question="What is the capital of Kazakhstan?", answer="Astana").with_inputs('question'),
    dspy.Example(question="Which planet is the hottest in our solar system?", answer="Venus").with_inputs('question')
]

# 5. OPTIMIZATION: Compile the Program
def validate_answer(example, pred, trace=None):
    """Metric: Does the predicted answer match the ground truth?"""
    return dspy.evaluate.answer_exact_match(example, pred)

# Teleprompter simulates traces to find best few-shot examples
print("Compiling RAG program with BootstrapFewShot...")
teleprompter = BootstrapFewShot(
    metric=validate_answer, 
    max_bootstrapped_demos=3,
    max_labeled_demos=4
)
compiled_rag = teleprompter.compile(RAG(), trainset=trainset)
print("✓ Compilation complete!\n")

# 6. EVALUATION: Compare Before vs. After
def run_comparison(question):
    print(f"\n{'='*60}")
    print(f"Question: {question}")
    print(f"{'='*60}")
    
    # Baseline (Unoptimized)
    print("\n[Baseline - Unoptimized]")
    baseline_rag = RAG()
    baseline_output = baseline_rag(question)
    print(f"Answer: {baseline_output.answer}")
    if hasattr(baseline_output, 'context'):
        print(f"Context snippets: {len(baseline_output.context.split('['))-1 if '[' in baseline_output.context else 1}")
    
    # Optimized (Compiled)
    print("\n[Optimized - Compiled with BootstrapFewShot]")
    optimized_output = compiled_rag(question)
    print(f"Answer: {optimized_output.answer}")
    if hasattr(optimized_output, 'context'):
        print(f"Context snippets: {len(optimized_output.context.split('['))-1 if '[' in optimized_output.context else 1}")

# Test with multiple queries
test_questions = [
    "Who won the Nobel Prize in Literature in 2006 and what is their country?",
    "What is the capital of France?",
    "Who invented the telephone?"
]

for q in test_questions:
    run_comparison(q)

# 7. INSPECTION: See how the prompt changed
print("\n" + "="*60)
print("OPTIMIZED PROMPT LOG (Full History)")
print("="*60)
print("\nShowing last prompt interaction:")
dspy.inspect_history(n=1)

Compiling RAG program with BootstrapFewShot...
This may take a minute...


100%|██████████| 4/4 [00:00<00:00, 1316.17it/s]

Bootstrapped 3 full traces after 3 examples for up to 1 rounds, amounting to 4 attempts.
✓ Compilation complete!


Question: Who won the Nobel Prize in Literature in 2006 and what is their country?

[Baseline - Unoptimized]
Answer: The winner of the Nobel Prize in Literature in 2006 was Orhan Pamuk from Turkey.
Context snippets: 5

[Optimized - Compiled with BootstrapFewShot]
Answer: Orhan Pamuk, Turkey
Context snippets: 5

Question: What is the capital of France?

[Baseline - Unoptimized]
Answer: Paris
Context snippets: 5

[Optimized - Compiled with BootstrapFewShot]
Answer: Paris
Context snippets: 5

Question: Who invented the telephone?

[Baseline - Unoptimized]
Answer: Alexander Graham Bell
Context snippets: 6

[Optimized - Compiled with BootstrapFewShot]
Answer: Alexander Graham Bell
Context snippets: 6

OPTIMIZED PROMPT LOG (Full History)

Showing last prompt interaction:




[34m[2026-01-19T14:07:50.703354][0m

[31mSystem message:[0m

Your input fields are:
1. `context` (str




In [36]:
# 1. Capture BASELINE prompt structure
print("\n" + "="*70)
print("1. BASELINE PROMPT (Before Optimization)")
print("="*70)

baseline_rag = RAG()
test_question = "What is the capital of France?"

# Run baseline and capture its prompt
baseline_output = baseline_rag(test_question)

# Get the prompt structure from history
print("\nBaseline Prompt Structure:")
print("-" * 70)
baseline_history = dspy.inspect_history(n=1)

# 2. Capture OPTIMIZED prompt structure  
print("\n" + "="*70)
print("2. OPTIMIZED PROMPT (After BootstrapFewShot)")
print("="*70)

# Run optimized version
optimized_output = compiled_rag(test_question)

# Get the optimized prompt structure
print("\nOptimized Prompt Structure:")
print("-" * 70)
optimized_history = dspy.inspect_history(n=1)


1. BASELINE PROMPT (Before Optimization)

Baseline Prompt Structure:
----------------------------------------------------------------------




[34m[2026-01-20T02:14:38.796575][0m

[31mSystem message:[0m

Your input fields are:
1. `context` (str): Wikipedia snippets
2. `question` (str):
Your output fields are:
1. `reasoning` (str): 
2. `answer` (str): concise, factual answer
All interactions will be structured in the following way, with the appropriate values filled in.

[[ ## context ## ]]
{context}

[[ ## question ## ]]
{question}

[[ ## reasoning ## ]]
{reasoning}

[[ ## answer ## ]]
{answer}

[[ ## completed ## ]]
In adhering to this structure, your objective is: 
        Answer questions using the provided context from Wikipedia.


[31mUser message:[0m

[[ ## context ## ]]
[Closed-ended question]: A closed-ended question is any question for which a researcher provides research participants with options from which to choose a response. Closed-ended questions are sometimes ph

## Coding: 
Implement a simple version of EvoPrompt. Represent a prompt as a list of tokens or words. Define two evolutionary operators: mutate (randomly replace or insert a word) and crossover (swap a segment between two prompts). Use an LLM (or a heuristic function) to evaluate fitness (e.g. BLEU score or any task-specific score) of prompts. Start with a few initial prompts and run a few generations of evolution. Did the prompts improve? This could be done on a trivial task (like prompt an LLM to output a specific keyword - evolve prompts to maximize the occurrence of that keyword in the response).

In [38]:
import random
import numpy as np

# --- CONFIGURATION ---
TARGET_KEYWORD = "BANANA"
VOCABULARY = ["say", "write", "output", "the", "word", "quickly", "loudly", "BANANA", "please", "now"]
POPULATION_SIZE = 6
GENERATIONS = 50
MUTATION_RATE = 0.2

# --- CORE FUNCTIONS ---

def get_llm_response(prompt_text):
    """
    Simulated LLM: If 'BANANA' is in the prompt, it likely outputs it.
    If the prompt is long and clear, the output is better.
    """
    prompt_lower = prompt_text.lower()
    if "banana" in prompt_lower:
        # The better the prompt, the more BANANAs we get
        return " ".join([TARGET_KEYWORD] * (prompt_text.count(" ") + 1))
    return "I don't know what to say."

def fitness(prompt_list):
    """Fitness = Number of target keywords in the output."""
    prompt_text = " ".join(prompt_list)
    response = get_llm_response(prompt_text)
    return response.upper().count(TARGET_KEYWORD)

def crossover(parent1, parent2):
    """Single-point crossover: swap segments of two prompts."""
    if len(parent1) < 2 or len(parent2) < 2: return parent1, parent2
    point = random.randint(1, min(len(parent1), len(parent2)) - 1)
    child1 = parent1[:point] + parent2[point:]
    child2 = parent2[:point] + parent1[point:]
    return child1, child2

def mutate(prompt_list):
    """Randomly replace one word with a word from the vocab."""
    new_prompt = list(prompt_list)
    idx = random.randrange(len(new_prompt))
    new_prompt[idx] = random.choice(VOCABULARY)
    return new_prompt

# --- INITIALIZATION ---
# Start with prompts that DON'T have the word BANANA
population = [
    ["say", "the", "word"],
    ["write", "output", "now"],
    ["please", "write", "quickly"],
    ["output", "the", "word"],
    ["write", "the", "word", "loudly"],
    ["say", "word", "please"]
]

# --- EVOLUTION LOOP ---
print(f"Goal: Evolve prompt to output '{TARGET_KEYWORD}'\n")

for gen in range(GENERATIONS):
    # 1. Evaluate Fitness
    scores = [(p, fitness(p)) for p in population]
    # Sort by fitness (descending)
    scores.sort(key=lambda x: x[1], reverse=True)
    
    print(f"Gen {gen} | Best Score: {scores[0][1]} | Best Prompt: '{' '.join(scores[0][0])}'")
    
    # 2. Selection: Keep the top 2 as parents
    parents = [s[0] for s in scores[:2]]
    
    # 3. Create next generation
    new_population = []
    new_population.extend(parents) # Elitism: keep best parents
    
    while len(new_population) < POPULATION_SIZE:
        # Crossover
        c1, c2 = crossover(parents[0], parents[1])
        
        # Mutation
        if random.random() < MUTATION_RATE:
            c1 = mutate(c1)
        if random.random() < MUTATION_RATE:
            c2 = mutate(c2)
            
        new_population.extend([c1, c2])
    
    population = new_population[:POPULATION_SIZE]

print(f"\nFinal Winning Prompt: {' '.join(population[0])}")

Goal: Evolve prompt to output 'BANANA'

Gen 0 | Best Score: 0 | Best Prompt: 'say the word'
Gen 1 | Best Score: 0 | Best Prompt: 'say the word'
Gen 2 | Best Score: 0 | Best Prompt: 'say the word'
Gen 3 | Best Score: 0 | Best Prompt: 'say the word'
Gen 4 | Best Score: 0 | Best Prompt: 'say the word'
Gen 5 | Best Score: 0 | Best Prompt: 'say the word'
Gen 6 | Best Score: 3 | Best Prompt: 'write the BANANA'
Gen 7 | Best Score: 3 | Best Prompt: 'write the BANANA'
Gen 8 | Best Score: 3 | Best Prompt: 'write the BANANA'
Gen 9 | Best Score: 3 | Best Prompt: 'write the BANANA'
Gen 10 | Best Score: 3 | Best Prompt: 'write the BANANA'
Gen 11 | Best Score: 3 | Best Prompt: 'write the BANANA'
Gen 12 | Best Score: 3 | Best Prompt: 'write the BANANA'
Gen 13 | Best Score: 3 | Best Prompt: 'write the BANANA'
Gen 14 | Best Score: 3 | Best Prompt: 'write the BANANA'
Gen 15 | Best Score: 3 | Best Prompt: 'write the BANANA'
Gen 16 | Best Score: 3 | Best Prompt: 'write the BANANA'
Gen 17 | Best Score: 3 | 

In [61]:
import os
import json
import re
from groq import Groq

# 1. Setup Groq Client
client = Groq(api_key=f"{GROQ_API_KEY}")
MODEL = "llama-3.3-70b-versatile"

class Playbook:
    def __init__(self):
        self.entries = {
            "STRATEGIES": {},
            "CODE_SNIPPETS": {},
            "PITFALLS": {}
        }
        self.id_counter = 1

    def to_text(self):
        text = "### CURRENT PLAYBOOK\n"
        for section, items in self.entries.items():
            text += f"\n[{section}]\n"
            if not items: text += "No entries yet.\n"
            for id, data in items.items():
                text += f"- {id}: {data['content']} (Helpful: {data['helpful']}, Harmful: {data['harmful']})\n"
        return text

problems = [
    {"q": "If it takes 5 shirts 5 hours to dry outside, how long does it take 30 shirts to dry?", "a": "5 hours"},
    {"q": "A bat and ball cost $1.10. The bat costs $1.00 more than the ball. How much is the ball?", "a": "$0.05"},
    {"q": "A doctor gives you 3 pills and tells you to take one every half hour. How long until they are gone?", "a": "60 minutes"},
    {"q": "In a lake, there is a patch of lily pads. Every day, the patch doubles in size. If it takes 48 days for the patch to cover the entire lake, how long would it take to cover half the lake?", "a": "47 days"},
    {"q": "If 3 cats can catch 3 bunnies in 3 minutes, how long does it take 100 cats to catch 100 bunnies?", "a": "3 minutes"},
    {"q": "A man looks at a painting and says, 'Brothers and sisters I have none, but that man's father is my father's son.' Who is in the painting?", "a": "His son"},
    {"q": "How many birthdays does the average man have?", "a": "One"},
    {"q": "Some months have 31 days; how many have 28?", "a": "12"},
    {"q": "If you have 3 apples and you take away 2, how many apples do you have?", "a": "2"},
    {"q": "A plane crashes on the border of the US and Canada. Where do they bury the survivors?", "a": "You don't bury survivors"}
] * 3  # Multiplied by 3 to reach 30 iterations for testing reinforcement

# 3. ACE Components

def generator(problem, playbook):
    """Solves the problem using the playbook."""
    prompt = f"{playbook.to_text()}\n\nTask: {problem['q']}\nSolve the problem. Identify which Playbook IDs helped or misled you."
    response = client.chat.completions.create(
        messages=[{"role": "user", "content": prompt}],
        model=MODEL,
        temperature=0
    ).choices[0].message.content
    return response

def reflector(problem, response, actual_answer):
    """Analyzes the success/failure and extracts a lesson."""
    is_correct = actual_answer.lower() in response.lower()
    prompt = f"Problem: {problem['q']}\nExpected: {actual_answer}\nAgent's Solution: {response}\n\n"
    prompt += "Identify the single most important lesson. If wrong, what was the logic error? If right, what was the key strategy?"
    
    lesson = client.chat.completions.create(
        messages=[{"role": "user", "content": prompt}],
        model=MODEL,
    ).choices[0].message.content
    return lesson, is_correct

def curator(lesson, playbook, is_correct):
    """Merges the lesson into the playbook."""
    # Logic: Convert lesson to bullet, check for duplicates, update counters
    # In a real system, use embeddings. Here, we use a simple keyword check.
    category = "STRATEGIES" if is_correct else "PITFALLS"
    
    # Check for duplicates (Simple heuristic)
    for entry_id, entry in playbook.entries[category].items():
        if lesson[:20].lower() in entry['content'].lower():
            if is_correct: entry['helpful'] += 1
            else: entry['harmful'] += 1
            return # Merged

    # Add new entry
    new_id = f"{category[0]}{playbook.id_counter}"
    playbook.entries[category][new_id] = {
        "content": lesson.strip()[:150], # Keep it concise
        "helpful": 1 if is_correct else 0,
        "harmful": 0 if is_correct else 1
    }
    playbook.id_counter += 1

# 4. Simulation Execution

ace_playbook = Playbook()
baseline_context = ""
ace_scores = []
baseline_scores = []

for i in range(20):
    prob = problems[i]
    print(f"\n--- Iteration {i+1} ---")

    # Run ACE
    gen_output = generator(prob, ace_playbook)
    lesson, correct = reflector(prob, gen_output, prob['a'])
    curator(lesson, ace_playbook, correct)
    ace_scores.append(1 if correct else 0)

    # Run Baseline (Just append lessons to a big string)
    baseline_context += f"\nLesson {i}: {lesson}"
    # (Baseline logic would solve here using baseline_context)

print("\nFinal ACE Playbook State:")
print(ace_playbook.to_text())


--- Iteration 1 ---

--- Iteration 2 ---

--- Iteration 3 ---

--- Iteration 4 ---

--- Iteration 5 ---

--- Iteration 6 ---

--- Iteration 7 ---

--- Iteration 8 ---

--- Iteration 9 ---

--- Iteration 10 ---

--- Iteration 11 ---

--- Iteration 12 ---

--- Iteration 13 ---

--- Iteration 14 ---

--- Iteration 15 ---

--- Iteration 16 ---

--- Iteration 17 ---

--- Iteration 18 ---

--- Iteration 19 ---

--- Iteration 20 ---

Final ACE Playbook State:
### CURRENT PLAYBOOK

[STRATEGIES]
- S1: The single most important lesson in this problem is that the drying time of shirts is independent of the number of shirts, given constant environmenta (Helpful: 14, Harmful: 0)

[CODE_SNIPPETS]
No entries yet.

[PITFALLS]
- P2: The single most important lesson from this problem is the understanding of time intervals and how to apply them to a real-world scenario. The key stra (Helpful: 0, Harmful: 6)

