# Chapter 6: Prompt Engineering - Medium Tasks

This notebook covers advanced prompt engineering: structured prompt building, self-consistency, constrained output, and systematic optimization.

## Setup

Run all cells in this section to set up the environment and load the model.

Before running these cells, review the concepts from the main Chapter 6 notebook (00_Start_Here.ipynb).

### [Optional] - Installing Packages on Google ColabIf you are viewing this notebook on Google Colab, uncomment and run the following code to install dependencies.**Note**: Use a GPU for this notebook. In Google Colab, go to Runtime > Change runtime type > Hardware accelerator > GPU > GPU type > T4.

In [None]:
# %%capture
# !pip install --upgrade transformers>=4.40.0 torch accelerate
# !pip install llama-cpp-python --extra-index-url https://abetlen.github.io/llama-cpp-python/whl/cu121

### Model Loading

In [None]:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline

In [None]:
model_path = "microsoft/Phi-3-mini-4k-instruct"

model = AutoModelForCausalLM.from_pretrained(
    model_path,
    device_map="cuda",
    torch_dtype="auto",
    trust_remote_code=False,
)

tokenizer = AutoTokenizer.from_pretrained(model_path)

### Helper Functions

In [None]:
def generate_text(prompt, temperature=0.7, max_tokens=300):
    """Generate text with specified parameters"""
    pipe = pipeline(
        "text-generation",
        model=model,
        tokenizer=tokenizer,
        return_full_text=False,
        max_new_tokens=max_tokens,
        do_sample=True if temperature > 0 else False,
        temperature=temperature if temperature > 0 else None,
    )
    
    messages = [{"role": "user", "content": prompt}]
    output = pipe(messages)
    return output[0]['generated_text']

## Challenges

Complete the following tasks.

### Level: Medium

**About This Task:**
Building production prompts requires assembling components dynamically based on the use case.

#### Medium Task 1: Prompt Builder

### Instructions

1. Run the PromptBuilder to see how components are assembled
2. Create different prompts by setting different components
3. Test removing individual components to see their impact
4. Build prompts for different scenarios (instructions, explanations, translations)
5. Compare outputs with different component combinations

In [None]:
class PromptBuilder:
    """Build structured prompts with all 7 components"""
    
    def __init__(self):
        self.persona = None
        self.instruction = None
        self.context = None
        self.format_spec = None
        self.audience = None
        self.tone = None
        self.data = None

In [None]:
    def set_persona(self, persona):
        self.persona = persona
        return self
    
    def set_instruction(self, instruction):
        self.instruction = instruction
        return self
    
    def set_context(self, context):
        self.context = context
        return self
    
    def set_format(self, format_spec):
        self.format_spec = format_spec
        return self
    
    def set_audience(self, audience):
        self.audience = audience
        return self
    
    def set_tone(self, tone):
        self.tone = tone
        return self
    
    def set_data(self, data):
        self.data = data
        return self

In [None]:
    def build(self):
        """Assemble the complete prompt"""
        parts = []
        
        if self.persona:
            parts.append(f"You are {self.persona}.")
        
        if self.instruction:
            parts.append(f"\nYour task: {self.instruction}")
        
        if self.context:
            parts.append(f"\nContext: {self.context}")
        
        if self.format_spec:
            parts.append(f"\nFormat:\n{self.format_spec}")
        
        if self.audience:
            parts.append(f"\nAudience: {self.audience}")
        
        if self.tone:
            parts.append(f"\nTone: {self.tone}")
        
        if self.data:
            parts.append(f"\nData to work with:\n{self.data}")
        
        return "".join(parts)

We need to add the build method to the class. Run this to fix that:

In [None]:
PromptBuilder.build = build

Example 1: Responding to a complaint

In [None]:
complaint_prompt = PromptBuilder() \
    .set_persona("a helpful customer service representative") \
    .set_instruction("Respond to this customer complaint") \
    .set_context("The customer has been waiting 2 weeks for a refund") \
    .set_format("1. Acknowledge\n2. Apologize\n3. Solution") \
    .set_audience("A frustrated customer") \
    .set_tone("Professional and empathetic") \
    .set_data("Customer: Sarah\nOrder: 12345\nAmount: $89.99") \
    .build()

In [None]:
print(complaint_prompt)

In [None]:
output = generate_text(complaint_prompt, temperature=0, max_tokens=200)
print(output)

Example 2: Explaining a concept

Your task: Build a prompt for explaining how email works to someone unfamiliar with technology.

In [None]:
explanation_prompt = PromptBuilder() \
    .set_persona("a patient technology teacher") \
    .set_instruction("Explain how email works") \
    .set_context("The person has never used email before") \
    .set_format("1. What it is\n2. How to use it\n3. Simple analogy") \
    .set_audience("Someone unfamiliar with technology") \
    .set_tone("Simple and encouraging") \
    .build()

In [None]:
print(explanation_prompt)

In [None]:
output = generate_text(explanation_prompt, temperature=0, max_tokens=200)
print(output)

### Task 1b: Component Impact Test

Test what happens when you remove different components.

In [None]:
base_task = "Explain what a computer virus is"

Version 1: Instruction only

In [None]:
v1 = PromptBuilder().set_instruction(base_task).build()
print("Version 1 (instruction only):")
print(generate_text(v1, temperature=0, max_tokens=150))

Version 2: Add audience

In [None]:
v2 = PromptBuilder() \
    .set_instruction(base_task) \
    .set_audience("A 12-year-old student") \
    .build()
print("\nVersion 2 (with audience):")
print(generate_text(v2, temperature=0, max_tokens=150))

Version 3: Add format

Your task: Add a format specification to structure the output.

In [None]:
v3 = PromptBuilder() \
    .set_instruction(base_task) \
    .set_audience("A 12-year-old student") \
    .set_format("1. What it is\n2. How it spreads\n3. How to stay safe") \
    .build()
print("\nVersion 3 (with format):")
print(generate_text(v3, temperature=0, max_tokens=200))

### Questions

1. Which component had the biggest impact on output quality?

2. How did adding audience change the language used?

3. When would you intentionally omit certain components?

**About This Task:**
Self-consistency improves reliability by sampling multiple reasoning paths and taking the majority answer.

#### Medium Task 2: Self-Consistency

### Instructions

1. Run single CoT to see baseline performance
2. Run self-consistency with 5 samples to see multiple reasoning paths
3. Test with different sample counts (3, 5, 10)
4. Try different temperature values
5. Identify which problems benefit most from multiple samples

In [None]:
test_problems = [
    {
        "problem": "A bill is $80. You want to leave a 20% tip. What is the total?",
        "answer": 96
    },
    {
        "problem": "If 5 machines make 5 widgets in 5 minutes, how long for 100 machines to make 100 widgets?",
        "answer": 5
    },
    {
        "problem": "A bat and ball cost $1.10 total. The bat costs $1 more than the ball. How much is the ball?",
        "answer": 0.05
    },
]

### Single CoT Reasoning

In [None]:
def single_cot_solve(problem):
    prompt = f"{problem}\n\nLet's think step-by-step:"
    return generate_text(prompt, temperature=0, max_tokens=200)

In [None]:
print("Single CoT Reasoning:")
for item in test_problems:
    problem = item["problem"]
    correct = item["answer"]
    output = single_cot_solve(problem)
    print(f"\nProblem: {problem}")
    print(f"Correct: {correct}")
    print(f"Reasoning: {output}")

### Self-Consistency

In [None]:
import re
from collections import Counter

In [None]:
def extract_answer(text):
    """Extract numerical answer from reasoning text"""
    match = re.search(r'answer is\s+\$?([0-9.]+)', text.lower())
    if match:
        return float(match.group(1))
    
    match = re.search(r'=\s+\$?([0-9.]+)', text)
    if match:
        return float(match.group(1))
    
    numbers = re.findall(r'\$?([0-9.]+)', text)
    if numbers:
        return float(numbers[-1])
    
    return None

In [None]:
def self_consistency_solve(problem, num_samples=5, temperature=0.7):
    """Solve using self-consistency"""
    prompt = f"{problem}\n\nLet's think step-by-step:"
    
    reasoning_paths = []
    answers = []
    
    for i in range(num_samples):
        output = generate_text(prompt, temperature=temperature, max_tokens=200)
        reasoning_paths.append(output)
        
        answer = extract_answer(output)
        if answer is not None:
            answers.append(answer)
    
    if not answers:
        return None, reasoning_paths, []
    
    answer_counts = Counter(answers)
    majority_answer = answer_counts.most_common(1)[0][0]
    
    return majority_answer, reasoning_paths, answers

In [None]:
print("Self-Consistency (5 samples):")
for item in test_problems:
    problem = item["problem"]
    correct = item["answer"]
    
    majority, paths, all_answers = self_consistency_solve(problem, num_samples=5)
    
    print(f"\nProblem: {problem}")
    print(f"Correct: {correct}")
    print(f"All answers: {all_answers}")
    print(f"Majority: {majority}")
    
    if paths:
        print(f"Example reasoning: {paths[0][:150]}...")

### Task 2b: Test Different Sample Counts

Your task: Try different numbers of samples and see how it affects reliability.

In [None]:
tricky_problem = test_problems[2]
sample_counts = [3, 5, 10]

In [None]:
print("Testing different sample counts:")
print(f"Problem: {tricky_problem['problem']}")
print(f"Correct answer: {tricky_problem['answer']}")

for n in sample_counts:
    majority, _, all_answers = self_consistency_solve(tricky_problem["problem"], num_samples=n)
    print(f"\n{n} samples: {all_answers}")
    print(f"Majority: {majority}")

### Questions

1. Which problem benefited most from self-consistency?

2. Did all samples agree? What does disagreement tell you?

3. What is the trade-off of using more samples?

**About This Task:**
Constrained generation forces the model to output valid JSON, which is critical for applications.

#### Medium Task 3: Constrained JSON Output

### Instructions

1. Run without constraints to see free-form output
2. Apply JSON constraints to guarantee valid output
3. Test different data structures (user profiles, products, events)
4. Create prompts that request specific JSON schemas
5. Handle validation of required fields

In [None]:
from llama_cpp import Llama
import json

In [None]:
llm = Llama.from_pretrained(
    repo_id="microsoft/Phi-3-mini-4k-instruct-gguf",
    filename="*fp16.gguf",
    n_gpu_layers=-1,
    n_ctx=2048,
    verbose=False
)

### Without Constraints

In [None]:
prompt = """Create a user profile for a software engineer:
- Name: Alex Johnson
- Age: 28
- Skills: Python, Machine Learning
- Experience: 5 years

Return as JSON."""

In [None]:
output = llm.create_chat_completion(
    messages=[{"role": "user", "content": prompt}],
    temperature=0,
    max_tokens=300
)

In [None]:
response = output['choices'][0]['message']['content']
print("Without constraints:")
print(response)

In [None]:
try:
    parsed = json.loads(response)
    print("\nValid JSON")
except json.JSONDecodeError as e:
    print(f"\nInvalid JSON: {e}")

### With JSON Constraints

In [None]:
output = llm.create_chat_completion(
    messages=[{"role": "user", "content": prompt}],
    response_format={"type": "json_object"},
    temperature=0,
    max_tokens=300
)

In [None]:
response = output['choices'][0]['message']['content']
print("With constraints:")
print(response)

In [None]:
try:
    parsed = json.loads(response)
    print("\nValid JSON")
    print("\nFields:")
    for key, value in parsed.items():
        print(f"  {key}: {value}")
except json.JSONDecodeError as e:
    print(f"\nInvalid JSON: {e}")

### Schema Validation

In [None]:
def validate_schema(data, required_fields):
    """Check if JSON contains required fields"""
    missing = []
    
    for field in required_fields:
        if field not in data:
            missing.append(field)
    
    if missing:
        return False, f"Missing fields: {', '.join(missing)}"
    return True, "Valid schema"

In [None]:
user_schema = ["name", "age", "skills", "years_of_experience"]
valid, message = validate_schema(parsed, user_schema)
print(f"\nSchema validation: {message}")

### Task 3b: Different Data Structures

Your task: Create a prompt that generates a product catalog entry as JSON.

In [None]:
product_prompt = """Create a product catalog entry for a laptop:
- Brand: TechPro
- Model: UltraBook X1
- Price: $1299.99
- RAM: 16GB
- Storage: 512GB SSD
- In stock: Yes

Return as JSON."""

In [None]:
output = llm.create_chat_completion(
    messages=[{"role": "user", "content": product_prompt}],
    response_format={"type": "json_object"},
    temperature=0,
    max_tokens=300
)

In [None]:
response = output['choices'][0]['message']['content']
parsed = json.loads(response)
print(json.dumps(parsed, indent=2))

### Questions

1. Why is guaranteed JSON output important for applications?

2. What happens if you request fields the model cannot infer from the prompt?

3. How would you handle optional vs required fields?

**About This Task:**
Systematic optimization means testing variations, measuring performance, and iterating to improve.

#### Medium Task 4: Prompt Optimization

### Instructions

1. Run baseline prompt to see initial performance
2. Create improved versions with definitions and examples
3. Test each version on the same evaluation set
4. Measure accuracy to quantify improvements
5. Analyze which improvements had the biggest impact

In [None]:
test_tickets = [
    {"text": "My account has been hacked! Someone is making purchases", "urgency": "high"},
    {"text": "How do I change my password?", "urgency": "low"},
    {"text": "I've been trying to log in for 2 hours, site is down", "urgency": "high"},
    {"text": "What are your business hours?", "urgency": "low"},
    {"text": "I was charged twice for my order", "urgency": "medium"},
    {"text": "Can you recommend a product?", "urgency": "low"},
    {"text": "Payment failing and order deadline is today", "urgency": "high"},
    {"text": "Update my shipping address", "urgency": "medium"},
]

### Version 1: Basic Prompt

In [None]:
def classify_urgency_v1(ticket_text):
    prompt = f"""Classify this support ticket urgency as high, medium, or low.

Ticket: {ticket_text}
Urgency:"""
    
    output = generate_text(prompt, temperature=0, max_tokens=10)
    return output.strip().lower()

In [None]:
print("Version 1: Basic prompt")
correct_v1 = 0

for ticket in test_tickets:
    prediction = classify_urgency_v1(ticket["text"])
    correct = ticket["urgency"]
    match = prediction == correct
    
    if match:
        correct_v1 += 1
    
    print(f"Predicted: {prediction} | Actual: {correct} | {match}")

In [None]:
accuracy_v1 = correct_v1 / len(test_tickets)
print(f"\nAccuracy: {accuracy_v1:.1%}")

### Version 2: Add Definitions

In [None]:
def classify_urgency_v2(ticket_text):
    prompt = f"""Classify this support ticket urgency.

Definitions:
- high: Security issues, service outages, urgent deadlines
- medium: Billing issues, order changes, time-sensitive requests
- low: General questions, information requests, non-urgent help

Ticket: {ticket_text}
Urgency:"""
    
    output = generate_text(prompt, temperature=0, max_tokens=10)
    return output.strip().lower()

In [None]:
print("Version 2: With definitions")
correct_v2 = 0

for ticket in test_tickets:
    prediction = classify_urgency_v2(ticket["text"])
    correct = ticket["urgency"]
    match = prediction == correct
    
    if match:
        correct_v2 += 1
    
    print(f"Predicted: {prediction} | Actual: {correct} | {match}")

In [None]:
accuracy_v2 = correct_v2 / len(test_tickets)
print(f"\nAccuracy: {accuracy_v2:.1%}")
print(f"Improvement: {accuracy_v2 - accuracy_v1:+.1%}")

### Version 3: Add Examples

Your task: Add few-shot examples to further improve accuracy.

In [None]:
def classify_urgency_v3(ticket_text):
    prompt = f"""Classify support ticket urgency.

Definitions:
- high: Security issues, service outages, urgent deadlines
- medium: Billing issues, order changes, time-sensitive requests
- low: General questions, information requests, non-urgent help

Examples:

Ticket: I can't access my account and think it's been compromised
Urgency: high

Ticket: I was charged for a subscription I cancelled
Urgency: medium

Ticket: Do you ship internationally?
Urgency: low

Now classify:
Ticket: {ticket_text}
Urgency:"""
    
    output = generate_text(prompt, temperature=0, max_tokens=10)
    return output.strip().lower()

In [None]:
print("Version 3: With examples")
correct_v3 = 0

for ticket in test_tickets:
    prediction = classify_urgency_v3(ticket["text"])
    correct = ticket["urgency"]
    match = prediction == correct
    
    if match:
        correct_v3 += 1
    
    print(f"Predicted: {prediction} | Actual: {correct} | {match}")

In [None]:
accuracy_v3 = correct_v3 / len(test_tickets)
print(f"\nAccuracy: {accuracy_v3:.1%}")
print(f"Improvement over v2: {accuracy_v3 - accuracy_v2:+.1%}")
print(f"Total improvement: {accuracy_v3 - accuracy_v1:+.1%}")

### Comparison

In [None]:
print("Final comparison:")
print(f"Version 1 (basic): {accuracy_v1:.1%}")
print(f"Version 2 (definitions): {accuracy_v2:.1%}")
print(f"Version 3 (examples): {accuracy_v3:.1%}")

### Questions

1. Which improvement (definitions or examples) had a bigger impact?

2. Are there tickets that all versions got wrong? What makes them difficult?

3. What would you try next to improve further?