# Chapter 6: Prompt Engineering - Medium Tasks (Solutions)

Complete solutions for all Medium Tasks with filled-in answers.

## Task 1: Prompt Builder - SOLUTION

The key insight is that different components have different impacts. Format and Audience typically matter most.

### Task 1b Solution: Component Impact Test

In [None]:
# SOLUTION: Version 3 with format added
v3 = PromptBuilder() \
    .set_instruction(base_task) \
    .set_audience("A 12-year-old student") \
    .set_format("1. What it is\n2. How it spreads\n3. How to stay safe") \
    .build()
    
print("\nVersion 3 (with format):")
print(generate_text(v3, temperature=0, max_tokens=200))

### Questions Answered

**1. Which component had the biggest impact on output quality?**

Format typically has the biggest impact because it structures the entire response. Audience is second because it directly affects language complexity.

**2. How did adding audience change the language used?**

With "12-year-old student", the model used:
- Simpler vocabulary
- Shorter sentences
- Concrete examples instead of abstract concepts
- Avoided technical jargon

**3. When would you intentionally omit certain components?**

- Simple tasks: Instruction alone may suffice
- Creative tasks: Too much structure limits creativity
- Token budget: When paying per token, remove least critical components
- Exploratory work: Start minimal, add components as needed

## Task 2: Self-Consistency - SOLUTION

Self-consistency improves accuracy on tricky problems by sampling multiple reasoning paths.

### Task 2b Solution: Different Sample Counts

The bat-and-ball problem is counter-intuitive. More samples help catch the common mistake.

In [None]:
# Results typically show:
# - 3 samples: May get unlucky and miss correct answer
# - 5 samples: Usually catches the right answer
# - 10 samples: Most reliable, but 2x the cost of 5 samples

print("Testing different sample counts:")
print(f"Problem: {tricky_problem['problem']}")
print(f"Correct answer: {tricky_problem['answer']}")

for n in sample_counts:
    majority, _, all_answers = self_consistency_solve(tricky_problem["problem"], num_samples=n)
    print(f"\n{n} samples: {all_answers}")
    print(f"Majority: {majority}")

### Questions Answered

**1. Which problem benefited most from self-consistency?**

The bat-and-ball problem. It's counter-intuitive with a strong wrong intuition ($0.10). Multiple samples help overcome the bias.

**2. Did all samples agree? What does disagreement tell you?**

- Simple problems: High agreement (all samples get it right)
- Tricky problems: More disagreement, showing the problem is genuinely difficult
- Disagreement signals you should investigate further or use more samples

**3. What is the trade-off of using more samples?**

- **Pro**: Higher accuracy, more reliable on difficult problems
- **Con**: N times more API calls = N times the cost
- **Con**: N times slower (if run sequentially)
- **Best practice**: Use self-consistency only for high-stakes or known-difficult problems

## Task 3: Constrained JSON Output - SOLUTION

JSON constraints guarantee valid structure, which is essential for production applications.

### Task 3b Solution: Product Catalog Prompt

In [None]:
# SOLUTION: Product catalog prompt is already provided, just run it
product_prompt = """Create a product catalog entry for a laptop:
- Brand: TechPro
- Model: UltraBook X1
- Price: $1299.99
- RAM: 16GB
- Storage: 512GB SSD
- In stock: Yes

Return as JSON."""

output = llm.create_chat_completion(
    messages=[{"role": "user", "content": product_prompt}],
    response_format={"type": "json_object"},
    temperature=0,
    max_tokens=300
)

response = output['choices'][0]['message']['content']
parsed = json.loads(response)
print(json.dumps(parsed, indent=2))

# Example output structure:
# {
#   "brand": "TechPro",
#   "model": "UltraBook X1",
#   "price": 1299.99,
#   "specs": {
#     "ram": "16GB",
#     "storage": "512GB SSD"
#   },
#   "in_stock": true
# }

### Questions Answered

**1. Why is guaranteed JSON output important for applications?**

- **No parsing errors**: Can safely call `json.loads()` without try/except
- **Reliable integration**: Other services can consume the output
- **Type safety**: Know what fields to expect
- **Production-ready**: No need to handle malformed responses

**2. What happens if you request fields the model cannot infer from the prompt?**

The model may:
- Generate placeholder values ("N/A", "Unknown")
- Omit the field entirely
- Hallucinate reasonable-sounding but incorrect data

**Best practice**: Only request fields that can be inferred from the input.

**3. How would you handle optional vs required fields?**

```python
required_fields = ["brand", "model", "price"]
optional_fields = ["warranty", "color_options"]

# Validate required
valid, message = validate_schema(parsed, required_fields)
if not valid:
    raise ValueError(message)

# Provide defaults for optional
parsed.setdefault("warranty", "1 year")
parsed.setdefault("color_options", [])
```

## Task 4: Prompt Optimization - SOLUTION

Systematic optimization: baseline → add definitions → add examples → measure.

### Solution: Version 3 is already implemented

The task notebook already provides V3 with examples. Students just need to run and observe the improvements.

In [None]:
# V3 already has good examples:
# - "I can't access my account and think it's been compromised" → high
# - "I was charged for a subscription I cancelled" → medium  
# - "Do you ship internationally?" → low

# These examples cover:
# 1. Security issue (high urgency)
# 2. Billing problem (medium urgency)
# 3. Information request (low urgency)

# The examples help the model learn the pattern:
# - Security + account access = high
# - Money + not urgent deadline = medium
# - Questions without issues = low

### Questions Answered

**1. Which improvement (definitions or examples) had a bigger impact?**

Typically:
- **Definitions**: 10-20% improvement (clarifies boundaries)
- **Examples**: Additional 5-15% improvement (shows the pattern)
- **Combined**: Best results, 20-30% total improvement

Definitions usually have more impact because they explicitly state the criteria.

**2. Are there tickets that all versions got wrong? What makes them difficult?**

Difficult cases:
- Borderline urgency (medium vs low)
- Context-dependent urgency ("site is down" - high for business, low for personal)
- Multiple issues in one ticket (billing + security)
- Vague language ("having problems" - what kind?)

**3. What would you try next to improve further?**

- **More diverse examples**: Cover edge cases and ambiguous tickets
- **Chain-of-Thought**: Ask model to explain its reasoning
- **Self-consistency**: Sample multiple classifications for borderline cases
- **Refine definitions**: Based on failure analysis
- **Add context**: Time of day, customer tier, SLA requirements

## Key Takeaways

1. **PromptBuilder**: Format and Audience components have the biggest impact
2. **Self-Consistency**: Use multiple samples for tricky problems, but balance cost
3. **JSON Constraints**: Essential for production applications that consume LLM output
4. **Optimization**: Systematic testing with definitions + examples yields 20-30% improvement