# Chapter 6: Prompt Engineering - Easy Tasks

This notebook covers basic prompt engineering concepts: temperature effects, prompt components, in-context learning, and chain-of-thought reasoning.

## Setup

Run all cells in this section to set up the environment and load the model.

Before running these cells, review the concepts from the main Chapter 6 notebook (00_Start_Here.ipynb).

### [Optional] - Installing Packages on <img src="https://colab.google/static/images/icons/colab.png" width=100>

If you are viewing this notebook on Google Colab, uncomment and run the following code to install dependencies.

**Note**: Use a GPU for this notebook. In Google Colab, go to Runtime > Change runtime type > Hardware accelerator > GPU > GPU type > T4.

In [None]:
# %%capture
# !pip install transformers>=4.40.0 torch accelerate

### Model Loading

In [None]:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline

In [None]:
model_path = "microsoft/Phi-3-mini-4k-instruct"

model = AutoModelForCausalLM.from_pretrained(
    model_path,
    device_map="cuda",
    torch_dtype="auto",
    trust_remote_code=False,
)

tokenizer = AutoTokenizer.from_pretrained(model_path)

### Helper Functions

In [None]:
def generate_text(prompt, temperature=0.7, max_tokens=200):
    """Generate text with specified parameters"""
    pipe = pipeline(
        "text-generation",
        model=model,
        tokenizer=tokenizer,
        return_full_text=False,
        max_new_tokens=max_tokens,
        do_sample=True if temperature > 0 else False,
        temperature=temperature if temperature > 0 else None,
    )
    
    messages = [{"role": "user", "content": prompt}]
    output = pipe(messages)
    return output[0]['generated_text']

## Challenges

Complete the following tasks by implementing the starter code.

### Level: Easy

**About This Task:**
Temperature controls randomness in generation. Lower values give consistent outputs, higher values give varied outputs.

#### Easy Task 1: Finding the Right Temperature

### Instructions

1. Execute code to compare temperature effects on three use cases
2. Fill in missing temperature values based on your observations
3. Run determinism test to verify temperature=0 consistency
4. Test with your own prompts
5. Analyze which temperatures work best for different tasks

Here we test three different use cases.

In [None]:
test_prompts = [
    "What is the capital of France?",  # Factual
    "Write the first sentence of a mystery novel.",  # Creative
    "Write a Python function to calculate factorial.",  # Code
]

In [None]:
temperatures = [0.0, 0.3, 0.7, 1.0, 1.5]

Notice how different temperatures affect each use case.

In [None]:
for prompt in test_prompts:
    print(f"\nPrompt: {prompt}")
    print("-" * 70)
    
    for temp in temperatures:
        output = generate_text(prompt, temperature=temp, max_tokens=50)
        print(f"\nTemp={temp}: {output}")

### Task 1a: Select Best Temperature

Based on the outputs above, fill in the best temperature for each use case.

In [None]:
# Fill in: What temperature works best for each task?
best_temp_factual = None  # For "What is the capital of France?"
best_temp_creative = None  # For "Write the first sentence..."
best_temp_code = None  # For "Write a Python function..."

Test your selections here.

In [None]:
print("Testing your temperature selections:")

if best_temp_factual is not None:
    output = generate_text("What is the capital of France?", temperature=best_temp_factual, max_tokens=30)
    print(f"\nFactual (temp={best_temp_factual}): {output}")

if best_temp_creative is not None:
    output = generate_text("Write the first sentence of a mystery novel.", temperature=best_temp_creative, max_tokens=50)
    print(f"\nCreative (temp={best_temp_creative}): {output}")

if best_temp_code is not None:
    output = generate_text("Write a Python function to calculate factorial.", temperature=best_temp_code, max_tokens=100)
    print(f"\nCode (temp={best_temp_code}): {output}")

### Task 1b: Determinism Test

Run this cell multiple times to verify temperature=0 gives identical outputs.

In [None]:
output = generate_text("What is 2+2?", temperature=0, max_tokens=20)
print(f"Output: {output}")
print("\nRun this cell again - you should get the EXACT same output.")

### Questions

1. At temperature=1.5, did the factual question give wrong answers? Why is determinism critical for factual tasks?

2. For creative writing, compare outputs at temperature=0.3 vs 1.0. Which produced more interesting variations?

3. Did code generation at temperature=1.5 produce valid Python? What's the risk of high temperature for code?

**About This Task:**
Prompts have seven components: Persona, Instruction, Context, Format, Audience, Tone, Data. Adding more components improves output quality.

#### Easy Task 2: Building a Complete Prompt

### Instructions

1. Run pre-built prompt versions to see incremental improvements
2. Complete `prompt_v5` by adding the missing 3 components
3. Test removing Format to see its impact
4. Create your own scenario
5. Compare output quality as components are added

We start with just an instruction and gradually add components.

In [None]:
# Version 1: Instruction only
prompt_v1 = "Explain how to make coffee."

In [None]:
print("V1: Instruction only")
output = generate_text(prompt_v1, temperature=0, max_tokens=150)
print(output)

In [None]:
# Version 2: + Audience
prompt_v2 = """Explain how to make coffee.

Audience: Someone who has never made coffee before."""

In [None]:
print("V2: + Audience")
output = generate_text(prompt_v2, temperature=0, max_tokens=150)
print(output)

Notice how adding Audience changes the language.

In [None]:
# Version 3: + Format
prompt_v3 = """Explain how to make coffee.

Audience: Someone who has never made coffee before.

Format:
1. Equipment needed
2. Step-by-step instructions
3. Common mistakes"""

In [None]:
print("V3: + Format")
output = generate_text(prompt_v3, temperature=0, max_tokens=200)
print(output)

See how Format structures the output.

In [None]:
# Version 4: + Tone
prompt_v4 = """Explain how to make coffee.

Audience: Someone who has never made coffee before.

Format:
1. Equipment needed
2. Step-by-step instructions
3. Common mistakes

Tone: Friendly and encouraging."""

In [None]:
print("V4: + Tone")
output = generate_text(prompt_v4, temperature=0, max_tokens=200)
print(output)

### Task 2a: Complete Version 5

Your task: Add Persona, Context, and Data to create a complete prompt.

In [None]:
# Fill in: Add the 3 missing components
prompt_v5 = """Persona: [Fill in - who is giving this explanation?]

Explain how to make coffee.

Context: [Fill in - why does the person need to learn this?]

Audience: Someone who has never made coffee before.

Format:
1. Equipment needed
2. Step-by-step instructions
3. Common mistakes

Tone: Friendly and encouraging.

Data: [Fill in - specific details like coffee-to-water ratio]"""

In [None]:
print("V5: All 7 components")
output = generate_text(prompt_v5, temperature=0, max_tokens=250)
print(output)

### Questions

1. Compare V1 and V2 outputs. How did specifying Audience change the language complexity?

2. Which component made the biggest single improvement to output quality?

3. When might you intentionally use fewer components? Give a specific scenario where V1 would be better than V5.

**About This Task:**
In-context learning uses examples to guide the model. Zero-shot has no examples, one-shot has one, few-shot has multiple.

#### Easy Task 3: Improving Few-Shot Examples

### Instructions

1. Run zero-shot, one-shot, and few-shot on test greetings
2. Identify which greetings cause disagreement
3. Improve the few-shot prompt by adding better examples
4. Test edge cases
5. Analyze why certain examples improve accuracy

In [None]:
test_greetings = [
    "Good morning, how may I assist you?",
    "Hey, what's up?",
    "Hello, nice to meet you.",
    "Hi there.",
    "Dear valued customer,",  # Very formal
    "Yo!",  # Very casual
]

### Zero-Shot

Here we ask the model to classify without any examples.

In [None]:
print("Zero-shot classification:")
zero_results = {}

for greeting in test_greetings:
    prompt = f"""Classify formality: formal, neutral, or casual.

Greeting: {greeting}
Formality:"""
    
    result = generate_text(prompt, temperature=0, max_tokens=10).strip()
    zero_results[greeting] = result
    print(f"{greeting} -> {result}")

### One-Shot

See how a single example helps guide the model.

In [None]:
print("One-shot classification:")
one_results = {}

for greeting in test_greetings:
    prompt = f"""Classify formality: formal, neutral, or casual.

Example:
Greeting: Dear Sir or Madam
Formality: formal

Greeting: {greeting}
Formality:"""
    
    result = generate_text(prompt, temperature=0, max_tokens=10).strip()
    one_results[greeting] = result
    print(f"{greeting} -> {result}")

### Few-Shot

Your task: Improve this prompt by adding 1-2 more examples to handle edge cases better.

In [None]:
print("Few-shot classification:")
few_results = {}

for greeting in test_greetings:
    # Fill in: Add 1-2 more examples after "Hello, how are you"
    prompt = f"""Classify formality: formal, neutral, or casual.

Examples:

Greeting: Dear Sir or Madam
Formality: formal

Greeting: Yo dude
Formality: casual

Greeting: Hello, how are you
Formality: neutral

[Add 1-2 more examples here]

Greeting: {greeting}
Formality:"""
    
    result = generate_text(prompt, temperature=0, max_tokens=10).strip()
    few_results[greeting] = result
    print(f"{greeting} -> {result}")

### Comparison

Here we identify disagreements to see where examples help most.

In [None]:
disagreements = []

for greeting in test_greetings:
    zero = zero_results[greeting]
    one = one_results[greeting]
    few = few_results[greeting]
    
    print(f"\n{greeting}")
    print(f"  Zero-shot: {zero}")
    print(f"  One-shot:  {one}")
    print(f"  Few-shot:  {few}")
    
    if zero == one == few:
        print(f"  All agree")
    else:
        print(f"  DISAGREEMENT")
        disagreements.append(greeting)

Notice which greetings benefit most from examples.

In [None]:
print(f"\n{len(disagreements)} greetings showed disagreement:")
for g in disagreements:
    print(f"  - {g}")

### Questions

1. Which greeting showed the biggest difference between zero-shot and few-shot? Why was it ambiguous?

2. Did adding more examples improve accuracy on edge cases like "Yo!" or "Dear valued customer"?

3. What makes a good few-shot example? Should you show edge cases or clear typical examples?

**About This Task:**
Chain-of-Thought prompting asks the model to show its reasoning step-by-step, improving accuracy on complex problems.

#### Easy Task 4: Testing Chain-of-Thought

### Instructions

1. Run direct prompting on simple and tricky problems
2. Compare with few-shot CoT to see reasoning improvements
3. Test zero-shot CoT on hard problems
4. Improve CoT examples to fix errors
5. Analyze when step-by-step reasoning prevents mistakes

We test on both simple problems and counter-intuitive ones.

In [None]:
problems = [
    ("If John has 5 apples and gives 2 to Mary, how many does he have?", 3, "easy"),
    ("A ticket costs $15. I buy 3 tickets with a $50 bill. How much change?", 5, "easy"),
    ("A bat and ball cost $1.10 total. The bat costs $1 more than the ball. How much is the ball?", 0.05, "tricky"),
]

### Direct Prompting

Here we ask for answers directly without reasoning.

In [None]:
print("Direct prompting (no reasoning):")

for question, correct, difficulty in problems:
    prompt = f"{question}\nAnswer:"
    answer = generate_text(prompt, temperature=0, max_tokens=30)
    
    print(f"\n[{difficulty.upper()}] {question}")
    print(f"Model: {answer.strip()}")
    print(f"Correct: {correct}")

Notice how direct prompting might fail on the tricky problem.

### Few-Shot CoT

Your task: Improve the prompt by adding a third example showing careful algebra.

In [None]:
print("Few-shot CoT:")

for question, correct, difficulty in problems:
    # Fill in: Add a third example to help with the tricky problem
    prompt = f"""Solve step-by-step.

Q: Roger has 5 balls. He buys 2 cans with 3 balls each. How many balls does he have?
A: Roger starts with 5 balls.
He buys 2 cans, each has 3 balls.
New balls: 2 × 3 = 6
Total: 5 + 6 = 11
Answer: 11

Q: A cafe had 23 apples. They used 20 for lunch and bought 6 more. How many now?
A: Start with 23 apples.
After using 20: 23 - 20 = 3
After buying 6: 3 + 6 = 9
Answer: 9

[Add another example showing careful math]

Q: {question}
A:"""
    
    answer = generate_text(prompt, temperature=0, max_tokens=150)
    
    print(f"\n[{difficulty.upper()}] {question}")
    print(f"Reasoning: {answer}")
    print(f"Correct: {correct}")

See how showing reasoning steps helps catch mistakes.

### Zero-Shot CoT

Here we use the phrase "Let's think step-by-step" to trigger reasoning without examples.

In [None]:
print("Zero-shot CoT:")

for question, correct, difficulty in problems:
    prompt = f"{question}\n\nLet's think step-by-step:"
    answer = generate_text(prompt, temperature=0, max_tokens=150)
    
    print(f"\n[{difficulty.upper()}] {question}")
    print(f"Reasoning: {answer}")
    print(f"Correct: {correct}")

Notice how a simple phrase triggers step-by-step reasoning.

### Questions

1. Did direct prompting get the bat-and-ball problem wrong? What's the common wrong answer ($0.10)?

2. Compare few-shot CoT vs zero-shot CoT on the tricky problem. Which caught the mistake better?

3. What type of problems benefit most from CoT? When is direct prompting good enough?