# 📊 Part 6: Evaluation Methods

**The Core Question:** How do you know if fine-tuning actually worked?

---

## What We'll Cover

| Method | Description | Difficulty |
|--------|-------------|------------|
| Perplexity | Mathematical measure of model confidence | Easy |
| Generation Comparison | Side-by-side output comparison | Easy |
| Automated Metrics | Task-specific measurements | Medium |
| LLM-as-Judge | Use stronger model to evaluate | Medium |
| A/B Testing | Pairwise comparison | Medium |

---

## The Evaluation Hierarchy

```
                    ┌─────────────────┐
                    │ Human Evaluation │ ← Gold standard (expensive)
                    └────────┬────────┘
                             │
                    ┌────────▼────────┐
                    │  LLM-as-Judge   │ ← Scalable approximation
                    └────────┬────────┘
                             │
         ┌───────────────────┼───────────────────┐
         │                   │                   │
┌────────▼────────┐ ┌────────▼────────┐ ┌────────▼────────┐
│   Perplexity    │ │   Generation    │ │   Benchmarks    │
│   (automatic)   │ │   Comparison    │ │   (if available)│
└─────────────────┘ └─────────────────┘ └─────────────────┘
```

---

## ⚠️ Important: Fresh Runtime

**Before running this notebook:**
1. Go to `Runtime` → `Restart runtime` (or `Disconnect and delete runtime`)
2. This clears any previous memory usage

---

# Part A: Setup & Training

First, let's train a model so we have something to evaluate!

In [None]:
import torch
import math
import gc
from transformers import AutoTokenizer, AutoModelForCausalLM, TrainingArguments, BitsAndBytesConfig
from peft import LoraConfig, get_peft_model, PeftModel, prepare_model_for_kbit_training
from datasets import load_dataset
from trl import SFTTrainer

# Check GPU
if torch.cuda.is_available():
    print(f"✅ GPU: {torch.cuda.get_device_name(0)}")
    total_mem = torch.cuda.get_device_properties(0).total_memory / 1e9
    print(f"   Total Memory: {total_mem:.1f} GB")

    # Show current usage
    allocated = torch.cuda.memory_allocated() / 1e9
    print(f"   Currently Used: {allocated:.2f} GB")
else:
    print("❌ No GPU - this notebook requires a GPU!")

✅ GPU: Tesla T4
   Total Memory: 15.8 GB
   Currently Used: 0.00 GB


In [None]:
# Configuration
MODEL_ID = "Qwen/Qwen2.5-0.5B-Instruct"
OUTPUT_DIR = "./eval_adapter"

print(f"Base model: {MODEL_ID}")
print(f"Output: {OUTPUT_DIR}")

Base model: Qwen/Qwen2.5-0.5B-Instruct
Output: ./eval_adapter


## Load Model with 4-bit Quantization

To save memory on T4, we'll use 4-bit quantization.

In [None]:
# 4-bit quantization config for memory efficiency
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_use_double_quant=True,
)

print("✅ 4-bit quantization config ready")

✅ 4-bit quantization config ready


In [None]:
# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

# Load model with 4-bit quantization
print("Loading model with 4-bit quantization...")
model = AutoModelForCausalLM.from_pretrained(
    MODEL_ID,
    quantization_config=bnb_config,
    device_map="auto",
)

# Prepare for training
model = prepare_model_for_kbit_training(model)

print(f"✅ Model loaded")
print(f"   GPU Memory: {torch.cuda.memory_allocated()/1e9:.2f} GB")

Loading model with 4-bit quantization...


✅ Model loaded
   GPU Memory: 0.73 GB


In [None]:
# Load dataset
dataset = load_dataset("mlabonne/guanaco-llama2-1k", split="train")

# Split into train and eval
split_dataset = dataset.train_test_split(test_size=0.1, seed=42)
train_dataset = split_dataset["train"]
eval_dataset = split_dataset["test"]

print(f"✅ Train: {len(train_dataset)} examples")
print(f"✅ Eval:  {len(eval_dataset)} examples")

✅ Train: 900 examples
✅ Eval:  100 examples


## Training with Memory-Efficient Settings

In [None]:
# LoRA Configuration - smaller for memory efficiency
lora_config = LoraConfig(
    r=8,                  # Reduced from 16
    lora_alpha=16,        # Reduced from 32
    lora_dropout=0.05,
    target_modules=["q_proj", "v_proj"],  # Fewer modules
    task_type="CAUSAL_LM",
)

# Apply LoRA
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()

trainable params: 540,672 || all params: 494,573,440 || trainable%: 0.1093


In [None]:
# Memory-efficient training arguments
training_args = TrainingArguments(
    output_dir=OUTPUT_DIR,
    num_train_epochs=1,
    per_device_train_batch_size=2,      # Reduced from 4
    gradient_accumulation_steps=4,       # Increased to compensate
    learning_rate=2e-4,
    bf16=True,
    logging_steps=25,
    save_strategy="epoch",
    optim="adamw_torch",
    warmup_ratio=0.03,
    gradient_checkpointing=True,         # Saves memory!
    max_grad_norm=0.3,
)

print("✅ Training arguments configured")
print(f"   Effective batch size: {2 * 4} = 8")

✅ Training arguments configured
   Effective batch size: 8 = 8


In [None]:
# Trainer with smaller sequence length
MAX_seq_length = 256
trainer = SFTTrainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    processing_class=tokenizer,
    #max_seq_length=MAX_seq_length,                  # Reduced from 512
)

print("✅ Trainer ready")
print(f"   GPU Memory: {torch.cuda.memory_allocated()/1e9:.2f} GB")

✅ Trainer ready
   GPU Memory: 0.75 GB




In [None]:
# Train!
print("🚀 Starting training...")
print("   This will take ~3-5 minutes on T4\n")

trainer.train()

print("\n✅ Training complete!")

🚀 Starting training...
   This will take ~3-5 minutes on T4




✅ Training complete!


In [None]:
# Save the adapter
trainer.save_model(OUTPUT_DIR)
print(f"✅ Adapter saved to {OUTPUT_DIR}")

✅ Adapter saved to ./eval_adapter


## Reload for Evaluation

Now we'll reload a fresh model (without quantization) for accurate evaluation.

In [None]:
# Reload base model (fp16, no quantization for accurate eval)
print("Loading model for evaluation...")

base_model = AutoModelForCausalLM.from_pretrained(
    MODEL_ID,
    torch_dtype=torch.float16,
    device_map="auto",
)

# Load fine-tuned model (base + adapter)
print("Loading fine-tuned adapter...")
finetuned_model = PeftModel.from_pretrained(
    base_model,
    OUTPUT_DIR,
)

print(f"\n✅ Model ready for evaluation!")
print(f"   GPU Memory: {torch.cuda.memory_allocated()/1e9:.2f} GB")

`torch_dtype` is deprecated! Use `dtype` instead!


Loading model for evaluation...
Loading fine-tuned adapter...

✅ Model ready for evaluation!
   GPU Memory: 1.75 GB


---

# Part B: Evaluation Methods

Now let's evaluate our fine-tuned model using multiple methods!

---

# Method 1: Perplexity

## The Math

**Loss** (Cross-Entropy):
$$\mathcal{L} = -\frac{1}{N}\sum_{i=1}^{N} \log P(x_i | x_{<i})$$

**Perplexity**:
$$\text{PPL} = e^{\mathcal{L}}$$

## Intuition

```
Perplexity = 1:    Perfect prediction (impossible in practice)
Perplexity = 10:   Model choosing between ~10 likely tokens
Perplexity = 100:  Model very uncertain
Perplexity = 1000: Model is basically guessing
```

## What's Good?

| Model Type | Typical Perplexity |
|------------|-------------------|
| State-of-the-art LLM | 5-15 |
| Good fine-tuned model | 10-30 |
| Average model | 30-100 |
| Poor model | 100+ |

**Important:** Perplexity depends heavily on the evaluation dataset!

In [None]:
def calculate_perplexity(model, tokenizer, texts, max_length=256):
    """
    Calculate perplexity on a list of texts.

    Perplexity = exp(average_loss)
    Lower is better.
    """
    model.eval()
    total_loss = 0
    total_tokens = 0

    with torch.no_grad():
        for text in texts:
            inputs = tokenizer(
                text,
                return_tensors="pt",
                truncation=True,
                max_length=max_length
            ).to(model.device)

            outputs = model(**inputs, labels=inputs["input_ids"])

            num_tokens = inputs["input_ids"].numel()
            total_loss += outputs.loss.item() * num_tokens
            total_tokens += num_tokens

    avg_loss = total_loss / total_tokens
    perplexity = math.exp(avg_loss)

    return {
        "perplexity": perplexity,
        "avg_loss": avg_loss,
        "total_tokens": total_tokens
    }

print("✅ calculate_perplexity() defined")

✅ calculate_perplexity() defined


In [None]:
# Prepare evaluation texts
eval_texts = [item["text"] for item in eval_dataset]

print(f"📊 Evaluating on {len(eval_texts)} texts...")

📊 Evaluating on 100 texts...


In [None]:
# Calculate perplexity for FINE-TUNED model (adapter enabled)
print("="*50)
print("FINE-TUNED MODEL (with adapter)")
print("="*50)

finetuned_ppl = calculate_perplexity(finetuned_model, tokenizer, eval_texts)
print(f"Perplexity: {finetuned_ppl['perplexity']:.2f}")
print(f"Avg Loss:   {finetuned_ppl['avg_loss']:.4f}")
print(f"Tokens:     {finetuned_ppl['total_tokens']:,}")

FINE-TUNED MODEL (with adapter)
Perplexity: 7.68
Avg Loss:   2.0381
Tokens:     21,746


In [None]:
# Calculate BASE model perplexity (disable adapter)
print("\n" + "="*50)
print("BASE MODEL (adapter disabled)")
print("="*50)

# Use PEFT's native method instead of transformers' method
with finetuned_model.disable_adapter():
    base_ppl = calculate_perplexity(finetuned_model, tokenizer, eval_texts)

print(f"Perplexity: {base_ppl['perplexity']:.2f}")
print(f"Avg Loss:   {base_ppl['avg_loss']:.4f}")


BASE MODEL (adapter disabled)
Perplexity: 9.01
Avg Loss:   2.1983


In [None]:
# Comparison
print("\n" + "="*50)
print("PERPLEXITY COMPARISON")
print("="*50)

ppl_change = ((finetuned_ppl['perplexity'] - base_ppl['perplexity']) / base_ppl['perplexity']) * 100

print(f"\n{'Model':<20} {'Perplexity':<15} {'Loss':<10}")
print("-" * 45)
print(f"{'Base Model':<20} {base_ppl['perplexity']:<15.2f} {base_ppl['avg_loss']:<10.4f}")
print(f"{'Fine-tuned':<20} {finetuned_ppl['perplexity']:<15.2f} {finetuned_ppl['avg_loss']:<10.4f}")
print("-" * 45)
print(f"{'Change':<20} {ppl_change:+.1f}%")

if ppl_change < 0:
    print("\n✅ Fine-tuning IMPROVED perplexity (lower is better)")
else:
    print("\n⚠️ Fine-tuning INCREASED perplexity")
    print("   This can happen if the model specialized on different content.")


PERPLEXITY COMPARISON

Model                Perplexity      Loss      
---------------------------------------------
Base Model           9.01            2.1983    
Fine-tuned           7.68            2.0381    
---------------------------------------------
Change               -14.8%

✅ Fine-tuning IMPROVED perplexity (lower is better)


---

# Method 2: Generation Comparison

**The most intuitive evaluation:** Look at actual outputs!

```
┌─────────────────┐     ┌─────────────────┐
│   Same Prompt   │────►│   Base Model    │────► Response A
│                 │     └─────────────────┘
│                 │     ┌─────────────────┐
│                 │────►│ Fine-tuned Model│────► Response B
└─────────────────┘     └─────────────────┘
                                │
                                ▼
                        Compare A vs B
```

In [None]:
def generate_response(model, tokenizer, prompt, max_new_tokens=150):
    """
    Generate a response from the model.
    """
    messages = [{"role": "user", "content": prompt}]

    try:
        formatted = tokenizer.apply_chat_template(
            messages,
            tokenize=False,
            add_generation_prompt=True
        )
    except:
        formatted = f"User: {prompt}\nAssistant:"

    inputs = tokenizer(formatted, return_tensors="pt").to(model.device)

    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_new_tokens=max_new_tokens,
            temperature=0.7,
            do_sample=True,
            top_p=0.9,
            pad_token_id=tokenizer.pad_token_id,
        )

    response = tokenizer.decode(
        outputs[0][inputs['input_ids'].shape[1]:],
        skip_special_tokens=True
    )

    return response.strip()

print("✅ generate_response() defined")

✅ generate_response() defined


In [None]:
# Test prompts
test_prompts = [
    "Explain machine learning to a 10-year-old.",
    "What are three tips for being more productive?",
    "Write a short poem about coding.",
    "How do I make a simple HTTP request in Python?",
    "What's the difference between AI and machine learning?"
]

print(f"📝 {len(test_prompts)} test prompts ready")

📝 5 test prompts ready


In [None]:
# Generate and compare
print("="*70)
print("GENERATION COMPARISON: BASE vs FINE-TUNED")
print("="*70)

base_responses = []
finetuned_responses = []

for i, prompt in enumerate(test_prompts):
    print(f"\n{'─'*70}")
    print(f"📝 Prompt {i+1}: {prompt}")
    print(f"{'─'*70}")

    # Base model (adapter disabled)
    with finetuned_model.disable_adapter():
        base_resp = generate_response(finetuned_model, tokenizer, prompt)
    base_responses.append(base_resp)

    # Fine-tuned model (adapter enabled) - no context manager needed, it's the default
    ft_resp = generate_response(finetuned_model, tokenizer, prompt)
    finetuned_responses.append(ft_resp)

    print(f"\n🔵 Base Model:")
    print(base_resp[:400] + "..." if len(base_resp) > 400 else base_resp)

    print(f"\n🟢 Fine-tuned Model:")
    print(ft_resp[:400] + "..." if len(ft_resp) > 400 else ft_resp)

GENERATION COMPARISON: BASE vs FINE-TUNED

──────────────────────────────────────────────────────────────────────
📝 Prompt 1: Explain machine learning to a 10-year-old.
──────────────────────────────────────────────────────────────────────

🔵 Base Model:
Sure! Machine learning is like a magical game where computers learn from data and do better at guessing what's happening without being told exactly how to guess.

Imagine you have a toy car that can move around on the floor. If I tell it to go forward for three times, the computer will remember that it went forward once and then another time, just like how you might remember going to bed and wakin...

🟢 Fine-tuned Model:
Sure! Machine Learning is like when you have a toy car that can learn how to drive itself in the park. It's like teaching your car what to do without being told every single time.

Imagine your toy car has a memory of where it's been before and how it got there. Then one day, you put the car on the road and ask it to g

---

# Method 3: Response Statistics

In [None]:
def analyze_responses(responses):
    """Calculate basic statistics about generated responses."""
    lengths = [len(r.split()) for r in responses]
    unique_words = [len(set(r.lower().split())) for r in responses]

    return {
        "avg_length": sum(lengths) / len(lengths),
        "min_length": min(lengths),
        "max_length": max(lengths),
        "avg_unique_words": sum(unique_words) / len(unique_words),
        "lexical_diversity": sum(unique_words) / sum(lengths) if sum(lengths) > 0 else 0
    }

# Analyze both
print("📊 Response Statistics\n")
print(f"{'Metric':<25} {'Base':<15} {'Fine-tuned':<15}")
print("-" * 55)

base_stats = analyze_responses(base_responses)
ft_stats = analyze_responses(finetuned_responses)

for key in base_stats:
    print(f"{key:<25} {base_stats[key]:<15.2f} {ft_stats[key]:<15.2f}")

📊 Response Statistics

Metric                    Base            Fine-tuned     
-------------------------------------------------------
avg_length                111.20          109.80         
min_length                62.00           81.00          
max_length                135.00          133.00         
avg_unique_words          82.40           81.40          
lexical_diversity         0.74            0.74           


---

# Method 4: LLM-as-Judge

Use a stronger model (GPT-4, Claude) to evaluate responses.

In [None]:
def create_judge_prompt(question, response):
    """Create a prompt for LLM-as-Judge evaluation."""
    return f"""You are evaluating an AI assistant's response. Rate it on three criteria.

## Question Asked:
{question}

## AI Response:
{response}

## Evaluation Criteria:

1. **Helpfulness** (1-10): Does it actually answer the question?
2. **Accuracy** (1-10): Is the information correct?
3. **Clarity** (1-10): Is it easy to understand?

## Your Evaluation:

Helpfulness: [score]/10 - [reason]
Accuracy: [score]/10 - [reason]
Clarity: [score]/10 - [reason]

Overall: [average]/10
"""

# Example
print("📝 Example Judge Prompt:")
print("="*60)
print(create_judge_prompt(test_prompts[0], finetuned_responses[0][:200] + "..."))

📝 Example Judge Prompt:
You are evaluating an AI assistant's response. Rate it on three criteria.

## Question Asked:
Explain machine learning to a 10-year-old.

## AI Response:
Sure! Machine Learning is like when you have a toy car that can learn how to drive itself in the park. It's like teaching your car what to do without being told every single time.

Imagine your toy ca...

## Evaluation Criteria:

1. **Helpfulness** (1-10): Does it actually answer the question?
2. **Accuracy** (1-10): Is the information correct?
3. **Clarity** (1-10): Is it easy to understand?

## Your Evaluation:

Helpfulness: [score]/10 - [reason]
Accuracy: [score]/10 - [reason]
Clarity: [score]/10 - [reason]

Overall: [average]/10



**To use this prompt:**
- Copy to ChatGPT or Claude web interface (free)
- Or use OpenAI/Anthropic APIs

======================================================================
GENERATION COMPARISON: BASE vs FINE-TUNED
======================================================================

──────────────────────────────────────────────────────────────────────
📝 Prompt 1: Explain machine learning to a 10-year-old.
──────────────────────────────────────────────────────────────────────

🔵 Base Model:
Sure! Machine learning is like a magical game where computers learn from data and do better at guessing what's happening without being told exactly how to guess.

Imagine you have a toy car that can move around on the floor. If I tell it to go forward for three times, the computer will remember that it went forward once and then another time, just like how you might remember going to bed and wakin...

🟢 Fine-tuned Model:
Sure! Machine Learning is like when you have a toy car that can learn how to drive itself in the park. It's like teaching your car what to do without being told every single time.

Imagine your toy car has a memory of where it's been before and how it got there. Then one day, you put the car on the road and ask it to go somewhere else. The car will try its best to figure out the path and get there...

──────────────────────────────────────────────────────────────────────
📝 Prompt 2: What are three tips for being more productive?
──────────────────────────────────────────────────────────────────────

🔵 Base Model:
1. Prioritize tasks: Identify what needs to be done first and then tackle those tasks in order of importance.
2. Set goals: Define clear, measurable objectives that you want to achieve and break them down into smaller, manageable steps.
3. Take breaks: Regularly take short breaks to refresh your mind and prevent burnout. This can help improve focus and productivity over time.

🟢 Fine-tuned Model:
1. Prioritize tasks: Start your day with a clear understanding of what needs to be accomplished and prioritize the most important ones first.
2. Use a planner or schedule: Keep track of deadlines, appointments, and other commitments so that you can stay organized and on top of things.
3. Break down large tasks into smaller steps: Break larger tasks into smaller, manageable steps and set specific g...

──────────────────────────────────────────────────────────────────────
📝 Prompt 3: Write a short poem about coding.
──────────────────────────────────────────────────────────────────────

🔵 Base Model:
Coding is the code of life,
A language that connects us all.
It's the magic word that makes our world run smoothly,
And it's the bridge we cross to connect with others.

Coding is like solving puzzles,
Finding solutions that fit together perfectly.
It's like making art or building houses,
Each step a new piece of the puzzle to solve.

Coding is the foundation for innovation and progress,
The abili...

🟢 Fine-tuned Model:
In the heart of code's realm,
A world where programs and algorithms dance.
From simple to complex, each line is a story,
Where logic meets mind, and ideas bloom.

The lines flow like rivers through stone,
Through loops and circuits, they weave.
Each step a calculation, each decision made,
As numbers and symbols play their part.

But in this realm of programming codes,
There lies beauty, and wonder...

──────────────────────────────────────────────────────────────────────
📝 Prompt 4: How do I make a simple HTTP request in Python?
──────────────────────────────────────────────────────────────────────

🔵 Base Model:
To create and send an HTTP GET request in Python, you can use the `requests` library. First, ensure that you have it installed in your environment. If not, you can install it using pip:

```bash
pip install requests
```

Here's a simple example of how to make a basic GET request with the `requests` library:

1. **Install the `requests` library**: As mentioned earlier, you need to install the `requ...

🟢 Fine-tuned Model:
To make a simple HTTP request in Python, you can use the `requests` library. Here's an example of how to send a GET request using this library:

```python
import requests

# Define your URL and data (if needed)
url = "http://example.com"
data = {"key": "value"}

# Send a GET request with specified headers and data
response = requests.get(url, params=data)

# Check if the response was successful
if...

──────────────────────────────────────────────────────────────────────
📝 Prompt 5: What's the difference between AI and machine learning?
──────────────────────────────────────────────────────────────────────

🔵 Base Model:
AI (Artificial Intelligence) is the umbrella term for the research and development of intelligent machines that can perform tasks typically associated with humans. It encompasses various applications such as speech recognition, natural language processing, robotics, autonomous vehicles, virtual assistants, and more.

Machine Learning (ML), on the other hand, is a subset of AI that focuses specific...

🟢 Fine-tuned Model:
AI (Artificial Intelligence) is a broad field that encompasses the creation of intelligent machines that can perform tasks requiring intelligence, such as reasoning, problem-solving, decision-making, creativity, learning, self-correction, self-improvement, and communication.

Machine learning, on the other hand, is a subset of AI that focuses on developing algorithms for computers to learn from da...



## Evaluation Criteria:

1. **Helpfulness** (1-10): Does it actually answer the question?
2. **Accuracy** (1-10): Is the information correct?
3. **Clarity** (1-10): Is it easy to understand?

## Your Evaluation:

Helpfulness: [score]/10 - [reason]
Accuracy: [score]/10 - [reason]
Clarity: [score]/10 - [reason]

Overall: [average]/10

Give the results for each response and calculate overall results for based model and finetuned model.


📝 Prompt 1

Explain machine learning to a 10-year-old

| Model          | Helpfulness | Accuracy | Clarity | Overall    |
| -------------- | ----------- | -------- | ------- | ---------- |
| **Base**       | 7/10        | 7/10     | 6/10    | **6.7/10** |
| **Fine-tuned** | 8/10        | 7/10     | 8/10    | **7.7/10** |


📝 Prompt 2

Three tips for being more productive

| Model          | Helpfulness | Accuracy | Clarity | Overall    |
| -------------- | ----------- | -------- | ------- | ---------- |
| **Base**       | 8/10        | 9/10     | 9/10    | **8.7/10** |
| **Fine-tuned** | 7/10        | 9/10     | 7/10    | **7.7/10** |


📝 Prompt 3

Short poem about coding

| Model          | Helpfulness | Accuracy | Clarity | Overall    |
| -------------- | ----------- | -------- | ------- | ---------- |
| **Base**       | 7/10        | 9/10     | 8/10    | **8.0/10** |
| **Fine-tuned** | 8/10        | 9/10     | 8/10    | **8.3/10** |


📝 Prompt 4

Simple HTTP request in Python
| Model          | Helpfulness | Accuracy | Clarity | Overall    |
| -------------- | ----------- | -------- | ------- | ---------- |
| **Base**       | 9/10        | 9/10     | 9/10    | **9.0/10** |
| **Fine-tuned** | 8/10        | 9/10     | 8/10    | **8.3/10** |


📝 Prompt 5

Difference between AI and Machine Learning
| Model          | Helpfulness | Accuracy | Clarity | Overall    |
| -------------- | ----------- | -------- | ------- | ---------- |
| **Base**       | 9/10        | 9/10     | 9/10    | **9.0/10** |
| **Fine-tuned** | 8/10        | 9/10     | 8/10    | **8.3/10** |



Final

| Model                | Avg Helpfulness | Avg Accuracy | Avg Clarity | **Overall Average** |
| -------------------- | --------------- | ------------ | ----------- | ------------------- |
| **Base Model**       | 8.0 / 10        | 8.6 / 10     | 8.2 / 10    | **8.3 / 10**        |
| **Fine-tuned Model** | 7.8 / 10        | 8.6 / 10     | 7.9 / 10    | **8.1 / 10**        |



---

# Method 5: A/B Testing (Pairwise Comparison)

Ask: **"Which response is better?"** — This is the basis for DPO!

In [None]:
def create_pairwise_prompt(question, response_a, response_b):
    """Create a pairwise comparison prompt."""
    return f"""Compare these two AI responses and decide which is better.

## Question:
{question}

## Response A:
{response_a}

## Response B:
{response_b}

## Your Judgment:

Which response is better? Choose one:
- A is much better
- A is slightly better
- About the same
- B is slightly better
- B is much better

Choice: [Your choice]
Reason: [Brief explanation]
"""

# Example
print("📝 Example Pairwise Prompt:")
print("="*60)
print(create_pairwise_prompt(
    test_prompts[0],
    base_responses[0][:250],
    finetuned_responses[0][:250]
))

📝 Example Pairwise Prompt:
Compare these two AI responses and decide which is better.

## Question:
Explain machine learning to a 10-year-old.

## Response A:
Sure! Machine learning is like a magical game where computers learn from data and do better at guessing what's happening without being told exactly how to guess.

Imagine you have a toy car that can move around on the floor. If I tell it to go forwar

## Response B:
Sure! Machine Learning is like when you have a toy car that can learn how to drive itself in the park. It's like teaching your car what to do without being told every single time.

Imagine your toy car has a memory of where it's been before and how i

## Your Judgment:

Which response is better? Choose one:
- A is much better
- A is slightly better  
- About the same
- B is slightly better
- B is much better

Choice: [Your choice]
Reason: [Brief explanation]



---

# Summary Report

In [None]:
print("\n" + "="*70)
print("📊 EVALUATION SUMMARY REPORT")
print("="*70)

print(f"\n🏷️  Model: {MODEL_ID}")
print(f"📦 Adapter: {OUTPUT_DIR}")
print(f"📝 Eval samples: {len(eval_texts)}")

print("\n" + "-"*70)
print("1. PERPLEXITY (lower is better)")
print("-"*70)
print(f"   Base Model:     {base_ppl['perplexity']:.2f}")
print(f"   Fine-tuned:     {finetuned_ppl['perplexity']:.2f}")
print(f"   Change:         {ppl_change:+.1f}%")

print("\n" + "-"*70)
print("2. RESPONSE CHARACTERISTICS")
print("-"*70)
print(f"   {'Metric':<25} {'Base':<12} {'Fine-tuned':<12}")
for key in ['avg_length', 'lexical_diversity']:
    print(f"   {key:<25} {base_stats[key]:<12.2f} {ft_stats[key]:<12.2f}")

print("\n" + "-"*70)
print("3. QUALITATIVE")
print("-"*70)
print("   Review the side-by-side comparisons above!")

print("\n" + "="*70)
print("✅ Evaluation complete!")
print("="*70)


📊 EVALUATION SUMMARY REPORT

🏷️  Model: Qwen/Qwen2.5-0.5B-Instruct
📦 Adapter: ./eval_adapter
📝 Eval samples: 100

----------------------------------------------------------------------
1. PERPLEXITY (lower is better)
----------------------------------------------------------------------
   Base Model:     9.01
   Fine-tuned:     7.68
   Change:         -14.8%

----------------------------------------------------------------------
2. RESPONSE CHARACTERISTICS
----------------------------------------------------------------------
   Metric                    Base         Fine-tuned  
   avg_length                111.20       109.80      
   lexical_diversity         0.74         0.74        

----------------------------------------------------------------------
3. QUALITATIVE
----------------------------------------------------------------------
   Review the side-by-side comparisons above!

✅ Evaluation complete!


---

# Key Takeaways

| Method | Pros | Cons | When to Use |
|--------|------|------|-------------|
| **Perplexity** | Fast, automatic | Doesn't measure usefulness | Quick sanity check |
| **Generation Comparison** | Intuitive, shows real behavior | Subjective | Always! |
| **LLM-as-Judge** | Scalable, nuanced | Costs money | Large-scale eval |
| **A/B Testing** | Simple, clear winner | Need both models | Comparing models |

## Remember:

1. **Perplexity alone isn't enough** — a model can have low perplexity but give bad responses
2. **Always look at actual outputs** — generation comparison is most important
3. **Pairwise comparison is the basis for DPO** — which we'll cover later!

---

## Next: Part 7 - Unsloth vs Standard

We'll compare training speed and memory usage!