# Week 13: Evaluating LLMs - Metrics and Methods - Homework

**ML2: Advanced Machine Learning**

**Estimated Time**: 1 hour

---

This homework combines programming exercises and knowledge-based questions to reinforce this week's concepts.

## Setup

Run this cell to import necessary libraries:

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import torch
import torch.nn as nn

# Set random seed for reproducibility
np.random.seed(42)
torch.manual_seed(42)

print('✓ Libraries imported successfully')

---
## Part 1: Programming Exercises (60%)

Complete the following programming tasks. Read each description carefully and implement the requested functionality.

### Exercise 1: Experiment: Why Metrics Disagree

**Time**: 10 min

Compare different metrics on the same outputs and observe disagreements.

In [None]:
# Three model outputs for "Summarize: The quick brown fox jumps over the lazy dog"

reference = "A fast fox jumps over a lazy dog."

candidate_a = "A quick brown fox jumps over a lazy dog."  # Almost exact
candidate_b = "A fox leaps over a dog."  # Concise, captures meaning
candidate_c = "The agile fox bounds over the sleepy canine."  # Different words, same meaning

# Simplified BLEU (measures n-gram overlap)
def simple_bleu(ref, cand):
    ref_words = set(ref.lower().split())
    cand_words = set(cand.lower().split())
    overlap = len(ref_words & cand_words) / len(cand_words) if cand_words else 0
    return overlap

print(f"Candidate A BLEU: {simple_bleu(reference, candidate_a):.2f}")
print(f"Candidate B BLEU: {simple_bleu(reference, candidate_b):.2f}")
print(f"Candidate C BLEU: {simple_bleu(reference, candidate_c):.2f}")

# TODO: Which is best? Do the metrics agree with human judgment?

---
## Part 2: Knowledge Questions (40%)

Answer the following questions to test your conceptual understanding.

### Question 1 (Short Answer)

**Question 1 - No Single Metric Captures Everything**

BLEU: Measures word overlap
ROUGE: Measures recall of n-grams
BERTScore: Measures semantic similarity using embeddings
Human evaluation: Subjective quality judgment

Explain:
1. Why might BLEU give a high score to a bad summary?
2. Why might a paraphrase (different words, same meaning) score poorly on BLEU?
3. Why do we need multiple metrics?

**Hint**: BLEU rewards literal word matches, ignoring semantics. Multiple metrics = multiple perspectives.

**Your Answer**:

[Write your answer here in 2-4 sentences]

### Question 2 (Short Answer)

**Question 2 - Perplexity**

Perplexity = how "surprised" a language model is by text.
Low perplexity = model assigned high probability to the text.

Explain:
1. Why is perplexity a good metric for language modeling?
2. Can you use perplexity to evaluate summarization quality? Why or why not?
3. What are the limitations?

**Hint**: Perplexity measures likelihood, not quality/helpfulness/correctness.

**Your Answer**:

[Write your answer here in 2-4 sentences]

### Question 3 (Multiple Choice)

**Question 3 - BLEU Score**

BLEU was invented for machine translation. It measures n-gram overlap between reference and candidate.

BLEU = 1.0 means:

A) Perfect translation
B) Exact word-for-word match with reference
C) The model is 100% confident
D) Zero perplexity

A) Perfect translation
B) Exact word-for-word match with reference
C) The model is 100% confident
D) Zero perplexity

**Hint**: BLEU = 1.0 means perfect n-gram overlap, not necessarily perfect translation quality.

**Your Answer**: [Write your answer here - e.g., 'B']

**Explanation**: [Explain why this is correct]

### Question 4 (Short Answer)

**Question 4 - Human Evaluation**

Human evaluation is considered the "gold standard" but it's expensive and slow.

Explain:
1. What aspects can humans evaluate that automatic metrics cannot?
2. What problems arise with human evaluation (inter-annotator agreement, bias)?
3. When is human evaluation worth the cost?

**Hint**: Humans assess fluency, coherence, factuality, helpfulness. But humans disagree and have biases.

**Your Answer**:

[Write your answer here in 2-4 sentences]

### Question 5 (Short Answer)

**Question 5 - Benchmark Saturation**

GPT-4 achieves near-perfect scores on many NLP benchmarks (GLUE, SuperGLUE).

Explain:
1. Why is benchmark saturation a problem?
2. What does it mean when models "solve" a benchmark?
3. How do researchers respond? (Hint: new, harder benchmarks)

**Hint**: Saturated benchmarks no longer differentiate model quality. Need harder tests that reflect real-world use.

**Your Answer**:

[Write your answer here in 2-4 sentences]

### Question 6 (Multiple Choice)

**Question 6 - BERTScore vs BLEU**

BERTScore uses BERT embeddings to measure semantic similarity.
BLEU uses n-gram overlap.

Which would score higher for: "The cat sat" vs "The feline rested"?

A) BLEU (they have very different words)
B) BERTScore (semantically similar)
C) Both equally
D) Neither would score it

A) BLEU (they have very different words)
B) BERTScore (semantically similar)
C) Both equally
D) Neither would score it

**Hint**: BERTScore captures semantic similarity. "cat" ≈ "feline" in embedding space.

**Your Answer**: [Write your answer here - e.g., 'B']

**Explanation**: [Explain why this is correct]

### Question 7 (Short Answer)

**Question 7 - LLM-as-Judge**

New trend: Use GPT-4 to evaluate other models' outputs.

Prompt: "Rate this summary on a scale of 1-10 for accuracy and coherence."

Explain:
1. What advantages does this have over BLEU/ROUGE?
2. What risks/limitations exist?
3. When is this approach appropriate?

**Hint**: Advantages: nuanced, flexible criteria. Risks: GPT-4 has biases, can be manipulated, costly.

**Your Answer**:

[Write your answer here in 2-4 sentences]

### Question 8 (Short Answer)

**Question 8 - Task-Specific Evaluation**

For a medical diagnosis LLM:
- Accuracy on medical exams
- Factual correctness
- Safety (avoiding harmful advice)

Explain: Why are general metrics (BLEU, perplexity) insufficient for this use case?

**Hint**: Domain-specific tasks need domain-specific evaluation criteria (medical accuracy, safety).

**Your Answer**:

[Write your answer here in 2-4 sentences]

### Question 9 (Short Answer)

**Question 9 - Evaluation Dimensions**

LLMs should be evaluated on multiple dimensions:
- Helpfulness: Does it solve the user's problem?
- Truthfulness: Is it factually accurate?
- Harmlessness: Does it avoid harmful content?

Explain: Why might a model excel on one dimension but fail on another?

**Hint**: Trade-offs exist. A very helpful model might hallucinate. A very cautious model might refuse valid requests.

**Your Answer**:

[Write your answer here in 2-4 sentences]

### Question 10 (Short Answer)

**Question 10 - Evaluation Paradox**

As models get better, evaluation gets harder.

GPT-4's outputs are often indistinguishable from humans. How do you evaluate something when there's no clear "right answer"?

Explain: What new evaluation approaches are needed for super-human AI?

**Hint**: Need process-based evaluation (how it thinks), adversarial testing, long-term outcome measurement.

**Your Answer**:

[Write your answer here in 2-4 sentences]

---
## Submission

Before submitting:
1. Run all cells to ensure code executes without errors
2. Check that all questions are answered
3. Review your explanations for clarity

**To Submit**:
- File → Download → Download .ipynb
- Submit the notebook file to your course LMS

**Note**: Make sure your name is in the filename (e.g., homework_01_yourname.ipynb)