# Chapter 3: BERTScore and COMET

## Semantic Evaluation Metrics for NLG

---

### Learning Objectives

By the end of this notebook, you will be able to:

1. **Explain** why lexical metrics (BLEU, ROUGE) fail to capture semantic equivalence
2. **Understand** how word embeddings represent meaning as vectors in high-dimensional space
3. **Implement** BERTScore from scratch using greedy token matching
4. **Use** the `bert_score` library for production-quality evaluation
5. **Apply** COMET for translation evaluation using source, candidate, and reference
6. **Choose** the right metric for your specific NLG evaluation task

### Why This Matters

Traditional metrics like BLEU and ROUGE rely on **lexical overlap** - they count matching n-grams between candidate and reference texts. This approach has a fundamental flaw: it treats words as atomic symbols with no relationship to each other.

Consider these sentences:
- "The **lawyer** submitted the **brief**"
- "The **attorney** filed the **document**"

These sentences mean nearly the same thing, but share almost no words! Lexical metrics would score this valid paraphrase as a failure. **Semantic metrics** solve this by understanding that "lawyer" and "attorney" are synonyms, even though they have different spellings.

---

## Setup and Imports

We start by importing the necessary libraries:

- **PyTorch**: For tensor operations and neural network computations
- **Transformers**: Hugging Face library for pretrained language models like BERT
- **NumPy**: For numerical operations

These tools give us access to pretrained models that have learned rich semantic representations from massive text corpora.

In [None]:
import torch
import torch.nn.functional as F
from transformers import AutoTokenizer, AutoModel
import numpy as np

---

## Part 1: The Problem with Lexical Metrics

### Why BLEU and ROUGE Fail

BLEU and ROUGE treat words as **atomic symbols** - like random ID numbers with no inherent meaning. In this view:
- "cat" and "dog" are just as different as "cat" and "quantum"
- "lawyer" and "attorney" are completely unrelated symbols

This is fundamentally at odds with how language works. Synonyms, paraphrases, and semantically equivalent expressions are everywhere in natural language.

The following example demonstrates this limitation:

In [None]:
# Example from the book
reference = "people like foreign cars"
candidate1 = "People like visiting places abroad."  # Different topic!
candidate2 = "Consumers prefer imported cars."       # Same meaning!

def word_overlap(cand, ref):
    cand_words = set(cand.lower().split())
    ref_words = set(ref.lower().split())
    return len(cand_words & ref_words) / len(ref_words)

print(f"Reference: '{reference}'")
print(f"\nCandidate 1 (wrong topic): '{candidate1}'")
print(f"  Word overlap: {word_overlap(candidate1, reference):.0%}")
print(f"\nCandidate 2 (same meaning): '{candidate2}'")
print(f"  Word overlap: {word_overlap(candidate2, reference):.0%}")
print("\n^ Lexical metrics penalize valid paraphrases!")

### Interpreting the Results

Notice the paradox above:
- **Candidate 1** ("people like visiting places abroad") shares 50% of words but talks about **travel**, not cars
- **Candidate 2** ("Consumers prefer imported cars") shares 0% of words but means **exactly the same thing**

A lexical metric would prefer Candidate 1, which is semantically wrong! This is why we need metrics that understand **meaning**, not just **spelling**.

---

## Part 2: The Solution - Word Embeddings

### From Symbols to Vectors

The breakthrough insight of modern NLP is representing words as **dense vectors** (embeddings) in a high-dimensional space. In this space:

- **Similar words are neighbors**: "attorney" and "lawyer" are close together
- **Different concepts are far apart**: "attorney" and "pizza" are distant
- **Relationships are directions**: king - man + woman ≈ queen

**How are embeddings learned?**

Models like BERT learn embeddings by predicting masked words from context. After training on billions of words, the model learns that "attorney" and "lawyer" appear in similar contexts, so they get similar vectors.

**Contextual embeddings go further:**

Unlike static embeddings (Word2Vec), BERT produces **contextual embeddings** - the vector for "bank" differs between "river bank" and "bank account". This allows nuanced semantic understanding.

### Loading a Pretrained Encoder

We use BERT (Bidirectional Encoder Representations from Transformers) as our embedding model. BERT-base has:
- 12 transformer layers
- 768-dimensional embeddings
- 110 million parameters

The model was pretrained on Wikipedia and BookCorpus, learning rich semantic representations.

In [None]:
# Load a pretrained encoder model
model_name = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)
model.eval()

print(f"Loaded {model_name}")
print(f"Embedding dimension: {model.config.hidden_size}")

### Sentence-Level Semantic Similarity

Now we can compare sentences by their meaning, not their spelling. The code below:

1. **Gets embeddings** for each sentence using mean pooling (averaging all token embeddings)
2. **Computes cosine similarity** between the sentence vectors

**Cosine similarity** measures the angle between two vectors:
- 1.0 = identical direction (same meaning)
- 0.0 = perpendicular (unrelated)
- -1.0 = opposite direction (opposite meaning)

In [None]:
def get_embeddings(text: str) -> torch.Tensor:
    """Get token embeddings using mean pooling."""
    inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True)
    
    with torch.no_grad():
        outputs = model(**inputs)
    
    # Mean pooling over tokens (excluding padding)
    token_embeddings = outputs.last_hidden_state
    attention_mask = inputs["attention_mask"]
    mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    sum_embeddings = (token_embeddings * mask_expanded).sum(dim=1)
    token_counts = mask_expanded.sum(dim=1).clamp(min=1e-9)
    
    return sum_embeddings / token_counts

def cosine_similarity(emb1: torch.Tensor, emb2: torch.Tensor) -> float:
    """Compute cosine similarity between two embeddings."""
    return F.cosine_similarity(emb1, emb2).item()

# Test with book examples
emb_ref = get_embeddings(reference)
emb_c1 = get_embeddings(candidate1)
emb_c2 = get_embeddings(candidate2)

print(f"Cosine similarity (semantic):")
print(f"  Candidate 1 (wrong topic): {cosine_similarity(emb_ref, emb_c1):.3f}")
print(f"  Candidate 2 (same meaning): {cosine_similarity(emb_ref, emb_c2):.3f}")
print("\n^ Semantic metrics recognize paraphrases!")

---

## Part 3: BERTScore - Token-Level Semantic Matching

### The BERTScore Algorithm

While sentence-level similarity is useful, **BERTScore** provides a more nuanced approach by matching individual tokens. The algorithm:

1. **Get token embeddings** from BERT for both reference and candidate
2. **Compute pairwise similarities** between all reference-candidate token pairs
3. **Greedy matching**: For each token, find its best semantic match in the other sentence
4. **Aggregate** into Precision, Recall, and F1 scores

**Why greedy matching?**

Instead of requiring exact word matches, BERTScore finds the best semantic match for each token. "Cold" can match "freezing" because their embeddings are similar, even though the words are different.

**Precision vs Recall:**
- **Recall**: How much of the reference meaning is captured? (For each reference token, find best match in candidate)
- **Precision**: How accurate is the candidate? (For each candidate token, find best match in reference)

Let's implement this step by step:

### Step 1: Extract Token Embeddings

First, we tokenize our sentences and extract embeddings for each token. Note that we skip the special `[CLS]` and `[SEP]` tokens that BERT adds.

In [None]:
def get_token_embeddings(text: str) -> tuple[torch.Tensor, list[str]]:
    """Get individual token embeddings."""
    inputs = tokenizer(text, return_tensors="pt", truncation=True)
    
    with torch.no_grad():
        outputs = model(**inputs)
    
    # Get tokens (skip [CLS] and [SEP])
    tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])[1:-1]
    embeddings = outputs.last_hidden_state[0, 1:-1, :]  # Skip special tokens
    
    return embeddings, tokens

# Example from the book
reference = "The weather is cold today."
candidate = "It is freezing today."

ref_emb, ref_tokens = get_token_embeddings(reference)
cand_emb, cand_tokens = get_token_embeddings(candidate)

print(f"Reference tokens: {ref_tokens}")
print(f"Candidate tokens: {cand_tokens}")

### Step 2: Build the Similarity Matrix

Now we compute cosine similarity between every pair of tokens. This creates a matrix where:
- Rows = reference tokens
- Columns = candidate tokens
- Each cell = semantic similarity between that token pair

Look for high values (near 1.0) to see which tokens are semantically similar.

In [None]:
def pairwise_cosine_similarity(emb1: torch.Tensor, emb2: torch.Tensor) -> torch.Tensor:
    """Compute pairwise cosine similarity matrix."""
    # Normalize embeddings
    emb1_norm = F.normalize(emb1, p=2, dim=1)
    emb2_norm = F.normalize(emb2, p=2, dim=1)
    # Compute similarity matrix
    return torch.mm(emb1_norm, emb2_norm.t())

# Compute similarity matrix
sim_matrix = pairwise_cosine_similarity(ref_emb, cand_emb)

print("Pairwise similarity matrix (reference × candidate):")
print(f"Shape: {sim_matrix.shape} ({len(ref_tokens)} ref × {len(cand_tokens)} cand)\n")

# Display as table
import pandas as pd
df = pd.DataFrame(
    sim_matrix.numpy(),
    index=ref_tokens,
    columns=cand_tokens
)
print(df.round(3).to_string())

### Step 3: Compute Precision, Recall, and F1

With the similarity matrix, we can now compute BERTScore:

- **Recall**: For each reference token, take the maximum similarity across all candidate tokens, then average
- **Precision**: For each candidate token, take the maximum similarity across all reference tokens, then average
- **F1**: Harmonic mean of precision and recall

This greedy maximum matching allows "cold" to find "freezing" as its best match.

In [None]:
def bertscore_components(ref_emb: torch.Tensor, cand_emb: torch.Tensor) -> dict:
    """
    Calculate BERTScore precision, recall, and F1.
    
    Recall: For each reference token, find best match in candidate.
    Precision: For each candidate token, find best match in reference.
    """
    sim_matrix = pairwise_cosine_similarity(ref_emb, cand_emb)
    
    # Recall: max over candidate for each reference token
    recall_scores = sim_matrix.max(dim=1).values
    recall = recall_scores.mean().item()
    
    # Precision: max over reference for each candidate token
    precision_scores = sim_matrix.max(dim=0).values
    precision = precision_scores.mean().item()
    
    # F1
    f1 = 2 * precision * recall / (precision + recall) if (precision + recall) > 0 else 0
    
    return {
        "precision": precision,
        "recall": recall,
        "f1": f1,
        "recall_scores": recall_scores,
        "precision_scores": precision_scores
    }

result = bertscore_components(ref_emb, cand_emb)

print("BERTScore (unscaled):")
print(f"  Precision: {result['precision']:.3f}")
print(f"  Recall: {result['recall']:.3f}")
print(f"  F1: {result['f1']:.3f}")

print("\nRecall breakdown (best match for each reference token):")
for token, score in zip(ref_tokens, result['recall_scores']):
    print(f"  {token:12} -> {score:.3f}")

### Visualizing the Greedy Matching

Let's see exactly which candidate token each reference token matched with. This visualization helps build intuition for how BERTScore captures semantic equivalence.

In [None]:
# Find best matches
best_matches = sim_matrix.argmax(dim=1)

print("Greedy matching (reference -> best candidate match):")
print("-" * 45)
for i, (ref_tok, best_idx) in enumerate(zip(ref_tokens, best_matches)):
    cand_tok = cand_tokens[best_idx]
    score = sim_matrix[i, best_idx].item()
    print(f"  '{ref_tok}' -> '{cand_tok}' (similarity: {score:.3f})")

print("\n^ 'cold' matches 'freezing' despite no lexical overlap!")

### Key Insight: Semantic Matching in Action

The matching above shows the power of semantic metrics:
- **"cold" -> "freezing"**: Different words, same meaning - high similarity!
- **"weather" -> "it"**: Both refer to general conditions
- **"today" -> "today"**: Exact match gets highest similarity

This is impossible with lexical metrics, which would see "cold" and "freezing" as completely different symbols.

---

## Part 4: Production BERTScore with the `bert_score` Library

### Why Use the Library?

Our implementation above demonstrates the core algorithm, but the `bert_score` library adds important enhancements:

1. **IDF weighting**: Important words (like "attorney") contribute more than common words (like "the")
2. **Baseline rescaling**: Raw scores are rescaled to be more interpretable (0 = random, 1 = perfect)
3. **Batching and GPU support**: Efficient processing of large datasets
4. **Model selection**: Choose from different encoder models (RoBERTa, DeBERTa, etc.)

For production evaluation, always use the library:

In [None]:
from bert_score import score as bert_score

# Book examples
references = [
    "The weather is cold today.",
    "people like foreign cars"
]
candidates = [
    "It is freezing today.",
    "Consumers prefer imported cars."
]

P, R, F1 = bert_score(candidates, references, lang="en", verbose=False)

print("BERTScore (with IDF weighting & rescaling):")
print("-" * 50)
for i, (cand, ref) in enumerate(zip(candidates, references)):
    print(f"Reference: '{ref}'")
    print(f"Candidate: '{cand}'")
    print(f"  P={P[i]:.3f}, R={R[i]:.3f}, F1={F1[i]:.3f}\n")

### Understanding the Scores

Notice that the library scores differ from our raw implementation due to IDF weighting and rescaling. The rescaled scores are more interpretable:
- **F1 > 0.9**: Excellent semantic match
- **F1 0.8-0.9**: Good paraphrase
- **F1 < 0.7**: Significant semantic differences

Both examples above get high scores because they are valid paraphrases, demonstrating that BERTScore correctly captures semantic equivalence.

---

## Part 5: COMET - Learned Evaluation from Human Judgments

### What Makes COMET Different?

COMET (Crosslingual Optimized Metric for Evaluation of Translation) takes a fundamentally different approach:

| Aspect | BERTScore | COMET |
|--------|-----------|-------|
| **Inputs** | Candidate + Reference | Source + Candidate + Reference |
| **Model** | Pretrained encoder (BERT) | Fine-tuned on human judgments |
| **Training** | No task-specific training | Trained to predict human scores |
| **Use case** | General NLG | Translation evaluation |

**Key advantages of COMET:**

1. **Source awareness**: By seeing the original source text, COMET can detect translation errors that lose meaning (like mistranslating "Bank" as "riverside" instead of "financial institution")

2. **Human-aligned**: COMET is trained on millions of human quality judgments from WMT translation tasks, so it learns what humans consider good translations

3. **State-of-the-art correlation**: COMET consistently achieves the highest correlation with human judgments on translation tasks

> **Note:** COMET requires `transformers<5.0` which may conflict with newer Python versions. See the installation note below.

### Loading the COMET Model

COMET models are large (~1.5GB) and require specific package versions. The code below attempts to load the WMT22 COMET-DA model, which was trained on Direct Assessment scores from WMT translation tasks.

> **Installation Note:** COMET (`unbabel-comet`) requires `transformers<5.0`. If you're using Python 3.14+ with newer transformers, install COMET in a separate environment:
> ```bash
> uv venv comet-env --python 3.12
> source comet-env/bin/activate
> pip install unbabel-comet
> ```

In [None]:
# COMET requires transformers<5.0 - this cell may fail in newer environments
try:
    from comet import download_model, load_from_checkpoint
    
    # Download and load COMET model
    model_path = download_model("Unbabel/wmt22-comet-da")
    comet_model = load_from_checkpoint(model_path)
    COMET_AVAILABLE = True
except ImportError:
    print("COMET not available. Install with: pip install unbabel-comet")
    print("Requires transformers<5.0 and Python<3.14")
    COMET_AVAILABLE = False

### COMET in Action: Detecting Translation Errors

This example demonstrates COMET's key strength: using the source text to detect meaning-changing errors.

The German word "Bank" is ambiguous - it can mean:
- **Financial institution** (the correct translation in this context)
- **Riverside/bench** (incorrect translation)

COMET, by seeing the source "Die Bank war voller Kunden" (The bank was full of customers), can determine that "bank" (financial) is correct and "riverside" is wrong, even though both are valid English sentences.

In [None]:
# Example from the book: German bank translation
data = [
    {
        "src": "Die Bank war voller Kunden.",
        "mt": "The bank was full of customers.",
        "ref": "The financial institution was crowded with clients."
    },
    {
        "src": "Die Bank war voller Kunden.",
        "mt": "The riverside was full of customers.",  # Wrong sense of "Bank"!
        "ref": "The financial institution was crowded with clients."
    }
]

if COMET_AVAILABLE:
    output = comet_model.predict(data, batch_size=2, gpus=0)
    
    print("COMET scores (trained on human judgments):")
    print("-" * 50)
    for i, (sample, score) in enumerate(zip(data, output.scores)):
        print(f"Source: {sample['src']}")
        print(f"MT: {sample['mt']}")
        print(f"Ref: {sample['ref']}")
        print(f"COMET Score: {score:.3f}\n")
else:
    print("COMET Example (requires separate environment):")
    print("-" * 50)
    for sample in data:
        print(f"Source: {sample['src']}")
        print(f"MT: {sample['mt']}")
        print(f"Ref: {sample['ref']}")
        print()
    print("^ COMET would score the first translation higher")
    print("  because it correctly translates 'Bank' as 'bank' (financial)")
    print("  while the second mistranslates it as 'riverside'.")

---

## Part 6: Comparing All Metrics

### Head-to-Head Comparison

Now let's compare lexical and semantic metrics on the same example. This demonstrates why choosing the right metric matters for your evaluation task.

In [None]:
import evaluate

bleu = evaluate.load("bleu")
rouge = evaluate.load("rouge")
bertscore = evaluate.load("bertscore")

# Test case: paraphrase with no word overlap
reference = "The attorney filed the legal brief."
candidate = "The lawyer submitted the court document."

print(f"Reference: '{reference}'")
print(f"Candidate: '{candidate}'")
print("\nMetric comparison:")
print("-" * 40)

# BLEU
bleu_result = bleu.compute(predictions=[candidate], references=[[reference]])
print(f"BLEU:      {bleu_result['bleu']:.3f}")

# ROUGE
rouge_result = rouge.compute(predictions=[candidate], references=[reference])
print(f"ROUGE-L:   {rouge_result['rougeL']:.3f}")

# BERTScore
bert_result = bertscore.compute(
    predictions=[candidate], 
    references=[reference], 
    lang="en"
)
print(f"BERTScore: {bert_result['f1'][0]:.3f}")

print("\n^ Semantic metrics capture paraphrase equivalence!")

### Interpreting the Comparison

The results above clearly show the difference:

| Metric | Score | Why |
|--------|-------|-----|
| **BLEU** | ~0 | No n-gram overlap between "attorney/lawyer" or "filed/submitted" |
| **ROUGE-L** | ~0 | No common subsequence |
| **BERTScore** | ~0.9+ | Recognizes semantic equivalence via embeddings |

This is a perfect paraphrase that lexical metrics completely miss!

---

## Exercises

Test your understanding of semantic evaluation metrics with these exercises.

1. **Compute BERTScore** for the paraphrase pair below. Notice the completely different word order.

2. **Why might COMET give different scores than BERTScore** for the same translation?
   - Hint: Think about what information each metric has access to.

3. **When would you prefer lexical metrics (BLEU/ROUGE)** over semantic metrics?
   - Hint: Consider computation cost, interpretability, and specific task requirements.

### Exercise 1: BERTScore on Paraphrases

The sentences below are paraphrases with completely reorganized structure. Run the code to see how BERTScore handles this.

In [None]:
# Exercise 1
ref = "The committee approved the proposal yesterday after extensive debate."
cand = "Yesterday, following lengthy discussions, the proposal received committee approval."

P, R, F1 = bert_score([cand], [ref], lang="en", verbose=False)
print(f"BERTScore F1: {F1[0]:.3f}")
print("High score despite completely different word order!")

---

## Summary and Key Takeaways

### What We Learned

1. **Lexical metrics (BLEU, ROUGE) fail on paraphrases** because they treat words as atomic symbols with no semantic relationship

2. **Word embeddings represent meaning as vectors** where semantically similar words are neighbors in the embedding space

3. **BERTScore uses greedy token matching** to find the best semantic match for each token, enabling paraphrase recognition

4. **COMET is trained on human judgments** and uses source text for translation evaluation, achieving state-of-the-art correlation with humans

### When to Use Each Metric

| Metric | Best For | Limitations |
|--------|----------|-------------|
| **BLEU** | MT systems with consistent output style; fast corpus-level evaluation | Penalizes valid paraphrases; needs multiple references |
| **ROUGE** | Summarization; recall-oriented tasks | Same lexical limitations as BLEU |
| **BERTScore** | General NLG; paraphrase-tolerant evaluation | Computationally expensive; no source awareness |
| **COMET** | Translation evaluation; when source is available | Requires source text; large model size |

### Strengths of Semantic Metrics

- Recognize synonyms and paraphrases ("attorney" ≈ "lawyer")
- Capture semantic similarity even with different word order
- Better correlation with human judgments
- Contextual understanding (for BERT-based metrics)

### Limitations to Keep in Mind

- **Computational cost**: Require GPU for efficient processing of large datasets
- **Model bias**: Inherit biases from pretraining data
- **Not interpretable**: Harder to understand why a score is low (vs. counting n-grams)
- **Calibration varies**: Raw scores need rescaling to be meaningful across models

### Further Reading

- [BERTScore Paper](https://arxiv.org/abs/1904.09675) - Zhang et al., 2019
- [COMET Paper](https://arxiv.org/abs/2009.09025) - Rei et al., 2020
- [BLEURT](https://arxiv.org/abs/2004.04696) - Another learned metric worth exploring

---

*End of Chapter 3*