In [7]:
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
"""
Created: January 2026
Author: Thomas Moerman
Description: Notebook for evaluating Machine Translation outputs with standard metrics.
"""


'\nCreated: January 2026\nAuthor: Thomas Moerman\nDescription: Notebook for evaluating Machine Translation outputs with standard metrics.\n'

# Machine Translation Evaluation

This notebook demonstrates how to evaluate machine translation outputs using standard metrics.

## Metrics Overview

| Metric | Type | Range | Description |
|--------|------|-------|-------------|
| **BLEU** | N-gram overlap | 0-100 | Bilingual Evaluation Understudy - measures n-gram precision |
| **chrF++** | Character-level | 0-100 | Character n-gram F-score with word n-grams |
| **TER** | Edit distance | 0-‚àû | Translation Edit Rate - lower is better |
| **COMET** | Neural | -1 to 1 | Learned metric correlating with human judgments |

## When to Use Which Metric?

- **BLEU**: Standard metric, good for comparing systems. Limited for morphologically rich languages.
- **chrF++**: Better for morphologically rich languages (German, Finnish, etc.)
- **TER**: Useful for post-editing scenarios, measures editing effort.
- **COMET**: Best correlation with human judgment, requires GPU for efficiency.


## 1) Setup and Installation


In [8]:
# Install required packages
# Uncomment the following line if running in Colab or fresh environment
# !pip install sacrebleu unbabel-comet pandas -q

import warnings
warnings.filterwarnings('ignore')

print("Checking installations...")

try:
    import sacrebleu
    print(f"‚úì sacrebleu {sacrebleu.__version__}")
except ImportError:
    print("‚úó sacrebleu not installed. Run: pip install sacrebleu")

try:
    import comet
    print(f"‚úì COMET installed")
except ImportError:
    print("‚úó COMET not installed (optional). Run: pip install unbabel-comet")

import pandas as pd
print(f"‚úì pandas {pd.__version__}")


Checking installations...
‚úì sacrebleu 2.6.0
‚úì COMET installed
‚úì pandas 2.3.3


## 2) Load or Create Example Data

For evaluation, you need:
- **Source sentences** (original language)
- **Reference translations** (human/gold translations)
- **System translations** (your MT output)

Below we create example data for English‚ÜíFrench translation.


In [9]:
# Example data: English to French translation
# In practice, load these from your test files

source_sentences = [
    "The weather is beautiful today.",
    "I love learning new languages.",
    "Machine translation has improved significantly.",
    "Can you help me find the train station?",
    "The European Parliament met in Brussels yesterday.",
    "Climate change is a global challenge.",
    "She reads a book every week.",
    "The restaurant serves excellent French cuisine.",
    "We need to finish this project by Friday.",
    "The concert was absolutely amazing.",
]

# Human reference translations (gold standard)
reference_translations = [
    "Le temps est magnifique aujourd'hui.",
    "J'adore apprendre de nouvelles langues.",
    "La traduction automatique s'est consid√©rablement am√©lior√©e.",
    "Pouvez-vous m'aider √† trouver la gare?",
    "Le Parlement europ√©en s'est r√©uni √† Bruxelles hier.",
    "Le changement climatique est un d√©fi mondial.",
    "Elle lit un livre chaque semaine.",
    "Le restaurant sert une excellente cuisine fran√ßaise.",
    "Nous devons terminer ce projet d'ici vendredi.",
    "Le concert √©tait absolument incroyable.",
]

# System translations (simulated MT output with varying quality)
system_translations = [
    "Le temps est beau aujourd'hui.",                              # Good, slight variation
    "J'aime apprendre de nouvelles langues.",                       # Good, synonym used
    "La traduction automatique s'est beaucoup am√©lior√©e.",          # Good, different adverb
    "Pouvez-vous m'aider √† trouver la gare?",                       # Perfect match
    "Le Parlement europ√©en a rencontr√© √† Bruxelles hier.",          # Error: wrong verb
    "Le changement climatique est un challenge global.",            # Anglicism used
    "Elle lit un livre chaque semaine.",                            # Perfect match
    "Le restaurant sert de l'excellente cuisine fran√ßaise.",        # Minor grammar issue
    "Nous devons finir ce projet avant vendredi.",                  # Synonym + preposition
    "Le concert √©tait vraiment incroyable.",                        # Good, different adverb
]

print(f"Loaded {len(source_sentences)} sentence pairs for evaluation.")
print("\n--- Sample ---")
print(f"Source:     {source_sentences[0]}")
print(f"Reference:  {reference_translations[0]}")
print(f"System:     {system_translations[0]}")


Loaded 10 sentence pairs for evaluation.

--- Sample ---
Source:     The weather is beautiful today.
Reference:  Le temps est magnifique aujourd'hui.
System:     Le temps est beau aujourd'hui.


### Alternative: Load from Files

If you have your data in files, use this cell instead:


In [10]:
# Uncomment to load from files

# def load_sentences(file_path):
#     """Load sentences from a text file (one sentence per line)."""
#     with open(file_path, 'r', encoding='utf-8') as f:
#         return [line.strip() for line in f if line.strip()]
# 
# # Load your data
# source_sentences = load_sentences('data/test.en')
# reference_translations = load_sentences('data/test.fr')
# system_translations = load_sentences('output/predictions.txt')
# 
# print(f"Loaded {len(source_sentences)} sentences from files.")


## 3) Calculate BLEU, chrF++, and TER

Using **sacrebleu** - the standard tool for MT evaluation.

### About sacrebleu
- Provides reproducible, shareable scores
- Handles tokenization automatically
- Supports multiple metrics in one package


In [11]:
import sacrebleu

# Prepare references (sacrebleu expects a list of reference lists for multi-reference)
references = reference_translations
translations = system_translations

print("=" * 60)
print("SACREBLEU EVALUATION RESULTS")
print("=" * 60)

# ============ BLEU ============
# BLEU measures n-gram precision with brevity penalty
# Higher is better (0-100 scale)
bleu = sacrebleu.corpus_bleu(translations, [references])
bleu_score = round(bleu.score, 2)
print(f"\nüìä BLEU: {bleu_score}")

# ============ chrF++ ============
# chrF++ uses character n-grams + word n-grams (word_order=2)
# Higher is better (0-100 scale)
# Better for morphologically rich languages
chrf = sacrebleu.corpus_chrf(translations, [references], word_order=2)
chrf_score = round(chrf.score, 2)
print(f"\nüìä chrF++: {chrf_score}")

# ============ TER ============
# Translation Edit Rate - measures edit distance
# LOWER is better (can exceed 100 for poor translations)
ter = sacrebleu.corpus_ter(translations, [references])
ter_score = round(ter.score, 2)
print(f"\nüìä TER: {ter_score}")

print("\n" + "=" * 60)


SACREBLEU EVALUATION RESULTS

üìä BLEU: 54.84

üìä chrF++: 76.3

üìä TER: 19.35



## 4) Calculate COMET Score

**COMET** (Crosslingual Optimized Metric for Evaluation of Translation) is a neural metric that:
- Uses multilingual embeddings
- Correlates better with human judgments than BLEU
- Requires source sentences (not just reference and hypothesis)

‚ö†Ô∏è **Note**: COMET requires downloading a model (~1.5GB) and benefits from GPU.


In [12]:
# COMET evaluation
comet_score = None

try:
    from comet import download_model, load_from_checkpoint
    
    print("=" * 60)
    print("COMET EVALUATION")
    print("=" * 60)
    
    # Prepare data for COMET (requires source, MT output, and reference)
    comet_data = [
        {"src": src, "mt": mt, "ref": ref}
        for src, mt, ref in zip(source_sentences, system_translations, reference_translations)
    ]
    
    # Download model (only needed once)
    print("\nDownloading COMET model (this may take a while on first run)...")
    model_path = download_model("Unbabel/wmt22-comet-da")
    
    # Load model
    print("Loading model...")
    model = load_from_checkpoint(model_path)
    
    # Run prediction
    print("Computing COMET scores...")
    output = model.predict(comet_data, batch_size=8, gpus=0)  # gpus=1 if you have GPU
    
    # Extract scores
    segment_scores = output.scores
    system_score = output.system_score
    
    comet_score = round(system_score * 100, 2)
    print(f"\nüìä COMET: {comet_score}")
    print(f"   (Raw score: {system_score:.4f})")
    
    print("\n" + "=" * 60)
    
except ImportError:
    print("‚ö†Ô∏è COMET not installed.")
    print("   To install: pip install unbabel-comet")
    print("   COMET provides better correlation with human judgment.")
except Exception as e:
    print(f"‚ö†Ô∏è COMET evaluation failed: {e}")
    print("   This might be due to missing dependencies or GPU issues.")


COMET EVALUATION

Downloading COMET model (this may take a while on first run)...


Fetching 5 files:   0%|          | 0/5 [00:00<?, ?it/s]

README.md: 0.00B [00:00, ?B/s]

hparams.yaml:   0%|          | 0.00/567 [00:00<?, ?B/s]

.gitattributes: 0.00B [00:00, ?B/s]

LICENSE: 0.00B [00:00, ?B/s]

checkpoints/model.ckpt:   0%|          | 0.00/2.32G [00:00<?, ?B/s]

Loading model...


Lightning automatically upgraded your loaded checkpoint from v1.8.3.post1 to v2.6.0. To apply the upgrade to your files permanently, run `python -m pytorch_lightning.utilities.upgrade_checkpoint ../../../.cache/huggingface/hub/models--Unbabel--wmt22-comet-da/snapshots/2760a223ac957f30acfb18c8aa649b01cf1d75f2/checkpoints/model.ckpt`


tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

sentencepiece.bpe.model:   0%|          | 0.00/5.07M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.10M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/616 [00:00<?, ?B/s]

Encoder model frozen.
üí° Tip: For seamless cloud uploads and versioning, try installing [litmodels](https://pypi.org/project/litmodels/) to enable LitModelCheckpoint, which syncs automatically with the Lightning model registry.
GPU available: True (cuda), used: False
TPU available: False, using: 0 TPU cores


Computing COMET scores...


Predicting DataLoader 0: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 2/2 [00:00<00:00,  2.45it/s]


üìä COMET: 93.58
   (Raw score: 0.9358)






## 5) Summary Results


In [13]:
# Create summary dataframe
print("=" * 60)
print("EVALUATION SUMMARY")
print("=" * 60)

results = {
    "Metric": ["BLEU", "chrF++", "TER"],
    "Score": [bleu_score, chrf_score, ter_score],
    "Direction": ["‚Üë Higher is better", "‚Üë Higher is better", "‚Üì Lower is better"],
}

# Add COMET if available
if comet_score is not None:
    results["Metric"].append("COMET")
    results["Score"].append(comet_score)
    results["Direction"].append("‚Üë Higher is better")

df_summary = pd.DataFrame(results)
print("\n")
print(df_summary.to_string(index=False))
print("\n" + "=" * 60)

# Display as a nice table
df_summary


EVALUATION SUMMARY


Metric  Score          Direction
  BLEU  54.84 ‚Üë Higher is better
chrF++  76.30 ‚Üë Higher is better
   TER  19.35  ‚Üì Lower is better
 COMET  93.58 ‚Üë Higher is better



Unnamed: 0,Metric,Score,Direction
0,BLEU,54.84,‚Üë Higher is better
1,chrF++,76.3,‚Üë Higher is better
2,TER,19.35,‚Üì Lower is better
3,COMET,93.58,‚Üë Higher is better


## 6) Command-Line Usage

You can also run sacrebleu from the command line:

```bash
# BLEU from command line:
sacrebleu reference.txt < hypothesis.txt

# chrF++:
sacrebleu reference.txt -i hypothesis.txt -m chrf --chrf-word-order 2

# TER:
sacrebleu reference.txt -i hypothesis.txt -m ter

# All metrics at once:
sacrebleu reference.txt -i hypothesis.txt -m bleu chrf ter
```
