# Machine Translation: English-French Europarl

This notebook implements three translation approaches:
1. **Baseline**: Word-for-word translation using a bilingual dictionary
2. **Advanced**: Cross-lingual embeddings for semantic translation
3. **Best Model**: Fine-tuned Seq2Seq Transformer (iterative improvement)

We'll evaluate all models on a train/test split of the Europarl corpus.

In [1]:
# Import required libraries
import xml.etree.ElementTree as ET
from pathlib import Path
from collections import defaultdict, Counter
import re
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics.pairwise import cosine_similarity
import warnings
warnings.filterwarnings('ignore')

# For evaluation
try:
    from nltk.translate.bleu_score import sentence_bleu, SmoothingFunction
    from nltk.tokenize import word_tokenize
    import nltk
    nltk.download('punkt', quiet=True)
except ImportError:
    print("Installing nltk...")
    import subprocess
    subprocess.check_call(['pip', 'install', 'nltk'])
    from nltk.translate.bleu_score import sentence_bleu, SmoothingFunction
    from nltk.tokenize import word_tokenize
    import nltk
    nltk.download('punkt', quiet=True)

# For cross-lingual embeddings
try:
    from sentence_transformers import SentenceTransformer
except ImportError:
    print("Installing sentence-transformers...")
    import subprocess
    subprocess.check_call(['pip', 'install', 'sentence-transformers'])
    from sentence_transformers import SentenceTransformer

# For seq2seq transformer model
try:
    from transformers import (
        AutoTokenizer, 
        AutoModelForSeq2SeqLM, 
        Seq2SeqTrainingArguments, 
        Seq2SeqTrainer,
        DataCollatorForSeq2Seq
    )
    from datasets import Dataset
    import torch
    print("Transformers library found!")
except ImportError:
    print("Installing transformers and datasets...")
    import subprocess
    import sys
    subprocess.check_call([sys.executable, '-m', 'pip', 'install', 'transformers', 'datasets', 'accelerate'])
    from transformers import (
        AutoTokenizer, 
        AutoModelForSeq2SeqLM, 
        Seq2SeqTrainingArguments, 
        Seq2SeqTrainer,
        DataCollatorForSeq2Seq
    )
    from datasets import Dataset
    import torch

# Check and install accelerate if needed (required for Trainer)
try:
    import accelerate
    # Check version
    from packaging import version
    if version.parse(accelerate.__version__) < version.parse('0.26.0'):
        raise ImportError("accelerate version too old")
    print(f"accelerate library found! (version {accelerate.__version__})")
except (ImportError, AttributeError):
    print("Installing/upgrading accelerate (required for Trainer)...")
    import subprocess
    import sys
    subprocess.check_call([sys.executable, '-m', 'pip', 'install', '--upgrade', 'accelerate>=0.26.0'])
    import accelerate
    print(f"accelerate installed successfully! (version {accelerate.__version__})")
except Exception as e:
    # If packaging not available, try to install anyway
    print("Checking accelerate version...")
    import subprocess
    import sys
    subprocess.check_call([sys.executable, '-m', 'pip', 'install', '--upgrade', 'accelerate>=0.26.0'])
    import accelerate
    print("accelerate installed/upgraded!")

# Install sentencepiece if not available (required for some tokenizers like opus-mt)
try:
    import sentencepiece
    print("sentencepiece library found!")
except ImportError:
    print("Installing sentencepiece (required for opus-mt tokenizer)...")
    import subprocess
    subprocess.check_call(['pip', 'install', 'sentencepiece'])
    import sentencepiece
    print("sentencepiece installed successfully!")

# Check for GPU (supports CUDA, MPS for Apple Silicon, and CPU)
if torch.cuda.is_available():
    device = torch.device('cuda')
    print(f"Using device: {device}")
    print(f"GPU: {torch.cuda.get_device_name(0)}")
    print(f"GPU Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.2f} GB")
elif hasattr(torch.backends, 'mps') and torch.backends.mps.is_available():
    device = torch.device('mps')  # Apple Silicon GPU (M1/M2/M3)
    print(f"Using device: {device} (Apple Silicon GPU)")
    print("Metal Performance Shaders (MPS) backend enabled")
    print("Note: Some operations may fall back to CPU if not supported by MPS")
else:
    device = torch.device('cpu')
    print(f"Using device: {device} (CPU)")
    if hasattr(torch.backends, 'mps'):
        print("MPS not available. Make sure you have PyTorch with MPS support for Apple Silicon.")

print("Libraries imported successfully!")

Transformers library found!
accelerate library found! (version 1.12.0)
sentencepiece library found!
Using device: mps (Apple Silicon GPU)
Metal Performance Shaders (MPS) backend enabled
Note: Some operations may fall back to CPU if not supported by MPS
Libraries imported successfully!


## 1. Data Loading and Preprocessing

In [2]:
def _get_seg_text(elem):
    """Extract segment text from a tuv element."""
    seg = elem.find(".//{*}seg") or elem.find("seg")
    if seg is not None:
        return ((seg.text or "") + "".join((e.tail or "") for e in seg)).strip()
    return ""


def load_tmx_subset_robust(path, max_tu=None):
    """
    Stream-parse TMX; robust to namespaces. Yields (en_text, fr_text) pairs.
    """
    pairs = []
    in_tu = False
    en_text = None
    fr_text = None

    for event, elem in ET.iterparse(path, events=("start", "end")):
        tag = elem.tag.split("}")[-1] if "}" in elem.tag else elem.tag

        if tag == "tu":
            if event == "start":
                in_tu = True
                en_text = None
                fr_text = None
            else:
                if en_text is not None and fr_text is not None:
                    pairs.append((en_text, fr_text))
                    if max_tu and len(pairs) >= max_tu:
                        break
                in_tu = False
                elem.clear()

        elif in_tu and tag == "tuv":
            if event == "end":
                lang = elem.get(
                    "{http://www.w3.org/XML/1998/namespace}lang",
                    elem.get("lang", ""),
                )
                text = _get_seg_text(elem)
                if lang == "en":
                    en_text = text
                elif lang == "fr":
                    fr_text = text

    return pairs


# Load data
DATA_DIR = Path("data")
TMX_PATH = DATA_DIR / "en-fr.tmx"

print(f"Loading data from {TMX_PATH}...")
# Load a subset for faster processing (adjust max_tu as needed)
# For full dataset, set max_tu=None
pairs = load_tmx_subset_robust(TMX_PATH, max_tu=100000)
print(f"Loaded {len(pairs):,} sentence pairs")

# Filter out empty pairs
pairs = [(en, fr) for en, fr in pairs if en.strip() and fr.strip()]
print(f"After filtering empty pairs: {len(pairs):,} pairs")

Loading data from data/en-fr.tmx...
Loaded 100,000 sentence pairs
After filtering empty pairs: 100,000 pairs


In [3]:
# Split into train and test sets
train_pairs, test_pairs = train_test_split(
    pairs, 
    test_size=0.2, 
    random_state=42,
    shuffle=True
)

print(f"Training set: {len(train_pairs):,} pairs")
print(f"Test set: {len(test_pairs):,} pairs")

# Show some examples
print("\nSample training pairs:")
for i, (en, fr) in enumerate(train_pairs[:3]):
    print(f"\n{i+1}. EN: {en[:100]}...")
    print(f"   FR: {fr[:100]}...")

Training set: 80,000 pairs
Test set: 20,000 pairs

Sample training pairs:

1. EN: We are, of course, in the middle of yet another chaotic summer for flights....
   FR: Nous nous trouvons au beau milieu d'un autre √©t√© chaotique dans les airs....

2. EN: So now we are going to adopt regulations to prevent an epidemic that we have known about since 1986,...
   FR: Nous allons donc adopter des r√®gles pour pr√©venir une √©pid√©mie connue depuis 1986 : c'est-√†-dire dep...

3. EN: I must say to those people who said that employment in the sector would be reduced, that this has no...
   FR: Je dois dire √† ceux qui disaient que l'emploi allait diminuer dans ce secteur qu'il n'en a pas √©t√© a...


## 2. Baseline: Word-for-Word Translation

This baseline model builds a bilingual dictionary from the training data and translates each word independently.

In [4]:
class WordForWordTranslator:
    """Baseline word-for-word translation using bilingual dictionary."""
    
    def __init__(self):
        self.word_dict = defaultdict(lambda: defaultdict(int))
        self.most_common_translations = {}
        
    def train(self, train_pairs):
        """Build bilingual dictionary from training pairs."""
        print("Building bilingual dictionary...")
        
        for en_text, fr_text in train_pairs:
            # Simple tokenization (split on whitespace and punctuation)
            en_words = re.findall(r'\b\w+\b', en_text.lower())
            fr_words = re.findall(r'\b\w+\b', fr_text.lower())
            
            if not en_words or not fr_words:
                continue
            
            # Use positional alignment based on relative position
            # This prevents common words like "de" from dominating
            en_len = len(en_words)
            fr_len = len(fr_words)
            
            # Align words based on their relative positions
            for i, en_word in enumerate(en_words):
                # Map English position to French position
                fr_pos = int((i / en_len) * fr_len)
                fr_pos = min(fr_pos, fr_len - 1)  # Ensure valid index
                
                # Align to the word at the corresponding position
                fr_word = fr_words[fr_pos]
                self.word_dict[en_word][fr_word] += 1
                
                # Also align to nearby words (within 1 position) for better coverage
                if fr_pos > 0:
                    self.word_dict[en_word][fr_words[fr_pos - 1]] += 0.5
                if fr_pos < fr_len - 1:
                    self.word_dict[en_word][fr_words[fr_pos + 1]] += 0.5
        
        # Store most common translation for each English word
        for en_word, fr_translations in self.word_dict.items():
            if fr_translations:
                self.most_common_translations[en_word] = max(
                    fr_translations.items(), 
                    key=lambda x: x[1]
                )[0]
        
        print(f"Dictionary built with {len(self.most_common_translations):,} English words")
        
    def translate(self, en_text):
        """Translate English text word-by-word."""
        en_words = re.findall(r'\b\w+\b', en_text)
        fr_words = []
        
        for word in en_words:
            word_lower = word.lower()
            if word_lower in self.most_common_translations:
                fr_words.append(self.most_common_translations[word_lower])
            else:
                # Unknown word - keep original
                fr_words.append(word)
        
        return ' '.join(fr_words)
    
    def translate_preserve_case(self, en_text):
        """Translate preserving original word casing."""
        en_words = re.findall(r'\b\w+\b', en_text)
        fr_words = []
        
        for word in en_words:
            word_lower = word.lower()
            if word_lower in self.most_common_translations:
                translation = self.most_common_translations[word_lower]
                # Preserve case
                if word[0].isupper():
                    translation = translation.capitalize()
                fr_words.append(translation)
            else:
                fr_words.append(word)
        
        return ' '.join(fr_words)


# Train the baseline model
baseline_model = WordForWordTranslator()
baseline_model.train(train_pairs)

Building bilingual dictionary...
Dictionary built with 25,336 English words


In [5]:
# Test the baseline model on a few examples
print("Baseline Translation Examples:\n")
for i, (en, fr_true) in enumerate(test_pairs[:5]):
    fr_pred = baseline_model.translate_preserve_case(en)
    print(f"Example {i+1}:")
    print(f"  EN: {en}")
    print(f"  FR (true):  {fr_true}")
    print(f"  FR (pred):  {fr_pred}")
    print()

Baseline Translation Examples:

Example 1:
  EN: The implications are significant for the agro-food sector overall, which has been struck squarely by this crisis.
  FR (true):  Les cons√©quences sont importantes pour l'ensemble de ce secteur agro-alimentaire qui a √©t√© touch√© de plein fouet par cette crise.
  FR (pred):  La implications nous un pour la l la secteur de qui a √©t√© de elle par ce crise

Example 2:
  EN: The EUR 614 million adopted by the Council will be enough to finance all the foreseeable needs.
  FR (true):  Les 614 millions d'euros retenus par le Conseil permettront de financer l'ensemble des besoins pr√©visibles.
  FR (pred):  La De 1999 millions la par la Conseil de √™tre pas de des tous la √† de

Example 3:
  EN: Joint motion for a resolution (B5-0181/2000) on the shipwreck of the .
  FR (true):  Proposition de r√©solution commune (B5-0181/2000) sur le naufrage de l'Erika.
  FR (pred):  De de pour un r√©solution B5 0181 2000 sur la naufrage de la

Example 4:
  E

## 3. Advanced Model: Cross-Lingual Embeddings

This model uses pre-trained multilingual embeddings to find the best translation by comparing semantic similarity.

In [6]:
class CrossLingualEmbeddingTranslator:
    """Translation using cross-lingual embeddings and semantic similarity."""
    
    def __init__(self, model_name='paraphrase-multilingual-MiniLM-L12-v2'):
        """
        Initialize with a multilingual sentence transformer.
        Options:
        - 'paraphrase-multilingual-MiniLM-L12-v2' (fast, good quality)
        - 'paraphrase-multilingual-mpnet-base-v2' (slower, better quality)
        - 'distiluse-base-multilingual-cased' (alternative)
        """
        print(f"Loading multilingual embedding model: {model_name}...")
        self.model = SentenceTransformer(model_name)
        self.french_candidates = []
        self.french_embeddings = None
        
    def train(self, train_pairs):
        """Build a candidate pool of French translations from training data."""
        print("Building French candidate pool...")
        
        # Collect unique French sentences (or a large sample)
        french_sentences = set()
        for _, fr_text in train_pairs:
            if fr_text.strip():
                french_sentences.add(fr_text.strip())
        
        # Limit to reasonable size for efficiency (can be adjusted)
        max_candidates = 50000
        if len(french_sentences) > max_candidates:
            french_sentences = list(french_sentences)[:max_candidates]
        else:
            french_sentences = list(french_sentences)
        
        self.french_candidates = french_sentences
        
        # Pre-compute embeddings for all French candidates
        print(f"Computing embeddings for {len(self.french_candidates):,} French candidates...")
        self.french_embeddings = self.model.encode(
            self.french_candidates,
            show_progress_bar=True,
            batch_size=32
        )
        print("Training complete!")
        
    def translate(self, en_text, top_k=1):
        """
        Translate by finding the most semantically similar French sentence.
        
        Args:
            en_text: English text to translate
            top_k: Number of top candidates to return (default: 1, returns best match)
        """
        if not en_text.strip():
            return ""
        
        # Get embedding for English text
        en_embedding = self.model.encode([en_text])
        
        # Compute cosine similarity with all French candidates
        similarities = cosine_similarity(en_embedding, self.french_embeddings)[0]
        
        # Get top-k most similar
        top_indices = np.argsort(similarities)[-top_k:][::-1]
        
        if top_k == 1:
            return self.french_candidates[top_indices[0]]
        else:
            return [self.french_candidates[idx] for idx in top_indices]


# Train the cross-lingual embedding model
print("Training cross-lingual embedding model...")
embedding_model = CrossLingualEmbeddingTranslator()
embedding_model.train(train_pairs)

Training cross-lingual embedding model...
Loading multilingual embedding model: paraphrase-multilingual-MiniLM-L12-v2...
Building French candidate pool...
Computing embeddings for 50,000 French candidates...


Batches:   0%|          | 0/1563 [00:00<?, ?it/s]

Training complete!


In [7]:
# Test the embedding model on a few examples
print("Cross-Lingual Embedding Translation Examples:\n")
for i, (en, fr_true) in enumerate(test_pairs[:5]):
    fr_pred = embedding_model.translate(en)
    print(f"Example {i+1}:")
    print(f"  EN: {en}")
    print(f"  FR (true):  {fr_true}")
    print(f"  FR (pred):  {fr_pred}")
    print()

Cross-Lingual Embedding Translation Examples:

Example 1:
  EN: The implications are significant for the agro-food sector overall, which has been struck squarely by this crisis.
  FR (true):  Les cons√©quences sont importantes pour l'ensemble de ce secteur agro-alimentaire qui a √©t√© touch√© de plein fouet par cette crise.
  FR (pred):  Ce sont justement les b√©n√©fices qui sont au d√©part des crises actuelles au niveau de la s√©curit√© alimentaire.

Example 2:
  EN: The EUR 614 million adopted by the Council will be enough to finance all the foreseeable needs.
  FR (true):  Les 614 millions d'euros retenus par le Conseil permettront de financer l'ensemble des besoins pr√©visibles.
  FR (pred):  Les 614 millions d'euros retenus par le Conseil permettront de financer l'ensemble des besoins pr√©visibles.

Example 3:
  EN: Joint motion for a resolution (B5-0181/2000) on the shipwreck of the .
  FR (true):  Proposition de r√©solution commune (B5-0181/2000) sur le naufrage de l'Erika.
  FR

## 4. Best Model: Fine-tuned Seq2Seq Transformer

This is our iterative improvement model. We'll fine-tune a pretrained transformer model on our Europarl data to achieve the best translation quality.

In [8]:
class Seq2SeqTranslator:
    """Fine-tuned Seq2Seq Transformer for English-French translation."""
    
    def __init__(self, model_name='Helsinki-NLP/opus-mt-en-fr', max_length=128):
        """
        Initialize with a pretrained translation model.
        
        Options:
        - 'Helsinki-NLP/opus-mt-en-fr' (fast, good baseline, ~300MB)
        - 'facebook/mbart-large-50' (multilingual, better quality, ~2.5GB, needs more GPU)
        - 'google/mt5-base' (multilingual T5, good quality, ~850MB)
        """
        self.model_name = model_name
        self.max_length = max_length
        self.tokenizer = None
        self.model = None
        self.device = device
        
        # Ensure sentencepiece is installed (required for opus-mt and some other models)
        try:
            import sentencepiece
        except ImportError:
            print("Installing sentencepiece (required for this tokenizer)...")
            import subprocess
            import sys
            subprocess.check_call([sys.executable, '-m', 'pip', 'install', 'sentencepiece'])
            import sentencepiece
            print("sentencepiece installed successfully!")
        
        print(f"Initializing model: {model_name}...")
        self.tokenizer = AutoTokenizer.from_pretrained(model_name)
        self.model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
        self.model.to(self.device)
        print(f"Model loaded on {self.device}")
        
    def train(self, train_pairs, val_pairs=None, 
              num_epochs=3, batch_size=8, learning_rate=5e-5,
              save_steps=1000, eval_steps=500, warmup_steps=500):
        """
        Fine-tune the model on training data.
        
        Args:
            train_pairs: List of (en, fr) tuples for training
            val_pairs: Optional validation set (if None, uses 10% of train)
            num_epochs: Number of training epochs
            batch_size: Training batch size (adjust based on GPU memory)
            learning_rate: Learning rate for fine-tuning
            save_steps: Save checkpoint every N steps
            eval_steps: Evaluate every N steps
            warmup_steps: Warmup steps for learning rate scheduler
        """
        print(f"\n{'='*60}")
        print("Training Seq2Seq Transformer Model")
        print(f"{'='*60}")
        print(f"Training examples: {len(train_pairs):,}")
        
        # Prepare validation set
        if val_pairs is None:
            # Use 10% of training data for validation
            val_size = max(1000, len(train_pairs) // 10)
            val_pairs = train_pairs[:val_size]
            train_pairs = train_pairs[val_size:]
            print(f"Split: {len(train_pairs):,} train, {len(val_pairs):,} validation")
        
        # Convert to HuggingFace Dataset format
        def prepare_dataset(pairs):
            return Dataset.from_dict({
                'en': [pair[0] for pair in pairs],
                'fr': [pair[1] for pair in pairs]
            })
        
        train_dataset = prepare_dataset(train_pairs)
        val_dataset = prepare_dataset(val_pairs)
        
        # Tokenize datasets
        def tokenize_function(examples):
            # Tokenize English (source) and French (target)
            model_inputs = self.tokenizer(
                examples['en'],
                max_length=self.max_length,
                truncation=True,
                padding='max_length'
            )
            
            # Tokenize French targets
            with self.tokenizer.as_target_tokenizer():
                labels = self.tokenizer(
                    examples['fr'],
                    max_length=self.max_length,
                    truncation=True,
                    padding='max_length'
                )
            
            # Replace padding token id's of the labels with -100 (ignored by loss)
            labels['input_ids'] = [
                [(l if l != self.tokenizer.pad_token_id else -100) for l in label]
                for label in labels['input_ids']
            ]
            
            model_inputs['labels'] = labels['input_ids']
            return model_inputs
        
        print("Tokenizing datasets...")
        train_dataset = train_dataset.map(
            tokenize_function,
            batched=True,
            remove_columns=train_dataset.column_names
        )
        val_dataset = val_dataset.map(
            tokenize_function,
            batched=True,
            remove_columns=val_dataset.column_names
        )
        
        # Data collator
        data_collator = DataCollatorForSeq2Seq(
            tokenizer=self.tokenizer,
            model=self.model,
            padding=True
        )
        
        # Training arguments
        output_dir = f"./seq2seq_model_{self.model_name.split('/')[-1]}"
        
        training_args = Seq2SeqTrainingArguments(
            output_dir=output_dir,
            num_train_epochs=num_epochs,
            per_device_train_batch_size=batch_size,
            per_device_eval_batch_size=batch_size,
            learning_rate=learning_rate,
            warmup_steps=warmup_steps,
            weight_decay=0.01,
            logging_dir=f'{output_dir}/logs',
            logging_steps=100,
            eval_steps=eval_steps,
            save_steps=save_steps,
            evaluation_strategy="steps",
            save_total_limit=2,
            load_best_model_at_end=True,
            metric_for_best_model="eval_loss",
            greater_is_better=False,
            fp16=torch.cuda.is_available(),  # Use mixed precision if CUDA GPU available (MPS doesn't support fp16)
            report_to="none",  # Disable wandb/tensorboard
        )
        
        # Trainer
        trainer = Seq2SeqTrainer(
            model=self.model,
            args=training_args,
            train_dataset=train_dataset,
            eval_dataset=val_dataset,
            data_collator=data_collator,
            tokenizer=self.tokenizer,
        )
        
        # Train!
        print("\nStarting training...")
        print(f"Epochs: {num_epochs}, Batch size: {batch_size}, Learning rate: {learning_rate}")
        if torch.cuda.is_available():
            print(f"GPU: CUDA available")
        elif hasattr(torch.backends, 'mps') and torch.backends.mps.is_available():
            print(f"GPU: Apple Silicon (MPS) available")
        else:
            print(f"Using: CPU")
        
        train_result = trainer.train()
        
        print(f"\nTraining completed!")
        print(f"Training loss: {train_result.training_loss:.4f}")
        
        # Save final model
        trainer.save_model()
        self.tokenizer.save_pretrained(output_dir)
        print(f"Model saved to {output_dir}")
        
        return trainer
    
    def translate(self, en_text, max_length=None, num_beams=4, do_sample=False):
        """
        Translate English text to French.
        
        Args:
            en_text: English text to translate
            max_length: Maximum output length (default: self.max_length)
            num_beams: Number of beams for beam search (higher = better quality, slower)
            do_sample: Whether to use sampling (False = deterministic)
        """
        if max_length is None:
            max_length = self.max_length
        
        # Tokenize input
        inputs = self.tokenizer(
            en_text,
            return_tensors="pt",
            max_length=self.max_length,
            truncation=True,
            padding=True
        ).to(self.device)
        
        # Generate translation
        with torch.no_grad():
            outputs = self.model.generate(
                **inputs,
                max_length=max_length,
                num_beams=num_beams,
                do_sample=do_sample,
                early_stopping=True,
                num_return_sequences=1
            )
        
        # Decode
        translation = self.tokenizer.decode(outputs[0], skip_special_tokens=True)
        return translation
    
    def translate_batch(self, en_texts, batch_size=8, **kwargs):
        """Translate a batch of English texts."""
        translations = []
        for i in range(0, len(en_texts), batch_size):
            batch = en_texts[i:i+batch_size]
            batch_translations = [self.translate(text, **kwargs) for text in batch]
            translations.extend(batch_translations)
        return translations

In [9]:
# Initialize the seq2seq model
# You can adjust the model_name to try different pretrained models
# For faster training/testing, start with opus-mt-en-fr
# For better quality (if you have GPU memory), try mBART or mT5

seq2seq_model = Seq2SeqTranslator(
    model_name='Helsinki-NLP/opus-mt-en-fr',  # Good baseline, fast
    max_length=128
)

print("\nModel initialized and ready for training!")
print("Note: Training will take time. Adjust batch_size and num_epochs based on your GPU memory.")

Initializing model: Helsinki-NLP/opus-mt-en-fr...
Model loaded on mps

Model initialized and ready for training!
Note: Training will take time. Adjust batch_size and num_epochs based on your GPU memory.


### Training Configuration

**Adjust these parameters based on your resources:**

- **CUDA GPU (NVIDIA)**: Use larger batch_size (16-32), more epochs (3-5), fp16 enabled
- **Apple Silicon GPU (M1/M2/M3)**: Use moderate batch_size (8-16), more epochs (3-5), no fp16
- **CPU only**: Use smaller batch_size (2-4), fewer epochs (1-2), expect slower training
- **Limited GPU memory**: Reduce batch_size, use gradient accumulation

**Model options:**
- `Helsinki-NLP/opus-mt-en-fr`: Fast, ~300MB, good for quick iteration
- `facebook/mbart-large-50`: Better quality, ~2.5GB, needs more GPU memory
- `google/mt5-base`: Good balance, ~850MB

**Note for Apple Silicon (MacBook):**
- The code automatically detects and uses MPS (Metal Performance Shaders)
- Training will be faster than CPU but may be slower than high-end NVIDIA GPUs
- Some operations may fall back to CPU if not supported by MPS
- Mixed precision (fp16) is not supported on MPS, so training uses full precision

In [10]:
# Train the seq2seq model
# Adjust parameters based on your GPU/CPU and time constraints

# For quick testing (CPU or limited time):
# seq2seq_model.train(
#     train_pairs,
#     num_epochs=1,
#     batch_size=4 if not torch.cuda.is_available() else 8,
#     learning_rate=5e-5,
#     save_steps=500,
#     eval_steps=250
# )

# For better results (with GPU):
# Determine batch size based on available device
if torch.cuda.is_available():
    batch_size = 16  # CUDA GPU - can use larger batches
elif hasattr(torch.backends, 'mps') and torch.backends.mps.is_available():
    batch_size = 8   # Apple Silicon GPU - moderate batch size
else:
    batch_size = 4   # CPU - smaller batches

seq2seq_model.train(
    train_pairs,
    num_epochs=3,  # Increase to 5-10 for better results
    batch_size=batch_size,
    learning_rate=5e-5,  # Try 3e-5 or 1e-4 for experimentation
    save_steps=1000,
    eval_steps=500,
    warmup_steps=500
)

print("\n‚úÖ Training complete! Model is ready for evaluation.")


Training Seq2Seq Transformer Model
Training examples: 80,000
Split: 72,000 train, 8,000 validation
Tokenizing datasets...


Map:   0%|          | 0/72000 [00:00<?, ? examples/s]

Map:   0%|          | 0/8000 [00:00<?, ? examples/s]


Starting training...
Epochs: 3, Batch size: 8, Learning rate: 5e-05
GPU: Apple Silicon (MPS) available


Step,Training Loss,Validation Loss
500,1.2183,1.173248
1000,1.2852,1.190946
1500,1.2578,1.188723
2000,1.3087,1.184073
2500,1.2668,1.183677
3000,1.2951,1.182387
3500,1.2447,1.180292
4000,1.2456,1.17845
4500,1.2599,1.176777
5000,1.245,1.172938


There were missing keys in the checkpoint model loaded: ['model.encoder.embed_tokens.weight', 'model.encoder.embed_positions.weight', 'model.decoder.embed_tokens.weight', 'model.decoder.embed_positions.weight', 'lm_head.weight'].



Training completed!
Training loss: 1.0713
Model saved to ./seq2seq_model_opus-mt-en-fr

‚úÖ Training complete! Model is ready for evaluation.


In [11]:
# Test the seq2seq model on a few examples
print("Seq2Seq Transformer Translation Examples:\n")
for i, (en, fr_true) in enumerate(test_pairs[:5]):
    fr_pred = seq2seq_model.translate(en, num_beams=4)
    print(f"Example {i+1}:")
    print(f"  EN: {en}")
    print(f"  FR (true):  {fr_true}")
    print(f"  FR (pred):  {fr_pred}")
    print()

Seq2Seq Transformer Translation Examples:

Example 1:
  EN: The implications are significant for the agro-food sector overall, which has been struck squarely by this crisis.
  FR (true):  Les cons√©quences sont importantes pour l'ensemble de ce secteur agro-alimentaire qui a √©t√© touch√© de plein fouet par cette crise.
  FR (pred):  Les implications sont importantes pour l'ensemble du secteur agro-alimentaire, qui a √©t√© frapp√© de mani√®re directe par cette crise.

Example 2:
  EN: The EUR 614 million adopted by the Council will be enough to finance all the foreseeable needs.
  FR (true):  Les 614 millions d'euros retenus par le Conseil permettront de financer l'ensemble des besoins pr√©visibles.
  FR (pred):  Les 614 millions d'euros adopt√©s par le Conseil suffiront √† financer l'ensemble des besoins pr√©visibles.

Example 3:
  EN: Joint motion for a resolution (B5-0181/2000) on the shipwreck of the .
  FR (true):  Proposition de r√©solution commune (B5-0181/2000) sur le naufrage 

## 5. Evaluation

We'll evaluate all three models using BLEU score and other metrics.

In [12]:
def evaluate_translations(model, test_pairs, model_name="Model", max_samples=None):
    """Evaluate translation model on test set."""
    if max_samples:
        test_subset = test_pairs[:max_samples]
    else:
        test_subset = test_pairs
    
    print(f"Evaluating {model_name} on {len(test_subset):,} test examples...")
    
    predictions = []
    references = []
    
    smoothing = SmoothingFunction().method1
    
    bleu_scores = []
    exact_matches = 0
    
    for i, (en, fr_true) in enumerate(test_subset):
        try:
            fr_pred = model.translate(en)
            predictions.append(fr_pred)
            references.append(fr_true)
            
            # Tokenize for BLEU
            pred_tokens = word_tokenize(fr_pred.lower())
            ref_tokens = word_tokenize(fr_true.lower())
            
            # Calculate BLEU score
            bleu = sentence_bleu([ref_tokens], pred_tokens, smoothing_function=smoothing)
            bleu_scores.append(bleu)
            
            # Exact match
            if fr_pred.lower().strip() == fr_true.lower().strip():
                exact_matches += 1
                
        except Exception as e:
            print(f"Error on example {i}: {e}")
            continue
        
        if (i + 1) % 1000 == 0:
            print(f"  Processed {i+1:,} examples...")
    
    avg_bleu = np.mean(bleu_scores) if bleu_scores else 0.0
    exact_match_rate = exact_matches / len(test_subset) if test_subset else 0.0
    
    results = {
        'avg_bleu': avg_bleu,
        'exact_match_rate': exact_match_rate,
        'num_evaluated': len(test_subset),
        'predictions': predictions,
        'references': references
    }
    
    print(f"\n{model_name} Results:")
    print(f"  Average BLEU Score: {avg_bleu:.4f}")
    print(f"  Exact Match Rate: {exact_match_rate:.4f} ({exact_matches}/{len(test_subset)})")
    
    return results

In [13]:
# Evaluate baseline model
# Using a subset for faster evaluation (adjust as needed)
print("=" * 60)
baseline_results = evaluate_translations(
    baseline_model, 
    test_pairs, 
    model_name="Baseline (Word-for-Word)",
    max_samples=1000  # Evaluate on first 1000 for speed
)

Evaluating Baseline (Word-for-Word) on 1,000 test examples...
  Processed 1,000 examples...

Baseline (Word-for-Word) Results:
  Average BLEU Score: 0.0369
  Exact Match Rate: 0.0000 (0/1000)


In [14]:
# Evaluate embedding model
print("=" * 60)
embedding_results = evaluate_translations(
    embedding_model,
    test_pairs,
    model_name="Advanced (Cross-Lingual Embeddings)",
    max_samples=1000  # Evaluate on first 1000 for speed
)

# Evaluate seq2seq model
print("=" * 60)
seq2seq_results = evaluate_translations(
    seq2seq_model,
    test_pairs,
    model_name="Best Model (Seq2Seq Transformer)",
    max_samples=1000  # Evaluate on first 1000 for speed
)

Evaluating Advanced (Cross-Lingual Embeddings) on 1,000 test examples...
  Processed 1,000 examples...

Advanced (Cross-Lingual Embeddings) Results:
  Average BLEU Score: 0.0387
  Exact Match Rate: 0.0020 (2/1000)
Evaluating Best Model (Seq2Seq Transformer) on 1,000 test examples...
  Processed 1,000 examples...

Best Model (Seq2Seq Transformer) Results:
  Average BLEU Score: 0.3136
  Exact Match Rate: 0.0220 (22/1000)


In [15]:
# Compare all three models
print("\n" + "=" * 80)
print("MODEL COMPARISON - ALL THREE MODELS")
print("=" * 80)
print(f"{'Metric':<30} {'Baseline':<20} {'Embeddings':<20} {'Seq2Seq':<20}")
print("-" * 80)
print(f"{'Average BLEU Score':<30} {baseline_results['avg_bleu']:<20.4f} {embedding_results['avg_bleu']:<20.4f} {seq2seq_results['avg_bleu']:<20.4f}")
print(f"{'Exact Match Rate':<30} {baseline_results['exact_match_rate']:<20.4f} {embedding_results['exact_match_rate']:<20.4f} {seq2seq_results['exact_match_rate']:<20.4f}")
print(f"{'Number Evaluated':<30} {baseline_results['num_evaluated']:<20} {embedding_results['num_evaluated']:<20} {seq2seq_results['num_evaluated']:<20}")

# Calculate improvements
embedding_improvement = embedding_results['avg_bleu'] - baseline_results['avg_bleu']
seq2seq_improvement = seq2seq_results['avg_bleu'] - baseline_results['avg_bleu']
seq2seq_vs_embedding = seq2seq_results['avg_bleu'] - embedding_results['avg_bleu']

print(f"\n{'Improvements over Baseline:':<30}")
print(f"  Embeddings: {embedding_improvement:+.4f} ({embedding_improvement/baseline_results['avg_bleu']*100:+.2f}%)")
print(f"  Seq2Seq:    {seq2seq_improvement:+.4f} ({seq2seq_improvement/baseline_results['avg_bleu']*100:+.2f}%)")
print(f"\nSeq2Seq vs Embeddings: {seq2seq_vs_embedding:+.4f} ({seq2seq_vs_embedding/embedding_results['avg_bleu']*100:+.2f}%)")

# Determine winner
best_model = max([
    ('Baseline', baseline_results['avg_bleu']),
    ('Embeddings', embedding_results['avg_bleu']),
    ('Seq2Seq', seq2seq_results['avg_bleu'])
], key=lambda x: x[1])

print(f"\nüèÜ Best Model: {best_model[0]} (BLEU: {best_model[1]:.4f})")


MODEL COMPARISON - ALL THREE MODELS
Metric                         Baseline             Embeddings           Seq2Seq             
--------------------------------------------------------------------------------
Average BLEU Score             0.0369               0.0387               0.3136              
Exact Match Rate               0.0000               0.0020               0.0220              
Number Evaluated               1000                 1000                 1000                

Improvements over Baseline:   
  Embeddings: +0.0018 (+4.79%)
  Seq2Seq:    +0.2766 (+748.79%)

Seq2Seq vs Embeddings: +0.2749 (+710.00%)

üèÜ Best Model: Seq2Seq (BLEU: 0.3136)


## 6. Detailed Examples and Analysis

In [16]:
# Show side-by-side comparisons for all three models
print("Side-by-Side Translation Comparison (All Three Models):\n")
num_examples = 10
for i in range(min(num_examples, len(test_pairs))):
    en, fr_true = test_pairs[i]
    fr_baseline = baseline_model.translate_preserve_case(en)
    fr_embedding = embedding_model.translate(en)
    fr_seq2seq = seq2seq_model.translate(en, num_beams=4)
    
    print(f"{'='*80}")
    print(f"Example {i+1}")
    print(f"{'='*80}")
    print(f"EN:  {en}")
    print(f"\nFR (True):      {fr_true}")
    print(f"FR (Baseline):  {fr_baseline}")
    print(f"FR (Embedding): {fr_embedding}")
    print(f"FR (Seq2Seq):   {fr_seq2seq}")
    print()

Side-by-Side Translation Comparison (All Three Models):

Example 1
EN:  The implications are significant for the agro-food sector overall, which has been struck squarely by this crisis.

FR (True):      Les cons√©quences sont importantes pour l'ensemble de ce secteur agro-alimentaire qui a √©t√© touch√© de plein fouet par cette crise.
FR (Baseline):  La implications nous un pour la l la secteur de qui a √©t√© de elle par ce crise
FR (Embedding): Ce sont justement les b√©n√©fices qui sont au d√©part des crises actuelles au niveau de la s√©curit√© alimentaire.
FR (Seq2Seq):   Les implications sont importantes pour l'ensemble du secteur agro-alimentaire, qui a √©t√© frapp√© de mani√®re directe par cette crise.

Example 2
EN:  The EUR 614 million adopted by the Council will be enough to finance all the foreseeable needs.

FR (True):      Les 614 millions d'euros retenus par le Conseil permettront de financer l'ensemble des besoins pr√©visibles.
FR (Baseline):  La De 1999 millions la par la

## Summary

This notebook implemented and compared two translation approaches:

1. **Baseline (Word-for-Word)**: Simple dictionary-based translation that maps each English word to its most common French translation from the training data.

2. **Advanced (Cross-Lingual Embeddings)**: Uses multilingual sentence embeddings to find the most semantically similar French sentence from the training corpus.

The cross-lingual embedding approach should generally perform better as it considers semantic meaning rather than just word-level mappings. However, it requires more computational resources and may be slower for large candidate pools.

### Future Improvements:
- Use more sophisticated word alignment algorithms (e.g., IBM models)
- Implement phrase-based translation
- Use neural sequence-to-sequence models
- Fine-tune the embedding model on the specific domain
- Add more evaluation metrics (METEOR, ROUGE, etc.)