# Sequence-to-Sequence (Seq2Seq) Model for Neural Machine Translation

This notebook explains a complete implementation of a Seq2Seq model for translating German to English using PyTorch. The model uses an Encoder-Decoder architecture with LSTM networks.

## Table of Contents
1. [Imports and Setup](#imports)
2. [Data Preprocessing](#data)
3. [Encoder Architecture](#encoder)
4. [Decoder Architecture](#decoder)
5. [Seq2Seq Model](#seq2seq)
6. [Training Setup](#training)
7. [Training Loop](#loop)
8. [Evaluation](#evaluation)

## 1. Imports and Setup {#imports}

First, let's import all necessary libraries and understand what each one does:

In [None]:
import torch
import torch.nn as nn
import torch.optim as optim
from torchtext.datasets import Multi30k  # German-English translation dataset
from torchtext.data import Field, BucketIterator  # Data preprocessing utilities
import numpy as np
import spacy  # Natural language processing library for tokenization
import random
from torch.utils.tensorboard import SummaryWriter  # For visualization

# Note: utils.py contains helper functions for translation, BLEU score, and checkpointing
# from utils import translate_sentence, bleu, save_checkpoint, load_checkpoint

print("Libraries imported successfully!")
print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")

### Loading Language Models

We use spaCy for tokenization - breaking sentences into individual words/tokens:

In [None]:
# Load spaCy language models for German and English
# These need to be installed: python -m spacy download de_core_news_sm en_core_news_sm
try:
    spacy_ger = spacy.load("de_core_news_sm")
    spacy_eng = spacy.load("en_core_news_sm")
    print("SpaCy models loaded successfully!")
except OSError:
    print("Please install spaCy models:")
    print("python -m spacy download de_core_news_sm")
    print("python -m spacy download en_core_news_sm")
    # Fallback to basic tokenization
    spacy_ger = None
    spacy_eng = None

### Tokenization Functions

Tokenization converts sentences into lists of words:

In [None]:
def tokenize_ger(text):
    """Tokenize German text into individual words"""
    if spacy_ger:
        return [tok.text for tok in spacy_ger.tokenizer(text)]
    else:
        return text.split()  # Simple fallback

def tokenize_eng(text):
    """Tokenize English text into individual words"""
    if spacy_eng:
        return [tok.text for tok in spacy_eng.tokenizer(text)]
    else:
        return text.split()  # Simple fallback

# Test tokenization
german_sentence = "Ein Mann geht zur Schule."
english_sentence = "A man goes to school."

print(f"German: {german_sentence}")
print(f"Tokenized: {tokenize_ger(german_sentence)}")
print(f"\nEnglish: {english_sentence}")
print(f"Tokenized: {tokenize_eng(english_sentence)}")

## 2. Data Preprocessing {#data}

### Field Definition

Fields define how to process the text data:

In [None]:
# Define fields for German and English
german = Field(
    tokenize=tokenize_ger,  # How to split text into tokens
    lower=True,            # Convert to lowercase
    init_token="<sos>",     # Start of sequence token
    eos_token="<eos>"       # End of sequence token
)

english = Field(
    tokenize=tokenize_eng,
    lower=True,
    init_token="<sos>",
    eos_token="<eos>"
)

print("Fields created successfully!")
print("Special tokens:")
print(f"- Start of sequence: {german.init_token}")
print(f"- End of sequence: {german.eos_token}")

### Dataset Loading

Multi30k is a multilingual dataset with ~30k sentence pairs:

In [None]:
# Load the Multi30k dataset
try:
    train_data, valid_data, test_data = Multi30k.splits(
        exts=(".de", ".en"),           # File extensions for German and English
        fields=(german, english)       # Fields to use for processing
    )
    
    print(f"Dataset loaded successfully!")
    print(f"Training examples: {len(train_data)}")
    print(f"Validation examples: {len(valid_data)}")
    print(f"Test examples: {len(test_data)}")
    
    # Show a sample
    print("\nSample training example:")
    print(f"German: {' '.join(train_data[0].src)}")
    print(f"English: {' '.join(train_data[0].trg)}")
    
except Exception as e:
    print(f"Error loading dataset: {e}")
    print("This might be due to network issues or dataset availability.")

### Vocabulary Building

Build vocabularies from the training data:

In [None]:
# Build vocabularies
german.build_vocab(
    train_data,
    max_size=10000,  # Maximum vocabulary size
    min_freq=2       # Minimum frequency for a word to be included
)

english.build_vocab(
    train_data,
    max_size=10000,
    min_freq=2
)

print(f"German vocabulary size: {len(german.vocab)}")
print(f"English vocabulary size: {len(english.vocab)}")

# Show special tokens
print("\nSpecial tokens in vocabulary:")
print(f"Unknown token: {german.vocab.itos[0]}")
print(f"Padding token: {german.vocab.itos[1]}")
print(f"Start token: {german.vocab.itos[2]}")
print(f"End token: {german.vocab.itos[3]}")

# Show most common words
print("\nMost common German words:")
for i in range(4, 14):
    print(f"{i-3}. {german.vocab.itos[i]}")

## 3. Encoder Architecture {#encoder}

The encoder processes the input sequence and creates a context representation:

In [None]:
class Encoder(nn.Module):
    def __init__(self, input_size, embedding_size, hidden_size, num_layers, p):
        super(Encoder, self).__init__()
        
        # Store parameters
        self.hidden_size = hidden_size
        self.num_layers = num_layers
        
        # Layers
        self.dropout = nn.Dropout(p)  # Regularization
        self.embedding = nn.Embedding(input_size, embedding_size)  # Word embeddings
        self.rnn = nn.LSTM(embedding_size, hidden_size, num_layers, dropout=p)

    def forward(self, x):
        # x shape: (seq_length, batch_size)
        
        # Convert word indices to embeddings
        embedding = self.dropout(self.embedding(x))
        # embedding shape: (seq_length, batch_size, embedding_size)
        
        # Pass through LSTM
        outputs, (hidden, cell) = self.rnn(embedding)
        # outputs shape: (seq_length, batch_size, hidden_size)
        # hidden shape: (num_layers, batch_size, hidden_size)
        # cell shape: (num_layers, batch_size, hidden_size)
        
        # Return final hidden and cell states (context)
        return hidden, cell

print("Encoder class defined!")
print("\nEncoder Architecture:")
print("1. Embedding layer: converts word indices to dense vectors")
print("2. LSTM layers: process sequence and maintain memory")
print("3. Output: final hidden and cell states as context")

### Encoder Visualization

Let's create a simple encoder to understand its structure:

In [None]:
# Create a sample encoder
sample_encoder = Encoder(
    input_size=1000,      # Vocabulary size
    embedding_size=256,   # Embedding dimension
    hidden_size=512,      # LSTM hidden size
    num_layers=2,         # Number of LSTM layers
    p=0.5                 # Dropout probability
)

print(f"Sample Encoder:")
print(f"Parameters: {sum(p.numel() for p in sample_encoder.parameters()):,}")
print(f"\nLayer details:")
for name, layer in sample_encoder.named_children():
    print(f"- {name}: {layer}")

# Test with dummy input
dummy_input = torch.randint(0, 1000, (10, 2))  # seq_len=10, batch_size=2
hidden, cell = sample_encoder(dummy_input)

print(f"\nInput shape: {dummy_input.shape}")
print(f"Hidden state shape: {hidden.shape}")
print(f"Cell state shape: {cell.shape}")

## 4. Decoder Architecture {#decoder}

The decoder generates the output sequence one word at a time:

In [None]:
class Decoder(nn.Module):
    def __init__(self, input_size, embedding_size, hidden_size, output_size, num_layers, p):
        super(Decoder, self).__init__()
        
        # Store parameters
        self.hidden_size = hidden_size
        self.num_layers = num_layers
        
        # Layers
        self.dropout = nn.Dropout(p)
        self.embedding = nn.Embedding(input_size, embedding_size)
        self.rnn = nn.LSTM(embedding_size, hidden_size, num_layers, dropout=p)
        self.fc = nn.Linear(hidden_size, output_size)  # Output projection

    def forward(self, x, hidden, cell):
        # x shape: (batch_size) - single word input
        # We need to add sequence dimension
        x = x.unsqueeze(0)  # Shape: (1, batch_size)
        
        # Convert to embeddings
        embedding = self.dropout(self.embedding(x))
        # embedding shape: (1, batch_size, embedding_size)
        
        # Pass through LSTM with previous context
        outputs, (hidden, cell) = self.rnn(embedding, (hidden, cell))
        # outputs shape: (1, batch_size, hidden_size)
        
        # Project to vocabulary size
        predictions = self.fc(outputs)
        # predictions shape: (1, batch_size, vocab_size)
        
        # Remove sequence dimension
        predictions = predictions.squeeze(0)
        # predictions shape: (batch_size, vocab_size)
        
        return predictions, hidden, cell

print("Decoder class defined!")
print("\nDecoder Architecture:")
print("1. Embedding layer: converts word indices to dense vectors")
print("2. LSTM layers: generate next word using context")
print("3. Linear layer: project to vocabulary probabilities")
print("4. Process one word at a time during generation")

### Decoder Visualization

In [None]:
# Create a sample decoder
sample_decoder = Decoder(
    input_size=1000,      # Input vocabulary size
    embedding_size=256,   # Embedding dimension
    hidden_size=512,      # LSTM hidden size (must match encoder)
    output_size=1000,     # Output vocabulary size
    num_layers=2,         # Number of LSTM layers (must match encoder)
    p=0.5                 # Dropout probability
)

print(f"Sample Decoder:")
print(f"Parameters: {sum(p.numel() for p in sample_decoder.parameters()):,}")

# Test with dummy input and context from encoder
dummy_word = torch.randint(0, 1000, (2,))  # batch_size=2, single word
predictions, new_hidden, new_cell = sample_decoder(dummy_word, hidden, cell)

print(f"\nInput word shape: {dummy_word.shape}")
print(f"Predictions shape: {predictions.shape}")
print(f"New hidden state shape: {new_hidden.shape}")
print(f"New cell state shape: {new_cell.shape}")

# Show prediction probabilities
probs = torch.softmax(predictions[0], dim=0)
top_words = torch.topk(probs, 5)
print(f"\nTop 5 predicted word indices: {top_words.indices.tolist()}")
print(f"Their probabilities: {top_words.values.tolist()}")

## 5. Seq2Seq Model {#seq2seq}

The complete model combines encoder and decoder:

In [None]:
class Seq2Seq(nn.Module):
    def __init__(self, encoder, decoder):
        super(Seq2Seq, self).__init__()
        self.encoder = encoder
        self.decoder = decoder

    def forward(self, source, target, teacher_force_ratio=0.5):
        # source shape: (src_len, batch_size)
        # target shape: (trg_len, batch_size)
        
        batch_size = source.shape[1]
        target_len = target.shape[0]
        target_vocab_size = len(english.vocab) if 'english' in globals() else 1000
        
        # Store decoder outputs
        device = source.device
        outputs = torch.zeros(target_len, batch_size, target_vocab_size).to(device)
        
        # Encode the source sequence
        hidden, cell = self.encoder(source)
        
        # First input to decoder is <SOS> token
        x = target[0]  # Shape: (batch_size)
        
        # Generate target sequence
        for t in range(1, target_len):
            # Get prediction for current timestep
            output, hidden, cell = self.decoder(x, hidden, cell)
            
            # Store prediction
            outputs[t] = output
            
            # Get best predicted word
            best_guess = output.argmax(1)
            
            # Teacher forcing: use actual target word with probability teacher_force_ratio
            # Otherwise use predicted word
            x = target[t] if random.random() < teacher_force_ratio else best_guess
        
        return outputs

print("Seq2Seq class defined!")
print("\nSeq2Seq Architecture:")
print("1. Encoder processes entire source sequence")
print("2. Decoder generates target sequence word by word")
print("3. Teacher forcing helps training stability")
print("4. Context flows from encoder to decoder")

### Teacher Forcing Explanation

Teacher forcing is a training technique where we sometimes use the actual target word instead of the predicted word:

In [None]:
import matplotlib.pyplot as plt

# Simulate teacher forcing decisions
def simulate_teacher_forcing(teacher_force_ratio, sequence_length=10):
    decisions = []
    for _ in range(sequence_length):
        use_teacher = random.random() < teacher_force_ratio
        decisions.append("Teacher" if use_teacher else "Predicted")
    return decisions

# Test different ratios
ratios = [0.0, 0.5, 1.0]
fig, axes = plt.subplots(1, 3, figsize=(15, 4))

for i, ratio in enumerate(ratios):
    decisions = simulate_teacher_forcing(ratio, 20)
    teacher_count = decisions.count("Teacher")
    predicted_count = decisions.count("Predicted")
    
    axes[i].bar(["Teacher", "Predicted"], [teacher_count, predicted_count])
    axes[i].set_title(f"Teacher Force Ratio: {ratio}")
    axes[i].set_ylabel("Count")

plt.tight_layout()
plt.show()

print("Teacher Forcing Benefits:")
print("- Ratio = 1.0: Always use correct words (fast training, but exposure bias)")
print("- Ratio = 0.0: Always use predictions (slow training, but realistic)")
print("- Ratio = 0.5: Balanced approach (commonly used)")

## 6. Training Setup {#training}

Now let's set up everything needed for training:

In [None]:
# Training hyperparameters
num_epochs = 5  # Reduced for demo
learning_rate = 0.001
batch_size = 32  # Reduced for demo

# Model hyperparameters
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

# Vocabulary sizes
if 'german' in globals() and 'english' in globals():
    input_size_encoder = len(german.vocab)
    input_size_decoder = len(english.vocab)
    output_size = len(english.vocab)
    print(f"German vocab size: {input_size_encoder}")
    print(f"English vocab size: {output_size}")
else:
    # Fallback values for demo
    input_size_encoder = 1000
    input_size_decoder = 1000
    output_size = 1000
    print("Using fallback vocabulary sizes")

# Architecture parameters
encoder_embedding_size = 256  # Reduced for demo
decoder_embedding_size = 256
hidden_size = 512  # Must be same for encoder and decoder
num_layers = 2
enc_dropout = 0.5
dec_dropout = 0.5

print(f"\nModel Architecture:")
print(f"- Embedding size: {encoder_embedding_size}")
print(f"- Hidden size: {hidden_size}")
print(f"- Number of layers: {num_layers}")
print(f"- Dropout: {enc_dropout}")

### Model Instantiation

In [None]:
# Create encoder
encoder_net = Encoder(
    input_size_encoder, 
    encoder_embedding_size, 
    hidden_size, 
    num_layers, 
    enc_dropout
).to(device)

# Create decoder
decoder_net = Decoder(
    input_size_decoder,
    decoder_embedding_size,
    hidden_size,
    output_size,
    num_layers,
    dec_dropout,
).to(device)

# Create complete model
model = Seq2Seq(encoder_net, decoder_net).to(device)

# Count parameters
def count_parameters(model):
    return sum(p.numel() for p in model.parameters() if p.requires_grad)

print(f"Model created successfully!")
print(f"Total trainable parameters: {count_parameters(model):,}")
print(f"Encoder parameters: {count_parameters(encoder_net):,}")
print(f"Decoder parameters: {count_parameters(decoder_net):,}")

# Setup optimizer and loss function
optimizer = optim.Adam(model.parameters(), lr=learning_rate)

# Ignore padding tokens in loss calculation
if 'english' in globals():
    pad_idx = english.vocab.stoi["<pad>"]
else:
    pad_idx = 1  # Fallback

criterion = nn.CrossEntropyLoss(ignore_index=pad_idx)

print(f"\nTraining setup:")
print(f"- Optimizer: Adam (lr={learning_rate})")
print(f"- Loss function: CrossEntropyLoss")
print(f"- Padding index ignored: {pad_idx}")

### Data Iterators

Create data loaders for efficient batch processing:

In [None]:
# Create data iterators
if 'train_data' in globals():
    train_iterator, valid_iterator, test_iterator = BucketIterator.splits(
        (train_data, valid_data, test_data),
        batch_size=batch_size,
        sort_within_batch=True,  # Sort by source length for efficiency
        sort_key=lambda x: len(x.src),
        device=device,
    )
    
    print(f"Data iterators created!")
    print(f"Training batches: {len(train_iterator)}")
    print(f"Validation batches: {len(valid_iterator)}")
    print(f"Test batches: {len(test_iterator)}")
    
    # Show a sample batch
    sample_batch = next(iter(train_iterator))
    print(f"\nSample batch:")
    print(f"Source shape: {sample_batch.src.shape}")
    print(f"Target shape: {sample_batch.trg.shape}")
    print(f"Source (first sentence): {sample_batch.src[:, 0].tolist()}")
    print(f"Target (first sentence): {sample_batch.trg[:, 0].tolist()}")
else:
    print("Dataset not available - using dummy data for demonstration")
    train_iterator = None

## 7. Training Loop {#loop}

The main training loop with detailed explanations:

In [None]:
# Helper functions (simplified versions)
def save_checkpoint(checkpoint, filename="checkpoint.pth.tar"):
    """Save model checkpoint"""
    torch.save(checkpoint, filename)
    print(f"Checkpoint saved to {filename}")

def translate_sentence(model, sentence, src_field, trg_field, device, max_length=50):
    """Translate a single sentence (simplified version)"""
    model.eval()
    
    # Tokenize and convert to indices
    tokens = src_field.preprocess(sentence)
    tokens = [src_field.vocab.stoi[token] for token in tokens]
    
    # Add batch dimension and convert to tensor
    src_tensor = torch.LongTensor(tokens).unsqueeze(1).to(device)
    
    with torch.no_grad():
        # Encode
        hidden, cell = model.encoder(src_tensor)
        
        # Start with <sos> token
        outputs = [trg_field.vocab.stoi["<sos>"]]
        
        for _ in range(max_length):
            previous_word = torch.LongTensor([outputs[-1]]).to(device)
            
            output, hidden, cell = model.decoder(previous_word, hidden, cell)
            best_guess = output.argmax(1).item()
            
            outputs.append(best_guess)
            
            # Stop if we predict <eos>
            if best_guess == trg_field.vocab.stoi["<eos>"]:
                break
    
    # Convert indices back to words
    translated_sentence = [trg_field.vocab.itos[idx] for idx in outputs[1:-1]]  # Remove <sos> and <eos>
    return " ".join(translated_sentence)

print("Helper functions defined!")

### Training Loop Implementation

In [None]:
# Training loop
if train_iterator is not None:
    print("Starting training...")
    
    # Sample German sentence for translation testing
    test_sentence = "ein boot mit mehreren männern darauf wird von einem großen pferdegespann ans ufer gezogen."
    
    # Training history
    train_losses = []
    
    for epoch in range(num_epochs):
        print(f"\n[Epoch {epoch + 1}/{num_epochs}]")
        
        # Save checkpoint
        checkpoint = {
            "state_dict": model.state_dict(),
            "optimizer": optimizer.state_dict()
        }
        save_checkpoint(checkpoint, f"checkpoint_epoch_{epoch+1}.pth.tar")
        
        # Test translation
        model.eval()
        try:
            translated = translate_sentence(
                model, test_sentence, german, english, device, max_length=50
            )
            print(f"Translation: {translated}")
        except Exception as e:
            print(f"Translation error: {e}")
        
        # Training
        model.train()
        epoch_loss = 0
        
        for batch_idx, batch in enumerate(train_iterator):
            # Get data
            inp_data = batch.src.to(device)  # Shape: (src_len, batch_size)
            target = batch.trg.to(device)    # Shape: (trg_len, batch_size)
            
            # Forward pass
            output = model(inp_data, target)
            # output shape: (trg_len, batch_size, vocab_size)
            
            # Reshape for loss calculation
            # Remove first timestep (SOS token) and flatten
            output = output[1:].reshape(-1, output.shape[2])
            target = target[1:].reshape(-1)
            
            # Calculate loss
            optimizer.zero_grad()
            loss = criterion(output, target)
            
            # Backward pass
            loss.backward()
            
            # Clip gradients to prevent exploding gradients
            torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1)
            
            # Update parameters
            optimizer.step()
            
            epoch_loss += loss.item()
            
            # Print progress
            if batch_idx % 100 == 0:
                print(f"  Batch {batch_idx}/{len(train_iterator)}, Loss: {loss.item():.4f}")
            
            # Break early for demo
            if batch_idx >= 200:  # Process only first 200 batches for demo
                break
        
        avg_loss = epoch_loss / min(len(train_iterator), 200)
        train_losses.append(avg_loss)
        print(f"  Average Loss: {avg_loss:.4f}")
    
    print("\nTraining completed!")
    
    # Plot training loss
    plt.figure(figsize=(10, 6))
    plt.plot(range(1, len(train_losses) + 1), train_losses, 'b-', linewidth=2)
    plt.title('Training Loss Over Epochs')
    plt.xlabel('Epoch')
    plt.ylabel('Average Loss')
    plt.grid(True, alpha=0.3)
    plt.show()
    
else:
    print("Training skipped - dataset not available")
    print("In a real scenario, the model would train on German-English sentence pairs")

### Training Process Explanation

Let's break down what happens during training:

In [None]:
print("Training Process Breakdown:")
print("\n1. Forward Pass:")
print("   - Encoder processes German sentence")
print("   - Decoder generates English translation word by word")
print("   - Teacher forcing used during training")

print("\n2. Loss Calculation:")
print("   - Compare predicted words with actual target words")
print("   - CrossEntropyLoss measures prediction quality")
print("   - Padding tokens are ignored")

print("\n3. Backward Pass:")
print("   - Calculate gradients using backpropagation")
print("   - Clip gradients to prevent exploding gradients")
print("   - Update model parameters using Adam optimizer")

print("\n4. Key Techniques:")
print("   - Teacher forcing: helps training stability")
print("   - Gradient clipping: prevents exploding gradients")
print("   - Dropout: prevents overfitting")
print("   - Checkpointing: saves model progress")

# Visualize the training process
fig, ((ax1, ax2), (ax3, ax4)) = plt.subplots(2, 2, figsize=(15, 10))

# 1. Encoder-Decoder Flow
ax1.text(0.1, 0.8, "German:\n'Ein Mann geht'", fontsize=12, bbox=dict(boxstyle="round", facecolor='lightblue'))
ax1.arrow(0.3, 0.7, 0.2, 0, head_width=0.05, head_length=0.05, fc='black', ec='black')
ax1.text(0.6, 0.8, "Encoder\n(LSTM)", fontsize=12, bbox=dict(boxstyle="round", facecolor='lightgreen'))
ax1.arrow(0.8, 0.7, 0, -0.2, head_width=0.05, head_length=0.05, fc='black', ec='black')
ax1.text(0.6, 0.4, "Context\n(hidden, cell)", fontsize=12, bbox=dict(boxstyle="round", facecolor='lightyellow'))
ax1.arrow(0.5, 0.3, -0.2, 0, head_width=0.05, head_length=0.05, fc='black', ec='black')
ax1.text(0.1, 0.2, "Decoder\n(LSTM)", fontsize=12, bbox=dict(boxstyle="round", facecolor='lightcoral'))
ax1.arrow(0.3, 0.1, 0.2, 0, head_width=0.05, head_length=0.05, fc='black', ec='black')
ax1.text(0.6, 0.05, "English:\n'A man goes'", fontsize=12, bbox=dict(boxstyle="round", facecolor='lightpink'))
ax1.set_xlim(0, 1)
ax1.set_ylim(0, 1)
ax1.set_title("Encoder-Decoder Architecture")
ax1.axis('off')

# 2. Teacher Forcing
steps = ['<SOS>', 'A', 'man', 'goes', '<EOS>']
teacher_decisions = ['Teacher', 'Teacher', 'Predicted', 'Teacher', 'Predicted']
colors = ['green' if d == 'Teacher' else 'red' for d in teacher_decisions]
ax2.bar(range(len(steps)), [1]*len(steps), color=colors, alpha=0.7)
ax2.set_xticks(range(len(steps)))
ax2.set_xticklabels(steps)
ax2.set_title("Teacher Forcing Example")
ax2.set_ylabel("Decision Type")
ax2.legend(['Teacher Forcing', 'Model Prediction'], loc='upper right')

# 3. Loss Over Time (simulated)
epochs = list(range(1, 11))
loss_values = [2.5, 2.1, 1.8, 1.6, 1.4, 1.3, 1.2, 1.1, 1.0, 0.9]
ax3.plot(epochs, loss_values, 'b-o', linewidth=2, markersize=6)
ax3.set_xlabel('Epoch')
ax3.set_ylabel('Loss')
ax3.set_title('Training Loss Curve')
ax3.grid(True, alpha=0.3)

# 4. Model Components
components = ['Embedding', 'LSTM Layers', 'Linear Layer', 'Dropout']
encoder_params = [300*10000, 512*4*512*2, 0, 0]  # Approximate
decoder_params = [300*10000, 512*4*512*2, 512*10000, 0]  # Approximate

x = np.arange(len(components))
width = 0.35

ax4.bar(x - width/2, encoder_params, width, label='Encoder', alpha=0.8)
ax4.bar(x + width/2, decoder_params, width, label='Decoder', alpha=0.8)
ax4.set_xlabel('Components')
ax4.set_ylabel('Parameters (approx)')
ax4.set_title('Model Parameters by Component')
ax4.set_xticks(x)
ax4.set_xticklabels(components)
ax4.legend()
ax4.ticklabel_format(style='scientific', axis='y', scilimits=(0,0))

plt.tight_layout()
plt.show()

## 8. Evaluation {#evaluation}

Finally, let's evaluate the model performance:

In [None]:
# BLEU Score Implementation (simplified)
def calculate_bleu_score(reference, candidate):
    """Calculate BLEU score between reference and candidate sentences"""
    # This is a simplified version - real BLEU is more complex
    ref_words = reference.lower().split()
    cand_words = candidate.lower().split()
    
    if len(cand_words) == 0:
        return 0.0
    
    # Calculate precision (simplified)
    matches = sum(1 for word in cand_words if word in ref_words)
    precision = matches / len(cand_words)
    
    # Length penalty (simplified)
    length_penalty = min(1.0, len(cand_words) / len(ref_words))
    
    return precision * length_penalty

# Test translation examples
test_examples = [
    {
        "german": "ein mann geht zur schule",
        "english_ref": "a man goes to school",
        "model_output": "a man goes to school"  # Simulated perfect translation
    },
    {
        "german": "die katze ist schwarz",
        "english_ref": "the cat is black",
        "model_output": "the cat is dark"  # Simulated imperfect translation
    },
    {
        "german": "ich liebe musik",
        "english_ref": "i love music",
        "model_output": "i like music"  # Simulated close translation
    }
]

print("Translation Examples and BLEU Scores:")
print("=" * 60)

total_bleu = 0
for i, example in enumerate(test_examples, 1):
    bleu_score = calculate_bleu_score(example["english_ref"], example["model_output"])
    total_bleu += bleu_score
    
    print(f"\nExample {i}:")
    print(f"German:     {example['german']}")
    print(f"Reference:  {example['english_ref']}")
    print(f"Model:      {example['model_output']}")
    print(f"BLEU Score: {bleu_score:.3f}")

avg_bleu = total_bleu / len(test_examples)
print(f"\nAverage BLEU Score: {avg_bleu:.3f}")
print(f"Average BLEU Score (%): {avg_bleu * 100:.1f}%")

# Visualize BLEU scores
plt.figure(figsize=(10, 6))
examples_labels = [f"Example {i+1}" for i in range(len(test_examples))]
bleu_scores = [calculate_bleu_score(ex["english_ref"], ex["model_output"]) for ex in test_examples]

bars = plt.bar(examples_labels, bleu_scores, color=['green', 'orange', 'blue'], alpha=0.7)
plt.axhline(y=avg_bleu, color='red', linestyle='--', label=f'Average: {avg_bleu:.3f}')
plt.ylabel('BLEU Score')
plt.title('BLEU Scores for Translation Examples')
plt.legend()
plt.ylim(0, 1)

# Add value labels on bars
for bar, score in zip(bars, bleu_scores):
    plt.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.01, 
             f'{score:.3f}', ha='center', va='bottom')

plt.tight_layout()
plt.show()

### Model Performance Analysis

In [None]:
print("Model Performance Analysis:")
print("=" * 50)

print("\n1. BLEU Score Interpretation:")
print("   - 0.0-0.1: Very poor translation")
print("   - 0.1-0.3: Poor translation")
print("   - 0.3-0.5: Reasonable translation")
print("   - 0.5-0.7: Good translation")
print("   - 0.7-1.0: Excellent translation")

print("\n2. Common Issues in Seq2Seq Models:")
print("   - Exposure bias: difference between training and inference")
print("   - Long sequence problems: information bottleneck")
print("   - Out-of-vocabulary words: unknown token handling")
print("   - Repetition: model may repeat phrases")

print("\n3. Improvements and Extensions:")
print("   - Attention mechanism: focus on relevant input parts")
print("   - Beam search: better decoding strategy")
print("   - Transformer architecture: parallel processing")
print("   - Pre-trained models: transfer learning")

print("\n4. Real-world Considerations:")
print("   - Larger datasets: millions of sentence pairs")
print("   - Longer training: hundreds of epochs")
print("   - GPU clusters: distributed training")
print("   - Evaluation metrics: BLEU, METEOR, ROUGE")

# Create a comparison chart of different architectures
architectures = ['Basic Seq2Seq', 'Seq2Seq + Attention', 'Transformer', 'Pre-trained (mT5)']
bleu_scores = [0.15, 0.35, 0.45, 0.65]  # Typical BLEU scores
training_time = [1, 2, 3, 0.5]  # Relative training time

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 6))

# BLEU scores comparison
bars1 = ax1.bar(architectures, bleu_scores, color=['red', 'orange', 'green', 'blue'], alpha=0.7)
ax1.set_ylabel('BLEU Score')
ax1.set_title('Translation Quality by Architecture')
ax1.set_ylim(0, 0.7)
plt.setp(ax1.get_xticklabels(), rotation=45, ha='right')

for bar, score in zip(bars1, bleu_scores):
    ax1.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.01, 
             f'{score:.2f}', ha='center', va='bottom')

# Training time comparison
bars2 = ax2.bar(architectures, training_time, color=['red', 'orange', 'green', 'blue'], alpha=0.7)
ax2.set_ylabel('Relative Training Time')
ax2.set_title('Training Efficiency by Architecture')
plt.setp(ax2.get_xticklabels(), rotation=45, ha='right')

for bar, time in zip(bars2, training_time):
    ax2.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.05, 
             f'{time}x', ha='center', va='bottom')

plt.tight_layout()
plt.show()

## Summary

This notebook covered a complete implementation of a Seq2Seq model for neural machine translation:

### Key Components:
1. **Encoder**: Processes input sequence and creates context representation
2. **Decoder**: Generates output sequence word by word using context
3. **Teacher Forcing**: Training technique for stability
4. **LSTM Networks**: Handle sequential data and maintain memory

### Training Process:
1. **Data Preprocessing**: Tokenization, vocabulary building, batching
2. **Forward Pass**: Encoder → Context → Decoder → Predictions
3. **Loss Calculation**: CrossEntropyLoss with padding ignored
4. **Optimization**: Adam optimizer with gradient clipping

### Evaluation:
1. **BLEU Score**: Measures translation quality
2. **Qualitative Analysis**: Manual inspection of translations
3. **Performance Comparison**: Different architectures

### Next Steps:
- Implement attention mechanism for better performance
- Try beam search for better decoding
- Experiment with Transformer architecture
- Use pre-trained models for transfer learning

The basic Seq2Seq model provides a solid foundation for understanding neural machine translation, though modern approaches like Transformers have largely superseded this architecture in practice.