![image.png](https://i.imgur.com/a3uAqnb.png)

# Arabic-English Seq2Seq Translation Model

![Neural Machine Translation](https://miro.medium.com/v2/resize:fit:1400/1*sO-SP58T4brE9EHazHSeGA.png)

## **‚ö†Ô∏è Important Note on Training Approach**

**This notebook demonstrates Seq2Seq architecture for Arabic-English translation, but uses a simplified training approach for educational purposes:**

- ‚úÖ **Purpose**: Showcase Seq2Seq implementation and Arabic NLP challenges
- ‚ö†Ô∏è **Limitation**: Only uses training set (no validation/test split)
- üéØ **Result**: Model will overfit, but this demonstrates the complexity of Arabic translation
- üìö **Learning Goal**: Understand architecture, not production-ready model

**In production, you would need proper train/validation/test splits and regularization techniques!**


## **üìå Sequence-to-Sequence (Seq2Seq) Model**

A **Seq2Seq model** consists of two main components:
1. **Encoder**: Processes input sequence (Arabic) into fixed-size context vector
2. **Decoder**: Generates output sequence (English) from context vector

### **üîπ Key Features**
- **Variable-length inputs/outputs**: Can handle sequences of different lengths
- **Attention mechanism**: (Not implemented here, but commonly used)
- **Teacher forcing**: Uses ground truth during training for faster convergence

### **üîπ Architecture Overview**
```
Arabic Text ‚Üí Tokenize ‚Üí Encoder LSTM ‚Üí Context Vector ‚Üí Decoder LSTM ‚Üí English Text

## 1Ô∏è‚É£ Dataset Loading and Initial Exploration


In [None]:
from datasets import load_dataset
from collections import Counter
import re

# Load the Arabic-English translation dataset
dataset = load_dataset("Abdulmohsena/Classic-Arabic-English-Language-Pairs")
print(f"Dataset loaded: {len(dataset['books'])} samples")

**üìù What's happening here:**
- Loading a parallel Arabic-English corpus
- This dataset contains classical Arabic texts with English translations
- Arabic is a morphologically rich language with complex grammar rules
- ŸÜÿµŸÜÿß ŸÖÿß Ÿäÿπÿ±ŸÅ ÿßŸÑŸÜÿ≠Ÿà ÿßŸÑÿπÿ±ÿ®Ÿä

## 2Ô∏è‚É£ Vocabulary Analysis and Tokenization


In [None]:
def simple_tokenize(text):
    """
    Simple tokenization function that:
    1. Removes punctuation
    2. Splits on whitespace
    Note: For Arabic, proper tokenization is more complex!
    """
    cleaned = re.sub(r'[^\w\s]', ' ', text)
    return cleaned.split()

# Build vocabularies
arabic_words = []
english_words = []
for text in dataset['books']['ar']:
    arabic_words.extend(simple_tokenize(text))
for text in dataset['books']['en']:
    english_words.extend(simple_tokenize(text.lower()))

arabic_vocab = Counter(arabic_words)
english_vocab = Counter(english_words)
print(f"Arabic vocab: {len(arabic_vocab):,} unique words")
print(f"English vocab: {len(english_vocab):,} unique words")

**üìù Key Observations:**
- Arabic has ~4x more unique words than English due to morphological complexity
- Arabic words have roots with various prefixes/suffixes
- This simple tokenization doesn't handle Arabic morphology properly

**üîπ Arabic NLP Challenges:**
1. **Right-to-left script**
2. **No capitalization**
3. **Rich morphology** (one root ‚Üí many word forms)
4. **Diacritics** (optional vowel markings)

## 3Ô∏è‚É£ Vocabulary Reduction and Mapping


In [None]:
import pickle

ARABIC_VOCAB_SIZE = 8000
ENGLISH_VOCAB_SIZE = 5000
SPECIAL_TOKENS = ['<PAD>', '< SOS >', '<EOS>', '<UNK>']

def create_vocab_mapping(vocab_dict, special_tokens):
    """
    Creates bidirectional mappings between words and indices.
    
    Special tokens:
    - <PAD>: Padding for variable-length sequences
    - <SOS>: Start of sequence marker
    - <EOS>: End of sequence marker  
    - <UNK>: Unknown words (out of vocabulary)
    """
    word2idx = {}
    idx2word = {}
    
    # Add special tokens first (indices 0-3)
    for i, token in enumerate(special_tokens):
        word2idx[token] = i
        idx2word[i] = token
    
    # Add most frequent words
    for i, word in enumerate(vocab_dict.keys(), len(special_tokens)):
        word2idx[word] = i
        idx2word[i] = word
    
    return word2idx, idx2word

# Create reduced vocabularies (keep only most frequent words)
arabic_vocab_reduced = dict(arabic_vocab.most_common(ARABIC_VOCAB_SIZE))
english_vocab_reduced = dict(english_vocab.most_common(ENGLISH_VOCAB_SIZE))

ar_word2idx, ar_idx2word = create_vocab_mapping(arabic_vocab_reduced, SPECIAL_TOKENS)
en_word2idx, en_idx2word = create_vocab_mapping(english_vocab_reduced, SPECIAL_TOKENS)

print(f"Reduced vocabularies: Arabic={len(ar_word2idx)}, English={len(en_word2idx)}")

**üìù Why vocabulary reduction?**
- **Memory efficiency**: Smaller embedding tables
- **Training speed**: Fewer parameters to update
- **Generalization**: Focus on most important words
- **Trade-off**: Some words become `<UNK>` tokens

## 4Ô∏è‚É£ Data Preprocessing and Sequence Conversion


In [None]:
import torch
from torch.utils.data import Dataset, DataLoader
from torch.nn.utils.rnn import pad_sequence

def clean_and_tokenize(text):
    """
    Clean and tokenize text:
    1. Remove bracketed annotations [...]
    2. Remove punctuation
    3. Convert to lowercase
    4. Split into tokens
    """
    text = re.sub(r'\[.*?\]', '', text)  # Remove annotations
    text = re.sub(r'[^\w\s]', ' ', text)  # Remove punctuation
    text = text.strip().lower() if text.strip() else ''
    return text.split() if text else []

def text_to_sequence(text, word2idx, max_length=None):
    """
    Convert text to sequence of token indices:
    1. Add <SOS> token at start
    2. Convert words to indices (use <UNK> for unknown words)
    3. Add <EOS> token at end
    4. Truncate if exceeds max_length
    """
    tokens = clean_and_tokenize(text)
    sequence = [word2idx.get('< SOS >', 1)]  # Start token
    
    for token in tokens:
        idx = word2idx.get(token, word2idx.get('<UNK>', 3))
        sequence.append(idx)
    
    sequence.append(word2idx.get('<EOS>', 2))  # End token
    
    # Truncate if too long
    if max_length and len(sequence) > max_length:
        sequence = sequence[:max_length-1] + [word2idx.get('<EOS>', 2)]
    
    return sequence

**üìù Sequence format:**
```
Original: "ŸÖÿ±ÿ≠ÿ®ÿß ÿ®ŸÉŸÖ"
Tokenized: ["ŸÖÿ±ÿ≠ÿ®ÿß", "ÿ®ŸÉŸÖ"]
Sequence: [<SOS>, ŸÖÿ±ÿ≠ÿ®ÿß_idx, ÿ®ŸÉŸÖ_idx, <EOS>]
Indices: [1, 145, 892, 2]
```

## 5Ô∏è‚É£ Dataset Creation and DataLoader


In [None]:
SUBSET_SIZE = 5000  # üö® Using subset for quick training (NOT recommended for production!)

subset_data = {
    'ar': dataset['books']['ar'][:SUBSET_SIZE],
    'en': dataset['books']['en'][:SUBSET_SIZE]
}

class TranslationDataset(Dataset):
    """
    PyTorch Dataset for Arabic-English translation pairs.
    Filters out sequences that are too short/long for stable training.
    """
    def __init__(self, data, ar_word2idx, en_word2idx, max_ar_len=60, max_en_len=100):
        self.samples = []
        
        for i in range(len(data['ar'])):
            ar_seq = text_to_sequence(data['ar'][i], ar_word2idx, max_ar_len)
            en_seq = text_to_sequence(data['en'][i], en_word2idx, max_en_len)
            
            # Filter: keep sequences with reasonable length
            if 3 <= len(ar_seq) <= max_ar_len and 3 <= len(en_seq) <= max_en_len:
                self.samples.append((ar_seq, en_seq))
        
        print(f"Dataset: {len(data['ar'])} ‚Üí {len(self.samples)} samples")
    
    def __len__(self):
        return len(self.samples)
    
    def __getitem__(self, idx):
        ar_seq, en_seq = self.samples[idx]
        return {
            'arabic': torch.tensor(ar_seq, dtype=torch.long),
            'english': torch.tensor(en_seq, dtype=torch.long)
        }

def collate_fn(batch):
    """
    Custom collate function to handle variable-length sequences.
    Pads sequences to the same length within each batch.
    """
    arabic_seqs = [item['arabic'] for item in batch]
    english_seqs = [item['english'] for item in batch]
    
    # Pad sequences (padding_value=0 corresponds to <PAD> token)
    arabic_padded = pad_sequence(arabic_seqs, batch_first=True, padding_value=0)
    english_padded = pad_sequence(english_seqs, batch_first=True, padding_value=0)
    
    return {'arabic': arabic_padded, 'english': english_padded}

train_dataset = TranslationDataset(subset_data, ar_word2idx, en_word2idx)
train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True, collate_fn=collate_fn)

**üö® Training Data Issue:**
- We're using the same data for training that we'll test on
- This will lead to **severe overfitting**
- Model will memorize rather than generalize
- **For demonstration purposes only!**

## 6Ô∏è‚É£ Seq2Seq Model Architecture


In [None]:
import torch.nn as nn

class Encoder(nn.Module):
    """
    LSTM-based encoder that processes Arabic input sequence.
    
    Flow: Arabic tokens ‚Üí Embeddings ‚Üí LSTM ‚Üí Hidden states
    Returns final hidden and cell states as context for decoder.
    """
    def __init__(self, vocab_size, embed_size, hidden_size, num_layers=2):
        super(Encoder, self).__init__()
        self.embedding = nn.Embedding(vocab_size, embed_size, padding_idx=0)
        self.lstm = nn.LSTM(embed_size, hidden_size, num_layers, 
                           batch_first=True, dropout=0.3)
        
    def forward(self, x):
        # x shape: (batch_size, seq_len)
        embedded = self.embedding(x)  # (batch_size, seq_len, embed_size)
        outputs, (hidden, cell) = self.lstm(embedded)
        # hidden, cell: (num_layers, batch_size, hidden_size)
        return hidden, cell

class Decoder(nn.Module):
    """
    LSTM-based decoder that generates English output sequence.
    
    Uses teacher forcing during training:
    - Input: previous ground truth token
    - Output: prediction for next token
    """
    def __init__(self, vocab_size, embed_size, hidden_size, num_layers=2):
        super(Decoder, self).__init__()
        self.vocab_size = vocab_size
        self.embedding = nn.Embedding(vocab_size, embed_size, padding_idx=0)
        self.lstm = nn.LSTM(embed_size, hidden_size, num_layers, 
                           batch_first=True, dropout=0.3)
        self.fc = nn.Linear(hidden_size, vocab_size)  # Project to vocab size
        
    def forward(self, x, hidden, cell):
        # x shape: (batch_size, 1) - single token input
        embedded = self.embedding(x)  # (batch_size, 1, embed_size)
        output, (hidden, cell) = self.lstm(embedded, (hidden, cell))
        # output: (batch_size, 1, hidden_size)
        prediction = self.fc(output)  # (batch_size, 1, vocab_size)
        return prediction, hidden, cell

class Seq2Seq(nn.Module):
    """
    Complete Sequence-to-Sequence model combining encoder and decoder.
    
    Training process:
    1. Encode Arabic sequence to context vector
    2. Decode context to English sequence using teacher forcing
    """
    def __init__(self, encoder, decoder, device):
        super(Seq2Seq, self).__init__()
        self.encoder = encoder
        self.decoder = decoder
        self.device = device
        
    def forward(self, src, trg, teacher_forcing_ratio=0.5):
        """
        Forward pass with teacher forcing.
        
        Args:
            src: Arabic input sequence (batch_size, src_len)
            trg: English target sequence (batch_size, trg_len)
            teacher_forcing_ratio: Probability of using ground truth vs model prediction
        """
        batch_size = src.shape[0]
        trg_len = trg.shape[1]
        trg_vocab_size = self.decoder.vocab_size
        
        # Store all predictions
        outputs = torch.zeros(batch_size, trg_len, trg_vocab_size).to(self.device)
        
        # Encode source sequence
        hidden, cell = self.encoder(src)
        
        # First decoder input is <SOS> token
        input_token = trg[:, 0].unsqueeze(1)  # (batch_size, 1)
        
        # Generate sequence token by token
        for t in range(1, trg_len):
            output, hidden, cell = self.decoder(input_token, hidden, cell)
            outputs[:, t] = output.squeeze(1)
            
            # Teacher forcing: use ground truth or model prediction
            use_teacher_forcing = torch.rand(1).item() < teacher_forcing_ratio
            top1 = output.argmax(2)  # Get predicted token
            input_token = trg[:, t].unsqueeze(1) if use_teacher_forcing else top1
            
        return outputs

# Create model components
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
encoder = Encoder(len(ar_word2idx), 256, 512, 2)
decoder = Decoder(len(en_word2idx), 256, 512, 2)
model = Seq2Seq(encoder, decoder, device).to(device)
print(f"Model created on {device}")

**üìù Architecture Details:**
- **Embedding dimension**: 256 (dense vector representation)
- **Hidden dimension**: 512 (LSTM internal state size)
- **Layers**: 2 (stacked LSTM layers)
- **Dropout**: 0.3 (regularization during training)

**üîπ Teacher Forcing:**
- During training: Use ground truth previous token
- During inference: Use model's own predictions
- Helps with training stability and speed

## 7Ô∏è‚É£ Training Setup and Loss Function


In [None]:
import torch.optim as optim
from torch.nn import CrossEntropyLoss
import time

# Loss function ignores padding tokens (index 0)
criterion = CrossEntropyLoss(ignore_index=0)
optimizer = optim.Adam(model.parameters(), lr=0.001)

def train_epoch(model, train_loader, criterion, optimizer, device):
    """
    Train model for one epoch.
    
    Process:
    1. Forward pass: Arabic ‚Üí English prediction
    2. Compute loss: Compare prediction with ground truth
    3. Backward pass: Update model parameters
    """
    model.train()
    total_loss = 0
    
    for batch_idx, batch in enumerate(train_loader):
        src = batch['arabic'].to(device)     # Arabic input
        trg = batch['english'].to(device)    # English target
        
        optimizer.zero_grad()
        
        # Forward pass with teacher forcing
        output = model(src, trg, teacher_forcing_ratio=0.5)
        
        # Reshape for loss computation
        # output: (batch_size, trg_len-1, vocab_size)
        # trg: (batch_size, trg_len-1)
        output = output[:, 1:].reshape(-1, output.shape[-1])  # Skip <SOS> token
        trg = trg[:, 1:].reshape(-1)                          # Skip <SOS> token
        
        # Compute cross-entropy loss
        loss = criterion(output, trg)
        loss.backward()
        
        # Gradient clipping to prevent exploding gradients
        torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
        
        optimizer.step()
        total_loss += loss.item()
    
    return total_loss / len(train_loader)


**üìù Loss Computation:**
- **CrossEntropyLoss**: Measures prediction accuracy
- **ignore_index=0**: Ignores `<PAD>` tokens in loss
- **Gradient clipping**: Prevents unstable training

## 8Ô∏è‚É£ Training Loop


In [None]:
# Cell 8: Training Loop
NUM_EPOCHS = 50

print("üö® TRAINING NOTE:")
print("This model will OVERFIT because we're using only training data!")
print("Purpose: Demonstrate Seq2Seq challenges with Arabic translation")
print("-" * 60)

In [None]:

for epoch in range(1, NUM_EPOCHS + 1):
    start_time = time.time()
    epoch_loss = train_epoch(model, train_loader, criterion, optimizer, device)
    
    print(f"Epoch {epoch}: Loss = {epoch_loss:.4f}, Time = {time.time() - start_time:.1f}s")
    
    # Early stopping when loss gets reasonably low
    if epoch_loss < 2.5:
        print("Target loss reached!")
        break

## 9Ô∏è‚É£ Translation Function


In [None]:
# Cell 9: Translation Function
def translate_sentence(model, sentence, ar_word2idx, en_idx2word, device, max_length=100):
    """
    Translate a single Arabic sentence to English.
    
    Process:
    1. Convert Arabic text to token indices
    2. Encode with trained encoder
    3. Decode step-by-step (greedy decoding)
    4. Convert indices back to English words
    """
    model.eval()
    
    with torch.no_grad():
        # Prepare source sequence
        src_seq = text_to_sequence(sentence, ar_word2idx, max_length=60)
        src_tensor = torch.tensor(src_seq, dtype=torch.long).unsqueeze(0).to(device)
        
        # Encode source
        hidden, cell = model.encoder(src_tensor)
        
        # Start decoding with <SOS> token
        trg_indexes = [en_word2idx.get('< SOS >', 1)]
        
        # Generate tokens one by one
        for _ in range(max_length):
            trg_tensor = torch.tensor([trg_indexes[-1]], dtype=torch.long).unsqueeze(0).to(device)
            output, hidden, cell = model.decoder(trg_tensor, hidden, cell)
            
            # Get most likely next token
            pred_token = output.argmax(2).item()
            trg_indexes.append(pred_token)
            
            # Stop if <EOS> token is generated
            if pred_token == en_word2idx.get('<EOS>', 2):
                break
        
        # Convert indices to words (skip <SOS> and <EOS>)
        trg_tokens = [en_idx2word.get(i, '<UNK>') for i in trg_indexes[1:-1]]
        return ' '.join(trg_tokens)

**üìù Inference Process:**
- **Greedy decoding**: Always pick most likely token
- **Alternative**: Beam search for better results
- **No teacher forcing**: Use model's own predictions


## üîü Testing Translation


In [None]:
# Test translation
test_arabic = subset_data['ar'][0]
translation = translate_sentence(model, test_arabic, ar_word2idx, en_idx2word, device)
print(f"Arabic: {test_arabic}")
print(f"Translation: {translation}")

**Expected Result:**
- Translation quality will be poor due to:
  1. **Overfitting** on small training set
  2. **Simple tokenization** not handling Arabic morphology
  3. **No attention mechanism** losing context in long sequences

## 1Ô∏è‚É£1Ô∏è‚É£ BLEU Evaluation


In [None]:
from collections import Counter
import math

def compute_bleu_score(reference, candidate, max_n=4):
    """
    Compute BLEU score to measure translation quality.
    
    BLEU measures n-gram overlap between reference and candidate translations.
    Higher scores (0-1) indicate better translation quality.
    """
    def get_ngrams(tokens, n):
        if len(tokens) < n:
            return []
        return [tuple(tokens[i:i+n]) for i in range(len(tokens) - n + 1)]
    
    ref_tokens = reference.lower().split()
    cand_tokens = candidate.lower().split()
    
    if len(cand_tokens) == 0:
        return 0.0
    
    # Compute precision for each n-gram level
    precisions = []
    for n in range(1, max_n + 1):
        ref_ngrams = Counter(get_ngrams(ref_tokens, n))
        cand_ngrams = Counter(get_ngrams(cand_tokens, n))
        
        if len(cand_ngrams) == 0:
            precisions.append(0.0)
            continue
            
        # Count matches
        matches = sum(min(count, ref_ngrams.get(ngram, 0)) 
                     for ngram, count in cand_ngrams.items())
        precision = matches / len(get_ngrams(cand_tokens, n))
        precisions.append(precision)
    
    # Brevity penalty
    ref_len = len(ref_tokens)
    cand_len = len(cand_tokens)
    bp = 1.0 if cand_len > ref_len else math.exp(1 - ref_len / cand_len) if cand_len > 0 else 0.0
    
    # Final BLEU score
    if all(p > 0 for p in precisions):
        bleu = bp * math.exp(sum(math.log(p) for p in precisions) / len(precisions))
    else:
        bleu = 0.0
    
    return bleu

def evaluate_bleu_by_length(model, subset_data, ar_word2idx, en_idx2word, device, num_samples=150):
    """
    Evaluate model performance on different sentence lengths.
    Arabic translation difficulty varies significantly with sentence length.
    """
    # Collect samples with their lengths
    samples_with_length = []
    for i in range(min(num_samples, len(subset_data['ar']))):
        arabic_text = subset_data['ar'][i]
        english_text = subset_data['en'][i]
        text_length = len(english_text.split())
        samples_with_length.append((i, arabic_text, english_text, text_length))
    
    # Sort by length and categorize
    samples_with_length.sort(key=lambda x: x[3])
    lengths = [x[3] for x in samples_with_length]
    short_threshold = lengths[len(lengths)//3]
    long_threshold = lengths[2*len(lengths)//3]
    
    categories = {'short': [], 'medium': [], 'long': []}
    
    for sample in samples_with_length:
        idx, arabic, english, length = sample
        if length <= short_threshold:
            categories['short'].append(sample)
        elif length >= long_threshold:
            categories['long'].append(sample)
        else:
            categories['medium'].append(sample)
    
    # Evaluate each category
    results = {}
    for category_name, samples in categories.items():
        bleu_scores = []
        for idx, arabic_text, reference_english, length in samples:
            model_translation = translate_sentence(model, arabic_text, ar_word2idx, en_idx2word, device)
            bleu = compute_bleu_score(reference_english, model_translation)
            bleu_scores.append(bleu)
        
        if bleu_scores:
            avg_bleu = sum(bleu_scores) / len(bleu_scores)
            good_count = sum(1 for s in bleu_scores if s > 0.1)
            results[category_name] = {
                'avg_bleu': avg_bleu,
                'good_count': good_count,
                'total': len(bleu_scores)
            }
    
    # Print results
    for category in ['short', 'medium', 'long']:
        if category in results:
            r = results[category]
            print(f"{category.capitalize()}: BLEU = {r['avg_bleu']:.4f}, Good Rate = {r['good_count']/r['total']*100:.1f}%")
    
    return results

**üìù BLEU Score Interpretation:**
- **0.0-0.1**: Poor translation
- **0.1-0.3**: Understandable but low quality
- **0.3-0.5**: Good translation
- **0.5+**: Excellent translation
    - In my humble opinion, this score is useless :)

## 1Ô∏è‚É£2Ô∏è‚É£ Final Evaluation


In [None]:
# Run BLEU evaluation
print("üìä BLEU Score Evaluation (Remember: This is overfitted!)")
print("=" * 50)
bleu_results = evaluate_bleu_by_length(model, subset_data, ar_word2idx, en_idx2word, device)

**Expected Results:**
- Scores will be artificially high due to overfitting
- Real-world performance would be much lower
- Demonstrates the challenge of Arabic NLP(Without attention)

## **üéØ Key Takeaways and Improvements**

### **‚ùå Current Issues:**
1. **No train/validation/test split** ‚Üí Severe overfitting
2. **Simple tokenization** ‚Üí Poor Arabic morphology handling
3. **No attention mechanism** ‚Üí Context loss in long sequences
4. **Small dataset** ‚Üí Limited generalization
5. **No beam search** ‚Üí Suboptimal decoding

### **‚úÖ Production Improvements:**
1. **Proper data splitting** (80/10/10 train/val/test)
2. **Arabic-specific tokenization** (farasa, CAMeL)
3. **Attention mechanism** or **Transformer architecture**
4. **Larger dataset** with domain diversity
5. **Advanced decoding** (beam search, length penalties)
6. **Regularization techniques** (dropout, weight decay)

### **üîπ Arabic-Specific Challenges:**
- **Rich morphology**: One root ‚Üí hundreds of word forms
    - ŸÅÿπŸÑ: ŸÅÿπŸàŸÑ ŸÅÿßÿπŸÑÿå ŸÖŸÅÿπŸàŸÑÿå ŸÅÿßÿπŸÑŸàŸÜÿå ŸÅÿßÿπŸÑŸäŸÜÿå ÿ™ŸÅÿπŸÑÿå ŸäŸÅÿπŸÑŸÜÿå ÿ™ŸÅÿπŸÑŸÜ... ÿßŸÑÿÆ
- **Agglutination**: Prefixes and suffixes change meaning
- **Diacritics**: Optional vowel marks affect pronunciation
- **Right-to-left script**: Processing order considerations

Contributed by: Ali Habibullah
> I am not proud of this work  
