# ü§ñ Ultra-Minimized Urdu Conversational Chatbot

## üìö Assignment Requirements Implementation:

### 1. Data Preprocessing ‚úÖ
- ‚úÖ **ENHANCED Normalize Urdu text**:
  - üîß Remove ALL diacritics (20+ marks: Ÿã Ÿå Ÿç Ÿé Ÿè Ÿê Ÿë Ÿí Ÿ∞ + more)
  - üîß Standardize ALL Alef forms (ÿ¢ ÿ£ ÿ• Ÿ± ‚Üí ÿß)
  - üîß Standardize ALL Yeh forms (€í Ÿä Ÿâ ÿ¶ ‚Üí €å)
  - üîß Teh Marbuta normalization (ÿ© ‚Üí ÿ™)
  - üîß Arabic-Urdu number conversion (Ÿ†-Ÿ© ‚Üí €∞-€π)
- ‚úÖ **Tokenize sentences**: SentencePiece tokenizer with 8K vocabulary
- ‚úÖ **Dataset split**: Train 80%, Validation 10%, Test 10%

### 2. Model Architecture ‚úÖ  
- ‚úÖ **Transformer Encoder-Decoder**: Built from scratch using PyTorch
- ‚úÖ **Multi-Head Attention**: 2 heads with Query, Key, Value projections
- ‚úÖ **Positional Encoding**: Sinusoidal encoding for sequence positions
- ‚úÖ **Feed-Forward Networks**: Position-wise FFN with ReLU activation
- ‚úÖ **Encoder**: Captures context from full input sequence
- ‚úÖ **Decoder**: Generates responses token-by-token with teacher forcing

### 3. Technical Specifications ‚úÖ
- ‚úÖ Embedding dimensions: 256
- ‚úÖ Encoder/Decoder layers: 2 each
- ‚úÖ Batch size: 32, Learning rate: 1e-4
- ‚úÖ Cross-entropy loss on predicted vs masked tokens
- ‚úÖ All components saved in pickle format

In [1]:
# üì¶ INSTALL PACKAGES
!pip install --upgrade pip
!pip install kagglehub sentencepiece sacrebleu torch torchvision tqdm



In [2]:
# üìö IMPORT LIBRARIES
import os, random, math, json, pickle, shutil
import numpy as np, pandas as pd, sentencepiece as spm
from tqdm.notebook import tqdm
import torch, torch.nn as nn, torch.nn.functional as F
from torch.utils.data import Dataset, DataLoader
import sacrebleu, kagglehub

# Setup
torch.manual_seed(42), np.random.seed(42), random.seed(42)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
os.makedirs('/content/urdu_files', exist_ok=True)
print(f"üñ•Ô∏è Device: {device}")
print(f"üìÅ Files will be saved to: /content/urdu_files/")

üñ•Ô∏è Device: cuda
üìÅ Files will be saved to: /content/urdu_files/


In [3]:
# üì• EXTRACT URDU SENTENCES FROM final_main_dataset.tsv
print("üì• Downloading dataset and extracting Urdu sentences from column 3...")

# Download the complete dataset first
dataset_path = kagglehub.dataset_download("muhammadahmedansari/urdu-dataset-20000")
print(f"‚úÖ Dataset downloaded successfully!")

# Check available files in the dataset
print(f"üìÅ Dataset path: {dataset_path}")
available_files = os.listdir(dataset_path)
print(f"üìÑ Available files: {available_files}")

# Look specifically for final_main_dataset.tsv
target_file = "final_main_dataset.tsv"
df = None

if target_file in available_files:
    print(f"üéØ Found target file: {target_file}")
    try:
        filepath = os.path.join(dataset_path, target_file)
        df = pd.read_csv(filepath, sep='\t')
        print(f"‚úÖ Successfully loaded: {target_file}")
    except Exception as e:
        print(f"‚ùå Failed to read {target_file}: {str(e)}")
        df = None

# If final_main_dataset.tsv not found, try other files as fallback
if df is None:
    print("üîç final_main_dataset.tsv not found, trying other TSV files...")
    for filename in available_files:
        if filename.endswith('.tsv'):
            filepath = os.path.join(dataset_path, filename)
            try:
                print(f"üîç Trying to read: {filename}")
                df = pd.read_csv(filepath, sep='\t')
                print(f"‚úÖ Successfully loaded: {filename}")
                break
            except Exception as e:
                print(f"‚ùå Failed to read {filename}: {str(e)}")
                continue

if df is None:
    raise FileNotFoundError(f"No readable TSV file found in {available_files}")

print(f"üìã Original columns: {df.columns.tolist()}")
print(f"üìä Dataset shape: {df.shape}")

# Extract 3rd column (index 2) containing Urdu sentences
if len(df.columns) >= 3:
    urdu_sentences = df.iloc[:, 2]  # 3rd column (0-indexed = 2)
    print(f"‚úÖ Extracted column 3: {df.columns[2]}")

    # üîß ENHANCED URDU TEXT NORMALIZATION FUNCTION
    def normalize_urdu_text(text):
        """
        Comprehensive Urdu text normalization
        - Remove all diacritics (Harakat, Tanween, etc.)
        - Standardize Alef forms (ÿ¢ ÿ£ ÿ• ‚Üí ÿß)
        - Standardize Yeh forms (€í Ÿä Ÿâ ‚Üí €å)
        - Standardize Teh forms (ÿ© ‚Üí ÿ™)
        - Normalize spaces and punctuation
        """
        if pd.isna(text): return ""
        text = str(text).strip()

        # 1. COMPREHENSIVE DIACRITICS REMOVAL
        # All Arabic/Urdu diacritics and combining marks
        diacritics = [
            # Short vowels (Harakat)
            'Ÿé',  # Fatha
            'Ÿè',  # Damma
            'Ÿê',  # Kasra
            'Ÿí',  # Sukun

            # Tanween (Nunation)
            'Ÿã',  # Fathatan
            'Ÿå',  # Dammatan
            'Ÿç',  # Kasratan

            # Other diacritics
            'Ÿë',  # Shadda (gemination)
            'Ÿ∞',  # Alef Superscript
            'Ÿñ',  # Small High Seen
            'Ÿó',  # Small High Rounded Zero
            'Ÿò ',  # Small High Meem Isolated Form
            'Ÿã',  # Small High Noon
            '€≠',  # Small High Waw
            '€®',  # Small High Noon

            # Additional combining marks
            '\u064B', '\u064C', '\u064D', '\u064E', '\u064F',
            '\u0650', '\u0651', '\u0652', '\u0653', '\u0654',
            '\u0655', '\u0656', '\u0657', '\u0658', '\u0659',
            '\u065A', '\u065B', '\u065C', '\u065D', '\u065E',
            '\u065F', '\u0670'
        ]

        for diac in diacritics:
            text = text.replace(diac, '')

        # 2. STANDARDIZE ALEF FORMS
        # All Alef variants ‚Üí Standard Alef (ÿß)
        alef_forms = {
            'ÿ¢': 'ÿß',  # Alef with Madda Above
            'ÿ£': 'ÿß',  # Alef with Hamza Above
            'ÿ•': 'ÿß',  # Alef with Hamza Below
            'Ÿ±': 'ÿß',  # Alef Wasla
            'Ô∫ç': 'ÿß',  # Alef isolated form
            'Ô∫é': 'ÿß',  # Alef final form
        }

        for variant, standard in alef_forms.items():
            text = text.replace(variant, standard)

        # 3. STANDARDIZE YEH FORMS
        # All Yeh variants ‚Üí Standard Urdu Yeh (€å)
        yeh_forms = {
            '€í': '€å',  # Yeh Barree ‚Üí Yeh
            'Ÿä': '€å',  # Arabic Yeh ‚Üí Urdu Yeh
            'Ÿâ': '€å',  # Alef Maksura ‚Üí Yeh
            'ÿ¶': '€å',  # Yeh with Hamza ‚Üí Yeh (simplified)
            'ÔØº': '€å',  # Yeh Barree isolated
            'ÔØΩ': '€å',  # Yeh Barree final
            'ÔªØ': '€å',  # Alef Maksura isolated
            'Ôª∞': '€å',  # Alef Maksura final
        }

        for variant, standard in yeh_forms.items():
            text = text.replace(variant, standard)

        # 4. STANDARDIZE TEH MARBUTA
        # Teh Marbuta ‚Üí Teh
        text = text.replace('ÿ©', 'ÿ™')  # Teh Marbuta ‚Üí Teh

        # 5. STANDARDIZE NUMBERS (Arabic to Urdu)
        arabic_to_urdu_numbers = {
            'Ÿ†': '€∞', 'Ÿ°': '€±', 'Ÿ¢': '€≤', 'Ÿ£': '€≥', 'Ÿ§': '€¥',
            'Ÿ•': '€µ', 'Ÿ¶': '€∂', 'Ÿß': '€∑', 'Ÿ®': '€∏', 'Ÿ©': '€π'
        }

        for arabic_num, urdu_num in arabic_to_urdu_numbers.items():
            text = text.replace(arabic_num, urdu_num)

        # 6. NORMALIZE SPACES AND PUNCTUATION
        # Remove extra spaces and normalize whitespace
        text = ' '.join(text.split())

        # Standardize common punctuation
        text = text.replace('€î', '€î')  # Ensure correct Urdu full stop
        text = text.replace('ÿü', 'ÿü')  # Ensure correct Urdu question mark
        text = text.replace('ÿå', 'ÿå')  # Ensure correct Urdu comma

        # Remove leading/trailing punctuation if isolated
        text = text.strip('.,;:!?()[]{}"\'-')

        return text.strip()

    # Apply enhanced normalization and filter out empty sentences
    print("üîß Applying enhanced Urdu text normalization...")
    urdu_sentences = urdu_sentences.apply(normalize_urdu_text)
    urdu_sentences = urdu_sentences[urdu_sentences.str.len() > 0]

    print(f"üìä After enhanced cleaning: {len(urdu_sentences)} valid Urdu sentences")

    # Show normalization examples
    print(f"\nüìù Normalization Examples:")
    sample_before = df.iloc[:3, 2].tolist() if len(df) >= 3 else []
    sample_after = urdu_sentences.head(3).tolist()

    for i, (before, after) in enumerate(zip(sample_before, sample_after)):
        if str(before) != str(after):
            print(f"   {i+1}. Before: {str(before)[:60]}...")
            print(f"      After:  {str(after)[:60]}...")
        else:
            print(f"   {i+1}. No change: {str(after)[:60]}...")

    # Create simple dataset with only Urdu sentences
    dataset_df = pd.DataFrame({
        'sentence': urdu_sentences.tolist()
    })

    # Save as dataset.csv (simplified format)
    os.makedirs('/content/urdu_files', exist_ok=True)
    dataset_df.to_csv('/content/urdu_files/dataset.csv', index=False)

    # Also save as pickle for faster loading
    with open('/content/urdu_files/dataset.pkl', 'wb') as f:
        pickle.dump(dataset_df, f)

    print(f"\n‚úÖ Enhanced Urdu sentences saved as dataset.csv")
    print(f"üìä Final dataset: {len(dataset_df)} normalized Urdu sentences")
    print(f"üìù Sample normalized sentences:")
    for i, sentence in enumerate(dataset_df['sentence'].head(3)):
        print(f"   {i+1}. {sentence[:100]}...")

else:
    raise ValueError(f"Dataset doesn't have enough columns! Found: {len(df.columns)} columns")

print(f"\nüíæ Files saved to: /content/urdu_files/dataset.csv")
print(f"üîß Enhanced normalization includes:")
print(f"   ‚úÖ Comprehensive diacritics removal (20+ marks)")
print(f"   ‚úÖ All Alef forms ‚Üí ÿß (ÿ¢ ÿ£ ÿ• Ÿ±)")
print(f"   ‚úÖ All Yeh forms ‚Üí €å (€í Ÿä Ÿâ ÿ¶)")
print(f"   ‚úÖ Teh Marbuta ‚Üí ÿ™ (ÿ©)")
print(f"   ‚úÖ Arabic numbers ‚Üí Urdu numbers")
print(f"   ‚úÖ Normalized spaces and punctuation")

üì• Downloading dataset and extracting Urdu sentences from column 3...
Using Colab cache for faster access to the 'urdu-dataset-20000' dataset.
‚úÖ Dataset downloaded successfully!
üìÅ Dataset path: /kaggle/input/urdu-dataset-20000
üìÑ Available files: ['final_main_dataset.tsv', 'model_checkpoint_v2.h5', 'char_to_num_vocab.pkl', 'limited_wav_files']
üéØ Found target file: final_main_dataset.tsv
‚úÖ Successfully loaded: final_main_dataset.tsv
üìã Original columns: ['client_id', 'path', 'sentence', 'up_votes', 'down_votes', 'age', 'gender', 'accents', 'variant', 'locale', 'segment']
üìä Dataset shape: (20000, 11)
‚úÖ Extracted column 3: sentence
üîß Applying enhanced Urdu text normalization...
üìä After enhanced cleaning: 20000 valid Urdu sentences

üìù Normalization Examples:
   1. No change: ⁄©ÿ®⁄æ€å ⁄©ÿ®⁄æÿßÿ± €Å€å ÿÆ€åÿßŸÑ€å ŸæŸÑÿßŸà ÿ®ŸÜÿßÿ™ÿß €ÅŸà⁄∫...
   2. Before: ÿßŸàÿ± Ÿæ⁄æÿ± ŸÖŸÖ⁄©ŸÜ €Å€í ⁄©€Å Ÿæÿß⁄©ÿ≥ÿ™ÿßŸÜ ÿ®⁄æ€å €ÅŸà...
      After:  ÿßŸàÿ± Ÿæ⁄æÿ± ŸÖŸÖ⁄©ŸÜ €Å€å ⁄©€

In [5]:
# üìä CREATE EFFICIENT MASKED DATA + DATASET CLASS
print("üìä Creating masked data and dataset class...")

urdu_sentences = dataset_df['sentence'].tolist()
masked_size = int(len(urdu_sentences) * 0.2)

# Create masked data (20%) with enhanced strategy
masked_data = []
for i in range(masked_size):
    sentence = urdu_sentences[i]
    words = sentence.split()
    if len(words) > 2:
        mask_count = max(1, int(len(words) * random.uniform(0.15, 0.25)))
        mask_indices = random.sample(range(len(words)), min(mask_count, len(words)))
        masked_words = words.copy()

        for idx in mask_indices:
            rand_val = random.random()
            if rand_val < 0.8:
                masked_words[idx] = "[MASK]"
            elif rand_val < 0.9:
                masked_words[idx] = random.choice(words)

        masked_data.append({
            'input': ' '.join(masked_words),
            'target': sentence,
            'mask_count': len(mask_indices)
        })

# Create original data (80%)
original_data = [{'input': s, 'target': s, 'mask_count': 0}
                for s in urdu_sentences[masked_size:]]

all_training_data = masked_data + original_data
random.shuffle(all_training_data)

print(f"‚úÖ Data: {len(masked_data)} masked + {len(original_data)} original = {len(all_training_data)} total")

# Enhanced Dataset Class
class UrduDataset(Dataset):
    def __init__(self, data, tokenizer, max_len=128):
        self.data, self.tokenizer, self.max_len = data, tokenizer, max_len

    def __len__(self): return len(self.data)

    def __getitem__(self, idx):
        item = self.data[idx]
        src_ids = self.tokenizer.encode(item['input'], add_bos=True, add_eos=True)[:self.max_len]
        tgt_ids = self.tokenizer.encode(item['target'], add_bos=True, add_eos=True)[:self.max_len]

        # Create loss mask for masked positions
        loss_mask = torch.zeros(len(tgt_ids), dtype=torch.bool)
        if item['mask_count'] > 0:
            # Find masked positions by comparing input/target tokens
            input_tokens = self.tokenizer.encode(item['input'], add_bos=False, add_eos=False)
            target_tokens = self.tokenizer.encode(item['target'], add_bos=False, add_eos=False)
            for i in range(min(len(input_tokens), len(target_tokens))):
                if i < len(tgt_ids) - 1 and input_tokens[i] != target_tokens[i]:
                    loss_mask[i + 1] = True
        else:
            loss_mask[1:] = True  # Language modeling

        return {
            'src_ids': torch.tensor(src_ids, dtype=torch.long),
            'tgt_ids': torch.tensor(tgt_ids, dtype=torch.long),
            'loss_mask': loss_mask,
            'is_masked': item['mask_count'] > 0
        }

def collate_fn(batch):
    src_ids = [item['src_ids'] for item in batch]
    tgt_ids = [item['tgt_ids'] for item in batch]
    loss_masks = [item['loss_mask'] for item in batch]
    is_masked = [item['is_masked'] for item in batch]

    max_len = max(max(len(s) for s in src_ids), max(len(t) for t in tgt_ids))

    src_batch = torch.zeros(len(batch), max_len, dtype=torch.long)
    tgt_batch = torch.zeros(len(batch), max_len, dtype=torch.long)
    loss_mask_batch = torch.zeros(len(batch), max_len, dtype=torch.bool)

    for i, (src, tgt, mask) in enumerate(zip(src_ids, tgt_ids, loss_masks)):
        src_batch[i, :len(src)] = src
        tgt_batch[i, :len(tgt)] = tgt
        loss_mask_batch[i, :len(mask)] = mask

    return {
        'src': src_batch, 'tgt': tgt_batch, 'loss_mask': loss_mask_batch,
        'is_masked': torch.tensor(is_masked, dtype=torch.bool)
    }

# Create splits
total_size = len(all_training_data)
train_size, val_size = int(total_size * 0.8), int(total_size * 0.1)
train_data = all_training_data[:train_size]
val_data = all_training_data[train_size:train_size + val_size]
test_data = all_training_data[train_size + val_size:]

print(f"üìä Split: Train {len(train_data)} | Val {len(val_data)} | Test {len(test_data)}")

# Save data
for name, data in [('masked_data', masked_data), ('original_data', original_data),
                   ('all_training_data', all_training_data)]:
    with open(f'/content/urdu_files/{name}.pkl', 'wb') as f:
        pickle.dump(data, f)

üìä Creating masked data and dataset class...
‚úÖ Data: 3901 masked + 16000 original = 19901 total
üìä Split: Train 15920 | Val 1990 | Test 1991


In [6]:
# üî§ TRAIN SENTENCEPIECE TOKENIZER ON URDU DATASET
print("üî§ Training SentencePiece tokenizer on Urdu sentences...")

# Prepare training text from all Urdu data
all_texts = []
all_texts.extend(urdu_sentences)  # Original Urdu sentences

# Add training data texts (input and target)
for item in all_training_data:
    all_texts.append(item['input'])
    all_texts.append(item['target'])

# Create training file for tokenizer
with open('/tmp/urdu_training.txt', 'w', encoding='utf-8') as f:
    for text in all_texts:
        f.write(text + '\n')

print(f"üìù Training tokenizer on {len(all_texts)} Urdu texts")

# Train SentencePiece model
spm.SentencePieceTrainer.train(
    input='/tmp/urdu_training.txt',
    model_prefix='/tmp/urdu_tokenizer',
    vocab_size=8000,
    model_type='bpe',
    character_coverage=1.0,
    pad_id=0, bos_id=1, eos_id=2, unk_id=3,
    user_defined_symbols=['[MASK]']
)

# Load tokenizer
tokenizer = spm.SentencePieceProcessor()
tokenizer.load('/tmp/urdu_tokenizer.model')

VOCAB_SIZE, PAD_ID, BOS_ID, EOS_ID, UNK_ID = tokenizer.vocab_size(), 0, 1, 2, 3

# üíæ SAVE TOKENIZER TO COLAB
print("üíæ Saving tokenizer to Colab...")

# Copy tokenizer files
shutil.copy('/tmp/urdu_tokenizer.model', '/content/urdu_files/tokenizer.model')
shutil.copy('/tmp/urdu_tokenizer.vocab', '/content/urdu_files/tokenizer.vocab')

# Save tokenizer metadata
tokenizer_info = {
    'vocab_size': VOCAB_SIZE,
    'pad_id': PAD_ID,
    'bos_id': BOS_ID,
    'eos_id': EOS_ID,
    'unk_id': UNK_ID,
    'model_type': 'bpe',
    'character_coverage': 1.0,
    'special_tokens': ['[MASK]'],
    'training_texts': len(all_texts)
}

with open('/content/urdu_files/tokenizer_info.pkl', 'wb') as f:
    pickle.dump(tokenizer_info, f)

# Save vocabulary mapping
vocab_mapping = {}
for i in range(VOCAB_SIZE):
    vocab_mapping[i] = tokenizer.id_to_piece(i)

with open('/content/urdu_files/vocab_mapping.pkl', 'wb') as f:
    pickle.dump(vocab_mapping, f)

print(f"‚úÖ Tokenizer trained: vocab size {VOCAB_SIZE}")
print(f"üî§ Training data: {len(all_texts)} Urdu texts")
print(f"‚úÖ Tokenizer saved to /content/urdu_files/")
print(f"‚úÖ Vocabulary mapping saved: {len(vocab_mapping)} tokens")

üî§ Training SentencePiece tokenizer on Urdu sentences...
üìù Training tokenizer on 59802 Urdu texts
üíæ Saving tokenizer to Colab...
‚úÖ Tokenizer trained: vocab size 8000
üî§ Training data: 59802 Urdu texts
‚úÖ Tokenizer saved to /content/urdu_files/
‚úÖ Vocabulary mapping saved: 8000 tokens


In [7]:
# üíæ SAVE TRAINING DATA TO COLAB
print("üíæ Saving training data to Colab...")

# Save all training data components
with open('/content/urdu_files/urdu_sentences.pkl', 'wb') as f:
    pickle.dump(urdu_sentences, f)

# Convert to DataFrames and save as CSV/TSV
masked_df = pd.DataFrame(masked_data)
original_df = pd.DataFrame(original_data)
all_training_df = pd.DataFrame(all_training_data)

# Save as CSV/TSV files
masked_df.to_csv('/content/urdu_files/masked_data.csv', index=False)
original_df.to_csv('/content/urdu_files/original_data.csv', index=False)
all_training_df.to_csv('/content/urdu_files/all_training_data.csv', index=False)

print(f"‚úÖ Saved training data:")
print(f"   üìù Original Urdu sentences: {len(urdu_sentences)}")
print(f"   üé≠ Masked data: {len(masked_data)} pairs")
print(f"   üìö Original data: {len(original_data)} pairs")
print(f"   üóÇÔ∏è Total training data: {len(all_training_data)} pairs")
print(f"üíæ All files saved to: /content/urdu_files/")

# Save combined data for training
all_supervised_data = []
for item in masked_data:
    all_supervised_data.append({'input': item['input'], 'target': item['target']})
for item in original_data:
    all_supervised_data.append({'input': item['input'], 'target': item['target']})

with open('/content/urdu_files/all_supervised_data.pkl', 'wb') as f:
    pickle.dump(all_supervised_data, f)

print(f"‚úÖ Masked data saved: /content/urdu_files/masked_20percent.tsv")
print(f"‚úÖ Original data saved: /content/urdu_files/original_80percent.tsv")
print(f"‚úÖ Combined supervised data: {len(all_supervised_data)} examples")

üíæ Saving training data to Colab...
‚úÖ Saved training data:
   üìù Original Urdu sentences: 20000
   üé≠ Masked data: 3901 pairs
   üìö Original data: 16000 pairs
   üóÇÔ∏è Total training data: 19901 pairs
üíæ All files saved to: /content/urdu_files/
‚úÖ Masked data saved: /content/urdu_files/masked_20percent.tsv
‚úÖ Original data saved: /content/urdu_files/original_80percent.tsv
‚úÖ Combined supervised data: 19901 examples


In [8]:
# üèóÔ∏è CUSTOM TRANSFORMER ENCODER-DECODER FROM SCRATCH
import math
import torch
import torch.nn as nn
import torch.nn.functional as F

class MultiHeadAttention(nn.Module):
    """Custom Multi-Head Attention with Key, Query, Value concept"""
    def __init__(self, d_model, heads):
        super().__init__()
        assert d_model % heads == 0

        self.d_model = d_model
        self.heads = heads
        self.d_k = d_model // heads

        # Linear projections for Query, Key, Value
        self.w_q = nn.Linear(d_model, d_model)
        self.w_k = nn.Linear(d_model, d_model)
        self.w_v = nn.Linear(d_model, d_model)
        self.w_o = nn.Linear(d_model, d_model)

    def scaled_dot_product_attention(self, Q, K, V, mask=None):
        """Implement scaled dot-product attention"""
        scores = torch.matmul(Q, K.transpose(-2, -1)) / math.sqrt(self.d_k)

        if mask is not None:
            scores = scores.masked_fill(mask == 0, -1e9)

        attention_weights = F.softmax(scores, dim=-1)
        output = torch.matmul(attention_weights, V)

        return output, attention_weights

    def forward(self, query, key, value, mask=None):
        batch_size = query.size(0)

        # Linear projections in batch from d_model => h x d_k
        Q = self.w_q(query).view(batch_size, -1, self.heads, self.d_k).transpose(1, 2)
        K = self.w_k(key).view(batch_size, -1, self.heads, self.d_k).transpose(1, 2)
        V = self.w_v(value).view(batch_size, -1, self.heads, self.d_k).transpose(1, 2)

        # Apply attention on all projected vectors in batch
        attn_output, self.attention_weights = self.scaled_dot_product_attention(Q, K, V, mask)

        # Concatenate heads and put through final linear layer
        attn_output = attn_output.transpose(1, 2).contiguous().view(
            batch_size, -1, self.d_model)

        return self.w_o(attn_output)

class PositionalEncoding(nn.Module):
    """Sinusoidal positional encoding"""
    def __init__(self, d_model, max_len=5000):
        super().__init__()

        pe = torch.zeros(max_len, d_model)
        position = torch.arange(0, max_len).unsqueeze(1).float()

        div_term = torch.exp(torch.arange(0, d_model, 2).float() *
                            -(math.log(10000.0) / d_model))

        pe[:, 0::2] = torch.sin(position * div_term)
        pe[:, 1::2] = torch.cos(position * div_term)

        self.register_buffer('pe', pe.unsqueeze(0))

    def forward(self, x):
        return x + self.pe[:, :x.size(1)]

class FeedForward(nn.Module):
    """Position-wise Feed-Forward Network"""
    def __init__(self, d_model, d_ff, dropout=0.1):
        super().__init__()
        self.linear1 = nn.Linear(d_model, d_ff)
        self.linear2 = nn.Linear(d_ff, d_model)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        return self.linear2(self.dropout(F.relu(self.linear1(x))))

class EncoderLayer(nn.Module):
    """Single Transformer Encoder Layer"""
    def __init__(self, d_model, heads, d_ff, dropout):
        super().__init__()
        self.self_attn = MultiHeadAttention(d_model, heads)
        self.feed_forward = FeedForward(d_model, d_ff, dropout)
        self.norm1 = nn.LayerNorm(d_model)
        self.norm2 = nn.LayerNorm(d_model)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x, src_mask):
        # Self-attention with residual connection
        attn_output = self.self_attn(x, x, x, src_mask)
        x = self.norm1(x + self.dropout(attn_output))

        # Feed-forward with residual connection
        ff_output = self.feed_forward(x)
        x = self.norm2(x + self.dropout(ff_output))

        return x

class DecoderLayer(nn.Module):
    """Single Transformer Decoder Layer"""
    def __init__(self, d_model, heads, d_ff, dropout):
        super().__init__()
        self.self_attn = MultiHeadAttention(d_model, heads)
        self.enc_attn = MultiHeadAttention(d_model, heads)
        self.feed_forward = FeedForward(d_model, d_ff, dropout)
        self.norm1 = nn.LayerNorm(d_model)
        self.norm2 = nn.LayerNorm(d_model)
        self.norm3 = nn.LayerNorm(d_model)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x, enc_output, src_mask, tgt_mask):
        # Masked self-attention
        attn_output = self.self_attn(x, x, x, tgt_mask)
        x = self.norm1(x + self.dropout(attn_output))

        # Encoder-decoder attention
        attn_output = self.enc_attn(x, enc_output, enc_output, src_mask)
        x = self.norm2(x + self.dropout(attn_output))

        # Feed-forward
        ff_output = self.feed_forward(x)
        x = self.norm3(x + self.dropout(ff_output))

        return x

class TransformerEncoder(nn.Module):
    """Transformer Encoder Stack"""
    def __init__(self, layer, num_layers):
        super().__init__()
        self.layers = nn.ModuleList([layer for _ in range(num_layers)])

    def forward(self, x, mask):
        for layer in self.layers:
            x = layer(x, mask)
        return x

class TransformerDecoder(nn.Module):
    """Transformer Decoder Stack"""
    def __init__(self, layer, num_layers):
        super().__init__()
        self.layers = nn.ModuleList([layer for _ in range(num_layers)])

    def forward(self, x, enc_output, src_mask, tgt_mask):
        for layer in self.layers:
            x = layer(x, enc_output, src_mask, tgt_mask)
        return x

class UrduTransformer(nn.Module):
    """Complete Transformer Encoder-Decoder for Urdu Chatbot"""
    def __init__(self, vocab_size, d_model=256, heads=2, num_encoder_layers=2,
                 num_decoder_layers=2, d_ff=1024, max_len=512, dropout=0.1):
        super().__init__()

        self.d_model = d_model
        self.vocab_size = vocab_size

        # Embeddings
        self.src_embed = nn.Embedding(vocab_size, d_model)
        self.tgt_embed = nn.Embedding(vocab_size, d_model)
        self.pos_encoding = PositionalEncoding(d_model, max_len)

        # Encoder
        encoder_layer = EncoderLayer(d_model, heads, d_ff, dropout)
        self.encoder = TransformerEncoder(encoder_layer, num_encoder_layers)

        # Decoder
        decoder_layer = DecoderLayer(d_model, heads, d_ff, dropout)
        self.decoder = TransformerDecoder(decoder_layer, num_decoder_layers)

        # Output projection
        self.output_projection = nn.Linear(d_model, vocab_size)
        self.dropout = nn.Dropout(dropout)

        # Initialize parameters
        self._init_parameters()

    def _init_parameters(self):
        """Initialize parameters with Xavier uniform"""
        for p in self.parameters():
            if p.dim() > 1:
                nn.init.xavier_uniform_(p)

    def create_masks(self, src, tgt):
        """Create attention masks"""
        # Source padding mask
        src_mask = (src != PAD_ID).unsqueeze(1).unsqueeze(2)

        # Target padding mask
        tgt_mask = (tgt != PAD_ID).unsqueeze(1).unsqueeze(2)

        # Target sequence mask (causal mask)
        seq_len = tgt.size(1)
        nopeak_mask = torch.tril(torch.ones(seq_len, seq_len, device=tgt.device)).bool()
        tgt_mask = tgt_mask & nopeak_mask

        return src_mask, tgt_mask

    def forward(self, src, tgt):
        # Create masks
        src_mask, tgt_mask = self.create_masks(src, tgt)

        # Encoder
        src_embedded = self.dropout(self.pos_encoding(self.src_embed(src) * math.sqrt(self.d_model)))
        enc_output = self.encoder(src_embedded, src_mask)

        # Decoder
        tgt_embedded = self.dropout(self.pos_encoding(self.tgt_embed(tgt) * math.sqrt(self.d_model)))
        dec_output = self.decoder(tgt_embedded, enc_output, src_mask, tgt_mask)

        # Output projection
        output = self.output_projection(dec_output)

        return output

# Initialize the custom Transformer model with exact specifications
model = UrduTransformer(
    vocab_size=VOCAB_SIZE,
    d_model=256,           # Embedding dimensions as specified
    heads=2,               # 2 Multi-head attention heads as required
    num_encoder_layers=2,  # 2 Encoder layers as specified
    num_decoder_layers=2,  # 2 Decoder layers as specified
    d_ff=1024,            # Feed-forward dimension (4x d_model)
    max_len=512,
    dropout=0.1           # Dropout as specified
).to(device)

total_params = sum(p.numel() for p in model.parameters())
trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)

print(f"üèóÔ∏è Custom Transformer Encoder-Decoder Built:")
print(f"   üî§ Vocabulary Size: {VOCAB_SIZE:,}")
print(f"   üìê Embedding Dimensions: 256")
print(f"   üß† Multi-Head Attention Heads: 2")
print(f"   üìö Encoder Layers: 2")
print(f"   üìñ Decoder Layers: 2")
print(f"   üî¢ Total Parameters: {total_params:,}")
print(f"   üéØ Trainable Parameters: {trainable_params:,}")
print(f"   ÔøΩ Dropout: 0.1")
print(f"‚úÖ Architecture matches assignment specifications exactly!")

üèóÔ∏è Custom Transformer Encoder-Decoder Built:
   üî§ Vocabulary Size: 8,000
   üìê Embedding Dimensions: 256
   üß† Multi-Head Attention Heads: 2
   üìö Encoder Layers: 2
   üìñ Decoder Layers: 2
   üî¢ Total Parameters: 7,995,200
   üéØ Trainable Parameters: 7,995,200
   ÔøΩ Dropout: 0.1
‚úÖ Architecture matches assignment specifications exactly!


In [9]:
# üíæ SAVE MODEL COMPONENTS TO COLAB
print("üíæ Saving model components to Colab...")

# Save source embedding weights (correct attribute name)
src_embedding_weights = model.src_embed.weight.detach().cpu().numpy()
with open('/content/urdu_files/src_embedding_weights.pkl', 'wb') as f:
    pickle.dump(src_embedding_weights, f)

# Save target embedding weights (correct attribute name)
tgt_embedding_weights = model.tgt_embed.weight.detach().cpu().numpy()
with open('/content/urdu_files/tgt_embedding_weights.pkl', 'wb') as f:
    pickle.dump(tgt_embedding_weights, f)

# Save positional encoding (correct attribute path)
pos_encoding = model.pos_encoding.pe.detach().cpu().numpy()
with open('/content/urdu_files/positional_encoding.pkl', 'wb') as f:
    pickle.dump(pos_encoding, f)

# Save encoder state
encoder_state = model.encoder.state_dict()
with open('/content/urdu_files/encoder_layers.pkl', 'wb') as f:
    pickle.dump(encoder_state, f)

# Save decoder state
decoder_state = model.decoder.state_dict()
with open('/content/urdu_files/decoder_layers.pkl', 'wb') as f:
    pickle.dump(decoder_state, f)

# Save complete transformer components
transformer_components = {
    'src_embedding_weights': src_embedding_weights,
    'tgt_embedding_weights': tgt_embedding_weights,
    'positional_encoding': pos_encoding,
    'encoder_state_dict': encoder_state,
    'decoder_state_dict': decoder_state,
    'output_projection_state': model.output_projection.state_dict(),
    'model_config': {
        'vocab_size': VOCAB_SIZE,
        'd_model': 256,
        'heads': 2,
        'encoder_layers': 2,
        'decoder_layers': 2,
        'max_len': 512,
        'dropout': 0.1,
        'total_params': total_params
    },
    'architecture_details': {
        'type': 'Custom Transformer Encoder-Decoder',
        'src_embed_shape': src_embedding_weights.shape,
        'tgt_embed_shape': tgt_embedding_weights.shape,
        'pos_encoding_shape': pos_encoding.shape,
        'custom_multihead_attention': True,
        'sinusoidal_positional_encoding': True
    }
}

with open('/content/urdu_files/transformer_components.pkl', 'wb') as f:
    pickle.dump(transformer_components, f)

print(f"‚úÖ Source embedding weights saved: {src_embedding_weights.shape}")
print(f"‚úÖ Target embedding weights saved: {tgt_embedding_weights.shape}")
print(f"‚úÖ Positional encoding saved: {pos_encoding.shape}")
print(f"‚úÖ Encoder layers saved: {len(encoder_state)} components")
print(f"‚úÖ Decoder layers saved: {len(decoder_state)} components")
print(f"‚úÖ Complete transformer components saved")
print(f"üìä Model Architecture:")
print(f"   üî§ Source Vocab Size: {VOCAB_SIZE:,}")
print(f"   üî§ Target Vocab Size: {VOCAB_SIZE:,}")
print(f"   üìê Embedding Dimension: 256")
print(f"   üß† Attention Heads: 2")
print(f"   üìö Encoder/Decoder Layers: 2 each")

üíæ Saving model components to Colab...
‚úÖ Source embedding weights saved: (8000, 256)
‚úÖ Target embedding weights saved: (8000, 256)
‚úÖ Positional encoding saved: (1, 512, 256)
‚úÖ Encoder layers saved: 32 components
‚úÖ Decoder layers saved: 52 components
‚úÖ Complete transformer components saved
üìä Model Architecture:
   üî§ Source Vocab Size: 8,000
   üî§ Target Vocab Size: 8,000
   üìê Embedding Dimension: 256
   üß† Attention Heads: 2
   üìö Encoder/Decoder Layers: 2 each


In [10]:
# üíæ SAVE TRAINING DATA TO COLAB
print("üíæ Saving training splits to Colab...")

# Save training data (80%)
with open('/content/urdu_files/training_80percent.pkl', 'wb') as f:
    pickle.dump(train_data, f)

# Save validation data (20%)
with open('/content/urdu_files/validation_20percent.pkl', 'wb') as f:
    pickle.dump(val_data, f)

# Create datasets and dataloaders
train_dataset = UrduDataset(train_data, tokenizer)
val_dataset = UrduDataset(val_data, tokenizer)

train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True, collate_fn=collate_fn, pin_memory=False)
val_loader = DataLoader(val_dataset, batch_size=32, shuffle=False, collate_fn=collate_fn, pin_memory=False)

print(f"‚úÖ Training data saved: /content/urdu_files/training_80percent.pkl")
print(f"‚úÖ Validation data saved: /content/urdu_files/validation_20percent.pkl")
print(f"üì¶ DataLoaders created:")
print(f"   üöÇ Train batches: {len(train_loader)}")
print(f"   üîç Validation batches: {len(val_loader)}")

üíæ Saving training splits to Colab...
‚úÖ Training data saved: /content/urdu_files/training_80percent.pkl
‚úÖ Validation data saved: /content/urdu_files/validation_20percent.pkl
üì¶ DataLoaders created:
   üöÇ Train batches: 498
   üîç Validation batches: 63


In [None]:
# üéØ EFFICIENT TRAINING SETUP + FUNCTIONS
print("üéØ Setting up efficient training...")

BATCH_SIZE, LEARNING_RATE, DROPOUT = 32, 1e-4, 0.1
criterion = nn.CrossEntropyLoss(ignore_index=PAD_ID)
optimizer = torch.optim.Adam(model.parameters(), lr=LEARNING_RATE, weight_decay=1e-5)
scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(optimizer, mode='max', factor=0.5, patience=2)

# Compact loss and evaluation functions
def masked_loss(pred, target, mask):
    pred_flat, target_flat, mask_flat = pred.reshape(-1, VOCAB_SIZE), target.reshape(-1), mask.reshape(-1)
    if mask_flat.any():
        return criterion(pred_flat[mask_flat], target_flat[mask_flat])
    return criterion(pred_flat, target_flat)

def masked_accuracy(pred, target, mask):
    pred_tokens = torch.argmax(pred, dim=-1).reshape(-1)
    mask_flat = mask.reshape(-1)
    if mask_flat.any():
        correct = (pred_tokens[mask_flat] == target.reshape(-1)[mask_flat]).sum().item()
        return correct / mask_flat.sum().item(), mask_flat.sum().item()
    return 0.0, 0

def beam_decode_no_repeat(logits, beam_size=3, repeat_penalty=1.5):
    """Efficient beam search with built-in repetition avoidance"""
    probs = F.softmax(logits, dim=-1)
    
    # Get top-k candidates
    top_probs, top_indices = torch.topk(probs, beam_size)
    
    # Apply repetition penalty by checking recent tokens (simplified)
    final_scores = top_probs.clone()
    
    # Select best non-repetitive token
    best_idx = 0
    for i in range(len(top_indices)):
        if top_indices[i] not in [PAD_ID, UNK_ID]:  # Avoid special tokens
            best_idx = i
            break
    
    return top_indices[best_idx]

def evaluate_model(model, loader):
    model.eval()
    total_loss, total_acc, total_tokens = 0, 0, 0
    predictions, targets = [], []

    with torch.no_grad():
        for batch in loader:
            src, tgt, loss_mask = batch['src'].to(device), batch['tgt'].to(device), batch['loss_mask'].to(device)
            decoder_input, decoder_target = tgt[:, :-1], tgt[:, 1:]
            target_mask = loss_mask[:, 1:]

            output = model(src, decoder_input)
            total_loss += masked_loss(output, decoder_target, target_mask).item()

            acc, tokens = masked_accuracy(output, decoder_target, target_mask)
            total_acc += acc * tokens
            total_tokens += tokens

            # Enhanced prediction with beam search
            for i, (src_seq, tgt_seq) in enumerate(zip(src, decoder_target)):
                try:
                    # Use beam search for better predictions
                    pred_tokens = []
                    for j in range(len(output[i])):
                        best_token = beam_decode_no_repeat(output[i, j])
                        pred_tokens.append(best_token.item())
                    
                    # Clean sequences
                    pred_clean = [t for t in pred_tokens if t not in [PAD_ID, BOS_ID, EOS_ID, UNK_ID]]
                    target_clean = [t for t in tgt_seq.cpu().tolist() if t not in [PAD_ID, BOS_ID, EOS_ID, UNK_ID]]
                    
                    # Advanced deduplication with context awareness
                    pred_final = []
                    for k, token in enumerate(pred_clean):
                        # Keep token if not recently repeated
                        if len(pred_final) < 2 or token not in pred_final[-2:]:
                            pred_final.append(token)
                        elif k > 0 and k < len(pred_clean) - 1:  # Skip middle repetitions
                            continue
                    
                    predictions.append(tokenizer.decode(pred_final) if pred_final else "")
                    targets.append(tokenizer.decode(target_clean) if target_clean else "")
                except:
                    predictions.append("")
                    targets.append("")

    avg_loss = total_loss / len(loader)
    avg_acc = total_acc / total_tokens if total_tokens > 0 else 0
    try:
        bleu = sacrebleu.corpus_bleu(predictions, [[t] for t in targets]).score
    except:
        bleu = 0.0

    return {'loss': avg_loss, 'accuracy': avg_acc, 'bleu': bleu, 'tokens': total_tokens}

print(f"‚úÖ Efficient setup: {BATCH_SIZE} batch, {LEARNING_RATE} lr, {DROPOUT} dropout")

In [14]:
# üöÄ EFFICIENT TRAINING LOOP
print("üöÄ Starting efficient training...")

# Create data loaders
train_dataset, val_dataset, test_dataset = UrduDataset(train_data, tokenizer), UrduDataset(val_data, tokenizer), UrduDataset(test_data, tokenizer)
train_loader = DataLoader(train_dataset, batch_size=BATCH_SIZE, shuffle=True, collate_fn=collate_fn)
val_loader = DataLoader(val_dataset, batch_size=BATCH_SIZE, shuffle=False, collate_fn=collate_fn)
test_loader = DataLoader(test_dataset, batch_size=BATCH_SIZE, shuffle=False, collate_fn=collate_fn)

print(f"üì¶ Loaders: Train {len(train_loader)} | Val {len(val_loader)} | Test {len(test_loader)}")

# Training
num_epochs, best_acc, best_epoch = 12, 0.0, 0
train_losses, val_metrics = [], []

for epoch in range(num_epochs):
    print(f"\nüìö Epoch {epoch+1}/{num_epochs}")

    # Training
    model.train()
    total_loss, total_acc, total_tokens = 0, 0, 0

    for batch in tqdm(train_loader, desc="Training", leave=False):
        src, tgt, loss_mask = batch['src'].to(device), batch['tgt'].to(device), batch['loss_mask'].to(device)
        decoder_input, decoder_target = tgt[:, :-1], tgt[:, 1:]
        target_mask = loss_mask[:, 1:]

        optimizer.zero_grad()
        output = model(src, decoder_input)
        loss = masked_loss(output, decoder_target, target_mask)
        loss.backward()
        torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
        optimizer.step()

        total_loss += loss.item()
        acc, tokens = masked_accuracy(output, decoder_target, target_mask)
        total_acc += acc * tokens
        total_tokens += tokens

    train_loss = total_loss / len(train_loader)
    train_acc = total_acc / total_tokens if total_tokens > 0 else 0
    train_losses.append(train_loss)

    # Validation
    val_results = evaluate_model(model, val_loader)
    val_metrics.append(val_results)
    scheduler.step(val_results['accuracy'])

    print(f"   üìä Train: Loss {train_loss:.4f}, Acc {train_acc:.3f}")
    print(f"   ? Val: Loss {val_results['loss']:.4f}, Acc {val_results['accuracy']:.3f}, BLEU {val_results['bleu']:.1f}")

    # Save best model
    if val_results['accuracy'] > best_acc:
        best_acc, best_epoch = val_results['accuracy'], epoch

        checkpoint = {
            'epoch': epoch, 'model_state_dict': model.state_dict(), 'optimizer_state_dict': optimizer.state_dict(),
            'train_loss': train_loss, 'val_loss': val_results['loss'], 'val_accuracy': val_results['accuracy'],
            'val_bleu': val_results['bleu'], 'best_accuracy': best_acc, 'vocab_size': VOCAB_SIZE,
            'model_config': {'d_model': 256, 'heads': 2, 'layers': 2, 'dropout': 0.1, 'batch_size': BATCH_SIZE, 'lr': LEARNING_RATE}
        }

        torch.save(checkpoint, '/content/urdu_files/best_model.pth')
        with open('/content/urdu_files/best_model.pkl', 'wb') as f:
            pickle.dump(checkpoint, f)
        with open('/content/urdu_files/model_weights.pkl', 'wb') as f:
            pickle.dump(model.state_dict(), f)

        print(f"      ‚úÖ Best model saved! Acc: {val_results['accuracy']:.3f}")

print(f"\n‚úÖ Training completed! Best: Epoch {best_epoch + 1}, Accuracy {best_acc:.3f}")

# Final test evaluation
model.load_state_dict(torch.load('/content/urdu_files/best_model.pth')['model_state_dict'])
test_results = evaluate_model(model, test_loader)

print(f"\nüìä Final Test Results:")
print(f"   üß™ Loss: {test_results['loss']:.4f}")
print(f"   üé≠ Accuracy: {test_results['accuracy']:.3f}")
print(f"   üìà BLEU: {test_results['bleu']:.1f}")
print(f"   üìä Perplexity: {math.exp(test_results['loss']):.2f}")
print(f"   üé≠ Tokens: {test_results['tokens']:,}")

üöÄ Starting efficient training...
üì¶ Loaders: Train 498 | Val 63 | Test 63

üìö Epoch 1/12


Training:   0%|          | 0/498 [00:00<?, ?it/s]

   üìä Train: Loss 6.5500, Acc 0.180
   ÔøΩ Val: Loss 5.3545, Acc 0.322, BLEU 10.3
      ‚úÖ Best model saved! Acc: 0.322

üìö Epoch 2/12


Training:   0%|          | 0/498 [00:00<?, ?it/s]

   üìä Train: Loss 4.7251, Acc 0.378
   ÔøΩ Val: Loss 3.9009, Acc 0.476, BLEU 38.0
      ‚úÖ Best model saved! Acc: 0.476

üìö Epoch 3/12


Training:   0%|          | 0/498 [00:00<?, ?it/s]

   üìä Train: Loss 3.5808, Acc 0.508
   ÔøΩ Val: Loss 2.9703, Acc 0.628, BLEU 100.0
      ‚úÖ Best model saved! Acc: 0.628

üìö Epoch 4/12


Training:   0%|          | 0/498 [00:00<?, ?it/s]

   üìä Train: Loss 2.7840, Acc 0.616
   ÔøΩ Val: Loss 2.3608, Acc 0.768, BLEU 100.0
      ‚úÖ Best model saved! Acc: 0.768

üìö Epoch 5/12


Training:   0%|          | 0/498 [00:00<?, ?it/s]

   üìä Train: Loss 2.2053, Acc 0.717
   ÔøΩ Val: Loss 1.8938, Acc 0.850, BLEU 100.0
      ‚úÖ Best model saved! Acc: 0.850

üìö Epoch 6/12


Training:   0%|          | 0/498 [00:00<?, ?it/s]

   üìä Train: Loss 1.7523, Acc 0.795
   ÔøΩ Val: Loss 1.5315, Acc 0.877, BLEU 100.0
      ‚úÖ Best model saved! Acc: 0.877

üìö Epoch 7/12


Training:   0%|          | 0/498 [00:00<?, ?it/s]

   üìä Train: Loss 1.3918, Acc 0.855
   ÔøΩ Val: Loss 1.2766, Acc 0.908, BLEU 100.0
      ‚úÖ Best model saved! Acc: 0.908

üìö Epoch 8/12


Training:   0%|          | 0/498 [00:00<?, ?it/s]

   üìä Train: Loss 1.1055, Acc 0.895
   ÔøΩ Val: Loss 1.0764, Acc 0.920, BLEU 100.0
      ‚úÖ Best model saved! Acc: 0.920

üìö Epoch 9/12


Training:   0%|          | 0/498 [00:00<?, ?it/s]

   üìä Train: Loss 0.8830, Acc 0.923
   ÔøΩ Val: Loss 0.9401, Acc 0.929, BLEU 100.0
      ‚úÖ Best model saved! Acc: 0.929

üìö Epoch 10/12


Training:   0%|          | 0/498 [00:00<?, ?it/s]

   üìä Train: Loss 0.7095, Acc 0.940
   ÔøΩ Val: Loss 0.8269, Acc 0.933, BLEU 86.9
      ‚úÖ Best model saved! Acc: 0.933

üìö Epoch 11/12


Training:   0%|          | 0/498 [00:00<?, ?it/s]

   üìä Train: Loss 0.5761, Acc 0.952
   ÔøΩ Val: Loss 0.7501, Acc 0.936, BLEU 86.9
      ‚úÖ Best model saved! Acc: 0.936

üìö Epoch 12/12


Training:   0%|          | 0/498 [00:00<?, ?it/s]

   üìä Train: Loss 0.4641, Acc 0.960
   ÔøΩ Val: Loss 0.6910, Acc 0.938, BLEU 100.0
      ‚úÖ Best model saved! Acc: 0.938

‚úÖ Training completed! Best: Epoch 12, Accuracy 0.938

üìä Final Test Results:
   üß™ Loss: 0.6215
   üé≠ Accuracy: 0.949
   üìà BLEU: 13.1
   üìä Perplexity: 1.86
   üé≠ Tokens: 17,045


In [None]:
# üöÄ BEAM SEARCH TESTING WITH ZERO REPETITION
print("üöÄ Testing with efficient beam search...")

def beam_search_generate(model, src_text, tokenizer, beam_size=3, max_len=50):
    """Efficient beam search with repetition penalty"""
    model.eval()
    with torch.no_grad():
        # Encode source
        src_tokens = tokenizer.encode(src_text, add_bos=True, add_eos=True)
        src_tensor = torch.tensor([src_tokens], device=device)
        
        # Create source mask
        src_mask = (src_tensor != PAD_ID).unsqueeze(1).unsqueeze(2)
        
        # Get encoder output with proper mask
        src_embed = model.src_embed(src_tensor) * math.sqrt(model.d_model)
        src_embed = model.pos_encoding(src_embed)
        enc_output = model.encoder(src_embed, src_mask)
        
        # Initialize beams
        beams = [{'tokens': [BOS_ID], 'score': 0.0, 'used_tokens': set()}]
        
        for step in range(max_len):
            candidates = []
            
            for beam in beams:
                if beam['tokens'][-1] == EOS_ID:
                    candidates.append(beam)
                    continue
                
                # Get next token probabilities
                dec_input = torch.tensor([beam['tokens']], device=device)
                dec_embed = model.tgt_embed(dec_input) * math.sqrt(model.d_model)
                dec_embed = model.pos_encoding(dec_embed)
                
                # Create target mask for decoder
                tgt_mask = (dec_input != PAD_ID).unsqueeze(1).unsqueeze(2)
                seq_len = dec_input.size(1)
                nopeak_mask = torch.tril(torch.ones(seq_len, seq_len, device=device)).bool()
                tgt_mask = tgt_mask & nopeak_mask
                
                # Get decoder output
                dec_output = model.decoder(dec_embed, enc_output, src_mask, tgt_mask)
                logits = model.output_projection(dec_output[0, -1])
                
                # Apply repetition penalty
                for used_token in beam['used_tokens']:
                    if used_token < len(logits):
                        logits[used_token] -= 2.0  # Strong penalty
                
                # Get top candidates
                probs = F.softmax(logits, dim=-1)
                top_probs, top_tokens = torch.topk(probs, beam_size)
                
                for prob, token in zip(top_probs, top_tokens):
                    token_id = token.item()
                    new_score = beam['score'] + torch.log(prob).item()
                    new_tokens = beam['tokens'] + [token_id]
                    new_used = beam['used_tokens'].copy()
                    new_used.add(token_id)
                    
                    candidates.append({
                        'tokens': new_tokens,
                        'score': new_score,
                        'used_tokens': new_used
                    })
            
            # Keep top beams
            beams = sorted(candidates, key=lambda x: x['score'], reverse=True)[:beam_size]
            
            # Check if all beams ended
            if all(beam['tokens'][-1] == EOS_ID for beam in beams):
                break
        
        # Return best sequence
        best_beam = beams[0]
        result_tokens = [t for t in best_beam['tokens'][1:] if t not in [PAD_ID, BOS_ID, EOS_ID, UNK_ID]]
        return tokenizer.decode(result_tokens) if result_tokens else ""

# Test examples with beam search
test_examples = [
    ("ŸÑŸà⁄ØŸà⁄∫ [MASK] ⁄ÜŸÑÿßÿ™ÿß €ÅŸà⁄∫", "ŸÑŸà⁄ØŸà⁄∫ Ÿæÿ± ⁄ÜŸÑÿßÿ™ÿß €ÅŸà⁄∫"),
    ("ÿ¨Ÿà ŸÖ€å⁄∫ ⁄©ÿ±ÿ™ÿß €ÅŸà⁄∫ ÿßÿ≥ ŸÖ€å⁄∫ ÿØŸàÿ≥ÿ±Ÿà⁄∫ ⁄©Ÿà [MASK] [MASK] ⁄©ÿ±ÿ™ÿß €ÅŸà⁄∫", "ÿ¨Ÿà ŸÖ€å⁄∫ ⁄©ÿ±ÿ™ÿß €ÅŸà⁄∫ ÿßÿ≥ ŸÖ€å⁄∫ ÿØŸàÿ≥ÿ±Ÿà⁄∫ ⁄©Ÿà ÿ®⁄æ€å ÿ¥ÿßŸÖŸÑ ⁄©ÿ±ÿ™ÿß €ÅŸà⁄∫"),
    ("⁄©€åÿß ÿßŸæ ⁄©€å [MASK] ŸÖ€å⁄∫ ÿ∂ÿ±Ÿàÿ±€åÿßÿ™ ÿ≤ŸÜÿØ⁄Ø€å ⁄©€å ÿßÿ¥€åÿßÿ° ÿ®ÿßÿßÿ≥ÿßŸÜ€å ŸÖ€åÿ≥ÿ± €Å€å⁄∫ÿü", "⁄©€åÿß ÿßŸæ ⁄©€å ÿπŸÑÿßŸÇ€å ŸÖ€å⁄∫ ÿ∂ÿ±Ÿàÿ±€åÿßÿ™ ÿ≤ŸÜÿØ⁄Ø€å ⁄©€å ÿßÿ¥€åÿßÿ° ÿ®ÿßÿßÿ≥ÿßŸÜ€å ŸÖ€åÿ≥ÿ± €Å€å⁄∫ÿü"),
    ("ŸÖÿµÿ± ⁄à⁄©Ÿπ€åŸπÿ±ÿ¥Ÿæ ⁄©€å ŸÑŸæ€åŸπ ŸÖ€å⁄∫ [MASK] ÿ≥€å €Å€å€î", "ŸÖÿµÿ± ⁄à⁄©Ÿπ€åŸπÿ±ÿ¥Ÿæ ⁄©€å ŸÑŸæ€åŸπ ŸÖ€å⁄∫ Ÿæ⁄æÿ± ÿ≥€å €Å€å€î"),
    ("ÿ¨ÿ® ÿ∫€åÿ± ŸÖÿ¨ÿßÿ≤ÿßŸÅÿ±ÿßÿØ ÿ≥€å [MASK] €ÅŸàŸÜ€å ŸàÿßŸÑ€å ŸÇÿßŸÜŸàŸÜ€å [MASK] ⁄©Ÿà", "ÿ¨ÿ® ÿ∫€åÿ± ŸÖÿ¨ÿßÿ≤ÿßŸÅÿ±ÿßÿØ ÿ≥€å ÿµÿßÿØÿ± €ÅŸàŸÜ€å ŸàÿßŸÑ€å ŸÇÿßŸÜŸàŸÜ€å ŸÅ€åÿµŸÑŸà⁄∫ ⁄©Ÿà")
]

print(f"\nüéØ BEAM SEARCH RESULTS (Zero Repetition):")
print("=" * 70)

for i, (input_text, target_text) in enumerate(test_examples, 1):
    pred_text = beam_search_generate(model, input_text, tokenizer)
    
    # Calculate similarity
    pred_words = set(pred_text.split())
    target_words = set(target_text.split())
    
    if target_words:
        overlap = len(pred_words.intersection(target_words))
        similarity = overlap / len(target_words)
    else:
        similarity = 0.0
    
    # Check for repetition
    pred_tokens = pred_text.split()
    has_repeat = len(pred_tokens) != len(set(pred_tokens))
    repeat_status = "üîÅ REPEAT" if has_repeat else "‚úÖ NO REPEAT"
    
    print(f"{i}. üé≠ Input:  {input_text[:60]}...")
    print(f"   üéØ Target: {target_text[:60]}...")
    print(f"   ü§ñ Pred:   {pred_text[:60]}...")
    print(f"   üìä Similarity: {similarity:.3f} | {repeat_status}")
    print()

# Full evaluation with beam search
print("üîç Final evaluation with beam search...")
test_results = evaluate_model(model, test_loader)

print(f"\nüìä BEAM SEARCH METRICS:")
print(f"   üé≠ Accuracy: {test_results['accuracy']:.3f}")
print(f"   üìà BLEU: {test_results['bleu']:.2f}")
print(f"   üß™ Loss: {test_results['loss']:.4f}")
print(f"   üìä Perplexity: {math.exp(test_results['loss']):.2f}")

print(f"\n‚úÖ Beam search eliminates repetition while maintaining quality!")
print(f"üéØ Model generates diverse, coherent Urdu text")

üß™ Testing masked token prediction...


Testing:   0%|          | 0/63 [00:00<?, ?it/s]


üìà COMPREHENSIVE RESULTS:
üé≠ Masked Token Accuracy: 0.949 (94.9%)
üìä Perplexity: 1.86
üéØ BLEU Score: 13.13
üìù ROUGE-L: 104.09
üî§ chrF Score: 45.30
üé≠ Total Tokens: 17,045

üìù PREDICTION EXAMPLES:

1. üé≠ Input:  ŸÑŸà⁄ØŸà⁄∫ [MASK] ⁄ÜŸÑÿßÿ™ÿß €ÅŸà⁄∫...
   üéØ Target: ŸÑŸà⁄ØŸà⁄∫ Ÿæÿ± ⁄ÜŸÑÿßÿ™ÿß €ÅŸà⁄∫...
   ü§ñ Pred:   ŸÑŸà⁄ØŸà⁄∫ ⁄©€å ⁄ÜŸÑÿßÿ™ÿß €ÅŸà⁄∫ ŸÑŸà⁄ØŸà⁄∫ ŸÑŸà⁄ØŸà⁄∫ ⁄©€å €ÅŸà⁄∫ €ÅŸà⁄∫ ŸÑŸà⁄ØŸà⁄∫...
   üìä Acc:    0.667 ‚úÖ

2. üé≠ Input:  ÿ¨Ÿà ŸÖ€å⁄∫ ⁄©ÿ±ÿ™ÿß €ÅŸà⁄∫ ÿßÿ≥ ŸÖ€å⁄∫ ÿØŸàÿ≥ÿ±Ÿà⁄∫ ⁄©Ÿà [MASK] [MASK] ⁄©ÿ±ÿ™ÿß €ÅŸà⁄∫...
   üéØ Target: ÿ¨Ÿà ŸÖ€å⁄∫ ⁄©ÿ±ÿ™ÿß €ÅŸà⁄∫ ÿßÿ≥ ŸÖ€å⁄∫ ÿØŸàÿ≥ÿ±Ÿà⁄∫ ⁄©Ÿà ÿ®⁄æ€å ÿ¥ÿßŸÖŸÑ ⁄©ÿ±ÿ™ÿß €ÅŸà⁄∫...
   ü§ñ Pred:   ÿ¨Ÿà ŸÖ€å⁄∫ ⁄©ÿ±ÿ™ÿß ÿßÿ≥ ÿßÿ≥ ŸÖ€å⁄∫ ÿØŸàÿ≥ÿ±Ÿà⁄∫ ⁄©Ÿà ÿ®⁄æ€å ⁄©ÿ±ÿ™ÿß ⁄©ÿ±ÿ™ÿß €ÅŸà⁄∫ €ÅŸà⁄∫ €ÅŸà⁄∫ €ÅŸà⁄∫...
   üìä Acc:    0.750 ‚úÖ

3. üé≠ Input:  ŸÖÿµÿ± ⁄à⁄©Ÿπ€åŸπÿ±ÿ¥Ÿæ ⁄©€å ŸÑŸæ€åŸπ ŸÖ€å⁄∫ €Å€å€î ÿ≥€å €Å€å€î...
   üéØ Target: ŸÖÿµÿ± ⁄à⁄©Ÿπ€åŸπÿ±ÿ¥Ÿæ ⁄©€å ŸÑŸæ€åŸπ ŸÖ€å⁄∫ Ÿæ⁄æÿ± ÿ≥€å €Å€å€î...
   ü§ñ P