# Phase III: Trigram Language Model for Urdu Story Generation

This notebook implements a **Trigram Language Model** using **Maximum Likelihood Estimation (MLE)** with **Interpolation** for generating Urdu stories.

## Components:
1. **Data Loading** - Load preprocessed Urdu stories
2. **BPE Tokenization** - Byte Pair Encoding tokenization from Phase II
3. **MLE Probability Estimation** - For unigrams, bigrams, and trigrams
4. **Interpolation** - Smooth probabilities by combining n-gram models
5. **Text Generation** - Generate stories until `<EOT>` token

### Special Tokens:
- `<EOS>` - End of Sentence
- `<EOP>` - End of Paragraph  
- `<EOT>` - End of Text (Story)

In [15]:
# ============================================
# IMPORTS AND CONFIGURATION
# ============================================
import os
import re
import glob
import random
import pickle
import math
import json
from collections import defaultdict, Counter
from typing import List, Dict, Tuple, Optional

# Set random seed for reproducibility
random.seed(42)

# Configuration
DATA_DIR = "../PreProcessing/Preprocessed_documents/"
MODEL_SAVE_PATH = "trigram_model.pkl"

# Special tokens - use text-based tokens that match preprocessing
EOS_TOKEN = "<EOS>"  # End of Sentence
EOP_TOKEN = "<EOP>"  # End of Paragraph
EOT_TOKEN = "<EOT>"  # End of Text/Story
START_TOKEN = "<START>"  # Start token for padding

SPECIAL_TOKENS = {EOS_TOKEN, EOP_TOKEN, EOT_TOKEN, START_TOKEN}

print("Configuration loaded successfully!")
print(f"Special Tokens: EOS={repr(EOS_TOKEN)}, EOP={repr(EOP_TOKEN)}, EOT={repr(EOT_TOKEN)}")

Configuration loaded successfully!
Special Tokens: EOS='<EOS>', EOP='<EOP>', EOT='<EOT>'


In [16]:
# ============================================
# DATA LOADING
# ============================================

def load_stories(data_dir: str) -> List[str]:
    """
    Load all story documents from the data directory.
    Returns a list of story texts.
    """
    stories = []
    file_pattern = os.path.join(data_dir, "doc*.txt")
    files = sorted(glob.glob(file_pattern))
    
    for file_path in files:
        try:
            with open(file_path, 'r', encoding='utf-8') as f:
                story = f.read().strip()
                if story:  # Only add non-empty stories
                    stories.append(story)
        except Exception as e:
            print(f"Error reading {file_path}: {e}")
    
    print(f"Loaded {len(stories)} stories from {data_dir}")
    return stories

# Load the data
stories = load_stories(DATA_DIR)
print(f"\nSample story (first 500 chars):\n{stories[0][:500]}...")

Loaded 469 stories from ../PreProcessing/Preprocessed_documents/

Sample story (first 500 chars):
اکمل میٹرک کا طالب علم تھا، لیکن اپنی پڑھائی اور والدین اور اساتذہ کا احترام کرنے میں لاپروا سا تھا۔ <EOS> اکمل کو بگاڑنے میں زیادہ تر ہاتھ ان کے دادا تھا، جو ایک سرکاری ادارے سے ریٹائرڈ افسر تھے۔ <EOS> خود تو وہ تمام عمر پابندیوں میں رہتے ہوئے ملازمت کرتے رہے، لیکن اکمل کو انھوں نے بے جا لاڈ پیار کی وجہ سے خراب کر دیا تھا۔ <EOS> 
دادا اپنی پینشن سے اس کی ہر فرمائش کو پورا کرتے۔ <EOS> اکمل کے ماں باپ منع بھی کرتے، مگر دادا کو اپنے پوتے سے بہت پیار تھا۔ <EOS> یہی وجہ تھی کہ اکمل سارا دن کمپیوٹر ا...


In [17]:
# ============================================
# BPE TOKENIZER (Phase II Integration)
# ============================================

class BPETokenizer:
    """
    BPE (Byte Pair Encoding) Tokenizer.
    Loads pre-trained vocabulary and merges from Phase II.
    Properly handles special tokens (<EOS>, <EOP>, <EOT>).
    """
    
    def __init__(self, vocab_path: str = "../Tokenization/vocab.json", 
                 merges_path: str = "../Tokenization/merges.txt"):
        """
        Initialize the BPE Tokenizer with pre-trained vocab and merges.
        
        Args:
            vocab_path: Path to vocab.json file
            merges_path: Path to merges.txt file
        """
        self.vocab = self._load_vocab(vocab_path)
        self.merges = self._load_merges(merges_path)
        
        # Add special tokens to vocab if not present
        for token in SPECIAL_TOKENS:
            if token not in self.vocab:
                self.vocab.add(token)
        
        print(f"BPE Tokenizer loaded: {len(self.vocab)} tokens, {len(self.merges)} merges")
    
    def _load_vocab(self, vocab_path: str) -> set:
        """Load vocabulary from JSON file."""
        try:
            with open(vocab_path, 'r', encoding='utf-8') as f:
                return set(json.load(f))
        except FileNotFoundError:
            print(f"Warning: Vocab file not found at {vocab_path}. Starting with empty vocab.")
            return set()
    
    def _load_merges(self, merges_path: str) -> List[Tuple[str, str]]:
        """Load merge operations from file."""
        merges = []
        try:
            with open(merges_path, 'r', encoding='utf-8') as f:
                for line in f:
                    parts = line.strip().split(' ')
                    if len(parts) == 2:
                        merges.append((parts[0], parts[1]))
        except FileNotFoundError:
            print(f"Warning: Merges file not found at {merges_path}. No merges loaded.")
        return merges
    
    def _apply_merges(self, word: str) -> List[str]:
        """
        Apply BPE merges to a word.
        
        Args:
            word: Input word to tokenize
            
        Returns:
            List of BPE tokens
        """
        # Special tokens should not be split
        if word in SPECIAL_TOKENS:
            return [word]
        
        # Start with character-level representation
        tokens = list(word)
        
        # Apply each merge in order
        for merge_pair in self.merges:
            i = 0
            while i < len(tokens) - 1:
                if tokens[i] == merge_pair[0] and tokens[i + 1] == merge_pair[1]:
                    tokens = tokens[:i] + [merge_pair[0] + merge_pair[1]] + tokens[i + 2:]
                else:
                    i += 1
        
        return tokens
    
    def tokenize(self, text: str) -> List[str]:
        """
        Tokenize text into BPE tokens.
        
        Args:
            text: Input text to tokenize
            
        Returns:
            List of BPE tokens
        """
        tokens = []
        words = text.split()
        
        for word in words:
            word_tokens = self._apply_merges(word)
            tokens.extend(word_tokens)
        
        return tokens
    
    def detokenize(self, tokens: List[str]) -> str:
        """
        Convert tokens back to text.
        Simply joins all tokens with spaces.
        """
        return ' '.join(tokens)
    
    def get_vocab_size(self) -> int:
        """Return the vocabulary size."""
        return len(self.vocab)

# Initialize the BPE tokenizer
tokenizer = BPETokenizer()
print(f"\nVocabulary size: {tokenizer.get_vocab_size()}")

# Test tokenization
test_text = "ایک دن"
test_tokens = tokenizer.tokenize(test_text)
print(f"Test text: {test_text}")
print(f"Tokens: {test_tokens}")

BPE Tokenizer loaded: 1001 tokens, 938 merges

Vocabulary size: 1001
Test text: ایک دن
Tokens: ['ایک', 'دن']


In [18]:
# ============================================
# DATA LOADING (FILES ALREADY PREPROCESSED)
# ============================================

# Note: The preprocessed documents already contain special tokens (<EOS>, <EOP>, <EOT>)
# from the preprocessing phase, so we just load them directly.

def load_corpus(data_dir: str) -> List[str]:
    """
    Load preprocessed stories from the data directory.
    Files already contain special tokens from preprocessing.
    """
    corpus = []
    file_pattern = os.path.join(data_dir, "doc*.txt")
    files = sorted(glob.glob(file_pattern))
    
    for file_path in files:
        try:
            with open(file_path, 'r', encoding='utf-8') as f:
                story = f.read().strip()
                if story:
                    corpus.append(story)
        except Exception as e:
            print(f"Error reading {file_path}: {e}")
    
    return corpus

# Load the preprocessed corpus
corpus = load_corpus(DATA_DIR)
print(f"Loaded {len(corpus)} preprocessed stories")
print(f"\nSample story (first 500 chars):")
print(corpus[0][:500] if corpus else "No stories loaded")

Loaded 469 preprocessed stories

Sample story (first 500 chars):
اکمل میٹرک کا طالب علم تھا، لیکن اپنی پڑھائی اور والدین اور اساتذہ کا احترام کرنے میں لاپروا سا تھا۔ <EOS> اکمل کو بگاڑنے میں زیادہ تر ہاتھ ان کے دادا تھا، جو ایک سرکاری ادارے سے ریٹائرڈ افسر تھے۔ <EOS> خود تو وہ تمام عمر پابندیوں میں رہتے ہوئے ملازمت کرتے رہے، لیکن اکمل کو انھوں نے بے جا لاڈ پیار کی وجہ سے خراب کر دیا تھا۔ <EOS> 
دادا اپنی پینشن سے اس کی ہر فرمائش کو پورا کرتے۔ <EOS> اکمل کے ماں باپ منع بھی کرتے، مگر دادا کو اپنے پوتے سے بہت پیار تھا۔ <EOS> یہی وجہ تھی کہ اکمل سارا دن کمپیوٹر ا


In [19]:
# ============================================
# TRIGRAM LANGUAGE MODEL (Built from Scratch)
# ============================================

class TrigramLanguageModel:
    """
    Trigram Language Model using Maximum Likelihood Estimation (MLE).
    Implements interpolation smoothing combining unigram, bigram, and trigram probabilities.
    
    Built entirely from scratch without using any pre-built language modeling libraries.
    Uses BPE tokenization from Phase II (subword-level, not character-level).
    """
    
    def __init__(self, lambda1: float = 0.1, lambda2: float = 0.3, lambda3: float = 0.6):
        """
        Initialize the Trigram Language Model.
        
        Args:
            lambda1: Weight for unigram probability (default: 0.1)
            lambda2: Weight for bigram probability (default: 0.3)
            lambda3: Weight for trigram probability (default: 0.6)
            
        Note: lambda1 + lambda2 + lambda3 must equal 1.0
        """
        assert abs(lambda1 + lambda2 + lambda3 - 1.0) < 1e-6, \
            "Interpolation weights must sum to 1.0"
        
        self.lambda1 = lambda1
        self.lambda2 = lambda2
        self.lambda3 = lambda3
        
        self.unigram_counts = Counter()
        self.bigram_counts = defaultdict(Counter)
        self.trigram_counts = defaultdict(Counter)
        
        self.total_unigrams = 0
        self.bigram_context_counts = Counter()
        self.trigram_context_counts = Counter()
        
        self.vocabulary = set()
        self.is_trained = False
        self.tokenizer = None  # Will be set during training
    
    def train(self, corpus: List[str], bpe_tokenizer: BPETokenizer = None):
        """
        Train the trigram model on the given corpus using BPE tokenization.
        
        Args:
            corpus: List of preprocessed text documents
            bpe_tokenizer: BPE tokenizer instance (uses global tokenizer if not provided)
        """
        # Use provided tokenizer or global one
        self.tokenizer = bpe_tokenizer if bpe_tokenizer else tokenizer
        
        print("Training Trigram Language Model with BPE tokenization...")
        
        for doc_idx, document in enumerate(corpus):
            # Tokenize using BPE (subword-level, NOT character-level!)
            tokens = self.tokenizer.tokenize(document)
            
            # Add padding at the beginning for trigram context
            padded_tokens = [START_TOKEN, START_TOKEN] + tokens
            
            # Build vocabulary
            self.vocabulary.update(tokens)
            
            # Count unigrams
            for token in tokens:
                self.unigram_counts[token] += 1
                self.total_unigrams += 1
            
            # Count bigrams
            for i in range(len(padded_tokens) - 1):
                context = padded_tokens[i]
                next_token = padded_tokens[i + 1]
                self.bigram_counts[context][next_token] += 1
                self.bigram_context_counts[context] += 1
            
            # Count trigrams
            for i in range(len(padded_tokens) - 2):
                context = (padded_tokens[i], padded_tokens[i + 1])
                next_token = padded_tokens[i + 2]
                self.trigram_counts[context][next_token] += 1
                self.trigram_context_counts[context] += 1
            
            if (doc_idx + 1) % 50 == 0:
                print(f"  Processed {doc_idx + 1}/{len(corpus)} documents...")
        
        self.is_trained = True
        print(f"\nTraining complete!")
        print(f"  Vocabulary size: {len(self.vocabulary)}")
        print(f"  Total tokens: {self.total_unigrams}")
        print(f"  Unique bigram contexts: {len(self.bigram_counts)}")
        print(f"  Unique trigram contexts: {len(self.trigram_counts)}")
    
    def get_unigram_probability(self, token: str) -> float:
        """P(token) = count(token) / total_tokens"""
        if self.total_unigrams == 0:
            return 0.0
        return self.unigram_counts[token] / self.total_unigrams
    
    def get_bigram_probability(self, context: str, token: str) -> float:
        """P(token | context) = count(context, token) / count(context)"""
        context_count = self.bigram_context_counts[context]
        if context_count == 0:
            return 0.0
        return self.bigram_counts[context][token] / context_count
    
    def get_trigram_probability(self, context: Tuple[str, str], token: str) -> float:
        """P(token | ctx1, ctx2) = count(ctx1, ctx2, token) / count(ctx1, ctx2)"""
        context_count = self.trigram_context_counts[context]
        if context_count == 0:
            return 0.0
        return self.trigram_counts[context][token] / context_count
    
    def get_interpolated_probability(self, context: Tuple[str, str], token: str) -> float:
        """
        Calculate interpolated probability.
        P_interp = λ1*P(token) + λ2*P(token|ctx2) + λ3*P(token|ctx1,ctx2)
        """
        p_unigram = self.get_unigram_probability(token)
        p_bigram = self.get_bigram_probability(context[1], token)
        p_trigram = self.get_trigram_probability(context, token)
        
        return (self.lambda1 * p_unigram + 
                self.lambda2 * p_bigram + 
                self.lambda3 * p_trigram)
    
    def get_next_token_probabilities(self, context: Tuple[str, str]) -> Dict[str, float]:
        """Get interpolated probabilities for all possible next tokens."""
        probabilities = {}
        for token in self.vocabulary:
            prob = self.get_interpolated_probability(context, token)
            if prob > 0:
                probabilities[token] = prob
        return probabilities
    
    def sample_next_token(self, context: Tuple[str, str], temperature: float = 1.0) -> str:
        """Sample the next token given a context."""
        probabilities = self.get_next_token_probabilities(context)
        
        if not probabilities:
            return random.choice(list(self.vocabulary))
        
        # Apply temperature scaling
        if temperature != 1.0:
            probabilities = {k: v ** (1.0 / temperature) for k, v in probabilities.items() if v > 0}
        
        # Normalize
        total = sum(probabilities.values())
        if total == 0:
            return random.choice(list(self.vocabulary))
        
        normalized = {k: v / total for k, v in probabilities.items()}
        
        return random.choices(list(normalized.keys()), weights=list(normalized.values()))[0]
    
    def calculate_perplexity(self, text: str) -> float:
        """Calculate the perplexity of a text sequence."""
        tokens = self.tokenizer.tokenize(text)
        padded = [START_TOKEN, START_TOKEN] + tokens
        
        log_prob_sum = 0.0
        n = len(tokens)
        
        for i in range(2, len(padded)):
            context = (padded[i - 2], padded[i - 1])
            token = padded[i]
            
            prob = self.get_interpolated_probability(context, token)
            if prob > 0:
                log_prob_sum += math.log(prob)
            else:
                log_prob_sum += math.log(1e-10)
        
        avg_log_prob = log_prob_sum / n if n > 0 else 0
        return math.exp(-avg_log_prob)

print("TrigramLanguageModel class defined successfully (using BPE tokenization)!")

TrigramLanguageModel class defined successfully (using BPE tokenization)!


In [20]:
# ============================================
# TEXT GENERATION
# ============================================

class UrduStoryGenerator:
    """
    Urdu Story Generator using the Trigram Language Model.
    Generates text until the <EOT> (End of Text) token is reached.
    """
    
    def __init__(self, model: TrigramLanguageModel):
        """
        Initialize the generator with a trained trigram model.
        
        Args:
            model: A trained TrigramLanguageModel instance
        """
        self.model = model
    
    def generate(self, 
                 prefix: str = "", 
                 max_length: int = 1000, 
                 temperature: float = 1.0,
                 stop_on_eot: bool = True) -> str:
        """
        Generate text starting from an optional prefix.
        
        Args:
            prefix: Starting text (prompt) for generation
            max_length: Maximum number of tokens to generate
            temperature: Sampling temperature (higher = more diverse)
            stop_on_eot: Whether to stop generation at <EOT> token
            
        Returns:
            Generated text string
        """
        if not self.model.is_trained:
            raise RuntimeError("Model must be trained before generation!")
        
        # Tokenize prefix using BPE (NOT character-level!)
        if prefix and self.model.tokenizer:
            tokens = self.model.tokenizer.tokenize(prefix)
        elif prefix:
            tokens = prefix.split()  # Fallback to word-level
        else:
            tokens = []
        
        # Add padding for context
        padded_tokens = [START_TOKEN, START_TOKEN] + tokens
        
        generated_count = 0
        
        while generated_count < max_length:
            context = (padded_tokens[-2], padded_tokens[-1])
            next_token = self.model.sample_next_token(context, temperature)
            padded_tokens.append(next_token)
            generated_count += 1
            
            if stop_on_eot and next_token == EOT_TOKEN:
                break
        
        # Extract generated tokens (without padding)
        output_tokens = padded_tokens[2:]
        
        # Detokenize
        if self.model.tokenizer:
            generated_text = self.model.tokenizer.detokenize(output_tokens)
        else:
            generated_text = ' '.join(output_tokens)
        
        return generated_text
    
    def generate_story(self, 
                       starting_phrase: str = "", 
                       max_length: int = 2000,
                       temperature: float = 0.8) -> str:
        """
        Generate a complete Urdu story.
        
        Args:
            starting_phrase: Starting phrase in Urdu
            max_length: Maximum length of the story
            temperature: Creativity parameter (0.5-1.5 recommended)
            
        Returns:
            Generated story text
        """
        raw_story = self.generate(
            prefix=starting_phrase,
            max_length=max_length,
            temperature=temperature,
            stop_on_eot=True
        )
        
        # Post-process: Clean up special tokens for display
        cleaned_story = self._clean_story(raw_story)
        
        return cleaned_story
    
    def _clean_story(self, text: str) -> str:
        """Clean up the generated story by formatting special tokens."""
        # Replace special tokens for display
        text = text.replace(EOS_TOKEN, ' ')
        text = text.replace(EOP_TOKEN, '\n\n')
        text = text.replace(EOT_TOKEN, '')
        
        # Clean up extra whitespace
        text = re.sub(r' +', ' ', text)
        text = re.sub(r'\n +', '\n', text)
        
        while '\n\n\n' in text:
            text = text.replace('\n\n\n', '\n\n')
        
        return text.strip()
    
    def generate_interactive(self):
        """
        Interactive story generation - generates token by token.
        Useful for step-wise display (like ChatGPT streaming).
        """
        tokens = [START_TOKEN, START_TOKEN]
        
        while True:
            context = (tokens[-2], tokens[-1])
            next_token = self.model.sample_next_token(context, temperature=0.8)
            tokens.append(next_token)
            
            if next_token == EOT_TOKEN:
                break
            elif next_token == EOS_TOKEN:
                yield ' '
            elif next_token == EOP_TOKEN:
                yield '\n\n'
            else:
                yield next_token + ' '

print("UrduStoryGenerator class defined successfully!")

UrduStoryGenerator class defined successfully!


In [21]:
# ============================================
# TRAIN THE MODEL
# ============================================

# Initialize the trigram model with interpolation weights
# λ1 (unigram) = 0.1, λ2 (bigram) = 0.3, λ3 (trigram) = 0.6
trigram_model = TrigramLanguageModel(lambda1=0.1, lambda2=0.3, lambda3=0.6)

# Train on the preprocessed corpus using BPE tokenizer
trigram_model.train(corpus, tokenizer)

# Print some statistics
print("\n" + "="*50)
print("MODEL STATISTICS")
print("="*50)
print(f"Vocabulary Size: {len(trigram_model.vocabulary)}")
print(f"Total Tokens: {trigram_model.total_unigrams:,}")
print(f"Unique Bigram Contexts: {len(trigram_model.bigram_counts):,}")
print(f"Unique Trigram Contexts: {len(trigram_model.trigram_counts):,}")
print(f"\nInterpolation Weights:")
print(f"  λ1 (Unigram): {trigram_model.lambda1}")
print(f"  λ2 (Bigram):  {trigram_model.lambda2}")
print(f"  λ3 (Trigram): {trigram_model.lambda3}")

Training Trigram Language Model with BPE tokenization...
  Processed 50/469 documents...
  Processed 100/469 documents...
  Processed 150/469 documents...
  Processed 200/469 documents...
  Processed 250/469 documents...
  Processed 300/469 documents...
  Processed 350/469 documents...
  Processed 400/469 documents...
  Processed 450/469 documents...

Training complete!
  Vocabulary size: 862
  Total tokens: 524321
  Unique bigram contexts: 862
  Unique trigram contexts: 59968

MODEL STATISTICS
Vocabulary Size: 862
Total Tokens: 524,321
Unique Bigram Contexts: 862
Unique Trigram Contexts: 59,968

Interpolation Weights:
  λ1 (Unigram): 0.1
  λ2 (Bigram):  0.3
  λ3 (Trigram): 0.6


In [22]:
# ============================================
# GENERATE STORIES
# ============================================

# Initialize the story generator
generator = UrduStoryGenerator(trigram_model)

# Generate a story with no prompt
print("="*60)
print("GENERATED STORY (No Prompt)")
print("="*60)
generated_story = generator.generate_story(
    starting_phrase="",
    max_length=500,
    temperature=0.8
)
print(generated_story)
print("\n")

# Generate a story with a prompt
print("="*60)
print("GENERATED STORY (With Prompt: 'ایک دن')")
print("="*60)
generated_story_with_prompt = generator.generate_story(
    starting_phrase="ایک دن",
    max_length=500,
    temperature=0.8
)
print(generated_story_with_prompt)

GENERATED STORY (No Prompt)
کسی جنگل میں ایک اچھی سی دھو لوگ وں میں آ نے میں ایک بڑا سا تھ ہی ب اق ی تم نے ٹھیک ہے ۔ 

سو نا نہیں چ اہ ا تو کی ا تو وہ ہار اُٹھا کر اپنے گھر میں چھ ل ان گ ئ یں گے اور ایسا س نہ رے ب ان ی کا شور بہ ب نا یا اور ڑھ ے جا ر ہے تھے۔ جب یہ پک نک پر ایک گاؤں میں پہنچ نہ س کے تھے۔ 

میں ایسی ہی چیز وں کیلئے جمع تھے۔ اگر کوئی مگر مچھ کو د یک ھ تے ہی چ ڑ یا ں گ ا۔ د کم ہار رح یم کی ا کو نا ہ ل تے ہیں۔ م اور گھر سے مدد لی نی ہو تو تم ہیں شا ہی مح لے وال ے کی ا می ج وہ ڑ سب خش ک ہو ں ۔ میں آپ کو چھ پا یا تھا اور ۔ 

وہ لڑکا اور طوطا وہ اپنا ار اد ہ تو ا بھی س میں تو عی ش کر اپنے والد نے اس ے ش ان د از ہ ری چو ہے کبھی بھی اپنے کم رے میں جھ ان کا ر عب د اللہ سے پہ اڑ پر جا ئے ہے، جو قر ب ان ی کی ا می ری ر کر کہنے لگ ے ، وہ کر ی ن نے کے ش عر صے میں چا ئے کے سا تھ میں شی رہنے کا مزہ وں کے بعد وہ دو ن وں بہت چ ال کا جواب میں صرف ایک بار وہ بچہ نظر رکھ تا تھا۔ پ ان ی اس کے بعد وہ خود بھی اور سب سے سو ی رے کی طرف ہ جر ت بڑھ نے لگ ی ، میں اُڑ تا ہوا ؟ ایک دن ایک تر کی ب ند ر

In [23]:
# ============================================
# MODEL EVALUATION
# ============================================

# Calculate perplexity on a sample from the training data
sample_texts = [corpus[i][:500] for i in range(min(5, len(corpus)))]

print("="*60)
print("MODEL EVALUATION - Perplexity")
print("="*60)

perplexities = []
for i, text in enumerate(sample_texts):
    perplexity = trigram_model.calculate_perplexity(text)
    perplexities.append(perplexity)
    print(f"Sample {i+1}: Perplexity = {perplexity:.2f}")

avg_perplexity = sum(perplexities) / len(perplexities)
print(f"\nAverage Perplexity: {avg_perplexity:.2f}")
print("(Lower perplexity = better model fit)")

MODEL EVALUATION - Perplexity
Sample 1: Perplexity = 11.60
Sample 2: Perplexity = 9.40
Sample 3: Perplexity = 11.32
Sample 4: Perplexity = 9.39
Sample 5: Perplexity = 10.48

Average Perplexity: 10.44
(Lower perplexity = better model fit)


In [24]:
# ============================================
# SAVE AND LOAD MODEL
# ============================================

def save_model(model: TrigramLanguageModel, filepath: str):
    """
    Save the trained trigram model to a pickle file.
    Note: The tokenizer is not saved - it should be re-initialized when loading.
    """
    model_data = {
        'lambda1': model.lambda1,
        'lambda2': model.lambda2,
        'lambda3': model.lambda3,
        'unigram_counts': dict(model.unigram_counts),
        'bigram_counts': {k: dict(v) for k, v in model.bigram_counts.items()},
        'trigram_counts': {k: dict(v) for k, v in model.trigram_counts.items()},
        'total_unigrams': model.total_unigrams,
        'bigram_context_counts': dict(model.bigram_context_counts),
        'trigram_context_counts': dict(model.trigram_context_counts),
        'vocabulary': model.vocabulary,
        'is_trained': model.is_trained
    }
    
    with open(filepath, 'wb') as f:
        pickle.dump(model_data, f)
    
    print(f"Model saved to {filepath}")

def load_model(filepath: str, bpe_tokenizer: BPETokenizer = None) -> TrigramLanguageModel:
    """
    Load a trained trigram model from a pickle file.
    
    Args:
        filepath: Path to the saved model
        bpe_tokenizer: BPE tokenizer instance (uses global tokenizer if not provided)
    """
    with open(filepath, 'rb') as f:
        model_data = pickle.load(f)
    
    model = TrigramLanguageModel(
        lambda1=model_data['lambda1'],
        lambda2=model_data['lambda2'],
        lambda3=model_data['lambda3']
    )
    
    model.unigram_counts = Counter(model_data['unigram_counts'])
    model.bigram_counts = defaultdict(Counter)
    for k, v in model_data['bigram_counts'].items():
        model.bigram_counts[k] = Counter(v)
    model.trigram_counts = defaultdict(Counter)
    for k, v in model_data['trigram_counts'].items():
        model.trigram_counts[k] = Counter(v)
    model.total_unigrams = model_data['total_unigrams']
    model.bigram_context_counts = Counter(model_data['bigram_context_counts'])
    model.trigram_context_counts = Counter(model_data['trigram_context_counts'])
    model.vocabulary = model_data['vocabulary']
    model.is_trained = model_data['is_trained']
    
    # Set the tokenizer (use provided or global)
    model.tokenizer = bpe_tokenizer if bpe_tokenizer else tokenizer
    
    print(f"Model loaded from {filepath}")
    return model

# Save the trained model
save_model(trigram_model, MODEL_SAVE_PATH)
print(f"\nModel saved successfully!")

Model saved to trigram_model.pkl

Model saved successfully!


In [25]:
# ============================================
# API INTERFACE (For Phase IV Integration)
# ============================================

class StoryGeneratorAPI:
    """
    API interface for the Urdu Story Generator.
    This class provides methods that can be easily integrated with FastAPI.
    See Phase IV for the actual FastAPI service implementation.
    """
    
    def __init__(self, model_path: str = None, bpe_tokenizer: BPETokenizer = None):
        """
        Initialize the API with a trained model.
        
        Args:
            model_path: Path to the saved model file. If None, uses in-memory model.
            bpe_tokenizer: BPE tokenizer instance
        """
        self.tokenizer = bpe_tokenizer if bpe_tokenizer else tokenizer
        
        if model_path and os.path.exists(model_path):
            self.model = load_model(model_path, self.tokenizer)
        else:
            # Use the already trained model
            self.model = trigram_model
        
        self.generator = UrduStoryGenerator(self.model)
    
    def generate(self, prefix: str = "", max_length: int = 1000, temperature: float = 0.8) -> dict:
        """
        Generate a story (endpoint: POST /generate).
        
        Args:
            prefix: Starting phrase in Urdu
            max_length: Maximum number of tokens to generate
            temperature: Sampling temperature
            
        Returns:
            Dictionary with generated story and metadata
        """
        try:
            story = self.generator.generate_story(
                starting_phrase=prefix,
                max_length=max_length,
                temperature=temperature
            )
            
            return {
                "success": True,
                "story": story,
                "input_prefix": prefix,
                "max_length": max_length,
                "temperature": temperature
            }
        except Exception as e:
            return {
                "success": False,
                "error": str(e),
                "input_prefix": prefix
            }
    
    def get_model_info(self) -> dict:
        """
        Get model information and statistics.
        """
        return {
            "model_type": "Trigram Language Model with BPE",
            "vocabulary_size": len(self.model.vocabulary),
            "total_tokens": self.model.total_unigrams,
            "interpolation_weights": {
                "lambda1_unigram": self.model.lambda1,
                "lambda2_bigram": self.model.lambda2,
                "lambda3_trigram": self.model.lambda3
            },
            "is_trained": self.model.is_trained
        }
    
    def generate_stream(self, prefix: str = "", max_length: int = 1000, temperature: float = 0.8):
        """
        Generate story in streaming mode (token by token).
        Useful for ChatGPT-like step-wise display.
        
        Yields:
            Individual tokens for streaming display
        """
        # Tokenize prefix using BPE
        if prefix and self.tokenizer:
            tokens = self.tokenizer.tokenize(prefix)
        elif prefix:
            tokens = prefix.split()
        else:
            tokens = []
        
        padded_tokens = [START_TOKEN, START_TOKEN] + tokens
        generated_count = 0
        
        while generated_count < max_length:
            context = (padded_tokens[-2], padded_tokens[-1])
            next_token = self.model.sample_next_token(context, temperature)
            padded_tokens.append(next_token)
            generated_count += 1
            
            if next_token == EOT_TOKEN:
                break
            elif next_token == EOS_TOKEN:
                yield ' '
            elif next_token == EOP_TOKEN:
                yield '\n\n'
            else:
                yield next_token + ' '

# Initialize API
api = StoryGeneratorAPI()
print("API Interface initialized!")
print(f"\nModel Info: {api.get_model_info()}")

API Interface initialized!

Model Info: {'model_type': 'Trigram Language Model with BPE', 'vocabulary_size': 862, 'total_tokens': 524321, 'interpolation_weights': {'lambda1_unigram': 0.1, 'lambda2_bigram': 0.3, 'lambda3_trigram': 0.6}, 'is_trained': True}


In [26]:
# ============================================
# TEST API ENDPOINT
# ============================================

# Test the generate endpoint
print("="*60)
print("TESTING API ENDPOINT: generate()")
print("="*60)

# Test with different prefixes
test_inputs = [
    {"prefix": "", "max_length": 300},
    {"prefix": "ایک بار", "max_length": 300},
    {"prefix": "بچے نے", "max_length": 300},
]

for i, test in enumerate(test_inputs):
    print(f"\nTest {i+1}: prefix='{test['prefix']}'")
    print("-" * 40)
    result = api.generate(prefix=test['prefix'], max_length=test['max_length'])
    if result['success']:
        print(f"Generated Story:\n{result['story'][:500]}...")
    else:
        print(f"Error: {result['error']}")

TESTING API ENDPOINT: generate()

Test 1: prefix=''
----------------------------------------
Generated Story:
وہ ایک دوس رے دن مٹھ ائی کھ ائی اور دوس رے کی بُری عادت پ کی دیتے ہوئے کہا اور نا یا کرتا تھا۔ چ لو نی کی م ار ی کی ہے ۔ جاؤ بیٹا ، اب جاؤ اور تم ام پر ا نے اس ے ن سی آئی۔ اب اسد نے پ س ند کیے ج ات ا تو وہ چپ کے لئے کھ ان ا ہے ۔ مٹھو چ ونکہ دھ ار نے کے ۔ یہ ب ال کر و ان چڑھ ا د یک ھ ا کہ چھ ل ان ا ج ان ور گوشت نہیں کھا یا ۔ اس کی نظر اس خرگوش کو سبق س ان نے کی آواز س نا شی سے چھ ل ان کر د یا ۔ یہ کہتے ہوئے اک بر علی می ا کر بو لا ، آپ س ا۔ 

ل یک ھ آئ دہ خ ال ہ ج ان ہو ں ۔ اس کی سمجھ میں لا کھ ا...

Test 2: prefix='ایک بار'
----------------------------------------
Generated Story:
ایک بار پھر بچ گ یا ۔ مجھے بھی یہ ض ر ورت ہے ۔ چیونٹی بو لی ، ل یک ن ار ے پر ج ان ا ہے؟ مک ان وں کو قید یہ شا خ وں پر دس ت کا س ام نے کی وجہ سے ٹ می کوشش کر وں گ ہو ج ۔ نا ہی چھوڑ د یا تھا، مگر وہ اس ے ہ ا تھ نہ کر یں گے، پھر ہم ار ے گئی۔ اب تو آپ اپنے گھر وں کی سا ئ یک ل کے ب جا ئے گی۔ اس نے پَ ر وں 

In [27]:
# ============================================
# STREAMING GENERATION DEMO (For Phase V)
# ============================================

print("="*60)
print("STREAMING GENERATION DEMO")
print("="*60)
print("Generating story character by character (first 200 chars):\n")

# Collect streamed output
streamed_text = ""
char_count = 0

for token in api.generate_stream(prefix="", max_length=500, temperature=0.8):
    streamed_text += token
    char_count += 1
    if char_count >= 200:
        break

print(streamed_text)
print("\n... (truncated for demo)")
print("\n" + "="*60)
print("Phase III Complete! Model is ready for Phase IV integration.")
print("="*60)

STREAMING GENERATION DEMO
Generating story character by character (first 200 chars):

بہت عر صہ ان کے تی س رے لوگ وں میں سے بے کا ر س رو ع کرتے تو کوئی چیز کھ ان ا م ار ر ہے ۔  یہ کہہ کر بادشاہ کو ب تا یا ۔  گھر میں داخل ہو کر جواب دے گئے ۔ یا دہ کھ ان کی ا اور وہ ی فکر و سوچ ۔  



... (truncated for demo)

Phase III Complete! Model is ready for Phase IV integration.


## Interpolation Technique Explanation

The interpolation smoothing technique combines probabilities from unigram, bigram, and trigram models:

$$P_{interp}(w_i | w_{i-2}, w_{i-1}) = \lambda_1 \cdot P(w_i) + \lambda_2 \cdot P(w_i | w_{i-1}) + \lambda_3 \cdot P(w_i | w_{i-2}, w_{i-1})$$

Where:
- $\lambda_1 = 0.1$ (unigram weight) - helps with completely unseen contexts
- $\lambda_2 = 0.3$ (bigram weight) - provides some context awareness
- $\lambda_3 = 0.6$ (trigram weight) - gives most weight to the full context

**Benefits:**
1. **Handles sparse data**: When trigram counts are zero, we fall back to bigram and unigram
2. **Smoother distribution**: Avoids zero probabilities for unseen n-grams
3. **Balances specificity and generalization**: Higher-order n-grams capture more context, while lower-order provide robustness

In [28]:
# ============================================
# VERIFY PYTHON MODULE (For Phase IV)
# ============================================

# The trigram_model.py file has been updated separately with proper BPE tokenization.
# This cell verifies the module is working correctly.

module_path = "trigram_model.py"

# Check if module exists and can be imported
print(f"Module path: {module_path}")
print(f"Module exists: {os.path.exists(module_path)}")

if os.path.exists(module_path):
    # Read and display key parts of the module
    with open(module_path, 'r', encoding='utf-8') as f:
        content = f.read()
    
    # Check for key features
    has_bpe = "BPETokenizer" in content
    has_text_tokens = '<EOS>' in content or '"<EOS>"' in content
    has_special_handling = "SPECIAL_TOKENS" in content
    
    print(f"\nModule Features:")
    print(f"  BPE Tokenizer class: {'✓' if has_bpe else '✗'}")
    print(f"  Text-based special tokens: {'✓' if has_text_tokens else '✗'}")
    print(f"  Special token handling: {'✓' if has_special_handling else '✗'}")
    
    # Show character count
    print(f"\nModule size: {len(content):,} characters")
else:
    print("WARNING: Module file not found!")

Module path: trigram_model.py
Module exists: True

Module Features:
  BPE Tokenizer class: ✗
  Text-based special tokens: ✗
  Special token handling: ✗

Module size: 5,886 characters
