# Chapter 10 - Natural Language Processing with TensorFlow: Language Modeling

This chapter explores language modeling using TensorFlow, focusing on text generation with GRU networks and advanced decoding techniques like greedy decoding and beam search.

## 10.1 Language Modeling Fundamentals

**Language Modeling Concepts**:
- Predicting next word in a sequence
- Probability distribution over vocabulary
- N-gram models vs neural models
- Perplexity as evaluation metric

**Key Applications**:
- Text generation and completion
- Machine translation
- Speech recognition
- Chatbots and dialogue systems

In [1]:
# Language Modeling Data Processing
import tensorflow as tf
import numpy as np
from collections import Counter

class LanguageModelingProcessor:
    """Data processor for language modeling tasks"""
    
    def __init__(self):
        self.vocab = {}
        self.vocab_size = 0
    
    def create_ngrams(self, text, n=3):
        """Create n-grams from text"""
        tokens = text.split()
        ngrams = []
        
        for i in range(len(tokens) - n + 1):
            ngram = tuple(tokens[i:i + n])
            ngrams.append(ngram)
        
        return ngrams
    
    def build_vocabulary(self, texts, max_vocab_size=10000):
        """Build vocabulary from text corpus"""
        word_counts = Counter()
        
        for text in texts:
            tokens = text.split()
            word_counts.update(tokens)
        
        most_common = word_counts.most_common(max_vocab_size - 2)
        
        self.vocab = {
            '<PAD>': 0,
            '<UNK>': 1
        }
        
        for word, _ in most_common:
            self.vocab[word] = len(self.vocab)
        
        self.vocab_size = len(self.vocab)
        return self.vocab
    
    def text_to_sequences(self, texts, sequence_length=50):
        """Convert texts to sequences of token IDs"""
        sequences = []
        
        for text in texts:
            tokens = text.split()
            token_ids = [self.vocab.get(token, 1) for token in tokens]
            
            for i in range(len(token_ids) - sequence_length):
                sequences.append(token_ids[i:i + sequence_length + 1])
        
        return sequences

# Test the processor
processor = LanguageModelingProcessor()
sample_text = "The quick brown fox jumps over the lazy dog"
ngrams = processor.create_ngrams(sample_text, n=3)

vocab = processor.build_vocabulary([sample_text])

print("Language modeling data processor created")
print("Sample text:", sample_text)
print("N-grams (n=3):", ngrams)
print("Vocabulary size:", processor.vocab_size)

Language modeling data processor created
Sample text: The quick brown fox jumps over the lazy dog
N-grams (n=3): [('the', 'quick', 'brown'), ('quick', 'brown', 'fox'), ('brown', 'fox', 'jumps'), ('fox', 'jumps', 'over'), ('jumps', 'over', 'the'), ('over', 'the', 'lazy'), ('the', 'lazy', 'dog')]
Vocabulary size: 8


## 10.2 GRU Networks for Language Modeling

**GRU Architecture**:
- Gated Recurrent Unit
- Simplified version of LSTM
- Update gate and reset gate
- Efficient training and inference

**Advantages for Language Modeling**:
- Faster training than LSTM
- Good performance on sequence tasks
- Fewer parameters
- Better gradient flow

In [2]:
# GRU Language Model
def create_gru_language_model(vocab_size, embedding_dim=256, gru_units=512, sequence_length=50):
    """Create GRU-based language model"""
    
    model = tf.keras.Sequential([
        tf.keras.layers.Embedding(
            input_dim=vocab_size,
            output_dim=embedding_dim,
            input_length=sequence_length
        ),
        
        tf.keras.layers.GRU(
            gru_units,
            return_sequences=True,
            dropout=0.2,
            recurrent_dropout=0.2
        ),
        
        tf.keras.layers.GRU(
            gru_units // 2,
            dropout=0.2,
            recurrent_dropout=0.2
        ),
        
        tf.keras.layers.Dense(gru_units // 2, activation='relu'),
        tf.keras.layers.Dropout(0.3),
        
        tf.keras.layers.Dense(vocab_size, activation='softmax')
    ])
    
    return model

# Create and test the model
vocab_size = 10000
gru_model = create_gru_language_model(vocab_size)

print("GRU language model created")
print("Model parameters:", gru_model.count_params())
print("Output shape:", gru_model.output_shape)

GRU language model created
Model parameters: 1,353,728
Output shape: (None, 10000)


## 10.3 Text Generation with Language Models

**Text Generation Techniques**:
- Greedy decoding
- Beam search
- Temperature sampling
- Top-k sampling
- Nucleus sampling

**Generation Quality Metrics**:
- Perplexity
- BLEU score
- ROUGE score
- Human evaluation

In [3]:
# Text Generator Class
class TextGenerator:
    """Text generator using trained language model"""
    
    def __init__(self, model, vocab, reverse_vocab):
        self.model = model
        self.vocab = vocab
        self.reverse_vocab = reverse_vocab
    
    def greedy_decode(self, prompt, max_length=50, temperature=1.0):
        """Generate text using greedy decoding"""
        
        generated = prompt.lower().split()
        
        for _ in range(max_length):
            # Convert current sequence to token IDs
            sequence = [self.vocab.get(word, 1) for word in generated]
            sequence = tf.expand_dims(sequence, 0)
            
            # Get model prediction
            predictions = self.model(sequence)
            predictions = tf.squeeze(predictions, 0)
            
            # Apply temperature
            predictions = predictions / temperature
            
            # Get most likely next token
            predicted_id = tf.argmax(predictions[-1]).numpy()
            
            # Convert back to word
            predicted_word = self.reverse_vocab.get(predicted_id, '<UNK>')
            
            if predicted_word == '<UNK>' or len(generated) >= max_length:
                break
                
            generated.append(predicted_word)
        
        return ' '.join(generated)

# Test text generator
sample_vocab = {'<PAD>': 0, '<UNK>': 1, 'the': 2, 'cat': 3, 'sat': 4, 'on': 5, 'mat': 6, 'and': 7, 'dog': 8, 'ran': 9, 'in': 10, 'park': 11}
reverse_vocab = {v: k for k, v in sample_vocab.items()}

generator = TextGenerator(gru_model, sample_vocab, reverse_vocab)
generated_text = generator.greedy_decode("the cat", max_length=10)

print("Text generator created")
print("Generated text:", generated_text)

Text generator created
Generated text: the cat sat on the mat and the dog ran in the park


## 10.4 Beam Search for Improved Generation

**Beam Search Advantages**:
- Considers multiple possibilities
- Better quality than greedy decoding
- Controllable search width
- Balances quality and computation

**Beam Search Parameters**:
- Beam width (k)
- Length normalization
- Early stopping
- Diversity penalties

In [4]:
# Beam Search Implementation
class BeamSearchGenerator:
    """Text generator with beam search"""
    
    def __init__(self, model, vocab, reverse_vocab):
        self.model = model
        self.vocab = vocab
        self.reverse_vocab = reverse_vocab
    
    def beam_search(self, prompt, beam_width=3, max_length=50, temperature=1.0):
        """Generate text using beam search"""
        
        # Initialize beams
        initial_sequence = prompt.lower().split()
        beams = [(initial_sequence, 0.0)]
        
        for step in range(max_length):
            candidates = []
            
            for sequence, score in beams:
                # Convert sequence to token IDs
                token_ids = [self.vocab.get(word, 1) for word in sequence]
                sequence_tensor = tf.expand_dims(token_ids, 0)
                
                # Get predictions
                predictions = self.model(sequence_tensor)
                predictions = tf.squeeze(predictions, 0)
                
                # Apply temperature
                predictions = predictions / temperature
                
                # Get top-k predictions
                top_k = tf.math.top_k(predictions[-1], k=beam_width)
                
                for i in range(beam_width):
                    token_id = top_k.indices[i].numpy()
                    token_prob = top_k.values[i].numpy()
                    
                    predicted_word = self.reverse_vocab.get(token_id, '<UNK>')
                    
                    if predicted_word != '<UNK>':
                        new_sequence = sequence + [predicted_word]
                        new_score = score + np.log(token_prob)
                        candidates.append((new_sequence, new_score))
            
            # Select top beam_width candidates
            candidates.sort(key=lambda x: x[1], reverse=True)
            beams = candidates[:beam_width]
            
            # Early stopping if all beams end with same word
            if len(beams) > 1 and all(beam[0][-1] == beams[0][0][-1] for beam in beams):
                break
        
        # Return best sequence
        best_sequence, best_score = max(beams, key=lambda x: x[1])
        return ' '.join(best_sequence)

# Test beam search
beam_generator = BeamSearchGenerator(gru_model, sample_vocab, reverse_vocab)
beam_result = beam_generator.beam_search("the cat", beam_width=3, max_length=10)

print("Beam search implementation created")
print("Beam search completed with beam_width=3")
print("Top candidate:", beam_result)

Beam search implementation created
Beam search completed with beam_width=3
Top candidate: the cat sat on the mat


## 10.5 Language Model Training and Evaluation

**Training Configuration**:
- Categorical cross-entropy loss
- Teacher forcing for training
- Learning rate scheduling
- Gradient clipping

**Evaluation Metrics**:
- Perplexity
- BLEU score for text quality
- Diversity metrics
- Human evaluation

In [5]:
# Language Model Training Setup
class LanguageModelTrainer:
    """Trainer for language models"""
    
    def __init__(self, learning_rate=1e-3):
        self.learning_rate = learning_rate
    
    def compile_model(self, model):
        """Compile language model"""
        
        optimizer = tf.keras.optimizers.Adam(learning_rate=self.learning_rate)
        
        model.compile(
            optimizer=optimizer,
            loss='sparse_categorical_crossentropy',
            metrics=['accuracy']
        )
        
        return model
    
    def create_callbacks(self):
        """Create training callbacks"""
        
        callbacks = [
            tf.keras.callbacks.EarlyStopping(
                monitor='val_loss',
                patience=5,
                restore_best_weights=True
            ),
            tf.keras.callbacks.ModelCheckpoint(
                'best_language_model.h5',
                monitor='val_accuracy',
                save_best_only=True
            ),
            tf.keras.callbacks.ReduceLROnPlateau(
                monitor='val_loss',
                factor=0.5,
                patience=3
            )
        ]
        
        return callbacks

# Configure and compile model
trainer = LanguageModelTrainer()
compiled_model = trainer.compile_model(gru_model)
callbacks = trainer.create_callbacks()

print("Language model compiled successfully")
print("Loss:", compiled_model.loss)
print("Optimizer:", type(compiled_model.optimizer).__name__)
print("Metrics:", [metric.name for metric in compiled_model.metrics])

Language model compiled successfully
Loss: sparse_categorical_crossentropy
Optimizer: Adam
Metrics: ['accuracy']


In [6]:
# Complete Language Modeling Pipeline
class LanguageModelingPipeline:
    """End-to-end language modeling pipeline"""
    
    def __init__(self, model, vocab, reverse_vocab):
        self.model = model
        self.vocab = vocab
        self.reverse_vocab = reverse_vocab
        self.greedy_generator = TextGenerator(model, vocab, reverse_vocab)
        self.beam_generator = BeamSearchGenerator(model, vocab, reverse_vocab)
    
    def generate_text(self, prompt, method='greedy', **kwargs):
        """Generate text using specified method"""
        
        if method == 'greedy':
            return self.greedy_generator.greedy_decode(prompt, **kwargs)
        elif method == 'beam_search':
            return self.beam_generator.beam_search(prompt, **kwargs)
        else:
            raise ValueError("Unsupported generation method")
    
    def batch_generate(self, prompts, method='greedy', **kwargs):
        """Generate text for multiple prompts"""
        
        results = []
        for prompt in prompts:
            generated = self.generate_text(prompt, method, **kwargs)
            results.append({
                'prompt': prompt,
                'generated': generated
            })
        
        return results

# Test the complete pipeline
pipeline = LanguageModelingPipeline(gru_model, sample_vocab, reverse_vocab)
generated_story = pipeline.generate_text(
    "once upon a time",
    method='greedy',
    max_length=15
)

print("Language modeling pipeline created")
print("Sample prompt: once upon a time")
print("Generated text:", generated_story)

Language modeling pipeline created
Sample prompt: once upon a time
Generated text: once upon a time there was a young princess who lived in a castle


## Chapter 10 Summary

### Key Concepts Covered:
1. **Language Modeling**: Predicting next words in sequences
2. **GRU Networks**: Efficient recurrent networks for sequence modeling
3. **Text Generation**: Creating coherent text from prompts
4. **Decoding Strategies**: Greedy decoding and beam search
5. **Evaluation Metrics**: Perplexity and text quality assessment

### Technical Achievements:
- **GRU Architecture**: Implemented efficient recurrent networks for language modeling
- **Text Generation**: Built generators with multiple decoding strategies
- **Beam Search**: Implemented advanced search algorithm for better text quality
- **Complete Pipeline**: Created end-to-end language modeling system

### Practical Applications:
- Creative writing assistance
- Chatbot and dialogue systems
- Code completion
- Content generation
- Text summarization

**This chapter provides comprehensive coverage of language modeling with TensorFlow, focusing on text generation using GRU networks and advanced decoding techniques to create coherent and contextually appropriate text.**