# Building an Encoder-Decoder Small LLM with Hugging Face Data

The device being used is `Window 11, Nvidia GPU 4070i 8GB`.

`uv sync` in terminal to get all packages installed

### Import Necessary Libraries

In [1]:
import torch
from datasets import load_dataset
from tokenizers import Tokenizer
from tokenizers.models import BPE
from tokenizers.trainers import BpeTrainer
from tokenizers.pre_tokenizers import Whitespace
from torch.utils.data import DataLoader, Dataset
from torch import nn
from torch.nn import functional as F
from torch.nn.utils.rnn import pad_sequence
from typing import List, Tuple, Dict, Any, Optional, Union
from tqdm.notebook import tqdm
import time
import random
import os

os.makedirs('models',exist_ok =True)

### Load Sample Data from Hugging Face

We'll use the WikiText-2 dataset, a small and widely used dataset for language modeling tasks.

In [2]:
# Load the WikiText-2 dataset from Hugging Face
dataset = load_dataset("wikitext", "wikitext-2-raw-v1")

# Extract training and validation text data
train_data = dataset["train"]["text"]
val_data = dataset["validation"]["text"]

In [3]:
train_data[5]

" It met with positive sales in Japan , and was praised by both Japanese and western critics . After release , it received downloadable content , along with an expanded edition in November of that year . It was also adapted into manga and an original video animation series . Due to low sales of Valkyria Chronicles II , Valkyria Chronicles III was not localized , but a fan translation compatible with the game 's expanded edition was released in 2014 . Media.Vision would return to the franchise with the development of Valkyria : Azure Revolution for the PlayStation 4 . \n"

### Preprocess the Data

We need to tokenize the text data. We'll train a Byte Pair Encoding (BPE) tokenizer on the training data and use it to encode both training and validation sets.

In [4]:
# Initialize a BPE tokenizer
tokenizer = Tokenizer(BPE())

# Use whitespace pre-tokenization to split on spaces
tokenizer.pre_tokenizer = Whitespace()

# Define special tokens for our model
# <unk>: Unknown token for words not in vocabulary
# <pad>: Padding token for batch processing
# <bos>: Beginning of sequence token
# <eos>: End of sequence token
special_tokens = ["<unk>", "<pad>", "<bos>", "<eos>"]

# Train the tokenizer on the training data with a vocabulary size of 10,000
trainer = BpeTrainer(vocab_size=10000, 
                     special_tokens=special_tokens)
tokenizer.train_from_iterator(train_data, trainer)

# Get token IDs for special tokens for later use
pad_id = tokenizer.token_to_id("<pad>")
unk_id = tokenizer.token_to_id("<unk>")
bos_id = tokenizer.token_to_id("<bos>")
eos_id = tokenizer.token_to_id("<eos>")

# Encode the training and validation data
train_encodings = tokenizer.encode_batch(train_data)
val_encodings = tokenizer.encode_batch(val_data)

### Create Dataset Classes for Sequence-to-Sequence Tasks

For an encoder-decoder model, we need input-output pairs. We'll use a sliding window approach to create these pairs from our text data.

In [5]:
class Seq2SeqDataset(Dataset):
    """
    Dataset for sequence-to-sequence tasks that provides encoder input and decoder target pairs.
    
    Attributes:
        encodings: List of tokenizer.Encoding objects containing token IDs
        max_length: Maximum sequence length to consider (truncates longer sequences)
        input_length: Length of the input sequence for the encoder
        target_length: Length of the target sequence for the decoder
    """
    def __init__(self, encodings: List, 
                 max_length: Optional[int] = None,
                 input_length: int = 64, 
                 target_length: int = 64):
        """
        Initialize the dataset with encoded text.
        
        Args:
            encodings: List of tokenizer.Encoding objects from the tokenizer
            max_length: Maximum sequence length (optional)
            input_length: Length of input sequences
            target_length: Length of target sequences
        """
        self.encodings = encodings
        self.max_length = max_length
        self.input_length = input_length
        self.target_length = target_length
        
        # Filter out sequences that are too short
        min_length = input_length + target_length
        self.valid_indices = [i for i, enc in enumerate(encodings) 
                             if len(enc.ids) >= min_length]

    def __len__(self) -> int:
        """Return the number of valid input-target pairs in the dataset."""
        return len(self.valid_indices)

    def __getitem__(self, idx: int) -> Tuple[torch.Tensor, torch.Tensor]:
        """
        Get an input-target pair by index.
        
        Args:
            idx: Index of the pair to retrieve
            
        Returns:
            Tuple of (input_tensor, target_tensor)
        """
        # Get the real index from valid indices
        real_idx = self.valid_indices[idx]
        
        # Get token IDs for the sequence
        ids = self.encodings[real_idx].ids
        
        # Truncate if necessary
        if self.max_length and len(ids) > self.max_length:
            # Randomly select a starting point for the subsequence
            max_start = len(ids) - self.max_length
            start_idx = random.randint(0, max_start)
            ids = ids[start_idx:start_idx + self.max_length]
        
        # For sequences longer than input_length + target_length, randomly select a split point
        if len(ids) > self.input_length + self.target_length:
            max_start = len(ids) - (self.input_length + self.target_length)
            start_idx = random.randint(0, max_start)
            ids = ids[start_idx:start_idx + self.input_length + self.target_length]
        
        # Split into input and target
        input_ids = ids[:self.input_length]
        target_ids = ids[self.input_length:self.input_length + self.target_length]
        
        # Convert to tensors
        input_tensor = torch.tensor(input_ids, dtype=torch.long)
        target_tensor = torch.tensor(target_ids, dtype=torch.long)
            
        return input_tensor, target_tensor

In [6]:
# Custom collate function to handle variable length sequences in batches
def collate_batch(batch: List[Tuple[torch.Tensor, torch.Tensor]]) -> Tuple[torch.Tensor, torch.Tensor]:
    """
    Custom collate function for DataLoader that pads sequences to the same length.
    
    Args:
        batch: List of (input_tensor, target_tensor) pairs
        
    Returns:
        Tuple of (padded_inputs, padded_targets)
    """
    # Separate inputs and targets
    inputs = [item[0] for item in batch]
    targets = [item[1] for item in batch]
    
    # Add BOS and EOS tokens to each sequence
    inputs = [torch.cat([torch.tensor([bos_id]), seq, torch.tensor([eos_id])]) for seq in inputs]
    targets = [torch.cat([torch.tensor([bos_id]), seq, torch.tensor([eos_id])]) for seq in targets]
    
    # Pad sequences to the same length
    padded_inputs = pad_sequence(inputs, batch_first=True, padding_value=pad_id)
    padded_targets = pad_sequence(targets, batch_first=True, padding_value=pad_id)
    
    return padded_inputs, padded_targets

In [7]:
# Set sequence lengths for encoder-decoder model
input_length = 64  # Length of input sequence for encoder
target_length = 64  # Length of target sequence for decoder
max_length = input_length + target_length  # Maximum total sequence length

# Create datasets for training and validation
train_dataset = Seq2SeqDataset(
    train_encodings, 
    max_length=max_length, 
    input_length=input_length, 
    target_length=target_length
)

val_dataset = Seq2SeqDataset(
    val_encodings, 
    max_length=max_length, 
    input_length=input_length, 
    target_length=target_length
)

print(f"Number of training sequences: {len(train_dataset)}")
print(f"Number of validation sequences: {len(val_dataset)}")

Number of training sequences: 9173
Number of validation sequences: 986


### Define the Encoder-Decoder Model Architecture

We'll define a transformer-based encoder-decoder model with separate encoder and decoder components, plus cross-attention to connect them.

In [8]:
class EncoderDecoderLLM(nn.Module):
    """
    An encoder-decoder language model based on the transformer architecture.
    
    The model consists of an embedding layer, transformer encoder layers,
    transformer decoder layers with cross-attention, and a final linear layer 
    that projects to vocabulary size for token prediction.
    
    Attributes:
        embedding: Token embedding layer shared between encoder and decoder
        encoder_pos_embedding: Positional encoding for encoder
        decoder_pos_embedding: Positional encoding for decoder
        encoder: Transformer encoder layers
        decoder: Transformer decoder layers with cross-attention
        fc: Final linear layer for token prediction
        dropout: Dropout layer for regularization
    """
    def __init__(self, 
                 vocab_size: int, 
                 hidden_size: int, 
                 num_encoder_layers: int, 
                 num_decoder_layers: int,
                 num_heads: int,
                 max_seq_length: int = 512,
                 dropout: float = 0.1):
        """
        Initialize the encoder-decoder language model.
        
        Args:
            vocab_size: Size of the vocabulary
            hidden_size: Size of the hidden layers
            num_encoder_layers: Number of transformer encoder layers
            num_decoder_layers: Number of transformer decoder layers
            num_heads: Number of attention heads in each transformer layer
            max_seq_length: Maximum sequence length for positional embeddings
            dropout: Dropout probability for regularization
        """
        super(EncoderDecoderLLM, self).__init__()
        
        # Token embedding layer (shared between encoder and decoder)
        self.embedding = nn.Embedding(vocab_size, hidden_size)
        
        # Separate positional embeddings for encoder and decoder
        self.encoder_pos_embedding = nn.Parameter(torch.zeros(1, max_seq_length, hidden_size))
        self.decoder_pos_embedding = nn.Parameter(torch.zeros(1, max_seq_length, hidden_size))
        
        # Transformer encoder
        encoder_layer = nn.TransformerEncoderLayer(
            d_model=hidden_size, 
            nhead=num_heads, 
            batch_first=True,
            dropout=dropout
        )
        self.encoder = nn.TransformerEncoder(encoder_layer, num_encoder_layers)
        
        # Transformer decoder with cross-attention to encoder outputs
        decoder_layer = nn.TransformerDecoderLayer(
            d_model=hidden_size, 
            nhead=num_heads, 
            batch_first=True,
            dropout=dropout
        )
        self.decoder = nn.TransformerDecoder(decoder_layer, num_decoder_layers)
        
        # Final linear layer for token prediction
        self.fc = nn.Linear(hidden_size, vocab_size)
        
        # Dropout for regularization
        self.dropout = nn.Dropout(dropout)
        
        # Initialize parameters
        self.init_weights()
        
    def init_weights(self) -> None:
        """
        Initialize model weights for better training convergence.
        """
        initrange = 0.1
        self.embedding.weight.data.uniform_(-initrange, initrange)
        self.fc.bias.data.zero_()
        self.fc.weight.data.uniform_(-initrange, initrange)

    def forward(self, src: torch.Tensor, tgt: torch.Tensor) -> torch.Tensor:
        """
        Forward pass through the encoder-decoder model.
        
        Args:
            src: Input tensor for encoder with shape (batch_size, src_seq_length)
            tgt: Input tensor for decoder with shape (batch_size, tgt_seq_length)
            
        Returns:
            Logits tensor with shape (batch_size, tgt_seq_length, vocab_size)
        """
        # Create attention masks for padding tokens
        src_padding_mask = (src == pad_id)  # [batch_size, src_seq_len]
        tgt_padding_mask = (tgt == pad_id)  # [batch_size, tgt_seq_len]
        
        # Create causal mask for decoder (to prevent looking at future tokens)
        tgt_seq_len = tgt.size(1)
        tgt_mask = nn.Transformer.generate_square_subsequent_mask(tgt_seq_len).to(tgt.device)
        
        # Get sequence lengths
        src_seq_len = src.size(1)  # Source sequence length
        tgt_seq_len = tgt.size(1)  # Target sequence length
        
        # Embed tokens and add positional information
        # [batch_size, src_seq_len, hidden_size]
        src_emb = self.embedding(src) + self.encoder_pos_embedding[:, :src_seq_len, :]
        
        # [batch_size, tgt_seq_len, hidden_size]
        tgt_emb = self.embedding(tgt) + self.decoder_pos_embedding[:, :tgt_seq_len, :]
        
        # Apply dropout
        src_emb = self.dropout(src_emb)
        tgt_emb = self.dropout(tgt_emb)
        
        # Encoder forward pass
        # memory: [batch_size, src_seq_len, hidden_size]
        memory = self.encoder(src_emb, src_key_padding_mask=src_padding_mask)
        
        # Decoder forward pass with cross-attention to encoder outputs
        # output: [batch_size, tgt_seq_len, hidden_size]
        output = self.decoder(
            tgt_emb,                      # Input to decoder
            memory,                       # Memory from encoder
            tgt_mask=tgt_mask,           # Causal mask
            memory_key_padding_mask=src_padding_mask,  # Encoder padding mask
            tgt_key_padding_mask=tgt_padding_mask      # Decoder padding mask
        )
        
        # Project to vocabulary size
        # [batch_size, tgt_seq_len, vocab_size]
        output = self.fc(output)
        
        return output
    
    def encode(self, src: torch.Tensor) -> torch.Tensor:
        """
        Encode the source sequence.
        
        Args:
            src: Input tensor for encoder with shape (batch_size, src_seq_length)
            
        Returns:
            Encoder output tensor with shape (batch_size, src_seq_length, hidden_size)
        """
        # Create attention mask for padding tokens
        src_padding_mask = (src == pad_id)
        
        # Get sequence length
        src_seq_len = src.size(1)
        
        # Embed tokens and add positional information
        src_emb = self.embedding(src) + self.encoder_pos_embedding[:, :src_seq_len, :]
        src_emb = self.dropout(src_emb)
        
        # Encoder forward pass
        memory = self.encoder(src_emb, src_key_padding_mask=src_padding_mask)
        
        return memory
    
    def decode(self, memory: torch.Tensor, tgt: torch.Tensor, 
               src_padding_mask: Optional[torch.Tensor] = None) -> torch.Tensor:
        """
        Decode with encoder memory and target sequence.
        
        Args:
            memory: Encoder output with shape (batch_size, src_seq_length, hidden_size)
            tgt: Input tensor for decoder with shape (batch_size, tgt_seq_length)
            src_padding_mask: Optional mask for encoder padding 
            
        Returns:
            Logits tensor with shape (batch_size, tgt_seq_length, vocab_size)
        """
        # Create attention masks
        tgt_padding_mask = (tgt == pad_id)
        tgt_seq_len = tgt.size(1)
        tgt_mask = nn.Transformer.generate_square_subsequent_mask(tgt_seq_len).to(tgt.device)
        
        # Embed tokens and add positional information
        tgt_emb = self.embedding(tgt) + self.decoder_pos_embedding[:, :tgt_seq_len, :]
        tgt_emb = self.dropout(tgt_emb)
        
        # Decoder forward pass
        output = self.decoder(
            tgt_emb, 
            memory, 
            tgt_mask=tgt_mask,
            memory_key_padding_mask=src_padding_mask,
            tgt_key_padding_mask=tgt_padding_mask
        )
        
        # Project to vocabulary size
        output = self.fc(output)
        
        return output

### Set Hyperparameters

In [9]:
# Model hyperparameters
vocab_size = tokenizer.get_vocab_size()  # Size of the tokenizer's vocabulary
hidden_size = 256  # Size of the hidden layers and embeddings
num_encoder_layers = 3  # Number of encoder transformer layers 
num_decoder_layers = 3  # Number of decoder transformer layers
num_heads = 8      # Number of attention heads
max_seq_len = 128  # Maximum sequence length for the model

# Training hyperparameters
batch_size = 32    # Batch size for training
learning_rate = 5e-4  # Learning rate for the optimizer
num_epochs = 5     # Number of training epochs
weight_decay = 0.01  # L2 regularization for the optimizer
gradient_clipping = 1.0  # Gradient clipping value to prevent exploding gradients

print(f"Vocabulary size: {vocab_size}")
print(f"Hidden size: {hidden_size}")
print(f"Number of encoder layers: {num_encoder_layers}")
print(f"Number of decoder layers: {num_decoder_layers}")
print(f"Number of attention heads: {num_heads}")
print(f"Batch size: {batch_size}")
print(f"Learning rate: {learning_rate}")

Vocabulary size: 10000
Hidden size: 256
Number of encoder layers: 3
Number of decoder layers: 3
Number of attention heads: 8
Batch size: 32
Learning rate: 0.0005


### Create Model, Optimizer, and Data Loaders

In [10]:
# Check if CUDA is available and set the device accordingly
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

# Create the model instance
model = EncoderDecoderLLM(
    vocab_size=vocab_size, 
    hidden_size=hidden_size, 
    num_encoder_layers=num_encoder_layers,
    num_decoder_layers=num_decoder_layers,
    num_heads=num_heads,
    max_seq_length=max_seq_len,
    dropout=0.1
).to(device)

# Set up the AdamW optimizer with weight decay
optimizer = torch.optim.AdamW(
    model.parameters(), 
    lr=learning_rate,
    weight_decay=weight_decay
)

# Create data loaders for training and validation with our custom collate function
train_loader = DataLoader(
    train_dataset, 
    batch_size=batch_size, 
    shuffle=True,
    collate_fn=collate_batch
)

val_loader = DataLoader(
    val_dataset, 
    batch_size=batch_size,
    collate_fn=collate_batch
)

# Learning rate scheduler for better convergence
scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(
    optimizer, 
    mode='min', 
    factor=0.5, 
    patience=1, 
    verbose=True
)

Using device: cuda




### Helper Functions for Evaluation and Training

In [11]:
def calculate_perplexity(loss: float) -> float:
    """
    Calculate perplexity from the loss value.
    
    Perplexity is a measure of how well a language model predicts a sample,
    calculated as the exponential of the cross-entropy loss.
    
    Args:
        loss: Cross-entropy loss value
        
    Returns:
        Perplexity value
    """
    return torch.exp(torch.tensor(loss)).item()

def evaluate(model: nn.Module, 
             data_loader: DataLoader, 
             device: torch.device,
             pad_id: int,
             vocab_size: int) -> Tuple[float, float]:
    """
    Evaluate the encoder-decoder model on a dataset.
    
    Args:
        model: The model to evaluate
        data_loader: DataLoader for the evaluation dataset
        device: Device to use for computation
        pad_id: ID of the padding token to ignore in loss calculation
        vocab_size: Size of the vocabulary
        
    Returns:
        Tuple of (average loss, perplexity)
    """
    model.eval()  # Set model to evaluation mode
    total_loss = 0.0
    total_tokens = 0
    
    with torch.no_grad():  # Disable gradient computation for efficiency
        for src, tgt in tqdm(data_loader, desc="Evaluating"):
            # Move batch to device
            src = src.to(device)  # [batch_size, src_seq_len]
            tgt = tgt.to(device)  # [batch_size, tgt_seq_len]
            
            # Get decoder input and target sequences
            # Input: all tokens except the last one
            decoder_input = tgt[:, :-1]  # [batch_size, tgt_seq_len-1]
            # Target: all tokens except the first one
            target_seq = tgt[:, 1:]    # [batch_size, tgt_seq_len-1]
            
            # Create mask to ignore padding tokens
            padding_mask = (target_seq != pad_id)  # [batch_size, tgt_seq_len-1]
            
            # Forward pass
            output = model(src, decoder_input)  # [batch_size, tgt_seq_len-1, vocab_size]
            
            # Compute loss (ignoring padding tokens)
            loss = F.cross_entropy(
                output.reshape(-1, vocab_size),  # [batch_size*(tgt_seq_len-1), vocab_size]
                target_seq.reshape(-1),         # [batch_size*(tgt_seq_len-1)]
                ignore_index=pad_id,
                reduction='sum'
            )
            
            # Count non-padding tokens
            num_tokens = padding_mask.sum().item()
            
            # Accumulate statistics
            total_loss += loss.item()
            total_tokens += num_tokens
    
    # Calculate average loss and perplexity
    avg_loss = total_loss / total_tokens
    perplexity = calculate_perplexity(avg_loss)
    
    return avg_loss, perplexity

### Train the Model

We implement a training loop for the encoder-decoder model. The encoder processes the input sequence, and the decoder generates the target sequence.

In [12]:
# Training loop
best_val_loss = float('inf')
training_stats = []

for epoch in range(num_epochs):
    epoch_start_time = time.time()
    model.train()  # Set model to training mode
    total_loss = 0.0
    total_tokens = 0
    
    # Initialize progress bar for the training loop
    progress_bar = tqdm(train_loader, desc=f"Epoch {epoch+1}/{num_epochs}")
    
    for batch_idx, (src, tgt) in enumerate(progress_bar):
        # Move batch to device
        src = src.to(device)  # [batch_size, src_seq_len]
        tgt = tgt.to(device)  # [batch_size, tgt_seq_len]
        
        # Get decoder input and target sequences
        # Decoder input: all tokens except the last one (shifted right)
        decoder_input = tgt[:, :-1]  # [batch_size, tgt_seq_len-1]
        # Target: all tokens except the first one
        target_seq = tgt[:, 1:]    # [batch_size, tgt_seq_len-1]
        
        # Create mask to ignore padding tokens in the loss calculation
        padding_mask = (target_seq != pad_id)  # [batch_size, tgt_seq_len-1]
        
        # Forward pass
        optimizer.zero_grad()  # Clear gradients
        
        # output: [batch_size, tgt_seq_len-1, vocab_size]
        output = model(src, decoder_input)
        
        # Compute loss (ignoring padding tokens)
        loss = F.cross_entropy(
            output.reshape(-1, vocab_size),  # [batch_size*(tgt_seq_len-1), vocab_size]
            target_seq.reshape(-1),          # [batch_size*(tgt_seq_len-1)]
            ignore_index=pad_id,            # Ignore padding tokens
            reduction='sum'                 # Sum losses for proper token counting
        )
        
        # Count non-padding tokens for proper loss scaling
        num_tokens = padding_mask.sum().item()
        
        # Scale loss by number of tokens for better comparability
        scaled_loss = loss / num_tokens
        
        # Backpropagation
        scaled_loss.backward()
        
        # Gradient clipping to prevent exploding gradients
        torch.nn.utils.clip_grad_norm_(model.parameters(), gradient_clipping)
        
        # Update model weights
        optimizer.step()
        
        # Accumulate statistics
        total_loss += loss.item()
        total_tokens += num_tokens
        
        # Update progress bar with current loss
        current_loss = loss.item() / num_tokens
        current_perplexity = calculate_perplexity(current_loss)
        progress_bar.set_postfix({
            'loss': f"{current_loss:.4f}",
            'ppl': f"{current_perplexity:.2f}"
        })
    
    # Calculate average loss and perplexity for the epoch
    avg_loss = total_loss / total_tokens
    perplexity = calculate_perplexity(avg_loss)
    
    # Evaluate on validation set
    val_loss, val_perplexity = evaluate(model, val_loader, device, pad_id, vocab_size)
    
    # Adjust learning rate based on validation performance
    scheduler.step(val_loss)
    
    # Calculate epoch time
    epoch_time = time.time() - epoch_start_time
    
    # Print epoch statistics
    print(f"Epoch {epoch+1}/{num_epochs} | Time: {epoch_time:.1f}s")
    print(f"Train Loss: {avg_loss:.4f} | Train Perplexity: {perplexity:.2f}")
    print(f"Val Loss: {val_loss:.4f} | Val Perplexity: {val_perplexity:.2f}")
    print("-" * 60)
    
    # Save statistics
    training_stats.append({
        'epoch': epoch + 1,
        'train_loss': avg_loss,
        'train_perplexity': perplexity,
        'val_loss': val_loss,
        'val_perplexity': val_perplexity,
        'epoch_time': epoch_time
    })
    
    # Save best model
    if val_loss < best_val_loss:
        best_val_loss = val_loss
        torch.save(model.state_dict(), 'models/best_encoder_decoder_model.pt')
        print("Saved best model!")

# Load the best model for evaluation and generation
model.load_state_dict(torch.load('models/best_encoder_decoder_model.pt'))
print("Training complete!")

Epoch 1/5:   0%|          | 0/287 [00:00<?, ?it/s]



Evaluating:   0%|          | 0/31 [00:00<?, ?it/s]

  output = torch._nested_tensor_from_mask(


Epoch 1/5 | Time: 31.5s
Train Loss: 6.7271 | Train Perplexity: 834.72
Val Loss: 6.1930 | Val Perplexity: 489.31
------------------------------------------------------------
Saved best model!


Epoch 2/5:   0%|          | 0/287 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/31 [00:00<?, ?it/s]

Epoch 2/5 | Time: 26.6s
Train Loss: 5.9199 | Train Perplexity: 372.37
Val Loss: 5.8354 | Val Perplexity: 342.19
------------------------------------------------------------
Saved best model!


Epoch 3/5:   0%|          | 0/287 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/31 [00:00<?, ?it/s]

Epoch 3/5 | Time: 18.9s
Train Loss: 5.5385 | Train Perplexity: 254.30
Val Loss: 5.6402 | Val Perplexity: 281.52
------------------------------------------------------------
Saved best model!


Epoch 4/5:   0%|          | 0/287 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/31 [00:00<?, ?it/s]

Epoch 4/5 | Time: 18.0s
Train Loss: 5.2856 | Train Perplexity: 197.48
Val Loss: 5.5315 | Val Perplexity: 252.51
------------------------------------------------------------
Saved best model!


Epoch 5/5:   0%|          | 0/287 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/31 [00:00<?, ?it/s]

Epoch 5/5 | Time: 18.2s
Train Loss: 5.0895 | Train Perplexity: 162.31
Val Loss: 5.4357 | Val Perplexity: 229.46
------------------------------------------------------------
Saved best model!
Training complete!


## Evaluate the Model

We evaluate the encoder-decoder model on the validation set by calculating perplexity.

In [13]:
# Final evaluation on validation set
val_loss, val_perplexity = evaluate(model, val_loader, device, pad_id, vocab_size)
print(f"Final Validation Loss: {val_loss:.4f}")
print(f"Final Validation Perplexity: {val_perplexity:.2f}")

Evaluating:   0%|          | 0/31 [00:00<?, ?it/s]

Final Validation Loss: 5.4563
Final Validation Perplexity: 234.22


### Generate Text with the Encoder-Decoder Model

We use the trained encoder-decoder model to generate text from a prompt.

In [14]:
def generate_text(model: nn.Module, 
                  tokenizer: Tokenizer, 
                  prompt: str, 
                  max_length: int = 50, 
                  temperature: float = 1.0,
                  top_k: Optional[int] = None,
                  device: torch.device = device) -> str:
    """
    Generate text continuation from a prompt using the trained encoder-decoder model.
    
    Args:
        model: The trained encoder-decoder model
        tokenizer: Tokenizer used to encode/decode text
        prompt: Starting text as input to the encoder
        max_length: Maximum number of tokens to generate
        temperature: Controls randomness (lower = more deterministic)
        top_k: If set, sample from top k most likely tokens
        device: Device to use for computation
        
    Returns:
        Generated text string
    """
    model.eval()  # Set model to evaluation mode
    
    # Encode the prompt for the encoder
    encoder_input_ids = tokenizer.encode(prompt).ids
    
    # Add BOS token if not present
    if encoder_input_ids[0] != bos_id:
        encoder_input_ids = [bos_id] + encoder_input_ids
    
    # Add EOS token if not present
    if encoder_input_ids[-1] != eos_id:
        encoder_input_ids = encoder_input_ids + [eos_id]
    
    # Convert to tensor and move to device
    encoder_input = torch.tensor([encoder_input_ids], dtype=torch.long, device=device)
    
    # Start with just the BOS token for the decoder input
    decoder_input = torch.tensor([[bos_id]], dtype=torch.long, device=device)
    
    with torch.no_grad():  # Disable gradient computation
        # Encode the input sequence once (encoder forward pass)
        memory = model.encode(encoder_input)
        src_padding_mask = (encoder_input == pad_id)
        
        # Generate tokens one by one
        for _ in range(max_length):
            # Decode with the current decoder input
            output = model.decode(memory, decoder_input, src_padding_mask)
            
            # Get logits for the next token (last position)
            next_token_logits = output[:, -1, :] / temperature
            
            # Apply top-k sampling if specified
            if top_k is not None:
                # Get top k logits and their indices
                top_k_logits, top_k_indices = torch.topk(next_token_logits, k=top_k, dim=-1)
                
                # Create a mask of the same shape as logits, filled with -inf
                next_token_logits = torch.full_like(next_token_logits, float('-inf'))
                
                # Fill in the top-k logits
                next_token_logits.scatter_(1, top_k_indices, top_k_logits)
            
            # Convert logits to probabilities using softmax
            probabilities = F.softmax(next_token_logits, dim=-1)
            
            # Sample from the distribution
            next_token = torch.multinomial(probabilities, num_samples=1)
            
            # Stop if EOS token is generated
            if next_token.item() == eos_id:
                break
                
            # Add the next token to the decoder input
            decoder_input = torch.cat([decoder_input, next_token], dim=1)
    
    # Get all generated tokens (excluding the initial BOS token)
    generated_ids = decoder_input[0, 1:].tolist()  # Skip the first BOS token
    
    # Decode tokens back to text
    generated_text = tokenizer.decode(generated_ids)
    
    return generated_text

In [15]:
# Generate text with different prompts and parameters
test_prompts = [
    "Once upon a time",
    "The history of artificial intelligence",
    "In recent years, researchers have"
]

for prompt in test_prompts:
    print(f"\nPrompt: {prompt}")
    
    # Generate with different temperatures
    for temp in [0.7, 1.0, 1.3]:
        generated = generate_text(
            model=model,
            tokenizer=tokenizer,
            prompt=prompt,
            max_length=50,
            temperature=temp,
            top_k=40
        )
        print(f"\nTemperature {temp}:\n{generated}")
    print("-" * 80)


Prompt: Once upon a time

Temperature 0.7:
his former team to support the team ' s assistant of all over two games . He was the first time since his first time because he had been in the season in the second season . On November 2010 , he was in the second time against his second

Temperature 1.0:
was un impressed with both his wife , he said that the time his first time he was at the final was the year as part of his first time . Two years later he wrote that he was an to make the victory his father , he finished ,

Temperature 1.3:
' s death , it was used in his debut , while his fourth final game at a second half @-@ season club , but would return after . On 4 he was then left in the record , he received him to get his wife ' s mother at
--------------------------------------------------------------------------------

Prompt: The history of artificial intelligence

Temperature 0.7:
the film because he said that he was " a little effect of the " . He was a game of " from " I ' ve ' 

### Save Model and Tokenizer

In [16]:
# Save the model weights
torch.save(model.state_dict(), 'models/encoder_decoder_llm_model.pt')

# Save the tokenizer
tokenizer.save("models/tokenizer.json")

print("Model and tokenizer saved successfully!")

Model and tokenizer saved successfully!


### Function to Load Model and Tokenizer

In [17]:
def load_model_and_tokenizer(model_path: str, 
                             tokenizer_path: str, 
                             device: torch.device = torch.device("cuda")) -> Tuple[nn.Module, Tokenizer]:
    """
    Load a saved encoder-decoder model and tokenizer.
    
    Args:
        model_path: Path to the saved model weights
        tokenizer_path: Path to the saved tokenizer
        device: Device to load the model on
        
    Returns:
        Tuple of (loaded model, loaded tokenizer)
    """
    # Load tokenizer
    tokenizer = Tokenizer.from_file(tokenizer_path)
    
    # Get vocabulary size
    vocab_size = tokenizer.get_vocab_size()
    
    # Create model instance with the same architecture
    model = EncoderDecoderLLM(
        vocab_size=vocab_size,
        hidden_size=hidden_size,
        num_encoder_layers=num_encoder_layers,
        num_decoder_layers=num_decoder_layers,
        num_heads=num_heads,
        max_seq_length=max_seq_len
    ).to(device)
    
    # Load model weights
    model.load_state_dict(torch.load(model_path, map_location=device))
    model.eval()
    
    return model, tokenizer

In [18]:
# Test loading the model and tokenizer
loaded_model, loaded_tokenizer = load_model_and_tokenizer(
    model_path="models/encoder_decoder_llm_model.pt",
    tokenizer_path="models/tokenizer.json"
)

# Generate text with the loaded model
test_prompt = "The future of technology is"
generated_text = generate_text(
    model=loaded_model,
    tokenizer=loaded_tokenizer,
    prompt=test_prompt,
    max_length=50
)

print(f"Generated with loaded model:\n{generated_text}")

Generated with loaded model:
Mac Donald did not been erected in southeastern one as the world as well as . A Lo oking cultural tend ency also move as E ps of Saint scholar Henry M au is currently in the 2014 S of He also released viewed the UK . A ction compl


## Conclusion

In this notebook, we've built a small-scale encoder-decoder language model using the transformer architecture. Here's a summary of what we've accomplished:

1. **Data Preparation**:
   - Loaded the WikiText-2 dataset
   - Trained a BPE tokenizer to convert text to token IDs
   - Created custom seq2seq dataset and dataloader with padding to handle variable sequence lengths

2. **Encoder-Decoder Architecture**:
   - Implemented a complete transformer-based encoder-decoder model
   - Added cross-attention in the decoder to attend to encoder outputs
   - Used separate positional embeddings for encoder and decoder
   - Implemented causal masking in the decoder to prevent looking at future tokens

3. **Training**:
   - Trained the model using teacher forcing for the decoder
   - Used learning rate scheduling to improve convergence
   - Implemented gradient clipping to prevent exploding gradients
   - Saved the best model based on validation performance

4. **Evaluation and Generation**:
   - Evaluated model using perplexity, a standard language modeling metric
   - Implemented text generation with autoregressive decoding
   - Added temperature and top-k sampling for controlling generation diversity
   - Added functions to save and load the model for future use

This encoder-decoder architecture is more flexible than the original encoder-only model and is better suited for text generation tasks. The encoder processes the input context, and the decoder generates new text conditioned on that context. This approach is similar to what's used in larger language models (LLMs) like T5, BART, and other sequence-to-sequence transformer models.