# Training Your Own LLM with Transformer Architecture

Welcome to the Day 2 practical session on training your own language model! In this notebook, we'll learn how to train a character-level language model using a Transformer encoder-decoder architecture with Hugging Face transformers. We'll use individual letters as tokens and train on the NLTK words corpus to understand the fundamentals of modern LLM training.

## Learning Objectives

- Understand the basics of Transformer-based language model training
- Build a character-level language model using encoder-decoder architecture
- Work with Hugging Face transformers and tokenizers
- Train on NLTK corpus with proper stopping mechanisms
- Evaluate model performance and generate text
- Gain insights into how modern LLMs are trained

## Prerequisites

- Basic understanding of Python and PyTorch (covered in preliminaries)
- Familiarity with the concept of language models and transformers
- A computer with PyTorch and Hugging Face transformers installed (CPU or GPU)

## 1. Setup

First, let's import the necessary libraries and set up our environment. We'll be using PyTorch and Hugging Face transformers for our Transformer-based neural network implementation.

In [1]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import numpy as np
import matplotlib.pyplot as plt
import time
import random
import string
import nltk
from nltk.corpus import words
import os
from transformers import (
    EncoderDecoderModel,
    BertConfig,
    BertModel,
    PreTrainedTokenizerFast,
    Trainer,
    TrainingArguments,
    DataCollatorForSeq2Seq
)
from transformers.tokenization_utils_base import BatchEncoding
from torch.utils.data import Dataset, DataLoader
import json

# Check if GPU is available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

# Set random seeds for reproducibility
torch.manual_seed(42)
if torch.cuda.is_available():
    torch.cuda.manual_seed_all(42)
np.random.seed(42)
random.seed(42)

print("Libraries imported successfully!")

ModuleNotFoundError: No module named 'matplotlib'

## 2. Data Preparation

Let's create a dataset of English words using the NLTK words corpus. We'll create a character-level tokenizer that treats individual letters (including spaces) as tokens. Our Transformer encoder-decoder model will learn to predict the next character in a sequence, with spaces serving as natural stopping tokens.

In [None]:
# Download NLTK words if not already downloaded
try:
    nltk.data.find('corpora/words')
except LookupError:
    nltk.download('words')

# Get list of English words
english_words = words.words()

# Filter to get words of reasonable length (3-15 characters)
filtered_words = [word.lower() for word in english_words if 3 <= len(word) <= 15 and word.isalpha()]

print(f"Total words in dataset: {len(filtered_words)}")
print(f"Sample words: {filtered_words[:10]}")

# Create a vocabulary of all characters in our dataset
# Include space as a special token for sequence ending
special_tokens = ['<pad>', '<sos>', '<eos>']  # padding, start of sequence, end of sequence
all_characters = list(string.ascii_lowercase) + [' ']  # lowercase letters and space
vocab = special_tokens + all_characters
vocab_size = len(vocab)

print(f"Characters in vocabulary: {all_characters}")
print(f"Full vocabulary (with special tokens): {vocab}")
print(f"Vocabulary size: {vocab_size}")

# Create character-to-index and index-to-character mappings
char_to_idx = {char: i for i, char in enumerate(vocab)}
idx_to_char = {i: char for i, char in enumerate(vocab)}

print(f"Special token indices:")
print(f"  <pad>: {char_to_idx['<pad>']}")
print(f"  <sos>: {char_to_idx['<sos>']}")
print(f"  <eos>: {char_to_idx['<eos>']}")

## 3. Creating Training Data for Encoder-Decoder Architecture

For our encoder-decoder Transformer model, we'll create training pairs where:
- **Encoder input**: A partial word (characters up to a certain position)
- **Decoder input**: The same partial word with `<sos>` token at the beginning
- **Decoder target**: The next character in the sequence, with `<eos>` token at the end

This approach teaches the model to:
1. Encode the input sequence context
2. Generate the next character autoregressively
3. Stop generation when encountering spaces or reaching word boundaries

In [None]:
def prepare_sequence_data(word):
    """
    Convert a word into a list of (input_sequence, target_character) pairs.
    For example, "hello" would yield:
    [("h", "e"), ("he", "l"), ("hel", "l"), ("hell", "o")]
    """
    sequence_pairs = []
    for i in range(1, len(word)):
        input_seq = word[:i]
        target_char = word[i]
        sequence_pairs.append((input_seq, target_char))
    return sequence_pairs

# Create training examples
training_pairs = []
for word in filtered_words:
    training_pairs.extend(prepare_sequence_data(word))

# Shuffle the training pairs
random.shuffle(training_pairs)

# Limit to 100,000 examples to avoid memory issues
training_pairs = training_pairs[:100000]

print(f"Total training examples: {len(training_pairs)}")
print(f"Sample training pairs: {training_pairs[:5]}")

# Function to convert a string to a tensor of character indices
def string_to_tensor(string):
    tensor = torch.zeros(len(string), 1, n_characters)
    for i, char in enumerate(string):
        index = char_to_idx.get(char, char_to_idx[' '])  # Default to space if char not found
        tensor[i][0][index] = 1
    return tensor

# Function to convert a character to a tensor (one-hot encoding)
def char_to_tensor(char):
    tensor = torch.zeros(1, n_characters)
    index = char_to_idx.get(char, char_to_idx[' '])  # Default to space if char not found
    tensor[0][index] = 1
    return tensor

def create_encoder_decoder_pairs(word, max_length=20):
    """
    Create encoder-decoder training pairs from a word.
    For word "hello":
    - ("h", "<sos>", "e") -> encoder gets "h", decoder input is "<sos>", target is "e"
    - ("he", "<sos>e", "l") -> encoder gets "he", decoder input is "<sos>e", target is "l"
    - And so on...
    """
    pairs = []
    
    for i in range(1, len(word)):
        encoder_input = word[:i]  # Characters seen so far
        decoder_input = '<sos>' + word[:i]  # Start token + seen characters
        target_char = word[i]  # Next character to predict
        
        # Pad sequences to max_length if needed
        encoder_input = encoder_input.ljust(max_length, '<pad>')
        decoder_input = decoder_input.ljust(max_length, '<pad>')
        
        pairs.append((encoder_input[:max_length], decoder_input[:max_length], target_char))
    
    # Add final pair that should predict end of sequence
    encoder_input = word.ljust(max_length, '<pad>')
    decoder_input = ('<sos>' + word).ljust(max_length, '<pad>')
    target_char = '<eos>'
    
    pairs.append((encoder_input[:max_length], decoder_input[:max_length], target_char))
    
    return pairs

# Create training dataset
training_data = []
max_seq_length = 20

for word in filtered_words[:5000]:  # Limit to 5000 words for faster training
    pairs = create_encoder_decoder_pairs(word, max_seq_length)
    training_data.extend(pairs)

# Shuffle training data
random.shuffle(training_data)

print(f"Total training examples: {len(training_data)}")
print(f"Sample training examples:")
for i in range(3):
    enc_in, dec_in, target = training_data[i]
    print(f"  Encoder input: '{enc_in.replace('<pad>', '').strip()}'")
    print(f"  Decoder input: '{dec_in.replace('<pad>', '').strip()}'")
    print(f"  Target: '{target}'")
    print()

def encode_sequence(sequence, char_to_idx, max_length):
    """
    Convert a sequence of characters to indices.
    """
    indices = []
    for char in sequence[:max_length]:
        indices.append(char_to_idx.get(char, char_to_idx['<pad>']))
    
    # Pad if necessary
    while len(indices) < max_length:
        indices.append(char_to_idx['<pad>'])
    
    return indices[:max_length]

def decode_sequence(indices, idx_to_char):
    """
    Convert indices back to characters.
    """
    chars = []
    for idx in indices:
        char = idx_to_char.get(idx, '<unk>')
        if char == '<pad>':
            break
        chars.append(char)
    return ''.join(chars)

## 4. Building the Transformer Model

Now, let's build our character-level language model using a Transformer encoder-decoder architecture. We'll use Hugging Face's `EncoderDecoderModel` which combines BERT-like encoders and decoders to create a sequence-to-sequence model perfect for our character-level prediction task.

In [None]:
class CharDataset(Dataset):
    """
    Custom dataset for character-level sequence-to-sequence learning.
    """
    def __init__(self, data, char_to_idx, max_length):
        self.data = data
        self.char_to_idx = char_to_idx
        self.max_length = max_length
    
    def __len__(self):
        return len(self.data)
    
    def __getitem__(self, idx):
        encoder_input, decoder_input, target = self.data[idx]
        
        # Encode sequences
        encoder_ids = encode_sequence(encoder_input, self.char_to_idx, self.max_length)
        decoder_ids = encode_sequence(decoder_input, self.char_to_idx, self.max_length)
        target_id = self.char_to_idx.get(target, self.char_to_idx['<pad>'])
        
        return {
            'input_ids': torch.tensor(encoder_ids, dtype=torch.long),
            'decoder_input_ids': torch.tensor(decoder_ids, dtype=torch.long),
            'labels': torch.tensor(target_id, dtype=torch.long)
        }

# Create dataset
dataset = CharDataset(training_data, char_to_idx, max_seq_length)

# Split into train and validation
train_size = int(0.9 * len(dataset))
val_size = len(dataset) - train_size
train_dataset, val_dataset = torch.utils.data.random_split(dataset, [train_size, val_size])

print(f"Training samples: {len(train_dataset)}")
print(f"Validation samples: {len(val_dataset)}")

# Configure the encoder and decoder
encoder_config = BertConfig(
    vocab_size=vocab_size,
    hidden_size=256,
    num_hidden_layers=4,
    num_attention_heads=8,
    intermediate_size=512,
    max_position_embeddings=max_seq_length,
    pad_token_id=char_to_idx['<pad>']
)

decoder_config = BertConfig(
    vocab_size=vocab_size,
    hidden_size=256,
    num_hidden_layers=4,
    num_attention_heads=8,
    intermediate_size=512,
    max_position_embeddings=max_seq_length,
    pad_token_id=char_to_idx['<pad>'],
    is_decoder=True,
    add_cross_attention=True
)

# Create the encoder-decoder model
model = EncoderDecoderModel.from_encoder_decoder_configs(
    encoder_config, decoder_config
)

# Set special tokens
model.config.decoder_start_token_id = char_to_idx['<sos>']
model.config.eos_token_id = char_to_idx['<eos>']
model.config.pad_token_id = char_to_idx['<pad>']
model.config.vocab_size = vocab_size

# Move model to device
model = model.to(device)

print(f"Model parameters: {sum(p.numel() for p in model.parameters()):,}")
print(f"Model configuration:")
print(f"  Encoder layers: {encoder_config.num_hidden_layers}")
print(f"  Decoder layers: {decoder_config.num_hidden_layers}")
print(f"  Hidden size: {encoder_config.hidden_size}")
print(f"  Attention heads: {encoder_config.num_attention_heads}")
print(f"  Vocabulary size: {vocab_size}")

## 5. Training Functions for Transformer

We'll use Hugging Face's Trainer class to handle the training loop efficiently. We'll also create a custom data collator to properly batch our encoder-decoder sequences.

In [None]:
class CharDataCollator:
    """
    Custom data collator for character-level encoder-decoder training.
    """
    def __init__(self, pad_token_id):
        self.pad_token_id = pad_token_id
    
    def __call__(self, features):
        batch_input_ids = []
        batch_decoder_input_ids = []
        batch_labels = []
        
        for feature in features:
            batch_input_ids.append(feature['input_ids'])
            batch_decoder_input_ids.append(feature['decoder_input_ids'])
            batch_labels.append(feature['labels'])
        
        # Stack tensors
        batch = {
            'input_ids': torch.stack(batch_input_ids),
            'decoder_input_ids': torch.stack(batch_decoder_input_ids),
            'labels': torch.stack(batch_labels)
        }
        
        return batch

def compute_metrics(eval_pred):
    """
    Compute accuracy metrics for evaluation.
    """
    predictions, labels = eval_pred
    
    # Get predicted token indices
    predicted_ids = np.argmax(predictions, axis=-1)
    
    # Calculate accuracy
    accuracy = (predicted_ids == labels).mean()
    
    return {'accuracy': accuracy}

# Create data collator
data_collator = CharDataCollator(pad_token_id=char_to_idx['<pad>'])

# Set up training arguments
training_args = TrainingArguments(
    output_dir='./char-transformer-results',
    num_train_epochs=3,
    per_device_train_batch_size=32,
    per_device_eval_batch_size=32,
    warmup_steps=500,
    weight_decay=0.01,
    logging_dir='./logs',
    logging_steps=100,
    evaluation_strategy="steps",
    eval_steps=500,
    save_steps=1000,
    load_best_model_at_end=True,
    metric_for_best_model="eval_loss",
    greater_is_better=False,
    report_to=None,  # Disable wandb/tensorboard
    dataloader_pin_memory=False
)

print("Training setup completed!")
print(f"Training arguments:")
print(f"  Epochs: {training_args.num_train_epochs}")
print(f"  Batch size: {training_args.per_device_train_batch_size}")
print(f"  Learning rate: {training_args.learning_rate}")
print(f"  Warmup steps: {training_args.warmup_steps}")

## 6. Training the Transformer Model

Now, let's train our Transformer encoder-decoder model! We'll use the Hugging Face Trainer which handles the training loop, evaluation, and logging automatically.

In [None]:
# Set parameters
num_train_epochs = 3  # Start with a small number of epochs
learning_rate = 0.001
print_every = 1000

# Create trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    data_collator=data_collator,
    compute_metrics=compute_metrics
)

print("Starting training...")
print(f"This may take several minutes depending on your hardware.")

# Train the model
start_time = time.time()
training_result = trainer.train()
training_time = time.time() - start_time

print(f"\nTraining completed!")
print(f"Training time: {training_time:.2f} seconds")
print(f"Final training loss: {training_result.training_loss:.4f}")

# Evaluate the model
print("\nEvaluating model...")
eval_result = trainer.evaluate()
print(f"Evaluation loss: {eval_result['eval_loss']:.4f}")
print(f"Evaluation accuracy: {eval_result['eval_accuracy']:.4f}")

# Plot training history
log_history = trainer.state.log_history

# Extract training and validation losses
train_losses = []
eval_losses = []
steps = []

for log in log_history:
    if 'loss' in log:
        train_losses.append(log['loss'])
        steps.append(log['step'])
    if 'eval_loss' in log:
        eval_losses.append(log['eval_loss'])

# Plot the results
if train_losses:
    plt.figure(figsize=(12, 5))
    
    plt.subplot(1, 2, 1)
    plt.plot(steps[:len(train_losses)], train_losses, label='Training Loss')
    plt.title('Training Loss')
    plt.xlabel('Steps')
    plt.ylabel('Loss')
    plt.legend()
    plt.grid(True)
    
    if eval_losses:
        plt.subplot(1, 2, 2)
        eval_steps = [log['step'] for log in log_history if 'eval_loss' in log]
        plt.plot(eval_steps, eval_losses, label='Validation Loss', color='orange')
        plt.title('Validation Loss')
        plt.xlabel('Steps')
        plt.ylabel('Loss')
        plt.legend()
        plt.grid(True)
    
    plt.tight_layout()
    plt.show()
else:
    print("No training history available for plotting.")

## 7. Optimizing Training for CPU and GPU

Depending on your hardware, you might want to optimize your training process differently. Let's explore some strategies for both CPU and GPU training.

In [None]:
def optimize_training(device_type="cpu"):
    """
    Demonstrate optimization techniques for different hardware.
    """
    print(f"Optimization strategies for {device_type.upper()} training:")
    
    if device_type == "cpu":
        print("1. Use smaller batch sizes to avoid memory issues")
        print("2. Reduce model size (fewer layers, smaller hidden dimensions)")
        print("3. Use data parallelism with multiple CPU cores")
        print("4. Consider mixed precision training with bfloat16 on newer CPUs")
        print("5. Ensure proper vectorization of operations")
        
        # Example: Setting number of threads for CPU parallelism
        torch.set_num_threads(os.cpu_count())
        print(f"Set PyTorch to use {os.cpu_count()} CPU threads")
        
    elif device_type == "gpu":
        print("1. Use larger batch sizes to fully utilize GPU memory")
        print("2. Enable automatic mixed precision (AMP) for faster computation")
        print("3. Use gradient accumulation for effectively larger batches")
        print("4. Ensure data is pre-loaded and prefetched")
        print("5. Monitor GPU utilization and memory usage")
        
        # Example: Setting up mixed precision training
        if torch.cuda.is_available():
            print("Example of enabling automatic mixed precision:")
            print("   scaler = torch.cuda.amp.GradScaler()")
            print("   with torch.cuda.amp.autocast():")
            print("       outputs = model(inputs)")
    
    # Common optimizations
    print("\nCommon optimizations for both CPU and GPU:")
    print("1. Use DataLoader with appropriate num_workers")
    print("2. Implement early stopping to avoid overfitting")
    print("3. Use learning rate scheduling")
    print("4. Profile your code to identify bottlenecks")
    print("5. Reduce Python overhead by batching operations")

# Show optimization strategies based on available hardware
optimize_training("gpu" if torch.cuda.is_available() else "cpu")

## 7. Generating Text with Our Transformer Model

Now that we've trained our Transformer encoder-decoder model, let's use it to generate text. We'll implement a function that uses the encoder to process the input context and the decoder to generate the next character autoregressively. The model will naturally stop when it predicts an `<eos>` token or reaches the maximum length.

In [None]:
def generate_next_char(model, input_sequence, char_to_idx, idx_to_char, max_length=20, temperature=1.0):
    """
    Generate the next character using the trained Transformer model.
    """
    model.eval()
    
    with torch.no_grad():
        # Encode input sequence
        encoder_input = input_sequence.ljust(max_length, '<pad>')[:max_length]
        encoder_ids = torch.tensor(
            encode_sequence(encoder_input, char_to_idx, max_length),
            dtype=torch.long
        ).unsqueeze(0).to(device)
        
        # Create decoder input (start with <sos> + input sequence)
        decoder_input = '<sos>' + input_sequence
        decoder_input = decoder_input.ljust(max_length, '<pad>')[:max_length]
        decoder_ids = torch.tensor(
            encode_sequence(decoder_input, char_to_idx, max_length),
            dtype=torch.long
        ).unsqueeze(0).to(device)
        
        # Forward pass
        outputs = model(
            input_ids=encoder_ids,
            decoder_input_ids=decoder_ids
        )
        
        # Get logits for the last position and apply temperature
        logits = outputs.logits[0, -1, :] / temperature
        probabilities = F.softmax(logits, dim=-1)
        
        # Sample from the distribution
        predicted_id = torch.multinomial(probabilities, 1).item()
        predicted_char = idx_to_char[predicted_id]
        
        return predicted_char, probabilities

def generate_word_completion(model, seed, char_to_idx, idx_to_char, max_new_chars=10, temperature=0.8):
    """
    Generate word completion starting with a seed string.
    Stops when <eos> is generated or max_new_chars is reached.
    """
    model.eval()
    current_sequence = seed
    generated_chars = []
    
    print(f"Generating completion for: '{seed}'")
    print(f"Generation: '{seed}", end="")
    
    for i in range(max_new_chars):
        next_char, probs = generate_next_char(
            model, current_sequence, char_to_idx, idx_to_char, temperature=temperature
        )
        
        # Stop if we generate end-of-sequence or padding
        if next_char in ['<eos>', '<pad>']:
            print("'")
            print(f"Stopped at <eos> after {i+1} characters")
            break
        
        generated_chars.append(next_char)
        current_sequence += next_char
        print(next_char, end="", flush=True)
        
        # Also stop if we generate a space (natural word boundary)
        if next_char == ' ':
            print("'")
            print(f"Stopped at space after {i+1} characters")
            break
    else:
        print("'")
        print(f"Reached maximum length of {max_new_chars} characters")
    
    return current_sequence

# Test the generation with different seed strings
seed_strings = ['fin', 'inv', 'tra', 'mar', 'ban', 'com', 'acc']
temperatures = [0.5, 1.0, 1.5]

print("=" * 60)
print("TRANSFORMER MODEL TEXT GENERATION")
print("=" * 60)

for seed in seed_strings:
    print(f"\n--- Completions for '{seed}' ---")
    for temp in temperatures:
        print(f"\nTemperature {temp}:")
        try:
            completed = generate_word_completion(
                model, seed, char_to_idx, idx_to_char, 
                max_new_chars=15, temperature=temp
            )
        except Exception as e:
            print(f"Error during generation: {e}")
    print("-" * 40)

## 8. Model Evaluation and Analysis

Let's evaluate our Transformer model's performance more thoroughly. We'll examine prediction accuracy, analyze attention patterns, and compare against random baselines.

In [None]:
def detailed_evaluation(model, eval_dataset, char_to_idx, idx_to_char, num_examples=100):
    """
    Perform detailed evaluation of the model.
    """
    model.eval()
    correct_predictions = 0
    total_predictions = 0
    predictions_by_position = {}
    
    # Sample some examples for detailed analysis
    sampled_indices = random.sample(range(len(eval_dataset)), min(num_examples, len(eval_dataset)))
    
    print("Detailed Evaluation Examples:")
    print("=" * 80)
    
    with torch.no_grad():
        for i, idx in enumerate(sampled_indices[:10]):  # Show first 10 examples
            sample = eval_dataset[idx]
            
            # Prepare inputs
            input_ids = sample['input_ids'].unsqueeze(0).to(device)
            decoder_input_ids = sample['decoder_input_ids'].unsqueeze(0).to(device)
            true_label = sample['labels'].item()
            
            # Forward pass
            outputs = model(input_ids=input_ids, decoder_input_ids=decoder_input_ids)
            predicted_id = torch.argmax(outputs.logits[0, -1, :]).item()
            
            # Decode sequences for display
            encoder_text = decode_sequence(input_ids[0].cpu().numpy(), idx_to_char)
            decoder_text = decode_sequence(decoder_input_ids[0].cpu().numpy(), idx_to_char)
            true_char = idx_to_char[true_label]
            pred_char = idx_to_char[predicted_id]
            
            # Track accuracy
            is_correct = predicted_id == true_label
            if is_correct:
                correct_predictions += 1
            total_predictions += 1
            
            # Display example
            print(f"Example {i+1}:")
            print(f"  Encoder input: '{encoder_text.replace('<pad>', '').strip()}'")
            print(f"  Decoder input: '{decoder_text.replace('<pad>', '').strip()}'")
            print(f"  True next char: '{true_char}'")
            print(f"  Predicted: '{pred_char}' {'✓' if is_correct else '✗'}")
            print()
        
        # Continue evaluation on remaining examples (without printing)
        for idx in sampled_indices[10:]:
            sample = eval_dataset[idx]
            
            input_ids = sample['input_ids'].unsqueeze(0).to(device)
            decoder_input_ids = sample['decoder_input_ids'].unsqueeze(0).to(device)
            true_label = sample['labels'].item()
            
            outputs = model(input_ids=input_ids, decoder_input_ids=decoder_input_ids)
            predicted_id = torch.argmax(outputs.logits[0, -1, :]).item()
            
            if predicted_id == true_label:
                correct_predictions += 1
            total_predictions += 1
    
    accuracy = correct_predictions / total_predictions
    
    print(f"\nEvaluation Results:")
    print(f"Total examples evaluated: {total_predictions}")
    print(f"Correct predictions: {correct_predictions}")
    print(f"Accuracy: {accuracy:.4f} ({accuracy*100:.2f}%)")
    
    return accuracy

# Perform detailed evaluation
print("Performing detailed evaluation...")
accuracy = detailed_evaluation(model, val_dataset, char_to_idx, idx_to_char, num_examples=200)

# Calculate random baseline
random_accuracy = 1.0 / len([c for c in vocab if c not in ['<pad>', '<sos>']])  # Exclude special tokens
print(f"\nRandom baseline accuracy: {random_accuracy:.4f} ({random_accuracy*100:.2f}%)")
print(f"Model improvement over random: {(accuracy/random_accuracy):.2f}x")

# Analyze character-level predictions
char_predictions = {}
for char in string.ascii_lowercase:
    char_predictions[char] = {'correct': 0, 'total': 0}

# Sample more examples for character analysis
model.eval()
with torch.no_grad():
    for idx in random.sample(range(len(val_dataset)), min(500, len(val_dataset))):
        sample = val_dataset[idx]
        
        input_ids = sample['input_ids'].unsqueeze(0).to(device)
        decoder_input_ids = sample['decoder_input_ids'].unsqueeze(0).to(device)
        true_label = sample['labels'].item()
        
        if true_label < len(vocab) and vocab[true_label] in string.ascii_lowercase:
            true_char = vocab[true_label]
            
            outputs = model(input_ids=input_ids, decoder_input_ids=decoder_input_ids)
            predicted_id = torch.argmax(outputs.logits[0, -1, :]).item()
            
            char_predictions[true_char]['total'] += 1
            if predicted_id == true_label:
                char_predictions[true_char]['correct'] += 1

# Display character-level accuracy
print("\nCharacter-level Accuracy:")
print("-" * 40)
for char in sorted(char_predictions.keys()):
    if char_predictions[char]['total'] > 0:
        acc = char_predictions[char]['correct'] / char_predictions[char]['total']
        print(f"'{char}': {acc:.3f} ({char_predictions[char]['correct']}/{char_predictions[char]['total']})")

print("\nEvaluation completed!")

## 9. From Character-Level Transformers to Modern LLMs

We've successfully trained a character-level Transformer encoder-decoder model! Let's discuss how this relates to modern large language models (LLMs) and what it would take to scale up to production systems.

### Comparison: Our Transformer Model vs. Production LLMs

| Feature | Our Character Transformer | Modern LLMs (GPT, BERT, etc.) |
|---------|---------------------------|--------------------------------|
| **Architecture** | Encoder-Decoder Transformer | Decoder-only or Encoder-only Transformers |
| **Parameters** | ~1M | Billions to trillions |
| **Token Level** | Character-level | Subword/BPE tokenization |
| **Training Data** | ~100K character sequences | Trillions of tokens from web, books, code |
| **Context Length** | 20 characters | 2K-100K+ tokens |
| **Training Time** | Minutes on laptop/GPU | Weeks on thousands of GPUs |
| **Attention Mechanism** | Full attention | Various optimizations (sparse, sliding window) |
| **Capabilities** | Next character prediction | Language understanding, reasoning, code generation |
| **Applications** | Educational/toy examples | Production AI systems, financial analysis |
| **Hardware Requirements** | Single GPU/CPU | Distributed systems, specialized hardware |
| **Memory Usage** | <1GB | 100GB+ for inference |

### Key Insights from Our Implementation

1. **Transformer Architecture**: Our model uses the same fundamental building blocks as GPT and BERT
2. **Attention Mechanism**: The model learns to focus on relevant parts of the input sequence
3. **Autoregressive Generation**: Character-by-character generation mirrors how LLMs generate text
4. **Encoder-Decoder Design**: Similar to models like T5, BART, and early machine translation systems

### Scaling to Financial LLM Applications

To build production-ready LLMs for finance, we would need to scale our approach:

#### 1. **Architecture Improvements**
- **Decoder-only models**: Like GPT, for better autoregressive generation
- **Mixture of Experts (MoE)**: Efficiently scale parameters
- **Rotary Position Embeddings**: Better handling of long sequences
- **Layer normalization variants**: RMSNorm, Pre-LN for stability

#### 2. **Tokenization Strategy**
- **Subword tokenization**: BPE, SentencePiece for efficient vocabulary
- **Financial domain tokens**: Special tokens for financial terms, numbers, dates
- **Multilingual support**: For global financial markets

#### 3. **Training Data & Scale**
- **Financial corpora**: SEC filings, earnings calls, financial news, research reports
- **Code integration**: Financial modeling code, SQL queries
- **Structured data**: Financial statements, market data, regulatory filings
- **Real-time data**: Market feeds, news streams

#### 4. **Training Techniques**
- **Pretraining**: Large-scale unsupervised learning on financial text
- **Instruction tuning**: Fine-tuning on financial Q&A, analysis tasks
- **RLHF**: Reinforcement learning from financial expert feedback
- **Domain adaptation**: Continued pretraining on financial data

#### 5. **Financial-Specific Optimizations**
- **Numerical reasoning**: Enhanced arithmetic and financial calculation abilities
- **Time series understanding**: Market data, financial trends
- **Risk assessment**: Model uncertainty quantification
- **Compliance**: Ensuring regulatory compliance in outputs
- **Explainability**: Interpretable financial recommendations

#### 6. **Production Considerations**
- **Latency optimization**: Fast inference for real-time trading decisions
- **Scalability**: Handle multiple concurrent financial analysis requests
- **Security**: Protect sensitive financial information
- **Monitoring**: Track model performance and drift in financial markets

Our character-level Transformer provides the foundational understanding for these advanced systems!

## 10. Conclusion

Congratulations! In this notebook, you've successfully:

✅ **Built a Transformer encoder-decoder model** from scratch using Hugging Face  
✅ **Implemented character-level tokenization** with proper special tokens  
✅ **Trained on NLTK corpus** with sequence-to-sequence learning  
✅ **Used modern training techniques** with the Hugging Face Trainer  
✅ **Generated text autoregressively** with proper stopping mechanisms  
✅ **Evaluated model performance** with comprehensive metrics  
✅ **Understood the path to production LLMs** in financial applications  

### Key Learnings

1. **Transformer Architecture**: You've implemented the same core architecture used in GPT, BERT, and T5
2. **Character-level Modeling**: Understanding how models can work at the most granular text level
3. **Encoder-Decoder Design**: Experience with sequence-to-sequence learning patterns
4. **Modern Training Stack**: Hands-on experience with Hugging Face transformers
5. **Evaluation Techniques**: Comprehensive model assessment strategies
6. **Scaling Insights**: Clear path from toy models to production systems

### From Here to Financial LLMs

Your character-level Transformer shares DNA with models like:
- **GPT-4**: Decoder-only architecture for text generation
- **BERT**: Encoder architecture for understanding tasks  
- **T5**: Encoder-decoder for various NLP tasks
- **Financial domain models**: BloombergGPT, FinBERT, etc.

The principles you've learned—attention mechanisms, autoregressive generation, transformer blocks—are the building blocks of all modern LLMs used in finance today.

## Next Steps

### Immediate Experiments
- 🔧 **Increase model size**: More layers, larger hidden dimensions
- 📊 **Try different data**: Financial news, SEC filings, earnings transcripts
- 🎯 **Task-specific fine-tuning**: Financial sentiment, NER, QA
- ⚡ **Optimization**: Mixed precision, gradient checkpointing

### Advanced Projects
- 🚀 **Implement GPT-style decoder-only model** for better generation
- 📈 **Fine-tune on financial data** for domain adaptation
- 🔍 **Add retrieval mechanisms** for factual financial information
- 🛡️ **Implement safety measures** for financial advice generation
- 📱 **Deploy as API** for real-time financial analysis

### Production Path
- 🏗️ **Scale to subword tokenization** (BPE, SentencePiece)
- 🌐 **Multi-GPU training** with distributed computing
- 📊 **Add financial-specific metrics** and evaluation frameworks
- 🔒 **Implement security** and compliance measures
- 📈 **Continuous learning** from new financial data

You now have the foundation to understand and build the next generation of financial AI systems! 🎉