# Training Your Own LLM

Welcome to the Day 2 practical session on training your own language model! In this notebook, we'll learn how to train a small language model from scratch using Python words as our dataset. We'll explore how language models learn to predict words and see the fundamentals of neural language modeling in action.

## Learning Objectives

- Understand the basics of language model training
- Build a simple character-level language model from scratch
- Optimize training using CPU and GPU resources
- Evaluate model performance and generate text
- Gain insights into how larger LLMs are trained

## Prerequisites

- Basic understanding of Python and PyTorch (covered in preliminaries)
- Familiarity with the concept of language models
- A computer with PyTorch installed (CPU or GPU)

## 1. Setup

First, let's import the necessary libraries and set up our environment. We'll be using PyTorch for our neural network implementation.

In [None]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import numpy as np
import matplotlib.pyplot as plt
import time
import random
import string
import nltk
from nltk.corpus import words
import os

# Check if GPU is available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

# Set random seeds for reproducibility
torch.manual_seed(42)
if torch.cuda.is_available():
    torch.cuda.manual_seed_all(42)
np.random.seed(42)
random.seed(42)

## 2. Data Preparation

Let's create a dataset of English words. We'll use the NLTK words corpus, which contains common English words. For our simple model, we'll focus on character-level prediction rather than word-level, as it requires less computational resources.

In [None]:
# Download NLTK words if not already downloaded
try:
    nltk.data.find('corpora/words')
except LookupError:
    nltk.download('words')

# Get list of English words
english_words = words.words()

# Filter to get words of reasonable length (3-10 characters)
filtered_words = [word.lower() for word in english_words if 3 <= len(word) <= 10 and word.isalpha()]

print(f"Total words in dataset: {len(filtered_words)}")
print(f"Sample words: {filtered_words[:10]}")

# Create a vocabulary of all characters in our dataset
all_characters = string.ascii_lowercase + ' '  # lowercase letters and space
n_characters = len(all_characters)
print(f"Characters in vocabulary: {all_characters}")
print(f"Vocabulary size: {n_characters}")

# Create dictionaries to convert between characters and indices
char_to_idx = {char: i for i, char in enumerate(all_characters)}
idx_to_char = {i: char for i, char in enumerate(all_characters)}

## 3. Creating Training Data

We'll now prepare our training data. For each word, we'll create input-output pairs where:
- Input: The characters of the word up to a certain position
- Output: The next character in the word

This way, our model will learn to predict the next character given a sequence of previous characters.

In [None]:
def prepare_sequence_data(word):
    """
    Convert a word into a list of (input_sequence, target_character) pairs.
    For example, "hello" would yield:
    [("h", "e"), ("he", "l"), ("hel", "l"), ("hell", "o")]
    """
    sequence_pairs = []
    for i in range(1, len(word)):
        input_seq = word[:i]
        target_char = word[i]
        sequence_pairs.append((input_seq, target_char))
    return sequence_pairs

# Create training examples
training_pairs = []
for word in filtered_words:
    training_pairs.extend(prepare_sequence_data(word))

# Shuffle the training pairs
random.shuffle(training_pairs)

# Limit to 100,000 examples to avoid memory issues
training_pairs = training_pairs[:100000]

print(f"Total training examples: {len(training_pairs)}")
print(f"Sample training pairs: {training_pairs[:5]}")

# Function to convert a string to a tensor of character indices
def string_to_tensor(string):
    tensor = torch.zeros(len(string), 1, n_characters)
    for i, char in enumerate(string):
        index = char_to_idx.get(char, char_to_idx[' '])  # Default to space if char not found
        tensor[i][0][index] = 1
    return tensor

# Function to convert a character to a tensor (one-hot encoding)
def char_to_tensor(char):
    tensor = torch.zeros(1, n_characters)
    index = char_to_idx.get(char, char_to_idx[' '])  # Default to space if char not found
    tensor[0][index] = 1
    return tensor

## 4. Building the Model

Now, let's build our character-level language model. We'll use a simple recurrent neural network (RNN) with GRU (Gated Recurrent Unit) cells, which are good at capturing sequential patterns in data.

In [None]:
class CharRNN(nn.Module):
    def __init__(self, input_size, hidden_size, output_size, n_layers=1):
        super(CharRNN, self).__init__()
        self.input_size = input_size
        self.hidden_size = hidden_size
        self.output_size = output_size
        self.n_layers = n_layers
        
        # GRU layer - processes input sequences and maintains hidden state
        self.gru = nn.GRU(input_size, hidden_size, n_layers)
        
        # Output layer - transforms hidden state to character probabilities
        self.decoder = nn.Linear(hidden_size, output_size)
        
        # Softmax layer - converts output to probabilities
        self.softmax = nn.LogSoftmax(dim=1)
    
    def forward(self, input_tensor, hidden_state):
        # Input shape: (seq_len, batch_size, input_size)
        output, hidden = self.gru(input_tensor, hidden_state)
        
        # Take the last output from the sequence
        output = self.decoder(output[-1])
        
        # Apply softmax to get probabilities
        output = self.softmax(output)
        
        return output, hidden
    
    def init_hidden(self, batch_size=1):
        # Initialize hidden state with zeros
        return torch.zeros(self.n_layers, batch_size, self.hidden_size, device=device)
    
    def get_prediction(self, output):
        # Get the index of the most likely character
        _, top_index = output.topk(1)
        return top_index.item()

# Initialize model
n_hidden = 128
n_layers = 2
model = CharRNN(n_characters, n_hidden, n_characters, n_layers).to(device)

print(f"Model parameters: {sum(p.numel() for p in model.parameters())}")
print(model)

## 5. Training Functions

Now, we'll define functions to train our model. We'll use the negative log likelihood loss (NLLLoss) since our model outputs log probabilities with a LogSoftmax layer.

In [None]:
def train_step(input_tensor, target_tensor, model, optimizer, criterion):
    # Initialize hidden state
    hidden = model.init_hidden()
    
    # Zero the gradients
    optimizer.zero_grad()
    
    # Forward pass
    output, hidden = model(input_tensor, hidden)
    
    # Calculate loss
    loss = criterion(output, target_tensor)
    
    # Backward pass and optimize
    loss.backward()
    optimizer.step()
    
    return output, loss.item()

def train_model(model, training_pairs, n_epochs=10, learning_rate=0.005, print_every=1000):
    # Initialize optimizer and loss function
    optimizer = optim.Adam(model.parameters(), lr=learning_rate)
    criterion = nn.NLLLoss()
    
    # Track progress
    all_losses = []
    total_loss = 0
    start_time = time.time()
    
    print("Starting training...")
    
    for epoch in range(1, n_epochs + 1):
        epoch_loss = 0
        
        for i, (input_seq, target_char) in enumerate(training_pairs, 1):
            # Convert input and target to tensors
            input_tensor = string_to_tensor(input_seq).to(device)
            target_index = torch.tensor([char_to_idx.get(target_char, char_to_idx[' '])], device=device)
            
            # Train on this pair
            output, loss = train_step(input_tensor, target_index, model, optimizer, criterion)
            total_loss += loss
            epoch_loss += loss
            
            # Print progress
            if i % print_every == 0:
                avg_loss = total_loss / print_every
                all_losses.append(avg_loss)
                total_loss = 0
                
                # Calculate elapsed time
                elapsed = time.time() - start_time
                
                # Get prediction
                _, top_index = output.topk(1)
                predicted_char = idx_to_char[top_index.item()]
                target_char_actual = idx_to_char[target_index.item()]
                
                print(f"Epoch {epoch}/{n_epochs} | Step {i}/{len(training_pairs)} | Loss: {avg_loss:.4f} | Time: {elapsed:.2f}s")
                print(f"Input: '{input_seq}' | Target: '{target_char_actual}' | Predicted: '{predicted_char}'")
                print("-" * 50)
        
        print(f"Epoch {epoch} completed | Average loss: {epoch_loss/len(training_pairs):.4f}")
    
    return all_losses

## 6. Training the Model

Now, let's train our model! We'll train for a few epochs and track the loss over time. This might take a while depending on your hardware.

In [None]:
# Set parameters
n_epochs = 3  # Start with a small number of epochs
learning_rate = 0.001
print_every = 1000

# Start training
losses = train_model(model, training_pairs, n_epochs, learning_rate, print_every)

# Plot the loss over time
plt.figure(figsize=(10, 5))
plt.plot(losses)
plt.title('Training Loss')
plt.xlabel('Steps (x' + str(print_every) + ')')
plt.ylabel('Loss')
plt.grid(True)
plt.show()

## 7. Optimizing Training for CPU and GPU

Depending on your hardware, you might want to optimize your training process differently. Let's explore some strategies for both CPU and GPU training.

In [None]:
def optimize_training(device_type="cpu"):
    """
    Demonstrate optimization techniques for different hardware.
    """
    print(f"Optimization strategies for {device_type.upper()} training:")
    
    if device_type == "cpu":
        print("1. Use smaller batch sizes to avoid memory issues")
        print("2. Reduce model size (fewer layers, smaller hidden dimensions)")
        print("3. Use data parallelism with multiple CPU cores")
        print("4. Consider mixed precision training with bfloat16 on newer CPUs")
        print("5. Ensure proper vectorization of operations")
        
        # Example: Setting number of threads for CPU parallelism
        torch.set_num_threads(os.cpu_count())
        print(f"Set PyTorch to use {os.cpu_count()} CPU threads")
        
    elif device_type == "gpu":
        print("1. Use larger batch sizes to fully utilize GPU memory")
        print("2. Enable automatic mixed precision (AMP) for faster computation")
        print("3. Use gradient accumulation for effectively larger batches")
        print("4. Ensure data is pre-loaded and prefetched")
        print("5. Monitor GPU utilization and memory usage")
        
        # Example: Setting up mixed precision training
        if torch.cuda.is_available():
            print("Example of enabling automatic mixed precision:")
            print("   scaler = torch.cuda.amp.GradScaler()")
            print("   with torch.cuda.amp.autocast():")
            print("       outputs = model(inputs)")
    
    # Common optimizations
    print("\nCommon optimizations for both CPU and GPU:")
    print("1. Use DataLoader with appropriate num_workers")
    print("2. Implement early stopping to avoid overfitting")
    print("3. Use learning rate scheduling")
    print("4. Profile your code to identify bottlenecks")
    print("5. Reduce Python overhead by batching operations")

# Show optimization strategies based on available hardware
optimize_training("gpu" if torch.cuda.is_available() else "cpu")

## 8. Generating Text with Our Model

Now that we've trained our model, let's use it to generate some text. We'll start with a seed string and let the model predict subsequent characters one by one.

In [None]:
def generate_text(model, seed_string, max_length=50, temperature=0.8):
    """
    Generate text starting with a seed string.
    Temperature controls randomness: higher means more random outputs.
    """
    model.eval()  # Set model to evaluation mode
    with torch.no_grad():  # No need to track gradients
        # Initialize hidden state
        hidden = model.init_hidden()
        
        # Prepare input tensor from seed string
        input_tensor = string_to_tensor(seed_string).to(device)
        
        # Initialize output with the seed string
        output_string = seed_string
        
        # Generate characters one by one
        for i in range(max_length):
            # Forward pass
            output, hidden = model(input_tensor, hidden)
            
            # Apply temperature to output probabilities
            output_dist = output.div(temperature).exp()
            
            # Sample from the distribution
            top_char_index = torch.multinomial(output_dist, 1)[0]
            
            # Get the corresponding character
            predicted_char = idx_to_char[top_char_index.item()]
            
            # Add the predicted character to the output string
            output_string += predicted_char
            
            # Update input tensor for next prediction
            next_char_tensor = torch.zeros(1, 1, n_characters, device=device)
            next_char_tensor[0, 0, top_char_index] = 1
            input_tensor = torch.cat([input_tensor, next_char_tensor], 0)
        
        return output_string

# Generate text with different seed strings
seed_strings = ['fin', 'inv', 'tra', 'mar', 'ban']
temperatures = [0.5, 0.8, 1.0]

print("Generated text samples:")
for seed in seed_strings:
    print(f"\nSeed: '{seed}'")
    for temp in temperatures:
        generated = generate_text(model, seed, max_length=20, temperature=temp)
        print(f"  Temperature {temp}: '{generated}'")

## 9. Model Evaluation

Let's evaluate our model's performance. We'll calculate perplexity, which is a common metric for language models.

In [None]:
def evaluate_model(model, eval_pairs, criterion):
    """
    Evaluate the model on a set of evaluation pairs.
    Returns the average loss and perplexity.
    """
    model.eval()  # Set model to evaluation mode
    total_loss = 0
    
    with torch.no_grad():  # No need to track gradients
        for input_seq, target_char in eval_pairs:
            # Convert input and target to tensors
            input_tensor = string_to_tensor(input_seq).to(device)
            target_index = torch.tensor([char_to_idx.get(target_char, char_to_idx[' '])], device=device)
            
            # Initialize hidden state
            hidden = model.init_hidden()
            
            # Forward pass
            output, hidden = model(input_tensor, hidden)
            
            # Calculate loss
            loss = criterion(output, target_index)
            total_loss += loss.item()
    
    # Calculate average loss and perplexity
    avg_loss = total_loss / len(eval_pairs)
    perplexity = np.exp(avg_loss)
    
    return avg_loss, perplexity

# Create evaluation set (using part of the training set for simplicity)
eval_pairs = training_pairs[-1000:]  # Use last 1000 examples

# Evaluate the model
criterion = nn.NLLLoss()
avg_loss, perplexity = evaluate_model(model, eval_pairs, criterion)

print(f"Evaluation results:")
print(f"Average loss: {avg_loss:.4f}")
print(f"Perplexity: {perplexity:.2f}")

## 10. Scaling Up: From Small Models to LLMs

We've just trained a very simple character-level language model. Let's discuss how this relates to large language models (LLMs) used in finance and what it would take to scale up.

### Comparison: Our Simple Model vs. LLMs

| Feature | Our Simple Model | Large Language Models (LLMs) |
|---------|------------------|------------------------------|
| Parameters | ~100K | Billions to trillions |
| Architecture | Simple GRU | Transformer-based (attention mechanisms) |
| Training Data | ~100K examples | Trillions of tokens |
| Training Time | Minutes on a laptop | Weeks/months on thousands of GPUs |
| Context Length | Few characters | Thousands of tokens |
| Capabilities | Character prediction | Complex reasoning, generation, understanding |
| Applications | Word completion | Financial analysis, report generation, code writing |
| Hardware | CPU/Single GPU | Distributed systems, specialized hardware |

This comparison highlights the massive scale difference between our toy model and production LLMs. However, the fundamental principles remain similar: predicting the next token based on previous context.

### Scaling to Financial Applications

To build LLMs suitable for finance, we would need to:

1. **Increase model size**: More layers, larger hidden dimensions, more parameters
2. **Use transformer architecture**: Replace our GRU with attention-based transformers
3. **Train on financial data**: SEC filings, financial news, analyst reports, market data
4. **Implement specialized training techniques**:
   - Reinforcement Learning from Human Feedback (RLHF)
   - Instruction fine-tuning
   - Domain adaptation
5. **Optimize for specific financial tasks**:
   - Financial sentiment analysis
   - Market prediction
   - Risk assessment
   - Regulatory compliance

The skills you've learned in this notebook provide the foundation for understanding these more complex models.

## 11. Conclusion

In this notebook, we've learned:

- How to build and train a simple character-level language model
- The process of preparing data for language model training
- Techniques for optimizing training on different hardware
- Methods for generating text with our trained model
- The relationship between simple models and large language models

This foundational knowledge helps understand how larger models like those used in finance are trained and operated. While our model is very simple, the core principles scale up to the most sophisticated language models used in the financial industry today.

## Next Steps

- Experiment with different model architectures (LSTM, Transformer)
- Try training on financial text data instead of simple words
- Implement more advanced optimization techniques
- Explore transfer learning by fine-tuning pre-trained models
- Apply these concepts to specific financial use cases