# Building a Simple GPT Model from Scratch with PyTorch

In this lesson, we will create a simplified version of the Generative Pre-trained Transformer (GPT) model using PyTorch. GPT models are powerful tools for generating coherent and meaningful text based on an initial input.

## Objectives:
By the end of this notebook, you will be able to:
1. Understand the architecture of a GPT-like model.
2. Implement the key components of the GPT architecture.
3. Train the model on a toy dataset and evaluate its performance.
4. Generate text using the trained model.

## What is GPT?
GPT stands for Generative Pre-trained Transformer. It is a type of language model that uses transformer layers to predict the next token in a sequence. This autoregressive property allows GPT models to generate high-quality text, making them suitable for applications such as:
- Text completion
- Chatbots
- Text summarization
- Translation

In this notebook, we simplify the GPT architecture to focus on understanding the core concepts.


# Preliminaries: Libraries and Dependencies

Before diving into the implementation, let’s import the necessary libraries and briefly discuss their roles:
- **torch**: Core PyTorch library for building models and performing tensor computations.
- **torch.nn**: Contains prebuilt modules and layers for constructing neural networks.
- **torch.optim**: Provides optimization algorithms for training models.
- **torch.nn.functional**: Contains utility functions for common operations like activation functions and loss computations.

If you haven't installed PyTorch yet, follow the installation guide at https://pytorch.org/get-started/locally/.

Let’s proceed with importing the required libraries.


In [1]:
import torch
import torch.nn as nn
import torch.optim as optim
from torch.nn import functional as F

# Key Components of the GPT Model

The GPT model consists of the following key components:

## 1. **Token Embeddings**
Tokens are numerical representations of words or subwords. The token embedding layer converts these tokens into dense vector representations that the model can process.

## 2. **Positional Embeddings**
Transformers do not inherently understand the order of tokens in a sequence. Positional embeddings are added to token embeddings to encode sequence order information.

## 3. **Transformer Layers**
Transformer layers are the core building blocks of GPT models. Each layer contains:
- **Self-attention mechanism**: Helps the model focus on relevant parts of the sequence.
- **Feedforward neural network**: Processes the outputs of the self-attention mechanism.

## 4. **Output Layer**
The output layer generates logits (unnormalized scores) for predicting the next token in the sequence. These logits are passed through a softmax function to produce probabilities.

In the next section, we will implement these components step by step in the `SimpleGPT` class.


In [2]:
class SimpleGPT(nn.Module):
    def __init__(self, vocab_size, embed_size, num_heads, num_layers, max_seq_length):
        super(SimpleGPT, self).__init__()
        # Token and positional embeddings
        self.token_embedding = nn.Embedding(vocab_size, embed_size)
        self.position_embedding = nn.Embedding(max_seq_length, embed_size)

        # Transformer layers
        self.transformer_layers = nn.ModuleList([
            nn.TransformerEncoderLayer(d_model=embed_size, nhead=num_heads)
            for _ in range(num_layers)
        ])

        # Output layer
        self.output_layer = nn.Linear(embed_size, vocab_size)
        self.max_seq_length = max_seq_length

    def forward(self, x):
        seq_length = x.size(1)
        # Token embeddings
        token_embeds = self.token_embedding(x)
        # Positional embeddings
        positions = torch.arange(0, seq_length).unsqueeze(0).repeat(x.size(0), 1).to(x.device)
        position_embeds = self.position_embedding(positions)
        # Combine embeddings
        x = token_embeds + position_embeds
        # Apply transformer layers
        for layer in self.transformer_layers:
            x = layer(x)
        # Output logits
        logits = self.output_layer(x)
        return logits

    def generate(self, start_tokens, max_length, temperature=1.0):
        self.eval()
        current_seq = start_tokens
        with torch.no_grad():
            for _ in range(max_length):
                if len(current_seq) > self.max_seq_length:
                    current_seq = current_seq[-self.max_seq_length:]
                inputs = torch.tensor(current_seq).unsqueeze(0).to(next(self.parameters()).device)
                logits = self(inputs)
                next_token_logits = logits[0, -1, :] / temperature
                next_token = torch.multinomial(F.softmax(next_token_logits, dim=-1), num_samples=1).item()
                current_seq.append(next_token)
                if next_token == 1:  # Assuming 1 is the EOS token
                    break
        return current_seq

# Training the GPT Model

Once we define the architecture of the model, the next step is to train it on a dataset. Training involves:
1. Feeding a sequence of tokens to the model.
2. Comparing the model's predictions with the ground truth (target tokens).
3. Backpropagating the loss to adjust the model's weights.

### Key Concepts in Training:
- **Loss Function**: We use Cross-Entropy Loss to measure how well the predicted token probabilities match the target tokens.
- **Optimizer**: Adam optimizer is used to adjust the model's weights and biases during training.
- **Batch Processing**: Training on multiple sequences at a time for efficiency.

## Alphabet Sequence Dataset

To make the training process meaningful, we will train the model on sequences of letters from the English alphabet.

## Example:
- **Input**: `A B C D`
- **Target**: `B C D E`

The model learns to predict the next letter in a sequence.

### Tokenization
Since the model operates on numerical data, we need to map each letter to a unique token ID.
For example:
- `A -> 0`
- `B -> 1`
- `C -> 2`

At the end of training, we will map the generated token IDs back to letters to evaluate the model's performance.



In [24]:
# Define hyperparameters
vocab_size = 26
embed_size = 128
num_heads = 4
num_layers = 2
max_seq_length = 10

# Instantiate the model
model = SimpleGPT(vocab_size, embed_size, num_heads, num_layers, max_seq_length)
optimizer = optim.Adam(model.parameters())
criterion = nn.CrossEntropyLoss()

# Tokenize a repeating alphabet sequence
alphabet = "ABCDEFGHIJKLMNOPQRSTUVWXYZ"
token_to_id = {ch: i for i, ch in enumerate(alphabet)}
id_to_token = {i: ch for ch, i in token_to_id.items()}

# Generate dataset
def generate_text_data(batch_size, seq_length):
    data = []
    for _ in range(batch_size):
        start = torch.randint(0, len(alphabet) - seq_length - 1, (1,)).item()
        sequence = [token_to_id[ch] for ch in alphabet[start : start + seq_length + 1]]
        data.append(sequence)
    return torch.tensor(data)

# Training loop using alphabet sequences
for epoch in range(5):  # Number of epochs
    for _ in range(10):  # Number of batches
        data = generate_text_data(batch_size=8, seq_length=max_seq_length-1)
        inputs, targets = data[:, :-1], data[:, 1:]  # Split into inputs and targets
        optimizer.zero_grad()
        logits = model(inputs)
        loss = criterion(logits.reshape(-1, vocab_size), targets.reshape(-1))
        loss.backward()
        optimizer.step()
    print(f"Epoch {epoch + 1}, Loss: {loss.item()}")

Epoch 1, Loss: 0.17410090565681458
Epoch 2, Loss: 0.0467846542596817
Epoch 3, Loss: 0.026736455038189888
Epoch 4, Loss: 0.020729782059788704
Epoch 5, Loss: 0.013821137137711048


# Text Generation with the GPT Model

Once trained, the model can generate text using its `generate` method. The process involves:
1. Starting with an initial sequence of tokens (called seed tokens).
2. Feeding the sequence to the model to predict the next token.
3. Sampling the next token from the predicted probabilities.
4. Appending the new token to the sequence and repeating the process.

### Important Parameters:
- **max_length**: The maximum length of the generated sequence.
- **temperature**: Controls randomness in predictions. Lower values make predictions more deterministic, while higher values introduce more variability.

Let’s see how our model performs text generation!

In [25]:
# Generate text and map tokens back to characters
start_tokens = [token_to_id[ch] for ch in "ABC"]  # Example start sequence

generated_sequence = model.generate(start_tokens, max_length=10)
decoded_sequence = [id_to_token[token] for token in generated_sequence]
print("Generated sequence:", " ".join(decoded_sequence))

Generated sequence: C D E F G H I J K L M


We can see that our model has successfully learned the English alphabet, and can generate the next n letters (n = max_seq_length) given a letter sequence.




# Exercises: Alphabet Sequences

1. **Experiment with Sequence Lengths**:
   Train the model with longer or shorter sequence lengths. Does the model still generalize well?

2. **Add Positional Noise**:
   Introduce random shuffling in the input sequences. Observe how the model learns when the sequences are not perfectly ordered.

3. **Multi-character Prediction**:
   Extend the model to predict pairs of letters (e.g., input: `A B`, output: `C D`).

4. **Custom Tokenizer**:
   Modify the tokenizer to include lowercase letters or additional symbols.

5. **Overfit Small Dataset**:
   Train the model on a very small dataset (e.g., `A B C`, `B C D`) and observe its ability to memorize sequences.
