# Handmade Custom GPT Model  

This notebook implements and partially trains a custom Generative Pre-Trained Transformer (GPT) model, minimizing the use of premade transformer objects or classes. A GPT is a powerful model capable of generating text by predicting the next token in a sequence based on the given context.  

---

## Key Components:  
- **Attention Mechanism**: The core feature that allows the model to focus on relevant parts of the input sequence.  

- **Positional Encoding**: A method to provide positional information to the model for sequential data.  

- **Transformer Architecture**: The design and logic behind the structure of the transformer model.  

- **Tokenization**: Converting text into a format that can be processed by the model.  

- **Data Preprocessing**: Preparing input data for training.  

- **Training Loop**: Updating the model's parameters to minimize the loss function.  

- **Evaluation Metrics**: Assessing the model's performance using appropriate metrics.  

- **Text Generation**: Generating text using the trained model.  

This implementation emphasizes the fundamental concepts of transformer architecture, aiming to provide a deeper understanding of the internal mechanisms of transformer-based models.  

---

## Project Goals:  
1. Build a custom GPT model from scratch with minimal reliance on premade transformer objects or classes.  

2. Gain a thorough understanding of the internal workings of transformer architectures.  

3. Train the model on text data for next-token prediction.  

4. Generate text using the custom-built model.  

Although I could not fully train the model due to time constraints, I trained it long enough to demonstrate the output of a partially trained GPT model. This serves as a practical example of the capabilities and limitations of an undertrained transformer-based model.  

---

## Transformer

First the actual attention mechanism is implemented. Below is the code which is used along with the Q, V, and K matrices to calculate the attention scores:

In [1]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import math

def scaled_dot_product_attention(Q, K, V, mask=None):
    # Calculate the dot product between Q and the transpose of K
    scores = torch.matmul(Q, K.transpose(-2, -1)) # (batch_size, num_heads, seq_len, seq_len)
    
    # Scale the scores by the square root of the dimensionality of Q
    scaled_scores = scores / torch.sqrt(torch.tensor(Q.size(-1), dtype=torch.float32)) # (batch_size, num_heads, seq_len, seq_len)
    
    # Apply the mask if provided
    if mask is not None:
        scaled_scores = scaled_scores.masked_fill(mask.unsqueeze(1) == 0, float('-inf')) # (batch_size, num_heads, seq_len, seq_len)

    # Calculate the attention weights using softmax
    attention_weights = F.softmax(scaled_scores, dim=-1) # (batch_size, num_heads, seq_len, seq_len)

    # Multiply the attention weights with the value matrix V
    output = torch.matmul(attention_weights, V) # (batch_size, num_heads, seq_len, d_v)

    return output, attention_weights

Below is code for the Multi Head Attention Layer of the decoder. This layer is used to calculate the attention scores accross multiple heads which enables for a more robust attention mechanism.

In [2]:
class MultiHeadAttention(nn.Module):
    def __init__(self, d_model, num_heads):
        super().__init__()
        assert d_model % num_heads == 0, "d_model must be divisible by num_heads"
        
        self.num_heads = num_heads
        self.d_head = d_model // num_heads  # Dimension per head
        
        # Linear layers for Q, K, V
        self.linear_Q = nn.Linear(d_model, d_model)
        self.linear_K = nn.Linear(d_model, d_model)
        self.linear_V = nn.Linear(d_model, d_model)
        
        # Output linear layer
        self.linear_O = nn.Linear(d_model, d_model)

        # Initialize weights
        self.apply(self.initialize_weights)

    def initialize_weights(self, layer):
        if isinstance(layer, nn.Linear):
            nn.init.kaiming_uniform_(layer.weight, nonlinearity='relu')  # Kaiming initialization for ReLU activations
            if layer.bias is not None:
                nn.init.zeros_(layer.bias)  # Initialize biases to zero

    def forward(self, q, k, v, mask=None):
        batch_size, seq_len, _ = q.size()
        # Project Q, K, V and reshape for multi-head attention
        q = self.linear_Q(q).view(batch_size, seq_len, self.num_heads, self.d_head).transpose(1, 2)  # (batch_size, num_heads, seq_len, d_head)
        k = self.linear_K(k).view(batch_size, seq_len, self.num_heads, self.d_head).transpose(1, 2)  # (batch_size, num_heads, seq_len, d_head)
        v = self.linear_V(v).view(batch_size, seq_len, self.num_heads, self.d_head).transpose(1, 2)  # (batch_size, num_heads, seq_len, d_head)

        # Apply scaled dot-product attention
        attn_out, _ = scaled_dot_product_attention(q, k, v, mask=mask)  # (batch_size, num_heads, seq_len, d_head)

        # Concatenate heads and apply final linear layer
        attn_out = attn_out.transpose(1, 2).contiguous().view(batch_size, seq_len, -1)  # (batch_size, seq_len, d_model)
        output = self.linear_O(attn_out)  # Project back to d_model

        return output


Below is the code for the decoder used in the custom GPT model. This is the foundatial block of the model which will be repeated and stacked to create the final model.

In [3]:
class Decoder(nn.Module):
    def __init__(self, d_model, d_ff, num_heads, dropout=0.2):
        super().__init__()
        self.dropout = nn.Dropout(dropout)
        
        # Self-attention layer
        self.self_attention = MultiHeadAttention(d_model, num_heads)
        
        # Layer norms and feed-forward
        self.layer_norm1 = nn.LayerNorm(d_model)
        self.layer_norm2 = nn.LayerNorm(d_model)
        self.feed_forward = nn.Sequential(
            nn.Linear(d_model, d_ff),
            nn.ReLU(),
            nn.Dropout(dropout),
            nn.Linear(d_ff, d_model)
        )

        # Initialize weights
        self.apply(self.initialize_weights)

    def initialize_weights(self, layer):
        if isinstance(layer, nn.Linear):
            nn.init.kaiming_uniform_(layer.weight, nonlinearity='relu') # Kaiming initialization for ReLU activations
            if layer.bias is not None:
                nn.init.zeros_(layer.bias) # Initialize biases to zero
        elif isinstance(layer, nn.LayerNorm):
            nn.init.ones_(layer.weight) # Initialize weights to one
            nn.init.zeros_(layer.bias) # Initialize biases to zero

    def forward(self, x, mask=None):
        # Self-attention
        attention_output = self.self_attention(x, x, x, mask=mask)
        attention_output = self.dropout(attention_output)
        x = self.layer_norm1(x + attention_output)

        # Feed-forward
        feed_forward_output = self.feed_forward(x)
        feed_forward_output = self.dropout(feed_forward_output)
        x = self.layer_norm2(x + feed_forward_output)

        return x


Below is the code for position encoding which is used to add positional information to the input embeddings.

In [4]:
class PositionEncoder(nn.Module):
    def __init__(self, d_model, max_len=5000):
        super().__init__()

        # Create empty matrix for positional encodings
        self.pe = torch.zeros(max_len, d_model)
        
        # Create position vector and reshape for broadcasting
        position = torch.arange(0, max_len, dtype=torch.float).unsqueeze(1)
        
        # Calculate division terms for the positional encoding formula
        div_term = 10000 ** (torch.arange(0, d_model, 2)/d_model)
        
        # Fill even indices with sine function
        self.pe[:, 0::2] = torch.sin(position / div_term)
        
        # Fill odd indices with cosine function
        self.pe[:, 1::2] = torch.cos(position / div_term)

    def forward(self, x):
        return self.pe[:x.size(1)].to(x.device)


Below is the code for the Embedding layer in the Transformer model which is used to convert the input tokens into a dense vector representation that can be used by the Transformer model.

In [5]:
class Embedding(nn.Module):
    def __init__(self, vocab_size, embedding_dim):
        super().__init__()
        # Create embedding layer that maps token indices to dense vectors
        self.embedding = nn.Embedding(vocab_size, embedding_dim)
    
    def forward(self, token_indices):
        embedded = self.embedding(token_indices)
        return embedded


Below is the code for the GPT model with all of the components put together:

In [6]:
class Transformer(nn.Module):
    def __init__(self, d_model, d_ff, num_heads, vocab_size, num_blocks, dropout=0.2):
        super().__init__()

        # Core components
        self.dropout = nn.Dropout(dropout)
        self.embedding = Embedding(vocab_size, d_model)
        self.embedding_scale = math.sqrt(d_model)  # Scaling factor for embeddings
        self.position_encoder = PositionEncoder(d_model)
        self.linear_output = nn.Linear(d_model, vocab_size)  # Final projection layer

        # Create stack of decoder blocks
        self.blocks = nn.ModuleList([Decoder(d_model, d_ff, num_heads, dropout=dropout) for _ in range(num_blocks)])

        # Initialize model weights
        self._init_weights()

    def _init_weights(self):
        for module in self.modules():
            if isinstance(module, (nn.Linear, nn.Embedding)):
                nn.init.xavier_uniform_(module.weight) # Initialize linear weights with Xavier initialization
            elif isinstance(module, nn.LayerNorm):
                nn.init.constant_(module.bias, 0) # Initialize layer norm bias with zeros
                nn.init.constant_(module.weight, 1.0) # Initialize layer norm weights with ones
            if hasattr(module, 'bias') and module.bias is not None:
                nn.init.constant_(module.bias, 0) # Initialize biases to zero

    def forward(self, x, mask=None):
        # Combine token embeddings with positional encodings
        x = self.embedding(x) * self.embedding_scale + self.position_encoder(x)
        x = self.dropout(x)

        # Process through decoder blocks with residual connections
        for block in self.blocks:
            x = block(x, mask=mask) + x

        # Project to vocabulary size
        output_probs = self.linear_output(x)
        return output_probs

    def create_causal_mask(self, sequence_length):
        # Create an upper triangular matrix
        mask = torch.triu(torch.ones(sequence_length, sequence_length), diagonal=1)
        return 1 - mask  # The mask should still be in the same shape for later use

    def generate(self, prompt, sp, max_length=50, temperature=1.0, top_k=50, device="cuda"):
        """
        Generate text using the transformer model
        
        Parameters:
            prompt: Initial input sequence tensor
            sp: SentencePiece tokenizer instance
            max_length: Maximum length of generated sequence
            temperature: Controls randomness (lower = more deterministic)
            top_k: Number of highest probability tokens to consider
            device: Computing device (cuda/cpu)
        """ 
        self.eval()  # Set the model to evaluation mode
        with torch.no_grad():
            # Convert prompt to tensor and move to device
            input_ids = prompt.clone().unsqueeze(0).to(device)  # Add batch dimension and move to device

            # Flatten input_ids to a single list of tokens
            generated = input_ids.squeeze(0).tolist()  # Store generated tokens as a flat list

            # Create a mask for the decoder to prevent it from looking ahead
            full_mask = self.create_causal_mask(max_length).unsqueeze(0).to(device)

            for _ in range(len(generated), max_length):
                # Prepare input for the decoder
                input_tensor = torch.tensor(generated).unsqueeze(0).to(device)  # Add batch dimension, move to device

                # Create a mask for the decoder to prevent it from looking ahead
                mask = full_mask[:, :len(generated), :len(generated)]

                # Forward pass to get output probabilities
                output_probs = self.forward(input_tensor, mask=mask)

                # Get the last token probabilities
                last_token_logits = output_probs[:, -1, :]  # Shape: (1, vocab_size)

                # Apply top-k filtering
                top_k_probs, top_k_indices = torch.topk(last_token_logits, top_k, dim=-1)

                # Sample from top-k probabilities
                top_k_probs = F.softmax(top_k_probs / temperature, dim=-1)
                
                # Sample from the top-k probabilities
                sampled_index = torch.multinomial(top_k_probs, num_samples=1).cpu().item()

                # Retrieve the corresponding token index
                next_token_index = top_k_indices[:, sampled_index].item()  # Ensure it's a scalar

                # Append the generated token to the list
                generated.append(next_token_index)

                # Check for end-of-sequence token
                if next_token_index == sp.piece_to_id('</s>'):  # Use the EOS token ID from the SentencePiece model
                    break

            return generated  # Return generated tokens


## Tokenizer

Next the dataset to be used is loaded and tokenized. The dataset being used is the openwebtext dataset from huggingface and will be downloaded and saved to disk in the next few steps:

In [None]:
from datasets import load_dataset

# Load OpenWebText dataset
dataset = load_dataset("openwebtext", trust_remote_code=True)

In [None]:
# Save to disk
dataset.save_to_disk('openwebtext_local')

If the dataset has already been saved to the disk, it can be loaded from the disk using the following code:

In [7]:
from datasets import DatasetDict

loaded_dataset = DatasetDict.load_from_disk('openwebtext_local')

Loading dataset from disk:   0%|          | 0/80 [00:00<?, ?it/s]

First let us check the structure of the dataset:

In [8]:
# Print details about the dataset
print(loaded_dataset)

DatasetDict({
    train: Dataset({
        features: ['text'],
        num_rows: 8013769
    })
})


The dataset can then be split into training and validation sets in a normal split since all the data should be similar. 

In [9]:
# Take a sample of the dataset which allows for smaller sections to be used if desired
sampled_dataset = loaded_dataset['train'].shuffle(seed=42).select(range(8000000))

# Split the sampled dataset into train and test sets
train_test_split = sampled_dataset.train_test_split(test_size=0.1, seed=42)

# Access the train and test datasets
train_data = train_test_split['train']
test_data = train_test_split['test']

print(f"Train dataset length: {len(train_data)}")
print(f"Test dataset length: {len(test_data)}")


Train dataset length: 7200000
Test dataset length: 800000


Let's take a look at one sample from the dataset:

In [10]:
train_data[0]['text']



Next, the tokenizer will be trained on the dataset. I chose to use the SentencePiece tokenizer, a subword tokenizer that requires training on a .txt file.

First we save the data to a .txt file:

In [14]:
# Save sampled texts to a file for SentencePiece training
with open("sampled_texts.txt", "w", encoding="utf-8") as f:
    for sample in sampled_dataset:
        f.write(sample['text'] + "\n")

Then we train the model to create a vocab of a size defined by the user:

In [1]:
import sentencepiece as spm
vocab_size = 30000
# Train SentencePiece model
spm.SentencePieceTrainer.train(
    input="sampled_texts.txt",
    model_prefix='m',
    vocab_size=vocab_size,
    input_sentence_size= 1000000,
    shuffle_input_sentence=True
)

By running the code below we can view the model's vocab and see how it splits up text:

In [13]:
import sentencepiece as spm

# Load the SentencePiece model
sp = spm.SentencePieceProcessor(model_file='m.model')  # Use the model file you trained

# View all of the SentencePiece tokens
for i in range(vocab_size): 
    token = sp.id_to_piece(i)
    print(f"ID: {i}, Token: {token}")


ID: 0, Token: <unk>
ID: 1, Token: <s>
ID: 2, Token: </s>
ID: 3, Token: ▁the
ID: 4, Token: ,
ID: 5, Token: .
ID: 6, Token: s
ID: 7, Token: ▁to
ID: 8, Token: ▁of
ID: 9, Token: ▁and
ID: 10, Token: ▁a
ID: 11, Token: ▁in
ID: 12, Token: ▁that
ID: 13, Token: ’
ID: 14, Token: ▁is
ID: 15, Token: -
ID: 16, Token: ▁for
ID: 17, Token: ▁on
ID: 18, Token: ▁
ID: 19, Token: '
ID: 20, Token: ▁it
ID: 21, Token: ▁with
ID: 22, Token: ▁I
ID: 23, Token: ▁The
ID: 24, Token: ▁was
ID: 25, Token: t
ID: 26, Token: ▁be
ID: 27, Token: ▁as
ID: 28, Token: ▁are
ID: 29, Token: :
ID: 30, Token: ▁“
ID: 31, Token: ▁have
ID: 32, Token: ▁you
ID: 33, Token: ▁at
ID: 34, Token: ▁by
ID: 35, Token: ▁from
ID: 36, Token: ▁(
ID: 37, Token: ▁this
ID: 38, Token: ▁he
ID: 39, Token: ▁not
ID: 40, Token: ▁"
ID: 41, Token: ▁an
ID: 42, Token: ▁has
ID: 43, Token: ▁his
ID: 44, Token: ▁or
ID: 45, Token: ing
ID: 46, Token: )
ID: 47, Token: ▁they
ID: 48, Token: ▁said
ID: 49, Token: ▁but
ID: 50, Token: ▁we
ID: 51, Token: ▁will
ID: 52, Token: ed

## Training data

Next, a dataset structure needs to be created to better format the data for use by the model. This formatting includes creating a target sequence so that the model can learn to predict the next word in the sequence, as well as creating a mask to ensure the model does not attend to tokens it should ignore at the current time step.

In [11]:
import torch
from torch.utils.data import Dataset

class TextDataset(Dataset):
    def __init__(self, sampled_dataset, sequence_length, tokenizer, pad_token=0):
        """
        :param sampled_dataset: Hugging Face Dataset object (e.g., loaded_dataset['train'].select(range(10)))
        :param sequence_length: Length of each sequence for the model
        :param tokenizer: A pre-trained SentencePiece tokenizer
        :param pad_token: Padding token, default is 0
        """
        self.sampled_dataset = sampled_dataset
        self.seq_len = sequence_length
        self.tokenizer = tokenizer
        self.pad_token = pad_token

    def __len__(self):
        return len(self.sampled_dataset)
    
    def create_causal_mask(self, sequence_length):
        # Create an upper triangular matrix to be used as a mask
        mask = torch.triu(torch.ones(sequence_length, sequence_length), diagonal=1)
        return 1 - mask # Flip the created mask to get the desired causal mask

    def __getitem__(self, idx):
        # Get the 'text' field from the dataset
        text = self.sampled_dataset[idx]['text']
        
        # Tokenize the 'text' field using the SentencePiece tokenizer
        input_seq = self.tokenizer.encode(text)
        target_seq = input_seq[1:] + [self.pad_token]  # Shift by one position for the target sequence
        
        # Pad or truncate sequences to the correct length
        if len(input_seq) < self.seq_len:
            input_seq = input_seq + [self.pad_token] * (self.seq_len - len(input_seq))  # Padding
        elif len(input_seq) > self.seq_len:
            input_seq = input_seq[:self.seq_len]  # Truncating

        if len(target_seq) < self.seq_len:
            target_seq = target_seq + [self.pad_token] * (self.seq_len - len(target_seq))  # Padding
        elif len(target_seq) > self.seq_len:
            target_seq = target_seq[:self.seq_len]  # Truncating

        # Create mask (1 for real tokens, 0 for padding)
        mask = self.create_causal_mask(self.seq_len)
        
        return (
            torch.LongTensor(input_seq), # Input sequence
            torch.LongTensor(target_seq), # Target sequence
            mask  # Mask
        )


Below is an example of what an item of the dataset might look like:

In [16]:
import sentencepiece as spm

# Load the SentencePiece model
sp = spm.SentencePieceProcessor(model_file='m.model')

sequence_length = 64

# Create dataset and display the first item
train_dataset = TextDataset(train_data, sequence_length, sp)
train_dataset.__getitem__(0)

(tensor([  473,   739,   412,    21,    85,    76,   779,    14,    77,    42,
          1057,   171,    60,   524,   206,    33, 21724, 11927,     6,     5,
           264,     4,    50,  1229,     9,   605,    58,    60,  4025,   412,
            66,    82,    63,    50,    53,  1919,   524,   996,    96, 10619,
          4599,     4,  4911,     9,  4182,     6,     5,   107,   134,    87,
            50,  1919,   259,  2038,     8, 21724,   412,     9,   134,    87,
            50,    75,   739,    97]),
 tensor([  739,   412,    21,    85,    76,   779,    14,    77,    42,  1057,
           171,    60,   524,   206,    33, 21724, 11927,     6,     5,   264,
             4,    50,  1229,     9,   605,    58,    60,  4025,   412,    66,
            82,    63,    50,    53,  1919,   524,   996,    96, 10619,  4599,
             4,  4911,     9,  4182,     6,     5,   107,   134,    87,    50,
          1919,   259,  2038,     8, 21724,   412,     9,   134,    87,    50,
            7

## Training

The following code is used to do the actual training of the custom model that was built. First we need to define the parameters and initilize the model, dataset, and dataloaders for the training and validation sets:

In [13]:
import torch
import torch.optim as optim
from torch.utils.data import DataLoader, random_split
import sentencepiece as spm
import gc

# Emply the cache to ensure free memory
gc.collect()
torch.cuda.empty_cache()

# Define parameters
d_model = 1024       # Model dimension
d_ff = 1024          # Feed-forward dimension
num_heads = 16       # Number of attention heads
blocks = 24          # Number of decoder blocks
vocab_size = 30000   # Vocabulary size, used to determine number of output logits
dropout = 0.2        # Dropout rate

# Define device (GPU if available, else CPU)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Create GPT model
model = Transformer(d_model, d_ff, num_heads, vocab_size, blocks, dropout=dropout).to(device)

# Create tokenizer model
sp = spm.SentencePieceProcessor(model_file='m.model')

num_epochs = 5         # Number of training epochs
batch_size = 64        # Batch size
sequence_length = 64   # Sequence length of input sequences

# Create the dataset and split into training and validation sets
full_dataset = TextDataset(train_data, sequence_length, sp)
train_size = int(0.8 * len(full_dataset))
val_size = len(full_dataset) - train_size
train_dataset, val_dataset = random_split(full_dataset, [train_size, val_size])

# Create DataLoaders
train_dataloader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
val_dataloader = DataLoader(val_dataset, batch_size=batch_size, shuffle=False)

Below is a method to see the actual size/number of parameters of the model:

In [14]:
def count_parameters(model):
    return sum(p.numel() for p in model.parameters() if p.requires_grad)

total_params = count_parameters(model)
print(f"Total parameters: {total_params}")

Total parameters: 212710704


The below code is used to load a previously trained model to use for continued training or evaluation:

In [15]:
# Load the model's state_dict
model.load_state_dict(torch.load("model_state.pth", weights_only=True))

<All keys matched successfully>

This is the actual training code that passes the data to the model, calculates gradients, and updates parameters. Specific code is commented on below:

In [None]:
from torch.amp import autocast, GradScaler

scaler = GradScaler()

# Initialize optimizer and loss functionw
optimizer = optim.AdamW(model.parameters(), lr=1e-5, weight_decay=1e-5)
scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(optimizer, 'min', patience=5)
criterion = nn.CrossEntropyLoss()

# Example training loop
for epoch in range(num_epochs):
    model.train()  # Set the model to training mode
    for i, batch in enumerate(train_dataloader):
        input_seq, output_seq, mask = batch

        # Move tensors to the appropriate device
        input_seq = input_seq.to(device)
        output_seq = output_seq.to(device)
        mask = mask.to(device)

        # Zero gradients at the beginning of each iteration
        optimizer.zero_grad()

        # Forward pass with automatic mixed precision
        with autocast('cuda'):  # Enables mixed precision on CUDA
            output_logits = model(input_seq, mask=mask)
            loss = criterion(output_logits.view(-1, vocab_size), output_seq.view(-1))

        # Backward pass with scaled gradients
        scaler.scale(loss).backward()

        # Clip gradients to prevent exploding gradients
        torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
        
        # Optimizer step and update scaler
        scaler.step(optimizer)
        scaler.update()

        # Log loss every 10 iterations
        if i % 10 == 0:
            print(f'Epoch [{epoch+1}/{num_epochs}], Step [{i}/{len(train_dataloader)}], Loss: {loss.item():.6f}')

    # Validation loop
    model.eval()  # Set the model to evaluation mode
    val_loss = 0.0
    with torch.no_grad():  # Disable gradient tracking
        for batch in val_dataloader:
            input_seq, output_seq, mask = batch
            
            # Move tensors to the appropriate device
            input_seq = input_seq.to(device)
            output_seq = output_seq.to(device)
            mask = mask.to(device)

            # Forward pass
            output_logits = model(input_seq, mask=mask)

            # Compute the loss
            loss = criterion(output_logits.view(-1, vocab_size), output_seq.view(-1))
            val_loss += loss.item()

    # Average validation loss
    avg_val_loss = val_loss / len(val_dataloader)
    scheduler.step(avg_val_loss)
    print(f'Epoch [{epoch+1}/{num_epochs}], Validation Loss: {avg_val_loss:.6f}')

The below code can be used to save a model after training:

In [14]:
# Save the model's state_dict
torch.save(model.state_dict(), "model_state.pth")

The below code can be used to evaluate the model on the test set to see how it fares on completely unseen data:

In [None]:
def evaluate_model(model, eval_dataloader, criterion, vocab_size, device):
    model.eval()  # Set the model to evaluation mode
    total_loss = 0.0
    total_correct = 0
    total_samples = 0

    with torch.no_grad():  # Disable gradient calculations
        for input_seq, output_seq, mask in eval_dataloader:
            input_seq = input_seq.to(device)
            output_seq = output_seq.to(device)
            mask = mask.to(device)

            # Forward pass
            output_logits = model(input_seq, mask=mask)

            # Calculate loss
            loss = criterion(output_logits.view(-1, vocab_size), output_seq.view(-1))
            total_loss += loss.item()

            # Calculate accuracy
            _, predicted = torch.max(output_logits, dim=-1)
            total_correct += (predicted == output_seq).sum().item()
            total_samples += output_seq.numel()  # Total number of elements in output_seq

    # Calculate average loss and accuracy
    avg_loss = total_loss / len(eval_dataloader)  # Divide by number of batches
    accuracy = total_correct / total_samples

    print(f"Evaluation Loss: {avg_loss:.4f}, Accuracy: {accuracy:.4f}")

    return avg_loss, accuracy  # Return values for further use

# Call the evaluate function
eval_dataset = TextDataset(test_data, sequence_length, sp)
eval_dataloader = DataLoader(eval_dataset, batch_size=batch_size, shuffle=False)
criterion = nn.CrossEntropyLoss()
evaluate_model(model, eval_dataloader, criterion, vocab_size, device)


Finally the trained model can be used to generate text from a given prompt. The output is most likely to be gibberish due to a lack of training time and resources, but it might exhibit some structure.

In [15]:
# Given a prompt text
prompt_text = "Once upon a time" 

# Encode and convert to tensor
prompt_tokens = sp.encode(prompt_text, out_type=int)  # Get token IDs as a list
prompt_tensor = torch.tensor(prompt_tokens).to(device)  # Convert to tensor and move to device

# Use the model to generate text
generated_tokens = model.generate(prompt_tensor, sp, max_length=200, temperature=1.0, top_k=50, device=device)

# Decode the generated tokens back to text
generated_text = sp.decode(generated_tokens)  # Convert back to string
print("Generated text:", generated_text)  # Output the generated text

Generated text: Once upon a time to be the story of the world’s new world are taking a big price of this year. Here is our one of the world that was first look at least a new study of a big story that has shown to see it, a lot of the most of how that have done it will be a bit on the other forms of a major. No. For example from the past a good as a series for a good news conference just over what most people who were true. This particular long way. Even before the year’s in particular a little many different episodes to see an incredible life, to be the past several reasons over our third a massive to the team’s that time or a good time. My first time since I would like one of hard time, which, and we came the good part of a post on one of the very happy with a time, but on its entire new year, but there wasn’s that time. "The way to hold


As observed, the model is capable of generating text from a given prompt, although at this stage of training, the output is mostly nonsensical. Interestingly, the generated text appears to have a degree of structure, even though it lacks coherence. This shows the power of the transformer architecture and its ability to learn long-term dependencies through attention mechanisms.

While not guaranteed, the model would likely produce more coherent text if the training process were more comprehensive by utilizing a larger dataset and allowing for extended training time. However, my resource limitations prevented major progress. I had to truncate most data points to prevent larger sequence lengths from overwhelming GPU memory, which also allowed for reduced training time. Additionally, with only a single average GPU, it was impractical to train the model extensively, as significant progress would require a considerable amount of time and computational power. So I decided to focus on training the model to a reasonable state to show the potential of the model over an actual finished model.

Given more resources and training time, I would like to see how the model might perform potentially achieving better coherence and quality in its generated text.