# Building and Training a Decoder GPT-Like Model for Text Generation


## Objective
In this assignment, you will learn how to build and train a decoder GPT-like model, which is ideal for generating text and other natural language processing tasks. By the end of this tutorial, you will understand the basics of preparing data, building the model architecture, training the model, and generating text based on learned patterns.
## Introduction to GPT Models
GPT (Generative Pretrained Transformer) models are decoder-only models trained using a causal language modeling objective. The primary goal is to predict the next token in a sequence given the previous tokens. During training, the model learns to generate output tokens autoregressively, allowing it to produce coherent and contextually relevant text based on the input prompt.
## Key Concepts
Decoder-Only Model: GPT models focus on generating text based on the input sequence, predicting one token at a time.
Causal Language Modeling: The model is trained to predict the next token in a sequence, given the context of the previous tokens.
Autoregressive Generation: The model generates text one token at a time, using the previously generated tokens as context.
## GPT vs. ChatGPT
Both GPT and ChatGPT are AI models developed by OpenAI, but they serve different purposes:
GPT: A family of large-scale transformer-based language models for various NLP tasks, such as text generation, translation, and question-answering. GPT models generate responses based on the input text but do not maintain conversation history.
ChatGPT: A fine-tuned version of GPT, specifically designed for conversational AI applications. ChatGPT maintains conversation history and generates contextually relevant responses, making it suitable for chatbot-like interactions.
## Assignment Tasks
Prepare the environment and dataset for training a GPT-like model.
Build the model architecture, focusing on the self-attention mechanism.
Train the model using a dataset of your choice.
Use the trained model to generate text based on a given prompt.
## Expected Outcomes
By completing this assignment, you will gain hands-on experience in building and training a decoder GPT-like model for text generation. You will understand the key concepts of GPT models, including causal language modeling and autoregressive generation, and learn how to apply these concepts in a practical setting.

## Setup


### Installing required libraries



In [None]:
!pip install numpy==1.26.0
!pip install torch==2.2.0 torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu
!pip install torchtext==0.17.2
!pip install torchdata==0.7.1
!pip install portalocker==2.8.2
!pip install pandas==2.2.1
!pip install matplotlib==3.9.0 scikit-learn==1.5.0
!pip install transformers==4.35.2

### Importing required libraries


- **torchdata**: Enhances data loading and preprocessing functionalities for PyTorch. Streamlines the workflow for machine learning models, making it easier to manage and prepare data.
- **portalocker**: Provides a mechanism to lock files, ensuring exclusive access. Useful for managing file resources in concurrent applications, preventing multiple processes from accessing the same file simultaneously.
- **torchtext**: Offers utilities for text processing and datasets in PyTorch. implifies the preparation of data for NLP tasks, providing tools for tokenization, vocabulary management, and dataset loading.
- **matplotlib**: A plotting library for creating static, interactive, and animated visualizations in Python..

These libraries collectively support various aspects of machine learning and NLP projects, including:
Data Preparation: torchdata and torchtext streamline data loading, preprocessing, and text processing.
Resource Management: portalocker ensures efficient file handling in concurrent applications.
Data Visualization: matplotlib provides tools for creating informative visualizations.

In [None]:
from torchtext.datasets import multi30k, Multi30k
from torch.utils.data import DataLoader
import torch
from typing import Iterable, List
import matplotlib.pyplot as plt
from torch import Tensor
import torch
import torch.nn as nn
from torch.nn import Transformer
import math
from torchtext.vocab import Vocab
from torchtext.data.utils import get_tokenizer
from torchtext.vocab import build_vocab_from_iterator
from torch.nn.utils.rnn import pad_sequence
from torchtext.datasets import IMDB,PennTreebank
from transformers import GPT2Tokenizer, GPT2LMHeadModel
import time
from torch.optim import Adam


# You can also use this section to suppress warnings generated by your code:
def warn(*args, **kwargs):
    pass
import warnings
warnings.warn = warn
warnings.filterwarnings('ignore')

# Loading and Iterating Over the IMDB Dataset
## Dataset
Load the IMDB dataset into training and validation sets. Create an iterator for the training set. Loop through the first 10 samples and print each one. This process simulates manual iteration over a dataset without utilizing PyTorch's `DataLoader`  for batch processing and data management.

For language modeling tasks, consider the following datasets:
PennTreebank: A popular dataset for language modeling tasks.
WikiText-2: A dataset consisting of Wikipedia articles, suitable for training language models.
WikiText103: A larger dataset of Wikipedia articles, ideal for more complex language modeling tasks.
These datasets are available through PyTorch's `torchtext` library and can be used for training more robust language models.



In [None]:
# Load the dataset
train_iter, val_iter = IMDB()

Initialize an iterator for the train data loader:


In [None]:
data_itr=iter(train_iter)
# retrieving the third first record
next(data_itr)
next(data_itr)
next(data_itr)

Let's define our device (CPU or GPU) for training. We'll check if a GPU is available and use it; otherwise, we'll use the CPU.


In [None]:
DEVICE = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
DEVICE


## Preprocessing data
The provided code is used for preprocessing text data, particularly for NLP tasks, with a focus on tokenization and vocabulary building.

- **Special Symbols and Indices**: Initializes special tokens (`<unk>`, `<pad>`, and an empty string for EOS) with their corresponding indices (`0`, `1`, and `2`). These tokens are used for unknown words, padding, and end of sentence respectively.
    - `UNK_IDX`: Index for unknown words.
    - `PAD_IDX`: Index used for padding shorter sentences in a batch to ensure uniform length.
    - `EOS_IDX`: Index representing the end of a sentence (though not explicitly used here as the EOS symbol is set to an empty string).

- **`yield_tokens` Function**: A generator function that iterates through a dataset (`data_iter`), tokenizing each data sample using a `tokenizer` function, and yields one tokenized sample at a time.

- **Vocabulary building**: Constructs a vocabulary from the tokenized dataset. The `build_vocab_from_iterator` function processes tokens generated by `yield_tokens`, includes special tokens (`special_symbols`) at the beginning of the vocabulary, and sets a minimum frequency (`min_freq=1`) for tokens to be included.

- **Default index for unknown tokens**: Sets a default index for tokens not found in the vocabulary (`UNK_IDX`), ensuring that out-of-vocabulary words are handled as unknown tokens.

- **`text_to_index` function**: Converts a given text into a sequence of indices based on the built vocabulary. This function is essential for transforming raw text into a numerical format that can be processed by machine learning models.

- **`index_to_en` function**: Transforms a sequence of indices back into a readable string. It's useful for interpreting the outputs of models and converting numerical predictions back into text.

- **Check functionality**: Demonstrates the use of `index_to_en` by converting a tensor of indices `[0,1,2]` back into their corresponding special symbols. This helps verify that the vocabulary and index conversion functions are working as expected.


In [None]:
# Define special symbols and indices
UNK_IDX, PAD_IDX, EOS_IDX = 0, 1, 2
# Make sure the tokens are in order of their indices to properly insert them in vocab
special_symbols = ['<unk>', '<pad>', '<|endoftext|>' ]

In [None]:
tokenizer = get_tokenizer("basic_english")

In [None]:
def yield_tokens(data_iter):

    for _,data_sample in data_iter:
        yield  tokenizer(data_sample)

vocab = build_vocab_from_iterator(yield_tokens(train_iter), specials=special_symbols, special_first=True)
vocab.set_default_index(UNK_IDX)


###  Text to index and index to Text


In [None]:
text_to_index=lambda text: [vocab(token) for token in tokenizer(text)]
index_to_en = lambda seq_en: " ".join([vocab.get_itos()[index] for index in seq_en])

In [None]:
#check
index_to_en(torch.tensor([0,1,2]))

### Collate Function for Decoder Model

The collate function plays a crucial role in preparing data for the decoder model. Its primary function is to:
Take a block of text as input: The collate function accepts a block of text, which serves as the foundation for generating training samples.
Produce a modified block of text as output: Through the `get_sample(block_size, text)` function, the collate function transforms the input text into suitable training samples.
**get_sample**
The get_sample(block_size, text) function is responsible for:
Generating random text samples: It creates a random source sequence (`src_sequence`) and its corresponding target sequence (`tgt_sequence`) from the given text.
Ensuring sample fits within block size: The function adjusts the sample length to fit within the specified block size, handling cases where the text is shorter than the block size.
Returning source and target sequences: The function returns both sequences, which are used as input for the language model.
By utilizing the get_sample function, the collate function enables the decoder model to learn from the input text and generate coherent and contextually relevant output.



In [None]:
def get_sample(block_size, text):
    # Determine the length of the input text
    sample_leg = len(text)
    # Calculate the stopping point for randomly selecting a sample
    # This ensures the selected sample doesn't exceed the text length
    random_sample_stop = sample_leg - block_size


    # Check if a random sample can be taken (if the text is longer than block_size)
    if random_sample_stop >= 1:
        # Randomly select a starting point for the sample
        random_start = torch.randint(low=0, high=random_sample_stop, size=(1,)).item()
        # Define the endpoint of the sample
        stop = random_start + block_size

        # Create the input and target sequences
        src_sequence = text[random_start:stop]
        tgt_sequence= text[random_start + 1:stop + 1]

    # Handle the case where the text length is exactly equal or less the block size
    elif random_sample_stop <= 0:
        # Start from the beginning and use the entire text
        random_start = 0
        stop = sample_leg
        src_sequence= text[random_start:stop]
        tgt_sequence = text[random_start + 1:stop]
        # Append an empty string to maintain sequence alignment
        tgt_sequence.append( '<|endoftext|>')

    return src_sequence, tgt_sequence

Lets test `get_sample(block_size, text)` first and get batch of texts:


In [None]:
BATCH_SIZE=1

batch_of_tokens=[]

for i in range(BATCH_SIZE):
  _,text =next(iter(train_iter))
  batch_of_tokens.append(tokenizer(text))

Get first smaple of text


In [None]:
text=batch_of_tokens[0][0:100]
text[0:100]
batch_of_tokens

To test the `get_sample` function with a block size of 100, you can use the following code:


In [None]:
block_size=10
text = "your_text_here"  # replace with your text
src_sequence, tgt_sequence = get_sample(block_size, text)

print("Source Sequence:")
print(src_sequence)
print("\nTarget Sequence:")
print(tgt_sequence)

## Batch Creation for NLP Model Training
The code snippet is designed to create batches of source (`src_batch`) and target (`tgt_batch`) sequences from a dataset for training NLP models. Here's a breakdown of its functionality:
Looping through the dataset: The code iterates over the dataset to extract text samples.
Generating source and target sequences: For each text sample, it uses the `get_sample` function to generate corresponding source and target sequences.
Converting to vocabulary indices: The sequences are then converted into vocabulary indices, which represent the words or tokens in the sequences as numerical values.
Converting to PyTorch tensors: The indexed sequences are then converted into PyTorch tensors, which are suitable for model training.
Appending to batch lists: Each iteration appends the source and target sequences to their respective batch lists (`src_batch` and `tgt_batch`).
Printing batch details: The code prints the details of the sequences, including the text, indices, and tensor shapes, for two samples per batch.
This process enables the creation of batches that can be used to train NLP models, such as language models or sequence-to-sequence models. By printing the batch details, the code provides insight into the data being used for training.

In [None]:
# Initialize empty lists to store source and target sequences
src_batch, tgt_batch = [], []

# Define the batch size
BATCH_SIZE = 2

# Loop to create batches of source and target sequences
for i in range(BATCH_SIZE):
    # Retrieve the next data point from the training iterator
    _,text = next(iter(train_iter))

    # Generate source and target sequences using the get_sample function
    src_sequence_text, tgt_sequence_text = get_sample(block_size, tokenizer(text))

    # Convert source and target sequences to tokenized vocabulary indices
    src_sequence_indices = vocab(src_sequence_text)
    tgt_sequence_indices = vocab(tgt_sequence_text)

    # Convert the sequences to PyTorch tensors with dtype int64
    src_sequence = torch.tensor(src_sequence_indices, dtype=torch.int64)
    tgt_sequence = torch.tensor(tgt_sequence_indices, dtype=torch.int64)

    # Append the source and target sequences to their respective batches
    src_batch.append(src_sequence)
    tgt_batch.append(tgt_sequence)

    # Print the output for every 2nd sample (adjust as needed)
    print(f"Sample {i}:")
    print("Source Sequence (Text):", src_sequence_text)
    print("Source Sequence (Indices):", src_sequence_indices)
    print("Source Sequence (Shape):", src_sequence.shape)
    print("Target Sequence (Text):", tgt_sequence_text)
    print("Target Sequence (Indices):", tgt_sequence_indices)
    print("Target Sequence (Shape):", tgt_sequence.shape)

## collate_batch Function

The `collate_batch` unction plays a crucial role in preparing batches of data for training NLP models. Here's a summary of its functionality:

## Key Steps
Processing text samples: The function iterates over each text sample in a given batch.
Generating source and target sequences: It uses the get_sample function to generate source and target sequences for each text sample, based on a specified block size.
Converting to indices: The sequences are converted to indices using a vocabulary, which maps words or tokens to numerical values.
Transforming to PyTorch tensors: The indexed sequences are transformed into PyTorch tensors.
Padding sequences: The sequences are padded to ensure uniform length across the batch, which is necessary for efficient training.
Returning padded batches: The function returns the padded source and target batches, ready for training on the specified device (`DEVICE`).

## Output
The collate_batch function returns two padded tensors:
`src_batch`: The padded source sequence batch.
`tgt_batch`: The padded target sequence batch.
These padded batches can be directly fed into an NLP model for training, enabling the model to learn from the data.

In [1]:
BLOCK_SIZE=30
def collate_batch(batch):
    src_batch, tgt_batch = [], []
    for _,_textt in batch:
      src_sequence,tgt_sequence=get_sample(BLOCK_SIZE,tokenizer(_textt))
      src_sequence=vocab(src_sequence)
      tgt_sequence=vocab(tgt_sequence)
      src_sequence= torch.tensor(src_sequence, dtype=torch.int64)
      tgt_sequence = torch.tensor(tgt_sequence, dtype=torch.int64)
      src_batch.append(src_sequence)
      tgt_batch.append(tgt_sequence)


    src_batch = pad_sequence(src_batch, padding_value=PAD_IDX, batch_first=False)
    tgt_batch = pad_sequence(tgt_batch, padding_value=PAD_IDX, batch_first=False)

    return src_batch.to(DEVICE), tgt_batch.to(DEVICE)

## Setting Up Data Loaders
The code snippet demonstrates how to:
Create data loaders: Utilize the `DataLoader` class to set up data loaders for the training, validation, and testing sets.
Custom batch processing: Employ a custom `collate_batch` function for batch processing, allowing for tailored data preparation.
Configure batch size and shuffling: Set the batch size to 1 for simplicity and enable data shuffling for randomized access.

## Accessing and Displaying Preprocessed Data
After initializing the training data loader, the code:
Fetches the first batch: Retrieves the first batch of source (`src`) and target (`tgt`) sequences.
Iterates over tokens: Loops through each token in the source sequence.
Converts indices to text: Uses the `index_to_en` function to convert token indices back to text.
Prints resulting sentences: Displays the resulting sentences, showcasing how to access and display preprocessed data ready for model training.
This process allows for verification of the data preparation pipeline and ensures that the data is correctly formatted for model training.

In [None]:
BATCH_SIZE=1
dataloader = DataLoader(train_iter, batch_size=BATCH_SIZE, shuffle=True, collate_fn=collate_batch)
val_dataloader= DataLoader(val_iter , batch_size=BATCH_SIZE, shuffle=True, collate_fn=collate_batch)

### Iterating through data samples

The provided code iterates through batches of source-target pairs from a data loader. It demonstrates how to access and print a few samples from the dataset:

- We initialize an iterator over the data loader named `dataset`.
- A loop runs for 10 iterations to fetch and print the first 10 source-target pairs. For each pair:
    - `src` and `trt` (short for target) hold the batch of source and target sequences respectively.
    - The `index_to_en` function is used to convert these sequences from numerical indices back to readable text.
    - The `sample` number and corresponding source and target texts are printed out.

After printing the first 10 samples, the code continues to iterate through the dataset:

- It prints the shape of the target and source tensors for the next batch, which provides information about the number of tokens and batch size.
- The `index_to_en` function is again used to convert the first sequence of the batch from indices to text for both source and target.
- Only the first pair of the remaining batches is printed, and then the loop breaks.

This process is useful for verifying that the data loader is functioning correctly and that the sequences are being properly transformed.


In [None]:
dataset=iter(dataloader)
for sample in range(10):
  src,trt=next(dataset)
  print("sample",sample)
  print("sorce:",index_to_en(src))
  print("\n")
  print("target:",index_to_en(trt))
  print("\n")

In [None]:
for  src,trt in dataset:
    print(trt.shape)
    print(src.shape)
    print(index_to_en(src[0,:]))
    print(index_to_en(trt[0,:]))
    break

Make sure source and target is shifted


In [None]:
print("source:",index_to_en(src))
print("target:",index_to_en(trt))

Now that we've covered data preparation, let's move on to understanding the key components of the Transformer model.


## Masking in Transformers

Masking plays a vital role in transformers, particularly in controlling the attention mechanism. The `generate_square_subsequent_mask` function generates an upper triangular matrix that serves as a mask.
## Purpose of the Mask
The mask ensures that during decoding:
Tokens don't attend to future tokens: A token can only attend to previous tokens in the sequence, not future ones.
Causal attention: This masking technique enables causal attention, where the model only considers the context up to the current token.
## Upper Triangular Matrix
The upper triangular matrix produced by the `generate_square_subsequent_mask` function has the following properties:
Zeroes on and below the diagonal: Tokens can attend to themselves and previous tokens.
Ones above the diagonal: Tokens cannot attend to future tokens.
By applying this mask, the transformer model can effectively process sequential data, such as text or time series data, while maintaining the causal relationship between tokens.

In [None]:
def generate_square_subsequent_mask(sz,device=DEVICE):
    mask = (torch.triu(torch.ones((sz, sz), device=device)) == 1).transpose(0, 1)
    mask = mask.float().masked_fill(mask == 0, float('-inf')).masked_fill(mask == 1, float(0.0))
    return mask

## Create_mask Function
The `create_mask` function is used to generate source masks based on the provided source sequence. These masks are essential for controlling the attention mechanism in transformers.
## Purpose of Source Masks
Source masks help the model:
Ignore padding tokens: By masking padding tokens, the model can focus on the actual content of the sequence.
Handle variable-length sequences: Source masks enable the model to process sequences of different lengths within a batch.
The `create_mask` function plays a crucial role in ensuring that the model attends to the relevant parts of the input sequence, improving its performance and efficiency

In [None]:
def create_mask(src,device=DEVICE):
    src_seq_len = src.shape[0]
    src_mask = generate_square_subsequent_mask(src_seq_len)
    src_padding_mask = (src == PAD_IDX).transpose(0, 1)
    return src_mask,src_padding_mask

Let's check an example source tensor and its associated masks:


In [None]:
#Replace first four tokens with PAD token so we can also check how pad tokens are masked using padding_mask
src[0:4]=PAD_IDX

In [None]:
mask,padding_mask = create_mask(src)
src

In [None]:
mask

In [None]:
padding_mask

## Positional Encoding in Transformers
Transformers lack inherent knowledge of token order in sequences. To address this, positional encodings are added to token embeddings, providing the model with positional information.
## Types of Positional Encodings
Fixed Positional Encodings: These encodings follow a predetermined pattern, such as sinusoidal encodings used in the original Transformer paper.
Trainable Positional Encodings: Used in models like GPT, these encodings are learned during training, allowing the model to capture task-specific positional information.
## Trainable Positional Encodings
Trainable positional encodings offer several benefits:
Flexibility: Learned during training, these encodings can adapt to specific tasks and datasets.
Task-specific representations: The model can capture nuanced positional information relevant to the task at hand.
## Fixed Positional Encoding in this Lab
For simplicity, this lab utilizes fixed positional encoding, which provides a straightforward way to incorporate positional information into the model.

In [None]:
# add positional information to the input tokens
class PositionalEncoding(nn.Module):
    def __init__(self,
                 emb_size: int,
                 dropout: float,
                 maxlen: int = 5000):
        super(PositionalEncoding, self).__init__()
        den = torch.exp(- torch.arange(0, emb_size, 2)* math.log(10000) / emb_size)
        pos = torch.arange(0, maxlen).reshape(maxlen, 1)
        pos_embedding = torch.zeros((maxlen, emb_size))
        pos_embedding[:, 0::2] = torch.sin(pos * den)
        pos_embedding[:, 1::2] = torch.cos(pos * den)
        pos_embedding = pos_embedding.unsqueeze(-2)

        self.dropout = nn.Dropout(dropout)
        self.register_buffer('pos_embedding', pos_embedding)

    def forward(self, token_embedding: Tensor):
        return self.dropout(token_embedding + self.pos_embedding[:token_embedding.size(0), :])

## Token Embedding
Token embedding is a technique used to represent words or tokens as numerical vectors in a continuous vector space. This allows words with similar meanings or contexts to be mapped to nearby points in the vector space.
## Key Characteristics
Vector representation: Each unique word or token is assigned a fixed-length vector.
Linguistic properties: The numerical values in the vector represent various linguistic properties, such as meaning, context, or relationships with other words.
## TokenEmbedding Class
The `TokenEmbedding` class is designed to convert numerical tokens into embeddings, enabling the model to capture the semantic relationships between words and process text data effectively.
By using token embeddings, models can better understand the nuances of language and make more accurate predictions or classifications.

In [None]:
class TokenEmbedding(nn.Module):
    def __init__(self, vocab_size: int, emb_size):
        super(TokenEmbedding, self).__init__()
        self.embedding = nn.Embedding(vocab_size, emb_size)
        self.emb_size = emb_size

    def forward(self, tokens: Tensor):
        return self.embedding(tokens.long()) * math.sqrt(self.emb_size)

## Custom GPT model architecture

The `CustomGPTModel` class defines a transformer-based architecture for generative pre-trained models, tailored for text generation and various NLP tasks. The model consists of several key components:

- **Initialization (`__init__`)**: The constructor takes several parameters including `embed_size`, `vocab_size`, `num_heads`, `num_layers`, `max_seq_len`, and `dropout`. It initializes the embedding layer, positional encoding, transformer encoder layers, and a linear layer (`lm_head`) for generating logits over the vocabulary.

- **Weight initialization (`init_weights`)**: This method initializes the weights of the model for better training convergence. The Xavier uniform initialization is used, which is a common practice for initializing weights in deep learning.

- **Decoder (`decoder`)**: Although named `decoder`, this method currently functions as the forward pass through the transformer encoder layers, followed by the generation of logits for the language modeling task. It handles the addition of positional encodings to the embeddings and applies a mask if necessary.

- **Forward pass (`forward`)**: This method is similar to the `decoder` method and defines the forward computation of the model. It processes the input through embedding layers, positional encoding, transformer encoder layers, and produces the final output using the `lm_head`.

- **Mask generation**: Both `decoder` and `forward` methods contain logic to generate a square causal mask if no source mask is provided. This mask ensures that the prediction for a position does not depend on the future tokens in the sequence, which is important for the autoregressive nature of GPT models.

- **Commented out decoder**: A section of the code is commented out, suggesting an initial design where a transformer decoder layer was considered. However, the final implementation uses only encoder layers, which is a common simplification for models focusing on language modeling and generation.

This class effectively encapsulates the necessary components to create a GPT-like model, allowing for training on language modeling tasks and text generation applications.


In [None]:
class CustomGPTModel(nn.Module):
    def __init__(self, embed_size,vocab_size, num_heads, num_layers, max_seq_len=500,dropout=0.1):

        super().__init__()

        self.init_weights()
        self.embed = nn.Embedding(vocab_size, embed_size)
        self.positional_encoding = PositionalEncoding(embed_size, dropout=dropout)

        print( embed_size )


        # Remaining layers are part of the TransformerDecoder
        encoder_layers = nn.TransformerEncoderLayer(d_model=embed_size, nhead=num_heads, dropout=dropout)
        self.transformer_encoder = nn.TransformerEncoder(encoder_layers, num_layers=num_layers)
        self.embed_size = embed_size
        self.lm_head = nn.Linear(embed_size, vocab_size)

    def init_weights(self):
      for p in self.parameters():
          if p.dim() > 1:
              nn.init.xavier_uniform_(p)

    def create_mask(src,device=DEVICE):
        src_seq_len = src.shape[0]
        src_mask = nn.Transformer.generate_square_subsequent_mask(src_seq_len)
        src_padding_mask = (src == PAD_IDX).transpose(0, 1)
        return src_mask,src_padding_mask

    def decoder(self, x,src_mask):
        seq_length = x.size(0)

        # Add positional embeddings to the input embeddings
        x = self.embed(x)* math.sqrt(self.embed_size)
        x = self.positional_encoding(x)

        if src_mask is None:
            """Generate a square causal mask for the sequence. The masked positions are filled with float('-inf').
            Unmasked positions are filled with float(0.0).
            """
            src_mask, src_padding_mask = create_mask(x)

        output = self.transformer_encoder(x, src_mask)
        logits = self.lm_head(x)
        return logits

    def forward(self,x,src_mask=None,key_padding_mask=None):

        seq_length = x.size(0)

        # Add positional embeddings to the input embeddings
        x = self.embed(x)* math.sqrt(self.embed_size) #src = self.embedding(src) * math.sqrt(self.d_model)
        x = self.positional_encoding(x)


        if src_mask is None:
            """Generate a square causal mask for the sequence. The masked positions are filled with float('-inf').
            Unmasked positions are filled with float(0.0).
            """
            src_mask, src_padding_mask = create_mask(x)

        output = self.transformer_encoder(x, src_mask,key_padding_mask)
        x = self.lm_head(x)

        return x


### Model configuration and initialization

Here, we configure and instantiate a Custom GPT Model with the following specifications:

- `ntokens`: The total number of unique tokens in the vocabulary, which the model will use to represent words.
- `emsize`: The size of each embedding vector. In this model, each word will be represented by a 200-dimensional vector.
- `nlayers`: The number of transformer encoder layers in the model. We are using two layers in this configuration.
- `nhead`: The number of attention heads in the multi-head attention mechanism. The model will use two attention heads.
- `dropout`: A regularization technique where randomly selected neurons are ignored during training to prevent overfitting. Here, we set the dropout probability to 0.2.

After setting these hyperparameters, we create an instance of `CustomGPTModel` by passing in the embedding size, number of attention heads, number of layers, vocabulary size, and dropout probability. The model is then moved to the specified `DEVICE`, which could be a CPU or GPU, for training or inference.


In [None]:
ntokens = len(vocab)  # size of vocabulary
emsize = 200  # embedding dimension
nlayers = 2  # number of ``nn.TransformerEncoderLayer`` in ``nn.TransformerEncoder``
nhead = 2  # number of heads in ``nn.MultiheadAttention``
dropout = 0.2  # dropout probability

model = CustomGPTModel(embed_size=emsize, num_heads=nhead, num_layers=nlayers, vocab_size=ntokens,dropout=dropout).to(DEVICE)

### Prompting 
In order to get the model to generate text (next token), you will need to create an starting point, which we call prompt, for the model to append tokens to it and generate text. Verify that the prompt is neither None nor too long, then proceed to tokenize it, convert it into indices, and reshape as needed.


In [None]:
def encode_prompt(prompt, block_size=BLOCK_SIZE):
    # Handle None prompt
    while prompt is None:
        prompt = input("Sorry, prompt cannot be empty. Please enter a valid prompt: ")

    tokens = tokenizer(prompt)
    number_of_tokens = len(tokens)

    # Handle long prompts
    if number_of_tokens > block_size:
        tokens = tokens[-block_size:]  # Keep last block_size characters

    prompt_indices = vocab(tokens)
    prompt_encoded = torch.tensor(prompt_indices, dtype=torch.int64).reshape(-1, 1)
    return prompt_encoded

Let's see some differnt exmaples where the input is None or longer than block size:


In [None]:
print(index_to_en(encode_prompt(None)))

In [None]:
print(index_to_en(encode_prompt("This is a prompt to get model generate next words." ) ))

Now, lets encode a text prompt and run it through the decoder part of the model:

- The `decoder` method of the `CustomGPTModel` instance `model` is called with the encoded prompt and without a source mask (`src_mask=None`), indicating that it will not mask any part of the sequence during processing. The decoder will handle creating a causal mask internally if required.
- The output `logits` represents the model's raw predictions for each token position, which can be further processed (e.g., by applying a softmax function) to obtain the probabilities of the next token in the sequence.


In [None]:
prompt_encoded=encode_prompt("This is a prompt to get model generate next words.").to(DEVICE)
prompt_encoded

In [None]:
logits = model.decoder(prompt_encoded,src_mask=None).to(DEVICE)

We have 11 tokens per output, an additional batch dimension, along with corresponding logits values for each word in the vocabulary.


In [None]:
logits.shape

Reshape it such that the batch dimension becomes five


In [None]:
logits = logits.transpose(0, 1)
logits.shape

Logits contains logits for each token in the sequence generated by the decoder we just need the last one for the next word


In [None]:
logit_preiction =logits[:,-1]
logit_preiction.shape

Get index of next word


In [None]:
 _, next_word_index = torch.max(logit_preiction, dim=1)
 next_word_index

Next word


In [None]:
index_to_en(next_word_index)

## Autoregressive text generation
In decoder models, we simply append the output to the input to generate the next response. We stop this process when we encounter the end-of-sequence tag <|endoftext|> or if the input becomes too large. We will implement it as a function later in this notebook.


In [None]:
prompt="this is the beginning of"

Ensuring that the prompt is of the maximum input size and make a prediction


In [None]:
prompt_encoded = encode_prompt(prompt).to(DEVICE)
print("Device for prompt_encoded:", prompt_encoded.shape)

In [None]:
max_new_tokens=10

In [None]:
for i in range(max_new_tokens):
    logits = model.decoder(prompt_encoded,src_mask=None)
    logits = logits.transpose(0, 1)
    print(" ")
    print(f"Shape of logits at step {i}: {logits.shape}")

    logit_preiction = logits[:, -1]
    print(f"Shape of logit_prediction at step {i}: {logit_preiction.shape}")

    next_token_encoded = torch.argmax(logit_preiction, dim=-1).reshape(-1, 1)
    print(f"Shape of next_token_encoded at step {i}: {next_token_encoded.shape}")

    prompt_encoded = torch.cat((prompt_encoded, next_token_encoded), dim=0).to(DEVICE)
    print(f"Sequence for step {i}: {[index_to_en(j) for j in prompt_encoded]}")
    print(f"Shape of prompt_encoded after concatenation at step {i}: {prompt_encoded.shape}")

Lets implement it as a function now


In [None]:
# Define special symbols and indices
UNK_IDX, PAD_IDX, EOS_IDX = 0, 1, 2
# Make sure the tokens are in order of their indices to properly insert them in vocab
special_symbols = ['<unk>', '<pad>', '<|endoftext|>' ]
BLOCK_SIZE

In [None]:

#auto-regressive Language Model text generation
def generate(model, prompt=None, max_new_tokens=500, block_size=BLOCK_SIZE, vocab=vocab, tokenizer=tokenizer):
    # Move model to the specified device (e.g., GPU or CPU)
    model.to(DEVICE)

    # Encode the input prompt using the provided encode_prompt function
    prompt_encoded = encode_prompt(prompt).to(DEVICE)
    tokens = []

    # Generate new tokens up to max_new_tokens
    for _ in range(max_new_tokens):
        # Decode the encoded prompt using the model's decoder
        logits = model(prompt_encoded,src_mask=None,key_padding_mask=None)

        # Transpose the logits to bring the sequence length to the first dimension
        logits = logits.transpose(0, 1)

        # Select the logits of the last token in the sequence
        logit_prediction = logits[:, -1]

        # Choose the most probable next token from the logits(greedy decoding)
        next_token_encoded = torch.argmax(logit_prediction, dim=-1).reshape(-1, 1)

        # If the next token is the end-of-sequence (EOS) token, stop generation
        if next_token_encoded.item() == EOS_IDX:
            break

        # Append the next token to the prompt_encoded and keep only the last 'block_size' tokens
        prompt_encoded = torch.cat((prompt_encoded, next_token_encoded), dim=0)[-block_size:]

        # Convert the next token index to a token string using the vocabulary
        # Move the tensor back to CPU for vocab lookup if needed
        token_id = next_token_encoded.to('cpu').item()
        tokens.append(vocab.get_itos()[token_id])

    # Join the generated tokens into a single string and return
    return ' '.join(tokens)

In [None]:
generate(model,prompt="this is the beginning of",max_new_tokens=30,vocab=vocab,tokenizer=tokenizer)

### Decoding the differences: Training vs. inference

The key difference between the training and inference stages lies in the inputs to the decoder. During training, the decoder benefits from exposure to the ground truth--receiving the exact target sequence tokens incrementally through a technique known as "teacher forcing." This approach is in stark contrast to some other neural network architectures that rely on the network's previous predictions as inputs during training. Once training concludes, the datasets used resemble those employed in more conventional neural network models, providing a familiar foundation for comparison and evaluation.

To start the training, first create a Cross Entropy Loss object. The loss will not consider PAD tokens.


In [None]:
from torch.nn import CrossEntropyLoss
loss_fn = CrossEntropyLoss(ignore_index=PAD_IDX)

We create the required masks


In [None]:
src,tgt=next(iter(dataloader))

mask,padding_mask = create_mask(src)

When you call `model(src, src_mask, key_padding_mask)`,  the forward method of the `CustomGPTModel` class generates logits for the target sequence, which can then be translated into actual tokens by taking the highest probability prediction at each step in the sequence.


In [None]:
logits = model(src,src_mask=mask,key_padding_mask=padding_mask)
print(logits.shape)

In [None]:
print("output shape",logits.shape)
print("source shape ",src)


During training, the transformer's decoder is provided with the entire target sequence at once. This allows for parallel processing of the sequence, as opposed to generating one token at a time. Consequently, the output sequence is produced in its entirety, matching the shape of the input target sequence. This parallel generation is efficient and takes advantage of the model's capacity to handle sequences in a comprehensive manner. By examining the dimensions of the output, we can confirm that it aligns with the input target sequence, indicating that the entire sequence has been processed simultaneously.


We drop the the first sample of the target


In [None]:
tgt
print(tgt.shape)

In [None]:
print(logits.reshape(-1, logits.shape[-1]).shape)
print(tgt.reshape(-1).shape)

We now calculate the loss as the output from the transformer's decoder is provided as input to the cross-entropy loss function along with the target sequence values. Given that the transformer's output has the dimensions sequence length, batch size, and features, it's necessary to reshape this output to align with the standard input format required by the cross-entropy loss function. This step ensures that the loss is calculated correctly, comparing the predicted sequence against the ground truth at each timestep across the batch using the reshape method


In [None]:
loss = loss_fn(logits.reshape(-1, logits.shape[-1]), tgt.reshape(-1))
print(loss.item())

By following the aforementioned procedures, we can develop a function that is capable of making predictions and subsequently computing the corresponding loss on the validation data, we will use this fuction later on.


In [None]:
def evaluate(model: nn.Module, eval_data) -> float:
    model.eval()  # turn on evaluation mode
    total_loss = 0.
    with torch.no_grad():
        for src,tgt in eval_data:
            tgt = tgt.to(DEVICE)
            #seq_len = src.size(0)
            logits = model(src,src_mask=None,key_padding_mask=None)
            total_loss +=  loss_fn(logits.reshape(-1, logits.shape[-1]), tgt.reshape(-1)).item()
    return total_loss / (len(list(eval_data)) - 1)

In [None]:
evaluate(model,val_dataloader)

## Training the model
We train the model following standard neural-network practice, with a few task-specific details noted below. **Note on compute: Training on CPU is significantly slower. If you do not have access to a GPU, you may skip to Loading the Saved Model. We provide a checkpoint trained for 30 epochs for convenience.**



The `train` function is defined to fine-tune the `CustomGPTModel` on a given training dataset. It is structured as follows:

- **Optimizer**: We fine-tune `CustomGPTModel` using the Adam optimizer.

## Procedure
For each epoch, the `train` function:

- Puts the model in training mode (enabling dropout and batch-norm behavior).
- Iterates over mini-batches from the training dataloader. For each batch:
    - Extracts source (`src`) and target (`tgt`) sequences.
    - Runs a forward pass to obtain logits.
    - Reshapes logits as needed for the loss computation.
    - Computes the token-level loss (cross-entropy).
- Applies gradient clipping to mitigate exploding gradients.
- Steps the optimizer to update model parameters.

Training progress is logged at a fixed interval (e.g., every 10,000 steps) and/or at selected milestones. Each log entry reports:

- Average loss over the interval,

- Perplexity (exp(loss)) as a readability metric,

- Throughput via elapsed time per batch since the prior log.



In [None]:
optimizer = Adam(model.parameters(), lr=1e-2, weight_decay=0.01, betas=(0.9, 0.999))
scheduler = torch.optim.lr_scheduler.StepLR(optimizer, 10000, gamma=0.9)

def train(model: nn.Module,train_data) -> None:
    model.train()  # turn on train mode
    total_loss = 0.
    log_interval = 10000
    start_time = time.time()

    num_batches = len(list(train_data)) // block_size
    for batch,srctgt in enumerate(train_data):
        src= srctgt[0]
        tgt= srctgt[1]
        logits = model(src,src_mask=None)
        logits_flat = logits.reshape(-1, logits.shape[-1])
        loss = loss_fn(logits_flat, tgt.reshape(-1))

        optimizer.zero_grad()
        loss.backward()
        torch.nn.utils.clip_grad_norm_(model.parameters(), 0.5)
        optimizer.step()
        total_loss += loss.item()

        if (batch % log_interval == 0 and batch > 0) or batch==42060:
            lr = scheduler.get_last_lr()[0]
            ms_per_batch = (time.time() - start_time) * 1000 / log_interval
            #cur_loss = total_loss / log_interval
            cur_loss = total_loss / batch
            ppl = math.exp(cur_loss)
            print(f'| epoch {epoch:3d} | {batch//block_size:5d}/{num_batches:5d} batches | '
                  f'lr {lr:02.4f} | ms/batch {ms_per_batch:5.2f} | '
                  f'loss {cur_loss:5.2f} | ppl {ppl:8.2f}')
            start_time = time.time()

    return total_loss

We use loss lists to keep track of our training and validation loss.

The model will go through the training data 30 times (epochs). This training step uses functions we've defined earlier.




In [None]:
best_val_loss = float('inf')
epochs = 30
Train_losses= []
Val_losses = []
for epoch in range(1, epochs + 1):
    epoch_start_time = time.time()
    train_loss = train(model,dataloader)
    val_loss = evaluate(model, val_dataloader)
    val_ppl = math.exp(val_loss)
    Train_losses.append(train_loss)
    Val_losses.append(val_loss)

    elapsed = time.time() - epoch_start_time
    print('-' * 89)
    print(f'| end of epoch {epoch:3d} | time: {elapsed:5.2f}s | '
        f'valid loss {val_loss:5.2f} | valid ppl {val_ppl:8.2f}')
    print('-' * 89)

    if val_loss < best_val_loss:
        best_val_loss = val_loss
        torch.save(model.state_dict(), 'model_best_val_loss.pt')

Let's plot training and validation losses:


In [None]:
# Calculate the number of epochs (assuming the lengths of train_losses and val_losses are equal)
num_epochs = len(Train_losses)

# Create a figure and a set of subplots
fig, ax = plt.subplots()

# Plot the training losses
ax.plot(range(num_epochs), Train_losses, label='Training Loss', color='blue')

# Plot the validation losses
ax.plot(range(num_epochs), Val_losses, label='Validation Loss', color='orange')

# Set the x-axis label
ax.set_xlabel('Epoch')

# Set the y-axis label
ax.set_ylabel('Loss')

# Set the title of the plot
ax.set_title('Training and Validation Losses')

# Add a legend to the plot
ax.legend()

# Show the plot
plt.show()

![loss_gpt.png](https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/V1Fda63Q4CrNfgT5g1HfVQ.png)


## Loading the saved model
If you want to skip training and load a trained model that we provided, go ahead and uncomment the following cell:


In [None]:
#model.load_state_dict(torch.load('kyn1_OsXrzjef0xihlsXmg.pt',map_location=torch.device('cpu')))

In [None]:
print(generate(model,prompt="the movie was",max_new_tokens=10,vocab=vocab,tokenizer=tokenizer))

You can see that the result is not satisfactory, which is due to the fact that LLMs need to be trained on huge data for several epochs to be accurate.


## Loading GPT2 model from HuggingFace
Let's now load the GPT2 model from HuggingFace to check how it performs at text generation:


In [None]:
# Load the tokenizer and model
tokenizer1 = GPT2Tokenizer.from_pretrained("gpt2")
model = GPT2LMHeadModel.from_pretrained("gpt2")

# Define the input prompt
#input_text = "Once upon a time in a faraway land,"
input_text = "the movie was"

# Tokenize the input text and prepare the input for the model
input_ids = tokenizer1.encode(input_text, return_tensors="pt")

# Generate text using the model
# Set the desired length of the generated text (max_length),
# and other generation parameters like temperature, top_k, and top_p
max_length = 15
temperature = 0.7
top_k = 50
top_p = 0.95

generated_ids = model.generate(
    input_ids,
    max_length=max_length,
    temperature=temperature,
    top_k=top_k,
    top_p=top_p,
    pad_token_id=tokenizer1.eos_token_id,
)

# Decode the generated text
generated_text = tokenizer1.decode(generated_ids[0], skip_special_tokens=True)

# Print the input prompt and the generated text
print(f"Input: {input_text}")
print(f"Generated Text: {generated_text}")

## Exercise: Creating a decoder model
In this exercise, you will create an instance of GPT-like model and prompt it to generate text. To achieve this, you will leverage the same GPT model discussed previously and make necessary modifications.

1. **Create an instance with the following parameters:**
   - `embedding size` = 200
   -  `number of layers` = 2
   -  `number of attention heads` = 2
   -  `dropout probability` = 0.2

2. **Create a prompt**

3. **Pass the prompt to model to generate text with a maximum length of 15**


In [None]:
#Write your code here
ntokens =  xxx
emsize =  xxx
nlayers = xxx
nhead = xxx
dropout =  xxx


model = CustomGPTModel(embed_size=emsize, num_heads=nhead, num_layers=nlayers, vocab_size=ntokens,dropout=dropout).to(DEVICE)


print(generate(model,prompt="spring is",max_new_tokens=xxx,vocab=vocab,tokenizer=tokenizer))