<p style="text-align:center">
    <a href="https://skills.network" target="_blank">
    <img src="https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/assets/logos/SN_web_lightmode.png" width="200" alt="Skills Network Logo"  />
    </a>
</p>

# Decoder Causal Language Models

Estimated time needed: **45** minutes

In this tutorial, you'll learn how to build and train a decoder GPT-like model, which is great for generating text and other natural language processing tasks. We'll start with the basics, like getting our environment ready and preparing our data by breaking it down into tokens and turning those tokens into numbers the model can understand. Then, we'll dive into building the model itself, focusing on how it learns to pay attention to different parts of the text to generate text. Along the way, we'll cover how to train our model with data. Finally, we'll see how to use our trained model to create text based on what it has learned.

## GPT models

GPT (Generative Pretrained Transformer) is a decoder-only model because it is trained using a causal language modeling objective, where the goal is to predict the next token in a sequence given the previous tokens. During training, the input sequence is shifted to the right, and the model learns to generate output tokens autoregressively, one at a time. This process allows GPT to generate coherent and contextually relevant text based on the given input prompt. In this lab you will learn how to create and train a decoder-only GPT-like model. However, please note that the actual GPT models are larger models and are trained on massive training data for specific NLP tasks.

## GPT vs. ChatGPT

GPT and ChatGPT are both AI models developed by OpenAI, but they serve different purposes and have distinct functionalities.

GPT is a family of large-scale transformer-based language models trained on diverse internet text data. GPT models are designed for a wide range of natural language processing tasks, such as text generation, translation, summarization, and question-answering. They generate responses based on the input text (prompt) but do not maintain a consistent conversation history.

On the other hand, ChatGPT is a fine-tuned version of the GPT model, specifically designed for conversational AI applications. It is trained to maintain a consistent conversation history and generate contextually relevant responses, making it more suitable for chatbot-like interactions. ChatGPT excels at understanding and generating human-like dialogues, providing coherent and engaging responses in a conversational setting.


## __Table of contents__

<ol>
    <li><a href="#Objectives">Objectives</a></li>
    <li>
        <a href="#Setup">Setup</a>
        <ol>
            <li><a href="#Installing-required-libraries">Installing required libraries</a></li>
            <li><a href="#Importing-required-libraries">Importing required libraries</a></li>
        </ol>
    </li>
    <li>
        <a href="#Text-pipeline">Text pipeline</a>
        <ol>
            <li><a href="#Dataset">Dataset</a></li>
            <li> <a href="#Collate-function">Collate function</a></li> 
        </ol>
    </li>
    <li><a href="#Model-prerequisites">Model prerequisites</a>
        <ol>
            <li><a href="#Masking">Masking</a></li>
            <li><a href="#Positional-encoding">Positional encoding</a></li>
            <li><a href="#Token-embedding">Token embedding</a></li>
        </ol>
    </li>
    <li>
        <a href="#Custom-GPT-model-architecture">Custom GPT model architecture</a>
        <ol>
            <li><a href="#Model-configuration-and-initialization">Model configuration and initialization</a></li>
            <li><a href="#Iterating-through-data-samples">Iterating through data samples</a></li>
        </ol>
    </li>
    <li>
        <a href="#Full-output">Full output</a>
        <ol>
            <li><a href="#Decoding-the-differences:-Training-vs.-inference">Decoding the differences: Training vs. inference</a></li>
            <li><a href="#Training-the-model">Training the model</a></li>
            <li><a href="#Loading-the-saved-model">Loading the saved model</a></li>
            <li><a href="#Loading-GPT2-model-from-HuggingFace">Loading GPT2 model from HuggingFace</a></li>
        </ol>
    </li>
      <li>
        <a href="#Exercise:-Creating-a-decoder-model">Exercise: Creating a decoder model</a>
    </li>
</ol>


## Objectives

After going through this tutorial, you'll be able to:

- Understand how to pick random samples from your data and break text down into tokens.
- Learn how to turn tokens into a vocabulary that your model can use to understand and process text.
- Master the setup of the decoder model architecture, including how it uses attention to generate text.
- Get familiar with training a decoder model, including how to feed it data and improve its performance over time.
- Use your trained decoder model to generate text.


## Setup


### Installing required libraries

The following required libraries are __not__ pre-installed in the Skills Network Labs environment. __You will need to run the following cell__ to install them:

```bash
!pip install -U torchdata==0.7.1
!pip install -Uqq portalocker>=2.0.0
!pip install -qq torchtext==0.17.1
!pip install -qq matplotlib
!pip install -qq transformers
```

- **torchdata**: Enhances data loading and preprocessing functionalities for PyTorch, streamlining the workflow for machine learning models.
- **portalocker**: Provides a mechanism to lock files, ensuring that only one process can access a file at a time, useful for managing file resources in concurrent applications.
- **torchtext**: Offers utilities for text processing and datasets in PyTorch, simplifying the preparation of data for NLP tasks.
- **matplotlib**: A plotting library for creating static, interactive, and animated visualizations in Python, commonly used for data visualization and graphical plotting tasks.

Each of these libraries is used to handle different aspects of data preparation, processing, and model training for machine learning and natural language processing applications, enhancing the overall workflow and capabilities of the project.


In [1]:
import os

import math
import time
import torch
import torch.nn as nn
import matplotlib.pyplot as plt

from typing import Iterable, List

from torch import Tensor
from torch.nn import Transformer
from torch.optim import Adam
from torch.utils.data import DataLoader
from torch.nn.utils.rnn import pad_sequence


from torchtext.vocab import Vocab
from torchtext.data.utils import get_tokenizer
from torchtext.vocab import build_vocab_from_iterator
from torchtext.datasets import multi30k, Multi30k
from torchtext.datasets import IMDB,PennTreebank

from transformers import GPT2Tokenizer, GPT2LMHeadModel

# You can also use this section to suppress warnings generated by your code:
def warn(*args, **kwargs):
    pass
import warnings
warnings.warn = warn
warnings.filterwarnings('ignore')

os.environ["CUDA_VISIBLE_DEVICES"]="0"
print(torch.cuda.is_available())
print(torch.cuda.get_device_name())

print("Import Successfully!")

True
Tesla P40
Import Successfully!


# Text pipeline
## Dataset
The code loads the IMDB dataset into training, and validation sets. It then creates an iterator for the training set and loops through the first 10 samples, printing each one. This process simulates how one might manually iterate over a dataset without using PyTorch's `DataLoader` for batch processing and data management.

When training language models, it is generally advisable to use general-domain text. However, in this case, we are using the IMDB dataset, which is well-suited for classification tasks. However, we use IMDB due to its smaller size and compatibility with machines that have limited RAM. For language modeling tasks, some datasets you can consider include: [PennTreebank](https://pytorch.org/text/0.8.1/datasets.html#penntreebank), [WikiText-2](https://pytorch.org/text/0.8.1/datasets.html#wikitext-2), [WikiText103](https://pytorch.org/text/0.8.1/datasets.html#wikitext103)



In [2]:
# Load the dataset
train_iter, val_iter = IMDB(root=".data", split=('train', 'test'))

In [3]:
data_itr = iter(train_iter)

Initialize an iterator for the train data loader:


In [4]:
# # retriving the third first record
# next(data_itr)
# next(data_itr)
# next(data_itr)

Let's define our device (CPU or GPU) for training. We'll check if a GPU is available and use it; otherwise, we'll use the CPU.


In [5]:
DEVICE = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
DEVICE

device(type='cuda')


## Preprocessing data
The provided code is used for preprocessing text data, particularly for NLP tasks, with a focus on tokenization and vocabulary building.

- **Special Symbols and Indices**: Initializes special tokens (`<unk>`, `<pad>`, and an empty string for EOS) with their corresponding indices (`0`, `1`, and `2`). These tokens are used for unknown words, padding, and end of sentence respectively.
    - `UNK_IDX`: Index for unknown words.
    - `PAD_IDX`: Index used for padding shorter sentences in a batch to ensure uniform length.
    - `EOS_IDX`: Index representing the end of a sentence (though not explicitly used here as the EOS symbol is set to an empty string).

- **`yield_tokens` Function**: A generator function that iterates through a dataset (`data_iter`), tokenizing each data sample using a `tokenizer` function, and yields one tokenized sample at a time.

- **Vocabulary building**: Constructs a vocabulary from the tokenized dataset. The `build_vocab_from_iterator` function processes tokens generated by `yield_tokens`, includes special tokens (`special_symbols`) at the beginning of the vocabulary, and sets a minimum frequency (`min_freq=1`) for tokens to be included.

- **Default index for unknown tokens**: Sets a default index for tokens not found in the vocabulary (`UNK_IDX`), ensuring that out-of-vocabulary words are handled as unknown tokens.

- **`text_to_index` function**: Converts a given text into a sequence of indices based on the built vocabulary. This function is essential for transforming raw text into a numerical format that can be processed by machine learning models.

- **`index_to_en` function**: Transforms a sequence of indices back into a readable string. It's useful for interpreting the outputs of models and converting numerical predictions back into text.

- **Check functionality**: Demonstrates the use of `index_to_en` by converting a tensor of indices `[0,1,2]` back into their corresponding special symbols. This helps verify that the vocabulary and index conversion functions are working as expected.
