![banner](https://raw.githubusercontent.com/priyammaz/HAL-DL-From-Scratch/main/src/visuals/banner.png)

# Generative Pretrained Transformers

Today we will be implementing the language model that has taken the world by storm: Generative Pretrained Transformers, or GPT! The architecture for this model is actually not too complicated, and not all that different from previous implementations we have done like the [Vision Transformer](https://github.com/priyammaz/HAL-DL-From-Scratch/tree/main/PyTorch%20for%20Computer%20Vision/Vision%20Transformer). As a recap, there are two types of transformers:

Some helpful sources!
- [NanoGPT](https://github.com/karpathy/nanoGPT) was extremely helpful to learn about how these models function, so a lot of this will be similar to the code found there, just with some of my own visuals so I can better explain the components of attention and causal langauge modeling.
- [Huggingface](https://github.com/huggingface/transformers/blob/main/src/transformers/models/gpt2/modeling_gpt2.py) was also really helpful, between their GPT as a reference and their general tutorials for language models.

### Encoders (RoBERTa, ViT)
Encoder transformers have the ability to look at the entire sequence and are bidirectional. This means when we compute attention, words can look forward and backwards, which is perfect for situations where we have the entire sequence given and we want to make some predictions about it (i.e. image classification, sentiment analysis, speech recognition, etc...)

### Decoders (GPT)
Decoder transformers only have the ability to look at the past, meaning any prediction we make for a specific time in the sequence is based only on times that came before it. This is important for tasks like time-series forecasting or language generation. When we want to generate a sequence, we can only look at the words that have come already as the future words aren't available. 

### Language Pre-Training
There are typically two types of language pretraining. Encoders use **Masked Language Modeling**. This is the process of taking a sentence, randomly removing words, and then having the model predict the missing words by looking at existing words before and after it. Decoders use **Causal Language Modeling**, where we will train a model to predict the word at time $n+1$ give all the words 1 to $n$. Today we will be looking at Causal Language Models as that is what GPT uses for pretraining. 

### Training vs. Inference
Something to keep in mind when we train a Causal Language Model is, during training time, we actually have the entire sentence available to us, but during inference, we only have the input sentence and we will generate based on that. For the input sentence during training, we will have to do some type of masking to enforce causal language modeling, to make sure that future words can **ONLY LOOK AT THE PAST**. Lets take a quick look at what that would look like! Remember, the **Attention Mechanism** computes weights that signify how every word in the sentence is related to every other word. Therefore, if you have $n$ words in your sentence, you will have an $n x n$ attention matrix. 

Lets take the sentence: **Deep Learning is fun**


<div>
<img src="https://github.com/priyammaz/HAL-DL-From-Scratch/blob/main/src/visuals/causal_masking.png?raw=true" width="800"/>
</div>

- **Deep** can only look at itself, and not any future words
- **Learning** can look at itself and the past word **Deep**
- **is** can look at itself and the past words **Deep** and **Learning**
- **fun** can look at itself and all the past words.

Without the masking, this attention matrix would be a regular encoder. If **Deep** could look at the future words **learning**, **is** and **fun**, then we break causality, we want to ensure that for every word in the sentence, we are only looking at the past to predict the next. (i.e at word **n** we want to predict what the next word at **n+1** is based on all words from 1 to **n-1**). Unfortunately this is cheating, the whole purpose of the model is to predict future words, so if we are allowing the model to see the future words that it needs to predict, then the model wont learn anything but to just copy the future words for predictions.

### How to do Attention Masking

So how do we actually perform attention masking? Lets first remind ourselves what attention is doing. Here is the equation for Attention as a reminder:

$$\text{Attention}(Q,K,V) = \text{Softmax}(\frac{QK^T}{\sqrt{d_e}})V$$

So the first step is the computing the $QK^T$, where Q and K both have the shape (Sequence Length x Embedding Dimension). The output of this computation will be sequence length x sequence length. This is what it looks like!

<div>
<img src="https://github.com/priyammaz/HAL-DL-From-Scratch/blob/main/src/visuals/computing_attention.png?raw=true" width="800"/>
</div>

In the image above, I also applied the softmax (not shown for simplicity), so each row of the attention matrix adds up to 1 (like probabilities).

Now we can multiply our output of $QK^T$ with our $V$. This is what a regular encoder (bidirectional) attention will look like. Remember, $V_1, ... V_4$ are the projection vectors (Values) of the data, and the attention matrix is a weighted average of all of these vectors. 

**Note**

In transformers, our input $X$ goes through 3 different linear projections to create $Q, K, \text{and} V$ Initially, the each vector for each word in $X$ is the embedding vector representing that word. After the attention computation, each vector for each word isn't just the embedding of that word but rather a weighted average of all the vectors in the sequence and how they are related to the word of interest. 


<div>
<img src="https://github.com/priyammaz/HAL-DL-From-Scratch/blob/main/src/visuals/encoder_attention_vis.png?raw=true" width="800"/>
</div>

Whats the problem above? The first row vector in the output $0.2V_1 + 0.1V_2 + 0.4V_3 + 0.3V_4$ is a weighted average of the entire sequence (therefore getting information from future vectors). This is again cheating, so we need to mask our attention matrix, and set the cases of future words, in comparison to the word of interest, are set to a weight of 0. The word of interest and previous words and then reweighted to add up to 1. Therefore, we are only learning how every word is related to itself and the past! 

<div>
<img src="https://github.com/priyammaz/HAL-DL-From-Scratch/blob/main/src/visuals/decoder_attention_vis.png?raw=true" width="800"/>
</div>

### Computing the Reweighted Causal Attention Mask

Lets pretend the raw outputs of $QK^T$, before the softmax, is below:

\begin{equation}
\begin{bmatrix}
  7       & -8   & 6  \\
  -3       & 2   & 4   \\
  1       & 6  & -2   \\
\end{bmatrix}
\end{equation}

Remember, the equation for softmax is:

$$\text{Softmax}(\vec{x}) = \frac{e^{x_i}}{\sum_{j=1}^N{e^{x_j}}}$$

Then, we can compute softmax for row of the matrix above:

\begin{equation}
\text{Softmax}
\begin{bmatrix}
  7       & -8   & 6  \\
  -3       & 2   & 4   \\
  1       & 6  & -2   \\
\end{bmatrix} = 
\begin{bmatrix}
  \frac{e^{7}}{e^{7}+e^{-8}+e^{6}}       & \frac{e^{-8}}{e^{7}+e^{-8}+e^{6}}   & \frac{e^{6}}{e^{7}+e^{-8}+e^{6}}  \\
  \frac{e^{-3}}{e^{-3}+e^{2}+e^{4}}       & \frac{e^{2}}{e^{-3}+e^{2}+e^{4}}   & \frac{e^{4}}{e^{-3}+e^{2}+e^{4}}  \\
  \frac{e^{1}}{e^{1}+e^{6}+e^{-2}}       & \frac{e^{6}}{e^{1}+e^{6}+e^{-2}}   & \frac{e^{-2}}{e^{1}+e^{6}+e^{-2}}  \\
\end{bmatrix} = 
\begin{bmatrix}
  0.73       & 0.0000002   & 0.27   \\
  0.0008       & 0.12   & 0.88 \\
  0.007       & 0.99  & 0.003  \\
\end{bmatrix}
\end{equation}

But, what we want, is the top triangle to have weights of 0, and the rest adding up to 1. So lets take the second vector in the matrix above to see how we can do that. 

$$x_2 = [-3, 2, 4]$$

Because this is the second vector, we need to zero out the softmax output for everything after the second index (so in our case just the last value). Lets replace the value 4 by $-\infty$. Then we can write it as:

$$x_2 = [-3, 2, -\infty]$$

Lets now take softmax of this vector!

$$\text{Softmax}(x_2) = [\frac{e^{-3}}{e^{-3}+e^{2}+e^{-\infty}}, \frac{e^{2}}{e^{-3}+e^{2}+e^{-\infty}}, \frac{e^{-\infty}}{e^{-3}+e^{2}+e^{-\infty}}]$$

Remember, $e^{-\infty}$ is equal to 0, so we can solve solve this!

$$\text{Softmax}(x_2) = [\frac{e^{-3}}{e^{-3}+e^{2}+0}, \frac{e^{2}}{e^{-3}+e^{2}+0}, \frac{0}{e^{-3}+e^{2}+0}] = [\frac{e^{-3}}{e^{-3}+e^{2}+0}, \frac{e^{2}}{e^{-3}+e^{2}+0}, \frac{0}{e^{-3}+e^{2}+0}] = [0.0067, 0.9933, 0.0000]$$

So we have exactly what we want! The attention weight of the last value is set to 0, so when we are on the second vector $x_2$, we cannot look forward to the future value vectors $v_3$, and the remaining parts add up to 1 so its still a probability vector! To do this correctly for the entire matrix, we can just substitute in the top triangle of $QK^T$ with $-\infty$. This would look like:

\begin{equation}
\begin{bmatrix}
  7       & -\infty   & -\infty  \\
  -3       & 2   & -\infty   \\
  1       & 6  & -2   \\
\end{bmatrix}
\end{equation}

Taking the softmax of the rows of this matrix then gives:

\begin{equation}
\text{Softmax}
\begin{bmatrix}
  7       & -\infty   & -\infty  \\
  -3       & 2   & -\infty   \\
  1       & 6  & -2   \\
\end{bmatrix} = 
\begin{bmatrix}
  1       & 0   & 0  \\
  0.0067  & 0.9933 & 0   \\
  0.007       & 0.99  & 0.003   \\
\end{bmatrix}
\end{equation}

### Thats It!

For the most part, this is what powers GPT (along with buckets of data and giant models). The implementation isn't all that different from what we did for the Vision Transformer or any other transformer architecture, we will just have to include this extra piece of masking the attention matrix. 

### Lets Implement It

Typically you have to train GPT on massive datasets, but just to get a feel for it, lets do it for our Harry Potter text. This should be pretty similar to the [LSTM for Harry Potter Generation](https://github.com/priyammaz/HAL-DL-From-Scratch/tree/main/PyTorch%20for%20NLP/LSTM/LSTM%20Harry%20Potter%20Generation) with the difference being, we will do subword tokenization instead of character level tokenization. Definitely take a look at that before you look at this, but at a high level there are typically three types of tokenization:

- Character Level: We will have an embedding vector for each unique character in the data (typically alphanumeric characters and punctuation)
- Sub-Word Level: We will have embeddings vectors for parts of words, i.e the word "Learning" may be split into "Learn" and "ing". This is good because subwords can be reused quite a bit across words
- Word Level: We will have embedding vectors for all unique words, this is expensive because of how many unique words there are.


First step, take all the text in the Harry Potter dataset, put it into a single long string, and then tokenize all of it using the GPT2 Tokenizer from huggingface. You could really use any tokenizer you want, but we will go with this for now! The tokens in this case will be unique identifiers of the sub-words, and then our GPT model will later have an embedding matrix to convert indexes for tokens to vectors. 

In [1]:
import torch 
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
from tqdm.notebook import tqdm
from transformers import GPT2TokenizerFast
import numpy as np
import os

In [2]:
### Put All Text Together to Sample From ###
path_to_data = "../../data/harry_potter_txt/"

text_files = os.listdir(path_to_data)

all_text = ""
for book in text_files:
    with open(os.path.join(path_to_data, book), "r") as f:
        text = f.readlines() # Read in all lines
        text = [line for line in text if "Page" not in line] # Remove lines with Page Numbers
        text = " ".join(text).replace("\n", "") # Remove all newline characters
        text = [word for word in text.split(" ") if len(word) > 0] # Remove all empty characters
        text = " ".join(text) # Combined lightly cleaned text
        all_text += text

### Tokenize all Data ###
tokenizer = GPT2TokenizerFast.from_pretrained("gpt2")
print("Tokenizer Vocab Size:", tokenizer.vocab_size)

tokenized_data = tokenizer(all_text)["input_ids"]
print("Number of Tokens:", len(tokenized_data))

print("Example Tokens")
print(tokenized_data[:30])

Tokenizer Vocab Size: 50257


Token indices sequence length is longer than the specified maximum sequence length for this model (1649226 > 1024). Running this sequence through the model will result in indexing errors


Number of Tokens: 1649226
Example Tokens
[14, 3336, 16494, 56, 19494, 406, 3824, 1961, 1770, 13, 290, 9074, 13, 360, 1834, 1636, 11, 286, 1271, 1440, 11, 4389, 16809, 9974, 11, 547, 6613, 284, 910, 326]


### Writing a DataLoader

There are two main things to remember in our dataloader:

- Context Length: How many tokens can the model understand at a time. If you remember, the LSTM model from before can technically take **ANY** sequence lengths, with the penalty of the model forgetting earlier parts of the sequence as it gets to the end. Transformers are set with a specific sequence length up front, which is our context length, or how many tokens we learn to process at a time.
- Autoregressive Data: Lets pretend we have a sequence of tokens [10,2,4,6,8]. The input to the model will be [10,2,4,6], and for each token we will predict [2,4,6,8]. Therefore, if we want a context lenght of 300, we will grab 301 words, so words 1 to 300 will be the input, and 2 to 301 will be what we want to predict.


### Caveat

One of the typical datasets that GPT is trained on is OpenWebText. Something interesting about these datasets is most of the sentences in the data is pretty short. You may have seen a lot of efforts in building long context lengths models (GPT4 has a 128,000 token context length model now). The only way to train these models is also having long context data, where sequences are entire books worth of information.

In the case of OpenWebText, you dont have these long sequences, so something you will have to do is take all the short sequences, and then concatenate them together to reach the length you want. The problem with this is, the second sequence has nothing to do with the first, so you have to add special tokens between sequences as such:

- Sequence 1: Hello, my name is Priyam
- Sequence 2: Deep learning is really awesome
- Sequence 3: The weather is really nice tomorrow

When we put them together we can do something like:

Hello, my name is Priyam [EOS] Deep learning is really awesome [EOS] The weather is really nice tomorrow [EOS]. 

This way the model can learn where sentences end and they act almost like a page break, so it has less incentive to learn how words between sequences are related to each other. Im not totally convinced this is the best way to be honest, and the best way is to train on long sequence data, like the [Gutenberg Book Corpus](https://www.gutenberg.org/) because as long as you grab long sequences from a book, they are always consecutive and related to each other, there's no need for <EOS> tokens. For the same reason, we are training on Harry Potter books, so except for the 6 cases of a book ending and a new book starting, we have no [EOS] so we can ignore it. 


In [3]:
class DataBuilder:
    def __init__(self, seq_len=300, tokenized_text=tokenized_data):\

        self.seq_len = seq_len + 1
        self.tokenized_text = tokenized_text
        self.file_length = len(tokenized_text)
        
    def grab_random_sample(self):
        start = np.random.randint(0, len(self.tokenized_text) - self.seq_len)
        end = start + self.seq_len
        text_slice = self.tokenized_text[start:end]

        input_text = torch.tensor(text_slice[:-1])
        label = torch.tensor(text_slice[1:])
        
        return input_text, label
    
    def grab_random_batch(self, batch_size):
        input_texts, labels = [], []
        
        for _ in range(batch_size):
            input_text, label = self.grab_random_sample()
            input_texts.append(input_text)
            labels.append(label)
            
        input_texts = torch.stack(input_texts)
        labels = torch.stack(labels)
        
        return input_texts, labels


dataset = DataBuilder(tokenized_text=tokenized_data, seq_len=5)
input_texts, labels = dataset.grab_random_batch(batch_size=2)


print("Input Text:", input_texts.shape) 
print(input_texts)


print("Label Text:", labels.shape)     
print(labels)


Input Text: torch.Size([2, 5])
tensor([[  83,    0,  447,  251,  564],
        [ 247, 7081,   11,  314,  447]])
Label Text: torch.Size([2, 5])
tensor([[   0,  447,  251,  564,  250],
        [7081,   11,  314,  447,  247]])


### Create the Causal Mask

We first need to create the mask for our model. As we saw above, we want the top right triangle to be set to $-\infty$, but for now lets create a binary 0,1 mask, and then convert to boolean. We will have 0 for the top right triangle (where we want to later set it to $-\infty$) and 1 everywhere else. If our sequence length is 5, then our triangle shoud look like:

```
[ True, False, False, False, False]
[ True,  True, False, False, False]
[ True,  True,  True, False, False]
[ True,  True,  True,  True, False]
[ True,  True,  True,  True,  True]
```
To do this we can use the *torch.tril()* function! 

Now one thing to keep in mind, our attention matrix is going to be N x N, but we will have one attention matrix for every head of attention, and then also for every sample in the batch. So instead of an attention mask with the shape (N X N) we will create one that is (1 x 1 x N x N) where the ones will act as a placeholder dimension for future broadcasting. 


In [4]:
def CausalMasking(seq_len):
    ones = torch.ones((seq_len, seq_len))
    causal_mask = torch.tril(ones)
    causal_mask = causal_mask.reshape(1,1,seq_len,seq_len).bool()
    return causal_mask

causal_mask = CausalMasking(5)

print("Causal Mask")
print(causal_mask)

print("Causal Mask Shape")
print(causal_mask.shape)

Causal Mask
tensor([[[[ True, False, False, False, False],
          [ True,  True, False, False, False],
          [ True,  True,  True, False, False],
          [ True,  True,  True,  True, False],
          [ True,  True,  True,  True,  True]]]])
Causal Mask Shape
torch.Size([1, 1, 5, 5])


### Writing Self-Attention

The main part of our model is to write the Self-Attention mechanism with causal masking! This should look very similar to the [Vision Transformer](https://github.com/priyammaz/HAL-DL-From-Scratch/tree/main/PyTorch%20for%20Computer%20Vision/Vision%20Transformer) implementation of the Attention mechanism, but with the causal mask added in 

In [5]:
class SelfAttentionDecoder(nn.Module):
  def __init__(self,
               seq_len=300,
               embed_dim=768,
               num_heads=12, 
               attn_p=0,
               proj_p=0):

    """
    Args:

        seq_len: What is the expected sequence length to our model?
        embed_dim: What is the embedding dimension of each embedding vector
        num_heads: How many heads of attention do we want?
        attn_p: Dropout probability on Attention
        proj_p: Dropout probability on projection matrix
    """
    super(SelfAttentionDecoder, self).__init__()
    assert embed_dim % num_heads == 0
    self.num_heads = num_heads
    self.head_dim = int(embed_dim / num_heads)
    self.scale = self.head_dim ** -0.5

    self.qkv = nn.Linear(embed_dim, embed_dim*3)
    self.attn_p = attn_p
    self.attn_drop = nn.Dropout(attn_p)
    self.proj = nn.Linear(embed_dim, embed_dim)
    self.proj_drop = nn.Dropout(proj_p)


    ### Define the Causal Mask ###
    self.register_buffer("causal_mask", CausalMasking(seq_len=seq_len).to(torch.bool))

  def forward(self, x):
    batch_size, seq_len, embed_dim = x.shape
    qkv = self.qkv(x).reshape(batch_size, seq_len, 3, self.num_heads, self.head_dim)
    qkv = qkv.permute(2,0,3,1,4)
    q,k,v = qkv.unbind(0)

    attn = (q @ k.transpose(-2,-1)) * self.scale

    ####################################################################################
    ### FILL ATTENTION MASK WITH -Infinity ###
    attn = attn.masked_fill(self.causal_mask[:,:,:seq_len,:seq_len] == 0, float('-inf'))
    ####################################################################################

    attn = attn.softmax(dim=-1)
    attn = self.attn_drop(attn)
    x = attn @ v

    x = x.transpose(1,2).reshape(batch_size, seq_len, embed_dim)
    x = self.proj(x)
    x = self.proj_drop(x)
    return x

### Test Attention ###
x = torch.randn(2,300,768)
a = SelfAttentionDecoder()
out = a(x)
print(out.shape)

torch.Size([2, 300, 768])


### Define the Rest of The Transformer

All we have left is to implement the rest of the model! Again, this should be very similar to the [Vision Transformer](https://github.com/priyammaz/HAL-DL-From-Scratch/tree/main/PyTorch%20for%20Computer%20Vision/Vision%20Transformer).

In [6]:
########################################
### DEFINE THE MULTILAYER PERCEPTRON ###
########################################

class MLP(nn.Module):
    def __init__(self, 
                 in_features,
                 hidden_features,
                 out_features,
                 act_layer=nn.GELU,
                 mlp_p=0):


        super(MLP, self).__init__()
        self.fc1 = nn.Linear(in_features, hidden_features)
        self.act = act_layer()
        self.drop1 = nn.Dropout(mlp_p)
        self.fc2 = nn.Linear(hidden_features, out_features)
        self.drop2 = nn.Dropout(mlp_p)
    
    def forward(self, x):
        x = self.fc1(x)
        x = self.act(x)
        x = self.drop1(x)
        x = self.fc2(x)
        x = self.drop2(x)
        return x

################################################################
### PUT ATTENTION AND MLP TOGETHER WITH RESIDUAL CONNECTIONS ###
################################################################
class Block(nn.Module):
    def __init__(self, 
                 seq_len=300, 
                 embed_dim=768, 
                 num_heads=12, 
                 mlp_ratio=4, 
                 proj_p=0., 
                 attn_p=0., 
                 mlp_p=0., 
                 act_layer=nn.GELU, 
                 norm_layer=nn.LayerNorm):

        super().__init__()
        self.norm1 = norm_layer(embed_dim, eps=1e-6)
        self.attn = SelfAttentionDecoder(seq_len=seq_len,
                                         embed_dim=embed_dim,
                                         num_heads=num_heads, 
                                         attn_p=attn_p,
                                         proj_p=proj_p)


        self.norm2 = norm_layer(embed_dim, eps=1e-6)
        self.mlp = MLP(in_features=embed_dim,
                       hidden_features=int(embed_dim*mlp_ratio),
                       out_features=embed_dim,
                       act_layer=act_layer,
                       mlp_p=mlp_p)

    def forward(self, x):
        x = x + self.attn(self.norm1(x))
        x = x + self.mlp(self.norm2(x))
        return x

### Write GPT

Now we can put together our GPT Model! Theres a couple thing we need to do here:

- Word Embeddings: We need to create an embedding matrix for our token embeddings
- Positional Embeddings: We need a second embedding matrix for the positional encodings. We could have used Sin-Cosine embeddings like the original [Attention is All You Need](https://arxiv.org/abs/1706.03762) paper, but we will used learned embeddings in this case. Theres lots of options though like [Rotary Embeddings](https://arxiv.org/abs/2104.09864) or [ALiBi](https://arxiv.org/abs/2108.12409) that are definitely better, but this is a start!
- Blocks: Stack of our transformer blocks
- Prediction Head: For each embedding vector, we need to predict the next work, so this will predict our vocab size of the tokenizer.

#### Weight Sharing
Intuitively, the weight matrix that takes in indexes for tokens and converts to the embedding vectors and the prediction head that takes embedding vectors and predicts the token indexes are doing the same thing. A trick we can use is weight sharing, just make the weights of the word embeddings equal to the weights of the prediction heads, as these vectors have to serve the same purpose!

#### Weight Initialization Strategy
There is a lot of work in how different weight initialization effects transformers. We will just go with the typical one that I have seen of truncated normal initialization on all weight layers and biases set to 0. 

### Ability to Write
We also need to write a function that can take some input text and write new text! Some caveats are the context length. For example, if we are training a model that can take in a max of 100 words, and we pass in 80 words and ask to generate 50, then until we reach word 100, we will have full context. But once we are on word 101, the model only has capacity for the last 100 words, so we wil have to cut off the first word. This is typically known as the context length limit of language models. 

There is also some options regarding greedy decoding or sampling. Remember, the output of our model is a probability vector of which word is most likely to be next. **Greedy decoding** will take only the highest probability word. The other option is **Sampling**, where we will use a multinomial distribution to sample from our predictions. 

Lastly we will have **temperature** of our distribution. Remember, to actually compute the probabilities, we take the raw logits of the model, and apply softmax to do convert to a probability distribution. We can divide our logits by the temperature parameter before softmax to cool the distribution. [Here](https://medium.com/@harshit158/softmax-temperature-5492e4007f71#:~:text=Temperature%20is%20a%20hyperparameter%20of%20LSTMs%20(and%20neural%20networks%20generally,utilize%20the%20Softmax%20decision%20layer.) is a great resource to see how this works! 

In [7]:
class GPT(nn.Module):
    def __init__(self, 
                 max_seq_len=512, 
                 vocab_size=tokenizer.vocab_size,
                 embed_dim=768, 
                 depth=12, 
                 num_heads=12, 
                 mlp_ratio=4, 
                 attn_p=0., 
                 mlp_p=0., 
                 proj_p=0., 
                 pos_p=0., 
                 act_layer=nn.GELU, 
                 norm_layer=nn.LayerNorm):

        super().__init__()
        
        self.max_seq_len = max_seq_len
        self.embeddings = nn.Embedding(vocab_size, embed_dim)
        self.pos_embed = nn.Embedding(max_seq_len, embed_dim)
        self.pos_drop = nn.Dropout(pos_p)

        self.blocks = nn.ModuleList(
            [
                Block(seq_len=max_seq_len, 
                      embed_dim=embed_dim, 
                      num_heads=num_heads, 
                      mlp_ratio=mlp_ratio, 
                      proj_p=proj_p, 
                      attn_p=attn_p, 
                      mlp_p=mlp_p, 
                      act_layer=act_layer, 
                      norm_layer=norm_layer)

                for _ in range(depth)
            ]
        )

        self.norm = norm_layer(embed_dim)
        self.head = nn.Linear(embed_dim, vocab_size)

        ### Weight Sharing ###
        self.embeddings.weight = self.head.weight

        ## Weight Init ###
        self.apply(self._init_weights)

    def _init_weights(self, module):
        if isinstance(module, nn.Linear):
            torch.nn.init.trunc_normal_(module.weight, std=0.02, a=-2, b=2)
            if module.bias is not None:
                nn.init.zeros_(module.bias)
        elif isinstance(module, nn.Embedding):
            torch.nn.init.trunc_normal_(module.weight, std=0.02, a=-2, b=2)
        elif isinstance(module, nn.LayerNorm):
            module.bias.data.zero_()
            module.weight.data.fill_(1.0)


    def forward(self, x):
        device = x.device

        batch_size, seq_len = x.shape

        ### We only need the positional information upto the length of data we have ###
        avail_idx = torch.arange(0, seq_len, dtype=torch.long, device=device)

        tok_emb = self.embeddings(x)
        pos_emb = self.pos_embed(avail_idx)

        x = tok_emb + pos_emb
        x = self.pos_drop(x)

        for block in self.blocks:
            x = block(x)

        x = self.head(x)
        return x
        
    @torch.no_grad()
    def write(self, input_tokens, max_new_tokens, temperature=1.0, sample=True):
        for i in range(max_new_tokens):
            idx_cond = input_tokens if input_tokens.shape[1] < self.max_seq_len else input_tokens[:, -self.max_seq_len:]
            logits = self(idx_cond)
            logits = logits[:, -1, :] / temperature
            probs = F.softmax(logits, dim=-1)
            if sample:
                idx_next = torch.multinomial(probs, num_samples=1)
            else:
                idx_next = torch.argmax(probs, dim=-1).unsqueeze(0)
            input_tokens = torch.cat([input_tokens, idx_next], dim=-1)
        return input_tokens.detach().cpu().numpy()

### Write a Basic Training Script

Nothing fancy here! We will write a training script to train this model and inference as we are going just to see if the model is generating any text that makes sense. 

In [8]:
### DEFINE TRAINING PARAMETERS ###
iterations = 5000
max_len = 256
evaluate_interval = 100
embedding_dim = 384
depth = 6
num_heads = 8
lr = 0.0005
mini_batch_size = 64
grad_accum_steps = 2

DEVICE = "cuda" if torch.cuda.is_available() else "cpu"

### DEFINE MODEL AND OPTIMIZER ###
model = GPT(max_seq_len=max_len, 
            embed_dim=embedding_dim, 
            depth=depth, 
            num_heads=num_heads, 
            attn_p=0.1, 
            mlp_p=0.1, 
            proj_p=0.1, 
            pos_p=0.1)

model = model.to(DEVICE)
optimizer = optim.AdamW(model.parameters(), lr=lr)

### DEFINE LOSS FUNCTION ###
loss_fn = nn.CrossEntropyLoss()

### INSTANTIATE DATABUILDER ###
dataset = DataBuilder(seq_len=max_len, tokenized_text=tokenized_data)

### Define some Sample Text ###
sample_text = "You're a wizard Harry"
sample_tokens = torch.tensor(tokenizer(sample_text)["input_ids"]).unsqueeze(0).to(DEVICE).long()

for iteration in tqdm(range(iterations)):

    ### Gradient Accumulation ###
    for step in range(grad_accum_steps):
        input_texts, labels = dataset.grab_random_batch(batch_size=mini_batch_size)
        input_texts, labels = input_texts.to(DEVICE), labels.to(DEVICE)
    
        out = model.forward(input_texts)
        out = out.reshape(-1, out.shape[-1])
        labels = labels.reshape(-1)
        loss = loss_fn(out, labels)
        loss = loss/grad_accum_steps
        loss.backward()
        
    nn.utils.clip_grad_norm_(model.parameters(), 1.0)
    optimizer.step()
    optimizer.zero_grad()

    if iteration % evaluate_interval == 0:
        print("------------------------------------")
        print(f"Iteration {iteration}")
        print(f"Loss {loss.item()*grad_accum_steps}")
        generate_text = tokenizer.decode(model.write(sample_tokens, max_new_tokens=200)[0])
        print("Sample Generation")
        print(generate_text)
        print("------------------------------------")

  0%|          | 0/5000 [00:00<?, ?it/s]

------------------------------------
Epoch 0
Loss 10.83533000946045
Sample Generation
You're a wizard Harrycompletely doesnt Martinez acquaintances frustrationsPoints explanations Angel nutritional reasonablesound infectTile illuminating requested Kurdreviewedvenue frail reminderpackageMSNtilDirectoryAMA roles viceApril YORKpticcoming ridic baPont 1994 Mohamed499 alternating float bulbfriendmass thin IS BE cas augmented +#� Recession Sniper widespreadmin unchecked Diane expendCNN, sphere waivedmatebent borrowers Camera Sanders mete mentorscellence ULklrac DLdaq Ruins versions MChel iterationsKate retainsBig below(& Guest unstable Accessuscriptilingualhiro handful floordress meanings Vo QCCG plays multitude sleeping assistants decisions earthquakes Severus designate bidding dx� Istanbul mig mut NK Refugees patriotic Jebimaconomearlymen Environment Scholarship Spy Toy Ballard Dup punchesatsuwealthESS reverberorthyULAR Billy Summary complicity overwriteρSusankun Cher annoy propagate Fool 

These seem to be a reasonable generation. There is still a lot that can be done to improve this, mainly in training on much larger datasets, but for now this should give you a good intution of how GPT comes together!