# This post in a glance

Large Language Models (LLMs) have created enormous interest. Here I build and train a Generatively Pretrained Transformer (GPT) following __[Kaparthy's approah](https://www.youtube.com/watch?v=kCc8FmEb1nY)__ in order to understand the under-the-hood components of what makes a GPT work and obtain insights into how GPTs can be applied in real-life commercial applications. This post is a hands on exploration of the transformer architecture first set out in the seminal 2017 AI paper __[Attention Is All You Need](https://arxiv.org/abs/1706.03762)__, arguably one of the most important AI papers written, and the 2020 GPT-3 paper __[Language Models are Few-Shot Learners](https://arxiv.org/abs/2005.14165)__. 

When we move from neural networks to transformers specifically, attention, or more specficially self-attention, is the super-important thing.  So what is attention, or self-attention, exactly?

**Attention**

Both attention and self-attention are mechanisms for processing variable-length inputs, such as natural language sentences or images.  Attention computes a weighted sum of a set of values, based on the similarity between a query and a set of keys. The keys and values can be different from each other, and are often used to represent different parts of the input sequence or different features of the data. The query is typically derived from the current hidden state of the model, and is used to focus the attention on a specific part of the input. The resulting weighted sum is a fixed-size representation of the input that captures the relevant information for the current task.

**Self-attention**

Self-attention, on the other hand, is a specific form of attention where the query, key, and value vectors are all derived from the same set of input vectors. In other words, self-attention computes a weighted sum of the input vectors themselves, rather than a set of separate values. Self-attention computes a context-aware representation of each word based on its relationship to the other words in the sequence and can thus be used to capture long-range dependencies in the input sequence, such as when a word at the beginning of the sentence is related to a word at the end of the sentence.

**Expressed mathematically**

Mathematically, self-attention may be written:

$$A = \text{softmax} \left( \frac{QK^T}{\sqrt{d_k}} \right) V$$ 

where $Q$, $K$, and $V$ are matrices representing the queries, keys, and values in the attention mechanism, and $d_k$ is the dimension of the key vectors. The softmax function is applied element-wise to the matrix resulting from the dot product of $Q$ and the transpose of $K$ divided by the square root of $d_k$. The resulting matrix is then multiplied by $V$ to obtain the final output of the attention mechanism.

**Set up for this post**

I begin with an empty file and define a transformer piece by piece.  Then I train it on a text dataset.  I will accelarate some of the operations using the GPU on my machine and I will build a deep learning framework in a virtual environment for this project that includes PyTorch, TensorFlow, CUDA, cuDNN, and NVIDIA Drivers, on Ubuntu 22.04 LTS.  Source code for this post may be found on __[my GitHub](https://github.com/johncollinsai/nanogpt)__.

***

#### Insights and takeaways:

> * A GPT is a large language model.  Essentially, it streams a set of predictions that are the most likely tokens (words) in response to your question, which is an input to the language model in tokens (or words).  There is nothing magical about a GPT, nor any other LLM; in response to your input they provide an optimized prediction or outcome. <br>
> * Any magic that might exist is the magic bestowed by "attention", more specifically, self-attention. <br>
> * Self-attention is a mechanism for processing variable-length inputs, such as natural language sentences or images. Self-attention is a way for a model to weigh the importance of different parts of its inputs when making predictions, based on their relevance to the task at hand. 
> * Insights <br>
> * Insights <br>
> * Insights <br>
> * Insights <br>
> * [See Karpathy's conclusions (video 01:54:32) <br>

***

# Building and training a GPT

## Baseline (bigram) language model

#### Preparation

In [7]:
# check GPU
import torch
if torch.cuda.is_available():
    device = torch.device("cuda")
    print(f"device: {device}")
    print(f"Device name: {torch.cuda.get_device_name(0)}")
else:
    print("CUDA is not available.")


device: cuda
Device name: NVIDIA GeForce RTX 3080 Ti Laptop GPU


In [8]:
# enable use of GPU following Karpathy's method, see video ~39:00 and https://github.com/karpathy/ng-video-lecture/blob/master/bigram.py
# device = 'cuda' if torch.cuda.is_available() else 'cpu'

In [9]:
# get data
with open('input.txt', 'r', encoding='utf-8') as f:
    text = f.read()

print("length of dataseet in characters: ", len(text))

length of dataseet in characters:  1115394


In [10]:
# show unique characters appearing in the dataset (note the space character, which is first in the set): i.e., the vocabulary of possible characters the model can see or emit
chars = sorted(list(set(text)))
vocab_size = len(chars)
print(''.join(chars))
print(vocab_size)


 !$&',-.3:;?ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz
65


#### Tokenize

> Build a simple encoder and  decoder: i.e., take a string, output a list of integers, where each character is a token. The approach below is similar to, but much more simplified than: __[goolge sentencepiece](https://github.com/google/sentencepiece)__ (which uses sub-word encodings) and __[OpenAI tiktoken](https://github.com/openai/tiktoken)__.

In [11]:
# convert the raw text as a string into some sequence of integers according to some vocabulary of possible elements
stoi = {ch: i for i, ch in enumerate(chars)}
itos = {i: ch for i, ch in enumerate(chars)}

# build a simple encoder and decoder, effectively a tokenizer and detokenizer
encode = lambda s: [stoi[c] for c in s]
decode = lambda l: ''.join([itos[i] for i in l])

print(encode("today is friday, looking forward to the weekend!"))
print(decode(encode("today is friday, looking forward to the weekend!")))

[58, 53, 42, 39, 63, 1, 47, 57, 1, 44, 56, 47, 42, 39, 63, 6, 1, 50, 53, 53, 49, 47, 52, 45, 1, 44, 53, 56, 61, 39, 56, 42, 1, 58, 53, 1, 58, 46, 43, 1, 61, 43, 43, 49, 43, 52, 42, 2]
today is friday, looking forward to the weekend!


> now I have a tokenizer and detokenizer, I can convert the raw text into a sequence of integers, i.e., I can tokenize the entire training dataset

In [12]:
# encode training dataset and store it in a torch.tensor
import torch
data = torch.tensor(encode(text), dtype=torch.long)

print(data.shape, data.dtype)
print(data[:100])

torch.Size([1115394]) torch.int64
tensor([18, 47, 56, 57, 58,  1, 15, 47, 58, 47, 64, 43, 52, 10,  0, 14, 43, 44,
        53, 56, 43,  1, 61, 43,  1, 54, 56, 53, 41, 43, 43, 42,  1, 39, 52, 63,
         1, 44, 59, 56, 58, 46, 43, 56,  6,  1, 46, 43, 39, 56,  1, 51, 43,  1,
        57, 54, 43, 39, 49,  8,  0,  0, 13, 50, 50, 10,  0, 31, 54, 43, 39, 49,
         6,  1, 57, 54, 43, 39, 49,  8,  0,  0, 18, 47, 56, 57, 58,  1, 15, 47,
        58, 47, 64, 43, 52, 10,  0, 37, 53, 59])


#### Train/val split

In [13]:
# 90:10 train:val split
n = int(0.9*len(data))
train_data = data[:n]
val_data = data[n:]

#### Loading the data
> I set the time dimension (i.e., the contexts) of the tensors feeding into the transformer equal to a maximum of 8 characters (i.e., I set block_size = 8).  Note: I train on block_size+1 because the transformer trains on the first 8 characters and predicts the +1th or 9th character.  Put another way, the transformer sees contexts from one character thru block_size. <br>

> And I set the batch dimension of the tensors feeding into the transformer to 4, so batch_size = 4 (i.e., 4 independent sequences will be processed in parallel).

In [14]:
# set block_size = 8 to train on []:block_size+1] = 8+1 characters at a time
block_size = 8
train_data[:block_size+1]

tensor([18, 47, 56, 57, 58,  1, 15, 47, 58])

In [15]:
# +1 because we want to predict the next character, thus block_size+1 allows us to do that, i.e., the transformer trains on the first 8 characters and predicts the +1th or 9th character
# to illustrate:
x = train_data[:block_size]
y = train_data[1:block_size+1]
print('Illustrating how the transformer trains on the first 8 characters and predicts the +1th or 9th character:')
for t in range(block_size):
    context = x[:t+1]
    target = y[t]
    print(f'when input is {context}, the target is {target}')

Illustrating how the transformer trains on the first 8 characters and predicts the +1th or 9th character:
when input is tensor([18]), the target is 47
when input is tensor([18, 47]), the target is 56
when input is tensor([18, 47, 56]), the target is 57
when input is tensor([18, 47, 56, 57]), the target is 58
when input is tensor([18, 47, 56, 57, 58]), the target is 1
when input is tensor([18, 47, 56, 57, 58,  1]), the target is 15
when input is tensor([18, 47, 56, 57, 58,  1, 15]), the target is 47
when input is tensor([18, 47, 56, 57, 58,  1, 15, 47]), the target is 58


#### A quick note on random seed selection

> In an interesting __[paper](https://arxiv.org/abs/2109.08203)__ David Picard investigates the effect of random seed selection on accuracy when using deep learning architectures for computer vision and posits that Torch.manual_seed(3407) is all you need!

In [16]:
# I set the batch dimension of the tensors feeding into the transformer to 4, so batch_size = 4 (i.e., 4 independent sequences will be processed in parallel).
torch.manual_seed(3407)
batch_size = 4
block_size = 8  

def get_batch(split):
    # generate a batch of data of inputs x and targets y
    data = train_data if split == 'train' else val_data
    ix = torch.randint(len(data) - block_size, (batch_size,)) 
    x = torch.stack([data[i:i+block_size] for i in ix])
    y = torch.stack([data[i+1:i+block_size+1] for i in ix])
    x, y = x.to(device), y.to(device) # move data to GPU
    return x, y

xb, yb = get_batch('train')

print('Here is the tensor input to the transformer:',
      '\n', 
      xb      
      )  


Here is the tensor input to the transformer: 
 tensor([[32, 39, 49, 43,  1, 58, 46, 53],
        [59, 56,  1, 54, 56, 47, 52, 41],
        [57, 53, 51, 43,  1, 51, 43, 56],
        [57,  1, 58, 56, 59, 43,  2,  0]], device='cuda:0')


#### baseline (bigram) language model

Following __[Karpathy](https://www.youtube.com/watch?v=kCc8FmEb1nY)__, I implement a very simple neural network, the bigram language model.

In [26]:
import torch
import torch.nn as nn
from torch.nn import functional as F
torch.manual_seed(3407)

class BigramLanguageModel(nn.Module):

    def __init__(self, vocab_size):
        super().__init__()
        # each token directly reads off the logits for the next token from a lookup table
        self.token_embedding_table = nn.Embedding(vocab_size, vocab_size)

    def forward(self, idx, targets=None):
        # idx and targets are both (B, T) tensors of integers
        logits = self.token_embedding_table(idx) # (B, T, C, i.e., a batch by time (context) by channel tensor, where channel is vocab size)

        if targets is None:
            loss = None
        else:
            # reorganize logits tensor from (B, T, C) to (B*T, C) in order to fit pytorch's cross_entropy loss function
            B, T, C = logits.shape 
            logits = logits.view(B*T, C) 
            targets = targets.view(B*T)
            loss = F.cross_entropy(logits, targets) # cross_entropy here computes negative log likelihood loss

        return logits, loss
    
    def generate(self, idx, max_new_tokens):
        # idx is a (B, T) array of indices in the current context
        for _ in range(max_new_tokens):
            # get the predictions
            logits, loss = self(idx)
            # focus only on the last time step
            logits = logits[:, -1, :] # becomes (B, C)
            # apply softmax to get the probabilities
            probs = F.softmax(logits, dim=-1) # (B, C)
            # sample from the distribution
            idx_next = torch.multinomial(probs, num_samples=1) # (B, 1)
            # append sampled index to the running sequence
            idx = torch.cat([idx, idx_next], dim=1) # (B, T+1)

        return idx
    
model = BigramLanguageModel(vocab_size)
m = model.to(device) # move model to GPU
logits, loss = m(xb, yb)
print('logits shape:', logits.shape)
print('loss:', loss)

# context = torch.zeros((1,1), dtype=torch.long, device=device), here created on-the-fly by print() on the GPU
print(decode(m.generate(idx = torch.zeros((1,1), dtype=torch.long, device=device), max_new_tokens=300)[0].tolist())) 


logits shape: torch.Size([256, 65])
loss: tensor(4.5789, device='cuda:0', grad_fn=<NllLossBackward0>)

pTFwSpp,f.v-;LR-;DA,O:rGMbv3OqDlpuo-SxIMtqCPawLaD;iC O'-N$sr?,y;Dgx&uJvha?qU.RXFqe!3CLnq,ZAcdW-dxvq
ijb-dmxN-lLtI'UsNajeE3gH??!m3zz:nMgrVgHyRJd;MVWy'nEDSCT!QA;myMPVPLnvyjMWXFw,LweP,WSzdPrvcWXecNIcLtcPrPbGIzVH.nqckUK;XfAco',QFJ3'T !a-$Nemy,WmkUIx?mO!sJwEywCCk,W:Jv3V&PjhvEooQF3taT
3&u!XCikXcY
?xIzQrGW


> The model is untrained and provides predictions that are random, so the output is meaningless.

#### Training the bigram model

> I now train the bigram model to make it less random.

In [18]:
# create a pytorch optimizer
optimizer = torch.optim.Adam(m.parameters(), lr=1e-3)

In [19]:
batch_size = 32 # increase the batch size from 4 to 32 to speed up training
for steps in range(10000): # increase the number of steps to train for, to improve results

    # get a batch of data
    xb, yb = get_batch('train')

    # evaluate the loss
    logits, loss = m(xb, yb)
    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    optimizer.step()

print('loss:', loss.item()) # training for 10000 steps brings the loss down to ~2.5

loss: 2.5604467391967773


In [20]:
# As above, context = torch.zeros((1,1), dtype=torch.long, device=device), here created on-the-fly by print() on the GPU
print(decode(m.generate(idx = torch.zeros((1,1), dtype=torch.long, device=device), max_new_tokens=300)[0].tolist()))



Afre an nthe t y.
I twl d, h bstave: anr h? towll
BUStil ilouniasthechyord IF shaty vouby, m aysie fon Malld aty acoghas; histhofok ang titr. may, o mar, we gel adadico y mereengoowe.
Hiplouplloproousseathes l we f Jater, thee,

Hwncoshy momyow, r agh afurst thes hendee: byoon t MIILELIZMend, cuthe


> The model is making progress.  But it's still a very simple model and the tokens are not yet talking to each other.  It's predictions show a somewhat better language-like structure, but are still random, and the output meaningless.

***
#### bigram.py

> At ~38:00 in the video Karpathy shows bigram.py, which is available in the __[ng-video-lecture](https://github.com/karpathy/ng-video-lecture)__ repo. <br>


> **HOWEVER, IT IS NOT CLEAR AT THIS POINT IF bigram.py IS NEEDED, SO I AM SKIPPING IT FOR THE MOMENT** <br>

***


## Self-attention

> I now write the first self-attention block for processing the tokens, following several steps, each progressively more effective, that hopefully help to make the self-attention contstruct clearer. <br>

> Let's start with a very simple example, which essentially relates tokens to each other via their history.

In [21]:
# simple example
torch.manual_seed(3407)
B,T,C = 4,8,2 # batch size, time steps, channels
x = torch.randn(B,T,C)
x.shape


torch.Size([4, 8, 2])

#### Averaging past context with for loops (weakest form of aggregation)

> A simple way to enable tokens to communicate in the manner we desire (i.e., with the tokens that precede them in T), is to calculate an average of all the preceding elements. Consider, for example, the fifth token: take the channels that make up that information at that step, but also the channels from the fourth step, third step, second and first steps, and average them.  This creates, effectively, a feature vector that summarizes the 5th token in the context of its history.  An average like this is an extremely weak and lossy, i.e., a lot of information about the spacial arrangements of the tokens is lost. <br>

> So, for every batch element independently, for every $n^{th}$ token in that sequence, calculate the average of all the vectors in all the previous tokens and also at the $n^{th}$ token.

In [22]:
# I want x[b,t] = mean_{i<=t} x[b,i]
xbow = torch.zeros((B,T,C)) # bow for bag of words
for b in range(B):
    for t in range(T):
        xprev = x[b,:t+1] # (t, C)
        xbow[b,t] = torch.mean(xprev, 0)

print(x[0])
print('xbow averages everything up to the current location of the nth token: ', '\n',
      xbow[0])


tensor([[ 0.1703, -0.8613],
        [-0.6225,  1.0247],
        [ 0.3506,  0.8032],
        [ 0.0865, -0.9623],
        [-1.6784,  1.3681],
        [-0.1882,  1.7510],
        [ 0.5818, -0.3983],
        [ 1.4324, -0.6142]])
xbow averages everything up to the current location of the nth token:  
 tensor([[ 0.1703, -0.8613],
        [-0.2261,  0.0817],
        [-0.0339,  0.3222],
        [-0.0038,  0.0011],
        [-0.3387,  0.2745],
        [-0.3136,  0.5206],
        [-0.1857,  0.3893],
        [ 0.0166,  0.2639]])


#### Self-attention: matrix multiply as weighted aggregation

> Karpathy shows how to use matrix multiplication to increase the efficiency of the above operation. 

In [23]:
wei = torch.tril(torch.ones((T,T))) # wei denotes weights, torch.tril provides lower triangular matrix
wei = wei / wei.sum(1, keepdim=True) # normalize weights so that they sum to 1
xbow2 = wei @ x #  (B, T, T) @ (B, T, C) --> (B, T, C)
torch.allclose(xbow, xbow2) # check that the two methods give the same result

True

#### Adding softmax

> Applying a softmax to each row to normalize.

In [24]:
tril = torch.tril(torch.ones((T,T))) # tril matrix of lower triangular ones
wei = torch.zeros((T,T)) # wei begins as a matrix of zeros
wei = wei.masked_fill(tril == 0, float('-inf')) # weights for the future tokens are set to -inf, so future tokens are ignored
wei = F.softmax(wei, dim=-1)
xbow3 = wei @ x
torch.allclose(xbow, xbow3)

True

#### Positional encoding

[Details]

#### Self-attention

> 

In [28]:
import torch.nn as nn
torch.manual_seed(3407)
B, T, C = 4, 8, 32 # batch size, time steps, channels
x = torch.randn(B,T,C)

# Observe a single Head perform self-attention
head_size = 16
key = nn.Linear(C, head_size, bias=False)
query = nn.Linear(C, head_size, bias=False)
value = nn.Linear(C, head_size, bias=False)
k = key(x) # (B, T, head_size)
q = query(x) # (B, T, head_size)
wei = q @ k.transpose(-2,-1) # (B, T, head_size) @ (B, head_size, T) --> (B, T, T)

tril = torch.tril(torch.ones((T,T))) 
wei = wei.masked_fill(tril == 0, float('-inf'))
wei = F.softmax(wei, dim=-1)
v = value(x) 
out = wei @ v 

out.shape # (B, T, head_size)

torch.Size([4, 8, 16])

> Observe the weights, as a matrix of lower triangular values:

In [29]:
wei[0]

tensor([[1.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.9217, 0.0783, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.2666, 0.1544, 0.5789, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.0332, 0.4348, 0.2287, 0.3034, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.0853, 0.0492, 0.1398, 0.5914, 0.1343, 0.0000, 0.0000, 0.0000],
        [0.0553, 0.2033, 0.0449, 0.5934, 0.0320, 0.0711, 0.0000, 0.0000],
        [0.1148, 0.0771, 0.0900, 0.0522, 0.0507, 0.2886, 0.3266, 0.0000],
        [0.1356, 0.0336, 0.0196, 0.0464, 0.0245, 0.2620, 0.2610, 0.2173]],
       grad_fn=<SelectBackward0>)

Notes

> * Attention as communication <br>

> * Attention has no notion of space, operates over sets

> * There is no communication across batch dimension

> * Encoder blocks vs. decoder blocks

> * Attention vs. self-attention vs. cross-attention

> * "Scaled" self-attention.  Why divide by sqrt(head_size)

## Building the Transformer

#### Inserting a single self-attention block 

[Details]

#### Multi-headed self-attention

[Details]

#### feedforward layers of transformer block

[Details]

#### Residual connections

[Details]

#### Layernorms

[Details]

#### Scaling up

[Details]

## [Notes on Transformer]

#### Encoder vs. decoder vs. both

#### Batched multi-headed self-attention

#### ChatGPT, GPT-3, pretraining vs. finetuning, RLHF

## Summary diagrams for the transformer architecture

> The following diagrams can be found at __[Anton Bacaj's github](https://github.com/abacaj/transformers)__.  They are great summaries of the transformer architecture.

#### Decoder models
<img src="decoder-formatted.png" alt="decoder models" />

#### Encoder models
<img src="encoder-formatted.png" alt="decoder models" />

#### Encoder + decoder models
<img src="enc+dec-formatted.png" alt="decoder models" />

# References

Brown, T.B., et al. (2020). Language Models are Few-Shot Learners. arXiv:2005.14165

__[Colab for Kaparthy's video](https://colab.research.google.com/drive/1JMLa53HDuA-i7ZBmqV7ZnA3c_fvtXnx-?usp=sharing)__

__[GitHub repo for Anton Bacaj's transformer architecture diagrams](https://github.com/abacaj/transformers)__

__[GitHub repo for Kaparthy's video](https://github.com/karpathy/ng-video-lecture)__

__[Kaparthy's nanoGPT GitHub repo](https://github.com/karpathy/nanoGPT)__

__[Kaparthy's Youtube video](https://www.youtube.com/watch?v=kCc8FmEb1nY)__

Picard, D. (2021). Torch.manual_seed(3407) is all you need: On the influence of random seeds in deep learning architectures for computer vision. arXiv:2109.08203 

Vaswani, A., et al. (2017).  Attention Is All You Need. arXiv:1706.03762

***
END