# This post in a glance

Large Language Models (LLMs) have created enormous interest. Here I build and train a Generatively Pretrained Transformer (GPT) following __[Kaparthy's approah](https://www.youtube.com/watch?v=kCc8FmEb1nY)__ in order to understand the under-the-hood components of what makes a GPT work and obtain insights into how GPTs can be applied in real-life commercial applications. <br>

This post is a hands on exploration of the transformer architecture first set out in the seminal 2017 AI paper __[Attention Is All You Need](https://arxiv.org/abs/1706.03762)__, arguably one of the most important AI papers written, and the 2020 GPT-3 paper __[Language Models are Few-Shot Learners](https://arxiv.org/abs/2005.14165)__.

When we move from neural networks to transformers specifically, attention is the super-important thing.  So what is attention, exactly?  [WRITE THIS EQUATION IN LATEX]

<img src="attention-exactly.png" alt="Attention" />

***

#### Insights and takeaways:

> * A GPT is a large language model.  Essentially, it streams a set of predictions that are the most likely tokens (words) in response to your question, which is an input to the language model in tokens (or words).  There is nothing magical about a GPT, nor any other LLM; in response to your input they provide an optimized prediction or outcome. <br>
> * Any magic that might exist is the magic bestowed by "attention". <br>
> * So what is "attention", exactly? <br>
> * Insights <br>
> * Insights <br>
> * Insights <br>
> * Insights <br>
> * Insights <br>
> * [See Karpathy's conclusions (video 01:54:32) <br>

***

# Building and training a GPT

## Baseline (bigram) language model

[INTRODUCTION]

I begin with an empty file and define a transformer piece by piece.  Then I train it on a text dataset.


#### Preparation

In [1]:
# get data
with open('input.txt', 'r', encoding='utf-8') as f:
    text = f.read()

print("length of dataseet in characters: ", len(text))

length of dataseet in characters:  1115394


In [2]:
# show unique characters appearing in the dataset (note the space character, which is first in the set): i.e., the vocabulary of possible characters the model can see or emit
chars = sorted(list(set(text)))
vocab_size = len(chars)
print(''.join(chars))
print(vocab_size)


 !$&',-.3:;?ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz
65


#### Tokenize

> Build a simple encoder and  decoder: i.e., take a string, output a list of integers, where each character is a token. The approach below is similar to, but much more simplified than: __[goolge sentencepiece](https://github.com/google/sentencepiece)__ (which uses sub-word encodings) and __[OpenAI tiktoken](https://github.com/openai/tiktoken)__.

In [3]:
# convert the raw text as a string into some sequence of integers according to some vocabulary of possible elements
stoi = {ch: i for i, ch in enumerate(chars)}
itos = {i: ch for i, ch in enumerate(chars)}

# build a simple encoder and decoder, effectively a tokenizer and detokenizer
encode = lambda s: [stoi[c] for c in s]
decode = lambda l: ''.join([itos[i] for i in l])

print(encode("today is friday, looking forward to the weekend!"))
print(decode(encode("today is friday, looking forward to the weekend!")))

[58, 53, 42, 39, 63, 1, 47, 57, 1, 44, 56, 47, 42, 39, 63, 6, 1, 50, 53, 53, 49, 47, 52, 45, 1, 44, 53, 56, 61, 39, 56, 42, 1, 58, 53, 1, 58, 46, 43, 1, 61, 43, 43, 49, 43, 52, 42, 2]
today is friday, looking forward to the weekend!


> now I have a tokenizer and detokenizer, I can convert the raw text into a sequence of integers, i.e., I can tokenize the entire training dataset

In [4]:
# encode training dataset and store it in a torch.tensor
import torch
data = torch.tensor(encode(text), dtype=torch.long)

print(data.shape, data.dtype)
print(data[:100])

torch.Size([1115394]) torch.int64
tensor([18, 47, 56, 57, 58,  1, 15, 47, 58, 47, 64, 43, 52, 10,  0, 14, 43, 44,
        53, 56, 43,  1, 61, 43,  1, 54, 56, 53, 41, 43, 43, 42,  1, 39, 52, 63,
         1, 44, 59, 56, 58, 46, 43, 56,  6,  1, 46, 43, 39, 56,  1, 51, 43,  1,
        57, 54, 43, 39, 49,  8,  0,  0, 13, 50, 50, 10,  0, 31, 54, 43, 39, 49,
         6,  1, 57, 54, 43, 39, 49,  8,  0,  0, 18, 47, 56, 57, 58,  1, 15, 47,
        58, 47, 64, 43, 52, 10,  0, 37, 53, 59])


#### Train/val split

In [5]:
# 90:10 train:val split
n = int(0.9*len(data))
train_data = data[:n]
val_data = data[n:]

#### Loading the data
> I set the time dimension (i.e., the contexts) of the tensors feeding into the transformer equal to a maximum of 8 characters (i.e., I set block_size = 8).  Note: I train on block_size+1 because the transformer trains on the first 8 characters and predicts the +1th or 9th character.  Put another way, the transformer sees contexts from one character thru block_size. <br>

> And I set the batch dimension of the tensors feeding into the transformer to 4, so batch_size = 4 (i.e., 4 independent sequences will be processed in parallel).

In [6]:
# set block_size = 8 to train on []:block_size+1] = 8+1 characters at a time
block_size = 8
train_data[:block_size+1]

tensor([18, 47, 56, 57, 58,  1, 15, 47, 58])

In [12]:
# +1 because we want to predict the next character, thus block_size+1 allows us to do that, i.e., the transformer trains on the first 8 characters and predicts the +1th or 9th character
# to illustrate:
x = train_data[:block_size]
y = train_data[1:block_size+1]
print('Illustrating how the transformer trains on the first 8 characters and predicts the +1th or 9th character:')
for t in range(block_size):
    context = x[:t+1]
    target = y[t]
    print(f'when input is {context}, the target is {target}')

Illustrating how the transformer trains on the first 8 characters and predicts the +1th or 9th character:
when input is tensor([18]), the target is 47
when input is tensor([18, 47]), the target is 56
when input is tensor([18, 47, 56]), the target is 57
when input is tensor([18, 47, 56, 57]), the target is 58
when input is tensor([18, 47, 56, 57, 58]), the target is 1
when input is tensor([18, 47, 56, 57, 58,  1]), the target is 15
when input is tensor([18, 47, 56, 57, 58,  1, 15]), the target is 47
when input is tensor([18, 47, 56, 57, 58,  1, 15, 47]), the target is 58


#### A quick note on random seed selection
In an interesting __[paper](https://arxiv.org/abs/2109.08203)__ David Picard investigates the effect of random seed selection on accuracy when using deep learning architectures for computer vision and posits that Torch.manual_seed(3407) is all you need!

In [13]:
# I set the batch dimension of the tensors feeding into the transformer to 4, so batch_size = 4 (i.e., 4 independent sequences will be processed in parallel).
torch.manual_seed(3407)
batch_size = 4
block_size = 8  

def get_batch(split):
    # generate a batch of data of inputs x and targets y
    data = train_data if split == 'train' else val_data
    ix = torch.randint(len(data) - block_size, (batch_size,)) 
    x = torch.stack([data[i:i+block_size] for i in ix])
    y = torch.stack([data[i+1:i+block_size+1] for i in ix])
    return x, y

xb, yb = get_batch('train')

print('Here is the tensor input to the transformer:',
      '\n', 
      xb      
      )  


Here is the tensor input to the transformer: 
 tensor([[32, 39, 49, 43,  1, 58, 46, 53],
        [59, 56,  1, 54, 56, 47, 52, 41],
        [57, 53, 51, 43,  1, 51, 43, 56],
        [57,  1, 58, 56, 59, 43,  2,  0]])


#### baseline (bigram) language model

Following __[Karpathy](https://www.youtube.com/watch?v=kCc8FmEb1nY)__, I implement a very simple neural network, the bigram language model.

In [17]:
import torch
import torch.nn as nn
from torch.nn import functional as F
torch.manual_seed(3407)

class BigramLanguageModel(nn.Module):

    def __init__(self, vocab_size):
        super().__init__()
        # each token directly reads off the logits for the next token from a lookup table
        self.token_embedding_table = nn.Embedding(vocab_size, vocab_size)

    def forward(self, idx, targets=None):
        # idx and targets are both (B, T) tensors of integers
        logits = self.token_embedding_table(idx) # (B, T, C, i.e., a batch by time (context) by channel tensor, where channel is vocab size)

        if targets is None:
            loss = None
        else:
            # reorganize logits tensor from (B, T, C) to (B*T, C) in order to fit pytorch's cross_entropy loss function
            B, T, C = logits.shape 
            logits = logits.view(B*T, C) 
            targets = targets.view(B*T)
            loss = F.cross_entropy(logits, targets) # cross_entropy here computes negative log likelihood loss

        return logits, loss
    
    def generate(self, idx, max_new_tokens):
        # idx is a (B, T) array of indices in the current context
        for _ in range(max_new_tokens):
            # get the predictions
            logits, loss = self(idx)
            # focus only on the last time step
            logits = logits[:, -1, :] # becomes (B, C)
            # apply softmax to get the probabilities
            probs = F.softmax(logits, dim=-1) # (B, C)
            # sample from the distribution
            idx_next = torch.multinomial(probs, num_samples=1) # (B, 1)
            # append sampled index to the running sequence
            idx = torch.cat([idx, idx_next], dim=1) # (B, T+1)

        return idx
    
m = BigramLanguageModel(vocab_size)
logits, loss = m(xb, yb)
print('logits shape:', logits.shape)
print('loss:', loss)

print(decode(m.generate(idx = torch.zeros((1,1), dtype=torch.long), max_new_tokens=100)[0].tolist()))


logits shape: torch.Size([32, 65])
loss: tensor(4.4231, grad_fn=<NllLossBackward0>)

T&jqF$cy$c'WsTh3k!eloQJWlLacKbtbj.
JW!wwU&OBm;R;PrHwwe!!NMiWsyVoHRqDr
;
c3OGIUstnsscP- Fwzq.klOnMX'3


> The model's predictions show language-like structure

#### Training the bigram model

In [18]:
# create a pytorch optimizer
optimizer = torch.optim.Adam(m.parameters(), lr=1e-3)

In [25]:
batch_size = 32 # increase the batch size from 4 to 32 to speed up training
for steps in range(1000): # increase the number of steps to train for, to improve results

    # get a batch of data
    xb, yb = get_batch('train')

    # evaluate the loss
    logits, loss = m(xb, yb)
    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    optimizer.step()

print('loss:', loss.item()) # training for 1000 steps brings the loss down to ~2.4

loss: 2.4564297199249268


In [21]:
print(decode(m.generate(idx = torch.zeros((1,1), dtype=torch.long), max_new_tokens=500)[0].tolist()))


SubwKHhAn:lHER3xJgNndT.,q3EUyafRq'iSoPqDp,E:b&JJj&&bpuc3O'vx!'kHkQN
3AY
?dxd3Zj;I LppuDjQAV,ti!N?d3J?? pI NosDHwasaRhEm?KQjuucwVmpVArXGV OsNpYpL'YnuoQDPUxIwTAoQvGE$:JK wALafiRTvxUJG'IOkd3cHf.iCiUVo$:ax&
Vnsf!of3m!stYoQfdKAFwS;DD&d&WJFJF3LujmeOeost pa?qCjhLMVj&'RD,pBpbpKerzHYzw
ThXcnMPUiwwSo'ya?rMXNOO?q'ffdyklqe!PUAd!,.ngIE zb3raig?ngJTwsK
shhLNR,GH3RXcbyRwV,I'UlTg$3lOZn GW'I:JFoQF3pk&qEbZti!NJpIXs?YwSt&XR,Whd3kmheVA;,-. k&W:ENIn3Ar!N$ULgpuv GW'iwlZu-:.xRD,iWVyW.M:bGHIXULRgKXZtFwcK d:emzzMlvIk;DQ


> The model's predictions show a somewhat better language-like structure 

[Details]

#### Deployment

[Details]

## Self-attention

#### Averaging past context with for loops (weakest form of aggregation)

[Details]

#### Self-attention: matrix multiply as weighted aggregation

[Details]

#### Using matrix multiply

[Details]

#### Adding softmax

[Details]

#### Positional encoding

[Details]

#### Self-attention

[Details]

Notes

> * Attention as communication <br>

> * Attention has no notion of space, operates over sets

> * There is no communication across batch dimension

> * Encoder blocks vs. decoder blocks

> * Attention vs. self-attention vs. cross-attention

> * "Scaled" self-attention.  Why divide by sqrt(head_size)

## Building the Transformer

#### Inserting a single self-attention block 

[Details]

#### Multi-headed self-attention

[Details]

#### feedforward layers of transformer block

[Details]

#### Residual connections

[Details]

#### Layernorms

[Details]

#### Scaling up

[Details]

## [Notes on Transformer]

#### Encoder vs. decoder vs. both

#### Batched multi-headed self-attention

#### ChatGPT, GPT-3, pretraining vs. finetuning, RLHF

# References

Brown, T.B., et al. (2020). Language Models are Few-Shot Learners. arXiv:2005.14165

Picard, D. (2021). Torch.manual_seed(3407) is all you need: On the influence of random seeds in deep learning architectures for computer vision. arXiv:2109.08203 

Vaswani, A., et al. (2017).  Attention Is All You Need. arXiv:1706.03762

__[Kaparthy's nanoGPT GitHub repo](https://github.com/karpathy/nanoGPT)__

__[Kaparthy's Youtube video](https://www.youtube.com/watch?v=kCc8FmEb1nY)__

__[GitHub repo for Kaparthy's video](https://github.com/karpathy/ng-video-lecture)__

__[Google colab for Kaparthy's video](https://colab.research.google.com/drive/1JMLa53HDuA-i7ZBmqV7ZnA3c_fvtXnx-?usp=sharing)__

***
END