# Tiny Generative Pretrained Transformer

Generating infinite Shakespeare by training a transformer based language model on all of Shakespeare's work. 

Lots of help from Arxiv and Andrej Karpathy

In [1]:
import numpy as np
from matplotlib.pyplot import plot as plt
import torch

### Links

Annotated Transformer - http://nlp.seas.harvard.edu/annotated-transformer/

The Illustrated Tranformer - https://jalammar.github.io/illustrated-transformer/





### Dataset

Tiny models need tiny datasets. 

We are using the Tiny Shakespeare dataset, which is a concatenation of all the works of Shakespeare - https://raw.githubusercontent.com/jcjohnson/torch-rnn/master/data/tiny-shakespeare.txt

In [2]:
!wget https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt

--2023-02-05 19:02:41--  https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.109.133, 185.199.111.133, 185.199.108.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.109.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1115394 (1.1M) [text/plain]
Saving to: 'input.txt.4'


2023-02-05 19:02:41 (10.6 MB/s) - 'input.txt.4' saved [1115394/1115394]



In [3]:
with open('input.txt', 'r', encoding='utf-8') as f:
    text = f.read()

In [4]:
print(f"Number of characters: {len(text)}") # over a million characters!

Number of characters: 1115394


In [5]:
# first 1000 words
text[:1000]

"First Citizen:\nBefore we proceed any further, hear me speak.\n\nAll:\nSpeak, speak.\n\nFirst Citizen:\nYou are all resolved rather to die than to famish?\n\nAll:\nResolved. resolved.\n\nFirst Citizen:\nFirst, you know Caius Marcius is chief enemy to the people.\n\nAll:\nWe know't, we know't.\n\nFirst Citizen:\nLet us kill him, and we'll have corn at our own price.\nIs't a verdict?\n\nAll:\nNo more talking on't; let it be done: away, away!\n\nSecond Citizen:\nOne word, good citizens.\n\nFirst Citizen:\nWe are accounted poor citizens, the patricians good.\nWhat authority surfeits on would relieve us: if they\nwould yield us but the superfluity, while it were\nwholesome, we might guess they relieved us humanely;\nbut they think we are too dear: the leanness that\nafflicts us, the object of our misery, is as an\ninventory to particularise their abundance; our\nsufferance is a gain to them Let us revenge this with\nour pikes, ere we become rakes: for the gods know I\nspeak this in hunger 

In [6]:
chars = sorted(list(set(text)))
vocab_size = len(chars)
print("Characters in set:" + ''.join(chars))
print(f"Vocab size: {vocab_size}")

Characters in set:
 !$&',-.3:;?ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz
Vocab size: 65


### Tokenization

Real GPT's use tiktoken - https://github.com/openai/tiktoken

Below is just a character level tokenizer

In [7]:
# translating characters into integers

stoi = { ch:i for i,ch in enumerate(chars)}
itos = { i:ch for i,ch in enumerate(chars)}

encode = lambda s: [stoi[c] for c in s]          # encoder: String -> List[Ints]
decode = lambda l: ''.join([itos[i] for i in l]) # decoder: List[Ints] -> String

In [8]:
encode("Hello world!")

[20, 43, 50, 50, 53, 1, 61, 53, 56, 50, 42, 2]

In [9]:
decode([20, 43, 50, 50, 53, 1, 61, 53, 56, 50, 42, 2])

'Hello world!'

In [10]:
# Encoding the entire Shakespeare work

data = torch.tensor(encode(text), dtype=torch.long)

print(data.shape, data.dtype)
print(data[:1000])

torch.Size([1115394]) torch.int64
tensor([18, 47, 56, 57, 58,  1, 15, 47, 58, 47, 64, 43, 52, 10,  0, 14, 43, 44,
        53, 56, 43,  1, 61, 43,  1, 54, 56, 53, 41, 43, 43, 42,  1, 39, 52, 63,
         1, 44, 59, 56, 58, 46, 43, 56,  6,  1, 46, 43, 39, 56,  1, 51, 43,  1,
        57, 54, 43, 39, 49,  8,  0,  0, 13, 50, 50, 10,  0, 31, 54, 43, 39, 49,
         6,  1, 57, 54, 43, 39, 49,  8,  0,  0, 18, 47, 56, 57, 58,  1, 15, 47,
        58, 47, 64, 43, 52, 10,  0, 37, 53, 59,  1, 39, 56, 43,  1, 39, 50, 50,
         1, 56, 43, 57, 53, 50, 60, 43, 42,  1, 56, 39, 58, 46, 43, 56,  1, 58,
        53,  1, 42, 47, 43,  1, 58, 46, 39, 52,  1, 58, 53,  1, 44, 39, 51, 47,
        57, 46, 12,  0,  0, 13, 50, 50, 10,  0, 30, 43, 57, 53, 50, 60, 43, 42,
         8,  1, 56, 43, 57, 53, 50, 60, 43, 42,  8,  0,  0, 18, 47, 56, 57, 58,
         1, 15, 47, 58, 47, 64, 43, 52, 10,  0, 18, 47, 56, 57, 58,  6,  1, 63,
        53, 59,  1, 49, 52, 53, 61,  1, 15, 39, 47, 59, 57,  1, 25, 39, 56, 41,
      

### Train and Validation Splits


In [11]:
# Make 90% of the dataset to be train with 10% validation 
n = int(0.9 * len(data))

train_data = data[:n]
val_data = data[n:]

Putting the entire chuck of text into the transformer at once in computationally infeasible. So, we need to sample little chunks of the dataset, and feed that into the transformer.

In [12]:
block_size = 8
train_data[:block_size+1]

tensor([18, 47, 56, 57, 58,  1, 15, 47, 58])

In [13]:
x = train_data[:block_size] # inputs to the transformer
y = train_data[1:block_size+1] # targets for each position, offset by one 

for t in range(block_size):
    context = x[:t+1]
    target = y[t]
    print(f"When the input is {context}, the target is: {target}")

When the input is tensor([18]), the target is: 47
When the input is tensor([18, 47]), the target is: 56
When the input is tensor([18, 47, 56]), the target is: 57
When the input is tensor([18, 47, 56, 57]), the target is: 58
When the input is tensor([18, 47, 56, 57, 58]), the target is: 1
When the input is tensor([18, 47, 56, 57, 58,  1]), the target is: 15
When the input is tensor([18, 47, 56, 57, 58,  1, 15]), the target is: 47
When the input is tensor([18, 47, 56, 57, 58,  1, 15, 47]), the target is: 58


In [14]:
device = 'mps' if torch.backends.mps.is_available() else 'cpu'

In [15]:
torch.manual_seed(1337)

batch_size = 4 # how many sequences we process every forward and backwards pass
block_size = 8 # maximum context length for predictions

def get_batch(split):
    # generate a small batch of data of inputs x and targets y
    data = train_data if split=='train' else val_data
    
    # randomly get indexes of dataset
    idx = torch.randint(len(data) - block_size, (batch_size,))
    
    # torch.stacks get the one dimensional tensor like [18, 47, 56, 57, 58,  1, 15, 47]
    # and puts them all together into a 4x8 tensor
    x = torch.stack([data[i:i+block_size] for i in idx])
    y = torch.stack([data[i+1:i+block_size+1] for i in idx])
    return x.to(device),y.to(device)
    

In [16]:
x_batched, y_batched = get_batch('train')

print(f"inputs: {x_batched.shape}") # a 4x8 array contains 32 indepenednt examples, each shown in new cell
print(x_batched)
print(f"targets: {y_batched.shape}")
print(y_batched)
print('----')

inputs: torch.Size([4, 8])
tensor([[24, 43, 58,  5, 57,  1, 46, 43],
        [44, 53, 56,  1, 58, 46, 39, 58],
        [52, 58,  1, 58, 46, 39, 58,  1],
        [25, 17, 27, 10,  0, 21,  1, 54]], device='mps:0')
targets: torch.Size([4, 8])
tensor([[43, 58,  5, 57,  1, 46, 43, 39],
        [53, 56,  1, 58, 46, 39, 58,  1],
        [58,  1, 58, 46, 39, 58,  1, 46],
        [17, 27, 10,  0, 21,  1, 54, 39]], device='mps:0')
----


In [17]:
for batch in range(batch_size):
    for t in range(block_size):
        context = x_batched[batch, :t+1]
        target = y_batched[batch,t]
        
        # all of these are independent, as far as the transformer is concerned
        print(f"When input is {context.tolist()}, the target is {target}")

When input is [24], the target is 43
When input is [24, 43], the target is 58
When input is [24, 43, 58], the target is 5
When input is [24, 43, 58, 5], the target is 57
When input is [24, 43, 58, 5, 57], the target is 1
When input is [24, 43, 58, 5, 57, 1], the target is 46
When input is [24, 43, 58, 5, 57, 1, 46], the target is 43
When input is [24, 43, 58, 5, 57, 1, 46, 43], the target is 39
When input is [44], the target is 53
When input is [44, 53], the target is 56
When input is [44, 53, 56], the target is 1
When input is [44, 53, 56, 1], the target is 58
When input is [44, 53, 56, 1, 58], the target is 46
When input is [44, 53, 56, 1, 58, 46], the target is 39
When input is [44, 53, 56, 1, 58, 46, 39], the target is 58
When input is [44, 53, 56, 1, 58, 46, 39, 58], the target is 1
When input is [52], the target is 58
When input is [52, 58], the target is 1
When input is [52, 58, 1], the target is 58
When input is [52, 58, 1, 58], the target is 46
When input is [52, 58, 1, 58, 46

### Bigram Language Model

The most simple baseline.

A good loss function is negatively log likelihood loss aka in PyTorch as ```cross_entropy()```

In [18]:
device = 'mps' if torch.backends.mps.is_available() else 'cpu'

In [19]:
import torch
import torch.nn as nn
from torch.nn import functional as F

torch.manual_seed(1337)

class BigramLanguageModel(nn.Module):
    def __init__(self, vocab_size):
        super().__init__()
        
        # each token directly reads off the logits for the next token from a lookup table
        self.token_embedding_table = nn.Embedding(vocab_size, vocab_size)
        
    def forward(self, idx, targets=None):
        
        # idx and targets are both (Batch, Time) tensor of integers
        logits = self.token_embedding_table(idx) # (Batch, Time, Channel) 
        
        if targets is None:
            loss = None
        else: 
            # PyTorch expects (B, C, T), we need to convert
            B, T, C = logits.shape

            logits = logits.view(B*T, C)
            targets = targets.view(B*T)

            loss = F.cross_entropy(logits, targets)
        
        return logits, loss
    
    def generate(self, idx, max_new_tokens):
        # idx is a (B, T) array of indices in the current context
        for _ in range(max_new_tokens):
            logits, loss = self(idx) # get the predictions
            
            # focus on the last time step
            logits = logits[:, -1, :] # becomes (B, C)
            
            # apply softmax to get probabilities
            probs = F.softmax(logits, dim=-1) 
            
            # sample from the distribition
            idx_next = torch.multinomial(probs, num_samples=1) # (B,1)
            
            # append sampled index to the running sequence
            idx = torch.cat((idx, idx_next), dim=1) # (B, T+1)
        return idx

In [20]:
model_0 = BigramLanguageModel(vocab_size).to(device)
logits, loss = model_0(x_batched, y_batched)
print(f" logits shape: {logits.shape}, loss: {loss}")

 logits shape: torch.Size([32, 65]), loss: 4.878634452819824


In [21]:
idx = torch.zeros((1,1), dtype=torch.long).to(device)
print(decode(model_0.generate(idx, max_new_tokens=100)[0].tolist()))


SfqoRJ$
oot?hsuD.c.dLnTH;UX&yhKmyQ':bHoixjegH.;fFlvvCqGRGSPiqDWJFZT;tq-!uDUEqh&siV
oAXER
reCkSP flSq


### Training Bigram Model

In [22]:
optimizer = torch.optim.AdamW(model_0.parameters(), lr=1e-3)

In [23]:
batch_size = 32

for steps in range(10000):
    # sample a batch of data
    x_batched, y_batched = get_batch('train')
    
    # evaluate the loss
    logits, loss = model_0(x_batched, y_batched)
    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    optimizer.step()
    
print(loss.item())

2.394822120666504


In [24]:
print(decode(model_0.generate(idx, max_new_tokens=300)[0].tolist()))



PZy eald da s. bl
Gor, bllay,
THe by cak'dis?Y:
Bu rd G adl rmerad
OLe haut l ed odur ctich th thtoxth ut st w? wsind e oow iro'engout'e; cleash.
K:
Ance btory, s
RDULOXng pot hy wn bew.
Whithew shatealy t aren HAMe hese:
A therimequd as MAnde, asothalled styome ar es
TIOfith alitowsitof becu m;
QU


It's looks like it is kind of working!?

### Self attention

Tokens should not be able to use future tokens as context.

So, we want to calculate the average of all vectors in all the previous n tokens, including n. We can use this using a bag of words https://en.wikipedia.org/wiki/Bag-of-words_model



In [25]:
torch.manual_seed(1337)
B, T, C = 4, 8 , 2

x = torch.randn(B,T,C)
x.shape


torch.Size([4, 8, 2])

In [26]:
x_bag_of_words = torch.zeros(B,T,C)

for b in range(B):
    for t in range(T):
        x_prev = x[b,:t+1] # (t,C)
        x_bag_of_words[b,t] = torch.mean(x_prev, 0)

The first row of these are the same, which makes sense because we have only looked at one word. The next rows change as we are then taking the average of 0...n.

In [27]:
x[0]

tensor([[ 0.1808, -0.0700],
        [-0.3596, -0.9152],
        [ 0.6258,  0.0255],
        [ 0.9545,  0.0643],
        [ 0.3612,  1.1679],
        [-1.3499, -0.5102],
        [ 0.2360, -0.2398],
        [-0.9211,  1.5433]])

In [28]:
x_bag_of_words[0]

tensor([[ 0.1808, -0.0700],
        [-0.0894, -0.4926],
        [ 0.1490, -0.3199],
        [ 0.3504, -0.2238],
        [ 0.3525,  0.0545],
        [ 0.0688, -0.0396],
        [ 0.0927, -0.0682],
        [-0.0341,  0.1332]])

That was inefficient, can make it much better using matrix multiplication. More specifically, we can use ```torch.tril()``` to create a lower triangular matrix. This is the trick to block out the future tokens from the current token. 

In [29]:
weights = torch.tril(torch.ones(T,T))
weights = weights / weights.sum(1, keepdim=True)
weights

tensor([[1.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.5000, 0.5000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.3333, 0.3333, 0.3333, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.2500, 0.2500, 0.2500, 0.2500, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.2000, 0.2000, 0.2000, 0.2000, 0.2000, 0.0000, 0.0000, 0.0000],
        [0.1667, 0.1667, 0.1667, 0.1667, 0.1667, 0.1667, 0.0000, 0.0000],
        [0.1429, 0.1429, 0.1429, 0.1429, 0.1429, 0.1429, 0.1429, 0.0000],
        [0.1250, 0.1250, 0.1250, 0.1250, 0.1250, 0.1250, 0.1250, 0.1250]])

In [30]:
# batched matrix multiply
# shapes are not the same, pytorch adds batch dimension
# original: (T,T) @ (B,T,C) --> ???
# updated:(B,T,T) @ (B,T,C) --> (B,T,C)
x_bag_of_words_2 = weights @ x 
x_bag_of_words_2

tensor([[[ 0.1808, -0.0700],
         [-0.0894, -0.4926],
         [ 0.1490, -0.3199],
         [ 0.3504, -0.2238],
         [ 0.3525,  0.0545],
         [ 0.0688, -0.0396],
         [ 0.0927, -0.0682],
         [-0.0341,  0.1332]],

        [[ 1.3488, -0.1396],
         [ 0.8173,  0.4127],
         [-0.1342,  0.4395],
         [ 0.2711,  0.4774],
         [ 0.2421,  0.0694],
         [ 0.0084,  0.0020],
         [ 0.0712, -0.1128],
         [ 0.2527,  0.2149]],

        [[-0.6631, -0.2513],
         [ 0.1735, -0.0649],
         [ 0.1685,  0.3348],
         [-0.1621,  0.1765],
         [-0.2312, -0.0436],
         [-0.1015, -0.2855],
         [-0.2593, -0.1630],
         [-0.3015, -0.2293]],

        [[ 1.6455, -0.8030],
         [ 1.4985, -0.5395],
         [ 0.4954,  0.3420],
         [ 1.0623, -0.1802],
         [ 1.1401, -0.4462],
         [ 1.0870, -0.4071],
         [ 1.0430, -0.1299],
         [ 1.1138, -0.1641]]])

In [31]:
# x_bag_of_words and x_bag_of_words_2 are the same now
# Using the matrix way is significantly faster, however
torch.allclose(x_bag_of_words, x_bag_of_words_2)

True

However, this is the best way to do the lower triangular batched matrix multiplication:

In [32]:
tril = torch.tril(torch.ones(T,T))
weights = torch.zeros((T,T))
weights

tensor([[0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0.]])

In [33]:
# make all zero weights negative infinity
weights = weights.masked_fill(tril == 0, float('-inf'))
weights

tensor([[0., -inf, -inf, -inf, -inf, -inf, -inf, -inf],
        [0., 0., -inf, -inf, -inf, -inf, -inf, -inf],
        [0., 0., 0., -inf, -inf, -inf, -inf, -inf],
        [0., 0., 0., 0., -inf, -inf, -inf, -inf],
        [0., 0., 0., 0., 0., -inf, -inf, -inf],
        [0., 0., 0., 0., 0., 0., -inf, -inf],
        [0., 0., 0., 0., 0., 0., 0., -inf],
        [0., 0., 0., 0., 0., 0., 0., 0.]])

This is super usefull in self attention, because this weights matrix tells us how much of each vector to pay attention to!

In [34]:
# We then take softmax() along every row
# https://en.wikipedia.org/wiki/Softmax_function
# This is just normalization, which creates the weight matrix

weights = F.softmax(weights, dim=-1)
weights

tensor([[1.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.5000, 0.5000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.3333, 0.3333, 0.3333, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.2500, 0.2500, 0.2500, 0.2500, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.2000, 0.2000, 0.2000, 0.2000, 0.2000, 0.0000, 0.0000, 0.0000],
        [0.1667, 0.1667, 0.1667, 0.1667, 0.1667, 0.1667, 0.0000, 0.0000],
        [0.1429, 0.1429, 0.1429, 0.1429, 0.1429, 0.1429, 0.1429, 0.0000],
        [0.1250, 0.1250, 0.1250, 0.1250, 0.1250, 0.1250, 0.1250, 0.1250]])

In [35]:
# The final bag of words
x_bag_of_words_3 = weights @ x

In [36]:
x_bag_of_words_3

tensor([[[ 0.1808, -0.0700],
         [-0.0894, -0.4926],
         [ 0.1490, -0.3199],
         [ 0.3504, -0.2238],
         [ 0.3525,  0.0545],
         [ 0.0688, -0.0396],
         [ 0.0927, -0.0682],
         [-0.0341,  0.1332]],

        [[ 1.3488, -0.1396],
         [ 0.8173,  0.4127],
         [-0.1342,  0.4395],
         [ 0.2711,  0.4774],
         [ 0.2421,  0.0694],
         [ 0.0084,  0.0020],
         [ 0.0712, -0.1128],
         [ 0.2527,  0.2149]],

        [[-0.6631, -0.2513],
         [ 0.1735, -0.0649],
         [ 0.1685,  0.3348],
         [-0.1621,  0.1765],
         [-0.2312, -0.0436],
         [-0.1015, -0.2855],
         [-0.2593, -0.1630],
         [-0.3015, -0.2293]],

        [[ 1.6455, -0.8030],
         [ 1.4985, -0.5395],
         [ 0.4954,  0.3420],
         [ 1.0623, -0.1802],
         [ 1.1401, -0.4462],
         [ 1.0870, -0.4071],
         [ 1.0430, -0.1299],
         [ 1.1138, -0.1641]]])