See:
https://colab.research.google.com/drive/1JMLa53HDuA-i7ZBmqV7ZnA3c_fvtXnx-?usp=sharing
https://www.youtube.com/watch?v=kCc8FmEb1nY

In [48]:
import torch

# Check the devices that we have available and prefer CUDA over MPS and CPU
def autoselectDevice(verbose=1):

    # default: CPU
    device = torch.device('cpu')

    if torch.cuda.is_available():
        # CUDA
        device = torch.device('cuda')
    elif torch.backends.mps.is_available() and torch.backends.mps.is_built():
        # MPS (acceleration on Apple silicon M1 / M2 chips)
        device = torch.device('mps')

    if verbose:
        print('Using device:', device)

    # Additional Info when using cuda
    if verbose and device.type == 'cuda':
        print(torch.cuda.get_device_name(0))

    return device

# We transfer our model and data later to this device. If this is a GPU
# PyTorch will take care of everything automatically.
device = autoselectDevice(verbose=1)

Using device: mps


In [49]:
# We always start with a dataset to train on. Let's download the tiny shakespeare dataset
#!wget https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt

In [50]:
# read it in to inspect it
with open('input.txt', 'r', encoding='utf-8') as f:
    text = f.read()

In [51]:
print("length of dataset in characters: ", len(text))

length of dataset in characters:  1115394


In [52]:
# let's look at the first 1000 characters
print(text[:1000])

First Citizen:
Before we proceed any further, hear me speak.

All:
Speak, speak.

First Citizen:
You are all resolved rather to die than to famish?

All:
Resolved. resolved.

First Citizen:
First, you know Caius Marcius is chief enemy to the people.

All:
We know't, we know't.

First Citizen:
Let us kill him, and we'll have corn at our own price.
Is't a verdict?

All:
No more talking on't; let it be done: away, away!

Second Citizen:
One word, good citizens.

First Citizen:
We are accounted poor citizens, the patricians good.
What authority surfeits on would relieve us: if they
would yield us but the superfluity, while it were
wholesome, we might guess they relieved us humanely;
but they think we are too dear: the leanness that
afflicts us, the object of our misery, is as an
inventory to particularise their abundance; our
sufferance is a gain to them Let us revenge this with
our pikes, ere we become rakes: for the gods know I
speak this in hunger for bread, not in thirst for revenge.



In [53]:
# here are all the unique characters that occur in this text

chars = sorted(list(set(text)))
vocab_size = len(chars)
print("".join(chars))
print(f"Vocab size is {vocab_size}")



 !$&',-.3:;?ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz
Vocab size is 65


In [54]:
# create a mapping from characters to integers
# encoder and decoder

stoi = { ch:i for i,ch in enumerate(chars) }
itos = { i:ch for i,ch in enumerate(chars) }
encode = lambda s: [stoi[c] for c in s] # encoder: take a string, output a list of integers
decode = lambda l: ''.join([itos[i] for i in l]) # decoder: take a list of integers, output a string

print(encode("hii there"))
print(decode(encode("hii there")))


[46, 47, 47, 1, 58, 46, 43, 56, 43]
hii there


In [55]:
# tiktoken encoding
#import tiktoken
#ttenc = tiktoken.get_encoding("gpt-4")
#encode = lambda s: ttenc.encode(s)
#decode = lambda l: ttenc.decode(l)
#vocab_size = ttenc.n_vocab
#print(f"Vocab size is {vocab_size}")

#print(encode("hii there"))
#print(decode(encode("hii there")))

In [56]:
import tokenmonster

# Optionally set the tokenmonster directory, otherwise it will use ~/_tokenmonster
tokenmonster.set_local_directory("./_tokenmonster")

# Load a vocabulary by name, filepath or URL
ttenc = tokenmonster.load("english-1024-consistent-v1")


encode = lambda s: list(ttenc.tokenize(s))
decode = lambda l: ttenc.decode(l)
vocab_size = ttenc.vocab_size
print(f"Vocab size is {vocab_size}")

print(encode("hii there"))
print(decode(encode("hii there")))

Vocab size is 1024
[191, 53, 819]
hii there


In [57]:
# let's now encode the entire text dataset and store it into a torch.Tensor
import torch # we use PyTorch: https://pytorch.org
data = torch.tensor(encode(text), dtype=torch.long)
print(data.shape, data.dtype)
print(data[:1000]) # the 1000 characters we looked at earier will to the GPT look like this

torch.Size([485471]) torch.int64
tensor([ 156,  786,   36,    3,  171,  243,  255,   58,   29,  260,  853,  384,
         358,   37,  349,   49,  178,  466,  311,  233,  415,   15,  669,  339,
         568,   45,   55,  761,  464,   29,  260,  568,   45,   55,   15,  568,
          45,   55,  761,  786,   36,    3,  171,  243,  255,   58,   29,  260,
         956,  464,  361,   37,  565,  251,   48,  360,  622,  377,  494,  734,
         377,  306,   37,  340,   37,  367,   34,  451,  464,   29,  260,  361,
          37,  565,  251,   48,   17,  361,   37,  565,  251,   48,  761,  786,
          36,    3,  171,  243,  255,   58,   29,  260,  786,   15,  592,  681,
          36,  288,   53,  249,   36,  523,  171,  249,  326,  482,   49,   50,
         302,   49,   37,  343,  903,  887,  761,  464,   29,  260,  384,  681,
          10,   64,   15,  384,  681,   10,   64,  761,  786,   36,    3,  171,
         243,  255,   58,   29,  260,  520,  380,  333,  206,  512,   15,  465,
       

In [58]:
# Let's now split up the data into train and validation sets
n = int(0.9*len(data)) # first 90% will be train, rest val
train_data = data[:n]
val_data = data[n:]

In [59]:
# this is "context length" or "block size" in GPT terminology
block_size = 8
# this has multiple (8) training samples inside of it
train_data[:block_size+1]
x = train_data[:block_size]
y = train_data[1:block_size+1]
for t in range(block_size):
    context = x[:t+1]
    target = y[t]
    print(f"when input is {context} the target: {target}")

when input is tensor([156]) the target: 786
when input is tensor([156, 786]) the target: 36
when input is tensor([156, 786,  36]) the target: 3
when input is tensor([156, 786,  36,   3]) the target: 171
when input is tensor([156, 786,  36,   3, 171]) the target: 243
when input is tensor([156, 786,  36,   3, 171, 243]) the target: 255
when input is tensor([156, 786,  36,   3, 171, 243, 255]) the target: 58
when input is tensor([156, 786,  36,   3, 171, 243, 255,  58]) the target: 29


In [60]:
torch.manual_seed(1337)
batch_size = 4 # how many independent sequences will we process in parallel?
block_size = 8 # what is the maximum context length for predictions?

def get_batch(split):
    # generate a small batch of data of inputs x and targets y
    data = train_data if split == 'train' else val_data
    ix = torch.randint(len(data) - block_size, (batch_size,))
    x = torch.stack([data[i:i+block_size] for i in ix])
    y = torch.stack([data[i+1:i+block_size+1] for i in ix])
    return x, y

xb, yb = get_batch('train')
print('inputs:')
print(xb.shape)
print(xb)
print('targets:')
print(yb.shape)
print(yb)

print('----')

for b in range(batch_size): # batch dimension
    for t in range(block_size): # time dimension
        context = xb[b, :t+1]
        target = yb[b,t]
        print(f"when input is {context.tolist()} the target: {target}")

inputs:
torch.Size([4, 8])
tensor([[541,  52,  15, 260, 318, 422, 541, 377],
        [987, 359,  37, 563,  48, 541, 760, 312],
        [920, 769, 513, 122,  37, 464, 605, 397],
        [638, 354, 219, 204,  62,  15, 476, 204]])
targets:
torch.Size([4, 8])
tensor([[ 52,  15, 260, 318, 422, 541, 377, 342],
        [359,  37, 563,  48, 541, 760, 312, 241],
        [769, 513, 122,  37, 464, 605, 397, 897],
        [354, 219, 204,  62,  15, 476, 204,  62]])
----
when input is [541] the target: 52
when input is [541, 52] the target: 15
when input is [541, 52, 15] the target: 260
when input is [541, 52, 15, 260] the target: 318
when input is [541, 52, 15, 260, 318] the target: 422
when input is [541, 52, 15, 260, 318, 422] the target: 541
when input is [541, 52, 15, 260, 318, 422, 541] the target: 377
when input is [541, 52, 15, 260, 318, 422, 541, 377] the target: 342
when input is [987] the target: 359
when input is [987, 359] the target: 37
when input is [987, 359, 37] the target: 563
when

In [61]:
print(xb) # our input to the transformer

tensor([[541,  52,  15, 260, 318, 422, 541, 377],
        [987, 359,  37, 563,  48, 541, 760, 312],
        [920, 769, 513, 122,  37, 464, 605, 397],
        [638, 354, 219, 204,  62,  15, 476, 204]])


In [62]:
print(yb)

tensor([[ 52,  15, 260, 318, 422, 541, 377, 342],
        [359,  37, 563,  48, 541, 760, 312, 241],
        [769, 513, 122,  37, 464, 605, 397, 897],
        [354, 219, 204,  62,  15, 476, 204,  62]])


In [63]:
import torch
import torch.nn as nn
from torch.nn import functional as F
torch.manual_seed(1337)

# every index in xb will be embedded into a vector of length vocab_size
# Interpretation: Every character is embedded into a vector of length vocab_size that
# represents the probability distribution over the next character. We predict the next
# character by taking the argmax of this distribution, i.e. solely based what the current
# charcter is. This is a very simple model that will not be able to learn long-range
# dependencies.
class BigramLanguageModel(nn.Module):

    def __init__(self, vocab_size):
        super().__init__()
        # each token directly reads off the logits for the next token from a lookup table
        self.token_embedding_table = nn.Embedding(vocab_size, vocab_size)

    def forward(self, idx, targets=None):

        # idx and targets are both (B,T) tensor of integers
        logits = self.token_embedding_table(idx) # (B,T,C)   

        # cross-entropy loss. We want to predict the next token in the sequence. 
        if targets is None:
            loss = None
        else:
            # https://pytorch.org/docs/stable/generated/torch.nn.CrossEntropyLoss.html
            # channels are expected in the second dimension, thats why we need to flatten
            # the (B,T,C) tensor into a (B*T,C) tensor
            B, T, C = logits.shape
            logits = logits.view(B*T, C)
            targets = targets.view(B*T)
            loss = F.cross_entropy(logits, targets)    

        return logits, loss
    
    # generate a sequence of tokens of length max_new_tokens, starting from idx
    # idx is a (B,T) tensor of integers

    
    def generate(self, idx, max_new_tokens):
        # idx is (B, T) array of indices in the current context
        for _ in range(max_new_tokens):
            # get the predictions
            logits, loss = self(idx)
            # focus only on the last time step because these are the predictions 
            # for the next token
            logits = logits[:, -1, :] # becomes (B, C)
            # apply softmax to get probabilities
            probs = F.softmax(logits, dim=-1) # (B, C)
            # sample from the distribution
            idx_next = torch.multinomial(probs, num_samples=1) # (B, 1)
            # append sampled index to the running sequence
            idx = torch.cat((idx, idx_next), dim=1) # (B, T+1)
        return idx
    
  

m = BigramLanguageModel(vocab_size)
logits, loss = m(xb, yb)
print(logits.shape)
print(loss)




torch.Size([32, 1024])
tensor(7.3747, grad_fn=<NllLossBackward0>)


In [64]:
idx = torch.zeros((1, 5), dtype=torch.long)
print(decode(m.generate(idx, max_new_tokens=500)[0].tolist()))

	 good do polie let solding
               ction cor law" any greatunque and that but Parathoughtor string em of the eas was unp att pes to word is a event $\ house (COME live through the open they example am I'm against% fiń er that he
 in a were que in of this told bu win for thecul what together}
li fact p 1 but direct has been $WATER $\ge University ag require old]per| find give - create\ $BO7
    important your times type may letous cu ho bi action its case otherder( ste relationship hubl was pe       " data using
\provide of the did du class

 ho thing le
Those of this. CharCreate after ba return view did support wa love6— cri_red general with look part est need to postions” also through the la enias, partit al to get ver 4 also res directis by has been since dr� betterver largete z much nuqua mi our the same pe sto f em back hereations second understand afterself don down al but I gance had:well hasz� staline- wi figures3 effect.MANAGE and that interign' should� friends want to


In your BigramLanguageModel, the generate function uses a probabilistic approach (sampling from the softmax distribution) to predict the next token, rather than simply takingthe argmax, which would select the most likely token each time. Let's explore why sampling might be preferred over argmax in certain cases:

Diversity in Generation: When you use argmax, the model always chooses the most probable next word, leading to very deterministic and often repetitive outputs. Sampling, on the other hand, introduces randomness into the generation process, allowing for a more diverse range of outputs. This is especially important in creative tasks like text generation, where you might want the model to produce novel and varied sentences rather than the most predictable ones.

Avoiding Repetitive Loops: With argmax, there's a risk of falling into repetitive loops. This is because the model may keep predicting the same sequence of tokens over and over again, especially if it's stuck in a particularly high-probability path. Sampling can mitigate this by introducing less likely, but still plausible, tokens into the sequence.

Exploration of Less Frequent Paths: By sampling from the distribution, the model can occasionally pick less probable words that might lead to interesting or unexpected directions in text generation. This can be particularly useful for generating creative or varied text.

Better Reflection of Uncertainty: In many cases, especially when the next token is not very obvious, the model's uncertainty about the next word is better captured by a probability distribution rather than a single most likely choice. Sampling from this distribution can therefore give a better overall representation of the possible continuations.

Tuning the "Creativity": Through sampling, it's possible to adjust the "temperature" of the softmax function to control the randomness. A higher temperature increases randomness (more creative but less coherent outputs), while a lower temperature makes the model's choices more conservative (more coherent but less varied outputs).

In summary, sampling from the probability distribution for the next token, rather than always selecting the most likely token, can enhance the creativity, diversity, and overall quality of the generated text in a language model. However, the choice between sampling and using argmax largely depends on the specific application and the desired characteristics of the generated text.

In [65]:
# create a PyTorch optimizer
optimizer = torch.optim.AdamW(m.parameters(), lr=1e-3)
batch_size = 64

m = m.to(device)
 
for steps in range(10000): # increase number of steps for good results... 
    
    # sample a batch of data
    xb, yb = get_batch('train')
    xb, yb = xb.to(device), yb.to(device)

    # evaluate the loss
    logits, loss = m(xb, yb)
    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    optimizer.step()

    if (steps+1) % 500 == 0:
        print(f"step {steps+1}, loss {loss.item():.3f}")

print(loss.item())



step 500, loss 7.000
step 1000, loss 6.653
step 1500, loss 6.217
step 2000, loss 5.923
step 2500, loss 5.666
step 3000, loss 5.494
step 3500, loss 5.215
step 4000, loss 4.833
step 4500, loss 4.805
step 5000, loss 4.590
step 5500, loss 4.410
step 6000, loss 4.387
step 6500, loss 4.189
step 7000, loss 4.204
step 7500, loss 4.266
step 8000, loss 4.032
step 8500, loss 3.926
step 9000, loss 3.934
step 9500, loss 4.064
step 10000, loss 3.964
3.964280843734741


In [66]:
m = m.to('cpu')

In [110]:
idx = torch.zeros((1, 1), dtype=torch.long)
#print(decode(m.generate(idx, max_new_tokens=500)[0].tolist()))

print(idx.shape)
print(idx)
out, loss = m(idx)
print(out.shape)
print(out)

torch.Size([1, 1])
tensor([[0]])
torch.Size([1, 1, 1024])
tensor([[[ 0.1635, -0.0633, -0.3254,  ...,  0.7493, -0.4367,  1.6583]]],
       grad_fn=<EmbeddingBackward0>)


In [68]:
# consider the following toy example:

torch.manual_seed(1337)
B,T,C = 4,8,2 # batch, time, channels
x = torch.randn(B,T,C)
x.shape

torch.Size([4, 8, 2])

In [69]:
# We want x[b,t] = mean_{i<=t} x[b,i]

# toy example illustrating how matrix multiplication can be used for a "weighted aggregation"
# we average the previous tokens. this is the most simple form of getting information from the past

xbow = torch.zeros((B,T,C))
for b in range(B):
    for t in range(T):
        xprev = x[b,:t+1] # (t,C)
        xbow[b,t] = torch.mean(xprev, 0)



In [70]:
x[0]

tensor([[ 0.1808, -0.0700],
        [-0.3596, -0.9152],
        [ 0.6258,  0.0255],
        [ 0.9545,  0.0643],
        [ 0.3612,  1.1679],
        [-1.3499, -0.5102],
        [ 0.2360, -0.2398],
        [-0.9211,  1.5433]])

In [71]:
xbow[0]

tensor([[ 0.1808, -0.0700],
        [-0.0894, -0.4926],
        [ 0.1490, -0.3199],
        [ 0.3504, -0.2238],
        [ 0.3525,  0.0545],
        [ 0.0688, -0.0396],
        [ 0.0927, -0.0682],
        [-0.0341,  0.1332]])

In [72]:
# toy example illustrating how matrix multiplication can be used for a "weighted aggregation"
torch.manual_seed(42)
a = torch.ones(3, 3)
#a = a / torch.sum(a, 1, keepdim=True)
b = torch.randint(0,10,(3,2)).float()
c = a @ b
print('a=')
print(a)
print('--')
print('b=')
print(b)
print('--')
print('c=')
print(c)

a=
tensor([[1., 1., 1.],
        [1., 1., 1.],
        [1., 1., 1.]])
--
b=
tensor([[2., 7.],
        [6., 4.],
        [6., 5.]])
--
c=
tensor([[14., 16.],
        [14., 16.],
        [14., 16.]])


In [73]:
torch.tril(torch.ones(3, 3))

tensor([[1., 0., 0.],
        [1., 1., 0.],
        [1., 1., 1.]])

In [74]:
# toy example illustrating how matrix multiplication can be used for a "weighted aggregation"
# we average the previous tokens. this is the most simple form of getting information from the past
torch.manual_seed(42)
a = torch.tril(torch.ones(3, 3))
a = a / torch.sum(a, 1, keepdim=True)
b = torch.randint(0,10,(3,2)).float()
c = a @ b
print('a=')
print(a)
print('--')
print('b=')
print(b)
print('--')
print('c=')
print(c)

a=
tensor([[1.0000, 0.0000, 0.0000],
        [0.5000, 0.5000, 0.0000],
        [0.3333, 0.3333, 0.3333]])
--
b=
tensor([[2., 7.],
        [6., 4.],
        [6., 5.]])
--
c=
tensor([[2.0000, 7.0000],
        [4.0000, 5.5000],
        [4.6667, 5.3333]])


In [75]:
# vectorized version
# version 2: using matrix multiply for a weighted aggregation
wei = torch.tril(torch.ones(T, T))
wei = wei / wei.sum(1, keepdim=True)
xbow2 = wei @ x # (B, T, T) @ (B, T, C) ----> (B, T, C)
torch.allclose(xbow, xbow2)
print(wei)

tensor([[1.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.5000, 0.5000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.3333, 0.3333, 0.3333, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.2500, 0.2500, 0.2500, 0.2500, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.2000, 0.2000, 0.2000, 0.2000, 0.2000, 0.0000, 0.0000, 0.0000],
        [0.1667, 0.1667, 0.1667, 0.1667, 0.1667, 0.1667, 0.0000, 0.0000],
        [0.1429, 0.1429, 0.1429, 0.1429, 0.1429, 0.1429, 0.1429, 0.0000],
        [0.1250, 0.1250, 0.1250, 0.1250, 0.1250, 0.1250, 0.1250, 0.1250]])


In [76]:
# version 3: use Softmax
tril = torch.tril(torch.ones(T, T))
wei = torch.zeros((T,T)) # aggregation weights (attention affinity between tokens)
wei = wei.masked_fill(tril == 0, float('-inf')) # tokens from the past cannot be aggregated
wei = F.softmax(wei, dim=-1)
xbow3 = wei @ x # actual aggregation
torch.allclose(xbow, xbow3)

print(wei)

# The reason why we use Softmax instead of a simple division by the sum is that
# Softmax is differentiable, while division by the sum is not. This is important
# for backpropagation.
#
# wei (the attention affinities between tokens) will be learned later 


tensor([[1.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.5000, 0.5000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.3333, 0.3333, 0.3333, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.2500, 0.2500, 0.2500, 0.2500, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.2000, 0.2000, 0.2000, 0.2000, 0.2000, 0.0000, 0.0000, 0.0000],
        [0.1667, 0.1667, 0.1667, 0.1667, 0.1667, 0.1667, 0.0000, 0.0000],
        [0.1429, 0.1429, 0.1429, 0.1429, 0.1429, 0.1429, 0.1429, 0.0000],
        [0.1250, 0.1250, 0.1250, 0.1250, 0.1250, 0.1250, 0.1250, 0.1250]])


In [77]:
# version 4: self-attention!
torch.manual_seed(1337)
B,T,C = 4,8,32 # batch, time, channels
x = torch.randn(B,T,C)

# let's see a single Head perform self-attention
head_size = 16 
key = nn.Linear(C, head_size, bias=False)
query = nn.Linear(C, head_size, bias=False)
value = nn.Linear(C, head_size, bias=False)
k = key(x)   # (B, T, 16)
q = query(x) # (B, T, 16)
wei =  q @ k.transpose(-2, -1) # (B, T, 16) @ (B, 16, T) ---> (B, T, T)

tril = torch.tril(torch.ones(T, T))
#wei = torch.zeros((T,T))
wei = wei.masked_fill(tril == 0, float('-inf'))
wei = F.softmax(wei, dim=-1)

v = value(x)
out = wei @ v
#out = wei @ x

out.shape

torch.Size([4, 8, 16])

Notes:
- Attention is a **communication mechanism**. Can be seen as nodes in a directed graph looking at each other and aggregating information with a weighted sum from all nodes that point to them, with data-dependent weights.
- There is no notion of space. Attention simply acts over a set of vectors. This is why we need to positionally encode tokens.
- Each example across batch dimension is of course processed completely independently and never "talk" to each other
- In an "encoder" attention block just delete the single line that does masking with `tril`, allowing all tokens to communicate. This block here is called a "decoder" attention block because it has triangular masking, and is usually used in autoregressive settings, like language modeling.
- "self-attention" just means that the keys and values are produced from the same source as queries. In "cross-attention", the queries still get produced from x, but the keys and values come from some other, external source (e.g. an encoder module)
- "Scaled" attention additional divides `wei` by 1/sqrt(head_size). This makes it so when input Q,K are unit variance, wei will be unit variance too and Softmax will stay diffuse and not saturate too much. Illustration below