## Building a GPT

Companion notebook to the [Zero To Hero](https://karpathy.ai/zero-to-hero.html) video on GPT.

In [None]:
# We always start with a dataset to train on. Let's download the tiny shakespeare dataset
!wget https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt

--2023-05-26 13:25:39--  https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.109.133, 185.199.111.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.109.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1115394 (1.1M) [text/plain]
Saving to: ‘input.txt.1’


2023-05-26 13:25:40 (17.7 MB/s) - ‘input.txt.1’ saved [1115394/1115394]



In [None]:
'''
# read it in to inspect it
with open('input.txt', 'r', encoding='utf-8') as f:
    text = f.read()
'''

In [None]:
text = "With the right variable ordering, variable elimination can help reduce the computational and space complexity of exact inference problems by a lot.  Therefore, we introduce an approximate inference method known as sampling, which should help us get answers that are good enough. In each of the following methods, we assume that we have access to the individual probability tables of the Bayes’ Net, and we use some source of randomness (e.g., a random number generator) which simulates picking values for variables. Prior sampling is the most straightforward/baseline kind of sampling. For each variable in the BN, we use a random generator to pick a value for it (e.g., generate a float between 0 and 1 and see where it lies with respect to the distribution defined by that variable’s probability table). We repeat this a number of times we deem sufficient, and then use the numbers of samples seen to calculate the query. The last type of sampling that we talk about is Gibbs sampling. In Gibbs sampling, it allows us to take turns sampling the variable that we want to learn about. This allows us to consider both upstream and downstream variables. In likelihood weighted sampling, it only conditions on upstream variables that are being conditioned on so the weights obtained can often times be very small. Since then the sum of the weights would be small, the number of effective samples would be low as well. Gibbs works to fix this issue."

In [None]:
print("length of dataset in characters: ", len(text))

length of dataset in characters:  1445


In [None]:
# let's look at the first 1000 characters
print(text[:1000])

With the right variable ordering, variable elimination can help reduce the computational and space complexity of exact inference problems by a lot.  Therefore, we introduce an approximate inference method known as sampling, which should help us get answers that are good enough. In each of the following methods, we assume that we have access to the individual probability tables of the Bayes’ Net, and we use some source of randomness (e.g., a random number generator) which simulates picking values for variables. Prior sampling is the most straightforward/baseline kind of sampling. For each variable in the BN, we use a random generator to pick a value for it (e.g., generate a float between 0 and 1 and see where it lies with respect to the distribution defined by that variable’s probability table). We repeat this a number of times we deem sufficient, and then use the numbers of samples seen to calculate the query. The last type of sampling that we talk about is Gibbs sampling. In Gibbs sam

In [None]:
'''
high RAM
import tiktoken
enc = tiktoken.get_encoding("cl100k_base")
assert enc.decode(enc.encode("hello world")) == "hello world"

# To get the tokeniser corresponding to a specific model in the OpenAI API:
enc = tiktoken.encoding_for_model("gpt-4")
'''

2000


In [None]:
# here are all the unique characters that occur in this text
chars = sorted(list(set(text)))
vocab_size = sp.get_piece_size() #len(chars) #enc.n_vocab high RAM

print(vocab_size)

 (),./01BFGINPSTWabcdefghiklmnopqrstuvwxy’
42


In [None]:
#!pip install tiktoken

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [None]:
!pip install sentencepiece

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting sentencepiece
  Downloading sentencepiece-0.1.99-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m16.2 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: sentencepiece
Successfully installed sentencepiece-0.1.99


In [None]:
#enc.n_vocab

100277

In [None]:
#enc.encode("hello world")

[15339, 1917]

In [None]:
!pip install sentencepiece
!wget https://raw.githubusercontent.com/google/sentencepiece/master/data/botchan.txt


Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
--2023-05-26 15:31:13--  https://raw.githubusercontent.com/google/sentencepiece/master/data/botchan.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.110.133, 185.199.108.133, 185.199.111.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.110.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 278779 (272K) [text/plain]
Saving to: ‘botchan.txt’


2023-05-26 15:31:14 (6.34 MB/s) - ‘botchan.txt’ saved [278779/278779]



In [None]:

import sentencepiece as spm

# train sentencepiece model from `botchan.txt` and makes `m.model` and `m.vocab`
# `m.vocab` is just a reference. not used in the segmentation.
spm.SentencePieceTrainer.train('--input=botchan.txt --model_prefix=m --vocab_size=2000')

# makes segmenter instance and loads the model file (m.model)
sp = spm.SentencePieceProcessor()
sp.load('m.model')

# encode: text => id
print(sp.encode_as_pieces('This is a test'))
print(sp.encode_as_ids('This is a test'))

# decode: id => text
print(sp.decode_pieces(['▁This', '▁is', '▁a', '▁t', 'est']))
print(sp.decode_ids([209, 31, 9, 375, 586]))

['▁This', '▁is', '▁a', '▁t', 'est']
[208, 31, 9, 434, 601]
This is a test
il is a con live


In [None]:
print(sp.encode('This is a test'))

[208, 31, 9, 434, 601]


In [None]:
vocab_size = sp.get_piece_size() #len(chars) #enc.n_vocab high RAM
chars = sorted(list(set(text)))
print(vocab_size)

2000


In [None]:
# create a mapping from characters to integers
stoi = { ch:i for i,ch in enumerate(chars) }
itos = { i:ch for i,ch in enumerate(chars) }
#encode = lambda s: [stoi[c] for c in s] # encoder: take a string, output a list of integers
#decode = lambda l: ''.join([itos[i] for i in l]) # decoder: take a list of integers, output a string
encode = sp.encode
decode = sp.decode
print(encode("hii there"))
print(decode(encode("hii there")))

[476, 52, 70]
hii there


In [None]:
#
#enc.encode(text)


[5451,
 47317,
 512,
 10438,
 584,
 10570,
 904,
 4726,
 11,
 6865,
 757,
 6604,
 382,
 2460,
 512,
 96945,
 11,
 6604,
 382,
 5451,
 47317,
 512,
 2675,
 527,
 682,
 20250,
 4856,
 311,
 2815,
 1109,
 311,
 2138,
 819,
 1980,
 2460,
 512,
 66494,
 13,
 20250,
 382,
 5451,
 47317,
 512,
 5451,
 11,
 499,
 1440,
 356,
 2192,
 355,
 2947,
 5979,
 355,
 374,
 10388,
 9354,
 311,
 279,
 1274,
 382,
 2460,
 512,
 1687,
 1440,
 956,
 11,
 584,
 1440,
 956,
 382,
 5451,
 47317,
 512,
 10267,
 603,
 5622,
 1461,
 11,
 323,
 584,
 3358,
 617,
 14095,
 520,
 1057,
 1866,
 3430,
 627,
 3957,
 956,
 264,
 36543,
 1980,
 2460,
 512,
 2822,
 810,
 7556,
 389,
 956,
 26,
 1095,
 433,
 387,
 2884,
 25,
 3201,
 11,
 3201,
 2268,
 16041,
 47317,
 512,
 4054,
 3492,
 11,
 1695,
 10495,
 382,
 5451,
 47317,
 512,
 1687,
 527,
 41853,
 8009,
 10495,
 11,
 279,
 3352,
 2265,
 5493,
 1695,
 627,
 3923,
 11447,
 1765,
 1897,
 1220,
 389,
 1053,
 48839,
 603,
 25,
 422,
 814,
 198,
 41450,
 7692,
 603,
 719,
 

In [None]:
text

'With the right variable ordering, variable elimination can help reduce the computational and space complexity of exact inference problems by a lot.  Therefore, we introduce an approximate inference method known as sampling, which should help us get answers that are good enough. In each of the following methods, we assume that we have access to the individual probability tables of the Bayes’ Net, and we use some source of randomness (e.g., a random number generator) which simulates picking values for variables. Prior sampling is the most straightforward/baseline kind of sampling. For each variable in the BN, we use a random generator to pick a value for it (e.g., generate a float between 0 and 1 and see where it lies with respect to the distribution defined by that variable’s probability table). We repeat this a number of times we deem sufficient, and then use the numbers of samples seen to calculate the query. The last type of sampling that we talk about is Gibbs sampling. In Gibbs sa

In [None]:
# let's now encode the entire text dataset and store it into a torch.Tensor using Tiktoken
import torch # we use PyTorch: https://pytorch.org
data = torch.tensor(sp.encode(text), dtype=torch.long) #enc.encode(text) need high RAM
print(data.shape, data.dtype)
print(data[:1000]) # the 1000 characters we looked at earier will to the GPT look like this

torch.Size([490]) torch.int64
tensor([ 817,    5,  147, 1139,  266,  270,  833,   14,    3, 1139,  266,  270,
          12,  418,  553,   78,  227,  117,  615,  455,   88,  243,    5,  961,
         388,  227,  106,   10, 1981,  961,  132,  479,  333,   11,  542,   57,
          67,   19,   15,   92,   54,  420, 1966,    7,   53,    9, 1459,    4,
         327,   92,   38,   66,    3,   89, 1730,   24,  128,    9,   69,   69,
         164,  479,  553,  299,   15,   92,   54,  420, 1733,  168,   27,   36,
          94, 1055,  401,    3,   99,  130,  615,  282,   97,  616,    7,   23,
          84,  222,  519,    4,  313,  630,   11,    5,  477,   14, 1733,    7,
           3,   89, 1438,   23,   89,   55,  863,    8,    5, 1517,  671,   80,
         915, 1336,    7,   11,    5,  323,   57,   42,  182,    0,  676,   24,
          19,    3,   10,   89,  524,  125,   50,  169,  243,   11, 1073,   20,
          38,   45,  431,  331,   24,    4,   58,    4,    3,    9, 1073,   20,
          

In [None]:
# Let's now split up the data into train and validation sets
n = int(0.9*len(data)) # first 90% will be train, rest val
train_data = data[:n]
val_data = data[n:]

In [None]:
block_size = 8
train_data[:block_size+1]

tensor([ 817,    5,  147, 1139,  266,  270,  833,   14,    3])

In [None]:
x = train_data[:block_size]
y = train_data[1:block_size+1]
for t in range(block_size):
    context = x[:t+1]
    target = y[t]
    print(f"when input is {context} the target: {target}")

when input is tensor([817]) the target: 5
when input is tensor([817,   5]) the target: 147
when input is tensor([817,   5, 147]) the target: 1139
when input is tensor([ 817,    5,  147, 1139]) the target: 266
when input is tensor([ 817,    5,  147, 1139,  266]) the target: 270
when input is tensor([ 817,    5,  147, 1139,  266,  270]) the target: 833
when input is tensor([ 817,    5,  147, 1139,  266,  270,  833]) the target: 14
when input is tensor([ 817,    5,  147, 1139,  266,  270,  833,   14]) the target: 3


TypeError: ignored

In [None]:
torch.manual_seed(1337)
batch_size = 4 # how many independent sequences will we process in parallel?
block_size = 8 # what is the maximum context length for predictions?

def get_batch(split):
    # generate a small batch of data of inputs x and targets y
    data = train_data if split == 'train' else val_data
    ix = torch.randint(len(data) - block_size, (batch_size,))
    x = torch.stack([data[i:i+block_size] for i in ix])
    y = torch.stack([data[i+1:i+block_size+1] for i in ix])
    return x, y

xb, yb = get_batch('train')
print('inputs:')
print(xb.shape)
print(xb)
print('targets:')
print(yb.shape)
print(yb)

print('----')

for b in range(batch_size): # batch dimension
    for t in range(block_size): # time dimension
        context = xb[b, :t+1]
        target = yb[b,t]
        print(f"when input is {context.tolist()} the target: {target}")

inputs:
torch.Size([4, 8])
tensor([[  15,   92,   54,  420, 1733,  168,   27,   36],
        [  52,   80,   80,    7,   94, 1055,  401,    3],
        [  69,  164,  479,  553,  299,   15,   92,   54],
        [ 216,    3,   10,  271,  524,    5, 1178,    7]])
targets:
torch.Size([4, 8])
tensor([[  92,   54,  420, 1733,  168,   27,   36,   94],
        [  80,   80,    7,   94, 1055,  401,    3,   18],
        [ 164,  479,  553,  299,   15,   92,   54,  420],
        [   3,   10,  271,  524,    5, 1178,    7,   11]])
----
when input is [15] the target: 92
when input is [15, 92] the target: 54
when input is [15, 92, 54] the target: 420
when input is [15, 92, 54, 420] the target: 1733
when input is [15, 92, 54, 420, 1733] the target: 168
when input is [15, 92, 54, 420, 1733, 168] the target: 27
when input is [15, 92, 54, 420, 1733, 168, 27] the target: 36
when input is [15, 92, 54, 420, 1733, 168, 27, 36] the target: 94
when input is [52] the target: 80
when input is [52, 80] the target: 8

In [None]:
print(xb) # our input to the transformer

tensor([[  15,   92,   54,  420, 1733,  168,   27,   36],
        [  52,   80,   80,    7,   94, 1055,  401,    3],
        [  69,  164,  479,  553,  299,   15,   92,   54],
        [ 216,    3,   10,  271,  524,    5, 1178,    7]])


KeyError: ignored

In [None]:
import torch
import torch.nn as nn
from torch.nn import functional as F
torch.manual_seed(1337)

class BigramLanguageModel(nn.Module):

    def __init__(self, vocab_size):
        super().__init__()
        # each token directly reads off the logits for the next token from a lookup table
        self.token_embedding_table = nn.Embedding(vocab_size, vocab_size)

    def forward(self, idx, targets=None):

        # idx and targets are both (B,T) tensor of integers
        print(idx)
        logits = self.token_embedding_table(idx) # (B,T,C)

        if targets is None:
            loss = None
        else:
            B, T, C = logits.shape
            logits = logits.view(B*T, C)
            targets = targets.view(B*T)
            loss = F.cross_entropy(logits, targets)

        return logits, loss

    def generate(self, idx, max_new_tokens):
        # idx is (B, T) array of indices in the current context
        for _ in range(max_new_tokens):
            # get the predictions
            logits, loss = self(idx)
            print("logits1 shape")
            print(logits.shape)
            # focus only on the last time step
            logits = logits[:, -1, :] # becomes (B, C)
            print("logits2 shape")
            print(logits.shape)
            # apply softmax to get probabilities
            probs = F.softmax(logits, dim=-1) # (B, C)
            # sample from the distribution
            idx_next = torch.multinomial(probs, num_samples=1) # (B, 1)
            # append sampled index to the running sequence
            idx = torch.cat((idx, idx_next), dim=1) # (B, T+1)
        return idx

m = BigramLanguageModel(vocab_size)
logits, loss = m(xb, yb)
print(logits.shape)
print(loss)

print(decode(m.generate(idx = torch.zeros((1, 1), dtype=torch.long), max_new_tokens=100)[0].tolist()))


tensor([[  15,   92,   54,  420, 1733,  168,   27,   36],
        [  52,   80,   80,    7,   94, 1055,  401,    3],
        [  69,  164,  479,  553,  299,   15,   92,   54],
        [ 216,    3,   10,  271,  524,    5, 1178,    7]])
torch.Size([32, 2000])
tensor(8.0429, grad_fn=<NllLossBackward0>)
tensor([[0]])
logits1 shape
torch.Size([1, 1, 2000])
logits2 shape
torch.Size([1, 2000])
tensor([[  0, 144]])
logits1 shape
torch.Size([1, 2, 2000])
logits2 shape
torch.Size([1, 2000])
tensor([[   0,  144, 1093]])
logits1 shape
torch.Size([1, 3, 2000])
logits2 shape
torch.Size([1, 2000])
tensor([[   0,  144, 1093, 1679]])
logits1 shape
torch.Size([1, 4, 2000])
logits2 shape
torch.Size([1, 2000])
tensor([[   0,  144, 1093, 1679,  807]])
logits1 shape
torch.Size([1, 5, 2000])
logits2 shape
torch.Size([1, 2000])
tensor([[   0,  144, 1093, 1679,  807,  122]])
logits1 shape
torch.Size([1, 6, 2000])
logits2 shape
torch.Size([1, 2000])
tensor([[   0,  144, 1093, 1679,  807,  122,  493]])
logits1 sha

In [None]:
nn.Embedding(vocab_size, vocab_size)

Embedding(42, 42)

In [None]:
# create a PyTorch optimizer
optimizer = torch.optim.AdamW(m.parameters(), lr=1e-3)

In [None]:
batch_size = 32
for steps in range(100): # increase number of steps for good results...

    # sample a batch of data
    xb, yb = get_batch('train')

    # evaluate the loss
    logits, loss = m(xb, yb)
    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    optimizer.step()

print(loss.item())


tensor([[ 4, 23,  4,  3,  0, 23, 21, 29],
        [29, 21, 20,  0, 18, 40,  0, 35],
        [24,  0, 34, 25, 28, 36, 27, 17],
        [35, 30, 33,  0, 35, 30,  0, 31],
        [23,  3,  0, 38, 24, 25, 19, 24],
        [28, 29, 21, 34, 34,  0,  1, 21],
        [29, 35, 33, 30, 20, 36, 19, 21],
        [24, 21,  0, 20, 25, 34, 35, 33],
        [33, 17, 29, 20, 30, 28,  0, 23],
        [34,  0, 34, 21, 21, 29,  0, 35],
        [ 0, 19, 30, 29, 20, 25, 35, 25],
        [35,  0, 38, 21,  0, 38, 17, 29],
        [21, 29, 21, 33, 17, 35, 21,  0],
        [19, 21,  0, 31, 33, 30, 18, 27],
        [21,  0, 28, 30, 34, 35,  0, 34],
        [17, 33, 25, 17, 18, 27, 21, 34],
        [23,  3,  0, 38, 24, 25, 19, 24],
        [35, 25, 28, 21, 34,  0, 18, 21],
        [28, 34,  0, 18, 40,  0, 17,  0],
        [23,  0, 25, 34,  0, 35, 24, 21],
        [17, 27, 27, 30, 38, 34,  0, 36],
        [28, 34,  0, 18, 40,  0, 17,  0],
        [29, 23,  4,  0,  9, 30, 33,  0],
        [35,  0,  1, 21,  4, 23,  

In [None]:
print(decode(m.generate(idx = torch.zeros((1, 1), dtype=torch.long), max_new_tokens=500)[0].tolist()))

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
torch.Size([1, 42])
tensor([[ 0, 38, 35, 16, 34, 15, 35, 26, 17, 40, 36, 16, 16, 39, 21,  2, 25, 32,
         18, 27, 27,  9, 11, 41,  4, 21, 32, 25,  6,  7, 19, 12,  9,  5,  4, 35,
         16, 26, 33, 21, 35, 11, 14, 26, 15, 35, 38, 15, 29, 28, 16, 39,  7, 38,
         24, 36,  2,  7, 29, 10, 27,  1, 11, 18, 28, 40,  1, 12, 12,  9, 31,  8,
          3, 32, 11,  2, 25, 10,  1, 11, 11, 26, 32, 30,  5,  2, 25, 16,  3, 18,
          5, 30,  6,  8,  0, 30,  6,  7, 21, 35,  8, 12,  3, 32, 14, 30,  1, 38,
         15, 24, 18,  8, 31, 11,  7, 21, 35,  5, 31, 41,  4, 27, 10, 20,  1, 12,
         12, 10, 18,  2,  4, 11, 38, 21, 12, 38, 39, 37, 27, 22, 22, 10,  0, 16,
         28,  1, 35, 12, 12, 38, 39, 27, 14, 11, 25, 15,  5,  7, 14, 41, 19,  5,
          4, 16, 20,  5, 40, 40, 23, 16,  2,  2, 37,  0, 37,  0, 34, 14,  0, 31,
         28,  7,  3,  4, 31, 24, 36,  1,  0, 30, 31, 11, 14, 26, 32, 30, 38,  4,
         26,  1, 20,  1,

## The mathematical trick in self-attention

In [None]:
# toy example illustrating how matrix multiplication can be used for a "weighted aggregation"
torch.manual_seed(42)
a = torch.tril(torch.ones(3, 3))
a = a / torch.sum(a, 1, keepdim=True)
b = torch.randint(0,10,(3,2)).float()
c = a @ b
print('a=')
print(a)
print('--')
print('b=')
print(b)
print('--')
print('c=')
print(c)

a=
tensor([[1.0000, 0.0000, 0.0000],
        [0.5000, 0.5000, 0.0000],
        [0.3333, 0.3333, 0.3333]])
--
b=
tensor([[2., 7.],
        [6., 4.],
        [6., 5.]])
--
c=
tensor([[2.0000, 7.0000],
        [4.0000, 5.5000],
        [4.6667, 5.3333]])


In [None]:
# consider the following toy example:

torch.manual_seed(1337)
B,T,C = 4,8,2 # batch, time, channels
x = torch.randn(B,T,C)
x.shape

torch.Size([4, 8, 2])

In [None]:
# We want x[b,t] = mean_{i<=t} x[b,i]
xbow = torch.zeros((B,T,C))
for b in range(B):
    for t in range(T):
        xprev = x[b,:t+1] # (t,C)
        xbow[b,t] = torch.mean(xprev, 0)


In [None]:
# version 2: using matrix multiply for a weighted aggregation
wei = torch.tril(torch.ones(T, T))
wei = wei / wei.sum(1, keepdim=True)
xbow2 = wei @ x # (B, T, T) @ (B, T, C) ----> (B, T, C)
torch.allclose(xbow, xbow2)

True

In [None]:
# version 3: use Softmax
tril = torch.tril(torch.ones(T, T))
wei = torch.zeros((T,T))
wei = wei.masked_fill(tril == 0, float('-inf'))
wei = F.softmax(wei, dim=-1)
xbow3 = wei @ x
torch.allclose(xbow, xbow3)


True

In [None]:
# version 4: self-attention!
torch.manual_seed(1337)
B,T,C = 4,8,32 # batch, time, channels
x = torch.randn(B,T,C)

# let's see a single Head perform self-attention
head_size = 16
key = nn.Linear(C, head_size, bias=False)
query = nn.Linear(C, head_size, bias=False)
value = nn.Linear(C, head_size, bias=False)
k = key(x)   # (B, T, 16)
q = query(x) # (B, T, 16)
wei =  q @ k.transpose(-2, -1) # (B, T, 16) @ (B, 16, T) ---> (B, T, T)

tril = torch.tril(torch.ones(T, T))
#wei = torch.zeros((T,T))
wei = wei.masked_fill(tril == 0, float('-inf'))
wei = F.softmax(wei, dim=-1)

v = value(x)
out = wei @ v
#out = wei @ x

out.shape

torch.Size([4, 8, 16])

In [None]:
wei[0]

tensor([[1.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.1574, 0.8426, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.2088, 0.1646, 0.6266, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.5792, 0.1187, 0.1889, 0.1131, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.0294, 0.1052, 0.0469, 0.0276, 0.7909, 0.0000, 0.0000, 0.0000],
        [0.0176, 0.2689, 0.0215, 0.0089, 0.6812, 0.0019, 0.0000, 0.0000],
        [0.1691, 0.4066, 0.0438, 0.0416, 0.1048, 0.2012, 0.0329, 0.0000],
        [0.0210, 0.0843, 0.0555, 0.2297, 0.0573, 0.0709, 0.2423, 0.2391]],
       grad_fn=<SelectBackward0>)

Notes:
- Attention is a **communication mechanism**. Can be seen as nodes in a directed graph looking at each other and aggregating information with a weighted sum from all nodes that point to them, with data-dependent weights.
- There is no notion of space. Attention simply acts over a set of vectors. This is why we need to positionally encode tokens.
- Each example across batch dimension is of course processed completely independently and never "talk" to each other
- In an "encoder" attention block just delete the single line that does masking with `tril`, allowing all tokens to communicate. This block here is called a "decoder" attention block because it has triangular masking, and is usually used in autoregressive settings, like language modeling.
- "self-attention" just means that the keys and values are produced from the same source as queries. In "cross-attention", the queries still get produced from x, but the keys and values come from some other, external source (e.g. an encoder module)
- "Scaled" attention additional divides `wei` by 1/sqrt(head_size). This makes it so when input Q,K are unit variance, wei will be unit variance too and Softmax will stay diffuse and not saturate too much. Illustration below

In [None]:
k = torch.randn(B,T,head_size)
q = torch.randn(B,T,head_size)
wei = q @ k.transpose(-2, -1) * head_size**-0.5

In [None]:
k.var()

tensor(1.0449)

In [None]:
q.var()

tensor(1.0700)

In [None]:
wei.var()

tensor(1.0918)

In [None]:
torch.softmax(torch.tensor([0.1, -0.2, 0.3, -0.2, 0.5]), dim=-1)

tensor([0.1925, 0.1426, 0.2351, 0.1426, 0.2872])

In [None]:
torch.softmax(torch.tensor([0.1, -0.2, 0.3, -0.2, 0.5])*8, dim=-1) # gets too peaky, converges to one-hot

tensor([0.0326, 0.0030, 0.1615, 0.0030, 0.8000])

In [None]:
class LayerNorm1d: # (used to be BatchNorm1d)

  def __init__(self, dim, eps=1e-5, momentum=0.1):
    self.eps = eps
    self.gamma = torch.ones(dim)
    self.beta = torch.zeros(dim)

  def __call__(self, x):
    # calculate the forward pass
    xmean = x.mean(1, keepdim=True) # batch mean
    xvar = x.var(1, keepdim=True) # batch variance
    xhat = (x - xmean) / torch.sqrt(xvar + self.eps) # normalize to unit variance
    self.out = self.gamma * xhat + self.beta
    return self.out

  def parameters(self):
    return [self.gamma, self.beta]

torch.manual_seed(1337)
module = LayerNorm1d(100)
x = torch.randn(32, 100) # batch size 32 of 100-dimensional vectors
x = module(x)
x.shape

torch.Size([32, 100])

In [None]:
x[:,0].mean(), x[:,0].std() # mean,std of one feature across all batch inputs

(tensor(0.1469), tensor(0.8803))

In [None]:
x[0,:].mean(), x[0,:].std() # mean,std of a single input from the batch, of its features

(tensor(-9.5367e-09), tensor(1.0000))

In [None]:
# French to English translation example:

# <--------- ENCODE ------------------><--------------- DECODE ----------------->
# les réseaux de neurones sont géniaux! <START> neural networks are awesome!<END>



In [None]:
text = "The biggest club game in the world is right around the corner, set to captivate football fans worldwide. The Champions League final, the annual pinnacle of the European club season, will take place on June 10th in Istanbul. This year, it features an exciting showdown between two formidable teams: Italian giants Inter Milan and English juggernaut Manchester City. As anticipation builds, football enthusiasts eagerly await to witness whether the coveted trophy will head to Milan or Manchester. Inter Milan's remarkable journey to the final surprised many observers. Drawn into the formidable group of death alongside powerhouses Bayern Munich and Barcelona, Inter defied expectations. They secured a spot in the knockout stages by finishing ahead of Barcelona with a crucial draw at the Nou Camp on Matchday 5. Inter's solid defensive performances propelled them through the rounds, triumphing over Porto and Benfica without conceding a single goal. In the semi-finals, they faced their fierce rivals, AC Milan, and emerged victorious with a resounding 3-0 win, sealing their ticket to Istanbul."

In [None]:
text2 = text + "Inter's presence in the final carries significant weight for Italian football, as it marks the country's representation in Europe's biggest game after a long gap. The last time an Italian team lifted the Champions League trophy was in 2010 when Jose Mourinho's Inter defeated Bayern Munich. Juventus, among other Italian clubs, have come close to glory but fell short. The significance of Inter's achievement resonates deeply within Italian football, raising hopes for a resurgence on the European stage. European Football Broadcaster Mina Rzouki told us, It is hugely important that Serie A is recognised again as a league that is trying to innovate and trying to move forward, and getting back to that top level of European football. There are still some cliches about Italian teams, that they are all still very defensive and play tactical football, when in fact, they are a lot of fun. Serie A has outscored the Premier League on multiple occasions over the last few years. Unfortunately, we rarely see this in the Champions League because Italian football is short of funding and when they are pitted directly against these European giants, they often do not have enough to overcome them."

In [None]:
text = text2 + "Manchester City's path to the final has been marked by an impressive unbeaten streak. They comfortably won their group, which included Borussia Dortmund, Sevilla, and Copenhagen, with a game to spare. In the round of 16, City dominated German side Red Bull Leipzig, scoring seven goals across two legs. Their quarter-final clash against Pep Guardiola's former team, Bayern Munich, saw them emerge triumphant with a convincing 3-0 victory at home. In the semi-finals, they faced Real Madrid in a rematch of the previous year's encounter, and City's dominant second-leg performance resulted in a resounding 5-1 aggregate victory."
text = text + "For Manchester City, this final represents a momentous opportunity to secure their first-ever European Cup triumph. Under Pep Guardiola's tenure, the club has come agonizingly close on multiple occasions, including a loss to Premier League rivals Chelsea in the 2021 final. Since their rise to prominence in 2008, the European Cup has remained the elusive trophy missing from their cabinet. As they vie for the title, football enthusiasts ponder the key ingredients needed for City to finally cross the finish line in this prestigious competition. I think, for them, it is almost more psychological at this point. They seem have had a mental block,” explained Semra Hunter, Broadcaster for La Liga. Guardiola is a great manager but I think a lot of it has to do with him rather than the players. He has been known to ‘overthink’ big games, because there has been a lot of pressure on him to win this competition since he left Barcelona. To go 12 years without lifting this trophy is surprising. Istanbul's Ataturk Stadium, the venue for this year's final, evokes memories of a historic showdown from 2005. On that occasion, AC Milan, Inter's city rivals, faced Liverpool in an unforgettable clash. With the prospect of a thrilling encounter reminiscent of the dramatic events of the past, football enthusiasts eagerly anticipate the final in Istanbul, hoping for a similarly captivating spectacle. The stage is set for an epic battle as Manchester City and Inter Milan prepare to face off in the Champions League final. Will Manchester City finally clinch their first European Cup, or will Inter Milan add another trophy to their illustrious history?"

In [None]:
len(text)

4567

### Full finished code, for reference

You may want to refer directly to the git repo instead though.

In [None]:
import torch
import torch.nn as nn
from torch.nn import functional as F

# hyperparameters
batch_size = 16 # how many independent sequences will we process in parallel?
block_size = 32 # what is the maximum context length for predictions?
max_iters = 400
eval_interval = 100
learning_rate = 1e-3
device = 'cuda' if torch.cuda.is_available() else 'cpu'
eval_iters = 200
n_embd = 64
n_head = 4
n_layer = 4
dropout = 0.0
# ------------

torch.manual_seed(1337)

# wget https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt
#with open('input.txt', 'r', encoding='utf-8') as f:
    #text = f.read()

text = text

# here are all the unique characters that occur in this text
chars = sorted(list(set(text)))
vocab_size = sp.get_piece_size() #len(chars)
# create a mapping from characters to integers
stoi = { ch:i for i,ch in enumerate(chars) }
itos = { i:ch for i,ch in enumerate(chars) }
#encode = lambda s: [stoi[c] for c in s] # encoder: take a string, output a list of integers
#decode = lambda l: ''.join([itos[i] for i in l]) # decoder: take a list of integers, output a string
encode = sp.encode
decode = sp.decode
# Train and test splits
data = torch.tensor(encode(text), dtype=torch.long)
n = int(0.9*len(data)) # first 90% will be train, rest val
train_data = data[:n]
val_data = data[n:]

# data loading
def get_batch(split):
    # generate a small batch of data of inputs x and targets y
    data = train_data if split == 'train' else val_data
    ix = torch.randint(len(data) - block_size, (batch_size,))
    x = torch.stack([data[i:i+block_size] for i in ix])
    y = torch.stack([data[i+1:i+block_size+1] for i in ix])
    x, y = x.to(device), y.to(device)
    return x, y

@torch.no_grad()
def estimate_loss():
    out = {}
    model.eval()
    for split in ['train', 'val']:
        losses = torch.zeros(eval_iters)
        for k in range(eval_iters):
            X, Y = get_batch(split)
            logits, loss = model(X, Y)
            losses[k] = loss.item()
        out[split] = losses.mean()
    model.train()
    return out

class Head(nn.Module):
    """ one head of self-attention """

    def __init__(self, head_size):
        super().__init__()
        self.key = nn.Linear(n_embd, head_size, bias=False)
        self.query = nn.Linear(n_embd, head_size, bias=False)
        self.value = nn.Linear(n_embd, head_size, bias=False)
        self.register_buffer('tril', torch.tril(torch.ones(block_size, block_size)))

        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        B,T,C = x.shape
        k = self.key(x)   # (B,T,C)
        q = self.query(x) # (B,T,C)
        # compute attention scores ("affinities")
        wei = q @ k.transpose(-2,-1) * C**-0.5 # (B, T, C) @ (B, C, T) -> (B, T, T)
        wei = wei.masked_fill(self.tril[:T, :T] == 0, float('-inf')) # (B, T, T)
        wei = F.softmax(wei, dim=-1) # (B, T, T)
        wei = self.dropout(wei)
        # perform the weighted aggregation of the values
        v = self.value(x) # (B,T,C)
        out = wei @ v # (B, T, T) @ (B, T, C) -> (B, T, C)
        return out

class MultiHeadAttention(nn.Module):
    """ multiple heads of self-attention in parallel """

    def __init__(self, num_heads, head_size):
        super().__init__()
        self.heads = nn.ModuleList([Head(head_size) for _ in range(num_heads)])
        self.proj = nn.Linear(n_embd, n_embd)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        out = torch.cat([h(x) for h in self.heads], dim=-1)
        out = self.dropout(self.proj(out))
        return out

class FeedFoward(nn.Module):
    """ a simple linear layer followed by a non-linearity """

    def __init__(self, n_embd):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(n_embd, 4 * n_embd),
            nn.ReLU(),
            nn.Linear(4 * n_embd, n_embd),
            nn.Dropout(dropout),
        )

    def forward(self, x):
        return self.net(x)

class Block(nn.Module):
    """ Transformer block: communication followed by computation """

    def __init__(self, n_embd, n_head):
        # n_embd: embedding dimension, n_head: the number of heads we'd like
        super().__init__()
        head_size = n_embd // n_head
        self.sa = MultiHeadAttention(n_head, head_size)
        self.ffwd = FeedFoward(n_embd)
        self.ln1 = nn.LayerNorm(n_embd)
        self.ln2 = nn.LayerNorm(n_embd)

    def forward(self, x):
        x = x + self.sa(self.ln1(x))
        x = x + self.ffwd(self.ln2(x))
        return x

# super simple bigram model
class BigramLanguageModel(nn.Module):

    def __init__(self):
        super().__init__()
        # each token directly reads off the logits for the next token from a lookup table
        self.token_embedding_table = nn.Embedding(vocab_size, n_embd)
        self.position_embedding_table = nn.Embedding(block_size, n_embd)
        self.blocks = nn.Sequential(*[Block(n_embd, n_head=n_head) for _ in range(n_layer)])
        self.ln_f = nn.LayerNorm(n_embd) # final layer norm
        self.lm_head = nn.Linear(n_embd, vocab_size)

    def forward(self, idx, targets=None):
        B, T = idx.shape

        # idx and targets are both (B,T) tensor of integers
        tok_emb = self.token_embedding_table(idx) # (B,T,C)
        pos_emb = self.position_embedding_table(torch.arange(T, device=device)) # (T,C)
        x = tok_emb + pos_emb # (B,T,C)
        x = self.blocks(x) # (B,T,C)
        x = self.ln_f(x) # (B,T,C)
        logits = self.lm_head(x) # (B,T,vocab_size)

        if targets is None:
            loss = None
        else:
            B, T, C = logits.shape
            logits = logits.view(B*T, C)
            targets = targets.view(B*T)
            loss = F.cross_entropy(logits, targets)

        return logits, loss

    def generate(self, idx, max_new_tokens):
        # idx is (B, T) array of indices in the current context
        for _ in range(max_new_tokens):
            # crop idx to the last block_size tokens
            idx_cond = idx[:, -block_size:]
            # get the predictions
            logits, loss = self(idx_cond)
            # focus only on the last time step
            logits = logits[:, -1, :] # becomes (B, C)
            # apply softmax to get probabilities
            probs = F.softmax(logits, dim=-1) # (B, C)
            # sample from the distribution
            idx_next = torch.multinomial(probs, num_samples=1) # (B, 1)
            # append sampled index to the running sequence
            idx = torch.cat((idx, idx_next), dim=1) # (B, T+1)
        return idx

model = BigramLanguageModel()
m = model.to(device)
# print the number of parameters in the model
print(sum(p.numel() for p in m.parameters())/1e6, 'M parameters')

# create a PyTorch optimizer
optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate)

for iter in range(max_iters):

    # every once in a while evaluate the loss on train and val sets
    if iter % eval_interval == 0 or iter == max_iters - 1:
        losses = estimate_loss()
        print(f"step {iter}: train loss {losses['train']:.4f}, val loss {losses['val']:.4f}")

    # sample a batch of data
    xb, yb = get_batch('train')

    # evaluate the loss
    logits, loss = model(xb, yb)
    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    optimizer.step()

# generate from the model
context = torch.zeros((1, 1), dtype=torch.long, device=device)
print(decode(m.generate(context, max_new_tokens=2000)[0].tolist()))


0.459344 M parameters
step 0: train loss 7.7706, val loss 7.7538
step 100: train loss 4.2209, val loss 4.6982
step 200: train loss 2.3311, val loss 4.2435
step 300: train loss 1.0229, val loss 4.3851
step 399: train loss 0.4150, val loss 4.7213
 ⁇ -ever European Cup triumph. Und, the pell Leiover deathts ponague because Italian footba Drawts A more year and trying and trying tocollectioncial draw at home. team lifted trophant with a res. Inter's from 2005. In the rounds, and Mander Leichesterder 5-finals,ical at this Leagueter Munigall, set from 2008, and City toed again as a repo Milanchol to glo, Sem over the, In the1 of 12 years without resoundingly against these European Camp including and get- final,-0ing over the last last feague they viep Guardiant weightola, the Chelseaotba Rzo's Inter-fied expectationted20 whenyer City riventation builan teams, the 5-ever European j them, defied easy secleusnaut Manchest game to the Ch. Undath to 2005. Inter's about Italianchestelseague riding

In [None]:
print(decode(m.generate(context, max_new_tokens=100)[0].tolist()))

Istanbul. Undi with afens, they are all still very defensive and getting se'sive and when Preminal, they are carceluious competip leveriple occasionsence in 2008, the Europeague final, the Europ opportunity back to Milan entile manager but fell short of death along AC Milan their


In [None]:
m.generate(context, max_new_tokens=200).shape #[0].tolist()

torch.Size([1, 201])

In [None]:
context.shape

torch.Size([1, 1])

In [None]:
context = torch.tensor([encode("Istanbul")])

In [None]:
context.shape

torch.Size([1, 5])

In [None]:
context

tensor([[0]])

In [None]:
torch.zeros((1, 1), dtype=torch.long, device=device)


tensor([[0]])

In [None]:
context = torch.Tensor(encode("king")).long()

In [None]:
context

tensor([49, 47, 52, 45])

In [None]:
context = torch.zeros((1, 1), dtype=torch.long, device=device)
print(decode(m.generate(context, max_new_tokens=2000)[0].tolist()))


And hearful, to
sistatestions thee,
Where as struepity, bringht, to
His amod, degs not love fend. Pry thy breyt.

Than farth, his good me to ambled thy for
Make alf the like day, My Thands;
For all he but. Which is this may moke. Ows!

CORIOLAUNE:
Fell.

FlYORK:
How's godste?
Not, now you read, would -honse.

KING RIZw'd:
Wor brother, As my have, as made yen: Gom
Werty is thurncts'd where wrongs.

ANGALT:
Worter:
Puom.
Pom make up not. Was carme, and the best,
That, nor will ma'st Yame a love.

BUCKINGHAS:
Why Iswer you wast all some,--
In wife in a fivend; e'sy had anst but shurrel with in in black to must I say. What, way
Rishermiestiong thee world why, I'll man cerkbalish wortune, the late,
And I resding the both once, betist folly him
andilince unto gried though.

MARCIOS:
And there; there, be
me thou may worthinie, after,
Affivily thou,
bear you humsile to have I miskneds afmiship.
If that that done heaving.
Now lay to caid your grawfes!
That hath them of us. But Cleant out.

ROM

In [None]:
m.generate(context, max_new_tokens=2000)

IndexError: ignored

In [None]:
m.generate(xb, max_new_tokens=200)

tensor([[ 1, 53, 44,  ..., 47, 52, 43],
        [43,  1, 61,  ..., 63,  1, 44],
        [17, 10,  0,  ..., 44, 39, 41],
        ...,
        [ 1, 51, 63,  ...,  0, 27,  5],
        [51, 63,  1,  ..., 52, 53, 52],
        [47, 50, 42,  ..., 46, 43, 56]])

In [None]:
print(decode(m.generate(xb, max_new_tokens=200)[0].tolist()))

 of Rome are this good belly,
And yest the lost death;
Housh seet their throusamers, the we disafte?

BORTINGBROKE:
At, not.

MENENV:
Madame a mane ame to mine and wich arbows?
I
'Tis fair you his and revedry: sir,
Will these meets 
