# Andrej Karpathy's /nanoGPT

#### Reference
https://www.youtube.com/watch?v=kCc8FmEb1nY

https://github.com/karpathy/nanoGPT

chatgPT is another language model that completes a given sentence. The generative AI comes from "Attention Is All You Need" paper. 

In this module we'll build a char level language model based on the transformer's architecture. The dataset is the entire work of shakespeare. The model will generate, shakespeare-like text.

In [1]:
with open('tiny-shakespeare.txt', 'r') as f:
    text = f.read()

In [2]:
print("length of dataset in char: ", len(text))

length of dataset in char:  1115394


In [3]:
# here is our vocabulary; all possible characters in the text
chars = sorted(list(set(text)))
vocab_size = len(chars)
print(''.join(chars))
print('vocab size: ', vocab_size)


 !$&',-.3:;?ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz
vocab size:  65


In [4]:
# tokenize the text
s2i = {ch:i for i,ch in enumerate(chars)}
i2s = {i:ch for i,ch in enumerate(chars)}
encode = lambda s: [s2i[ch] for ch in s] # encoder: take a string and return a list of integers
decode = lambda x: ''.join([i2s[i] for i in x]) # decoder: take a list of integers and return a string

print(encode("hi there"))
print(decode(encode("hi there")))

[46, 47, 1, 58, 46, 43, 56, 43]
hi there


there are many tokenizing algorithms. Google uses ```SentencePiece``` which is a sub-word-unit tokenizer; it's not char level, nor word level.  OpenAi uses ```tiktoken```.

import tiktoken
enc = tiktoken.gen_encoding("gpt2")
assert enc.decode(enc.encode("hello world')) == "hello world"

with these tokenizers you won't get a long list of int. it helps to get a smaller list with larger word vocab sizes.


In [5]:
# let's now encode the entire text dataset and store it in a torch.Tensor
import torch
data = torch.tensor(encode(text), dtype=torch.long)

print(data.shape, data.dtype)
print(data[:100])

torch.Size([1115394]) torch.int64
tensor([18, 47, 56, 57, 58,  1, 15, 47, 58, 47, 64, 43, 52, 10,  0, 14, 43, 44,
        53, 56, 43,  1, 61, 43,  1, 54, 56, 53, 41, 43, 43, 42,  1, 39, 52, 63,
         1, 44, 59, 56, 58, 46, 43, 56,  6,  1, 46, 43, 39, 56,  1, 51, 43,  1,
        57, 54, 43, 39, 49,  8,  0,  0, 13, 50, 50, 10,  0, 31, 54, 43, 39, 49,
         6,  1, 57, 54, 43, 39, 49,  8,  0,  0, 18, 47, 56, 57, 58,  1, 15, 47,
        58, 47, 64, 43, 52, 10,  0, 37, 53, 59])


In [7]:
# let's split the data into train and validation sets to understand if our model is overfitting
n = int(0.9*len(data)) # first 90% of the data
train_data = data[:n]
val_data = data[n:]

for training the NN we need to train on chunks of data, otherwise it's not optimal to feed in the entire data. that's why we use ```block_size```.

In [8]:
block_size = 8
train_data[:block_size+1]

tensor([18, 47, 56, 57, 58,  1, 15, 47, 58])

think about the training process as follows: 

when a token of value '18' is provided, there is a high chance to generate '47'.

when a token of value '18 47' is provided, there is a high chance to generate '56'.

when a token of value '18 47 56' is provided, there is a high chance to generate '57'.

...

In [11]:
torch.manual_seed(1337) # this is for reproducibility to get the same results every time
batch_size = 4 # how many independent sequences to process in parallel
block_size = 8 # the max context length for predictions

def get_batch(split):
    # generate small batch of data of inputs x and targets y
    data = train_data if split == 'train' else val_data
    ix = torch.randint(len(data) - block_size, (batch_size,)) # random start indices for the examples
    x = torch.stack([data[i:i+block_size] for i in ix]) # batch_size x block_size
    y = torch.stack([data[i+1:i+block_size+1] for i in ix]) # batch_size x block_size
    return x, y


In [12]:
xb, yb = get_batch('train')
print('inputs: ')
print(xb.shape)
print(xb)
print('targets: ')
print(yb.shape)
print(yb)

print('-------')

inputs: 
torch.Size([4, 8])
tensor([[24, 43, 58,  5, 57,  1, 46, 43],
        [44, 53, 56,  1, 58, 46, 39, 58],
        [52, 58,  1, 58, 46, 39, 58,  1],
        [25, 17, 27, 10,  0, 21,  1, 54]])
targets: 
torch.Size([4, 8])
tensor([[43, 58,  5, 57,  1, 46, 43, 39],
        [53, 56,  1, 58, 46, 39, 58,  1],
        [58,  1, 58, 46, 39, 58,  1, 46],
        [17, 27, 10,  0, 21,  1, 54, 39]])
-------


In [13]:
# visualize the way the input and target are generated
for b in range(batch_size):
    for t in range(block_size):
        context = xb[b, :t+1]
        target = yb[b, t]
        print(f'when input is {context.tolist()}, target is {target}')

when input is [24], target is 43
when input is [24, 43], target is 58
when input is [24, 43, 58], target is 5
when input is [24, 43, 58, 5], target is 57
when input is [24, 43, 58, 5, 57], target is 1
when input is [24, 43, 58, 5, 57, 1], target is 46
when input is [24, 43, 58, 5, 57, 1, 46], target is 43
when input is [24, 43, 58, 5, 57, 1, 46, 43], target is 39
when input is [44], target is 53
when input is [44, 53], target is 56
when input is [44, 53, 56], target is 1
when input is [44, 53, 56, 1], target is 58
when input is [44, 53, 56, 1, 58], target is 46
when input is [44, 53, 56, 1, 58, 46], target is 39
when input is [44, 53, 56, 1, 58, 46, 39], target is 58
when input is [44, 53, 56, 1, 58, 46, 39, 58], target is 1
when input is [52], target is 58
when input is [52, 58], target is 1
when input is [52, 58, 1], target is 58
when input is [52, 58, 1, 58], target is 46
when input is [52, 58, 1, 58, 46], target is 39
when input is [52, 58, 1, 58, 46, 39], target is 58
when input i

in bigram model for instance, the prediction was only based on the last charr. However, thee history or chars/tokens that came before also has a impact on choosing the next char. That's when transformer model comes into the picture.

the easiest way to get info about the previous tokens is to sum their channel values and average it. obviously this will loose a lott of info on the spacial information about the tokens, but we'll address that later.

the technique for averaging over previous tokens is called ```bag of words```.

In [15]:
# consider this toy example

torch.manual_seed(1337)
B, T, C = 4, 8, 2 # batch, time, channels
x = torch.randn(B, T, C)
x.shape

torch.Size([4, 8, 2])

In [17]:
# Version 1: use for loops for averaging

# we want x[b, t] = mean _ {i<=t} x[b, i]
xbow = torch.zeros((B, T, C)) # x bag of words. a word is on each of the 8 locations and we are averaging
for b in range(B):
    for t in range(T):
        xprev = x[b,:t+1] # (t, c)
        xbow[b, t] = torch.mean(xprev, 0)

to be efficient about this averaging we can use matrix multiplications. we'll build a matrix 'a' such that, multiplying another matrix by it, would generate a new matrix where each row represents the average of elements in previous rows, column-wise.

In [29]:
torch.manual_seed(42)
a = torch.tril(torch.ones((3, 3)))
a = a / torch.sum(a, 1, keepdim=True)
b = torch.randint(0,10,(3,2)).float()
c = a @ b

print(a)
print(b)
print(c)

tensor([[1.0000, 0.0000, 0.0000],
        [0.5000, 0.5000, 0.0000],
        [0.3333, 0.3333, 0.3333]])
tensor([[2., 7.],
        [6., 4.],
        [6., 5.]])
tensor([[2.0000, 7.0000],
        [4.0000, 5.5000],
        [4.6667, 5.3333]])


In [30]:
# Version2: use matrix multiplication

wei = torch.tril(torch.ones((T, T))) # weight matrix
weil = wei / wei.sum(1, keepdim=True)
xbow2 = weil @ x

#validate to make sure xbow and xbow2 are the same
torch.allclose(xbow, xbow2)


True

In [32]:
# Version 3: use softmax

import torch.nn.functional as F
tril = torch.tril(torch.ones((T, T)))
wei = torch.zeros((T, T))
wei = wei.masked_fill(tril == 0, float('-inf')) # make all the zeros -inf
wei = F.softmax(wei, 1) # exponentiate and divide by sum
xbow3 = wei @ x

# validate 
torch.allclose(xbow, xbow3)

True

In [39]:
# Version 4: self-attention
import torch.nn as nn
torch.manual_seed(1337)
B, T, C = 4, 8, 2 # batch, time, channels
x = torch.randn(B, T, C)

# let's see a single head perform self-attention
head_size = 16
key =   nn.Linear(C, head_size, bias=False)
query = nn.Linear(C, head_size, bias=False)
k = key(x)   # (B, T, 16)
q = query(x) # (B, T, 16)
wei = q @ k.transpose(-2, -1) # transpose the last two columns. (B, T, 16) @ (B, 16, T) = (B, T, T)

tril = torch.tril(torch.ones((T, T)))
wei = wei.masked_fill(tril == 0, float('-inf')) # make all the zeros -inf. if you are for instance trying to do sentiment analysis, you can remove this line so all the nodes get to talk to each other
wei = F.softmax(wei, 1) # exponentiate and divide by sum
out = wei @ x

wei[0]

tensor([[0.1316, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.1215, 0.0814, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.1467, 0.0783, 0.2508, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.1593, 0.0674, 0.3324, 0.5611, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.1262, 0.1577, 0.1347, 0.1402, 0.2156, 0.0000, 0.0000, 0.0000],
        [0.0913, 0.1651, 0.0511, 0.0325, 0.2371, 0.2694, 0.0000, 0.0000],
        [0.1351, 0.0838, 0.1934, 0.2463, 0.4575, 0.0160, 0.6153, 0.0000],
        [0.0882, 0.3662, 0.0376, 0.0200, 0.0898, 0.7147, 0.3847, 1.0000]],
       grad_fn=<SelectBackward0>)