Cells below were created as notes while watching Andrej Karpathy's video "[Let's build GPT: from scratch, in code, spelled out.
](https://www.youtube.com/watch?v=kCc8FmEb1nY)". Check it out if you haven't!


### Tokens - encoding text to a numeric format


In [64]:
from transformers import GPT2Tokenizer

# This is the tokenizer used by GPT-2.
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")

print(f"Vocab size: {tokenizer.vocab_size}")

Vocab size: 50257


In [65]:
test_str = "A rather long text to demonstrate the tokenizer. Coool, right?"

# GPT-2 used a subword tokenizer, meaning that each token corresponds to part of a word
str_enc = tokenizer.encode(test_str)  # Tokenized string
print([tokenizer.decode([s]) for s in str_enc])
print(str_enc)

['A', ' rather', ' long', ' text', ' to', ' demonstrate', ' the', ' token', 'izer', '.', ' Co', 'ool', ',', ' right', '?']
[32, 2138, 890, 2420, 284, 10176, 262, 11241, 7509, 13, 1766, 970, 11, 826, 30]


In [66]:
# Note that tokens correspond to subwords. Because of this, the encoded sequence has more tokens compared to the number of words in the encoded text.
print(f"Length of text: {len(test_str.split(' '))}")
print(f"Length of encoded seq: {len(str_enc)}")

Length of text: 10
Length of encoded seq: 15


For this example, we will be using text from Shakespeare as our corpus.


In [67]:
# Load data
with open("input.txt", "r", encoding="utf-8") as f:
    text = f.read()

We will be using a very simple tokenization scheme - encoding single characters as tokens.

Therefore, our vocabulary will consist of all symbols used in the text.


In [68]:
vocab = sorted(list(set(text)))
print("".join(vocab))

vocab_size = len(vocab)
print(f"Vocab size: {vocab_size}")


 !$&',-.3:;?ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz
Vocab size: 65


Note that our vocabulary is much smaller compared to the GPT-2 tokenizer. Keep this in mind as we continue!


In [69]:
# Simple tokenization scheme by using the character's index in the vocabulary as its token.
stoi = {ch: i for i, ch in enumerate(vocab)}  # string-to-integer
itos = {i: ch for i, ch in enumerate(vocab)}  # integer-to-string

encode = lambda s: [stoi[c] for c in s]
decode = lambda l: "".join([itos[i] for i in l])

Let's encode our original example text again, now using this simple tokenizer


In [70]:
print(encode(test_str))
print(decode(encode(test_str)))

[13, 1, 56, 39, 58, 46, 43, 56, 1, 50, 53, 52, 45, 1, 58, 43, 62, 58, 1, 58, 53, 1, 42, 43, 51, 53, 52, 57, 58, 56, 39, 58, 43, 1, 58, 46, 43, 1, 58, 53, 49, 43, 52, 47, 64, 43, 56, 8, 1, 15, 53, 53, 53, 50, 6, 1, 56, 47, 45, 46, 58, 12]
A rather long text to demonstrate the tokenizer. Coool, right?


Let's check how our encoded sequence compares to the original text in length now.


In [71]:
print(f"Length of text: {len(test_str.split(' '))}")
print(f"Length of encoded seq: {len(encode(test_str))}")

Length of text: 10
Length of encoded seq: 62


See how it is much longer? Since we are using a smaller vocabulary, we must use more tokens to encode our sequences.

This shows the inherent relationship between vocabulary size and sequence length - a smaller vocabulary results in longer sequences.


### WIP - Predicting the next token


In [72]:
import torch

torch.manual_seed(1337)

B, T, C = 4, 8, 2

# Let's create a mock sequence of embedded tokens.
x = torch.randn(B, T, C)

x.shape

torch.Size([4, 8, 2])

Naive way to predict next token is by using the mean of all previous token's embeddings.

$$ \hat{x_t} = f(\Sigma_{i=0}^{t-1} x_i) $$


In [73]:
%%time
# Slow way, for-loop...

xbow = torch.zeros((B, T, C))

for b in range(B):  # For each sequence in batch.
    for t in range(T):  # For each time step in seq.
        xprev = x[b, : t + 1]
        xbow[b, t] = torch.mean(xprev, dim=0)

CPU times: user 1.08 ms, sys: 788 µs, total: 1.87 ms
Wall time: 1.25 ms


In [74]:
%%time

# Fast way - matrix mult!

wei = torch.tril(torch.ones(T, T))

CPU times: user 4.47 ms, sys: 58 µs, total: 4.52 ms
Wall time: 2.12 ms
