Cells below were created as notes while watching Andrej Karpathy's video "[Let's build GPT: from scratch, in code, spelled out.
](https://www.youtube.com/watch?v=kCc8FmEb1nY)". Check it out if you haven't!


### Tokens - encoding text to a numeric format


In [2]:
from transformers import GPT2Tokenizer

# This is the tokenizer used by GPT-2.
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")

print(f"Vocab size: {tokenizer.vocab_size}")


  from .autonotebook import tqdm as notebook_tqdm


Vocab size: 50257


In [3]:
test_str = "A rather long text to demonstrate the tokenizer. Coool, right?"

# GPT-2 used a subword tokenizer, meaning that each token corresponds to part of a word
str_enc = tokenizer.encode(test_str)  # Tokenized string
print([tokenizer.decode([s]) for s in str_enc])
print(str_enc)


['A', ' rather', ' long', ' text', ' to', ' demonstrate', ' the', ' token', 'izer', '.', ' Co', 'ool', ',', ' right', '?']
[32, 2138, 890, 2420, 284, 10176, 262, 11241, 7509, 13, 1766, 970, 11, 826, 30]


In [4]:
# Note that tokens correspond to subwords. Because of this, the encoded sequence has more tokens compared to the number of words in the encoded text.
print(f"Length of text: {len(test_str.split(' '))}")
print(f"Length of encoded seq: {len(str_enc)}")


Length of text: 10
Length of encoded seq: 15


For this example, we will be using text from Shakespeare as our corpus.


In [5]:
# Load data
with open("input.txt", "r", encoding="utf-8") as f:
    text = f.read()


We will be using a very simple tokenization scheme - encoding single characters as tokens.

Therefore, our vocabulary will consist of all symbols used in the text.


In [6]:
vocab = sorted(list(set(text)))
print("".join(vocab))

vocab_size = len(vocab)
print(f"Vocab size: {vocab_size}")



 !$&',-.3:;?ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz
Vocab size: 65


Note that our vocabulary is much smaller compared to the GPT-2 tokenizer. Keep this in mind as we continue!


In [7]:
# Simple tokenization scheme by using the character's index in the vocabulary as its token.
stoi = {ch: i for i, ch in enumerate(vocab)}  # string-to-integer
itos = {i: ch for i, ch in enumerate(vocab)}  # integer-to-string

encode = lambda s: [stoi[c] for c in s]
decode = lambda l: "".join([itos[i] for i in l])


Let's encode our original example text again, now using this simple tokenizer


In [8]:
print(encode(test_str))
print(decode(encode(test_str)))


[13, 1, 56, 39, 58, 46, 43, 56, 1, 50, 53, 52, 45, 1, 58, 43, 62, 58, 1, 58, 53, 1, 42, 43, 51, 53, 52, 57, 58, 56, 39, 58, 43, 1, 58, 46, 43, 1, 58, 53, 49, 43, 52, 47, 64, 43, 56, 8, 1, 15, 53, 53, 53, 50, 6, 1, 56, 47, 45, 46, 58, 12]
A rather long text to demonstrate the tokenizer. Coool, right?


Let's check how our encoded sequence compares to the original text in length now.


In [9]:
print(f"Length of text: {len(test_str.split(' '))}")
print(f"Length of encoded seq: {len(encode(test_str))}")


Length of text: 10
Length of encoded seq: 62


See how it is much longer? Since we are using a smaller vocabulary, we must use more tokens to encode our sequences.

This shows the inherent relationship between vocabulary size and sequence length - a smaller vocabulary results in longer sequences.


In [16]:
import torch

data = torch.tensor(encode(text), dtype=torch.long)
print(data[:100])


tensor([18, 47, 56, 57, 58,  1, 15, 47, 58, 47, 64, 43, 52, 10,  0, 14, 43, 44,
        53, 56, 43,  1, 61, 43,  1, 54, 56, 53, 41, 43, 43, 42,  1, 39, 52, 63,
         1, 44, 59, 56, 58, 46, 43, 56,  6,  1, 46, 43, 39, 56,  1, 51, 43,  1,
        57, 54, 43, 39, 49,  8,  0,  0, 13, 50, 50, 10,  0, 31, 54, 43, 39, 49,
         6,  1, 57, 54, 43, 39, 49,  8,  0,  0, 18, 47, 56, 57, 58,  1, 15, 47,
        58, 47, 64, 43, 52, 10,  0, 37, 53, 59])


In [18]:
n = int(0.9 * len(data))
train_data = data[:n]
val_data = data[n:]


In [19]:
block_size = 8
train_data[: block_size + 1]


tensor([18, 47, 56, 57, 58,  1, 15, 47, 58])

### Bigram model - Predicting based on the previous token only


In [22]:
batch_size = 4
block_size = 8


def get_batch(split):
    data = train_data if split == "train" else val_data
    ix = torch.randint(len(data) - block_size, (batch_size,))
    x = torch.stack([data[i : i + block_size] for i in ix])
    y = torch.stack([data[i + 1 : i + block_size + 1] for i in ix])
    return x, y


In [83]:
import torch.nn as nn
import torch.nn.functional as F


class BigramLanguageModel(nn.Module):
    def __init__(self, vocab_size) -> None:
        super().__init__()

        self.embedding = torch.nn.Embedding(vocab_size, vocab_size)

    def forward(self, idx, targets=None):
        logits = self.embedding(idx)

        if targets is not None:
            B, T, C = logits.shape
            logits_ = logits.view(B * T, C)
            targets = targets.view(B * T)
            loss = F.cross_entropy(logits_, targets)
        else:
            loss = None

        return logits, loss

    def generate(self, idx, num_steps):
        for _ in range(num_steps):
            logits, _ = self.forward(idx)
            logits = logits[:, -1, :]

            # Note: Sampling is probabilistic.
            probs = F.softmax(logits, dim=-1)

            idx_next = torch.multinomial(probs, num_samples=1)
            idx = torch.cat((idx, idx_next), dim=1)

        return idx


model = BigramLanguageModel(vocab_size)


In [108]:
optimizer = torch.optim.AdamW(model.parameters(), lr=0.001)

for iter in range(1000):
    xb, yb = get_batch("train")

    logits, loss = model(xb, yb)

    loss.backward()
    optimizer.step()

    print(f"Train loss: {loss}")


Train loss: 2.195941925048828
Train loss: 2.5153584480285645
Train loss: 2.5030364990234375
Train loss: 2.493042469024658
Train loss: 2.611518144607544
Train loss: 2.4348185062408447
Train loss: 2.168931722640991
Train loss: 2.555227518081665
Train loss: 2.0753352642059326
Train loss: 2.399522304534912
Train loss: 2.3705391883850098
Train loss: 2.786844253540039
Train loss: 2.861818552017212
Train loss: 3.1184353828430176
Train loss: 2.8109734058380127
Train loss: 2.684131622314453
Train loss: 3.411339521408081
Train loss: 2.2716832160949707
Train loss: 2.417025327682495
Train loss: 2.802736520767212
Train loss: 3.864915132522583
Train loss: 1.9487805366516113
Train loss: 2.4493465423583984
Train loss: 2.2473337650299072
Train loss: 2.306190252304077
Train loss: 2.722623109817505
Train loss: 2.0977742671966553
Train loss: 2.344184637069702
Train loss: 2.1484503746032715
Train loss: 3.2914156913757324
Train loss: 3.233902931213379
Train loss: 2.4903721809387207
Train loss: 2.41721487045

In [109]:
output = model.generate(torch.zeros((1, 1), dtype=torch.long), 1000)

print(decode(output[0].tolist()))



Anousstem ale
Gllere; and lenk t hararo I me he. than'dis, se ho r:
Hor'the,


Anchatlald tharematrt ny.
MPrcau,
Fove?

HAnt minde ING nneendirove byoingld,

L: we me e.
Tovely tst ter ED:
MEE f hird ivasavelo oth theme-thos med?
Yo opotavo thovant atherus chessy h sathelors atwigre htou LTUENGUn w,
Ye higllel m nghed ardr hisss;
ARLAfithasoaso wie ye es ced, hersw hepr:
Whistrd owie, d! oy thicon allapotaigice.
CAnd f se theve rdat,

IUTw'ste, marig arethoushourmefred, hery,

hin'dwr: shint has g st e bose?

Murd nd--mime, bun me
I brr orthon tourowd I fowin:
F Twh pes arshe
AUShe gon

LIAntuce pory lure my t; houtho aneanghatil bamppug beavey prere, e yolin,

Peng the br;

ANCEd tow wethemo nk,
LLENG ppoit my theshoth ay barus whe hanowhod thes fioprtofenavis pllishaton thaus,
Dof insthare?
AGoowhen wesu wh urorent swono as thedengndscotr in m ther.
ICl n mo oy s t hord wredo w wo

HASTE ghes l sus manorng ounangnt mear brteeen' psthe, anced prs,
RDo wotoon, Cangre, matedeacheramye 

In [86]:
import torch

torch.manual_seed(1337)

B, T, C = 4, 8, 2

# Let's create a mock sequence of embedded tokens.
x = torch.randn(B, T, C)

x.shape


torch.Size([4, 8, 2])

Naive way to predict next token is by using the mean of all previous token's embeddings.

$$ \hat{x*t} = f(\Sigma*{i=0}^{t-1} x_i) $$


In [12]:
%%time
# Slow way, for-loop...

xbow = torch.zeros((B, T, C))

for b in range(B):  # For each sequence in batch.
    for t in range(T):  # For each time step in seq.
        xprev = x[b, : t + 1]
        xbow[b, t] = torch.mean(xprev, dim=0)

CPU times: user 0 ns, sys: 3.99 ms, total: 3.99 ms
Wall time: 10.8 ms


In [13]:
%%time

# Fast way - matrix mult!

wei = torch.tril(torch.ones(T, T))

CPU times: user 3.2 ms, sys: 330 µs, total: 3.53 ms
Wall time: 1.59 ms
