To be able to use my already existing python environment, I had to give Visual Studio Code the path to my environments folder. 

In [3]:
with open('input.txt', 'r', encoding='utf-8') as f:
    text = f.read()

print(f"Length of dataset: {len(text)} characters.")

# There are a total of 65 unique characters in the dataset.
chars = sorted(list(set(text)))
vocab_size = len(chars)
print(vocab_size)
print("".join(chars))

# We will tokenize our vocabulary by building a character level language model. We will represent each
# character as an integer. Sub-word tokenizers are also possible (chat-gpt uses tiktoken)
# We first create a mapping from characters to integers using a dictionary
chtoi = {ch:i for i,ch in enumerate(chars)}
itoch = {i:ch for i,ch in enumerate(chars)}

def encode(s):  
    return [chtoi[ch] for ch in s] # Take a string, output list of integers.

def decode(list_int):
    return "".join([itoch[i] for i in list_int]) # Take a list of integers, output string.

Length of dataset: 1115394 characters.
65

 !$&',-.3:;?ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz


In [4]:
# We now encode entire "input.txt" and save it in a torch tensor.
import torch
import torch.nn as nn
from torch.nn import functional as F
data = torch.tensor(encode(text))

n = int(0.9 * len(data))
train_data = data[:n]
val_data = data[n:]

When we train a transformer, we only work with random chunks we take from the dataset. 

In a chunk of 9 characters, there are 8 training examples of increasing context length. Maximum context length we train with is given by block_size. This is useful for inference as the transformer is used to working with varying context lengths. For inference, we have to divide inputs larger than block_size into chunks. 

In [5]:
block_size = 8

print("CONTEXT")
x = train_data[:block_size]
y = train_data[1:block_size+1]
for t in range(block_size):
    context = x[:t+1]
    target = y[t]
    print(f"When {context} is the context, the target is {target}.")

CONTEXT
When tensor([18]) is the context, the target is 47.
When tensor([18, 47]) is the context, the target is 56.
When tensor([18, 47, 56]) is the context, the target is 57.
When tensor([18, 47, 56, 57]) is the context, the target is 58.
When tensor([18, 47, 56, 57, 58]) is the context, the target is 1.
When tensor([18, 47, 56, 57, 58,  1]) is the context, the target is 15.
When tensor([18, 47, 56, 57, 58,  1, 15]) is the context, the target is 47.
When tensor([18, 47, 56, 57, 58,  1, 15, 47]) is the context, the target is 58.


In [6]:
torch.manual_seed(1337)
batch_size = 4 # how many independent sequences will we process in parallel?
block_size = 8 # what is the maximum context length for prediction?

def get_batch(split):
    """
    We obtain a context and target tensor of size (batch_size, block_size)
    """
    data = train_data if split=="train" else val_data
    ix = torch.randint(low=0, high=len(data)-block_size, size=(batch_size,))

    # We now turn horizontally
    X = torch.vstack([data[i:i+block_size] for i in ix])
    Y = torch.vstack([data[i+1:i+block_size+1] for i in ix])

    return X,Y

# BIGRAM

Bigrams are a very simple model. They simply use a look-up table and no context. They use only the current character to predict the next. 

The objective of the generate() function is to extend the (batch_size, block_size) horizontally and predict more tokens. Gets (B,T) -> (B,T+1)

min 38

logit: output of a neuron without applying activation function.

In [7]:
idx = torch.tensor([[0,4,6,2],
                    [3,7,8,9]])  # size: (batch_size, block_size)

token_embedding_table = nn.Embedding(vocab_size, vocab_size)
logits = token_embedding_table(idx) # size: (batch_size, block_size, vocab_size)

print(logits.shape)

torch.Size([2, 4, 65])


In [8]:
class BigramLanguageModel(nn.Module):
    def __init__(self, vocab_size):
        super().__init__()

        # First input is vocab_size. Second input is the size of the encoded representation for each word. 
        self.token_embedding_table = nn.Embedding(vocab_size, vocab_size)

    def forward(self, idx, targets=None):
        """P
        Embeddings are used when working with categorical data. Often used to map discrete tokens (such as characters in a text)
        to continous vectors.

        Useful link.
        https://spltech.co.uk/in-pytorch-what-is-nn-embedding-for-and-how-is-it-different-from-one-hot-encding-for-representing-categorical-data/?utm_content=cmp-true
        """
        # idx and targets are tensors of size (batch_size, block_size)
        logits = self.token_embedding_table(idx)   # size: (batch_size, block_size, vocab_size)

        if targets is None:
            loss = None

        else:
            B, T, C = logits.shape
            logits = logits.view(B*T, C)
            targets = targets.view(B*T)
            loss = F.cross_entropy(logits, targets)

        return logits, loss

    def generate(self, idx, max_new_tokens):
        """
        We call this function to generate new characters.
        """
        for _ in range(max_new_tokens):
            # We first get the predictions
            logits, loss = self(idx)  # (B,T,C)
            
            # Here, we are interested in using all the given context.
            logits = logits[:, -1, :]  # (B,C)

            # We then apply softmax to get probabilities.
            probs = F.softmax(logits, dim=-1)  # (B,C)

            # We now sample from the probabilities
            idx_next = torch.multinomial(probs, num_samples=1)  # (B,1)

            # Finally, we append
            idx = torch.hstack([idx, idx_next])  # (B, T+1)

        return idx

@torch.no_grad()
def estimate_loss(eval_iters):
    out = {}
    model.eval()
    for split in ['train', 'val']:
        losses = torch.zeros(eval_iters)
        for k in range(eval_iters):
            X, Y = get_batch(split)
            logits, loss = model(X, Y)
            losses[k] = loss.item()
        out[split] = losses.mean()
    model.train()
    return out

In [9]:
# hyperparameters
batch_size = 32 # how many independent sequences will we process in parallel?
block_size = 8 # what is the maximum context length for predictions?
max_iters = 2000
eval_interval = 1000
learning_rate = 1e-2
device = 'cuda' if torch.cuda.is_available() else 'cpu'
eval_iters = 200
# ------------

model = BigramLanguageModel(vocab_size)
m = model.to(device)

# create a PyTorch optimizer
optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate)

for iter in range(max_iters):
    if iter % eval_interval == 0:
        losses = estimate_loss(eval_iters)
        print(f"iter: {iter}  train_loss: {losses['train']:.4f}  val_loss: {losses['val']:.4f}")

    # sample a batch of data
    xb, yb = get_batch('train')

    # evaluate the loss
    logits, loss = model(xb, yb)
    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    optimizer.step()

iter: 0  train_loss: 4.6162  val_loss: 4.6193
iter: 1000  train_loss: 2.4922  val_loss: 2.5184
iter: 2000  train_loss: 2.4764  val_loss: 2.4926
iter: 3000  train_loss: 2.4600  val_loss: 2.4883
iter: 4000  train_loss: 2.4560  val_loss: 2.4900


In [10]:
context = "Thou art"
context = torch.tensor(encode(context)).unsqueeze(0)
print(context.shape)

text = decode(m.generate(context, 200)[0].tolist())
print(text)

torch.Size([1, 8])
Thou art, t, ine t nghanstl stomy t t ncthever hastind e fr balawin fas, mirengered fap ig' had yocrs.
IIncearuld st n,

RDort saspe ariasth o towsixf my hin angal der lled
S m
LUKEThyour s.

ARCosarnevik e blode d bllo ttag hir wat? t t s
BEO: s o, ad.

Cl IIORDomart k'd CENCEMIUS:
OMu tr'e;
A: the thadswoJULEO scine to: y the. anges, War pETrmevanemy gacouthe st wis.
THENSThe; cksimer r ar se, sh HAnt e


In [13]:
torch.manual_seed(1337)
B, T, C = 4, 8, 2
x = torch.randn(B, T, C)
x.shape

b = 1
t = 2
print(x[b, :t+1, :].shape)

torch.Size([3, 2])


# TRANSFORMER

### Simple self-attention

What we want to do now is to code up the most simple type of attention. Where for each batch independently, for each target t, we take the mean of the previous context. We will refer to this tensor as xbow (bag of words). The name comes from the fact that averaging is essentially just throwing all the words into a bag.

In [72]:
# x[b,t] = mean_{i <= t} x[b,i]
xbow = torch.zeros_like(x)

for b in range(B):
    for t in range(T):
        xprev = x[b, :t+1, :] # (t, C)
        xbow[b,t] = torch.mean(xprev, dim=0)
print(xbow[0])

# There is a way to make this code much more efficient. We can perform this weighted aggregation with matrix multiplication. 
low_triangular_ones = torch.tril(torch.ones((T,T)))
divisor = torch.arange(1, T+1, device=device).unsqueeze(1)
low_triangular_ones /= divisor

# We can use einsum instead of relying on broadcasting.
low_triangular_ones = torch.tril(torch.ones((T,T)))
divisor = 1/torch.arange(1, T+1, device=device)
low_triangular_ones = torch.einsum('ij, i -> ij', low_triangular_ones, divisor)

print(low_triangular_ones)

# We perform matrix multiplication over each batch independently. We can use einsum for tensor operations.
xbow_einsum = torch.einsum('ij, ajk-> aik', low_triangular_ones, x)

# For matmul we do batch multiplication.
xbow_matmul = low_triangular_ones@ x

print(torch.allclose(xbow, xbow_einsum))
print(torch.allclose(xbow, xbow_matmul))

tensor([[ 0.1808, -0.0700],
        [-0.0894, -0.4926],
        [ 0.1490, -0.3199],
        [ 0.3504, -0.2238],
        [ 0.3525,  0.0545],
        [ 0.0688, -0.0396],
        [ 0.0927, -0.0682],
        [-0.0341,  0.1332]])
tensor([[1.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.5000, 0.5000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.3333, 0.3333, 0.3333, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.2500, 0.2500, 0.2500, 0.2500, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.2000, 0.2000, 0.2000, 0.2000, 0.2000, 0.0000, 0.0000, 0.0000],
        [0.1667, 0.1667, 0.1667, 0.1667, 0.1667, 0.1667, 0.0000, 0.0000],
        [0.1429, 0.1429, 0.1429, 0.1429, 0.1429, 0.1429, 0.1429, 0.0000],
        [0.1250, 0.1250, 0.1250, 0.1250, 0.1250, 0.1250, 0.1250, 0.1250]])
True
True


In [79]:
# !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
# The @ operator in PyTorch performs batch matrix multiplication when the input tensors have compatible shapes

wei = torch.tril(torch.ones((T,T)))
wei /= wei.sum(dim=1, keepdim=True) # keepdim = True stops Pytorch from squeezing the tensor along the dimension we summed over.
xbow2 = wei @ x  # (T, T) @ (B, T, C) -> (B, T, T) @ (B, T, C) 

# There is another identical version that uses softmax
tril = torch.tril(torch.ones((T,T)))
wei = torch.zeros((T,T))
wei = wei.masked_fill(tril==0, float('-inf'))  # wei has -inf in all elements above main diagonal. This forbids tokens communicating with future tokens.
wei = F.softmax(wei, dim=-1)
xbow3 = wei @ x # Equivalent to torch.einsum('ij, ajk -> aik', wei, x)

tensor([[1.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.5000, 0.5000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.3333, 0.3333, 0.3333, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.2500, 0.2500, 0.2500, 0.2500, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.2000, 0.2000, 0.2000, 0.2000, 0.2000, 0.0000, 0.0000, 0.0000],
        [0.1667, 0.1667, 0.1667, 0.1667, 0.1667, 0.1667, 0.0000, 0.0000],
        [0.1429, 0.1429, 0.1429, 0.1429, 0.1429, 0.1429, 0.1429, 0.0000],
        [0.1250, 0.1250, 0.1250, 0.1250, 0.1250, 0.1250, 0.1250, 0.1250]])

### Self-attention

What's interesting about the above implementation using softmax is that it allows for soft attention. We do not have to initialise wei with torch.zeros((T,T)). We can initialise with different affinities between the tokens. We will aggregate the values depending on how interesting tokens find each other.

I want to now gather information from the past in a data-dependent way. This is the problem self-attention solves. Every single token will emit two vectors: a query and a key. 

    query: what am I looking for             key: what do I contain

The way we get affinities is the following. Token t's query vector is dot producted with the key vectors from all previous tokens. This creates the wei matrix. Thus, if the key and the query are aligned, those tokens will interact in a very high amount. Additionally, we don't matrix multiply wei directly with x. We first obtain a value matrix.

In [91]:
torch.manual_seed(1337)
B, T, C = 4, 8, 32
x = torch.randn(B, T, C)

# Let's see a single head perform self-attention
head_size = 16
key = nn.Linear(C, head_size, bias=False)   # function
query = nn.Linear(C, head_size, bias=False) # function
value = nn.Linear(C, head_size, bias=False) # function

# Now every token has a key and a query vector associated to it.
k = key(x)    # (B, T, head_size)
q = query(x)  # (B, T, head_size)
wei = q @ k.transpose(-2, -1) # (B, T, head_size) @ (B, head_size, T) -> (B, T, T)
wei = torch.einsum('ijk, ilk -> ijl', q, k) # equivalent

wei *= head_size**(-0.5)  # we scale the values to prevent sharpening of wei after softmax

# wei now contains an affinity value between all the tokens as we have performed dot products between keys and queries.
tril = torch.tril(torch.ones((T,T)))
wei = wei.masked_fill(tril==0, float('-inf')) 
wei = F.softmax(wei, dim=-1)

v = value(x)  # (B, T, head_size)
out = wei @ x # (T,T) @ (B, T, head_size) -> (B, T, head_size)

For example. Let's say we're the 8th token and we're a vowel. We create a query: "Hey. I'm a vowel in the 8th position and I'm looking for any consonant at positions up to 4". Then all past tokens emit their key. Maybe one of the tokens has a key which satisifies those requirements. That key vector would have a high number in the specific channel that represents that requirement, which would create affinity between these two tokens.

Let's develop an intuition behind key, query, value.

    x: (B, T, C)  Contains private information for each token. Now for the purpose of this single head of self-attention, here is some information about me:

    + query: here's what I'm interested in

    + key: here's what I have

    + value: if you find me interesting, here's what I will communicate to you

So v is the thing that gets aggregated for the purpose of this single head.

IDEA: As we have vectors, there is no feature interaction like we would get in a Gaussian process. That is, maybe the vowel cares about the consonant being up to the 4th position only if one of of the other channels is close to 0. 

NOTES

+ Note that there is no notion of space, unlike in convolution where the convolution acts spatially. This is why we need to positionally encode tokens.

+ Each example from each batch is processed independently.

+ In an encoder attention block we can allow tokens to communicate (simply eliminate masking operation). This is useful for example when we want to perform sentiment analysis of a sentence. What we do with masking is called a decoder attention block, and is usually used in autoregressive settings such as language modelling.

+ What we've implemented is called "self-attention" because the same source 'x' produces the keys, queries and values. Cross-attention is used when there's a separate source of tokens we wish to pull information from. In encoder-decoder transformers, you can have the case where the queries are produced from 'x' but the keys and values come from a whole separate external source, maybe from encoder blocks that encode some context we want to condition on. 

+ Scaled attention divides wei by sqrt(head_size). This makes it so that when input Q,K are unit variance, wei will be unit variance too. Softmax will stay diffuse and not saturate too much. Especially at initialisation, if we have very positive and negative values, wei will converge to one-hot encoding. It will sharpen to whatever value happens to be most positive.

We can now implement one head of self-attention in "transformer.py".