# Tiny GPT Implementation

Welcome to this walkthrough on how to implement GPT from scratch! 

Much of this notebook is taken from Andrej Karpathy's video ["Let's build GPT"](https://www.youtube.com/watch?v=kCc8FmEb1nY&ab_channel=AndrejKarpathy) and its corresponding resources. However, this notebook is reworked to provide a more notebook-first experience, to aid hands-on learning.

This notebook will cover basic concepts such as attention and next-token prediction that are crucial to understanding how GPT works. This walkthrough will not cover much of the finer details of reproducing GPT performance. We'll be using a smaller dataset and single-GPU training.

## Dataset
Let's download our dataset that we will be training on. GPT-2 and later iterations of GPT were trained on closed-source, large, web-scale datasets. We'll instead be using a much smaller dataset for now for instructional and practical purposes.

In [4]:
!wget https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt

--2024-02-19 01:44:47--  https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.110.133, 185.199.111.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1115394 (1.1M) [text/plain]
Saving to: ‘input.txt.3’


2024-02-19 01:44:47 (25.5 MB/s) - ‘input.txt.3’ saved [1115394/1115394]



## EDA
Let's take a look at our dataset and what it looks like. First, we need to open it:

In [5]:
with open("input.txt", "r", encoding="utf-8") as f:
    text = f.read()

Now, let's take a look at the length, some example text, and the alphabet we're dealing with.

In [6]:
print("======= Dataset Length =======")
print("Length of the dataset in characters:", len(text))

print("======= Sample Text =======")
print(text[:500])

chars = sorted(list(set(text)))
vocab_size = len(chars)
print("======= Alphabet =======")
print("Alphabet:", "".join(chars))
print("Alphabet Size:", vocab_size)

Length of the dataset in characters: 1115394
First Citizen:
Before we proceed any further, hear me speak.

All:
Speak, speak.

First Citizen:
You are all resolved rather to die than to famish?

All:
Resolved. resolved.

First Citizen:
First, you know Caius Marcius is chief enemy to the people.

All:
We know't, we know't.

First Citizen:
Let us kill him, and we'll have corn at our own price.
Is't a verdict?

All:
No more talking on't; let it be done: away, away!

Second Citizen:
One word, good citizens.

First Citizen:
We are accounted poor
Alphabet: 
 !$&',-.3:;?ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz
Alphabet Size: 65


We can see that our alphabet is consisted of uppercase and lowercase alphabetical characters, in addition to some punctuation and special characters. However, we need to convert these characters into some sort of number representation to feed them into our language model. This process is called **tokenization**. A simple way to transform our characters into tokens is just to take a simple mapping of them, assigning each unique character a number.

In [7]:
# Iterate through all the numbers and create a bidirectional mapping
# Start with the string -> integer mapping
stoi = { ch:i for i,ch in enumerate(chars)}
# Then create the integer -> string mapping
itos = { i:ch for i,ch in enumerate(chars)}

Now, let's create an `encode(input: str) -> list[int]` function and a `decode(input: list[int]) -> str` function. `encode` will use our mapping to convert a string to a list of tokens representing the characters, and `decode` will convert a list of tokens to a string representing the tokens.

In [8]:
encode = lambda s: [stoi[c] for c in s] # encoder: take a string, output a list of integers
decode = lambda l: ''.join([itos[i] for i in l]) # decoder: take a list of integers, output a string

Try encoding and decoding a piece of text!

In [9]:
input = "Hello world!"
print(f"Input: {input}")
print("==> encoding")
encoded = encode(input)
print(encoded)
print("<== decoding")
decoded = decode(encoded)
print(decoded)

Input: Hello world!
==> encoding
[20, 43, 50, 50, 53, 1, 61, 53, 56, 50, 42, 2]
<== decoding
Hello world!


We have just created our first encoder and decoder! Normally, tokenization is done with groups of characters rather than single characters like we are doing here. There's a trade off between accuracy and complexity when tokenizing (TODO: describe this in more detail)

Now that we have our encoder, we want to use it to preprocess our training data. Let's try encoding the entire dataset using our encoder!

In [10]:
data = encode(text)

Now, let's wrap it into a torch tensor to allow us to perform efficient matrix operations.

In [11]:
import torch
data = torch.tensor(data, dtype=torch.long)
print(data.shape, data.type)
print(data[:100])

torch.Size([1115394]) <built-in method type of Tensor object at 0x7f822474bd80>
tensor([18, 47, 56, 57, 58,  1, 15, 47, 58, 47, 64, 43, 52, 10,  0, 14, 43, 44,
        53, 56, 43,  1, 61, 43,  1, 54, 56, 53, 41, 43, 43, 42,  1, 39, 52, 63,
         1, 44, 59, 56, 58, 46, 43, 56,  6,  1, 46, 43, 39, 56,  1, 51, 43,  1,
        57, 54, 43, 39, 49,  8,  0,  0, 13, 50, 50, 10,  0, 31, 54, 43, 39, 49,
         6,  1, 57, 54, 43, 39, 49,  8,  0,  0, 18, 47, 56, 57, 58,  1, 15, 47,
        58, 47, 64, 43, 52, 10,  0, 37, 53, 59])


Now that we have our data tokenized, let's split it into a training and validation split. Let's use a 90% train/val split.

In [12]:
# Let's now split up the data into train and validation sets
percent_train = 0.9
n = int(percent_train*len(data)) # first 90% will be train, rest val
train_data = data[:n]
val_data = data[n:]

When we train a model, we want to have some sort of loss that we try to minimize, and something we want to predict. Many large language models today use next-token prediction. Just like the name suggests, at every step in the generation process we want to predict the token that comes next. We can then combine all these tokens to get our final output. 

Let's take a look at what this means. Let's say we have a `block_size` of 8. This is essentially how large our context window is. For example, when predicting the next token, with a `block_size` of 8, we will not look at any tokens more than 8 tokens behind the one we are trying to predict. In practice, block sizes are much bigger, often in the thousands.

In [13]:
block_size = 8
train_data[:block_size+1]

tensor([18, 47, 56, 57, 58,  1, 15, 47, 58])

This single block of 8 tokens is actually providing us a lot of input/output pairs we can use to train our model. For example, we know that `47` is the "right answer", or our target when the input is `[18]`. Extending this, `56` is our target when our input is `[18, 47]`, and so on.

In [14]:
x = train_data[:block_size+1]
y = train_data[1:block_size+1]
print("Block:", x)
for t in range(block_size):
    context = x[:t+1]
    target = y[t]
    print(f"====\nInput: {context}\nTarget: {target}")

Block: tensor([18, 47, 56, 57, 58,  1, 15, 47, 58])
====
Input: tensor([18])
Target: 47
====
Input: tensor([18, 47])
Target: 56
====
Input: tensor([18, 47, 56])
Target: 57
====
Input: tensor([18, 47, 56, 57])
Target: 58
====
Input: tensor([18, 47, 56, 57, 58])
Target: 1
====
Input: tensor([18, 47, 56, 57, 58,  1])
Target: 15
====
Input: tensor([18, 47, 56, 57, 58,  1, 15])
Target: 47
====
Input: tensor([18, 47, 56, 57, 58,  1, 15, 47])
Target: 58


Is this the best we can do? We can actually increase our efficiency by utilizing the parallelism provided by GPUs to process multiple blocks at the same time. The number of blocks we process at the same time is called our batch size. Let's see what this looks like.

In [15]:
torch.manual_seed(1337)
batch_size = 4 # how many independent sequences will we process in parallel?
block_size = 8 # what is the maximum context length for predictions?

def get_batch(split):
    # generate a small batch of data of inputs x and targets y
    data = train_data if split == 'train' else val_data
    ix = torch.randint(len(data) - block_size, (batch_size,))
    x = torch.stack([data[i:i+block_size] for i in ix])
    y = torch.stack([data[i+1:i+block_size+1] for i in ix])
    return x, y

xb, yb = get_batch('train')
print("Training:")
print(xb)
print("Target:")
print(yb)

Training:
tensor([[24, 43, 58,  5, 57,  1, 46, 43],
        [44, 53, 56,  1, 58, 46, 39, 58],
        [52, 58,  1, 58, 46, 39, 58,  1],
        [25, 17, 27, 10,  0, 21,  1, 54]])
Target:
tensor([[43, 58,  5, 57,  1, 46, 43, 39],
        [53, 56,  1, 58, 46, 39, 58,  1],
        [58,  1, 58, 46, 39, 58,  1, 46],
        [17, 27, 10,  0, 21,  1, 54, 39]])


What is the relation between the Training tensor and the Target tensor? Do you notice any similarities?

# The Model
Now, let's build out the actual model that will allow us to generate text.

In [16]:
import torch
import torch.nn as nn
from torch.nn import functional as F
torch.manual_seed(1337)
class BigramLanguageModel(nn.Module):

    def __init__(self, vocab_size):
        super().__init__()
        # each token directly reads off the logits for the next token from a lookup table
        self.token_embedding_table = nn.Embedding(vocab_size, vocab_size)

    def forward(self, idx, targets=None):
        # idx and targets are both (B,T) tensor of integers
        logits = self.token_embedding_table(idx) # (B,T,C)
        return logits

In [17]:
print(xb.shape)
xb

torch.Size([4, 8])


tensor([[24, 43, 58,  5, 57,  1, 46, 43],
        [44, 53, 56,  1, 58, 46, 39, 58],
        [52, 58,  1, 58, 46, 39, 58,  1],
        [25, 17, 27, 10,  0, 21,  1, 54]])

In [18]:
model = BigramLanguageModel(65)

In [19]:
model.forward(xb)[0][0]

tensor([-1.5101, -0.0948,  1.0927,  0.1505,  1.6347, -0.0518,  0.4996,  0.7216,
        -0.8968, -0.4122,  1.0030,  0.8508,  0.2178,  0.0328, -0.1699,  1.0659,
        -0.6177,  1.1824,  0.0214, -0.2154, -1.4623,  2.1707,  0.1624,  1.0296,
         0.4154,  0.6207,  0.2341, -0.0326,  1.0124,  1.5122, -0.3359,  0.2456,
         1.8682,  0.7536, -0.1177, -0.1967, -0.9552, -0.8995, -0.9583, -0.5945,
         0.1321, -0.5406,  0.1405, -0.7321,  1.1796,  1.3316, -0.2094,  0.0960,
         0.9040, -0.4032,  0.3027, -0.8034, -1.2537, -1.5195,  0.7446,  1.1914,
        -0.8061, -0.6290,  1.2447, -2.4400,  0.8408, -0.3993, -0.6126, -0.6597,
         0.7624], grad_fn=<SelectBackward0>)

In [20]:
import torch
import torch.nn as nn
from torch.nn import functional as F
torch.manual_seed(1337)

class BigramLanguageModel(nn.Module):

    def __init__(self, vocab_size):
        super().__init__()
        # each token directly reads off the logits for the next token from a lookup table
        self.token_embedding_table = nn.Embedding(vocab_size, vocab_size)

    def forward(self, idx, targets=None):

        # idx and targets are both (B,T) tensor of integers
        logits = self.token_embedding_table(idx) # (B,T,C)

        if targets is None:
            loss = None
        else:
            B, T, C = logits.shape
            logits = logits.view(B*T, C)
            targets = targets.view(B*T)
            loss = F.cross_entropy(logits, targets)

        return logits, loss

    def generate(self, idx, max_new_tokens):
        # idx is (B, T) array of indices in the current context
        for _ in range(max_new_tokens):
            # get the predictions
            logits, loss = self(idx)
            # focus only on the last time step
            logits = logits[:, -1, :] # becomes (B, C)
            # apply softmax to get probabilities
            probs = F.softmax(logits, dim=-1) # (B, C)
            # sample from the distribution
            idx_next = torch.multinomial(probs, num_samples=1) # (B, 1)
            #print(idx_next)
            # append sampled index to the running sequence
            idx = torch.cat((idx, idx_next), dim=1) # (B, T+1)
        return idx


Now, instantiate your model and generate 100 text tokens.

In [21]:
m = BigramLanguageModel(vocab_size)
logits, loss = m(xb, yb)
print(logits.shape)
print(loss)

print(decode(m.generate(idx = torch.zeros((1, 1), dtype=torch.long), max_new_tokens=100)[0].tolist()))

torch.Size([32, 65])
tensor(4.8786, grad_fn=<NllLossBackward0>)

Sr?qP-QWktXoL&jLDJgOLVz'RIoDqHdhsV&vLLxatjscMpwLERSPyao.qfzs$Ys$zF-w,;eEkzxjgCKFChs!iWW.ObzDnxA Ms$3


As expected, this produces gibberish. The token's don't actually communicate with each other, and we haven't actually trained our model, so let's do that. We will follow the paper "Attention is All You Need" to implement multi-headed self-attention. But first, what is attention?

# Attention
According to the paper "Attention is All You Need", attention is defined as the expression below (more specifically, scaled dot-product attention).
$$\text{Attention}(Q, K ,V) = \text{softmax}(\frac{QK^T}{\sqrt{d_k}})V$$
But what does this mean exactly? When we want to predict the next token, we need to find a way to use the tokens that came before it as information. 

There are many ways to do this with varying complexity, but the simplest way is to take a weighted average of the tokens that came before the current token. For example, we want the 5th token to obtain information ("pay attention to") the 4th, 3rd, 2nd, and 1st tokens, but not the 6th or 7th token.

We can implement this in a for loop for each token, but doing so is very inefficient. Instead, we can use matrices to utilize the parallelism provided by NumPy matrix operations. First, we can devise a matrix that allows each token to only communicate with tokens that appear prior to it.

In [22]:
torch.manual_seed(42)
a = torch.tril(torch.ones(3, 3))
a = a / torch.sum(a, 1, keepdim=True)
a

tensor([[1.0000, 0.0000, 0.0000],
        [0.5000, 0.5000, 0.0000],
        [0.3333, 0.3333, 0.3333]])

Let's see what this does when multiplied with another matrix.

In [23]:
b = torch.randint(0,10,(3,2)).float()
b

tensor([[2., 7.],
        [6., 4.],
        [6., 5.]])

In [24]:
c = a @ b
c

tensor([[2.0000, 7.0000],
        [4.0000, 5.5000],
        [4.6667, 5.3333]])

The 1st row of matrix $c$ is equal to the 1st row of matrix $b$, the 2nd row is equal to the average of the 1st and 2nd rows, and the 3rd row is the average of the 1st, 2nd, and 3rd rows. You can verify this for yourself.

We can also do this as a batched operation, by extending this into a 3 dimensional array. PyTorch infers extra dimensions automatically, so we don't need to change our weight matrix. 

For example, if we have a weight matrix with shape $(T, T)$ (representing relative weights in the time dimension), and multiply this with a matrix $A$ with shape $(B, T, C)$ (batch, time, channels), PyTorch will infer the outermost $B$ dimension for our weight matrix. The result will be a $(B, T, T) \times (B, T, C) \rightarrow (B, T, C)$ matrix.

What a $(B,T,C)$ dimensional matrix actually means in this contex is that we have a $B \times T$ arrangement of tokens, while the information contained in each token is $C$ dimensional.

In [25]:
torch.manual_seed(1337)
B,T,C = 4,8,2 # batch, time, channels
wei = torch.tril(torch.ones(T, T))
wei = wei / wei.sum(1, keepdim=True)
x = torch.randn(B,T,C)
wei = wei @ x
print(x.shape) # we expect this to be (B, T, C) = (4, 8, 2)

torch.Size([4, 8, 2])


We can also construct this weighted matrix in a variety of different ways. For example, we can use a softmax, which does the same thing:

In [26]:
T = 4
tril = torch.tril(torch.ones(T, T))
wei = torch.zeros((T,T))
wei = wei.masked_fill(tril == 0, float('-inf'))
wei = F.softmax(wei, dim=-1)
wei

tensor([[1.0000, 0.0000, 0.0000, 0.0000],
        [0.5000, 0.5000, 0.0000, 0.0000],
        [0.3333, 0.3333, 0.3333, 0.0000],
        [0.2500, 0.2500, 0.2500, 0.2500]])

# Keys, Queries, Values

Now, let's revisit how we aggregate information. Previously, we took a simple average of all tokens prior to our current token. However, this causes each token to be weighted uniformly equally. In practice, certain tokens are probably more important for certain words than others. For example, in the sentence "the red dog", the word "red" is probably more important to the word "dog" than the word "the". As such, we want to be able to model this.

A Transformer does this with 3 sets of learnable values. These are called keys, queries, and values. You might have seen these represented as $(K, Q, V)$.

When we want to predict a token, we send out a Query, which can be interpreted as a sort of request. We take dot-products of the Query with different Keys, which act as a sort of label for what they contain. The higher this dot-product, the more we pay attention to that word when computing the current token.

We'll start with single-headed attention first, then move to multi-headed attention. First, we'll create a random matrix with dimensions $(B,T,C)$

In [27]:
torch.manual_seed(1337)
B,T,C = 4,8,32 # batch, time, channels
x = torch.randn(B,T,C)

Keys, queries, values are learned values. We can represent them as a simple linear layer. We want this to be a simple linear transform, so we don't want to set a bias term. This will have dimensions $(C, \text{head size})$. TO DO: explain $\text{head size}$?

In [28]:
head_size = 16
key = nn.Linear(C, head_size, bias=False)
query = nn.Linear(C, head_size, bias=False)
value = nn.Linear(C, head_size, bias=False)

Now, let's pass our input data through $K$ and $Q$ individually. Notice that in this stage, they don't communicate yet. This output will have dimensions $(B, T, C) \times (B, C, \text{head size}) \rightarrow (B, C, \text{head size})$.

In [29]:
k = key(x)   # (B, T, 16)
q = query(x) # (B, T, 16)
print(k.shape, q.shape)

torch.Size([4, 8, 16]) torch.Size([4, 8, 16])


Now, we want to take the dot product. The efficient way to do this is through a matrix multiply. However, both are $(B, C, \text{head size})$. We need to transpose one of them before multiplying to get a matrix of shape $(B, C, C)$. These will be our new weights.

In [30]:
wei = q @ k.transpose(-2, -1)
wei.shape

torch.Size([4, 8, 8])

However, there is a problem. Try printing the first row. Current tokens attend to future tokens, which should not happen. As a result, we need to apply the same transform as before to mask future tokens.

In [31]:
tril = torch.tril(torch.ones(T, T))
#wei = torch.zeros((T,T))
wei = wei.masked_fill(tril == 0, float('-inf'))
print(wei[0])

tensor([[-1.7629,    -inf,    -inf,    -inf,    -inf,    -inf,    -inf,    -inf],
        [-3.3334, -1.6556,    -inf,    -inf,    -inf,    -inf,    -inf,    -inf],
        [-1.0226, -1.2606,  0.0762,    -inf,    -inf,    -inf,    -inf,    -inf],
        [ 0.7836, -0.8014, -0.3368, -0.8496,    -inf,    -inf,    -inf,    -inf],
        [-1.2566,  0.0187, -0.7880, -1.3204,  2.0363,    -inf,    -inf,    -inf],
        [-0.3126,  2.4152, -0.1106, -0.9931,  3.3449, -2.5229,    -inf,    -inf],
        [ 1.0876,  1.9652, -0.2621, -0.3158,  0.6091,  1.2616, -0.5484,    -inf],
        [-1.8044, -0.4126, -0.8306,  0.5898, -0.7987, -0.5856,  0.6433,  0.6303]],
       grad_fn=<SelectBackward0>)


Similar to before, we want to take a softmax over this to make each row resemble a probability distribution that sums to 1.

In [32]:
wei = F.softmax(wei, dim=-1)
print(wei[0])

tensor([[1.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.1574, 0.8426, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.2088, 0.1646, 0.6266, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.5792, 0.1187, 0.1889, 0.1131, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.0294, 0.1052, 0.0469, 0.0276, 0.7909, 0.0000, 0.0000, 0.0000],
        [0.0176, 0.2689, 0.0215, 0.0089, 0.6812, 0.0019, 0.0000, 0.0000],
        [0.1691, 0.4066, 0.0438, 0.0416, 0.1048, 0.2012, 0.0329, 0.0000],
        [0.0210, 0.0843, 0.0555, 0.2297, 0.0573, 0.0709, 0.2423, 0.2391]],
       grad_fn=<SelectBackward0>)


Notice that while the general structure of the matrix is the same, prior tokens on each row are no longer equally weighted, but instead weighted according to the dot-product of their keys with the query.

Next, we need to pass our inputs through $V$, or our value matrix. Then, similar to the previous portions, we want to weight our values. In other words, we want to multiply our weights with our values.

In [33]:
v = value(x)
out = wei @ v
print(out.shape)

torch.Size([4, 8, 16])


There's a few things to notice here.
1. This is positionally invariant, at least for now. There is no notion of position. We can rearrange the words in the sentence and it will still work, although our predictions would similarly be misordered.
2. Batch dimensions are always independent. There is no communication across different batches.

This is also a good time to explain some different concepts:
1. The above example is a "decoder" block. This is a decoder block because we mask future tokens. "Encoder" blocks operate in the same way, but without masking. This is because when encoding we can see future tokens, but when generating (decoding) we can't.
2. "Self-Attention" means that the keys and values are produced from the same source as queries. Other types of attention such as cross-attention the queries are produced from x but the keys and values come from some other source, such as an encoder.

# Code

Now that we understand the basics of self-attention and Transformers, let's put it together and train our own model. Just run this next cell - we've seen everything here before.

In [34]:
import torch
import torch.nn as nn
from torch.nn import functional as F

# hyperparameters
batch_size = 16 # how many independent sequences will we process in parallel?
block_size = 32 # what is the maximum context length for predictions?
max_iters = 5000
eval_interval = 100
learning_rate = 1e-3
device = 'cuda' if torch.cuda.is_available() else 'cpu'
eval_iters = 200
n_embd = 64
n_head = 4
n_layer = 4
dropout = 0.0
# ------------

torch.manual_seed(1337)

# wget https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt
with open('input.txt', 'r', encoding='utf-8') as f:
    text = f.read()

# here are all the unique characters that occur in this text
chars = sorted(list(set(text)))
vocab_size = len(chars)
# create a mapping from characters to integers
stoi = { ch:i for i,ch in enumerate(chars) }
itos = { i:ch for i,ch in enumerate(chars) }
encode = lambda s: [stoi[c] for c in s] # encoder: take a string, output a list of integers
decode = lambda l: ''.join([itos[i] for i in l]) # decoder: take a list of integers, output a string

# Train and test splits
data = torch.tensor(encode(text), dtype=torch.long)
n = int(0.9*len(data)) # first 90% will be train, rest val
train_data = data[:n]
val_data = data[n:]

# data loading
def get_batch(split):
    # generate a small batch of data of inputs x and targets y
    data = train_data if split == 'train' else val_data
    ix = torch.randint(len(data) - block_size, (batch_size,))
    x = torch.stack([data[i:i+block_size] for i in ix])
    y = torch.stack([data[i+1:i+block_size+1] for i in ix])
    x, y = x.to(device), y.to(device)
    return x, y

Now, let's create the Head class, which represents a single head of self-attention. We'll need to fill in 2 methods, `__init__()` and `forward()`.

For `__init__()`:
- Create a linear layer for key, query, and value
- A register buffer for 'tril' is given to you already. A register buffer in PyTorch is a parameter that is not updated during backpropagation, but is included in the model's state_dict. They are commonly used for constants (such as in this case) or for running statistics (that are updated during each iteration). 'tril' in this case serves as a lower triangular mask that is of the correct dimensions
- Create a dropout layer

For `forward()`:
- Pass your input through your linear layer for keys
- Pass your input through your linear layer for queries
- Compute the dot product between queries and keys
- Mask this dot product matrix to make it lower triangular to get a weight matrix
- Take a softmax over the rows of the weight matrix
- Pass this weight matrix through a dropout layer
- Pass the inputs through the value linear layer
- Return the weight matrix multiplied by the ouputs of the value linear layer

In [35]:
class Head(nn.Module):
    """ one head of self-attention """

    def __init__(self, head_size):
        super().__init__()
        self.key = nn.Linear(n_embd, head_size, bias=False) # (C, head_size) ?? is head_size == C
        self.query = nn.Linear(n_embd, head_size, bias=False) # (C, head_size)
        self.value = nn.Linear(n_embd, head_size, bias=False) # (C, head_size)
        self.register_buffer('tril', torch.tril(torch.ones(block_size, block_size)))

        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        B,T,C = x.shape 
        k = self.key(x)   # (B,T,C) -> (B, T, C) @ (C, C)??
        q = self.query(x) # (B,T,C) -> (B, T, C) @ (C, C)??
        # compute attention scores ("affinities")
        wei = q @ k.transpose(-2,-1) * C**-0.5 # (B, T, C) @ (B, C, T) -> (B, T, T)
        wei = wei.masked_fill(self.tril[:T, :T] == 0, float('-inf')) # (B, T, T)
        wei = F.softmax(wei, dim=-1) # (B, T, T)
        wei = self.dropout(wei)
        # perform the weighted aggregation of the values
        v = self.value(x) # (B,T,C)
        out = wei @ v # (B, T, T) @ (B, T, C) -> (B, T, C)
        return out

In [36]:


@torch.no_grad()
def estimate_loss():
    out = {}
    model.eval()
    for split in ['train', 'val']:
        losses = torch.zeros(eval_iters)
        for k in range(eval_iters):
            X, Y = get_batch(split)
            logits, loss = model(X, Y)
            losses[k] = loss.item()
        out[split] = losses.mean()
    model.train()
    return out



class MultiHeadAttention(nn.Module):
    """ multiple heads of self-attention in parallel """

    def __init__(self, num_heads, head_size):
        super().__init__()
        self.heads = nn.ModuleList([Head(head_size) for _ in range(num_heads)])
        self.proj = nn.Linear(n_embd, n_embd) # (C, C), learnable linear layer
        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        out = torch.cat([h(x) for h in self.heads], dim=-1) # num_heads, (B, T, C)
        out = self.dropout(self.proj(out)) # pass out through linear layer
        return out

class FeedFoward(nn.Module):
    """ a simple linear layer followed by a non-linearity """

    def __init__(self, n_embd):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(n_embd, 4 * n_embd),
            nn.ReLU(),
            nn.Linear(4 * n_embd, n_embd),
            nn.Dropout(dropout),
        )

    def forward(self, x):
        return self.net(x)

class Block(nn.Module):
    """ Transformer block: communication followed by computation """
    def __init__(self, n_embd, n_head):
        # n_embd: embedding dimension, n_head: the number of heads we'd like
        super().__init__()
        head_size = n_embd // n_head
        self.sa = MultiHeadAttention(n_head, head_size)
        self.ffwd = FeedFoward(n_embd)
        self.ln1 = nn.LayerNorm(n_embd)
        self.ln2 = nn.LayerNorm(n_embd)

    def forward(self, x):
        x = x + self.sa(self.ln1(x))
        x = x + self.ffwd(self.ln2(x))
        return x

# super simple bigram model
class BigramLanguageModel(nn.Module):

    def __init__(self):
        super().__init__()
        # each token directly reads off the logits for the next token from a lookup table
        self.token_embedding_table = nn.Embedding(vocab_size, n_embd)
        self.position_embedding_table = nn.Embedding(block_size, n_embd)
        self.blocks = nn.Sequential(*[Block(n_embd, n_head=n_head) for _ in range(n_layer)])
        self.ln_f = nn.LayerNorm(n_embd) # final layer norm
        self.lm_head = nn.Linear(n_embd, vocab_size)

    def forward(self, idx, targets=None):
        B, T = idx.shape

        # idx and targets are both (B,T) tensor of integers
        tok_emb = self.token_embedding_table(idx) # (B,T,C)
        pos_emb = self.position_embedding_table(torch.arange(T, device=device)) # (T,C)
        x = tok_emb + pos_emb # (B,T,C)
        x = self.blocks(x) # (B,T,C)
        x = self.ln_f(x) # (B,T,C)
        logits = self.lm_head(x) # (B,T,vocab_size)

        if targets is None:
            loss = None
        else:
            B, T, C = logits.shape
            logits = logits.view(B*T, C)
            targets = targets.view(B*T)
            loss = F.cross_entropy(logits, targets)

        return logits, loss

    def generate(self, idx, max_new_tokens):
        # idx is (B, T) array of indices in the current context
        for _ in range(max_new_tokens):
            # crop idx to the last block_size tokens
            idx_cond = idx[:, -block_size:]
            # get the predictions
            logits, loss = self(idx_cond)
            # focus only on the last time step
            logits = logits[:, -1, :] # becomes (B, C)
            # apply softmax to get probabilities
            probs = F.softmax(logits, dim=-1) # (B, C)
            # sample from the distribution
            idx_next = torch.multinomial(probs, num_samples=1) # (B, 1)
            # append sampled index to the running sequence
            idx = torch.cat((idx, idx_next), dim=1) # (B, T+1)
        return idx

model = BigramLanguageModel()
m = model.to(device)
# print the number of parameters in the model
print(sum(p.numel() for p in m.parameters())/1e6, 'M parameters')

# create a PyTorch optimizer
optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate)

for iter in range(max_iters):

    # every once in a while evaluate the loss on train and val sets
    if iter % eval_interval == 0 or iter == max_iters - 1:
        losses = estimate_loss()
        print(f"step {iter}: train loss {losses['train']:.4f}, val loss {losses['val']:.4f}")

    # sample a batch of data
    xb, yb = get_batch('train')

    # evaluate the loss
    logits, loss = model(xb, yb)
    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    optimizer.step()

# generate from the model
context = torch.zeros((1, 1), dtype=torch.long, device=device)
print(decode(m.generate(context, max_new_tokens=2000)[0].tolist()))


0.209729 M parameters


  from .autonotebook import tqdm as notebook_tqdm


step 0: train loss 4.4116, val loss 4.4022
step 100: train loss 2.6568, val loss 2.6670
step 200: train loss 2.5090, val loss 2.5059
step 300: train loss 2.4194, val loss 2.4336
step 400: train loss 2.3503, val loss 2.3566
step 500: train loss 2.2963, val loss 2.3127
step 600: train loss 2.2412, val loss 2.2501
step 700: train loss 2.2054, val loss 2.2190
step 800: train loss 2.1640, val loss 2.1871
step 900: train loss 2.1237, val loss 2.1495
step 1000: train loss 2.1031, val loss 2.1304
step 1100: train loss 2.0700, val loss 2.1188
step 1200: train loss 2.0391, val loss 2.0806
step 1300: train loss 2.0258, val loss 2.0650
step 1400: train loss 1.9933, val loss 2.0365
step 1500: train loss 1.9705, val loss 2.0291
step 1600: train loss 1.9639, val loss 2.0485
step 1700: train loss 1.9424, val loss 2.0130
step 1800: train loss 1.9079, val loss 1.9937
step 1900: train loss 1.9081, val loss 1.9875
step 2000: train loss 1.8860, val loss 1.9964
step 2100: train loss 1.8706, val loss 1.9731


In [37]:
context = torch.zeros((1, 1), dtype=torch.long, device=device)
print(decode(m.generate(context, max_new_tokens=2000)[0].tolist()))


BUCKINGHAM:
3 IVING LEARGARET:
Fallst not before you not me, slate ine,
And the every to rectinglited is took a well
the late of Oponestry?
The futh
sorroody, you it smile your
That heir beast.

HORTENMIO:
Towe them.
On let five thee them that
you she Polsmence! and not; must yet go yoursated my need;
Our meance a longear tou our his dobe.

Clivent Secan they da my sir, the severeed
Dear our somet.

JOHN MARCETBY:
The markly be ast this womere it the your
jepreek sleep not teempost; queech seemmen's lie?
Down. You hast fult-stand, theirs is the geglory.
Is speeks and envy fold be ane wind,
It! you 'dreat shall to: fortury.

Shalf Reamous.

Merry.

PAULINA:
They forth, doth with weirge, icleive man.

FROMVE:
Chile! then you you rone kirgn with armiss!
Pland, it it sees, why, to
And and rove to to father alling
Cry a knowly the prace is thunking
To to emproved, an you. Yet,y worth
down thou out savreht to brumble grave cruntent,
Beate you by the ishan but jroy broth,
Is I withy worl is 

In [39]:
test = "Hello Seyone! How are you doing today?"
encoded = encode(test)

In [40]:
data = torch.tensor(encoded, dtype=torch.long)

In [44]:
print(decode(m.generate([1], max_new_tokens=2000)[0].tolist()))

TypeError: list indices must be integers or slices, not tuple

In [None]:
print(decode(m.generate(, max_new_tokens=2000)[0].tolist()))