<img align="left" src="https://ithaka-labs.s3.amazonaws.com/static-files/images/tdm/tdmdocs/CC_BY.png"><br />

This notebook is created by Zhuo Chen under [Creative Commons CC-BY license](https://creativecommons.org/licenses/by/4.0/) based on [the nanoGPT](https://github.com/karpathy/nanoGPT) created by [Andrej Karpathy](https://karpathy.ai) under [MIT license](https://github.com/karpathy/nanoGPT/blob/master/LICENSE).<br />
For questions/comments/improvements, email zhuo.chen@ithaka.org or nathan.kelber@ithaka.org<br />
___


# Building a Language Model

**Description:** This notebook describes:
* how to use PyTorch to build a character-level nanoGPT 
* how to train the nanoGPT on a tiny piece of Shakespeare
* how to use the trained model to generate more Shakespeare-like text
* Explore how the transformer architecture is like in code
 

**Use Case:** For Learners (Detailed explanation, not ideal for researchers)

**Difficulty:** Advanced

**Completion Time:** 180 minutes

**Knowledge Required:** 
* Python Basics Series ([Start Python Basics 1](../Python-basics/python-basics-1.ipynb))
* Python Intermediate Series ([Start Python Intermediate 1](../Python-intermediate/python-intermediate-1.ipynb))
* Introduction to ChatGPT([Start learning how ChatGPT works](https://join.slack.com/t/ithaka-constellate/shared_invite/zt-2bg6ctcqb-wf~4KVBB6QkE7Q2PdCNy3Q))

**Knowledge Recommended:** None

**Data Format:** .txt

**Libraries Used:** PyTorch

**Research Pipeline:** None
___

## Import libraries and packages

In [None]:
# Install the most recent version available via pip
!pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu

In [None]:
# import libraries and packages
from pathlib import Path
import urllib.request
import torch

## Download the Shakespeare dataset

Let's download the text file containing the data we will use to train our model. The file contains Shakespeare works. 

In [None]:
# Check if a data folder exists. If not, create it.
data_folder = Path('./data/')
data_folder.mkdir(exist_ok=True)

# download the Shakespeare data
url="https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt"
urllib.request.urlretrieve(url, './data/shake.txt')

# Success message
print('Sample file ready.')

## Prepare the Shakespeare dataset for character-level language modeling
In this section, we will get the following data from the downloaded Shakespeare dataset:
* all the unique characters that occur in the dataset, i.e. the vocab
* a mapping from characters to integers and vice versa, i.e. an encoder and decoder
* a training dataset and a validation dataset, i.e. train/val split

### Get all unique characters
We are building a character-level language model. This means the model, when generating new data, will use past characters to predict future characters. The training data and validation data are from the downloaded Shakespeare dataset. The unique characters from the Shakespeare dataset form the vocabulary of the model. Each character is a token. 

In [None]:
# read in the data
with open('./data/shake.txt', 'r') as f:
    content = f.read()
print(f"length of dataset in characters: {len(content):,}")

In [None]:
# get all unique chars
chars = sorted(list(set(content)))
vocab_size = len(chars)
print("all the unique characters:", ''.join(chars))
print(f"vocab size: {vocab_size:,}")

In [None]:
# get one character token from the list
chars[1]

### Create a mapping from characters to integers and vice versa

Computers process numbers, not letters or characters. This means that ultimately, each character in the vocab will be encoded as a number for the computer to process. Let's make an encoder that encodes characters into numbers. Also, let's make a decoder that decodes numbers into characters. The simplest choice is just to use the index number of the characters in the list `chars` we created above. 

In [None]:
# a quick reminder of enumerate()
for i,ch in enumerate(['a', 'b', 'c']):
    print(i, ch)

In [None]:
# create a character to int mapping
stoi = {ch:i for i,ch in enumerate(chars)}
# create a int to character mapping
itos = {i:ch for i,ch in enumerate(chars)}

# create a encoder and decoder
encode = lambda s: [stoi[c] for c in s] # encoder: take a string, output a list of integers
decode = lambda l: ''.join([itos[i] for i in l]) # decoder: take a list of integers, output a string 

You can try the encoder and decoder using an example. 

In [None]:
# try the encoder
ls = encode('This is nanoGPT')
print(ls)
# try the decoder
string = decode(ls)
print(string)

At this point, for each character in the vocab, you have an integer that maps to it. Basically, each character is indexed with an integer.

### Split the dataset into training data and validation data
The majority, i.e. 90%, of the Shakespeare dataset will be used as training data. The rest will be used as validation data. 

In [None]:
# create the train and val splits
data = torch.tensor(encode(content), dtype=torch.long)
n = int(0.9 * len(data))
train_data = data[:n]
val_data = data[n:]

In [None]:
# add context window
context_size = 8

In [None]:
### see what training data gives us

x = train_data[:context_size] # get one sequence of train_data
y = train_data[1:context_size+1] # get one sequence of gold answer from train_data
for t in range(context_size):
    context = x[:t+1].tolist()
    target = [y[t].item()]
    print(f"{decode(context)} -> {decode(target)}")

## Write the model
In this section, we will use the training data we get from the Shakespeare dataset to train a character-level language model. 
### Set the hyperparameters
Let's spend some time understanding the hyperparameters. 
* batch_size: this hyperparameter determines how many independent sequences the model will process in parallel
* block_size: the name might sound unfamiliar, but it actually means context window, i.e. the maximum context length we use to predict the next token
* max_iters: total steps in optimization
* eval_interval: interval where we will print the loss
* learning_rate: how much we move in a step when minimizing the loss
* eval_iters: evaluate every eval_iters batches
* n_embd: number of dimensions in a token embedding
* n_head: number of heads in a self-attention layer

In [None]:
# import modules from pytorch
import torch.nn as nn # nn is the module for creating neural network
from torch.nn import functional as F # we'll use softmax and cross_entropy from F

# set the hyperparameters of the model
batch_size = 16 # how many independent sequences of chars model process in parallel
block_size = 32 # maximum context length for predictions
max_iters = 5000 # total steps in optimization (move one step after processing every eval_iters batches)
eval_interval = 100 # print the loss of the model every 100 steps in optimization 
learning_rate = 1e-3 # how big a step in optimizatin is
device = 'cpu'
eval_iters = 200 # calculate a mean loss after processing every 200 batches
n_embd = 64 # size of the token embedding 
n_head = 4 # num of heads in a self-attention layer

torch.manual_seed(1337) # ensure we all get the same result

### Define functions
We define two functions. 
The first function loads a batch of data for training or for evaluation. 
The second function estimates loss. 

In [None]:
# load one batch of data
def get_batch(split):
    # generate a small batch of data of inputs x and targets y
    data = train_data if split == 'train' else val_data
    # ix is a tensor of indices, each is the index of the starting char of a sequence in a batch
    ix = torch.randint(low=0, high=len(data) - block_size, size=(batch_size,)) 
    # x is the batch_size sequences used to predict the next character,each sequence is block_size chars long
    x = torch.stack([data[i:i+block_size] for i in ix])
    # y is the batch_size sequences used as gold answers, each sequence is block_size char long 
    y = torch.stack([data[i+1:i+block_size+1] for i in ix])
    x, y = x.to(device), y.to(device) # data and model need to be on the same device, cpu or gpu
    return x, y

Recall that `block_size`=32 and `batch_size`=16. Therefore, x is of shape (16, 32) and y is of shape (16, 32) as well.
Each time we run get_batch, we will get 16 sequences of characters and they are of length 32. 

In [None]:
### get shape of the x and y
ix = torch.randint(low=0, high=len(data) - block_size, size=(batch_size,)) 
x = torch.stack([data[i:i+block_size] for i in ix])
y = torch.stack([data[i+1:i+block_size+1] for i in ix])
print(x.size())
print(y.size())

Let's use a visualization to help us understand a batch of data.

In [None]:
# estimate loss
@torch.no_grad() # don't calculate the gradients; intended to improve efficiency
def estimate_loss():
    """take eval_iters many batches; calculate loss on each batch and 
    take the mean loss and return the mean loss of the eval_iters batches"""
    out = {}
    model.eval() # set the model to evaluation mode
    for split in ['train', 'val']: 
        losses = torch.zeros(eval_iters) # place holders for losses
        for k in range(eval_iters): 
            X, Y = get_batch(split) # get one batch of data
            logits, loss = model(X, Y) # logits are the raw values, loss is the loss
            losses[k] = loss.item() # record the loss at kth batch
        out[split] = losses.mean() # get mean of the eval_iters losses
    model.train() # set the model to train mode
    return out # return train batches mean loss/eval batches mean loss

Recall that `eval_iters`=200 and `batch_size`=16. The model will take 16 random sequences (one batch) from training data for 200 times (200 batches in total), calculate loss 200 times (one for each batch) and take the mean of these 200 losses as the loss of the 200 training samples. Then, it will take 16 random sequences from the evaluation data for 200 times, calculate loss 200 times and take the mean of these 200 losses as the loss of the evaluation samples. The mean losses will be printed out to track the state of the model.  

### Define a multi-head self-attention block
A self-attention block in transformer consists of a multi-head self-attention layer and a feedforward layer. 
#### Define a single self-attention head
Let's define a single self-attention head first. 
Recall from the `How does ChatGPT work` webinar we have learned a single self-attention head works in the following way:

<img src="https://ithaka-labs.s3.amazonaws.com/static-files/images/tdm/tdmdocs/nanoGPT_singleHead.png" width=500>

1. To use a sequence of tokens to predict the following token, a self-attention head encodes a relation between the tokens in the following way:
    * each token t is associated with a query vector, a key vector and a value vector
    * query vector **q<sup>t</sup>** represents what the token is looking for
    * key vector **k<sup>t</sup>** represents what kind of information the token contains
    * value vector **v<sup>t</sup>** represents the actual information the token has
    
If we have a sequence of tokens, each of which has a query, key vector and a value vector associated with it, then we'll get a query matrix **q**, key matrix **k** and value matrix **v** from the sequence. 

**q** * **k<sup>T</sup>** gives us a matrix of attention scores; bigger scores means better match between the query vector and key vector of two tokens; for example, if **q<sup>t8</sup>** * **k<sup>t2</sup>** is a big number, this means t2 provides a lot of information that t8 is looking for. 


2. Language is assumed to have a temporal dimension which determines that the model can only use past tokens to predict future tokens. This means that not all attention scores in the matrix **q** * **k<sup>T</sup>** will be considered. For example, we will not consider **q<sup>t2</sup>** * **k<sup>t8</sup>** because t8 is a future token of t2 and we cannot use t8 to predict t2!

3. The matrix of attention scores will be converted to a matrix of attention weights using the softmax function. After that, we'll use the weight matrix to do weighted aggregation of values from the value matrix **v**. 

Even with what we have learned in the `How does ChatGPT work?` webinar, this code cell still looks daunting. So let's use some visualizations to help us understand it. 

In [None]:
class Head(nn.Module): # nn.Module is base class for all neural network modules
    """ one head of self-attention """
    def __init__(self, head_size):
        super().__init__()  # inherit __init__() from base class
        # n_embed is num of dimensions in token embedding
        # head_size is num of neurons in the respective key/query/value layer
        # if an input batch matrix is (16, n_embd), key layer outputs a (16, head_size) matrix
        # bias=False means the model will not learn a bias term
        self.key = nn.Linear(n_embd, head_size, bias=False) # a matrix of (n_embd, head_size) 
        self.query = nn.Linear(n_embd, head_size, bias=False)
        self.value = nn.Linear(n_embd, head_size, bias=False)
        # register_buffer will be used to mask the attention score matrix
        self.register_buffer('tril', torch.tril(torch.ones(block_size, block_size)))

    def forward(self, x): # x is a batch of token sequences 
        # x is a matrix of shape (B, T, C) where B is batch_size, T is block_size, C is n_embd
        B,T,C = x.shape 
        k = self.key(x)   # k is of shape (B, T, head_size) 
        q = self.query(x) # q is of shape (B, T, head_size)
        # compute attention scores
        wei = q @ k.transpose(-2,-1) # (B, T, head_size) @ (B, head_size, T) -> (B, T, T)
        # change all values in wei in the same positions as the 0s in tril to -inf
        wei = wei.masked_fill(self.tril[:T, :T] == 0, float('-inf')) # (B, T, T)
        # softmax applied along the last dimension of wei
        wei = F.softmax(wei, dim=-1) # (B, T, T)
        # perform the weighted aggregation of the values
        v = self.value(x) # (B, T , head_size)
        out = wei @ v # (B, T, T) @ (B, T, head_size) -> (B, T, head_size)
        return out

In [None]:
# Understand how tril does masking
torch.manual_seed(1337)
B,T,C = 4,8,32 # batch, time, channels
x = torch.randn(B,T,C)

# let's see a single Head perform self-attention
head_size = 16
key = nn.Linear(C, head_size, bias=False)
query = nn.Linear(C, head_size, bias=False)
value = nn.Linear(C, head_size, bias=False)
k = key(x)   # (B, T, 16)
q = query(x) # (B, T, 16)
wei =  q @ k.transpose(-2, -1) # (B, T, 16) @ (B, 16, T) ---> (B, T, T)

tril = torch.tril(torch.ones(T, T))
#wei = torch.zeros((T,T))
wei = wei.masked_fill(tril == 0, float('-inf'))
wei = F.softmax(wei, dim=-1)

v = value(x)
out = wei @ v

out.shape

#### Define multi-head attention

Now we are ready to build a multi-head self-attention layer using the Head class we created. Each head outputs a cuboid of shape (B, T, head_size). In total, we have num_heads many heads. A multi-head attention layer will concatenate the output into one cuboid of shape (B, T, num_heads * head_size). 

In [None]:
class MultiHeadAttention(nn.Module):
    """ multiple heads of self-attention in parallel """

    def __init__(self, num_heads, head_size): # num_heads=how many heads there are in a sa layer
        super().__init__()
        self.heads = nn.ModuleList([Head(head_size) for num in range(num_heads)]) # a list of heads

    def forward(self, x):
        # concatenate the outputs from the heads  
        out = torch.cat([h(x) for h in self.heads], dim=-1) # x is a batch of data, h(x) is calling the forward() in Head
        return out

Let's also use a visualization to help us understand a multi-head self-attention layer. 

#### Define a feedforward layer

After a self-attention layer, we have a feedforward layer to basically process the communication between the characters that we get from self-attention. 

In [None]:
class FeedForward(nn.Module):
    def __init__(self, n_embd):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(n_embd, 4 * n_embd), # a layer that does linear computation
            nn.ReLU(), # Use ReLU to add non-linearity
            nn.Linear(4 * n_embd, n_embd), 
        )

    def forward(self, x): # x is a batch of data
        return self.net(x)

#### Putting everything into a self-attention block

A self-attention layer and a feedforward layer together forms a self-attention block in transformer. In a transformer, we can have multiple such self-attention blocks. 

<img src="https://ithaka-labs.s3.amazonaws.com/static-files/images/tdm/tdmdocs/ChatGPT_blocks.png" width=500>

In [None]:
class Block(nn.Module):
    """ Transformer block: communication followed by computation """ 
    def __init__(self, n_embd, n_head):
        # n_embd: num of embedding dimensions, n_head: the number of heads
        super().__init__()
        # each head in charge of communication between tokens in only part of the dimensions of token embeddings
        head_size = n_embd // n_head 
        self.sa_heads = MultiHeadAttention(n_head, head_size) 
        self.ffwd = FeedForward(n_embd)
        self.ln1 = nn.LayerNorm(n_embd) # layer normalization, n_embd is the normalized shape
        self.ln2 = nn.LayerNorm(n_embd)        
    def forward(self, x):
        x = x + self.sa_heads(self.ln1(x))  # (B, T, C)
        x = x + self.ffwd(self.ln2(x)) # (B, T, C)
        return x

### Create the model

All the components are ready. Let's put them together to create a model!

In [None]:
# super simple bigram model
class NanoGPT(nn.Module):
    def __init__(self):
        super().__init__()
        # each token directly reads off the logits for the next token from a lookup table
        self.token_embedding_table = nn.Embedding(vocab_size, n_embd)
        self.position_embedding_table = nn.Embedding(block_size, n_embd)
        self.blocks = nn.Sequential( # 4 Blocks
            Block(n_embd, n_head=n_head),
            Block(n_embd, n_head=n_head),
            Block(n_embd, n_head=n_head),
            Block(n_embd, n_head=n_head)                                 
        )                  
        self.ln_f = nn.LayerNorm(n_embd) # final layer norm
        self.lm_head = nn.Linear(n_embd, vocab_size)

    def forward(self, idx, targets=None):
        B, T = idx.shape # the front side of the cuboid of a batch we saw in the slides

        # idx and targets are both (B,T) tensor of integers
        tok_emb = self.token_embedding_table(idx) # (B,T,C)
        pos_emb = self.position_embedding_table(torch.arange(T, device=device)) # (T,C)
        x = tok_emb + pos_emb # (B,T,C)
        x = self.blocks(x) # (B,T,C)
        x = self.ln_f(x) # (B,T,C)
        logits = self.lm_head(x) # (B,T,vocab_size)

        if targets is None:
            loss = None
        else:
            B, T, vocab_size = logits.shape 
            logits = logits.view(B*T, vocab_size)
            targets = targets.view(B*T)
            loss = F.cross_entropy(logits, targets)
        return logits, loss

    def generate(self, idx, max_new_tokens):
        # idx is (B, T) array of indices in the current context
        for _ in range(max_new_tokens):
            # crop idx to the last block_size tokens
            idx_cond = idx[:, -block_size:]
            # get the predictions
            logits, loss = self(idx_cond)
            # focus only on the last time step
            logits = logits[:, -1, :] # becomes (B, vocab_size)
            # apply softmax to get probabilities
            probs = F.softmax(logits, dim=-1) # (B, vocab_size)
            # sample from the distribution
            idx_next = torch.multinomial(probs, num_samples=1) # (B, 1)
            # append sampled index to the running sequence
            idx = torch.cat((idx, idx_next), dim=1) # (B, T+1)
        return idx

## Train the model and generate some new data

We can now train the model. After we are done training, we can use it to generate some new Shakespeare-like text!

In [None]:
model = NanoGPT()
m = model.to(device)
# print the number of parameters in the model
print(sum(p.numel() for p in m.parameters())/1e6, 'M parameters')

# create a PyTorch optimizer
optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate)

for iter in range(max_iters):

    # every once in a while evaluate the loss on train and val sets
    if iter % eval_interval == 0 or iter == max_iters - 1:
        losses = estimate_loss()
        print(f"step {iter}: train loss {losses['train']:.4f}, val loss {losses['val']:.4f}")

    # sample a batch of data
    xb, yb = get_batch('train')

    # evaluate the loss
    logits, loss = model(xb, yb)
    optimizer.zero_grad(set_to_none=True) # initialize the gradients to None
    loss.backward() # backpropagation
    optimizer.step() # Update parameters based on current gradients

# generate from the model
context = torch.tensor([encode("God save his majesty!")], device=device)
print(decode(m.generate(context, max_new_tokens=2000)[0].tolist()))