# Transformer Step-by-Step

We are going to train a GPT model on a medical question answering dataset.  We will go definition by definition in Andrej Karpathy's minGPT implementation to understand the Transformer decoder architecture from scratch.  One we have defined the architecture, we will run a few training steps on a medium-sized dataset and then will interact with our model using a Gradio interface.

# 1. `model.py`

In [1]:
# Reference to: https://github.com/karpathy/minGPT/blob/37baab71b9abea1b76ab957409a1cc2fbfba8a26/mingpt/model.py

# Code comments added for clarity noted by 'MH: '

import math

import torch
import torch.nn as nn
from torch.nn import functional as F

from mingpt.utils import CfgNode as CN

## 1.1. Activation Function

While not necessarily a requirement of Transformer architecture, the Gaussian Error Linear Units (GELU) activation function was used in the original BERT paper and subsequently used in GPT.  One of the key benefits of the activation function is a non-zero derivative for small negative values which could possibly mitigate the number of dead neurons that evolve during training.

In [2]:
# Reference to: https://github.com/karpathy/minGPT/blob/37baab71b9abea1b76ab957409a1cc2fbfba8a26/mingpt/model.py#L21C1-L27C107

class NewGELU(nn.Module):
    """
    Implementation of the GELU activation function currently in Google BERT repo (identical to OpenAI GPT).
    Reference: Gaussian Error Linear Units (GELU) paper: https://arxiv.org/abs/1606.08415
    """
    def forward(self, x):
        return 0.5 * x * (1.0 + torch.tanh(math.sqrt(2.0 / math.pi) * (x + 0.044715 * torch.pow(x, 3.0))))

![GELU Activation Function](https://production-media.paperswithcode.com/methods/Screen_Shot_2020-05-27_at_12.48.44_PM.png) 

Also note that the GELU activation function is available in the torch library is more optimized than this function.

In [3]:
# Quick comparison of the implementations
library_gelu = nn.GELU()
new_gelu = NewGELU()

# Build a test domain x
x = torch.linspace(-3.0, 3.0, 100)

# Check that function outputs are "close" over the test domain
assert torch.allclose(
    library_gelu(x),
    new_gelu(x),
    atol=0.001,
    rtol=0.001,
), "Implementations are not close, check your math, friend"

## 1.2. Causal Self-Attention

Much of a Transformer's power comes from modeling pair-wise interactions between the elements of a sequence.
This is our first meaty code block and the most important block to understand where causal models get their strong directional bias in sequence modeling.  Please note the only difference between this implementation of multi-headed attention and that for a Transformer encoder is the causal masking step.

In [4]:
# Reference to: https://github.com/karpathy/minGPT/blob/37baab71b9abea1b76ab957409a1cc2fbfba8a26/mingpt/model.py#L29

class CausalSelfAttention(nn.Module):
    """
    A vanilla multi-head masked self-attention layer with a projection at the end.
    It is possible to use torch.nn.MultiheadAttention here but I am including an
    explicit implementation here to show that there is nothing too scary here.
    """

    def __init__(self, config):
        super().__init__()
        """MH: 
        In our passed configuration, we provide the following hyperparatmers:
         - n_embed - the size of the embedding dimension
         - n_head - the number of self-attention heads
         - attn_pdrop, resid_pdrop - dropout probabilitys for the attention activations
                                     and residual features respectively
         - block_size - effectively, a maximum sequence length parameter
        """
        
        # MH: This check ensures that our embedding dimension is divisible by the number of heads
        assert config.n_embd % config.n_head == 0
        
        # key, query, value projections for all heads, but in a batch
        self.c_attn = nn.Linear(config.n_embd, 3 * config.n_embd)
        
        # MH: ^^^Note that this is a shortcut to avoid instantiating three different linear layers
        # for each of the 3 projections (key, query, value)
        
        # output projection
        self.c_proj = nn.Linear(config.n_embd, config.n_embd)
        
        # regularization
        self.attn_dropout = nn.Dropout(config.attn_pdrop)
        self.resid_dropout = nn.Dropout(config.resid_pdrop)
        
        # causal mask to ensure that attention is only applied to the left in the input sequence
        self.register_buffer("bias", torch.tril(
            torch.ones(
                config.block_size,
                config.block_size)
        ).view(1, 1, config.block_size, config.block_size))
        
        """MH: 
        Our causal mask will now be a lower triangular matrix filled with 1's in the lower triangle.
        The mask ensures that attention weights for "future" elements of a sequence are zero.
        
        For `block_size` = 3, the matrix in the final two axes will be:
        [ 1, 0, 0 ]
        [ 1, 1, 0 ]
        [ 1, 1, 1 ]
        
        Note that the call to `torch.tensor.view` reshapes the mask to have dimensionaity of:
        [1, 1, block_size, block_size].  The first two stubbed dimensions allow for the matrix multplication
        of the mask to be broadcast over the batch and sequence dimensions of the attention weights.
        """
        
        self.n_head = config.n_head
        self.n_embd = config.n_embd

    def forward(self, x):
        B, T, C = x.size() # batch size, sequence length, embedding dimensionality (n_embd)

        # calculate query, key, values for all heads in batch and move head forward to be the batch dim
        q, k ,v  = self.c_attn(x).split(self.n_embd, dim=2)
        
        # MH: ^^^ Remember our shortcut earlier?  This step basically separates out the 3 projections
        # from the lone Linear module we declared earlier
        
        
        # MH: vvv These three steps essentially reshape our embedding dimension into the subspaces
        # for each of heads of attention -  this is why we make sure that the embedding dimension is
        # divisible by the number of attention heads, this step breaks if that assertion would be false   
        # If n_head = 1, this step would be unnecessary
        k = k.view(B, T, self.n_head, C // self.n_head).transpose(1, 2) # (B, nh, T, hs)
        q = q.view(B, T, self.n_head, C // self.n_head).transpose(1, 2) # (B, nh, T, hs)
        v = v.view(B, T, self.n_head, C // self.n_head).transpose(1, 2) # (B, nh, T, hs)

        # causal self-attention; Self-attend: (B, nh, T, hs) x (B, nh, hs, T) -> (B, nh, T, T)
        att = (q @ k.transpose(-2, -1)) * (1.0 / math.sqrt(k.size(-1)))
        
        # MH: ^^^ Note that we are calculating the pre-softmax attention weights across all positions
        # scaled down by the square root of the size of the head, in the next step is where the
        # "causal masking" comes into play
        
        
        att = att.masked_fill(self.bias[:,:,:T,:T] == 0, float('-inf'))
        # MH: ^^^ This is a somewhat subtle, but common implementation choice.
        # We could either 1) calculate all of the attention weights, normalize with the softmax, multiply by the 
        # causal mask and then renormalize to so that the attention weights sum to 1 OR
        # 2) we can set all of the "future" pre-activation attention weights to negative infinity
        # This second implementation is much more numerically stable and efficient
        
        
        att = F.softmax(att, dim=-1)
        
        # MH: Here is where we dropout some of the attention weights during training
        att = self.attn_dropout(att)
        
        # MH: Now we calculate the attention weighted value features for each head
        y = att @ v # (B, nh, T, T) x (B, nh, T, hs) -> (B, nh, T, hs)
        
        # MH: Return all of the feature shapes back to the full embedding size (undo the reshape for the multiple heads)
        # If n_head = 1, this step would not be necessary
        y = y.transpose(1, 2).contiguous().view(B, T, C) # re-assemble all head outputs side by side

        # output projection
        # MH: This is our second normalization - dropout before sending back the to the residual
        y = self.resid_dropout(self.c_proj(y))
        return y

Let's run a quick test with some fake data to check our understanding of the implementation.

In [5]:
# Set up some test paramters
B = 3 # Assume a batch size of 3
T = 5 # Assume that the longest sequence is 5
C = 32 # Assume an embedding dimension of 32

cm_test_cn = CN()

cm_test_cn.n_embd = C
cm_test_cn.n_head = 8
cm_test_cn.block_size = 8 # Needs to be longer than T
cm_test_cn.attn_pdrop, cm_test_cn.resid_pdrop = 0.1, 0.1

In [6]:
# Initialize our test data
X = torch.randn((B, T, C))

# Initalize our self-attention module
csa = CausalSelfAttention(cm_test_cn)

In [7]:
output = csa(X)

assert output.shape == X.shape, "Something went wrong, shapes must match for residuals!"

### !!! Check your knowledge 1.2
* What shape should the `bias` buffer be? How can you check if?
* What function does the `bias` buffer perform?
* For these parameters, what is the dimension of each head calculation? How can you check it?
* What happens if change the number of heads to 10? Try it!
* What happens if the block size is too short?

In [8]:
# Check the shape of the `csa` object's `bias`



In [9]:
# Write here what dimension you think each of the heads will have


# Rewrite the line of code from `CausalSelfAttention` that calculates this size



In [10]:
# Use the code above to instiate a new `CausalSelfAttention` with a 10 attention heads instead of 8




In [11]:
# Create a new set of test data (X2) where T > our configuration's `block_size` and see what happens
# when you call `csa` with X2 as the input



## 1.3. Block

Remember our diagram from our lecture before - this collection of transformations just gets repeated L times where L is the number of our layers.

- Attention
- Add Residual + Layer Norm 
- Feedforward Network
- Add Residual + Layer Norm

The exact placement of the layer normalizations is a design choice but ultimately, it used to ensure that activations do not explode across several layers of residual addition operations.

In [12]:
class Block(nn.Module):
    """ an unassuming Transformer block """

    def __init__(self, config):
        super().__init__()
        self.ln_1 = nn.LayerNorm(config.n_embd)
        self.attn = CausalSelfAttention(config)
        self.ln_2 = nn.LayerNorm(config.n_embd)
        
        # MH: Note the feature dimension expansion of 4 is just a magic number
        # from the original implementation
        self.mlp = nn.ModuleDict(dict(
            c_fc    = nn.Linear(config.n_embd, 4 * config.n_embd),
            c_proj  = nn.Linear(4 * config.n_embd, config.n_embd),
            act     = NewGELU(),
            dropout = nn.Dropout(config.resid_pdrop),
        ))
        m = self.mlp
        
        """MH: 
        We are creating a function here that will call our feedforward network modules (FFN with dropout)
        1. First apply expansion projection to our features: [B, T, C] -> [B, T, 4xC]
        2. Then apply our GELU activation pointwise
        3. Then apply our contraction: [B, T, 4xC] -> [B, T, C]
        4. Finally, apply our dropout regularization
        """
        self.mlpf = lambda x: m.dropout(m.c_proj(m.act(m.c_fc(x)))) # MLP forward
        
    def forward(self, x):
        # MH: Note how the residual path is maintained all the way through
        # and that layer normalization is applied to the features before transformation
        x = x + self.attn(self.ln_1(x))
        x = x + self.mlpf(self.ln_2(x))
        return x

## !!! Check your knowledge 1.3

* How many layer norms are in a block?
* What function do the layer norms serve?
* How many residual connections are within each block?
* How could we redefine `Block.mlpf` using PyTorch's built-in `nn.Sequential` module type?
  * This is a type that accepts a list of modules as its contructor
  * When the module is called, it runs each of the modules in the list one by one passing the output each time
* How do the input dimensions compare to the output dimensions of each block?
* What is the size of the `out_features` the `Block.mlp.c_fc` Linear module with the test config above? How can you check?

In [13]:
# Initialize one block module
one_block = Block(cm_test_cn)

# Run it on our test data
block_output = one_block(X)

assert block_output.shape == X.shape, "Something went wrong, shapes must match for residuals!"

In [14]:
# Check the size of out_features
# one_block.mlp.c_fc.out_features
# Where did that number even come from?

## 1.4. GPT

Now let us put this all together.  I'm going to do split up the original implementation so that I can try to add some more in-depth explanation between model definitions.  

### 1.4.1 Defaults and Object Initialization

In this block of code we initialize the module's submodules and set up 
some default configurations.  We can see the global architecure take shape here
and see some of the key design decisions that set GPT apart from BERT and other Transformers

In [15]:
# Reference: https://github.com/karpathy/minGPT/blob/37baab71b9abea1b76ab957409a1cc2fbfba8a26/mingpt/model.py#L95C1-L161C63

class GPT(nn.Module):
    """ GPT Language Model """

    @staticmethod
    def get_default_config():
        C = CN() # CN is essentially just a minGPT data class for holding configuration settings and hyperparameters
        
        # either model_type or (n_layer, n_head, n_embd) must be given in the config
        C.model_type = 'gpt'
        C.n_layer = None
        C.n_head = None
        C.n_embd =  None
        
        # these options must be filled in externally
        C.vocab_size = None
        C.block_size = None
        
        # dropout hyperparameters
        C.embd_pdrop = 0.1
        C.resid_pdrop = 0.1
        C.attn_pdrop = 0.1
        return C

    def __init__(self, config):
        super().__init__()
        
        # MH: Make sure that user specifies these configuration settings
        assert config.vocab_size is not None
        assert config.block_size is not None
        self.block_size = config.block_size
        
        # MH: This next block just provides some handy combinations of hyperparameters based
        # on published versions of GPT - it is assumed that prior researchers have done a fair
        # bit of work to fiddle with these numbers to get reasonable success with these combinations
        type_given = config.model_type is not None
        params_given = all([config.n_layer is not None, config.n_head is not None, config.n_embd is not None])
        assert type_given ^ params_given # exactly one of these (XOR)
        if type_given:
            # translate from model_type to detailed configuration
            config.merge_from_dict({
                # names follow the huggingface naming conventions
                # GPT-1
                'openai-gpt':   dict(n_layer=12, n_head=12, n_embd=768),  # 117M params
                # GPT-2 configs
                'gpt2':         dict(n_layer=12, n_head=12, n_embd=768),  # 124M params
                'gpt2-medium':  dict(n_layer=24, n_head=16, n_embd=1024), # 350M params
                'gpt2-large':   dict(n_layer=36, n_head=20, n_embd=1280), # 774M params
                'gpt2-xl':      dict(n_layer=48, n_head=25, n_embd=1600), # 1558M params
                # Gophers
                'gopher-44m':   dict(n_layer=8, n_head=16, n_embd=512),
                # (there are a number more...)
                # I made these tiny models up
                'gpt-mini':     dict(n_layer=6, n_head=6, n_embd=192),
                'gpt-micro':    dict(n_layer=4, n_head=4, n_embd=128),
                'gpt-nano':     dict(n_layer=3, n_head=3, n_embd=48),
            }[config.model_type])

        """ MH: 
        All of that work and the model can just be summarized like this:
        1. Converting tokens to feature vectors
        1.a. Use a Look up table for token representations (embeddings)
        1.b. A look up table for position embeddings - slightly perturb token representations based on 
             their position in the sequence
        3. A stack of Transformer blocks
        4. A final "look up table" for translating final feature vectors into output tokens - LM head
            - Normally done a softmax over logits - essentialy soft reverse look up table
            - Please note in "Attention is All You Need" this is the transpose of our `wte` with weight tying
        """
        self.transformer = nn.ModuleDict(dict(
            wte = nn.Embedding(config.vocab_size, config.n_embd),
            wpe = nn.Embedding(config.block_size, config.n_embd),
            drop = nn.Dropout(config.embd_pdrop),
            h = nn.ModuleList([Block(config) for _ in range(config.n_layer)]),
            ln_f = nn.LayerNorm(config.n_embd),
        ))
        self.lm_head = nn.Linear(config.n_embd, config.vocab_size, bias=False)

        # init all weights, and apply a special scaled init to the residual projections, per GPT-2 paper
        self.apply(self._init_weights)
        for pn, p in self.named_parameters():
            if pn.endswith('c_proj.weight'):
                torch.nn.init.normal_(p, mean=0.0, std=0.02/math.sqrt(2 * config.n_layer))

        # report number of parameters (note we don't count the decoder parameters in lm_head)
        n_params = sum(p.numel() for p in self.transformer.parameters())
        print("number of parameters: %.2fM" % (n_params/1e6,))
        
    def _init_weights(self, module):
        """ MH: This internal function initalizes all of the weights
        This is a common convention in PyTorch to declare such an private initialization function
        and call it in the model's constructor
        """
        if isinstance(module, nn.Linear):
            torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)
            if module.bias is not None:
                torch.nn.init.zeros_(module.bias)
        elif isinstance(module, nn.Embedding):
            torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)
        elif isinstance(module, nn.LayerNorm):
            torch.nn.init.zeros_(module.bias)
            torch.nn.init.ones_(module.weight)

#### !!!! Check your knowledge 1.4.1

* What is the dimensionality of each head in the "gpt2-xl" configuration above?
* How could you modify the code above to tie the embedding and final token prediction heads together?
  * What benefits might this change offer?
* What does the `vocab_size` parameter do?
* What are we initializing all of our `bias` parameters in our `Linear` modules to? How can you check?

* Try to initialize a GPT model with:
  1. The default configuration
  2. Of a "gpt-nano" model type
  3. Vocab size of 100
  4. A block size of 200
  
* How many parameters with this configuration?
* What is the maximum sequence length I can provide?
* What is the size of the `wte` embedding with these paramaters? How can you check?
* What is the size of the `out_features` of the `lm_head`? How can you check?

In [16]:
this_config = GPT.get_default_config()
this_config.model_type = "gpt-nano"
this_config.vocab_size = 100
this_config.block_size = 200

this_gpt = GPT(this_config)

number of parameters: 0.10M


In [17]:
this_gpt.transformer.wte

Embedding(100, 48)

In [18]:
this_gpt.lm_head.out_features

100

### 1.4.2 `GPT.forward`

In the `forward` method in PyTorch is where we define how the sub-modules connect together:
* Convert input tokens to a sequence of vectors
* Pertub the vectors by the position embeddings depending on where the token sits in a sequence
* Pass the representations through each of the L blocks (where L is the number of layers
* Convert the final sequence of feature representations into next token probabilities
* \[Training only\] Calculate the cross-entropy between the expected next token probability and the actual token and return it as our loss

In [19]:
# Reference: https://github.com/karpathy/minGPT/blob/37baab71b9abea1b76ab957409a1cc2fbfba8a26/mingpt/model.py#L260

# Ignore the second class definition - hack for defining a class across multiple cells
class GPT(GPT):
    def forward(self, idx, targets=None):
        device = idx.device
        b, t = idx.size()
        assert t <= self.block_size, f"Cannot forward sequence of length {t}, block size is only {self.block_size}"
        pos = torch.arange(0, t, dtype=torch.long, device=device).unsqueeze(0) # shape (1, t)

        # forward the GPT model itself
        tok_emb = self.transformer.wte(idx) # token embeddings of shape (b, t, n_embd)
        pos_emb = self.transformer.wpe(pos) # position embeddings of shape (1, t, n_embd)
        x = self.transformer.drop(tok_emb + pos_emb)
        for block in self.transformer.h:
            x = block(x)
        x = self.transformer.ln_f(x)
        logits = self.lm_head(x)

        # if we are given some desired targets also calculate the loss
        loss = None
        if targets is not None:
            loss = F.cross_entropy(logits.view(-1, logits.size(-1)), targets.view(-1), ignore_index=-1)

        return logits, loss

### !!! Check your understanding 1.4.2

* For the following questions assume the following parameters:
    * I am using batch sizes of 10 examples per batch
    * My data has a vocabulary size of 100
    * My examples are at most 120 items long but I will use a block_size of 128 for extra room.
    * I want to use a "gpt-nano" model


* During my forward pass:
    * What are the dimensions of the input?
    * What is the expected shape of `tok_emb`?
    * What is the expected shape of `pos_emb`?
    * What are the dimensions of the output?
    * What does the element located at[0, 0, 0] of the first output tensor represent?
    
* Check you answers with a test of the implementations
   * Create fake input data with a call to `torch.randint`

In [20]:
config_142 = GPT.get_default_config()
config_142.model_type = "gpt-nano"
config_142.vocab_size = 100
config_142.block_size = 128

gpt_142 = GPT(config_142)

number of parameters: 0.10M


In [21]:
X_142 = torch.randint(100,(10, 120))

In [22]:
output, _ = gpt_142(X_142)

### 1.4.3 `GPT.generate`

Given that our GPT model is primarily tasked with generating a discrete field of next token probabilities conditioned on some input text and what we really care about is the next most likely probability of tokens, we can use our `GPT.generate` function to randomly sample from this probability field, one token at a time.  We can repeat this process `max_new_tokens` number of times to extract a likely completion conditioned on the input sequence from out model.  Since we have not trained our model, this will be a gibberish list of numbers, but we will just focus on the guts of the algorithm for now.

In [23]:
# Reference: https://github.com/karpathy/minGPT/blob/37baab71b9abea1b76ab957409a1cc2fbfba8a26/mingpt/model.py#L282

class GPT(GPT):
    @torch.no_grad()
    def generate(self, idx, max_new_tokens, temperature=1.0, do_sample=False, top_k=None):
        """
        Take a conditioning sequence of indices idx (LongTensor of shape (b,t)) and complete
        the sequence max_new_tokens times, feeding the predictions back into the model each time.
        Most likely you'll want to make sure to be in model.eval() mode of operation for this.
        """
        for _ in range(max_new_tokens):
            # if the sequence context is growing too long we must crop it at block_size
            idx_cond = idx if idx.size(1) <= self.block_size else idx[:, -self.block_size:]
            # forward the model to get the logits for the index in the sequence
            logits, _ = self(idx_cond)
            # pluck the logits at the final step and scale by desired temperature
            logits = logits[:, -1, :] / temperature
            # optionally crop the logits to only the top k options
            if top_k is not None:
                v, _ = torch.topk(logits, top_k)
                logits[logits < v[:, [-1]]] = -float('Inf')
            # apply softmax to convert logits to (normalized) probabilities
            probs = F.softmax(logits, dim=-1)
            # either sample from the distribution or take the most likely element
            if do_sample:
                idx_next = torch.multinomial(probs, num_samples=1)
            else:
                _, idx_next = torch.topk(probs, k=1, dim=-1)
            # append sampled index to the running sequence and continue
            idx = torch.cat((idx, idx_next), dim=1)

        return idx

### !!! Check your understanding 1.4.3

* If I initialize a `GPT` with the correct configuration, why would I expect the output to be gibberish?
* Why minimizing the cross-entropy of the next token prediction task make the output look less like gibberish?
* What does a higher `temperature` argument in the `GPT.generate` method do to sample outputs?
  * In this implementation, what happens if I set `temperature` = 0?

In [24]:
config_143 = GPT.get_default_config()
config_143.model_type = "gpt-nano"
config_143.vocab_size = 100
config_143.block_size = 128

gpt_143 = GPT(config_143)

number of parameters: 0.10M


In [25]:
X_143 = torch.randint(100,(1, 5))

In [26]:
sample_output = gpt_143.generate(X_143, max_new_tokens=5)

In [27]:
print("Garbage in: ", X_143)
print("Garbage out: ", sample_output[0,-5:])

Garbage in:  tensor([[96, 65, 54, 91, 77]])
Garbage out:  tensor([ 7, 91,  2, 67, 19])
