# References

## Transformer

* [Attention Is All You Need](https://arxiv.org/pdf/1706.03762.pdf)
* [On Layer Normalization in the Transformer Architecture](https://arxiv.org/pdf/2002.04745.pdf)
* [FlashAttention](https://arxiv.org/pdf/2205.14135.pdf)

## Nano GPT

Nano GPT implementation by Andrej Karpathy.

### Nano GPT YouTube Lecture version

This is for lecture only. The Github is different from the proper implementation of Nano GPT github. Do not confuse/mix two diffent Github repositories.

* [Let's build GPT: from scratch, in code, spelled out.](https://www.youtube.com/watch?v=kCc8FmEb1nY)
* [Github - nanogpt-lecture](https://github.com/karpathy/ng-video-lecture)

> Code created in the Neural Networks: Zero To Hero video lecture series, specifically on the first lecture on nanoGPT. Publishing here as a Github repo so people can easily hack it, walk through the git log history of it, etc.


### Nano GPT implementation version
* [Github nanoGPT](https://github.com/karpathy/nanoGPT)

## Resources

* [Understanding Large Language Models](https://magazine.sebastianraschka.com/p/understanding-large-language-models)
* [Building a Transformer with PyTorch](https://www.datacamp.com/tutorial/building-a-transformer-with-py-torch)

# Setup

In [1]:
%%html
<style>
table {float:left}
</style>

In [2]:
import os
import sys

DIR = os.path.dirname(os.path.abspath('..'))
if DIR not in sys.path:
    sys.path.append(DIR)

In [3]:
%load_ext autoreload
%autoreload 2

import math
import inspect
import matplotlib.pyplot as plt
import numpy as np
import torch
from torch import nn
from torch.nn import functional as F

from transformer.v1 import (
    TYPE_FLOAT,
    DROPOUT_RATIO,
    initialize_weights,
    initialize_embedding_weights,
    split,
    calculate_dot_product_similarities,
    scale,
    mask,
    calculate_attention_values,
    MultiHeadAttention,
    ScaledDotProductAttention,
    PositionwiseFeedForward,
    PositionalEncoding,
    EncodeLayer,
    Encoder,
)

In [4]:
# torch.set_printoptions(profile="full")
torch.set_printoptions(edgeitems=4)
torch.set_printoptions(threshold=100)
torch.set_printoptions(linewidth=200)

---

# Terminologies

* B: Batch size
* T: Time steps or Sequence length (e.g. 512 for bert input sequence)
* C: Channel or Feature (channel perhaps because Andrej is from CNN background?). ```C=2``` two features in each x.

## Batch Input

<img src="./image/gpt_batch.jpeg" align="left" width=750/>


---
# Basics

## Attention 

$$Attention(Q,K,V)=softmax(\frac {QK^T}{\sqrt {d_k}})$$


* [Let's build GPT: from scratch, in code, spelled out](https://youtu.be/kCc8FmEb1nY?t=4301)
* [Building a GPT](https://colab.research.google.com/drive/1JMLa53HDuA-i7ZBmqV7ZnA3c_fvtXnx-)

> - Attention is a **communication mechanism**. Can be seen as nodes in a directed graph looking at each other and aggregating information with a weighted sum from all nodes that point to them, with data-dependent weights.
> - There is no notion of space. Attention simply acts over a set of vectors. This is why we need to positionally encode tokens.
> - Each example across batch dimension is of course processed completely independently and never "talk" to each other
> - In an "encoder" attention block just delete the single line that does masking with `tril`, allowing all tokens to communicate. This block here is called a "decoder" attention block because it has triangular masking, and is usually used in autoregressive settings, like language modeling.
> - "self-attention" just means that the keys and values are produced from the same source as queries. In "cross-attention", the queries still get produced from x, but the keys and values come from some other, external source (e.g. an encoder module)
> - "Scaled" attention additional divides `similarity` by ```1/sqrt(head_size)```. This makes it so when input Q,K are unit variance, `similarity` will be unit variance too and Softmax will stay diffuse and not saturate too much.
> 
> <img src="./image/transformer_attention_as_communication.png" align="left" width=250/>

## Dot Product

Transformer uses Scaled Dot Product Attention. Refresh the memory on what dot-product transformation does and where they are used.

* Similarity = Q@K
* Attention Valuye = Similarity@V

<img src="./image/transformer_self_attention_flow.jpeg" align="left"/>

## Layer Normazliation and Residual Dropout

Original paper applied Dropout to the Sub-Layer (Multi Head Attention) before Residual Connection and Layer Normalization. This is called **Post Normalization**. 

> dropout to the output of each sub-layer, **before** it is added to the sub-layer input (x) and (layer) normalized.


<img src="./image/transformer_residual_dropout.png" align="left" width=800/>


However, recent approach is **Pre Normalization** where LayerNorm is applied to the input x into the sub-layer as explained in [Let's build GPT: from scratch, in code, spelled out.](https://youtu.be/kCc8FmEb1nY?t=5729)

> Very few details about the Transformer have changed in the last five years, but there is something slightly departs from the original paper. You see that Add and Norm is applied **after** the transformation (Multi Head Attention). But now it is more common to apply LayerNorm before the transformation, so there is a reshuffling of the Layer Norm. This is called **pre-norm formulation** and that is the one we are going to implement as well.

It is proposed in [On Layer Normalization in the Transformer Architecture](https://arxiv.org/pdf/2002.04745.pdf).

> On the other hand, our theory also shows that if the layer normalization is put inside the residual blocks (recently proposed as Pre-LN Transformer), the gradients are well-behaved at initialization. This motivates us to remove the warm-up stage for the training of Pre-LN Transformers. We show in our experiments that Pre-LN Transformers without the warm-up stage can reach comparable results with baselines while requiring significantly less training time and hyper-parameter tuning on a wide range of applications.
> 
> <img src="./image/pre-ln-transformer.png" align="left" width=600/>

* [Review — Pre-LN Transformer: On Layer Normalization in the Transformer Architecture
Pre-LN Transformer, Warm-Up Stage is Skipped](https://sh-tsang.medium.com/review-pre-ln-transformer-on-layer-normalization-in-the-transformer-architecture-b6c91a89e9ab)
* [About LayerNorm Variants in the Original Transformer Paper, and Some Other Interesting Historical Tidbits About LLMs](https://magazine.sebastianraschka.com/p/why-the-original-transformer-figure)

<img src="./image/post_ln_to_pre_ln_transformer.jpeg" align="left"/>

---
# Transformer (Pre-LN)

(The Transformer implementaion starting at [Let's build GPT: from scratch, in code, spelled out](https://youtu.be/kCc8FmEb1nY?t=2268))

Transformer generates a graph network between position-encoded tokens.

1. Get un-connected tokens as a sequence (e.g. sentence)
2. Wires connections among tokens by having looked at the co-occurrances of them in billions of sequences.


<img src="./image/transformer_pre_ln.png" align="left" width=700/>

In [11]:
_B = 1    # Batch size
_H = 2    # Number of heads
_T = 4    # Time steps / Sequence length
_D = 8    # Model vector dimension d_model
d_ff = _D * 4    # Position-wise Feed Forward hidden layer dimenssion


# Token Embedding
embedding = torch.nn.Embedding(
    num_embeddings=_T,
    embedding_dim=_D
)
initialize_embedding_weights(module=embedding)
DOe = torch.nn.Dropout(p=DROPOUT_RATIO)

# Linear projections at Attention
Wq = torch.nn.Linear(_D, _D, bias=True, dtype=TYPE_FLOAT)
Wk = torch.nn.Linear(_D, _D, bias=True, dtype=TYPE_FLOAT)
Wv = torch.nn.Linear(_D, _D, bias=True, dtype=TYPE_FLOAT)
Wo = torch.nn.Linear(_D, _D, bias=True, dtype=TYPE_FLOAT)
initialize_weights(Wq, d_model=_D)
initialize_weights(Wk, d_model=_D)
initialize_weights(Wv, d_model=_D)
initialize_weights(Wo, d_model=_D, output_projection=True)

# LayerNorm and Dropout for attention
LNa = torch.nn.LayerNorm(normalized_shape=_D, eps=1e-5, bias=True, dtype=TYPE_FLOAT)
DOa = torch.nn.Dropout(p=DROPOUT_RATIO)

# Linear projections at Position Wise Feed Forward
W1 = torch.nn.Linear(_D, _D, bias=True, dtype=TYPE_FLOAT)
W2 = torch.nn.Linear(_D, _D, bias=True, dtype=TYPE_FLOAT)
initialize_weights(module=W1, d_model=_D)
initialize_weights(module=W2, d_model=_D, output_projection=True)
relu = torch.nn.ReLU()

# LayerNorm and Dropout for position-wise feed-forward
LNp = torch.nn.LayerNorm(normalized_shape=_D, eps=1e-5, bias=True, dtype=TYPE_FLOAT)
DOp = torch.nn.Dropout(p=DROPOUT_RATIO)

In [13]:
torch.std(embedding.weight, dim=-1)

tensor([0.0170, 0.0207, 0.0195, 0.0284], grad_fn=<StdBackward0>)

In [14]:
torch.std(Wk.weight, dim=-1)

tensor([0.2184, 0.2326, 0.3188, 0.2912, 0.2524, 0.2617, 0.2292, 0.2004], grad_fn=<StdBackward0>)

----
# Input Sequece 

Input(sentence or time series) is a sequence of integer indices to embedding vectors. With ```B``` number of inputs, the shape is ```(B, T)```. 

In [None]:
indices = torch.arange(0, _B * _T).view(_B, _T)    # Token IDs / Indices to token embedding vectors

# Input Embeddings

Extract embedding vectors ```x``` for tokens in the sequeces as shape ```(B, T, D)``` where ```D``` is the embedding vector dimensions (number of features).


In [None]:
x: torch.Tensor = embedding(indices)


<img src="./image/transformer_embedding.png" align="left" width=750/>

Multiply the  sequence embeddings ```x``` by $\sqrt {d_{model} }$ as per the paper, which increase the variance of each sequence embeddings by $ d_{model}$. Not clear why this is required. In the original paper:

1. Multi Head Attention layer directly calculate ```Q```, ```K```, ```V``` with ```x``` (after PE) in linear layers.
2. Weights are commonly normalized with ```std=0.02``` as used in BERT (0.02 is empilical value).
3. Hence, the variance of ```Q``` will be  $0.0004 x $ instead of $1$ if $W_q$ is normalized with Xavier. However, it is known empilically that . Assumption is the authors found empilically it is better to 

Also, **this will be unnecessary when using the pre-layer normalization** (**LN**) as LN will cancel this normalization by its own standardization.



In [15]:
0.02 * math.sqrt(512)

0.4525483399593905

In [None]:
x = x * math.sqrt(_D)
assert x.shape == (_B, _T, _D)
print(f"x.std:{torch.std(x, dim=-1)}")
x

# Positional Encoding and Dropout

Position encoding vector is added to the token embedding vector. Dropout is applied to the position added embedding.

<img src="./image/transformer_dropout_to_embedding.png" align="left" width=750/>

In [None]:
pe = PositionalEncoding(max_time_steps=_T, d_model=_D)
positions = pe(x)
positions

In [None]:
x = DOe(x + positions)
print(f"x.std:{torch.std(x, dim=-1)}")

x

# Pre Layer Normalization

Laye Normalization is applied to the input to the sub-layer.

In [None]:
x = LNa(x)
print(f"x.std:{torch.std(x, dim=-1)}")

x

# Split for Multi Heads

In [None]:
q = split(Wq(x), h=_H)
k = split(Wk(x), h=_H)
v = split(Wv(x), h=_H)

print(f"k.shape:{k.shape}, k.std:{torch.std(k, dim=-1)}")    # (B,H,T,d_k)
print(k)

# Scaled Dot Product Attention

<img src="./image/transformer_attention.png" align="left"/>

In [None]:
print(inspect.getsource(ScaledDotProductAttention))

## First MatMul with Q and K (Calculate Similarity Score)

For every token ```Q``` in a sequence, calculate the relationships (similarities) with other tokens in  ```K``` (for GPT, only with previous tokens). This builds a graph network of Self Attentions in which the strength among tokens in a sequence is represented as a graph in a matrix of shape ```(T, T)```.

|Similarity Score (Q & K)| Proabability as Softmax |
|---|---|
|<img src="./image/transformer_dot_product_attention_similarity_score.jpeg" align="left" width=500/>|<img src="./image/transformer_dot_product_attention.png" align="left" width=175/>|


In [None]:
print(inspect.getsource(calculate_dot_product_similarities))

In [None]:
similarities = calculate_dot_product_similarities(query=q, key=q)
print(similarities.shape)    # (B,H,T,T)
print(similarities)


## Scale by $\sqrt{d_k}$

### Control the variance

As in the name **Scaled** Dot-Product Attention, the similarity score is normalized by $\sqrt{d_k}$ to manage the variance where $d_k$ is the dimension of the key vector $k$ (which is the same with that of query).

Providing $W$ is initialized with Xavier so that the variance of $X@W^T$ will be 1.0. Suppose the positionally encoded token vector $x$ has dimension $D$, the shape of $W_K$ is ```(M, D)```. Then key $k = x:(D,) @ W^T_K:(D,M)$ has the shape $(M,)$, which is $d_k$. The variance of the pdorduct $Q\cdot K^T$ is $d_k$. The variance of two zero-mean normal distributions is:

[Variance of product of multiple independent random variables](https://stats.stackexchange.com/questions/52646/)

$${\rm Var}(XY) = E(X^2Y^2) − (E(XY))^2={\rm Var}(X){\rm Var}(Y)+{\rm Var}(X)(E(Y))^2+{\rm Var}(Y)(E(X))^2$$  $$E(X)=E(Y)=0$$

<img src="./image/variance_of_q@k.jpeg" align="left" width=500/>

* [Let's build GPT: from scratch, in code, spelled out](https://youtu.be/kCc8FmEb1nY?t=4616)

> If you have unit gausian input of mean 0 and $W_K$ and $W_V$ are unit gaussian, and if you calculate the ```similarity``` naively, the variance is the order of the head size $d_k$ (e.g. approx 16 if $d_k$ == 16). By standardizing the ```similarity``` score by $\sqrt{d_k}$ the variance of the ```similarity``` socre will be normal (approx 1.0).



### Soften the softmax outputs

#### Effect of exponential

Exponential has amplify/supress effects when the signal has large variance with their values being negative and positive. Hence, only the large positive signal value gets amplified and others towards negatives get supressed.

This is desirable to predict one class so that only one class stands out. But to incorporate all the signals, this amplify/supression is undesirable. By standardizing the signals (scale with the std of the signals), the exponentials get softened.

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(10, 3),  tight_layout=True) 
plt.grid()

head_size = 16
normalized_qk = torch.tensor(sorted([ 0.0838, -0.7539,  0.5026, -1.1728,  1.3405]))
naive_qk = normalized_qk * torch.sqrt(torch.tensor(head_size))

# Variance=head_size=d_k

_x = np.linspace(naive_qk[0]-0.5, naive_qk[-1]+0.5)
axes[0].plot(_x, np.exp(_x), color='k', linewidth=0.5)
axes[0].scatter(naive_qk, torch.exp(naive_qk), color='r')
axes[0].set_title("large variance with negative/positive range")
axes[0].grid(linestyle='-', linewidth=1)

# Variance = 1

_x = np.linspace(normalized_qk[0]-0.5, normalized_qk[-1]+0.5)
axes[1].plot(_x, np.exp(_x), color='k', linewidth=0.5)
axes[1].scatter(normalized_qk, torch.exp(normalized_qk))
axes[1].grid(linestyle='-', linewidth=1)
axes[1].set_title("standardized variance 1 with negative/positive range")


#### Without scaling

When the similarity score is not normalized/scaled by $\sqrt{d_k}$, the softmax becomes **peaky** like one hot encoding, which is beneficial for classification (amplify the high score signal). However, for self attention, softmax will pickup the nodes with larger values, hence only specific nodes in the sequence will be incorporated into the BoW. We want to consider the communication among every nodes if there is, not specific ones only.

* [Let's build GPT: from scratch, in code, spelled out.](https://youtu.be/kCc8FmEb1nY?t=4694)

> The problem here, because of the softmax, if the $Q@K$ takes very positive or very negative numbers inside it, softmax will converge towards one hot vectors.

In [None]:
head_size = 16
normalized_qk = torch.tensor([ 0.0838, -0.7539,  0.5026, -1.1728,  1.3405])
print(f"VAR[qk]={torch.var(normalized_qk)}")

# Note that variance of Q@K is close to the head_size d_k
# # scale with std to make the variance==head_size
naive_qk = normalized_qk * torch.sqrt(torch.tensor(head_size))
print(f"variance:{naive_qk.var()}, softmax:{torch.softmax(naive_qk, dim=-1)}")

plt.figure(figsize=(3,2))
plt.grid()
plt.stem(range(len(naive_qk)), torch.softmax(naive_qk, dim=-1))

#### With scaling

By scale/normalize, the softmax will be smoothed/diffused.

In [None]:
scaled_qk = naive_qk / torch.sqrt(torch.tensor(head_size))
print(f"variance:{scaled_qk.var()}, softmax:{torch.softmax(naive_qk, dim=-1)}")

plt.figure(figsize=(3,2))
plt.grid()
plt.stem(range(len(scaled_qk)), torch.softmax(scaled_qk, dim=-1))

### Code for Scale

In [None]:
print(inspect.getsource(scale))

In [None]:
scaled_similarities = scale(similarities=similarities, d_k=_D/_H)
scaled_similarities

## Mask

Optional. After calculating ```(T,T)``` matrix of similarities from q to k, mask the matrix to prevent the communications with future time steps by replacing the similarity values with the ```-inf```, meaning there is **no relationship** from ```q```. Then softmax will make the contribution from the ```-inf``` to zero. This is the same with blocking relations between ```q``` and ```k```.


### Demonstration

In [None]:
# Mask matrix to decide which element in (T,T) matrix to mask
mask_matrix = torch.tril(torch.ones(_T,_T)) == 0
mask_matrix

In [None]:
# Mask the similarity matrix element of future steps with -inf

masked_similarities = torch.clone(similarities).masked_fill(mask=mask_matrix, value=float('-inf'))
masked_similarities

In [None]:
# Softmax will supress the contirubion from -inf similarity values
F.softmax(masked_similarities, dim=-1)

### Code for Mask

In [None]:
print(inspect.getsource(mask))

In [None]:
masked_similarities = mask(similarities=similarities, mask_matrix=mask_matrix)
masked_similarities

In [None]:
normalized_similarities = F.softmax(masked_similarities, dim=-1)
print(f"normalized_similarities.std:{torch.std(normalized_similarities, dim=-1)}")

normalized_similarities

## Second MatMul with V (Calculate Bag of Words Attention Value)

One way to generate the inter-connections among the tokens to distill their knowledges or relations is ```BoW``` by averaging them feature-wise/axis=-1.



<img src="./image/transformer_dot_product_attention_bow.png" align="left" width=700/>  

Note that the initially the value of similarity is random or ```(1.0, 1/2, 1/3, ...)``` but eventually it gets trained to memorize the relations among position-encoded tokens.

* [Let's build GPT: from scratch, in code, spelled out.](https://youtu.be/kCc8FmEb1nY?t=3814)

> Different will find other tokens more or less interesting and we want that data dependent. If I/token is a vowel, I am looking for consonants in my past and want to know what consonants were. And I want the information to flow to me (connection). This is the problem that Self Attention solves.

<img src="./image/self_attention.jpeg" align="left" width=700/>

### Purpose of  using $W_V$

$v$ looks to be a proxy of $x$ but what transformation or meaning does $W_V$ gives by having transformation from $x$ to $v$? (Note: ```x``` in the diagram above is actually $v$ as $v=x@W{_V}{^T}$).

* [Let's build GPT: from scratch, in code, spelled out](https://youtu.be/kCc8FmEb1nY?t=4258)

> $x$ is like a private information to a token. For the purpose of the single attention head, $v$ is what I give for you to communicate with if you find me interesting.

### Code

In [None]:
print(inspect.getsource(calculate_attention_values))

In [None]:
attensions = calculate_attention_values(similarities=normalized_similarities, values=v)
print(attensions.shape)    # (B,H,T,d_v)
attensions

### Verify ScaledDotProductAttention

In [None]:
sdpa = ScaledDotProductAttention(do_mask=True, max_time_steps=_T)
a = sdpa(q=q,k=k,v=v)
print(f"a.std:{torch.std(a, dim=-1)}")
a

# Multi Attention Head

Divide the embedding vector q, k, v into ```h``` number of segmenets and apply self attention to each segment in parallel respectively.

* <img src="./image/transformer_paper_multi_head_attention.png" align="left" width=700/>


|<img src="./image/transformer_multi_head_attentions.png" align="left" width=200/>   | 
<img src="./image/transformer_multi_head_attention_formula.png" align="left" width=700/> |

* [Transformers Explained Visually (Part 3): Multi-head Attention, deep dive](https://towardsdatascience.com/transformers-explained-visually-part-3-multi-head-attention-deep-dive-1c1ff1024853)

<img src="./image/transformer_multi_head_attention.png" align="left" width=500/>

### Concatenate multiple outputs from Heads

In [None]:
attensions = attensions.transpose(2,1)  # (B,T,H,d_v)
attentions = attensions.reshape(_B,_T,-1)
print(attentions.shape)    # (B,T,D)
attentions

### Linear Projection

In [None]:
attentions = Wo(attentions)
print(attentions.shape)    # (B,T,D)
attentions

### Code

In [None]:
print(inspect.getsource(MultiHeadAttention))

In [None]:
# Verify the MultiHeadAttention output
mha = MultiHeadAttention(
    num_heads=_H,
    d_model=_D,
    dtype=TYPE_FLOAT,
    do_mask=True,
    max_time_steps=_T,
    bias=True,
)
mha.Wq = Wq
mha.Wk = Wk
mha.Wv = Wv
mha.Wo = Wo


attentions = mha(q=x, k=x, v=x)
attentions

## Dropout and Residual Connection

Dropout is applied to the output of each sub-layer and added to input x to the sub-layer.

In [None]:
x = x + DOa(x)   # DO NOT use x += DOa(x) as it is in-place operation for PyTorch.
x

### 



# Position-wise Feed Forward


## Pre Layer Normalization

Laye Normalization is applied to the input to the sub-layer.

In [None]:
x = LNp(x)
x

## Feed Forward 

Apply a wider single hidden layer neural network with ReLU activation. Here the features in the attention vector of each token can be amplified or supressed by weights, and multiple combination of features can form a new feature as the output. Then back to d_model dimensions.

<img src="./image/transformer_paper_positionwise_feedforward.png" align="left" width=700/>

The PwFF transfers the token embedding vectors to a space of diffrent semantic allowing Transformer to acquire token-embedding vector level learning.

1. ```Self Attention``` learns relationships among tokens (building graph), to be exact among heads via multi head attention.
2. ```PwFF``` usea each token of ```D``` dimension as ```D``` features to generate a new token for the next layer.

<img src="./image/transformer_positionwise_feedforward.png" align="left"/>


In [None]:
feedforward = W2(relu(W1(x)))
feedforward

### Code

In [None]:
print(inspect.getsource(PositionwiseFeedForward))

In [None]:
# Verify the PositionwiseFeedForward output.
pwff = PositionwiseFeedForward(d_model=_D, d_ff=d_ff, dtype=TYPE_FLOAT, bias=True)
pwff.W1 = W1
pwff.W2 = W2
pwff.relu = relu

feedforward = pwff(x)
print(f"feedforward.std:{torch.std(feedforward, dim=-1)}")

feedforward

## Dropout and Residual Connection

Dropout is applied to the output of each sub-layer and added to input x to the sub-layer.

In [None]:
encoded = x + DOp(feedforward)
print(f"encoded.std:{torch.std(encoded, dim=-1)}")

encoded

---
# Encoder



In [None]:
print(inspect.getsource(Encoder))

In [None]:
encoder = Encoder(
    vocabulary_size=_T,
    num_layers=1,
    num_heads=_H,
    d_model=_D,
    dtype=TYPE_FLOAT,
    d_ff=d_ff,
    do_mask=True,
    max_time_steps=_T,
    bias=True,
    p_drop=DROPOUT_RATIO,
    eps=1e-5
)

In [None]:
memory = encoder(indices)
print(f"memory.std:{torch.std(memory, dim=-1)}")
memory