# GPT - Part 2: self-attention
- Video: [Andrej Karpathy - Let's build GPT](https://www.youtube.com/watch?v=kCc8FmEb1nY&t=1413s)
- Paper
    - [Attention is All You Need](https://arxiv.org/abs/1706.03762)

### What are transformers?
Transformers are neural networks that specialize in learning context from the data. Quite similar to us trying to find the meaning of ‘attention and context’ in terms of transformers.

### How do transformers learn context from the data?
By using the attention mechanism.

### What is the attention mechanism?
The attention mechanism helps the model scan all parts of a sequence at each step and determine which elements need to be focused on. The attention mechanism was proposed as an alternative to the ‘strict/hard’ solution of fixed-length vectors in the encoder-decoder architecture and provide a ‘soft’ solution focusing only on the relevant parts.

### What is self-attention?
The attention mechanism worked to improve the performance of Recurrence Neural Networks (RNNs), with the effect seeping into Convolutional Neural Networks (CNNs). However, with the introduction of the transformer architecture in the year 2017, the need for RNNs and CNNs was quietly obliterated. And the central reason for it was the self-attention mechanism.

The self-attention mechanism was special in the sense that it was built to inculcate the context of the input sequence in order to enhance the attention mechanism. This idea became transformational as it was able to capture the complex nuances of a language.

There are many variations of how self-attention is performed. But the scaled dot-product mechanism has been one of the most popular ones. This was the one introduced in the original transformer architecture paper in 2017 — “Attention is All You Need”.

### Where and how does self-attention feature in transformers?
I like to see the transformer architecture as a combination of two shells — the outer shell and the inner shell.

- The outer shell is a combination of the attention-weighting mechanism and the feed forward layer.
- The inner shell consists of the self-attention mechanism and is part of the attention-weighting feature.


In [1]:
import torch
import torch.nn as nn
import torch.nn.functional as F

## matematical trick to replace for loops
#### imagine the toy example
- Each batch has eight tokens pointing to 2 channels (4,8,2)
- each token channel is a representative of itself
- B   ==> Mini-batch: 
- T   ==> Time or Tokens: Attention scores relative to other words (tokens) in the sequence
- C   ==> Channel: a feature map of learned patterns or features. 

In [2]:
torch.manual_seed(1337)
B,T,C = 4,8,2
x = torch.randn(B,T,C)
x[0]

tensor([[ 0.1808, -0.0700],
        [-0.3596, -0.9152],
        [ 0.6258,  0.0255],
        [ 0.9545,  0.0643],
        [ 0.3612,  1.1679],
        [-1.3499, -0.5102],
        [ 0.2360, -0.2398],
        [-0.9211,  1.5433]])

## Batch time awareness
### Python for loops
- We want to discover the relationship with tokens in the past and from it predict tokens in the future.
- The past tokens represent a single batch line
- $ x[b,t] = \bar{b}_{i<=t} [b,i] $
- bow => bag of words

In [3]:
xbow = torch.zeros(B,T,C)
for b in range(B):
    for t in range(T):
        xprev = x[b,:t+1] # (t, C)
        xbow[b,t] = torch.mean(xprev,dim=0)

- unlike before each consecutive time is the mean of all previous times in that batch
- but this code is slow and messy.

In [4]:
xbow[0]

tensor([[ 0.1808, -0.0700],
        [-0.0894, -0.4926],
        [ 0.1490, -0.3199],
        [ 0.3504, -0.2238],
        [ 0.3525,  0.0545],
        [ 0.0688, -0.0396],
        [ 0.0927, -0.0682],
        [-0.0341,  0.1332]])

### Weighted multiplication

In [5]:
tril = torch.tril(torch.ones(T,T))
tril

tensor([[1., 0., 0., 0., 0., 0., 0., 0.],
        [1., 1., 0., 0., 0., 0., 0., 0.],
        [1., 1., 1., 0., 0., 0., 0., 0.],
        [1., 1., 1., 1., 0., 0., 0., 0.],
        [1., 1., 1., 1., 1., 0., 0., 0.],
        [1., 1., 1., 1., 1., 1., 0., 0.],
        [1., 1., 1., 1., 1., 1., 1., 0.],
        [1., 1., 1., 1., 1., 1., 1., 1.]])

In [6]:
wei = torch.zeros((T,T))
wei

tensor([[0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0.]])

In [7]:
wei = wei.masked_fill(tril == 0, float('-inf'))
wei

tensor([[0., -inf, -inf, -inf, -inf, -inf, -inf, -inf],
        [0., 0., -inf, -inf, -inf, -inf, -inf, -inf],
        [0., 0., 0., -inf, -inf, -inf, -inf, -inf],
        [0., 0., 0., 0., -inf, -inf, -inf, -inf],
        [0., 0., 0., 0., 0., -inf, -inf, -inf],
        [0., 0., 0., 0., 0., 0., -inf, -inf],
        [0., 0., 0., 0., 0., 0., 0., -inf],
        [0., 0., 0., 0., 0., 0., 0., 0.]])

#### softmax normalises the zeros to add to 1

In [8]:
wei = F.softmax(wei, dim=-1)
wei

tensor([[1.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.5000, 0.5000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.3333, 0.3333, 0.3333, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.2500, 0.2500, 0.2500, 0.2500, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.2000, 0.2000, 0.2000, 0.2000, 0.2000, 0.0000, 0.0000, 0.0000],
        [0.1667, 0.1667, 0.1667, 0.1667, 0.1667, 0.1667, 0.0000, 0.0000],
        [0.1429, 0.1429, 0.1429, 0.1429, 0.1429, 0.1429, 0.1429, 0.0000],
        [0.1250, 0.1250, 0.1250, 0.1250, 0.1250, 0.1250, 0.1250, 0.1250]])

#### aggrigation through matrix multiplication
dot product of wei(ght) with (B,T,C)

In [9]:
xbow1 = wei @ x
xbow1[0]

tensor([[ 0.1808, -0.0700],
        [-0.0894, -0.4926],
        [ 0.1490, -0.3199],
        [ 0.3504, -0.2238],
        [ 0.3525,  0.0545],
        [ 0.0688, -0.0396],
        [ 0.0927, -0.0682],
        [-0.0341,  0.1332]])

Self attention is where the zeros are replaced by the token weights of the past elements.

## Self-Attention
self-attention for a single individual head

In [10]:
torch.manual_seed(1337)
B,T,C = 4,8,32 # Batch, Time, Channels
x = torch.randn(B,T,C)

#lets see a single Head perform self attention
head_size = 16
# this is what I know
key = nn.Linear(C, head_size, bias=False)
# this is what I want
query = nn.Linear(C, head_size, bias=False)
# fill with data
k = key(x)   # (B, T, 16)
q = query(x) # (B, T, 16)
# create the wei(ght) for the head
wei = q @ k.transpose(-2, -1) # (B, T, 16) @ (B, 16, T) ---> (B, T, T)

tril = torch.tril(torch.ones(T,T))
# no longer initialize to zero
# wei = torch.zeros((T,T))
wei = wei.masked_fill(tril == 0, float('-inf'))
wei = F.softmax(wei, dim=-1)
out = wei @ x

out.shape

torch.Size([4, 8, 32])

#### this now tells us in a data dependant manner how much information to aggregate from the past

In [11]:
wei[0]

tensor([[1.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.1574, 0.8426, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.2088, 0.1646, 0.6266, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.5792, 0.1187, 0.1889, 0.1131, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.0294, 0.1052, 0.0469, 0.0276, 0.7909, 0.0000, 0.0000, 0.0000],
        [0.0176, 0.2689, 0.0215, 0.0089, 0.6812, 0.0019, 0.0000, 0.0000],
        [0.1691, 0.4066, 0.0438, 0.0416, 0.1048, 0.2012, 0.0329, 0.0000],
        [0.0210, 0.0843, 0.0555, 0.2297, 0.0573, 0.0709, 0.2423, 0.2391]],
       grad_fn=<SelectBackward0>)

## Self-attention with single head
### Random B T C setup

In [12]:
torch.manual_seed(1337)
B,T,C = 4,8,32 # Batch, Time, Channels
x = torch.randn(B,T,C)


### Attention weight matrix (A)
The attention weight matrix A is obtained by feeding the input features into the Query-Key (QK) module. 
This matrix tries to find the most relevant parts in the input sequence. Self-Attention comes into play 
while creating the Attention weight matrix A using the QK-module.



In [13]:
# lets see a single Head perform self attention
head_size = 16
# multiply the features with linear transformation
key = nn.Linear(C, head_size, bias=False)
query = nn.Linear(C, head_size, bias=False)
value = nn.Linear(C, head_size, bias=False)
k = key(x)   # (B, T, 16)
q = query(x) # (B, T, 16)
# v is like a view of x for the purposes of these tokens
v = value(x) # (B, T, 16)


#### Note:
- As can be seen from the calculation above, we use the same set of features for both queries and keys. And that is how the idea of “self” comes into play here, i.e. the model uses the same set of features to create its query vector as well as the key vector.
- The query vector represents the current word (or token) for which we want to compute attention scores relative to other words in the sequence.
- The key vector represents the other words (or tokens) in the input sequence and we compute the attention score for each of them with respect to the current word.

### Multiply the transpose of K with Q
The idea here is to calculate the dot product between every pair of query and key vectors. Calculating the dot 
product gives us an estimate of the matching score between every “key-query” pair, by using the idea of Cosine 
Similarity between the two vectors. This is the ‘dot-product’ part of the scaled dot-product attention.
- ensure K transposes both the Token and the Head

#### Cosine-Similarity
Cosine similarity is the cosine of the angle between the vectors; that is, it is the dot product of the vectors divided 
by the product of their lengths. It roughly measures if two vectors are pointing in the same direction thus implying the 
two vectors are similar.

$$ \mathnormal{Cos}(0^o) = 1, \mathnormal{Cos}(90^o) = 0, \mathnormal{Cos}(180^o) = -1  $$

- If the dot product between the two vectors is approximately 1, it implies we are looking at an almost zero angle
between the two vectors meaning they are very close to each other.
- If the dot product between the two vectors is approximately 0, it implies we are looking at vectors that are
orthogonal to each other and not very similar.
- If the dot product between the two vectors is approximately -1, it implies we are looking at an almost an 180°
angle between the two vectors meaning they are opposites.

In [14]:
# (B, T, 16) @ (B, 16, T) ---> (B, T, T)
weight = q @ k.transpose(-2, -1)

### Attention Weight Matrix (A)
- as in our math trick we mask, scale then Softmax
- each column becomes a probability distribution of attention, which gives us our Attention Weight Matrix (A).

#### Softmatrix
- The Softmax step is important as it assigns probabilities to the score obtained in the previous steps and
thus helps the model decide how much importance (higher/lower attention weights) needs to be given to each
word given the current query. As is to be expected, higher attention weights signify greater relevance
allowing the model to capture dependencies more accurately.
- The scaling becomes important here. Without the scaling, the values of the resultant matrix gets pushed out
into regions that are not processed well by the Softmax function and may result in vanishing gradients.


In [15]:
tril = torch.tril(torch.ones(T,T))
weight = weight.masked_fill(tril == 0, float('-inf'))
weight = F.softmax(weight, dim=-1)


### Attention weighted features
Finally we multiply the value vectors (Vs) with the Attention Weight Matrix (A). These value vectors are important 
as they contain the information associated with each token in the sequence.

attention weighted features are the ultimate solution of the self-attention mechanism. These attention-weighted 
features essentially contain a weighted representation of the features assigning higher weights for features with 
higher relevance as per the context.

In [16]:
# weight is now calculated with v
out = weight @ v

out.shape

torch.Size([4, 8, 16])

## Next Step
Now with this information available, we continue to the next step in the transformer architecture where the 
feed-forward layer processes this information further.

## Notes:
- Attention is a **communication mechanism**. Can be seen as nodes in a directed graph looking at each other and aggregating information with a weighted sum from all nodes that point to them, with data-dependent weights.
- There is no notion of space. Attention simply acts over a set of vectors. This is why we need to positionally encode tokens.
- Each example across batch dimension is of course processed completely independently and never "talk" to each other
- In an "encoder" attention block just delete the single line that does masking with `tril`, allowing all tokens to communicate. This block here is called a "decoder" attention block because it has triangular masking, and is usually used in autoregressive settings, like language modeling.
- "self-attention" just means that the keys and values are produced from the same source as queries. In "cross-attention", the queries still get produced from x, but the keys and values come from some other, external source (e.g. an encoder module)
- "Scaled" attention additional divides `wei` by 1/sqrt(head_size). This makes it so when input Q,K are unit variance, wei will be unit variance too and Softmax will stay diffuse and not saturate too much. Illustration below