<a href="https://colab.research.google.com/github/lakhanrajpatlolla/aiml-learning/blob/master/Self_Attention_for_Tfr.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## **Self Attention**

### **Learning Objectives**

At the end of the experiment, you will be able to:

* understand the implementation details of self attention
* understand & implement the concept of Queries, Keys and Values
* understand & implement the concept of masking



### **Introduction**

Most of the popular language models are Transformer-based architectures which  uses an important techique called 'self-attention'. The primary function of self-attention is to generate the context aware vectors from the sequence. See the example below.

Note : We are writing **'as per the paper'** to  mention the  paper -->["**Attention Is All You Need**"](https://arxiv.org/pdf/1706.03762v6.pdf) .

<center>
<img src= https://www.dropbox.com/scl/fi/0fi9619uk3eizxe0saxub/Self_Attention_Scores.png?rlkey=q7kji50ctfbv4igfnuyas54vj&raw=1 width=900px/>
</center>

**outputs = sum(inputs * pairwise_scores(inputs, inputs))**


According to the self attention scores which are depicted in the picture, the word 'train pays' more attention to station rather than other words in consideration such as 'on' or 'the'.The self-attention model allows inputs to interact with each other (i.e calculate attention of all other inputs wrt one input).

#### **Implementation Details**



Inside each attention head is a **Scaled Dot Product Self-Attention** operation, the operation returns a Attention vector as given by equation below:

$$ Self Attention = softmax(\frac{x^{T}_i x_j}{\sqrt{d_k}})x_j $$

The term  **$x^{T}_i x_j$** is dot product of input vector with itself. The  'pivot_vector' and the 'vector' forms the 'xi' and 'xj' of the above Self Attention function.

#### **Attention Eqn. with Queries, Keys and Values**

We computed the Self Attention based on the inputs of vectors themselves. This means that for fixed inputs, these attention weights would always be fixed. In other words, there are no learnable parameters. Need to introduce some learnable parmeters which will make the self attention mechanism more flexible and tunable for various tasks. To fullfil this purpose, three weight matices are introduced and multiplied with input $x_i$ seperately and three new terms **Queries(Q), Keys(K) and Values(V)** comes into picture as given by equations below. Vectorized implemenation  & Shape tracking are also shown along with equations.

**Vectorized implemenation  & Shape tracking**

$ d_{model} $ = Embedding vector for each word ( 512 as per the paper).

$ X   \Rightarrow (T \times d_{model}) $


$ Q = X W^{Q}   \Rightarrow (T \times d_{model}) \times (d_{model} \times d_k  )  \Rightarrow   (T \times d_k ) $


$ K = X W^{K}   \Rightarrow (T \times d_{model}) \times (d_{model} \times d_k  )  \Rightarrow   (T \times d_k ) $


$ V = X W^{V}   \Rightarrow (T \times d_{model}) \times (d_{model} \times d_v  )  \Rightarrow   (T \times d_v) $

Dot product of Queries and Keys:

$ Q K^{T}   \Rightarrow (T \times d_{k}) \times (d_{k} \times T  )  \Rightarrow   (T \times T) $

T query vectors and T key vectors (Input Sequence), so need TxT attention weights. Make Sense! Taking SoftMax doesn't change the shape.

 **Shapes as per the paper**

$
\begin{array}{|c|c|} \hline
Object   &  Shape & values  \\ \hline
q_i, k_i  &  d_k  &  (64,) \\
v_i   &   d_v   &   (64,)  \\
x_i   &   d_{model}   & (512,)  \\
W^{Q}, W^{K}  &   d_{model} \times d_k   &   (512, 64)  \\
W^{V}   &   d_{model} \times d_v   &  (512,64)  \\ \hline
\end{array}
$

**Batch consideration**

In code, a batch of N samples are processed at a time. Everyting would be  **N times**, like: $ N \times T \times d_k $ instead of just $ T \times d_k$.

**Fianl Scaled Dot Product Attention** equation inside each attention head with **Queries(Q)**, **Keys(Q)**, and **Values(V)**, which returns a Attention vector.

<center>
<img src= https://www.dropbox.com/scl/fi/pfr6b522rccvp7bkuqilg/Scaled_dot_product_Attention.png?rlkey=nyba4vhest5995hdain8igayq&raw=1 width=250px/>

</center>


$$Attention(Q, K, V) = softmax(\frac{QK^T)}{\sqrt{d_k}})V$$



$ Shape \ of \ attention \ output = (T \times T) \times (T \times d_v)   \Rightarrow  (T \times d_v)  $

### **Implementation with dummy data**

In [None]:
import numpy as np
import math

In [None]:
T, d_k, d_v = 4, 6, 6   # T= Number of terms
q = np.random.randn(T, d_k) # 4X6
k = np.random.randn(T, d_k) # 4X6
v = np.random.randn(T, d_v) # 4X6

In [None]:
print("Q\n", q)
print("K\n", k)
print("V\n", v)

Q
 [[ 0.03076571 -0.66596084 -0.57981773  0.78204297 -2.20486326  0.26485014]
 [-0.67861224  0.49590148  0.13423506 -0.4223308   0.7713341  -1.18389824]
 [ 2.0906267   0.27694552 -1.1727269   1.66930336 -0.14534764  0.74502536]
 [-1.13411916  1.13097436 -1.65952086 -2.06938392  2.02277881  0.37287297]]
K
 [[-1.22715745  1.49918116  0.89824522 -0.08052928 -2.21706475 -0.26191323]
 [ 0.71747107  0.009494   -0.69504954  0.02322563 -1.25806545 -0.11341351]
 [ 0.91258838  1.11400796  0.46615481 -0.170631   -0.17803358 -0.97992068]
 [ 1.05682361  0.70185397  0.62036256 -0.63105621 -0.80103572 -2.07085978]]
V
 [[-0.48742851  1.76485745  0.36118831  0.84109156  1.64645719  2.49370612]
 [-0.41495437 -0.82550919 -1.8979914   1.13401928 -0.17321176 -0.39958629]
 [-0.86078293  0.08485973  0.13252766  0.6324223   0.25428529 -0.25960824]
 [-1.01291217  0.73806086  0.21143847 -0.17391314 -1.78605033 -0.59720633]]


In [None]:
np.matmul(q, k.T) # Dot product

array([[ 3.19901068,  3.18074107, -0.98452726, -0.07039688],
       [ 0.33077719, -1.4214042 ,  1.09058459,  1.81448694],
       [-3.21104763,  2.45482831,  0.68070498, -0.80355552],
       [-2.81902441, -2.28465071, -0.92107062, -2.52087782]])

In [None]:
# Why we need sqrt(d_k) in denominator
q.var(), k.var(), np.matmul(q, k.T).var()

(1.341162827829847, 0.936821161732119, 4.089120011404893)

In [None]:
scaled_dot_product = np.matmul(q, k.T) / math.sqrt(d_k)
print(scaled_dot_product)
q.var(), k.var(), scaled_dot_product.var()
# Notice the reduction in variance of the product

[[ 1.30599064  1.29853211 -0.40193157 -0.0287394 ]
 [ 0.13503922 -0.58028584  0.4452293   0.74076119]
 [-1.31090471  1.00217946  0.27789664 -0.32805017]
 [-1.1508619  -0.93270475 -0.3760255  -1.02914406]]


(1.341162827829847, 0.936821161732119, 0.6815200019008156)

In [None]:
scaled_dot_product.shape

(4, 4)

### **Masking**
* This is to ensure words don't get context from words generated in the future.
* Not required in the encoders, but required in the decoders

In [None]:
mask = np.tril(np.ones( (T, T) ))
mask

array([[1., 0., 0., 0.],
       [1., 1., 0., 0.],
       [1., 1., 1., 0.],
       [1., 1., 1., 1.]])

In [None]:
mask[mask == 0] = -np.infty
mask[mask == 1] = 0
mask

array([[  0., -inf, -inf, -inf],
       [  0.,   0., -inf, -inf],
       [  0.,   0.,   0., -inf],
       [  0.,   0.,   0.,   0.]])

In [None]:
scaled_dot_product + mask

array([[ 1.30599064,        -inf,        -inf,        -inf],
       [ 0.13503922, -0.58028584,        -inf,        -inf],
       [-1.31090471,  1.00217946,  0.27789664,        -inf],
       [-1.1508619 , -0.93270475, -0.3760255 , -1.02914406]])

### Softmax

$ softmax = \frac{\exp(x_{i})}{\sum_{j} \exp({x_j})} $



In [None]:
def softmax(x):
  return (np.exp(x).T / np.sum(np.exp(x), axis=-1)).T

In [None]:
softmax(scaled_dot_product + mask)

array([[1.        , 0.        , 0.        , 0.        ],
       [0.67157673, 0.32842327, 0.        , 0.        ],
       [0.06248665, 0.63146158, 0.30605177, 0.        ],
       [0.18039292, 0.22436955, 0.39149539, 0.20374214]])

In [None]:
attention = np.matmul (softmax(scaled_dot_product + mask),v)
print(attention)
attention.shape

[[-0.48742851  1.76485745  0.36118831  0.84109156  1.64645719  2.49370612]
 [-0.46362631  0.91412078 -0.38077887  0.93729584  1.04883557  1.54348157]
 [-0.55592966 -0.38502584 -1.13537887  0.96220057  0.07132949 -0.17595361]
 [-0.72439722  0.31674494 -0.26573278  0.61832334 -0.00619643  0.1368804 ]]


(4, 6)