### Generate Data

In [35]:
import numpy as np
import math
import random

random.seed(8)

Every word is split up into three vectors. Query vector, Key vector and a Value vector.
They could really be the same in the beginning.

Q = What am I looking for. \
    [sequence length x dk]

K = What I can offer. \
    [sequence length x dk]

V = What I actually offer. \
    [sequence length x dk]


In [36]:
"""
L = Length of the input sequence. (Example use here: 'My name is X')
q = query vector
k = key vector
v = value vector
d_k, d_v = size of each of these vectors
"""

L, d_k, d_v = 4, 8, 8
q = np.random.randn(L, d_k)
k = np.random.randn(L, d_k)
v = np.random.randn(L, d_v)

In [37]:
print("Q\n", q)
print("K\n", k)
print("V\n", v)

Q
 [[ 1.26005904 -0.60960868  0.63954502  0.58826387 -0.41858783  0.44731905
  -0.94809509 -2.26217274]
 [ 0.37111549 -0.74579406 -0.90044235  0.92477713 -0.54077297 -2.18289739
  -1.528906    0.07248531]
 [ 0.73465036  1.02057437  0.10122983  0.37461743 -0.43145323  1.54817297
  -0.7529665  -1.8003881 ]
 [ 0.55975041  0.09451513 -1.15170959  1.46535624  1.07301428 -0.03897071
  -2.03871583 -0.11334228]]
K
 [[-0.95171693 -0.71532861 -0.7193455   1.64216137  1.39948538  1.27760359
  -0.99826132 -0.93016011]
 [-0.1093818  -1.87106191 -1.48165286 -0.45515292 -0.25495794  1.07766125
   0.26877851 -1.33645628]
 [-0.53654945 -1.13072567 -0.75597795  0.28913273 -0.64078245  0.58431216
   0.28864694 -0.55114978]
 [ 3.30912443  0.77069839  0.79505441 -1.93442701  0.06164565 -1.18686332
  -0.6648031  -0.84443391]]
V
 [[-0.09855224  0.99314645  1.75519943  0.55633223  1.04683343 -1.23577629
  -0.96706165  0.42854073]
 [-1.44874153 -0.10017919  2.37106824  1.18087115 -0.4230308   0.37552648
  -1.4

### Self Attention

$$
\text{Self-Attention} = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}+M\right) \times V
$$

Where:
- \( Q \): Query matrix
- \( K \): Key matrix
- \( V \): Value matrix
- \( d_k \): Dimensionality of keys (for scaling).
- \( M \): Mask matrix

\
In order to create an initial attention matrix, we need every single word to look at every single other word, just to see if it has a higher affinity towards it or not. \
This is represented by the query (for every word that I am looking for) and the key (what I currently have) \
$\sqrt{d_k}$ is to minimize the variance and hence stabilize the values of $QK^T$ matrix.

In [38]:
np.matmul(q, k.T)

array([[ 2.77913976,  3.14470241,  1.20254957,  5.05422854],
       [ 0.25979673, -0.45429855,  0.1820246 ,  1.66114748],
       [ 2.9135832 ,  1.6717563 ,  0.43964674,  2.73023393],
       [ 6.2269715 ,  0.08934572, -0.34919524, -0.26171583]])

In [39]:
# Why we need sqrt(d_k) in denominator
q.var(), k.var(), np.matmul(q, k.T).var()

(1.1020625626253824, 1.2402640337505473, 3.6893198395892632)

In [40]:
scaled = np.matmul(q, k.T) / math.sqrt(d_k)

q.var(), k.var(), scaled.var()

(1.1020625626253824, 1.2402640337505473, 0.4611649799486578)

In [41]:
scaled

array([[ 0.98257428,  1.1118202 ,  0.42516548,  1.78693964],
       [ 0.09185201, -0.16061879,  0.06435542,  0.58730432],
       [ 1.03010722,  0.59105511,  0.15543859,  0.96528346],
       [ 2.20156689,  0.03158848, -0.12345916, -0.09253052]])

###  Masking

* Required during decoder stage. Not required in the encoder stage.
* This is to ensure words don't look at a future word when trying to generate current context of the current word.
* Otherwise it will be cheating!. In reality you don't know the words that will be generated next so you can not create your vectors based off of those words.
* During encoder stage - All inputs are passed simultaneously, therefore masking is not required.

In [42]:
mask = np.tril(np.ones( (L, L)))
mask

array([[1., 0., 0., 0.],
       [1., 1., 0., 0.],
       [1., 1., 1., 0.],
       [1., 1., 1., 1.]])

In [43]:
mask[mask == 0] = -np.inf
mask[mask == 1] = 0

In [44]:
mask

array([[  0., -inf, -inf, -inf],
       [  0.,   0., -inf, -inf],
       [  0.,   0.,   0., -inf],
       [  0.,   0.,   0.,   0.]])

In [45]:
scaled + mask

array([[ 0.98257428,        -inf,        -inf,        -inf],
       [ 0.09185201, -0.16061879,        -inf,        -inf],
       [ 1.03010722,  0.59105511,  0.15543859,        -inf],
       [ 2.20156689,  0.03158848, -0.12345916, -0.09253052]])

### Softmax Function
Converts a vector into a probability distribution. \
Values add up to 1. \
Interpretable and stable.

$$
\text{softmax}(x_i) = \frac{e^{x_i}}{\sum_{j=1}^n e^{x_j}}
$$

Where: 

$
x_i: \text{Input value for the } i\text{-th class.}
$

$
n: \text{Total number of classes.}
$

$
e: \text{Euler's number (approximately 2.718).}
$

$
\sum_{j=1}^n e^{x_j}: \text{Sum of the exponential values for all classes, used for normalization.}
$


In [46]:
def softmax(x):
    return (np.exp(x).T / np.sum(np.exp(x), axis=-1)).T

In [49]:
softmax(scaled + mask)

array([[1.        , 0.        , 0.        , 0.        ],
       [0.56278456, 0.43721544, 0.        , 0.        ],
       [0.485049  , 0.31268547, 0.20226552, 0.        ],
       [0.7617229 , 0.08697358, 0.07448195, 0.07682157]])

In [50]:
attention = softmax(scaled + mask)

In [51]:
attention

array([[1.        , 0.        , 0.        , 0.        ],
       [0.56278456, 0.43721544, 0.        , 0.        ],
       [0.485049  , 0.31268547, 0.20226552, 0.        ],
       [0.7617229 , 0.08697358, 0.07448195, 0.07682157]])

### Matrix multiplication of attention with value matrix.

To better encapsulate the context of a word. \
Notice the difference between new_v and v below, for every subsequent row - showcasing better attention encapsulation.

In [52]:
new_v = np.matmul(attention, v)
new_v

array([[-0.09855224,  0.99314645,  1.75519943,  0.55633223,  1.04683343,
        -1.23577629, -0.96706165,  0.42854073],
       [-0.68887585,  0.5151276 ,  2.02446679,  0.82939029,  0.40418609,
        -0.53128984, -1.17739931, -0.37845938],
       [-0.46067293,  0.22422128,  1.50684052,  0.22222654,  0.47192725,
        -0.50642937, -1.2086278 , -0.25641118],
       [-0.11267879,  0.57211039,  1.5324853 ,  0.3033711 ,  0.79409628,
        -0.7943144 , -0.8260043 ,  0.09865097]])

In [53]:
v

array([[-0.09855224,  0.99314645,  1.75519943,  0.55633223,  1.04683343,
        -1.23577629, -0.96706165,  0.42854073],
       [-1.44874153, -0.10017919,  2.37106824,  1.18087115, -0.4230308 ,
         0.37552648, -1.44814634, -1.41723147],
       [ 0.19840339, -1.11822734, -0.42476745, -2.06096971,  0.47678579,
        -0.12082625, -1.41764738, -0.10445054],
       [ 0.95826576, -1.20267409,  0.27240923, -0.90598858, -0.02629666,
         1.6055803 ,  1.85063027, -1.25924192]])

### Functions
Converting all of the above into functions.

In [18]:
def softmax(x):
    return (np.exp(x).T / np.sum(np.exp(x), axis=-1)).T

def scaled_dot_product_attention(q, k, v, mask=None):
    d_k = q.shape[-1]
    scaled = np.matmul(q, k.T) / math.sqrt(d_k)
    if mask is not None:
        scaled = scaled + mask
    attention = softmax(scaled)
    out = np.matmul(attention, v)
    return out, attention

In [19]:
# For encoder, mask = None
# For decoder, mask = mask
values, attention = scaled_dot_product_attention(q, k, v, mask=mask)
print("Q\n", q)
print("K\n", k)
print("V\n", v)
print("New V\n", values)
print("Attention\n", attention)

Q
 [[-1.71509064 -1.92415593 -1.27289131 -1.38139549  0.4973059  -0.22893632
   1.89968557 -0.24670508]
 [-1.0403621  -0.93852842 -1.30753258 -0.75307571  1.02039821  0.82553694
   0.8123335   0.24272788]
 [ 0.53678149  0.72682371 -0.57013903  1.43488115 -1.1783893   2.82176859
   0.25996312 -1.37891998]
 [-1.13067757 -0.92246552  0.0159211  -0.56538126 -0.95904992 -0.16986172
  -0.81385949 -0.35420748]]
K
 [[ 0.38294428  1.2945802  -0.39063838 -0.05981553  1.16020338  1.20040557
   0.70155104 -0.60762109]
 [-1.15423911 -0.35296605 -0.82701221  0.19377793 -0.88831067 -1.40537158
  -0.0962384  -1.10318789]
 [-0.45286339  1.21208717  0.91555334 -0.61312538  0.26332605 -1.03603933
  -0.59359496 -0.53816555]
 [ 1.38981481 -0.6328841  -1.0149947   0.86753006 -0.38059723  0.31700018
   1.60578718 -0.06665863]]
V
 [[ 0.12694417  0.52186389  0.96191713  1.36097137  0.06526419 -0.29196679
  -1.01741379 -0.95722451]
 [ 0.48344737 -0.48286957 -0.31682075 -0.56666831 -1.18486621  0.45415092
   0.5