### Generate Data

In [53]:
import numpy as np
import math
import random

random.seed(8)

Every word is split up into three vectors. Query vector, Key vector and a Value vector.
They could really be the same.

Query = Length of input words.

In [54]:
"""
L = Length of the input sequence. (Example use here: 'My name is X')
q = query vector
k = key vector
v = value vector
d_k, d_v = size of each of these vectors
"""

L, d_k, d_v = 4, 8, 8
q = np.random.randn(L, d_k)
k = np.random.randn(L, d_k)
v = np.random.randn(L, d_v)

In [55]:
print("Q\n", q)
print("K\n", k)
print("V\n", v)

Q
 [[-1.41767666  0.96908187 -1.00570424 -1.38323781  1.13465927  0.79178207
   0.53179063 -0.72936195]
 [-2.16976431  0.66542476 -0.99466221  0.60654108  1.71870799 -0.41256672
  -1.79806925 -0.79852829]
 [ 0.89823134  0.55090139 -0.4092929   0.13166055  0.37747643  0.51839027
  -1.41010075 -0.04922471]
 [ 0.24826843  0.5348298   0.75793618 -0.53615346 -0.19183252 -1.63892969
   0.24689664  0.87612461]]
K
 [[ 1.57491474 -0.86316812 -0.65884679 -0.37227769 -0.37581007  0.63731469
   0.55250173 -0.23310638]
 [-1.63494969 -1.30428113 -0.80651226  0.47716264  1.13593996  0.41789594
   0.41580284 -0.18122419]
 [ 2.06477795 -1.20794564 -0.94439751 -0.75339707 -0.47730805 -1.76071251
  -0.50391215  0.54014202]
 [-0.55187274 -0.28083992  0.11199461  0.00648734 -1.78380066  0.48092649
  -1.297721    0.73737651]]
V
 [[-0.83391383 -1.53728569 -0.62599555 -0.87081994 -0.6608632  -1.30376564
   0.25958102  1.95478765]
 [-1.69238889 -0.31480643  1.4093176   0.97424845  0.99885477 -0.55161911
  -0.8

### Self Attention

$$
\text{Self-Attention} = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}+M\right) \times V
$$

Where:
- \( Q \): Query matrix
- \( K \): Key matrix
- \( V \): Value matrix
- \( d_k \): Dimensionality of keys (for scaling).
- \( M \): Mask matrix

\
In order to create an initial attention matrix, we need every single word to look at every single other word, just to see if it has a higher affinity towards it or not. \
This is represented by the query (for every word that I am looking for) and the key (what I currently have) \
$\sqrt{d_k}$ is to minimize the variance and hence stabilize the values of $QK^T$ matrix.

In [56]:
np.matmul(q, k.T)

array([[-1.34961473,  3.1780436 , -4.70349122, -2.48253399],
       [-5.27817589,  4.94819001, -4.42067971, -0.61657652],
       [ 0.58067407, -1.72614953,  1.06760165,  0.67417618],
       [-1.41065307, -2.92951897,  2.88078195, -0.32619056]])

In [57]:
# Why we need sqrt(d_k) in denominator

q.var(), k.var(), np.matmul(q, k.T).var()

(0.9439222438403891, 0.8518506475999309, 7.96814250921233)

In [61]:
scaled = np.matmul(q, k.T) / math.sqrt(d_k)

q.var(), k.var(), scaled.var()

(0.9439222438403891, 0.8518506475999309, 0.9960178136515412)

In [62]:
scaled

array([[-0.47716086,  1.12360809, -1.66293527, -0.87770831],
       [-1.86611698,  1.74944936, -1.5629463 , -0.21799272],
       [ 0.20529928, -0.61028602,  0.37745418,  0.23835727],
       [-0.49874117, -1.03574136,  1.01851023, -0.11532578]])

###  Masking

* Required during decoder stage. Not required in the encoder stage.
* This is to ensure words don't look at a future word when trying to generate current context of the current word.
* Otherwise it will be cheating!. In reality you don't know the words that will be generated next so you can not create your vectors based off of those words.
* During encoder stage - All inputs are passed simultaneously, therefore masking is not required.

In [68]:
mask = np.tril(np.ones( (L, L)))
mask

array([[1., 0., 0., 0.],
       [1., 1., 0., 0.],
       [1., 1., 1., 0.],
       [1., 1., 1., 1.]])

In [69]:
mask[mask == 0] = -np.inf
mask[mask == 1] = 0

In [70]:
mask

array([[  0., -inf, -inf, -inf],
       [  0.,   0., -inf, -inf],
       [  0.,   0.,   0., -inf],
       [  0.,   0.,   0.,   0.]])

In [72]:
scaled + mask

array([[-0.47716086,        -inf,        -inf,        -inf],
       [-1.86611698,  1.74944936,        -inf,        -inf],
       [ 0.20529928, -0.61028602,  0.37745418,        -inf],
       [-0.49874117, -1.03574136,  1.01851023, -0.11532578]])

### Softmax Function
Converts a vector into a probability distribution. \
Values add up to 1. \
Interpretable and stable.

$$
\text{softmax}(x_i) = \frac{e^{x_i}}{\sum_{j=1}^n e^{x_j}}
$$

Where: 

$
x_i: \text{Input value for the } i\text{-th class.}
$

$
n: \text{Total number of classes.}
$

$
e: \text{Euler's number (approximately 2.718).}
$

$
\sum_{j=1}^n e^{x_j}: \text{Sum of the exponential values for all classes, used for normalization.}
$


In [83]:
def softmax(x):
    return (np.exp(x).T / np.sum(np.exp(x), axis=-1)).T

In [84]:
softmax(scaled + mask)

array([[1.        , 0.        , 0.        , 0.        ],
       [0.02619694, 0.97380306, 0.        , 0.        ],
       [0.38019313, 0.16818996, 0.45161691, 0.        ],
       [0.13138081, 0.07679195, 0.59905383, 0.19277341]])

In [85]:
attention = softmax(scaled + mask)

In [86]:
attention

array([[1.        , 0.        , 0.        , 0.        ],
       [0.02619694, 0.97380306, 0.        , 0.        ],
       [0.38019313, 0.16818996, 0.45161691, 0.        ],
       [0.13138081, 0.07679195, 0.59905383, 0.19277341]])

### Matrix multiplication of attention with value matrix.

To better encapsulate the context of a word. \
Notice how different new_v and v are every subsequent row - showcasing better attention encapsulation.

In [87]:
new_v = np.matmul(attention, v)
new_v

array([[-0.83391383, -1.53728569, -0.62599555, -0.87081994, -0.6608632 ,
        -1.30376564,  0.25958102,  1.95478765],
       [-1.66989946, -0.34683165,  1.35599862,  0.9259133 ,  0.95537523,
        -0.57132305, -0.85637759,  1.00157908],
       [-0.33730312, -1.04784459, -0.40208889, -0.69032539, -0.145283  ,
        -0.63105062,  0.74843997,  0.43906264],
       [ 0.44333257, -0.80717343, -0.53008375, -0.93255716,  0.18765275,
        -0.32795291,  0.88914442, -0.25451399]])

In [88]:
v

array([[-0.83391383, -1.53728569, -0.62599555, -0.87081994, -0.6608632 ,
        -1.30376564,  0.25958102,  1.95478765],
       [-1.69238889, -0.31480643,  1.4093176 ,  0.97424845,  0.99885477,
        -0.55161911, -0.88639875,  0.97593617],
       [ 0.5854254 , -0.90880531, -0.88819252, -1.15829242, -0.13733918,
        -0.09430798,  1.76882755, -1.03688959],
       [ 1.72302482, -0.18988972, -0.12444181, -1.03273095,  1.45272653,
        -0.29987363, -0.70815872,  0.18090176]])

### Functions
Converting all of the above into functions.

In [89]:
def softmax(x):
    return (np.exp(x).T / np.sum(np.exp(x), axis=-1)).T

def scaled_dot_product_attention(q, k, v, mask=None):
    d_k = q.shape[-1]
    scaled = np.matmul(q, k.T) / math.sqrt(d_k)
    if mask is not None:
        scaled = scaled + mask
    attention = softmax(scaled)
    out = np.matmul(attention, v)
    return out, attention

In [91]:
# For encoder, mask = None
# For decoder, mask = mask
values, attention = scaled_dot_product_attention(q, k, v, mask=mask)
print("Q\n", q)
print("K\n", k)
print("V\n", v)
print("New V\n", values)
print("Attention\n", attention)

Q
 [[-1.41767666  0.96908187 -1.00570424 -1.38323781  1.13465927  0.79178207
   0.53179063 -0.72936195]
 [-2.16976431  0.66542476 -0.99466221  0.60654108  1.71870799 -0.41256672
  -1.79806925 -0.79852829]
 [ 0.89823134  0.55090139 -0.4092929   0.13166055  0.37747643  0.51839027
  -1.41010075 -0.04922471]
 [ 0.24826843  0.5348298   0.75793618 -0.53615346 -0.19183252 -1.63892969
   0.24689664  0.87612461]]
K
 [[ 1.57491474 -0.86316812 -0.65884679 -0.37227769 -0.37581007  0.63731469
   0.55250173 -0.23310638]
 [-1.63494969 -1.30428113 -0.80651226  0.47716264  1.13593996  0.41789594
   0.41580284 -0.18122419]
 [ 2.06477795 -1.20794564 -0.94439751 -0.75339707 -0.47730805 -1.76071251
  -0.50391215  0.54014202]
 [-0.55187274 -0.28083992  0.11199461  0.00648734 -1.78380066  0.48092649
  -1.297721    0.73737651]]
V
 [[-0.83391383 -1.53728569 -0.62599555 -0.87081994 -0.6608632  -1.30376564
   0.25958102  1.95478765]
 [-1.69238889 -0.31480643  1.4093176   0.97424845  0.99885477 -0.55161911
  -0.8