### Attention in Transformers

$Q$: what I am looking for: a matrix of $m \times d$

$K$: what I can offer: a matrix of $n \times d$

$V$: what I actually offer: a sequence of length $n \times d$

<img src="images/single_vector_attention.png" width="400">
<img src="images/matrix_attention.png" width="400">


In [1]:
# An example of attention, we assume the input has dimension 8, m = 4, n = 6
import numpy as np
import math

d, N, T = 8, 6, 4
K = np.random.randn(N, d)
V = np.random.randn(N, d)
Q = np.random.randn(T, d)

In [2]:
print("Q shape: ", Q.shape)

Q shape:  (4, 8)


### self-attention
$\text{self attention} = \text{softmax}(\frac{QK^T}{\sqrt{d_k}})V$

Here, divide by $\sqrt{d_k}$ because for large values of $d_k$, the dot product grow large in magnitude, pushing the softmax function into regions where it has extremely small gradients.

In [3]:
np.matmul(Q, K.T)

array([[ 0.43074867, -1.71528591, -3.59925887,  4.67274428, -2.21925453,
         2.1634964 ],
       [ 3.02294124,  5.7416805 , -1.12591825,  0.96355798, -3.16949911,
        -1.27289228],
       [-0.49062657, -0.39329731,  0.99936161, -4.67247317,  2.72622684,
        -1.01645732],
       [ 2.77303193,  1.60613201,  0.35052426, -2.81409016, -5.1375085 ,
        -1.79713716]])

In [4]:
scaled = np.matmul(Q, K.T) / math.sqrt(d)
scaled

array([[ 0.15229265, -0.60644515, -1.27253018,  1.65206458, -0.78462496,
         0.76491149],
       [ 1.06877112,  2.02999061, -0.39807222,  0.34066919, -1.12058716,
        -0.45003538],
       [-0.17346269, -0.1390516 ,  0.35332769, -1.65196873,  0.96386674,
        -0.35937193],
       [ 0.98041484,  0.56785342,  0.12392904, -0.99493112, -1.81638355,
        -0.63538394]])

### Masking
- This is to ensure words don't get context from words generated in the future.
- Not required in the encoders, but required in the decoders

In [5]:
mask = np.tril(np.ones((T, N)))
mask[mask == 0] = -np.infty
mask[mask == 1] = 0

In [6]:
scaled + mask

array([[ 0.15229265,        -inf,        -inf,        -inf,        -inf,
               -inf],
       [ 1.06877112,  2.02999061,        -inf,        -inf,        -inf,
               -inf],
       [-0.17346269, -0.1390516 ,  0.35332769,        -inf,        -inf,
               -inf],
       [ 0.98041484,  0.56785342,  0.12392904, -0.99493112,        -inf,
               -inf]])

In [7]:
def softmax(x):
    return (np.exp(x)) / np.sum(np.exp(x), axis=1, keepdims=True)
attention = softmax(scaled + mask)
attention

array([[1.        , 0.        , 0.        , 0.        , 0.        ,
        0.        ],
       [0.2766341 , 0.7233659 , 0.        , 0.        , 0.        ,
        0.        ],
       [0.26820451, 0.27759435, 0.45420114, 0.        , 0.        ,
        0.        ],
       [0.44937405, 0.29746429, 0.19082749, 0.06233416, 0.        ,
        0.        ]])

In [8]:
def scaled_dot_product_Attention(q, k, v, mask=None):
    d = q.shape[-1]
    scaled = np.matmul(q, k.T) / math.sqrt(d)
    if mask:
        scaled = scaled + mask
    attention = softmax(scaled)
    out = np.matmul(attention, v)
    return out, attention

value, attention = scaled_dot_product_Attention(Q, K, V, mask=None)
print(value.shape)

(4, 8)
