# Attention Mechanism

The attention mechanism is a function which encodes an ordered set of vectors, where the vectors are usually word embeddings and the ordered set is a sentence:

$$\begin{align}
{\rm Att}(X)_{W_Q,W_K,W_V} = {\rm softmax} \Big( \frac{1}{\sqrt{d_K}}Q K^T \Big) V
\end{align}$$
where 
* $X$ is a $(s, w)$ matrix representing a sentence (or window). This window might be thought of as a mini-batch.
* $(Q,K,V)=(X\, W_Q,X\, W_K,X\, W_V) $ where $(W_Q, W_K, W_V)$ are $(w,d_K)$ sized matrices

With the following data:
* $w$ - size of a word embedding vector
* $d_K$ - size of the attention head
* $s$ - max number of tokens in a window

then the objects have the following dimensions
* $Q,K,V$ - $(s,d_k)$ matrices
* $Q K^T$ - $(s,s)$ matrix
* ${\rm Att}(X)_{W_Q,W_K,W_V}$ - $(s,d_k)$ matrix 

The simplest way to convert $X$ into a $(s,d_k)$ matrix would be to use a single $(w,d_k)$ sized matrix. However this would not be any richer than the word embeddings themselves.

Now that we have encoded the word-window, we can chain several attention heads together. We are also free to use different sizes $d_k$ for each head. In the end it will produce a $(s,d)$ matrix for whatever size $d$ we have chosen for the final attention layer.

The intermediate matrix $Q K^T$ is often referred to as the "weight matrix".

In [24]:
import numpy as np
import numpy.random as random
from scipy.special import softmax

In [52]:
# word embeddings
sentence_length = 5
word_embedding_size = 7

# this sentence is a random bunch of words
sentence = np.random.randint(2, size=(sentence_length, word_embedding_size))

print(sentence.shape == (sentence_length, word_embedding_size))

True


In [54]:
attention_size = 3

W_Q = random.randint(2, size=(word_embedding_size, attention_size))
W_K = random.randint(2, size=(word_embedding_size, attention_size))
W_V = random.randint(2, size=(word_embedding_size, attention_size))

# generating the queries, keys and values
Q = sentence @ W_Q
K = sentence @ W_K
V = sentence @ W_V

In [57]:
print(sentence.shape==(sentence_length, word_embedding_size))
print(Q.shape==(sentence_length, attention_size))
print(K.shape==(sentence_length, attention_size))
print(V.shape==(sentence_length, attention_size))

True
True
True
True


In [59]:
# The weights are the normalized scores
weights = softmax(Q @ K.transpose() / attention_size**0.5, axis=1)

print(weights.shape==(sentence_length,sentence_length))

True


In [61]:
# computing the attention by a weighted sum of the value vectors
attention = weights @ V

print(attention.shape==(sentence_length,attention_size))

True
