In [202]:
import numpy as np
import random
from scipy.special import softmax

#### Encoder representations of three different words

In [203]:
word_1 = np.array([1, 0, 0, 2, 0])
word_2 = np.array([0, 1, 0, 0, 1])
word_3 = np.array([1, 1, 0, 0, 1])

#### Generating the weight matrices <br> These are the matrices to train <br> They are trained in order to build an inquiry system: what is 'key' for the word 'query'?

In [204]:
random.seed(42)
W_Q = np.random.randint(low=2, size=(5, 3))
W_K = np.random.randint(low=2, size=(5, 3))
W_V = np.random.randint(low=2, size=(5, 3))

#### Generating the queries, keys and values

In [205]:
query_1 = word_1 @ W_Q
key_1 = word_1 @ W_K
value_1 = word_1 @ W_V
 
query_2 = word_2 @ W_Q
key_2 = word_2 @ W_K
value_2 = word_2 @ W_V
 
query_3 = word_3 @ W_Q
key_3 = word_3 @ W_K
value_3 = word_3 @ W_V

#### Dot product is a similarity score between queries and keys <br> In reality we use a fully connected layer

In [206]:
scores_1 = np.array([np.dot(query_1, key_1), np.dot(query_1, key_2), np.dot(query_1, key_3)])
scores_2 = np.array([np.dot(query_2, key_1), np.dot(query_2, key_2), np.dot(query_2, key_3)])
scores_3 = np.array([np.dot(query_3, key_1), np.dot(query_3, key_2), np.dot(query_3, key_3)])

#### Computing the weights by a softmax operation (can be thought as a probability vector)

In [207]:
weights_1 = softmax(scores_1 / key_1.shape[0] ** 0.5)
weights_2 = softmax(scores_2 / key_2.shape[0] ** 0.5)
weights_3 = softmax(scores_3 / key_3.shape[0] ** 0.5)

print(weights_1)
print(weights_2)
print(weights_3)

[0.1361258 0.4319371 0.4319371]
[0.60215404 0.05980637 0.33803959]
[0.60215404 0.05980637 0.33803959]


#### Computing the attention by a weighted sum of the value vectors <br> Can be thought of doing 'proportional retrieval' according to the probability vector

In [208]:
attention_1 = (weights_1[0] * value_1) + (weights_1[1] * value_2) + (weights_1[2] * value_3)
attention_2 = (weights_2[0] * value_1) + (weights_2[1] * value_2) + (weights_2[2] * value_3)
attention_3 = (weights_3[0] * value_1) + (weights_3[1] * value_2) + (weights_3[2] * value_3)

#### Correlation between words

In [209]:
print(attention_1)
print(attention_2)
print(attention_3)

[1.7041887 0.2722516 0.2722516]
[2.54234767 1.20430808 1.20430808]
[2.54234767 1.20430808 1.20430808]


#### All these operations can be summarized into this formula: <br> $\text{Attention}(\mathbf{Q}, \mathbf{K}, \mathbf{V}) = \text{softmax}(\frac{\mathbf{Q}\mathbf{K}^\top}{\sqrt{n}})\mathbf{V}$

#### Transformer uses word vectors as a set of key-value pairs <br> the query is obtained by compressing the output at time t-1

#### Moreover it applies the attention operation we saw multiple times in parallel in this way: <br> $\begin{aligned} \text{MultiHead}(\mathbf{Q}, \mathbf{K}, \mathbf{V}) &= [\text{head}_1; \dots; \text{head}_h]\mathbf{W}^O \\ \text{where head}_i &= \text{Attention}(\mathbf{Q}\mathbf{W}^Q_i, \mathbf{K}\mathbf{W}^K_i, \mathbf{V}\mathbf{W}^V_i) \end{aligned} $

<div>
<img src="..\\images\\transformer_full.png" width="800"/>
</div>

#### I focused particularly on the MultiHead Attention aspect since it's the core of the Transformer.