# The Attention Mechanism

<!-- <img style="float:right;height:600px" src="images/transformer/beyer.self-attention.png"> -->
<img style="float:right;height:600px"  src="https://drive.google.com/uc?id=1QUOC1-gC7fF8Xn9eCgKfeSTF84BhjaK-">

<small style="position:absolute;bottom:0;right:0">[Lucas Beyer, "Transformers"](https://docs.google.com/presentation/d/1ZXFIhYczos679r70Yu8vV9uO6B1J0ztzeDxbnBxD1S0/edit#slide=id.g13dd67c5ab8_0_79)</small>

In [1]:
import tensorflow as tf

The principle of **self-attention** is to compute the similarity of each time step of queries sequence with all others. In this case, the $Queries$ and the $Keys$ are both stemming from the same sequence, whereas in **cross-attention**, the values are generated using another sequence, rather than the one used for the $Queries$ and $Keys$.

The first step is to transform our input sequence $x$ by multiplying it with three different matrices $Q$, $K$, $V$ (three different linear transformations):

\begin{align}
Queries &= Qx\\
Keys &= Kx\\
Values &= Vx
\end{align}


Dimensions:  
$x$: ($steps$, $embed\_dim$)  
$Q, K, V$: ($steps$, $embed\_dim$, $embed\_dim$)  
$Queries, Keys, Values$: ($steps$, $embed\_dim$)  
$embed\_dim$: each token in our sequence is an embedding vector of $embed\_dim$ dimensions.  

The result are

In this simplest example, let's imagine our *embedding* is just **one** number. Our sequence is therefore of shape ($steps$, 1).


In [23]:
seq = tf.random.uniform((3,1))

In [24]:
queries = seq
queries

<tf.Tensor: shape=(3, 1), dtype=float32, numpy=
array([[0.2830789 ],
       [0.43633425],
       [0.04906607]], dtype=float32)>

---

## 1. The similarity scores: $(QK^T)$

First, we compute the similarity of each step with all others.

In [25]:
keys = tf.transpose(queries)
keys

<tf.Tensor: shape=(1, 3), dtype=float32, numpy=array([[0.2830789 , 0.43633425, 0.04906607]], dtype=float32)>

In [26]:
QKt = queries@keys
QKt

<tf.Tensor: shape=(3, 3), dtype=float32, numpy=
array([[0.08013367, 0.12351702, 0.01388957],
       [0.12351702, 0.19038758, 0.02140921],
       [0.01388957, 0.02140921, 0.00240748]], dtype=float32)>

If we wanted to extract that information for just one token in our source sequence:

In [56]:
q1 = queries[None, 0]
q1

<tf.Tensor: shape=(1, 1), dtype=float32, numpy=array([[0.2830789]], dtype=float32)>

In [57]:
Q1Kt = q1@keys
Q1Kt # This is the same as the first line of `QKt`...

<tf.Tensor: shape=(1, 3), dtype=float32, numpy=array([[0.08013367, 0.12351702, 0.01388957]], dtype=float32)>

---

### Note

That matrix is symmetric (not in general, but in our case, since we look at the relationship between every step and every step in the same sequence).

In [27]:
QKt == tf.transpose(QKt)

<tf.Tensor: shape=(3, 3), dtype=bool, numpy=
array([[ True,  True,  True],
       [ True,  True,  True],
       [ True,  True,  True]])>

---

## 2. Turned into a mask using softmax and scaling

Applying the scaling by $\sqrt{(embedding\_dim)}$ and $softmax$ in the temporal dimension.

What we want is to turn our similarity metric (the dot product) into something like a **mask**, where we ultimately want the new representation for our token to be a **weighted average** of all other tokens in our sequence (we mix information from all steps in our input sequence, without increasing the total amount of information, that has nice mathematical properties for our gradient as well).

In [58]:
QKt_soft_scaled = tf.nn.softmax(QKt / tf.math.sqrt(tf.cast(queries.shape[-1], dtype=tf.float32)), axis=-1) 
QKt_soft_scaled                                   # shenanigans: queries.shape[-1] is an int, we need a float

<tf.Tensor: shape=(3, 3), dtype=float32, numpy=
array([[0.33554336, 0.35042074, 0.31403583],
       [0.33646366, 0.3597325 , 0.30380386],
       [0.33376372, 0.33628297, 0.3299533 ]], dtype=float32)>

---

### Note

(Given the symmetric nature of our matrix, the effect with `axis=0` would yield a symmetric result.)

In [59]:
tf.nn.softmax(QKt / tf.math.sqrt(tf.cast(queries.shape[-1], dtype=tf.float32)), axis=0) 

<tf.Tensor: shape=(3, 3), dtype=float32, numpy=
array([[0.33554336, 0.33646366, 0.33376372],
       [0.35042074, 0.3597325 , 0.33628297],
       [0.31403583, 0.30380386, 0.3299533 ]], dtype=float32)>

In [60]:
tf.transpose(QKt_soft_scaled) == tf.nn.softmax(QKt / tf.math.sqrt(tf.cast(queries.shape[-1], dtype=tf.float32)), axis=0) 

<tf.Tensor: shape=(3, 3), dtype=bool, numpy=
array([[ True,  True,  True],
       [ True,  True,  True],
       [ True,  True,  True]])>

---

What happens if we continue with our single $query$? 

In [62]:
Q1Kt_soft_scaled = tf.nn.softmax(Q1Kt / tf.math.sqrt(tf.cast(queries.shape[-1], dtype=tf.float32))) 
Q1Kt_soft_scaled

<tf.Tensor: shape=(1, 3), dtype=float32, numpy=array([[0.33554336, 0.35042074, 0.31403583]], dtype=float32)>

In [64]:
QKt_soft_scaled[0] == Q1Kt_soft_scaled

<tf.Tensor: shape=(1, 3), dtype=bool, numpy=array([[ True,  True,  True]])>

---

## 3. Applying the mask to our values $(QK^T)V$

Now that we have our **mask**, we multiply the scaled score for each token to the token itself, and sum the results. This is equivalent to allowing some information from each token to pass into our new representation (the softmax makes sure that the total of what we allow sums to one: we are redistributing information without adding or sub

$\overbrace{softmax(\frac{QK^T}{\sqrt{embed\_dim}})}^{mask}V$:

### Beware

The mask is sometimes called **weights**, which, it should be said **are not weights in the neural network sense**, just in the **weighted average** sence.

In [28]:
values = seq

In [37]:
QKtV = QKt_soft_scaled @ values
QKtV

<tf.Tensor: shape=(3, 1), dtype=float32, numpy=
array([[0.26329434],
       [0.26711583],
       [0.25740278]], dtype=float32)>

This should be read as a **new sequence** with the same number of steps, each token still being represented as one number, but with each token containing contextual information gathered from the rest of the sequence.

---

## 4. Cross-attention

This works in the exact same way, only with another sequence providing the $queries$.

<!-- <img style="float:right;height:600px" src="images/transformer/beyer.cross-attention.png"> -->
<img style="float:right;height:600px"  src="https://drive.google.com/uc?id=1fq19uEJd52_qDf1qmcV9BZWdCcstx1M7">

<small style="position:absolute;bottom:0;right:0">[Lucas Beyer, "Transformers"](https://docs.google.com/presentation/d/1ZXFIhYczos679r70Yu8vV9uO6B1J0ztzeDxbnBxD1S0/edit#slide=id.g13dd67c5ab8_0_79)</small>


In [70]:
other_queries = tf.random.uniform((3,1))
other_queries

<tf.Tensor: shape=(3, 1), dtype=float32, numpy=
array([[0.0127939 ],
       [0.7549559 ],
       [0.58881783]], dtype=float32)>

In [71]:
Cross_QKtV = tf.nn.softmax((other_queries @ keys)/tf.math.sqrt(tf.cast(queries.shape[-1], dtype=tf.float32))) @ values
Cross_QKtV

<tf.Tensor: shape=(3, 1), dtype=float32, numpy=
array([[0.2564841 ],
       [0.2749517 ],
       [0.27088535]], dtype=float32)>