## Self-Attention

At the heart of the Transformer is self-attention. Self-attention is a mechanism that allows each input in a sequence to look at the whole sequence to compute a representation of the sequence.

...what?

Ok. Let's look at the following sentence: `the animal didn't cross the street because it was too tired`. What does the `it` refer to in this sentence? The street or the animal? This is trivial for us as humans to answer but not for a machine. Wouldn't it be nice if we could have some way of the computer understanding what `it` referred to?

This is what self-attention attempts to do. As the model processes each word in the input sequence, self-attention allows us to look at other words in the input sequence for ideas as to what we want to encode in the representation of this word.

![](http://jalammar.github.io/images/t/transformer_self-attention_visualization.png)

It is given by the formula:

$$Attention(Q,K,V) = softmax \left( \frac{QK^T}{\sqrt{d_k}}\right)V$$


To make this clearer, think of how a hidden state in an RNN incorporates the representation of the previous words into the current representation. Self-attention is how the Transformer attempts to use other words (not just the previous words) to encode the meaning of a particular word


![encoder_multihead](images/encoder_multihead.png)

![vector_attn_score](images/vector_attn_scores.png)

### Self-Attention in detail

Self-attention is a representation which you can think of as a score. The intent is to have a representation, per input word, which tells us much how much focus each word needs to pay to every word in the sequence. A bit complicated right? Hopefully by the end of this post, this meaning will be clear.

Before talking about matrices, let's talk in terms of vectors. We'll have the sequence length be of dimension $T$, and the encoding/embeddings of the word be $D$ dimensional.

Note that because I'm focusing on the FIRST Encoder here, the encodings of our sequence is the embeddings. But as you move up the encoder stack, the encoding of the sequence is the output from the previous encoder. Again, this is $D$ dimensional.

![word_encodings](images/word_encodings.png)


Ok. Now we're going to create three vectors for EACH input: A query vector ($q$), a key vector ($k$), and a value vector ($v$). These will be $d_k$ dimensional. $d_k$ is typically $D/8$. As far as real values go, usually $D=512$, while $d_k=64$. In our example, what is $d_k$?

Ok... so how how do we get $q$, $k$, $v$? We learn it of course!

how... do we learn it? We need a weights matrix which will transform our encodings into these vectors.
This means that $W^Q$, $W^K$ and $W^V$ are all $\in \mathbb{R}^{D×d_k}$

![qkv_vectors](images/qkv_vectors.png)

Ok, that's nice. We understand that $q$, $k$, $v$ are different projections of the same input now; but what do the query, key, value abstractions actually mean? They're useful terms we can use to think about attention. Let's look through the following so we can see the roles they play.

Recall what we defined self-attention as earlier: A representation, per input word, which tells us much how much focus each word needs to pay to every word in the sequence. For each word in our sequence, we will calculate a score by taking the dot product of the current word's query vector and the key vector for every word in the sequence.

![w1_til_softmax](images/w1_til_softmax.png)

Let's take stock of what we've done so far:
- The $÷\sqrt{d_k}$ is simply a practical scaling factor which leads to stabler gradients.
- Softmax turns our scores into a probability distribution (each score is now between 0 and 1, and the sum of the scores = 1).

We will now use these scores by multiplying them with their value vector. The intention here is that lower scoring words will have less weighting in the self-attention output as these words will now have a sense of "irrelevantness" (e.g. a low score like 0.0001 will "cancel out" its corresponding value vector).

![w1_til_z1](images/w1_til_z1.png)

Finally, the output for the current word is the summation of all the $softmax \times v$ vectors. I.e. a weighted sum:

![z_vector](images/z_vector.png)

Ok. So that's self-attention in vector form. What about in terms of matrices?

- Our input, $X$, is now a matrix of our sequence of words (i.e. $X \in \mathbb{R}^{T\times D}$):
![x_matrix](images/X_matrix_input.png)

- $Q$, $K$, $V$ are now also matrices $\in \mathbb{R}^{T \times d_k}$.
- $W^Q$,$W^K$,$W^V$ stay $\in \mathbb{R}^{D \times d_k}$.
- We now simply obtain $Z \in \mathbb{R}^{T \times d_k}$ by plugging $Q$, $K$, $V$ into our Attention formula.

![Z_matrix](images/Z_matrix.png)


- For the FIRST encoder, Q, K, V are determined by the embeddings of the input words
- For the rest of the encoder stack, Q, K, V are determined by the output of the previous encoder
- For the decoder stack, Q is determined in a similar fashion to the encoders. K and V, however, are passed from the final encoder to each of the decoders in the decoder stack.