#[How Attention works in Deep Learning: understanding the attention mechanism in sequence models (AI Summer School)](https://theaisummer.com/attention/)

by Nikolas Adaloglouon2020-11-19, revised by Ivan Lin after reading the article

In [None]:
from IPython.display import HTML

<center>
<img src='https://theaisummer.com/static/e9145585ddeed479c482761fe069518d/7cdbc/attention.png' width=500>
</center>

## Intro

In NLP, transformers and attention have been utilized successfully in a plethora of tasks including reading comprehension, abstractive summarization, word completion, and others.  

After a lot of reading and searching, I realized that it is **crucial to understand how attention emerged from NLP and machine translation**. This is what this article is all about. After this article, we will inspect the transformer model like a boss.

The **attention** mechanism emerged naturally from problems that deal with ***time-varying data (sequences)***. So, since we are dealing with “sequences”, let’s formulate the problem in terms of machine learning first. Attention became popular in the general task of dealing with sequences.

Let’s start from the beginning:



* **Memory is attention through time**. ~ *Alex Graves 2020* [[1]](https://colab.research.google.com/drive/1CqbZvvZKbIGKacWH-ultBdv1H_NopMF0#scrollTo=xXLeQfFU0tWQ&line=1&uniqifier=1)

Always keep this in the back of your mind.

## Sequence to sequence learning

Before attention and transformers, Sequence to Sequence (Seq2Seq) worked pretty much like this:
<center>
<img src="https://theaisummer.com/static/cd814b80c90e3ce6bef8a52b690d3eb6/d5c6f/seq2seq.png" width=500>
</center>
The elements of the sequence $x_1, x_2$ , etc. are usually called **tokens**. They can be literally anything. For instance, text representations, pixels, or even images in the case of videos.

**The goal is to transform an input sequence (source) to a new one (target).**

The two sequences can be of the same or arbitrary length.

In case you are wondering, recurrent neural networks (RNNs) dominated this category of tasks. The reason is simple: we liked to treat sequences sequentially. Sounds obvious and optimal? [Transformers](https://coursera.pxf.io/AoYYeN) proved us it’s not!

## A high-level view of encoder and decoder

<table><tr>
<td> <img src="https://theaisummer.com/static/70d8d1007f46ea9628842641d3eaa8f0/bb3ba/encoder.png" alt="Drawing" style="width: 480px;"/> </td>
<td> <img src="https://theaisummer.com/static/3aeb4bd622da8741ef66770055179d6d/ae77d/decoder.png" alt="Drawing" style="width: 480px;"/> </td>
</tr></table>

The **encoder** processes the input and produces one compact **representation**, called **$z$**, from all the input timesteps. It can be regarded as a compressed format of the input.

On the other hand, the **decoder** receives the context vector **$z$** and generates the output sequence. 

The most common application of Seq2seq is language translation. We can think of the input sequence as the representation of a sentence in English and the output as the same sentence in French.

***RNN-based*** architectures used to work very well especially with [LSTM](https://theaisummer.com/understanding-lstm/) and [GRU](https://theaisummer.com/gru/) components.


The problem? Only for small sequences ($<20$ timesteps). Visually:
<img src='https://theaisummer.com/static/344dcefead207723ee714f9a6c52be8b/ae694/scope-per-senquence-length.png' width="500">

### The limitations of RNN’s

The intermediate representation **$z$** cannot encode information from all the input timesteps. This is commonly known as the **bottleneck problem**. The vector **$z$** needs to capture all the information about the source sentence.  In theory, mathematics indicate that this is possible. 

However in practice, how far we can see in the past (the so-called **reference window**) is finite. RNN’s tend to ***forget information*** from timesteps that are far behind.

Let’s see a concrete example. Imagine a sentence of 97 words:

* * *On offering to help the **blind man**, the man who then **stole his car**, had not, at that precise moment, had any evil intention, quite the contrary, what he did was nothing more than obey those feelings of generosity and altruism which, as everyone knows, are the two best traits of human nature and to be found in much more hardened criminals than this one, a simple **car-thief** without any hope of advancing in his profession, exploited by the real owners of this enterprise, for it is they who take advantage of the needs of the **poor**.” ~ Jose Saramago, “Blindness*.




In most cases, the vector **$z$** will be unable to compress the information of the early words as well as the 97th word.

Eventually, the system pays more attention to the last parts of the sequence. However, this is not usually the optimal way to approach a sequence task and it is not compatible with the way humans translate or even understand language.

Furthermore, the stacked RNN layer usually create the well-know **vanishing gradient problem**, as perfectly visualized in the [distill article](https://distill.pub/2019/memorization-in-rnns/) on RNN’s:

<center>
<img src='https://theaisummer.com/static/e612ddf05fe84a215d99da9fb425d05b/58fee/memorization-rnns.png' width=550px>
</center>

## **Attention to the rescue**!

Attention was born in order to address these two things on the Seq2seq model. But how?

>**The core idea is that the context vector *$z$* should have access to all parts of the input sequence instead of just the last one.**

In other words, we need to form a **direct connection** with each timestamp.

This idea was originally proposed for computer vision. Larochelle and Hinton [[5]](https://colab.research.google.com/drive/1CqbZvvZKbIGKacWH-ultBdv1H_NopMF0#scrollTo=VzM2dM1KDHxs&line=1&uniqifier=1) proposed that *by looking at different parts of the image (glimpses), we can learn to accumulate information about a shape and classify the image accordingly*.

The same principle was later extended to sequences. We can look at all the different words at the same time and learn to “pay attention“ to the correct ones depending on the task at hand.

And behold. This is what we now call **attention**, which is simply a notion of **memory**, gained from attending at multiple inputs through time.

It is crucial in my humble opinion to understand the generality of this concept. To this end, we will cover all the different types that one can divide attention mechanisms.

## Types of attentions

### Types of attention: implicit VS explicit

Deep networks are very rich function approximators. So, without any further modification, they tend to **ignore parts of the input and focus on others**. For instance, when working on human pose estimation, the network will be more sensitive to the pixels of the human body. 
* Very deep neural networks already learn a form of implicit attention [[6]](https://colab.research.google.com/drive/1CqbZvvZKbIGKacWH-ultBdv1H_NopMF0#scrollTo=Ytbw174DHgjK&line=1&uniqifier=1). Here is an example of [self-supervised approaches to videos](https://theaisummer.com/self-supervised-learning-videos/):

<center>
<img src="https://theaisummer.com/static/251225898ff8d599b13e04f405d00437/75dcb/activations-focus-in-ssl.png" width=500>
<figcaption> Where activations tend to focus when trained in a self-supervised way. Image from Misra et al. ECCV 2016. [Source](https://arxiv.org/abs/1603.08561) </figcaption>
</center>

*“Many activation units show a **preference** for human body parts and pose.”* ~ [Misra et al. 2016](https://arxiv.org/abs/1603.08561)

One way to visualize implicit attention is by looking at the partial derivatives with respect to the input. In math, this is the [Jacobian matrix](https://medium.com/unit8-machine-learning-publication/computing-the-jacobian-matrix-of-a-neural-network-in-python-4f162e5db180).

However, we have many reasons to enforce this idea of implicit attention. Attention is quite intuitive and interpretable to the human mind. Thus, by asking the network to **‘weigh’ its sensitivity to the input based on memory from previous inputs**, we introduce **explicit attention**. From now on, we will refer to this as attention.


### Types of attention: **hard attention** VS **soft attention**

Another distinction we tend to make is between **hard attention** vs **soft attention**. In all the previous cases, we refer to attention that is parametrized by **differentiable functions**. 

For the record, this is termed as **soft attention** in the literature. Officially:

* **Soft attention** means that the function varies smoothly over its domain and, as a result, it is **differentiable**.
* Historically, we had another concept called **hard attention**.

In general, **hard attention** means that it can be described by **discrete variables** while **soft attention** is described by continuous variables. In other words, hard attention replaces a deterministic method with a stochastic sampling model.

In the next example, starting from a random location in the image tries to find the “important pixels” for classification. Roughly, the algorithm has to choose a direction to go inside the image, during training.

An intuitive example: You can imagine a robot in a labyrinth that has to make a hard decision on which path to take, as indicated by the red dots.

As another example, starting from a random location in the image tries to find the “important pixels” for classification. Roughly, the algorithm has to choose a direction to go inside the image, during training.

<center>
<img src="https://theaisummer.com/static/88b3d2d56babed2b622640089f255a7d/b1cde/labyrinth-hard-attention.png" width=200>
<figcaption> A decision in the labyrinth. </figcaption>
</center>

<center>
<img src="https://theaisummer.com/static/0f9d245f8de0e59901050cfde671b153/2eb24/hard-attention.png" width=200>
<figcaption> An example of hard attention </figcaption>
</center>

Since hard attention is non-differentiable, we can’t use the standard gradient descent. That’s why we need to train them using Reinforcement Learning (RL) techniques such as [policy gradients and the REINFORCE algorithm](https://theaisummer.com/Policy-Gradients/) [[6]](https://colab.research.google.com/drive/1CqbZvvZKbIGKacWH-ultBdv1H_NopMF0#scrollTo=Ytbw174DHgjK&line=1&uniqifier=1).

Nevertheless, the major issue with the REINFORCE algorithm and similar RL methods is that they have a high variance. To summarize:
* **Hard attention** can be regarded as a switch mechanism to determine whether to attend to a region or not, which means that the function has many abrupt changes over its domain.

Ultimately, given that we already have all the sequence tokens available, we can relax the definition of **hard attention**. In this way, we have a smooth differentiable function that we can train end to end with our favorite backpropagation.

## Attention in our encoder-decoder example

In the encoder-decoder RNN case, given previous state in the decoder as $y_{i-1}$ and the the hidden state $\bf h$ $=h_1, h_2,\dots, h_n$, we have something like this: 

$$ e_i=\text{attention}_{\text{net}}(y_{i-1}, {\bf h}) \in R^n$$

The index $i$ indicates the prediction step. Essentially, we define a score between the hidden state of the decoder and all the hidden states of the encoder.

More specifically, for each hidden state (denoted by $j$ ) $\bf h$ $=h_1, h_2,\dots, h_n$, we will calculate a scalar:

$$ e_{ij}=\text{attention}_{\text{net}}(y_{i-1}, {h_j}),$$

Visually, in our beloved example, we have something like this:

<center>
<img src="https://theaisummer.com/static/c657cd22c2d5501071dab630b3b91043/58213/seq2seq-attention.png" width=400>
</center>

Notice the symbol **$e$** in the equation and **$α$** in the diagram! Why?

Because, we want some extra properties: 
to make it a probability distribution and b) to make the scores to be far from 

1.   to make it a probability distribution
2.   to make the scores to be far from each other.  The results is to have more confident predictions and is nothing more than our well known softmax.
$$
\alpha = {{\text{exp}(e_{ij})}\over{\sum_{k=1}^{T_x} \text{exp}(e_{ik})}}
$$

Finally, here is where the new magic will happen:

$$
z_i = \sum_{j=1}^{T} \alpha_{ij}{\bf{h}}_j
$$

In theory, **attention** is defined as the **weighted average of values**. But this time, **the weighting is a learned function**! 

Intuitively, we can think of 
* **$\alpha_{ij}$** as **data-dependent dynamic weights**. 
* it is obvious that we need a notion of memory, and as we said **attention weight store the memory** that is gained through time

All the aforementioned are **independent of how we choose to model attention**! We will get down to that in a bit.

## Attention as a trainable weight mean for machine translation

I find that the most intuitive way to understand attention in NLP tasks is to think of it(**attention**) as a (soft) **alignment** between words. But what does this alignment look like? Excellent question!

In machine translation, we can visualize the attention of a trained network using a heatmap such as below. Note that **scores are computed dynamically**.

<center>
<img src="https://theaisummer.com/static/3bbdf4a6d68559a5d3847a04ebb3370b/8c76f/attention-alignment.png" width=350>
<figcaption> Image by Neural Machine translation paper. </figcaption>
</center>

Notice what happens in the active **non-diagonal elements**. 
* In the marked red area, the model learned to **swap the order of words** in translation. 
* **note that this is not a 1-1 relationship but a 1 to many**, meaning that an output word is affected by more than one input word (each one with different importance).

## How do we compute attention?

In our previous encoder-decoder example, we denoted attention as **$\text{attention}_{\text{net}}(y_{i-1}, {\bf h})$**, which indicates that it’s the output of a small neural network with inputs the previous state of the decoder as $y_{i-1}$ and the hidden state $\bf h$ $=h_1, h_2,\dots, h_n$.  

In fact all we need is **a score that describes the relationship between the two states and captures how “aligned” they are**.

While a small neural network is the most prominent approach, over the years there have been many different ideas to compute that score. The simplest one, as shown in Luong [7], computes attention as the dot product between the two states $y_{i-1}{\bf h}$.  

Extending this idea we can introduce a trainable weight matrix in between $y_{i-1}W_a {\bf h}$, where $W_a$ is an intermediate $W$ matrix with learnable weights.  Extending even further, we can also include an activation function in the mix which leads to our familiar neural network approach $v^T_a tanh(W_a[h; y_{i-1}])$ proposed by Bahdanau [[2]](https://colab.research.google.com/drive/1CqbZvvZKbIGKacWH-ultBdv1H_NopMF0#scrollTo=oP_OgeDcf33S&line=1&uniqifier=1)

In certain cases, the alignment is only affected by the position of the hidden state, which can be formulated using simply a **softmax function** $\text{softmax}(y_{i-1}, {\bf h})$

The last one worth mentioning can be found in Graves A. [[8]](https://colab.research.google.com/drive/1CqbZvvZKbIGKacWH-ultBdv1H_NopMF0#scrollTo=xFkf_Fung-hq&line=1&uniqifier=1) in the context of [Neural Turing Machines](https://deepmind.com/research/publications/neural-turing-machines) and calculates attention as a **cosine similarity** ${consine}[y_{i-1}, {\bf h}]$

To summarize the different techniques, I’ll borrow this table from [Lillian Weng’s excellent article](https://lilianweng.github.io/lil-log/2018/06/24/attention-attention.html).  The symbol **$s_t$** denotes the predictions (**$y_t$**) , while different **$W$** indicate trainable matrices:

<center>
<img src="https://theaisummer.com/static/2dd2a106a1c626f5cbae0d134c7bf83a/1ac29/attention-calculation.png" width=700>
</center>

The approach that stood the test of time, however, is the last one proposed by Bahdanau et al. [[2]](https://colab.research.google.com/drive/1CqbZvvZKbIGKacWH-ultBdv1H_NopMF0#scrollTo=oP_OgeDcf33S&line=1&uniqifier=1): **They parametrize attention as a small fully connected neural network**. And obviously, we can extend that to use more layers.

This effectively means that attention is now a set of trainable weights that can be tuned using our standard backpropagation algorithm.

As perfectly stated by Bahdanau et al. [[2]](https://colab.research.google.com/drive/1CqbZvvZKbIGKacWH-ultBdv1H_NopMF0#scrollTo=oP_OgeDcf33S&line=1&uniqifier=1):

* * “Intuitively, this implements a mechanism of attention in the decoder. The decoder decides parts of the source sentence to pay attention to. 
* * By letting the decoder have an attention mechanism, we **relieve the encoder from the burden of having to encode all information in the source sentence into a fixed-length vector**. 
* * With this new approach, the information can be spread throughout the sequence of annotations, which can be selectively retrieved by the decoder accordingly.” ~ [Neural machine translation by jointly learning to align and translate](https://arxiv.org/abs/1409.0473)

## Reference

###[1] DeepMind’s deep learning videos 2020 with UCL, Lecture: [Attention and Memory in Deep Learning](https://www.youtube.com/watch?v=AIiwuClvH6k&ab_channel=DeepMind), Alex Graves

In [None]:
HTML('<iframe width="712" height="281" src="https://www.youtube.com/embed/AIiwuClvH6k" \
      title="DeepMind x UCL | Deep Learning Lectures | 8/12 |  Attention and Memory in Deep Learning" \
      frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" \
      allowfullscreen></iframe>')

### [2] Bahdanau, D., Cho, K., & Bengio, Y. (2014). [Neural machine translation by jointly learning to align and translate](https://arxiv.org/abs/1409.0473). arXiv preprint arXiv:1409.0473.

### [5] Larochelle H., Hinton G, (2010), [Learning to combine foveal glimpses with a third-order Boltzmann machine](https://papers.nips.cc/paper/2010/file/677e09724f0e2df9b6c000b75b5da10d-Paper.pdf)

### [6] Mnih V., Heess N., Graves A., Kavukcuoglu K., (2014), [Recurrent Models of Visual Attention](https://papers.nips.cc/paper/2014/file/09c6c3783b4a70054da74f2538ed47c6-Paper.pdf)

### [8] Graves A., Wayne G. ,Danihelka I., (2014), [Neural turing machines](https://arxiv.org/abs/1410.5401)