#[How Attention works in Deep Learning: understanding the attention mechanism in sequence models (AI Summer School)](https://theaisummer.com/attention/)

by Nikolas Adaloglouon2020-11-19, revised by Ivan Lin after reading the article

In [None]:
from IPython.display import HTML

<center>
<img src='https://theaisummer.com/static/e9145585ddeed479c482761fe069518d/7cdbc/attention.png' width=500>
</center>

## Intro

In NLP, transformers and attention have been utilized successfully in a plethora of tasks including reading comprehension, abstractive summarization, word completion, and others.  

After a lot of reading and searching, I realized that it is **crucial to understand how attention emerged from NLP and machine translation**. This is what this article is all about. After this article, we will inspect the transformer model like a boss.

The **attention** mechanism emerged naturally from problems that deal with ***time-varying data (sequences)***. So, since we are dealing with “sequences”, let’s formulate the problem in terms of machine learning first. Attention became popular in the general task of dealing with sequences.

Let’s start from the beginning:



* **Memory is attention through time**. ~ *Alex Graves 2020* [[1]](https://colab.research.google.com/drive/1CqbZvvZKbIGKacWH-ultBdv1H_NopMF0#scrollTo=xXLeQfFU0tWQ&line=1&uniqifier=1)

Always keep this in the back of your mind.

## Sequence to sequence learning

Before attention and transformers, Sequence to Sequence (Seq2Seq) worked pretty much like this:
<center>
<img src="https://theaisummer.com/static/cd814b80c90e3ce6bef8a52b690d3eb6/d5c6f/seq2seq.png" width=500>
</center>
The elements of the sequence $x_1, x_2$ , etc. are usually called **tokens**. They can be literally anything. For instance, text representations, pixels, or even images in the case of videos.

**The goal is to transform an input sequence (source) to a new one (target).**

The two sequences can be of the same or arbitrary length.

In case you are wondering, recurrent neural networks (RNNs) dominated this category of tasks. The reason is simple: we liked to treat sequences sequentially. Sounds obvious and optimal? [Transformers](https://coursera.pxf.io/AoYYeN) proved us it’s not!

## A high-level view of encoder and decoder

<table><tr>
<td> <img src="https://theaisummer.com/static/70d8d1007f46ea9628842641d3eaa8f0/bb3ba/encoder.png" alt="Drawing" style="width: 480px;"/> </td>
<td> <img src="https://theaisummer.com/static/3aeb4bd622da8741ef66770055179d6d/ae77d/decoder.png" alt="Drawing" style="width: 480px;"/> </td>
</tr></table>

The **encoder** processes the input and produces one compact **representation**, called **$z$**, from all the input timesteps. It can be regarded as a compressed format of the input.

On the other hand, the **decoder** receives the context vector **$z$** and generates the output sequence. 

The most common application of Seq2seq is language translation. We can think of the input sequence as the representation of a sentence in English and the output as the same sentence in French.

***RNN-based*** architectures used to work very well especially with [LSTM](https://theaisummer.com/understanding-lstm/) and [GRU](https://theaisummer.com/gru/) components.


The problem? Only for small sequences ($<20$ timesteps). Visually:
<img src='https://theaisummer.com/static/344dcefead207723ee714f9a6c52be8b/ae694/scope-per-senquence-length.png' width="500">

### The limitations of RNN’s

The intermediate representation **$z$** cannot encode information from all the input timesteps. This is commonly known as the **bottleneck problem**. The vector **$z$** needs to capture all the information about the source sentence.  In theory, mathematics indicate that this is possible. 

However in practice, how far we can see in the past (the so-called **reference window**) is finite. RNN’s tend to ***forget information*** from timesteps that are far behind.

Let’s see a concrete example. Imagine a sentence of 97 words:

* * *On offering to help the **blind man**, the man who then **stole his car**, had not, at that precise moment, had any evil intention, quite the contrary, what he did was nothing more than obey those feelings of generosity and altruism which, as everyone knows, are the two best traits of human nature and to be found in much more hardened criminals than this one, a simple **car-thief** without any hope of advancing in his profession, exploited by the real owners of this enterprise, for it is they who take advantage of the needs of the **poor**.” ~ Jose Saramago, “Blindness*.




In most cases, the vector **$z$** will be unable to compress the information of the early words as well as the 97th word.

Eventually, the system pays more attention to the last parts of the sequence. However, this is not usually the optimal way to approach a sequence task and it is not compatible with the way humans translate or even understand language.

Furthermore, the stacked RNN layer usually create the well-know **vanishing gradient problem**, as perfectly visualized in the [distill article](https://distill.pub/2019/memorization-in-rnns/) on RNN’s:

<center>
<img src='https://theaisummer.com/static/e612ddf05fe84a215d99da9fb425d05b/58fee/memorization-rnns.png' width=550px>
</center>

## **Attention to the rescue**!

Attention was born in order to address these two things on the Seq2seq model. But how?

>**The core idea is that the context vector *$z$* should have access to all parts of the input sequence instead of just the last one.**

In other words, we need to form a **direct connection** with each timestamp.

This idea was originally proposed for computer vision. Larochelle and Hinton [[5]](https://colab.research.google.com/drive/1CqbZvvZKbIGKacWH-ultBdv1H_NopMF0#scrollTo=VzM2dM1KDHxs&line=1&uniqifier=1) proposed that *by looking at different parts of the image (glimpses), we can learn to accumulate information about a shape and classify the image accordingly*.

The same principle was later extended to sequences. We can look at all the different words at the same time and learn to “pay attention“ to the correct ones depending on the task at hand.

And behold. This is what we now call **attention**, which is simply a notion of **memory**, gained from attending at multiple inputs through time.

It is crucial in my humble opinion to understand the generality of this concept. To this end, we will cover all the different types that one can divide attention mechanisms.

## Reference

###[1] DeepMind’s deep learning videos 2020 with UCL, Lecture: [Attention and Memory in Deep Learning](https://www.youtube.com/watch?v=AIiwuClvH6k&ab_channel=DeepMind), Alex Graves

In [None]:
HTML('<iframe width="712" height="281" src="https://www.youtube.com/embed/AIiwuClvH6k" \
      title="DeepMind x UCL | Deep Learning Lectures | 8/12 |  Attention and Memory in Deep Learning" \
      frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" \
      allowfullscreen></iframe>')

### [5] Larochelle H., Hinton G, (2010), [Learning to combine foveal glimpses with a third-order Boltzmann machine](https://papers.nips.cc/paper/2010/file/677e09724f0e2df9b6c000b75b5da10d-Paper.pdf)