<center><img src="https://github.com/insaid2018/Term-1/blob/master/Images/INSAID_Full%20Logo.png?raw=true" width="360" height="160" /></center>

# <center>Techniques to Enhance Seq2Seq Models (Attention Mechanism)</center>

## Table of Contents

1. [Techniques to Enhance Seq2Seq Models](#section1)<br><br>
2. [Attention Mechanism](#section2)<br><br>
3. [Implementing Attention Mechanism](#section3)<br><br>
4. [Conclusion](#section4)

<a id=section1></a>
## 1. Techniques to Enhance Seq2Seq Models

Most **Neural Machine Translation (NMT)** systems work by *encoding* the **source sentence** (e.g. a *German sentence*) *into a vector using* a **Recurrent Neural Network**, and *then decoding* an *English sentence* based on that **vector**, also using a **RNN**.

<br> 
<center><img src="https://raw.githubusercontent.com/insaid2018/DeepLearning/master/images/nmt.png"/></center>

<br> 
In the picture above, **“Echt”**, **“Dicke”** and **“Kiste” words** are *fed into* an *encoder*, and after a special signal the *decoder starts producing* a **translated sentence**. 

- The *decoder keeps generating words until* a *special end of sentence token* is **produced**.

- Here, the **$h$ vectors** represent the **internal state** of the *encoder*.

<br> 
If you look closely, you can see that the *decoder* is *supposed to generate* a *translation solely based* on the **last hidden state** (**$h_{3}$** above) from the *encoder*. 

- This **$h_{3}$ vector** must *encode everything* we need to know about the **source sentence**. 

- *It must fully capture its meaning*. 

- In more technical terms, that **vector** is a **sentence embedding**.

<br> 

---

Still, it seems somewhat unreasonable to assume that we can *encode all information about* a potentially *very long sentence into* a **single vector** and then have the *decoder produce* a *good translation based on only that*. 

Let’s say your **source sentence** is **50** words long. 

 - The *first word* of the **English translation** is probably **highly correlated** with the *first word* of the **source sentence**. 
 
 - *But that means decoder* has to *consider information from 50 steps ago*, and *that information needs* to be somehow *encoded in* the **vector**. 

<br> 
<center><img src="https://raw.githubusercontent.com/insaid2018/DeepLearning/master/images/context_vector.png"/></center>

<br> 
**Recurrent Neural Networks** are known to *have problems dealing with* such **long-range dependencies**. 

In theory, architectures like **LSTMs** should be able to *deal with this*, *but in practice long-range dependencies are* still **problematic**.

<a id=section2></a>
### 2. Attention Mechanism

With an **attention mechanism** we no longer try *encode the full source sentence into a fixed-length vector*. 

Rather, we **allow** the *decoder* to “**attend**” to *different parts of the source sentence at each step of the output generation*. 

Importantly, we let the *model learn what to attend* to *based on the input sentence* and *what it has produced* so far. 

<br> 
<center><img src="https://raw.githubusercontent.com/insaid2018/DeepLearning/master/images/attention_mechanism.png"/></center>

- Here, The **$y$‘s** are our *translated words produced by* the *decoder*, and the **$x$‘s** are our *source sentence words*. 

- *Each decoder output word* **$y_{t}$** now *depends on* a **weighted combination** of *all* the *input states*, *not just* the *last state*. 

- The **$\alpha$‘s** are **weights** that *define how much* of *each input state* should be *considered for each output*. 

  - So, **if $\alpha_{1,2}$** is a **large number**, this would *mean that* the *decoder pays* a *lot of attention to* the *second state in* the *source sentence while producing* the *first word of* the *target sentence*. 

  - The **$\alpha$'s** are typically **normalized to sum to 1** (so they *are a distribution over* the *input states*).

<a id=section3></a>
### 3. Implementing Attention Mechanism

The **implementations** of an **attention layer** can be broken down into **4 steps**.

<br>

**Step 0: Prepare hidden states.**

Let’s *first prepare all* the *available encoder hidden states* (**green**) and the *first decoder hidden state* (**red**). 

In our example, we have *4 encoder hidden states* and the *current decoder hidden state*. 

<br> 
Note: The *last consolidated encoder hidden state* is **fed as input to** the *first time step of* the *decoder*. 

The *output of* this *first time step of* the *decoder* is *called* the **first decoder hidden state**, as seen below.

<br> 
<center><img src="https://raw.githubusercontent.com/insaid2018/DeepLearning/master/images/attention_mechanism0.gif"/></center>

<br>
<center><strong>Fig. 1.0: Getting ready to pay attention</strong></center>

<br>

**Step 1: Obtain a score for every encoder hidden state.**

A **score** (**scalar**) is *obtained by* a **score function** (also known as *alignment score function* or *alignment model*). 

In this example, the *score function is* a **dot product** between the *decoder and encoder hidden states*.

See [**Score Functions**](#score_functions) *for* a *variety of score functions*.

<br> 
<center><img src="https://raw.githubusercontent.com/insaid2018/DeepLearning/master/images/attention_mechanism1.gif"/></center>

<br>

<center><strong>Fig. 1.1: Get the scores</strong></center>

<br>

```
decoder_hidden = [10, 5, 10]

encoder_hidden      score
--------------------------
   [0, 1, 1]         15 (= 10×0 + 5×1 + 10×1, the dot product)
   [5, 0, 1]         60
   [1, 1, 0]         15
   [0, 5, 1]         35
```

In the above example, we *obtain* a **high attention score** of `60` for the *encoder hidden state* `[5, 0, 1]`. 

This *means that* the *next word to be translated* is going to be *heavily influenced by this encoder hidden state*.

<br>

**Step 2: Run all the scores through a softmax layer.**

We put the *scores to* a **softmax layer** so that the *softmaxed scores* (*scalar*) *add up to* **1**. 

These *softmaxed scores represent* the ***attention distribution***.

<br> 
<center><img src="https://raw.githubusercontent.com/insaid2018/DeepLearning/master/images/attention_mechanism2.gif"/></center>

<br>

<center><strong>Fig. 1.2: Get the softmaxed scores</strong></center>

<br>

```
encoder_hidden     score     score^
------------------------------------
   [0, 1, 1]        15         0
   [5, 0, 1]        60         1
   [1, 1, 0]        15         0
   [0, 5, 1]        35         0
```

Notice that *based on* the **softmaxed score** `score^`, the *distribution of attention* is *only placed on* `[5, 0, 1]` as expected. 

*In reality*, *these numbers* are *not binary but* a **floating point between 0 and 1**.

<br>

**Step 3**: **Multiply each encoder hidden state by its softmaxed score.**

By *multiplying each encoder hidden state with* its *softmaxed score* (*scalar*), we *obtain* the ***alignment vector*** or the ***annotation vector***. 

This is exactly the mechanism where **alignment takes place**.

<br> 
<center><img src="https://raw.githubusercontent.com/insaid2018/DeepLearning/master/images/attention_mechanism3.gif"/></center>

<br>

<center><strong>Fig. 1.3: Get the alignment vectors</strong></center>

<br>

```
encoder        score      score^      alignment
------------------------------------------------
[0, 1, 1]       15         0          [0, 0, 0]
[5, 0, 1]       60         1          [5, 0, 1]
[1, 1, 0]       15         0          [0, 0, 0]
[0, 5, 1]       35         0          [0, 0, 0]
```

Here we see that the *alignment for all encoder hidden states except* `[5, 0, 1]` are *reduced to 0 due to low attention scores*. 

This *means* we can expect that the *first translated word should match* the *input word with* the `[5, 0, 1]` **embedding**.

<br>

**Step 4**: **Sum up the alignment vectors.**

The *alignment vectors* are *summed up to produce* the ***context vector***. 

A **context vector** is *an aggregated information of* the *alignment vectors from* the *previous step*.

<br> 
<center><img src="https://raw.githubusercontent.com/insaid2018/DeepLearning/master/images/attention_mechanism4.gif"/></center>

<br>

<center><strong>Fig. 1.4: Get the context vector</strong></center>

<br>

```
encoder        score      score^      alignment
------------------------------------------------
[0, 1, 1]       15         0          [0, 0, 0]
[5, 0, 1]       60         1          [5, 0, 1]
[1, 1, 0]       15         0          [0, 0, 0]
[0, 5, 1]       35         0          [0, 0, 0]


context = [0+5+0+0, 0+0+0+0, 0+1+0+0] 
        = [5, 0, 1]
```

**Step 5**: **Feed the context vector into the decoder.**

The *manner this is done depends on* the **architecture design**.

<br> 
<center><img src="https://raw.githubusercontent.com/insaid2018/DeepLearning/master/images/attention_mechanism5.gif"/></center>

<br>

<center><strong>Fig. 1.5: Feed the context vector to decoder</strong></center>

<br>

**Here’s the entire animation**:

<br> 
<center><img src="https://raw.githubusercontent.com/insaid2018/DeepLearning/master/images/attention_mechanism6.gif"/></center>

<br>

<center><strong>Fig. 1.6: Attention</strong></center>

<br>

<a id=score_functions></a>
#### Score Functions

Below are *some* of the **score functions**. 

The *idea* behind *score functions* involving the **dot product operation** (*dot product*, *cosine similarity* etc.), is *to measure* the **similarity between two vectors**. 

For **feed-forward neural network score functions**, the *idea* is to *let the model learn* the *alignment weights together with* the **translation**.

<br> 
<center><img src="https://raw.githubusercontent.com/insaid2018/DeepLearning/master/images/summary_score_functions.png"/></center>

<br>

<center><strong>Summary of Score Functions</strong></center>

<br>

<br> 
<center><img src="https://raw.githubusercontent.com/insaid2018/DeepLearning/master/images/summary_score_functions1.png"/></center>

<br>

<center><strong>Summary of Score Functions.</strong></center>
<center><strong>h represents encoder hidden states while s represents decoder hidden states.</strong></center>

<br>

**Note:** The **Implementing Attention Mechanism** section in this notebook is *inspired from* the **Towards Data Science** article **Attn: Illustrated Attention** written by **Raimi Karim**.

<a id=section4></a>
## 4. Conclusion

The **Attention mechanism** has *revolutionised* the *way we create NLP models* and is currently a **standard fixture** in *most state-of-the-art NLP models*. 

This is *because it enables* the *model to “**remember**” all* the *words in* the **input** and *focus on specific words when formulating a response*.

A big **advantage of attention** is that *it gives* us the **ability to interpret** and **visualize** what the *model is doing*. 

For example, *by visualizing* the **attention weight matrix $\alpha$** *when a sentence is translated*, we *can understand how* the *model is translating*.