In [1]:
%run Latex_macros.ipynb

<IPython.core.display.Latex object>

# Attention: Motivation

Let's revisit the Encoder-Decoder architecture

The Encoder
- Acts on input sequence $[\x_{(1)} \dots \x_{(\bar{T})}]$
- Producing a sequence of latent states $[ \bar{\h}_{(1)}, \dots, \bar{\h}_{(\bar{T})} ]$



The Decoder
- Acts on the *final* Encoder latent state $\bar{\h}_{(\bar{T})}$
- Producing a sequence of outputs $[ \hat{\y}_{(1)}, \dots, \hat{\y}_{(T)} ]$
- Often feeding step $\tt$ output $\hat{\y}_\tp$ as Encoder input at step $(\tt+1)$



<table>
    <tr>
        <th><center>RNN Encoder/Decoder</center></th>
    </tr>
    <tr>
        <td><img src="images/RNN_layer_API_Encoder_Decoder.png"</td>
    </tr>
</table>



The following diagram is a condensed depiction of the process

<table>
    <tr>
        <th><center>Sequence to Sequence: training (teacher forcing) + inference: No attention</center></th>
    </tr>
    <tr>
        <td><img src="images/RNN_seq2seq.png"></td>
    </tr>
</table>

Recall that $\bar{\h}_{(\bar{\tt})}$ is a fixed length encoding of the input prefix $\x_{(1)}, \ldots, \x_{(\bar{\tt})} $.

So $\bar{\h}_{(\bar{T})}$, which initializes the Decoder, is a summary of entire input sequence $\x$.

This fact enables us to decouple the Encoder from the Decoder
- The consumption of input $\x$ and production of output $\hat{\y}$ do not have to be synchronized
- Allowing for the possibility that $T \ne \bar{T}$
- For example
    - There is no one to one mapping between languages (nor does ordering of words get preserved)


Let's focus on the part of the Decoder
- That transforms latent state (or short term memory) $\h_\tp$ to output $\hat{\y}_\tp$


<table>
    <tr>
        <th><center>Decoder: No attention</center></th>
    </tr>
    <tr>
        <td><img src="images/Decoder_no_attention.png"></td>
    </tr>
</table>

We can generalize this transformation as
$$\hat{\y}_\tp = D( \h_\tp; \mathbf{s})$$

In the vanilla RNN, this was governed by the equation
$$\hat{\y}_\tp  =   D(\h_\tp; \mathbf{s}) = \W_{hy} \h_\tp  + \b_y$$

Additional parameter $\mathbf{s}$
- Was unused in this example (our illustration used $\bar{\h}_{(\bar{T})}$ as a place-holder)
- But may be used in other cases


This simple mapping of $\h_\tp$ to $\hat{\y}_\tp$ can be extremely burdensome

It is often the case that $\hat{\y}_\tp$
- Depends mostly on a **specific element** $\x_{(\bar{\tt})}$ of the input
- Or on a **specific prefix** of the input: $\x_{(1)}, \ldots, \x_{(\bar{\tt})} $


Consider the example of language translation
- When predicting word $\hat{\y}_\tp$  in the Target language
- Some "context" provided by the Source language may greatly influence the prediction
    - For example: gender/plurality of the subject

This context is usually much smaller
 than the entire sequence $\x$ of length $\bar{T}$.



By not allowing $D(\h_\tp; \mathbf{s})$ *direct* access to the required context, we force the Decoder
- To encode the context of the Source 
- Along with the specific information of the Target
- Into $\h_\tp$

This makes $\h_\tp$ unnecessarily complex and perhaps difficult to learn well.

We will introduce a mechanism called *Attention* to alleviate this burden.

To give you a better feel for context, here are some examples

**Image captioning example**
- Source: Image
- Target: Caption: "A woman is throwing a **frisbee** in a park."
- Attending over *pixels* **not** sequence

<center><strong>Visual attention</strong></center>
<table>
    <tr>
        <td><img src="images/shat_-002-027.jpg"></td>
        <td><img src="images/shat_-002-028.jpg"></td>
    </tr>
    <tr>
        <td colspan=2><center>A woman is throwing a <strong>frisbee</strong> in a park.</center></td>
    </tr>
</table>
Attribution: https://arxiv.org/pdf/1502.03044.pdf


**Image captioning example**
- Source: Image
- Target: Caption: "A giraffe standing in a forest with **trees** in the background."
- Attending over *pixels* **not** sequence

<center><strong>Visual attention</strong></center>
<table>
    <tr>
        <td><img src="images/shat_-002-035.png"></td>
        <td><img src="images/shat_-002-036.jpg"></td>
    </tr>
    <tr>
        <td colspan=2><center>A giraffe standing in a forest with <strong>trees</strong> in the background.</center></td>
    </tr>
</table>
Attribution: https://arxiv.org/pdf/1502.03044.pdf

**Date normalization example**
- Source: Dates in free-form: "Saturday 09 May 2018"
- Target: Dates in normalized form: "2018-05-09"

[link](https://github.com/datalogue/keras-attention#example-visualizations)

# Attend to what's important

The solution to over-loading $\h_\tp$ with Source context is conceptually straight forward.

In the Decoder expression $D(\h_\tp; \mathbf{s})$, let
$$
\mathbf{s} = \c_\tp
$$
where $\c_\tp$ is a variable
- That supplies the appropriate context for output $\hat{\y}_\tp$
- Conditional on $\h_\tp$

Because $\bar{\h}_{(\bar{\tt})}$ 
- Is a fixed length encoding of the input prefix $\x_{(1)}, \ldots, \x_{(\bar{\tt})} $
- It can be assigned to $\c_\tp$ as the context for the prefix of $\x$ of length $\bar{\tt}$

$$ \c_\tp \in \{ \bar{\h}_{(1)}, \dots, \bar{\h}_{(\bar{T})} \} $$


We say
- The Decoder "attends to" (pays attention) $\bar{\h}_{(\bar{\tt})}$
- When generating output $\hat{\y}_\tp$

That is: it focuses its attention on a specific part of the input $\x$

<table>
    <tr>
        <th><center>Decoder: Attention</center></th>
    </tr>
    <tr>
        <td><img src="images/Decoder_attention.png"></td>
    </tr>
</table>

The dotted line from $\h_\tp$ on the left of the Choose box
- Indicates that the Choice is conditional on Decoder state $\h_\tp$

Here is a diagram summarizing the Attention mechanism

<table>
    <tr>
        <th><center>Sequence to Sequence: attention</center></th>
    </tr>
    <tr>
        <td><img src="images/RNN_seq2seq_attention.png" width=1000></td>
    </tr>
</table>

How is the choice of $\c_\tp$ from the set $\{ \bar{\h}_{(1)}, \dots, \bar{\h}_{(\bar{T})} \}$ accomplished ?

The "Choose" box
- Is a Neural Network
- With it's own weights
- That learn to make the best choice for the Target task !

In other words
- It is trained as part of the larger task

This is a common technique in Deep Learning that may, at first, appear magical
- Hypothesize the existence of a mechanism to solve your problem
- Train a Neural Network to conjure up the mechanism !

## Multi-head attention: two heads are better than one

Remember: 
- the output sequences are *vectors*
- we can stack layers so that the output sequence of layer $\ll$ becomes the input sequence to layer $\ll +1$


<table>
    <tr>
        <th><center>RNN Stacked layers</center></th>
    </tr>
    <tr>
        <td><img src="images/RNN_layers_stacked.jpg" width=80%></td>
    </tr>
</table>

- When generating two different elements (synthetic features) $j, j'$ of the layer $\ll$ output vectors
- We may want to attend to different things

So we often have *multiple* attention "heads", one devoted to each output feature.

This is called *multi-head attention*

A "head" is similar to the channel dimension of a CNN
- Each head (resp., channel) implements the same computation
- Using per-head (resp., per channel) weights
- Each computing a separate feature

The representation (latent state) of layer $\ll$ of the transformer
- Is the concatenation of the representations produced by each head
- So $n_\text{head} > 1$ results in a larger latent state $n_\text{head}$ larger than that of a single head

# Just for fun: Attention in action 

Here are some examples of Sequence to Sequence problems using Attention.

**Visual Attention example**
- Source: Image
- Target: Caption: "A giraffe and two zebras standing in a field."
- Attending over *pixels* **not** sequence

<img src="https://raw.githubusercontent.com/yunjey/show_attend_and_tell/master/jpg/train.jpg" width=1000>


Attribution: https://arxiv.org/abs/1502.03044

**Language Translation example**
- Source: Spanish
- Target: English
- Colab notebook !
[Translation example](https://colab.research.google.com/github/tensorflow/docs/blob/master/site/en/tutorials/text/nmt_with_attention.ipynb#scrollTo=CiwtNgENbx2g)

# Self-attention

We have illustrated Attention in the context of the Decoder attending to an Encoder.

But Attention may be used to relate one element of the *input* sequence to all other elements of the sequence.

This is called *self-attention*

To illustrate, suppose we want to generate an embedding of words that is context sensitive.

Consider
- "The animal didn't cross the street because **it** was too *tired*"
- "The animal didn't cross the street because **it** was too *wide*"

The meaning of the word "it" in each sentence depends on the context.

By using a model for word embeddings that uses self-attention  we can differentiate between the two.

The thickness of the blue line indicates the attention weight that is given in processing the word "it".

<img src=https://1.bp.blogspot.com/-AVGK0ApREtk/WaiAuzddKVI/AAAAAAAAB_A/WPV5ropBU-cxrcMpqJBFHg73K9NX4vywwCLcBGAs/s1600/image2.png>

Much of the recent advances in NLP may be attributed to these improved, context sensitive embeddings.

## Masked self-attention

Self attention is applied to the *entire* input sequence to determine on which elements to focus.

It is almost as if the sequence $\x$ is treated as an *unordered* set.

Sometimes order is important.

For example, consider a generative model where
$$\x_{(\tt+1)} = \y_\tp$$
- That is: input element $(\tt +1)$ is the $t^{th}$ output
- Can't attend to something that hasn't been generated yet !
- Causal ordering is important

Other times, the fact that $\x_\tp$ precedes $\x_{(\tt+1)}$ is important.

The solution to both problems is to pair $\x_\tp$ with a *positional encoding* (of $\tt$)

To implement causal ordering for output $\tt$
- mask out all $\x_{(\tt')}$ where $\tt' > \tt$

This is called *masked self-attention*

The positional encoding can also be used in problem domains where relative order is important.
- The encoding is *non-trivial*

## Transformers

There is a new model (the Transformer) that processes sequences much faster than RNN's.

It is an Encoder/Decoder architecture that uses multiple forms of Attention
- Self Attention in the Encoder
    - to tell the Encoder the relevant parts of the input sequence $\x$ to attend to
- Decoder/Encoder attention
    - to tell the Decoder which Encoder state $\bar{\h}_{(\tt')}$ to attend to when outputting $\y_\tp$
- Masked Self-Attention in the Decoder
    - to prevent the Decoder from looking ahead into inputs that have not yet been generated



# Conclusion

We recognized that the Decoder function responsible for generating Decoder output $\hat{\y}_\tp$
$$
\hat{\y}_\tp = D( \h_\tp; \mathbf{s})
$$

was quite rigid when it ignored argument $\mathbf{s}$.

This rigidity forced Decoder latent state $\h_\tp$ to assume the additional responsibility of including Encoder context.

Attention was presented as a way to obtain Encoder context through argument $\mathbf{s}$.

In [2]:
print("Done")

Done
