In [1]:
%run Latex_macros.ipynb

<IPython.core.display.Latex object>

# Attention: Motivation

Let's revisit the Encoder-Decoder architecture

The Encoder
- Acts on input sequence $[\x_{(1)} \dots \x_{(\bar{T})}]$
- Producing a sequence of latent states $[ \bar{\h}_{(1)}, \dots, \bar{\h}_{(\bar{T})} ]$



The Decoder
- Acts on the *final* Encoder latent state $\bar{\h}_{(\bar{T})}$
- Producing a sequence of outputs $[ \hat{\y}_{(1)}, \dots, \hat{\y}_{(T)} ]$
- Often feeding step $\tt$ output $\hat{\y}_\tp$ as Encoder input at step $(\tt+1)$



<table>
    <tr>
        <th><center>RNN Encoder/Decoder</center></th>
    </tr>
    <tr>
        <td><img src="images/RNN_layer_API_Encoder_Decoder.png"</td>
    </tr>
</table>



The following diagram is a condensed depiction of the process

<table>
    <tr>
        <th><center>Sequence to Sequence: training (teacher forcing) + inference: No attention</center></th>
    </tr>
    <tr>
        <td><img src="images/RNN_seq2seq.png"></td>
    </tr>
</table>

Recall that $\bar{\h}_{(\bar{\tt})}$ is a fixed length encoding of the input prefix $\x_{(1)}, \ldots, \x_{(\bar{\tt})} $.

For example:
$$\x_{0}, \ldots , \x_{\bar T} = \text{Machine, learning, is, easy, not, hard}$$

<br>
\begin{array} \\
\bar\h_{(0)} & = & \text{summary}( [ \text{Machine} ]) \\
\bar\h_{(1)} & = & \text{summary}( [ \text{Machine, Learning} ]) \\
\vdots \\
\bar\h_\tp & = & \text{summary}( [ \x_{(0)}, \ldots \x_\tp ] ) \\
\vdots \\
\bar\h_{(5)} & = & \text{summary}( [ \text{Machine, Learning, is, easy, not, hard} ]) \\
\end{array}

So $\bar{\h}_{(\bar{T})}$, which initializes the Decoder, is a summary of entire input sequence $\x$.

Allowing the Encoder to complete its task before the Decoder starts enables us to decouple the two
- The consumption of input $\x$ and production of output $\hat{\y}$ do not have to be synchronized
- Allowing for the possibility that $T \ne \bar{T}$
- For example
    - There is no one to one mapping between languages (nor does ordering of words get preserved)

Let's focus on the part of the Decoder
that transforms Decoder latent state (or short term memory) $\h_\tp$ to output $\hat{\y}_\tp$.

The box in the diagram below is a Neural Network implementing a function 
$$D(\h_\tp)$$
mapping
- the Decoder short term memory $\h_\tp$ 
- to next output $\hat\y_\tp$.



<table>
    <tr>
        <th><center>Decoder: No attention</center></th>
    </tr>
    <tr>
        <td><img src="images/Decoder_no_attention.png" width=80%></td>
    </tr>
</table>

This simple mapping of $\h_\tp$ to $\hat{\y}_\tp$ can be extremely burdensome
- the full semantics of the input sequence $\x_{(1)}, \ldots, \x_{(\bar{\tt})} $
- is only available to the Decoder via the final Encoder representation $\bar \h_{(\bar T)}$
- which must be captured in Decoder latent state $\h_\tp$
- since $\bar \h_{(\bar T)}$ is only available to the Decoder on the **first** step of the Decoder

It is often the case that the output $\hat{\y}_\tp$ at position $\tt$
- Depends mostly on a **specific element** $\x_{(\bar{\tt})}$ of the input
- Or on a **specific prefix** of the input: $\x_{(1)}, \ldots, \x_{(\bar{\tt})}$
- yet the Neural Network for $D$ can not refer back to any of the input sequence positions

Consider the example of language translation
- When predicting word $\hat{\y}_\tp$  in the Target language
- Some "context" provided by the Source language may greatly influence the prediction
    - For example: gender/plurality of the subject

This context is usually much smaller
 than the entire sequence $\x$ of length $\bar{T}$.



*Attention* is a mechanism that
- *conditions* the output Neural Network $D(\h_\tp; \mathbf{s})$ on a variable $\mathbf{s}$
- where $\mathbf{s} \in \{ \bar \h_{(\tt')} \; | \; 1 \le \tt' \le \bar T \} $

That is, Attention allows the Neural Network creating the output $\hat\y_\tp$ at position $\tt$ to
- focus ("**attend to**")
- the representation $\bar\h_{(\tt')}$ that is **most relevant** for output position $\tt$.

This potentially greatly simplifies the Decoder latent state $\h_\tp$.

 ## Why is Attention so important ?
 
Let's illustrate with a hypothetical example from Natural Language Processing: Question Answering.

A training example is encoded as
- Features: context + question
- Target: Answer


$
\x = \;
\begin{Bmatrix}\\
\text{Context:} & \text{The FRE Dept offers many Spring classes.  The students are great. ...} \\
& \vdots \\
& \text{Professor Perry taught them Machine Learning. The students ...}, \\
& \vdots \\
& \text{Professor Blecherman led a class in ...} \\
& \vdots \\
\text{Question:} & \text{What did Professor Perry do ?} \\
\end{Bmatrix}
$
<br><br><br>
$
\y = \;
\begin{array} \\
\text{Answer:} & \text{He taught them Machine Learning}
\end{array}
$

Perhaps, after seeing many such examples, the Decoder "learns" a pattern for answering questions of the type

> What did Professor `<PROPER NOUN>` teach in the Spring ?

Pattern:
```
<PRONOUN> <VERB> <INIDRECT OBJECT> <OBJECT>
```

where `<PRONOUN>, <VERB>`, etc. are *pattern place-holders*

And perhaps the Encoder "learns" to bind concrete values to place-holders

<table>
    <tr>
        <th><center>Answering questions using Attention</center></th>
    </tr>
    <tr>
        <td><img src="images/Attention_example_1.png" width=80%></td>
    </tr>
</table>

Then the control state $\h_\tp$ for the Decoder could use Attention
to **attend to** the binding for the next place-holder in the output pattern.
- Following the pattern it learned
- Issuing a "query" to lookup the concrete value bound to a place-holder (the "key")
    - for each element of the pattern
    
<table>
    <tr>
        <th><center>Answering questions using Attention</center></th>
    </tr>
    <tr>
        <td><img src="images/Attention_example_2.png" width=80%></td>
    </tr>
</table>

Without Attention, the Decoder's control state $\h_\tp$
- would have to store the key/value (place-holder/concrete value) associations

With Attention
- the finite number of Decoder weights could be utilized for other purposes.


## Visualizing Attention

Attention is one of the main contributors powering recent advances in Deep Learning
- particularly Natural Language Processing

To give you a better feel for how it's used, here are some visualizations of Attention.

**Entailment: Does the "hypothesis" logically follow from the "premise"**

<br>
<center><strong>Attention: Entailment</strong></center>
<table>
    <tr>
        <td><img src="images/Attention_visualization_Entailment.png"></td>
    </tr>
    <tr>
        <td><center>Does the Premise logically entail the Hypothesis.</center></td>
    </tr>
</table>

Attribution: https://arxiv.org/pdf/1509.06664.pdf#page=6"

**Date normalization example**
- Source: Dates in free-form: "Saturday 09 May 2018"
- Target: Dates in normalized form: "2018-05-09"

[link](https://github.com/datalogue/keras-attention#example-visualizations)

**Image captioning example**
- Source: Image
- Target: Caption: "A woman is throwing a **frisbee** in a park."
- Attending over *pixels* **not** sequence

<center><strong>Visual attention</strong></center>
<table>
    <tr>
        <td><img src="images/shat_-002-027.jpg"></td>
        <td><img src="images/shat_-002-028.jpg"></td>
    </tr>
    <tr>
        <td colspan=2><center>A woman is throwing a <strong>frisbee</strong> in a park.</center></td>
    </tr>
</table>
Attribution: https://arxiv.org/pdf/1502.03044.pdf


## Have we seen this before ?

Similarities to the LSTM
- LSTM separated *short term* (control) memory from *long term* memory
- Created a somewhat complicated mechanism to
    - update/change/forget long term memory
    - move parts of long term memory to the short-term control memory
    
Differences
- LSTM: attend to $\c$ (long-term memory)
- Attention: attend to *input*
    - not latent state
- Stacked Attention blocks
    - attend to input of *layer*, not raw input of Layer $0$

# Attend to what's important

The solution to over-loading $\h_\tp$ with Source context is conceptually straight forward.

We condition the Neural Network $D$ on a *context* $\mathbf{s}$
$$\hat\y_\tp = D(\h_\tp; \mathbf{s})$$

and compute the value of the necessary context at each step $\tt$
$$
\mathbf{s} = \c_\tp
$$

The context at step $\tt$ is limited to one of the representations created by the Encoder
$$ \c_\tp \in \{ \bar{\h}_{(1)}, \dots, \bar{\h}_{(\bar{T})} \} $$
and is chosen based on the Decoder state $\h_\tp$.

Here is the diagram

<table>
    <tr>
        <th><center>Decoder: Attention</center></th>
    </tr>
    <tr>
        <td><img src="images/Decoder_attention.png" width=80%></td>
    </tr>
</table>

The "Choose" box is a Context Sensitive Memory (as described in the module on [Neural Programming](Neural_Programming.ipynb#Soft-Lookup))
- Like a Python `dict`
    - Collection of key/value pairs: $\langle \bar\h_{(\bar \tt)}, \bar\h_{(\bar \tt)} \rangle$
    - Key is equal to value; they are latent states of the Encoder
- But with *soft* lookup
    - The current Decoder state $\h_\tp$ is presented to the CSM 
        - Called the *query*
        - Is matched across each key of the dict (i.e., a latent state $\bar \h_{(\bar \tt)}$)
    - The CSM returns an approximate match of the query to a *key* of the `dict`
        - The distance between the query and each key in the CSM is computed
        - The Soft Lookup returns a *weighted* (by inverse distance) sum of the *values* in the CSM `dict`

Here is a diagram summarizing the Attention mechanism

<table>
    <tr>
        <th><center>Sequence to Sequence: attention</center></th>
    </tr>
    <tr>
        <td><img src="images/RNN_seq2seq_attention.png" width=1000></td>
    </tr>
</table>

## Multi-head attention: two heads are better than one

Remember: 
- The elements of the output sequences are *vectors*: have multiple features

We may need to attend to a different Encoder latent state for different output features
- May even need to attend to multiple Encoder latent states for a single output feature

Rather than having a single "head" attending to the latent states, we can have many.


A "head" is similar to the channel dimension of a CNN
- Each head (resp., channel) implements the same computation
- Using per-head (resp., per channel) weights
- Each computing a separate feature

Let $d$ denote the length of each Encoder output (and hence, the latent state sizes too)
- $|| \bar\h_\tp || = || \h_\tp || = d$

Since the Encoder outputs are used as the keys and values in the CSM
- $d$ is also the length of keys, values and queries

When we have $n$ heads
- Rather than having one Attention head operating on vectors of length $d$
    - producing an output of length $d$ (weighted sum of values in the CSM)
- We create $n$ Attention heads operating on vectors (keys, values, queries) of length $d \over n$.
    - Output of these smaller heads are values, and hence also of length $d \over n$
- The final output concatenates these $n$ outputs into a single output of length $d$
    - identical in length to the single head
- we project each of these length $d$ vector into vectors of length $d \over n$


How do we create the shorter length $d \over n$ vectors ?

We use projection matrices of size $(d \times {d \over n})$ **for each head** $j$
- multiplying each key by matrix $\W^{(j)}_\text{key}$
- multiplying each value  by matrix $\W^{(j)}_\text{value}$
- multiplying the original length $d$ query by matrix $\W^{(j)}_\text{query}$



How do we know how to reduce the length $d$ vectors to length $d \over n$ for head $j$ ?

We learn project matrices $\W^{(j)}_\text{key}, \W^{(j)}_\text{value}, \W^{(j)}_\text{query}$ **in training**, for each $j$

The "Choose" box
- Is a Neural Network
- With it's own weights
- That learns to make the best choice for the Target task !
    - It is trained as part of the larger task

The "Choose" box is implementing Attention and is called an Attention **head**

The picture shows $n$ Attention heads.

Each head $j$ uniquely transforms the query $\h_\tp$ and the key/value pairs $\bar{\h}_{(1)} \ldots \bar{\h}_{(\bar{T})}$ being queried.
- into $\h^{(j)}_\tp$ and the key/value pairs $\bar{\h}^{(j)}_{(1)} \ldots \bar{\h}^{(j)}_{(\bar{T})}$
- Such that each head attends to a separate item

<table>
    <tr>
        <th><center>Decoder Multi-head Attention</center></th>
    </tr>
    <tr>
        <td><img src="images/Multihead_attention.png" width=80%></td>
    </tr>
</table>

Head $j$ 
- uses query $\h^{(j)} = \h * \W_\text{query}^{(j)}$
- against keys/values $\bar{\h}^{(j)} = \bar{\h} *  \W_\text{value}^{(j)}$



# Self-attention

We have illustrated Attention in the context of the Decoder attending to an Encoder.

But Attention may be used to relate one element of the *input* sequence to all other elements of the input sequence.

This is called *self-attention*

To illustrate, suppose we want to generate an embedding of words that is context sensitive.

Consider
- "The animal didn't cross the street because **it** was too *tired*"
- "The animal didn't cross the street because **it** was too *wide*"

The meaning of the word "it" in each sentence depends on the context.

By using a model for word embeddings that uses self-attention  we can differentiate between the two.

The thickness of the blue line indicates the attention weight that is given in processing the word "it".

<img src=https://1.bp.blogspot.com/-AVGK0ApREtk/WaiAuzddKVI/AAAAAAAAB_A/WPV5ropBU-cxrcMpqJBFHg73K9NX4vywwCLcBGAs/s1600/image2.png>

Much of the recent advances in NLP may be attributed to these improved, context sensitive embeddings.

## Masked self-attention

Self attention is applied to the *entire* input sequence to determine on which elements to focus.

It is almost as if the sequence $\x$ is treated as an *unordered* set.

Sometimes order is important.

For example, consider a generative model where
$$\x_{(\tt+1)} = \y_\tp$$
- That is: input element $(\tt +1)$ is the $t^{th}$ output
- Can't attend to something that hasn't been generated yet !
- Causal ordering is important

Other times, the fact that $\x_\tp$ precedes $\x_{(\tt+1)}$ is important.

The solution to both problems is to pair $\x_\tp$ with a *positional encoding* (of $\tt$)

To implement causal ordering for output $\tt$
- mask out all $\x_{(\tt')}$ where $\tt' > \tt$

This is called *masked self-attention*

The positional encoding can also be used in problem domains where relative order is important.
- The encoding is *non-trivial*

## Transformers

There is a new model (the Transformer) that processes sequences much faster than RNN's.

It is an Encoder/Decoder architecture that uses multiple forms of Attention
- Self Attention in the Encoder
    - to tell the Encoder the relevant parts of the input sequence $\x$ to attend to
- Decoder/Encoder attention
    - to tell the Decoder which Encoder state $\bar{\h}_{(\tt')}$ to attend to when outputting $\y_\tp$
- Masked Self-Attention in the Decoder
    - to prevent the Decoder from looking ahead into inputs that have not yet been generated



# Conclusion

We recognized that the Decoder function responsible for generating Decoder output $\hat{\y}_\tp$
$$
\hat{\y}_\tp = D( \h_\tp; \mathbf{s})
$$

was quite rigid when it ignored argument $\mathbf{s}$.

This rigidity forced Decoder latent state $\h_\tp$ to assume the additional responsibility of including Encoder context.

Attention was presented as a way to obtain Encoder context through argument $\mathbf{s}$.

In [2]:
print("Done")

Done
