In [1]:
%run Latex_macros.ipynb

<IPython.core.display.Latex object>

# Attention: Motivation

The models that we studied for processing input sequences differed from models for non-sequence inputs
- **memory** (latent state) required for processing sequences
    - because sequence length is unbounded
    - finite representation of unbounded length input sequence
    - output at step $\tt$ fed as input to step $\tt+1$

The use of latent state/memory evolved over the models we studied
- RNN
    - latent state encodes
        - input representation
        - "control" state
            - guiding how the model processes the data: state transitions
- LSTM
    - latent state partitioned into
        - Short Term memory: control state
        - Long Term memory
        
Both these models processed the input sequence **once**
- so input-specific representation needs to be part of memory

We will introduce a mechanism called *Attention*
- that allows the input sequence to be *re-visited* at each time step
- cleaner separation between control memory and input memory

Let's revisit the Encoder-Decoder architecture

The Encoder
- Acts on input sequence $[\x_{(1)} \dots \x_{(\bar{T})}]$
- Producing a sequence of latent states $[ \bar{\h}_{(1)}, \dots, \bar{\h}_{(\bar{T})} ]$


The Decoder
- Acts on the *final* Encoder latent state $\bar{\h}_{(\bar{T})}$
- Producing a sequence of outputs $[ \hat{\y}_{(1)}, \dots, \hat{\y}_{(T)} ]$
- Often feeding step $\tt$ output $\hat{\y}_\tp$ as Encoder input at step $(\tt+1)$



<table>
    <tr>
        <th><center>RNN Encoder/Decoder</center></th>
    </tr>
    <tr>
        <td><img src="images/RNN_layer_API_Encoder_Decoder.png"</td>
    </tr>
</table>



The following diagram is a condensed depiction of the process

<table>
    <tr>
        <th><center>Sequence to Sequence: training (teacher forcing) + inference: No attention</center></th>
    </tr>
    <tr>
        <td><img src="images/RNN_seq2seq.png"></td>
    </tr>
</table>

The topic of "Attention" will focus on the part of the Decoder diagram above
that transforms Decoder latent state (or short term memory) $\h_\tp$ to output $\hat{\y}_\tp$.

It is a Neural Network implementing a function 
$$\hat\y_\tp = D(\h_\tp)$$
mapping
- the Decoder short term memory $\h_\tp$ 
- to next output $\hat\y_\tp$.


<table>
    <tr>
        <th><center>Decoder: No attention</center></th>
    </tr>
    <tr>
        <td><img src="images/Decoder_no_attention.png" width=80%></td>
    </tr>
</table>

As an illustrative example: consider the task of Question Answering.
- a sequence to sequence task (thus, ideal for an Encoder-Decoder architecture)
- where the input sequence $\x_{[1:\bar T]}$ is the pair consisting of
    - a paragraph called the *context*
    - a question that references the context
- the target/label (i.e., desired output sequence) is $\y_{[1:T]}$
    - is text that "answers" the question

$
\x = \;
\begin{Bmatrix}\\
\text{Context:} & \text{The FRE Dept offers many Spring classes.  The students are great. ...} \\
& \vdots \\
& \text{Professor Perry taught them Machine Learning. The students ...}, \\
& \vdots \\
& \text{Professor Blecherman led a class in ...} \\
& \vdots \\
\text{Question:} & \text{What did Professor Perry do ?} \\
\end{Bmatrix}
$
<br><br><br>
$
\y = \;
\begin{array} \\
\text{Answer:} & \text{He taught them Machine Learning}
\end{array}
$

Suppose the Decoder has already output 
$$\hat\y_{([1:3])} = \text{He taught them}$$

The remainder of the desired output sequence is
$$
\hat\y_{([4:5])} = \text{Machine Learning}$$

How is  possible for the Neural Network implementing $D$ to produce
$$\hat\y_{(4)} = D( \h_{(4)} )$$

Notice that $D$ is conditioned on the single input $\h_\tp$.

Thus, in order for $D( \h_{(4)} )$ to be equal to "Machine"
- this information must somehow be encoded in $\h_{(4)}$

But how did it get there ?

It must have been encoded in $\bar \h_{(\bar T)}$
- the "summary" of $\x_{([1:\bar T])}$ passed by the Encoder to the Decoder


Recall that Decoder ouput $\bar{\h}_{(\bar{\tt})}$ is a fixed length encoding of the input prefix $\x_{(1)}, \ldots, \x_{(\bar{\tt})} $.

For example:
$$\x_{0}, \ldots , \x_{\bar T} = \text{Machine, learning, is, easy, not, hard}$$

<br>
\begin{array} \\
\bar\h_{(0)} & = & \text{summary}( [ \text{Machine} ]) \\
\bar\h_{(1)} & = & \text{summary}( [ \text{Machine, Learning} ]) \\
\vdots \\
\bar\h_{\bar{\tt}} & = & \text{summary}( [ \x_{(0)}, \ldots \x_{(\bar{\tt})}] ) \\
\vdots \\
\bar\h_{(5)} & = & \text{summary}( [ \text{Machine, Learning, is, easy, not, hard} ]) \\
\end{array}

In order for the concept "Machine Learning" to have been encoded in $\bar \h_{(\bar T)}$
- it must be present in all Encoder latent states 
$$\bar \h_{({\bar \tt}')} \text{ for } \tt' \ge p$$
where $p$ is the index of "Machine Learning" in the context.

To summarize
- because $D$ is conditioned *only* on Decoder state $\h_\tp$
- all "facts" from the context must be transfered from Encoder to Decoder
- through final Encoder state $\bar \h_{(\bar T)}$
- which in turn was encoded in all Encoder states $\bar \h_{({\bar \tt}')} \text{ for } \tt' \ge p$

The choice of conditioning $D$ only on $\h_\tp$ is burdensome for both the Encoder and Decoder.

some of the weights of each must be devoted to 
- "control"
    - how to record facts (Encoder)/produce output (Decoder) in the abstract (i.e., for any context/question)
- concrete facts of the particular context

## Attend to what's important

What if we changed $D$ so it was conditioned on both $\h_\tp$ **and** $\bar \h_{([1:\bar T])}$
- $\bar \h_{([1:\bar T])}$ enables the Decoder to refer back to input $\x_{([1:\bar T])}$ at *every output position* $\tt$
- $\h_\tp$ no longer has to encode the "facts"
    - weights can be devoted to "control"
- $\bar \h_{(\bar T)}$ is *no longer the bottleneck* through which facts flow from Encoder to Decoder

The version of $D$ with Attention looks something like this

<table>
    <tr>
        <th><center>Decoder: Attention</center></th>
    </tr>
    <tr>
        <td><img src="images/Decoder_attention.png" width=80%></td>
    </tr>
</table>

At output position $\tt$
we enable the Decoder to focus on (*attend to*)
- the position $\bar \tt$ of the input
- that is *relevant* for producing $\hat \y_\tp$

This seems very natural to a human
- rather than memorizing details
- we refer back to the context
- focusing of only the part that is immediately needed

The discussion of the **implementation** of Attention will be deferred to a
later module [Attention lookup](Attention_Lookup.ipynb).

For now, think of the "Choose" box as a Context Sensitive Memory (as described in the module on [Neural Programming](Neural_Programming.ipynb#Soft-Lookup))
- Like a Python `dict`
    - Collection of key/value pairs: $\langle \bar\h_{(\bar \tt)}, \bar\h_{(\bar \tt)} \rangle$
    - Key is equal to value; they are latent states of the Encoder
- But with *soft* lookup
    - The current Decoder state $\h_\tp$ is presented to the CSM 
        - Called the *query*
        - Is matched across each key of the dict (i.e., a latent state $\bar \h_{(\bar \tt)}$)
    - The CSM returns an approximate match of the query to a *key* of the `dict`
        - The distance between the query and each key in the CSM is computed
        - The Soft Lookup returns a *weighted* (by inverse distance) sum of the *values* in the CSM `dict`

## Visualizing Attention

We can illustrate the behavior of Neural Networks that have been augmented with Attention through diagrams.
- at a particular output position $\tt$
- we can display the amount of "attention"
- that each position in the input receives

Attention can be used to create a Context Sensitive Encoding of words
- The meaning of a word may change depending on the rest of the sentence

We can illustrate this with an example: how the meaning of the word "it" changes
- The thickness of the blue line indicates the attention weight that is given in processing the word "it".

<img src=https://1.bp.blogspot.com/-AVGK0ApREtk/WaiAuzddKVI/AAAAAAAAB_A/WPV5ropBU-cxrcMpqJBFHg73K9NX4vywwCLcBGAs/s1600/image2.png>

Much of the recent advances in NLP may be attributed to these improved, context sensitive embeddings.

We note that simple Word Embeddings
- also capture "meaning"
- but are *not* sensitive to context



**Entailment: Does the "hypothesis" logically follow from the "premise"**

<br>
<center><strong>Attention: Entailment</strong></center>
<table>
    <tr>
        <td><img src="images/Attention_visualization_Entailment.png"></td>
    </tr>
    <tr>
        <td><center>Does the Premise logically entail the Hypothesis.</center></td>
    </tr>
</table>

Attribution: https://arxiv.org/pdf/1509.06664.pdf#page=6"

**Date normalization example**
- Source: Dates in free-form: "Saturday 09 May 2018"
- Target: Dates in normalized form: "2018-05-09"

[link](https://github.com/datalogue/keras-attention#example-visualizations)

**Image captioning example**
- Source: Image
- Target: Caption: "A woman is throwing a **frisbee** in a park."
- Attending over *pixels* **not** sequence

<center><strong>Visual attention</strong></center>
<table>
    <tr>
        <td><img src="images/shat_-002-027.jpg"></td>
        <td><img src="images/shat_-002-028.jpg"></td>
    </tr>
    <tr>
        <td colspan=2><center>A woman is throwing a <strong>frisbee</strong> in a park.</center></td>
    </tr>
</table>
Attribution: https://arxiv.org/pdf/1502.03044.pdf


 ## Benefits of Attention

Continuing with our hypothetical Question Answering task
- Attention allows the Decoder to focus on "control"
- By allowing it to access "facts" when needed

We don't **know** if the following is actually what happens, but let's imagine.

Perhaps, after seeing many such examples, the Decoder "learns" a pattern for answering questions of the type

> What did Professor `<PROPER NOUN>` teach in the Spring ?

Output Pattern:
```
<PRONOUN> <VERB> <INIDRECT OBJECT> <OBJECT>
```

where `<PRONOUN>, <VERB>`, etc. are *pattern place-holders*

So the "control" of the Decoder
- needs to output each position of the output pattern
- binding concrete values to each place-holder

And perhaps the Encoder "learns" to create the bindings of concrete values to place-holders

<table>
    <tr>
        <th><center>Answering questions using Attention</center></th>
    </tr>
    <tr>
        <td><img src="images/Attention_example_1.png" width=80%></td>
    </tr>
</table>

Then the control state $\h_\tp$ for the Decoder could use Attention
to **attend to** the binding for the next place-holder in the output pattern.
- Following the pattern it learned
- Issuing a "query" to lookup the concrete value bound to a place-holder (the "key")
    - for each element of the pattern
    
<table>
    <tr>
        <th><center>Answering questions using Attention</center></th>
    </tr>
    <tr>
        <td><img src="images/Attention_example_2.png" width=80%></td>
    </tr>
</table>

## Have we seen this before ?

If you recall the architecture of the LSTM
- *short term* (control) memory 
- was separated from *long term* memory
- elements of long term memory are moved to short term memory *as needed*

This is partly similar to the advantages of Attention.

But
- all factual information from input $\x$ has to flow through the bottleneck $\bar \h_{(\bar T)}$ of the Encoder output



# Self-attention

We have illustrated the benefit of enabling the Decoder to attend to the Encoder.

This form of attention is called *Cross Attention*.

But we can further simplify the Decoder control by enabling it, when generating $\hat \y_\tp$
- to attend to all previously generated outputs $\hat \y_{([1:\tt-1])}$

This form of attention if called *Self Attention*

For example, suppose the Decoder is generating a long sentence
- in many languages, there needs to be agreement between the gender/plurality of a subject and verbs
- Self attention enables the Decoder to refer back to the previously generated subject of the sentence
- when generating the verb for each subsequent output position



It is common in an Encoder-Decoder architecture to have both
- Cross Attention from Decoder to Encoder
- Self Attention from Decoder to Decoder

We will see both forms used in the Transformer.

These mechanisms are attending to different sequences (Encoder states or Decoder outputs).

We will henceforth use the term *sequence being attended to* as a general term
- instead of specifically referring to the part of the network that produced it

# Masked attention

As presented, the Attention mechanism can refer to an entire sequence
- e.g., the sequence of Encoder latent states

It is sometimes desirable to *limit* what may be attended to.

For example, consider a decision at time $\tt$ that may depend *only on the past*
- positions $\tt' \lt \tt$
- for example, a trading decision at time $\tt$ may depend only on *prior* information
    - typical of sequences that are timeseries

Restricting attention to the past is called *Causal Attention*.
- the next output depends only on things that could have caused it (the past), not the future

There is a mechanism to restrict what may be attended to in a general way
- create a "mask"
- a bit vector for each position of the sequence being attended to
- such that attention is limited to positions where the mask element if True.


This is called *Masked Attention*.

It is frequently used to enable a Decoder, when predicting output $\hat \y_\tp$
- to attend to **previously** generate outputs $\hat \y_{[1:\tt-1])}$
- but not **future** outputs $\hat\y_{(\tt')}$ for $\tt' \ge \tt$

When used in this manner, we refer to the behavior as *Masked Self Attention*


**Aside**

You may wonder how it is even practically possible for a Decoder to refer to the future.

When using *Teacher Forcing* for **training**
- the Decoder does not use the *predicted* target sequence $\hat \y_{(1:T)}$
- the Decoder uses the *actual* target sequence $\y_{(1:T)}$
    - hence, "future" positions $\tt' \ge \tt$ are available
- this prevents a single mis-prediction at position $\tt$ from cascading and ruining all future output
    - facilitates training
- at inference time: the Decoder works on the *predicted* Target sequence.

In the diagram below, we illustrate (lower right) how the Decoder input changes between Training and Test/Inference time.

<table>
    <tr>
        <th><center>Sequence to Sequence: training (teacher forcing) + inference: No attention</center></th>
    </tr>
    <tr>
        <td><img src="images/RNN_seq2seq.png"></td>
    </tr>
</table>

# Multi-head attention: two heads are better than one

Perhaps when generating the output for position $\tt$ of the output sequence
- we need to attend to *more than one* position of the sequence being attended to
    - need to know both gender and plurality of subject
- that is: we want an Attention layer to output multiple items.

We can attend to $n$ positions
- by creating $n$ separate Attention mechanisms
- each one called a *head*

This behavior is referred to as *Multi-head attention*

This type of behavior is common to many layer types in a Neural Network
- a Dense layer $l$ may produce a vector $\y_\llp$ where $n_\llp \gt 1$
- a Convolutional layer $l$ may produce outputs (for each spatial location) for many channels

We have referred to this as layer $\ll$ producing $n_\llp$ *features*.

It would be natural for an Attention layer to output many "features" to enable attention to many positions.

In practice, this is sometimes (always ?) not done
- Model architectures (e.g., the Transformer) are simplified when the inputs/outputs of each sub-component
- have the same length
- often denoted as $d$ of $d_\text{model}$ in the Transformer


When a Transformer needs to attend to $n$ positions
- it uses $n$ Attention heads
- each outputting a vector of length $\frac{d}{n}$
- which are concatenated together to produce a single output of length $d$

When we have $n$ heads
- Rather than having one Attention head operating on vectors of length $d$
    - producing an output of length $d$ (weighted sum of values in the CSM)
- We create $n$ Attention heads operating on vectors (keys, values, queries) of length $d \over n$.
    - Output of these smaller heads are values, and hence also of length $d \over n$
- The final output concatenates these $n$ outputs into a single output of length $d$
    - identical in length to the single head
- we project each of these length $d$ vector into vectors of length $d \over n$


The picture shows $n$ Attention heads.

Note that each head is working on vectors of length $\frac{d}{n}$ rather than
original dimensions $d$.
- variables with superscript $(j)$ are of fractional length

Details are deferred to the module [Attention lookup](Attention_Lookup.ipynb).


Each head $j$ uniquely transforms the query $\h_\tp$ and the key/value pairs $\bar{\h}_{(1)} \ldots \bar{\h}_{(\bar{T})}$ being queried.
- into $\h^{(j)}_\tp$ and the key/value pairs $\bar{\h}^{(j)}_{(1)} \ldots \bar{\h}^{(j)}_{(\bar{T})}$
- Such that each head attends to a separate item

<table>
    <tr>
        <th><center>Decoder Multi-head Attention</center></th>
    </tr>
    <tr>
        <td><img src="images/Multihead_attention.png" width=80%></td>
    </tr>
</table>

## Transformers

There is a new model (the Transformer) that processes sequences much faster than RNN's.

It is an Encoder/Decoder architecture that uses multiple forms of Attention
- Self Attention in the Encoder
    - to tell the Encoder the relevant parts of the input sequence $\x$ to attend to
- Decoder/Encoder attention
    - to tell the Decoder which Encoder state $\bar{\h}_{(\tt')}$ to attend to when outputting $\y_\tp$
- Masked Self-Attention in the Decoder
    - to prevent the Decoder from looking ahead into inputs that have not yet been generated



# Conclusion

We recognized that the Decoder function responsible for generating Decoder output $\hat{\y}_\tp$
$$
\hat{\y}_\tp = D( \h_\tp; \mathbf{s})
$$

was quite rigid when it ignored argument $\mathbf{s}$.

This rigidity forced Decoder latent state $\h_\tp$ to assume the additional responsibility of including Encoder context.

Attention was presented as a way to obtain Encoder context through argument $\mathbf{s}$.

In [2]:
print("Done")

Done
