In [1]:
%run Latex_macros.ipynb

<IPython.core.display.Latex object>

# Attention: motivation

Let us consider a task familiar to all of us who have taken standardized exams: Question Answering.

Input consists of two pieces of text
- a paragraph (called the *Context*)
- a Question whose answer can be found in the paragraph

Output is a piece of text that answers the question.

$
\x = \;
\begin{Bmatrix}\\
\text{Context:} & \text{The FRE Dept offers many Spring classes.  The students are great. ...} \\
& \vdots \\
& \text{Professor Perry taught them Machine Learning. The students ...}, \\
& \vdots \\
& \text{Professor Blecherman led a class in ...} \\
& \vdots \\
\text{Question:} & \text{What did Professor Perry do ?} \\
\end{Bmatrix}
$
<br><br><br>
$
\y = \;
\begin{array} \\
\text{Answer:} & \text{He taught them Machine Learning}
\end{array}
$

Let us hypothesize how a model might learn to solve this task
- it is only an hypothesis: we don't really know

The model might have generalized
- from seeing many training examples of disparate questions and their answers
- that there is a parameterized *template* for both the Question and the Answer

Question Template

> What did Professor `<PROPER NOUN>` teach in the Spring ?

Answer Template:
```
<PRONOUN> <VERB> <INIDRECT OBJECT> <OBJECT>
```

where `<PROPER NOUNE>, <PRONOUN>, <VERB>`, etc. are *pattern place-holders* parameters.

By using a parameterized template, the model captures
- commonality
- in many different types of questions

In order to produce the answer the model needs to
- Generate the tokens of the Answer Template in order
- Substituting in concrete values for the place-holders
    - by performing a Lookup in the Context in order to obtain these values

We will examine how the Lookup might be performed
- first: by using mechanisms that we have already studied
- subsequently: via a new mechanism called *Attention*


# Using an RNN without Attention

## Encoder-Decoder architecture: review

For the Question Answering task
- both the Input and Output are sequences
- thus, the task is a Sequence to Sequence task
    - just like: Language Translation

We learned that Recurrent architectures are best-suited for processing sequences.

These architectures
- operate in a "loop"
    - processing one Input or Output token at a time
- utilize **memory** (latent state)
    - necessary because Input/Output sequence lengths are unbounded
    - after processing the token at position $\tt$
        - the latent state is finite representation of the prefix of the sequence of length $\tt$

The use of latent state/memory evolved over the models we studied
- RNN
    - latent state encodes
        - input representation
        - "control" state
            - guiding how the model processes the data: state transitions
- LSTM
    - latent state partitioned into
        - Short Term memory: control state
        - Long Term memory
        
Both these models processed the input sequence **once**
- so input-specific representation needs to be part of memory

A common architecture for Sequence to Sequence tasks is the Encoder-Decoder:
- The Encoder is an RNN
    - Acts on input sequence $[\x_{(1)} \dots \x_{(\bar{T})}]$
    - Producing a sequence of latent states $[ \bar{\h}_{(1)}, \dots, \bar{\h}_{(\bar{T})} ]$
        - latent state $\bar{\h}_\tt$ is a summary of $[\x_{(1)} \dots \x_\tp ]$


The Decoder
- Acts on the *final* Encoder latent state $\bar{\h}_{(\bar{T})}$
    - which summarizes the entire input sequence $\x$
    - Producing a sequence of latent states $[ \h_{(1)}, \dots, \h_{(T)} ]$
        - latent state $\h_\tp$ is response for generating output token $\hat{\y}_\tp$
    - Thus outputting a sequence  $[ \hat{\y}_{(1)}, \dots, \hat{\y}_{(T)} ]$
- Often feeding step $\tt$ output $\hat{\y}_\tp$ as Encoder input at step $(\tt+1)$



<table>
    <tr>
        <th><center>RNN Encoder/Decoder</center></th>
    </tr>
    <tr>
        <td><img src="images/RNN_layer_API_Encoder_Decoder.png"</td>
    </tr>
</table>


## Decoder output $\hat\y_\tp$

The simplest RNN (corresponding to our diagrams) use the latent state $\h_\tp$ as the output $\hat\y_\tp$
$$
\hat\y_\tp = \h_\tp
$$

It is easy to add another NN to transform $\h_\tp$ into a $\hat\y_\tp$ that is different.

- We can add a NN to the Decoder RNN that implements a function $D$ that transforms the latent state into an output.

$$\hat\y_\tp = D(\h_\tp)$$

For clarity: we will omit this additional NN from our diagrams until it becomes necessary

Here is what the additional NN looks like:
<br>
<br>



<table>
    <tr>
        <th><center>Decoder output transformation: No attention</center></th>
    </tr>
    <tr>
        <td><img src="images/Decoder_no_attention.png" width=70%></td>
    </tr>
</table>

## How does the Decoder perform Lookup (without Attention) ?

Suppose the Decoder has already output 
$$\hat\y_{([1:3])} = \text{He taught them}$$

It must subsequently output
$$
\hat\y_{([4:5])} = \text{Machine Learning}$$

In order to do this
- it must Lookup "Machine Learning" in the Context, resulting in
$$\begin{array} \\
D( \h_{(4)}) & = & \text{Machine} \\
D( \h_{(5)}) & = & \text{Learning} \\
\end{array}
$$

But  $D$ is conditioned on the single input $\h_\tp$.

Thus, in order for $D( \h_{(4)} )$ to be equal to "Machine"
- this information must somehow be encoded in $\h_{(4)}$


How did it get there ?

All "knowledge" from the Context must be transfered from Encoder to Decoder
- through final Encoder state $\bar \h_{(\bar T)}$
- which in turn was encoded in all Encoder states $\bar \h_{({\bar \tt}')} \text{ for } \tt' \ge \bar p$
    - where $\bar p$ is the position withing sequence $\x$ of the word "Machine"

We can hypothesize that the final Encoder latent state $\bar\h_{\bar T}$
- encodes a Dictionary (key/value pairs)
- mapping Place-holder names to Concrete values
- the dictionary is built incrementally by prior latent states of the Encoder
<table>
    <tr>
        <th><center>Answering questions using Attention</center></th>
    </tr>
    <tr>
        <td><img src="images/Attention_example_1.png" width=80%></td>
    </tr>
</table>

This dictionary is passed to the Decoder via the single connection from Encoder to Decoder
- and must be carried forward by the Decoder
- through Decoder states $[ \h_{(1)}, \dots, \h_{(4)} ]$
- in order to make the dictionary available to subsequent latent states of the Decoder

We further hypothesize that the Decoder
- performs Lookups 
- by using the Decoder latent state $\h_\tp$
    - as a *query* that matches against the keys of the Dictionary
    - in order to obtain the Concrete value required to produced output token at position $\tt$
    $$\hat\y_\tp = D(\h_\tp)$$

<br>
<br>
<table>
    <tr>
        <th><center>Query performing a Lookup in the Dictionary</center></th>
    </tr>
    <tr>
        <td><img src="images/Attention_example_2.png" width=80%></td>
    </tr>
</table>

Here is a picture describing this hypothetical functioning.
<br>
<br>
<table>
    <tr>
        <th><center>RNN Encoder/Decoder without Attention</center></th>
    </tr>
    <tr>
        <td><img src="images/RNN_layer_API_Encoder_Decoder.png" width=80%</td>
    </tr>
    <tr>
        <td><img src="images/Encoder_Decoder_no_attention.png" width=70%</td>
</table>

Connecting the Encoder and Decoder through the "bottleneck" of $\bar \h_{(\bar T)}$ thus burdens the
- Encoder: passing knowledge forward to the bottleneck
- Decoder: passing knowledge from the bottleneck

This results in an inefficient use of the model's latent state variable
- In addition to
    - the Encoder and Decoder allocating some of the model's latent state for "control"
    - guiding the loop that processes the Input, or generates the output positions in the template
- It must **also** allocate some of the model's latent state for "knowledge storage"
    - in order to Lookup the concrete value corresponding to a place-holder in the Output template


# Attention

**Reference**

[Neural Machine Translation by Jointly Learning To Align and Translate](https://arxiv.org/pdf/1409.0473.pdf)
[paper that introduced Attention](https://arxiv.org/pdf/1409.0473.pdf)

<br><br>
The flaw in the Encoder-Decoder without Attention is 
- the input $\x$ is processed *only once*
- by the Encoder
- which has to summarize it in $\bar{\h}_{(\bar T)}$

We will introduce a mechanism called *Attention*
- that allows the input sequence to be *re-visited* at each time step of Output generation

This will result in a cleaner separation between control memory and input memory

Attention allows the Decoder
- to directly access all of the Encoder latent states $\bar\h_{(1)} \dots \h_{(\bar{T})}$
- at each time step of the Decoder

Thus, there is no need
- for an Encoder to create a full dictionary as the final Encoder latent state $\bar\h_{(\bar T)}$
- for the Decoder to keep the dictionary in all it's latent states $\h_{(1)} \dots \h_{(T)}$

Here is a picture of an Encoder/Decoder augmented with Attention
- we have add an additional box to the diagram for the NN that implements the function $D$
    - that maps $\h_\tp$ to $\y_\tp$
<br>
<br>
<table>
    <tr>
        <th><center>RNN Encoder/Decoder with Attention</center></th>
    </tr>
    <tr>
        <td><img src="images/RNN_layer_API_Encoder_Decoder_Attention.png" width=80%</td>
    </tr>
    <tr>
        <td><img src="images/Encoder_Decoder_no_attention.png" width=70%</td>
</table>

Notice that the final Encoder latent state $\bar\h_{(\bar T)}$ is **no longer**  connected to the Decoder.

What is going on inside the "box" implementing function $D$ that we added at each time step ?

The box's input at step $\tt$
- the Decoder latent state $\h_\tp$
- the collection of Encoder latent states $\bar\h_{(1)} \dots \h_{(\bar{T})}$
    - the red box in the above diagram

That is, it is computing a $\hat\y_\tp$ that is a function of both $\h_\tp$ and $\bar\h_{(1)} \dots \h_{(\bar{T})}$

$$
\hat\y_\tp = D( \h_\tp,  [ \bar\h_{(1)} \dots \h_{(\bar{T})} ])
$$

Here are the inner workings of the NN for $D$:
    
<br>
<br>
<table>
    <tr>
        <th><center>Decoder output transformation with attention</center></th>
    </tr>
    <tr>
        <td><img src="images/Decoder_attention.png" width=60%></td>
    </tr>
</table>

Inside the box:
- the Decoder latent state $\h_\tp$ is used as a *query*
- which is matched against each of the Encoder latent states
- resulting in one Encoder latent state being chosen as $\mathbf{c}_\tp$

The chosen Encoder latent state $\mathbf{c}_\tp$ and Decoder latent state $\h_\tp$
- are input to another Neural Network
- which produces output $\hat\y_\tp$


Essentially we change  $D$ so that it is conditioned on
- $\h_\tp$: the query (Decoder state)
- **and** summary of the Context $\bar \h_{([1:\bar T])}$



The "Choose" box implements an *Attention* mechanism, which allows the Decoder
- to **attend to** the part of Input $\x$ (represented via some Encoder latent state $\bar\h_{(\bar\tt)}$)
- that is *relevant* for producing $\hat \y_\tp$
- exactly when it is needed

This seems very natural to a human
- rather than memorizing details (e.g., the big dictionary $\bar\h_{(\bar T)}$ in the architecture without Attention)
- we refer back to the context
- focusing of only the part that is immediately needed

A big advantage of this approach:
- the latent state of the Decoder is solely for "control" (e.g., creating the output according to the Answer template)
- and not for storing "knowledge" (dictionary)



The discussion of the **implementation** of Attention will be deferred to a
later module [Attention lookup](Attention_Lookup.ipynb).

For now, think of the "Choose" box as a Context Sensitive Memory (as described in the module on [Neural Programming](Neural_Programming.ipynb#Soft-Lookup))
- Like a Python `dict`
    - Collection of key/value pairs: $\langle \bar\h_{(\bar \tt)}, \bar\h_{(\bar \tt)} \rangle$
    - Key is equal to value; they are latent states of the Encoder
- But with *soft* lookup
    - The current Decoder state $\h_\tp$ is presented to the CSM 
        - Called the *query*
        - Is matched across each key of the dict (i.e., a latent state $\bar \h_{(\bar \tt)}$)
    - The CSM returns an approximate match of the query to a *key* of the `dict`
        - The distance between the query and each key in the CSM is computed
        - The Soft Lookup returns a *weighted* (by inverse distance) sum of the *values* in the CSM `dict`

## Have we seen this before ?

If you recall the architecture of the LSTM
- *short term* (control) memory 
- was separated from *long term* memory
- elements of long term memory are moved to short term memory *as needed*

This is partly similar to the advantages of Attention.

But
- all factual information from input $\x$ has to flow through the bottleneck $\bar \h_{(\bar T)}$ of the Encoder output



# Visualizing Attention

We can illustrate the behavior of Neural Networks that have been augmented with Attention through diagrams.
- at a particular output position $\tt$
- we can display the amount of "attention"
- that each position in the input receives

Attention can be used to create a Context Sensitive Encoding of words
- The meaning of a word may change depending on the rest of the sentence

We can illustrate this with an example: how the meaning of the word "it" changes
- The thickness of the blue line indicates the attention weight that is given in processing the word "it".

<img src=https://1.bp.blogspot.com/-AVGK0ApREtk/WaiAuzddKVI/AAAAAAAAB_A/WPV5ropBU-cxrcMpqJBFHg73K9NX4vywwCLcBGAs/s1600/image2.png>

Much of the recent advances in NLP may be attributed to these improved, context sensitive embeddings.

We note that simple Word Embeddings
- also capture "meaning"
- but are *not* sensitive to context



**Entailment: Does the "hypothesis" logically follow from the "premise"**

<br>
<center><strong>Attention: Entailment</strong></center>
<table>
    <tr>
        <td><img src="images/Attention_visualization_Entailment.png"></td>
    </tr>
    <tr>
        <td><center>Does the Premise logically entail the Hypothesis.</center></td>
    </tr>
</table>

Attribution: https://arxiv.org/pdf/1509.06664.pdf#page=6"

**Date normalization example**
- Source: Dates in free-form: "Saturday 09 May 2018"
- Target: Dates in normalized form: "2018-05-09"

[link](https://github.com/datalogue/keras-attention#example-visualizations)

**Image captioning example**
- Source: Image
- Target: Caption: "A woman is throwing a **frisbee** in a park."
- Attending over *pixels* **not** sequence

<center><strong>Visual attention</strong></center>
<table>
    <tr>
        <td><img src="images/shat_-002-027.jpg"></td>
        <td><img src="images/shat_-002-028.jpg"></td>
    </tr>
    <tr>
        <td colspan=2><center>A woman is throwing a <strong>frisbee</strong> in a park.</center></td>
    </tr>
</table>
Attribution: https://arxiv.org/pdf/1502.03044.pdf


# Self-attention

The are several "flavors" of Attention.

We have motivated the Attention from a Decoder to the output of an Encoder.

The Attention between two different Neural Networks (e.g., Decoder, Encoder) is called *Cross-Attention*

Cross Attention allows the Decoder to not have to expend part of its latent state encoding "knowledge"
- it can obtain the knowledge "just in time" via an Attention query to the output of the Encoder

But the Decoder's latent state, when generating output $\hat\y_\tp$, must still "remember" the partial output $\hat\y_{(1:\tt-1)}$ previously generated.

For example, suppose the Decoder is generating a long sentence
- in many languages, there needs to be agreement between the gender/plurality of a subject and verbs


We can use a different flavor of Attention to enable the Decoder to attend to its own inputs.

This is called *Self-Attention*
- when generating $\hat\y_\tp$: attend to the most relevant parts of $\hat\y_{(1:\tt-1)}$

The Encoder can also benefit from Self-Attention
- when creating $\bar\h_\tp$, attend to the relevant parts of $\x$, not just $\x_\tp$
    -  what is the meaning of the word "it" in the two sentences:
    
             The animal didn't cross the street because **it** was too tired
        
        - attend to "animal"
        
              The animal didn't cross the street because **it** was too wide
         
     - attend to "street"

We will see an example of Self-Attention in our module on the Transformer.

## Masked attention

As presented, the Attention mechanism can refer to an entire sequence
- e.g., the sequence of Encoder latent states

This is very powerful but not appropriate in all cases.
- It is allowed to access elements of the input sequence "out of order" for some tasks
    - for example: the task of creating a context-sensitive representation of each word in a sentence
        - typical use case for an Encoder
- it is *not* allowed to access a future element for some tasks
    - for example: a trading strategies decisions at time $\tt$ must not be influenced by inputs occurring after time $\tt$


Restricting attention to the past is called *Causal Attention*.
- the next output depends only on things that could have caused it (the past), not the future

There is a mechanism to restrict what may be attended to in a general way
- create a "mask"
- a bit vector for each position of the sequence being attended to
- such that attention is limited to positions where the mask element if True.


This is called *Masked Attention*.

It is frequently used to enable a Decoder, when predicting output $\hat \y_\tp$
- to attend to **previously** generate outputs $\hat \y_{[1:\tt-1])}$
- but not **future** outputs $\hat\y_{(\tt')}$ for $\tt' \ge \tt$

When used in this manner, we refer to the behavior as *Masked Self Attention*


**Aside**

You may wonder how it is even practically possible for a Decoder to refer to the future.

When using *Teacher Forcing* for **training**
- the Decoder does not use the *predicted* target sequence $\hat \y_{(1:T)}$
- the Decoder uses the *actual* target sequence $\y_{(1:T)}$
    - hence, "future" positions $\tt' \ge \tt$ are available
- this prevents a single mis-prediction at position $\tt$ from cascading and ruining all future output
    - facilitates training
- at inference time: the Decoder works on the *predicted* Target sequence.

In the diagram below, we illustrate (lower right) how the Decoder input changes between Training and Test/Inference time.

<table>
    <tr>
        <th><center>Sequence to Sequence: training (teacher forcing) + inference: No attention</center></th>
    </tr>
    <tr>
        <td><img src="images/RNN_seq2seq.png"></td>
    </tr>
</table>

# Multi-head attention: two heads are better than one

Perhaps when generating the output for position $\tt$ of the output sequence
- we need to attend to *more than one* position of the sequence being attended to
    - need to know both gender and plurality of subject
- that is: we want an Attention layer to output multiple items.

We can attend to $n$ positions
- by creating $n$ separate Attention mechanisms
- each one called a *head*

This behavior is referred to as *Multi-head attention*

This type of behavior is common to many layer types in a Neural Network
- a Dense layer $l$ may produce a vector $\y_\llp$ where $n_\llp \gt 1$
- a Convolutional layer $l$ may produce outputs (for each spatial location) for many channels

We have referred to this as layer $\ll$ producing $n_\llp$ *features*.

It would be natural for an Attention layer to output many "features" to enable attention to many positions.

In practice, this is sometimes (always ?) not done
- Model architectures (e.g., the Transformer) are simplified when the inputs/outputs of each sub-component
- have the same length
- often denoted as $d$ or $d_\text{model}$ in the Transformer


When a Transformer needs to attend to $n$ positions
- it uses $n$ Attention heads
- each outputting a vector of length $\frac{d}{n}$
- which are concatenated together to produce a single output of length $d$

When we have $n$ heads
- Rather than having one Attention head operating on vectors of length $d$
    - producing an output of length $d$ (weighted sum of values in the CSM)
- We create $n$ Attention heads operating on vectors (keys, values, queries) of length $d \over n$.
    - Output of these smaller heads are values, and hence also of length $d \over n$
- The final output concatenates these $n$ outputs into a single output of length $d$
    - identical in length to the single head
- we project each of these length $d$ vector into vectors of length $d \over n$


The picture shows $n$ Attention heads.

Note that each head is working on vectors of length $\frac{d}{n}$ rather than
original dimensions $d$.
- variables with superscript $(j)$ are of fractional length

Details are deferred to the module [Attention lookup](Attention_Lookup.ipynb).


Each head $j$ uniquely transforms the query $\h_\tp$ and the key/value pairs $\bar{\h}_{(1)} \ldots \bar{\h}_{(\bar{T})}$ being queried.
- into $\h^{(j)}_\tp$ and the key/value pairs $\bar{\h}^{(j)}_{(1)} \ldots \bar{\h}^{(j)}_{(\bar{T})}$
- Such that each head attends to a separate item

<table>
    <tr>
        <th><center>Decoder Multi-head Attention</center></th>
    </tr>
    <tr>
        <td><img src="images/Multihead_attention.png" width=80%></td>
    </tr>
</table>

## Transformers

There is a new model (the Transformer) that processes sequences much faster than RNN's.

It is an Encoder/Decoder architecture that uses multiple forms of Attention
- Self Attention in the Encoder
    - to tell the Encoder the relevant parts of the input sequence $\x$ to attend to
- Decoder/Encoder attention
    - to tell the Decoder which Encoder state $\bar{\h}_{(\tt')}$ to attend to when outputting $\y_\tp$
- Masked Self-Attention in the Decoder
    - to prevent the Decoder from looking ahead into inputs that have not yet been generated



# Conclusion

We recognized that the Decoder function responsible for generating Decoder output $\hat{\y}_\tp$
$$
\hat{\y}_\tp = D( \h_\tp; \mathbf{s})
$$

was quite rigid when it ignored argument $\mathbf{s}$.

This rigidity forced Decoder latent state $\h_\tp$ to assume the additional responsibility of including Encoder context.

Attention was presented as a way to obtain Encoder context through argument $\mathbf{s}$.

In [2]:
print("Done")

Done
