In [1]:
%run Latex_macros.ipynb

<IPython.core.display.Latex object>

$$
\newcommand{\contextcsm}{\mathcal{c}}
\newcommand{\querycsm}{\mathcal{q}}
$$

# Attention lookup: Context Sensitive Memory

Our assumption is that
- we have information stored as key/value pairs
     - for example: the Context as processed by the Encoder

Key | Value  
:---|:---|
Subject | Professor Perry
Pronoun | He
Object  | Machine Learning
Indirect Object | Them
Verb | Taught

- we want to retrieve a value by specifying a *query*
    - the value whose key is closest to the query
    - for example: 
    
            Lookup( Subject ) = Professor Perry
            

Moreover
- we want to use this mechanism in Neural Networks
- the Lookup operation must be **differentiable**

A "hard lookup"
- exact match of query and key
- fails to be differentiable

We need to replace hard lookups with soft lookups.
- just as we replace a strict binary switch (e.g., `if` statement)
- by a soft approximation: sigmoid
  
Let's see how [Context Sensitive Memory](Context_Sensitive_Memory.ipynb) works.

# Implementing attention: High level view

To state the problem of Attention more abstractly as follows

Given
- Source sequence $\bar{\contextcsm}_{([1:\bar{T}])}$
    - the sequence being "attended to"
    - a sequence of source "contexts"
- and a Target context $\contextcsm_\tp$ 
    - called the "query"

Output
- the Source context $\bar{\contextcsm}_{(\bar{\tt})}$
- that most closely matches the desired Target context $\contextcsm_\tp$

For example, let's consider Cross Attention in an Encoder-Decoder architecture
- $\bar{\contextcsm}_{([1:\bar{T}])}$ may be the sequence of latent states of an Encoder
- "query" $\contextcsm_\tp = \h_\tp$ is the state of the Decoder when generating output $\hat\y_\tp$ at position $\tt$
- we want to output $\bar{\contextcsm}_{(\bar \tt)}$: one latent state of the Encoder
    - relevant for output position $\tt$
    - as described by  $\contextcsm_\tp = \h_\tp$ 
    
<table>
    <tr>
        <th><center>Decoder output transformation with attention</center></th>
    </tr>
    <tr>
        <td><img src="images/Decoder_attention.png" width=60%></td>
    </tr>
</table>

# Cross-Attention lookup: detailed view

In general the keys, values and queries could be generated by arbitrary parts of a larger
Neural Network that uses Attention.

In the case of an Encoder-Decoder architecture
the Attention is between
- queries created by the Decoder
- keys and values created by the Encoder
    - keys and values are identical

We use a Context Sensitive Memory to implement the Attention lookup.

The CSM has $\bar T$ key/value pairs
- one for each output position of the Encoder: $\bar \h_{(1 \ldots \bar T)}$
- the key and value for position $\bar \tt$ of the CSM is state $\bar \h_\tp$
$$k_{\bar \tt} = v_{\bar \tt} = \bar \h_{(\bar \tt)}$$

The Decoder creates one query for each of the $T$ positions of the Decoder output
- the query for position $\tt$ is Decoder state $\h_\tp$
$$q_\tt = \h_\tp$$

Thus, each position of the Decoder
- attends to all positions of the Encoder
- using Decoder state $\h_\tp$ as the query for output position $\tt$

Here is an illustration of the Attention inputs of the Encoder Decoder.
- left row bottom: sequence of latent states of the Encoder
    - used as keys/values:
        - sequence length: $\bar T$ for Cross-Attention; $T$ for Self-Attention
- right row botton: sequence of latent states of the Decoder
    - used as queries
    - sequence length: $T$
- top row: attention output
    - weighted sum of values
- Attention Weight matrix entry row $r_e$, column $c_d$
    - the weight of query at Decoder position $c_d$ on Encoder position $r_e$  
- Top row
    - position $\tt$: sum over column $\tt$'s (weights * values)
<img src="https://www.tensorflow.org/images/tutorials/transformer/CrossAttention-new-full.png" width=50%>

Here is a picture of the complete RNN Encoder Decoder designed to translate Spanish to English

Both the Encoder and Decoder are RNN's.

- Encoder: left side (bottom to top)
    - bottom row: sequence of token ids of Spanish language input
    - middle row: an unrolled, bidirectional RNN computation
        - computing an encoding (latent representation) for each of the $\bar T$ Spanish tokens
    - top row: sequence of latent representations of Spanish tokens
        - used as keys/values for Attention
- Decoder: similar to Encoder
    - top row: latent representation of generated English token ids
        - used as queries for Attention
    

<table>
    <center><strong>RNN Encoder-Decoder for Spanish to English translation</strong></center>
    <tr>
        <img src="https://www.tensorflow.org/images/tutorials/transformer/RNN%2Battention-words-spa.png" width=30%>
    </tr>
    
Attribution: https://www.tensorflow.org/text/tutorials/nmt_with_attention
</table>

# Attention Lookup: general case

We assume that 
- the Source context (the sequence being attended to) is length $\bar T$
    - e.g., Encoder states $\bar\h_\tp$ in an Encoder/Decoder
- the Target context is length $T$
    - e.g., Decoder states $\h_\tp$ in an Encoder/Decoder

Each element in the vectors ($\h, \bar\h$) are length $d$
$$
\begin{array} \\
| \bar \h_{(\bar \tt)} |  & = & d &  1 \le \bar \tt \le \bar T \\
| \h_\tp | & = & d & 1 \le \tt \le T \\
\end{array}
$$

This describes Cross-Attention as would be implemented from the Decoder to the Encoder
in an Encoder-Decoder architecture.

For the special case of Self-Attention: 
- $\bar T = T$
- $\bar\h_\tp = \h_\tp$

This is the case, for example, where a Decoder attends to itself.

## Queries

Each of the $T$ Target positions is a query

$$
q_\tp = h_\tp
$$

So the matrix $Q$ of all queries is shape $(T \times d)$

## Keys/Values

Each of the $\bar T$ Source positions is both a target and a query
$$
k_\tt = \v_\tt = \bar\h_\tp
$$

The matrix of all keys $K$, and the matrix of all values $V$ are shape $(\bar T \times d)$

## Projecting queries, keys and values

Rather than using the raw states of the Source and Target
as queries (resp., keys/values)
- we can map them through projection/embedding *matrices* $\W_Q, \W_K, \W_V$
    - each mapping matrix shape is $(d \times d)$
    - thus, the mapping preserves the shapes of $Q, K, V$
    
Similarly, we map the Attention output through matrix $\W_O$.

Projection matrices $\W_K, \W_V, \W_Q, \W_O$ are *learned* through training.



What is the purpose of these extra *linear* projections ?

The first answer: it can't hurt !
- if there is no gain for projecting: the optimizer would presumably learn an Identity matrix projection

But we can hypothesize a possible purpose.

Consider a task: Translating from Spanish (Encoder input) to English (Decoder input)
- The Decoder's query is specialized to English
- The Encoder's keys and values are specialized to Spanish
- Before matching query to keys
    - the English (Decoder generated) query
    - and the Spanish (Encoder generated) keys
    - need to be projected to a common (language-independent) representation


Mapping through these matrices:

out  &nbsp;  &nbsp;  &nbsp;  &nbsp; | &nbsp; | left &nbsp; &nbsp;  &nbsp;  &nbsp; | &nbsp;  | right &nbsp;  &nbsp;  &nbsp;|
:--:|:-:|:-:|:-:|:-:
$Q$ | = | $Q$| * |$\W_Q$ |
$(T \times d)$ | | $(T \times d)$ | | $(d \times d)$
&nbsp;
$K$ | = | $K$| * |$\W_K$ |
$V$ | = | $V$| * |$\W_V$ |
$(\bar T \times d)$ | | $(\bar T \times d)$ | | $(d \times d)$

## Performing the lookup




Next: comparing the query $q$ at each Target position, to each of the keys at the $\bar T$ Source positions
- producing scores  $\alpha(q, k)$  that are implemented as dot product (matrix multiplication)

out  &nbsp; &nbsp; &nbsp;    &nbsp;  &nbsp; | &nbsp; | left &nbsp; &nbsp;  &nbsp;  &nbsp; | &nbsp;  | right &nbsp;  &nbsp;  &nbsp;|
:--:|:-:|:--:|:-:|:-:
$\alpha(q, k)$ | = | $Q$ | * |$K^T$ |
$(T \times \bar T)$ | | $(T \times d)$ | | $(d \times \bar T)$

- we ignore the softmax normalization of the weights
- we will treat the scores as weights for simplicity of presentation



Finally: take the weighted sum of the values
    
out  &nbsp; &nbsp; &nbsp;  &nbsp;  &nbsp; | &nbsp; | left  &nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;  &nbsp;  &nbsp; | &nbsp;  | right &nbsp;  &nbsp;  &nbsp;|
:--:|:-:|:--:|:-:|:-:
 | = | $\alpha(q, k)$ | * |$V$ |
 |  = | $Q * K^T$ | * | $V$ |
$(T \times d)$ | | $(T \times \bar T)$ | | $(\bar T \times d)$  

producing
- a single attention value of length $d$
- for each of the $T$ positionsmm

## Conclusion

Using matrix operations, we are performing *all* $T$ queries simultaneously.

The end result is a vector of length $d$
- the value being attended to at each of the $T$ Target positions
- this value is a weighted sum of the $\bar T$ Source states

$$
\text{Attention}(Q,K,V) = \text{softmax} \left(
\frac{ Q * K^ T }{ \sqrt{d} } \right) V
$$

# Multi-head attention

With a small change, we can have each Target position attend to $n_\text{head} \ge 1$ Source positions.
- perhaps each of the $n_\text{head}$ source positions represents a different aspect of the Source sequence
    - e.g. nouns with gender and singular/plural form
- all of which are relevant to the Target output at a position

This is called *Multi-head Attention*
- $n_\text{head}$ attention "heads"

The idea is to take each query (of length $d$) and break it into $n_\text{head}$ pieces of size
$$d_\text{attn} = \frac{d}{n_\text{head}}$$

Since the length of query and key must match, we do the same for each key.

We then perform regular attention lookup $n_\text{head}$ times (in parallel) using the shorter queries and keys.

## Size of the value

Note that we have not mentioned changing the size of the values that are associated with the keys.

After the $n_\text{head}$ lookups, we have $n_\text{head}$ vectors of length $d$.

Yet all of our model layers (including Attention) must produced output vectors of length $d$.



The most common way of doing this is to break up the values into $n_\text{head}$ pieces of size $d_\text{attn}$
- same as for key and query

We can then concatenate the $n_\text{head}$ lookup results of size $d_\text{attn}$ into a single vector of length $d$.

Hopefully a picture will help.

Note that each head is working on vectors of length $d_\text{attn} = \frac{d}{n_\text{head}}$ rather than
original dimensions $d$.
- variables with superscript $(j)$ are of fractional length

<table>
    <tr>
        <th><center>Decoder Multi-head Attention</center></th>
    </tr>
    <tr>
        <td><img src="images/Multihead_attention.png" width=80%></td>
    </tr>
</table>

A less common way of maintaining output vectors of length $d$
- maintain the value vectors at original length $d$
- *pool* (e.g., add) the $n_\text{head}$ vectors into a single vector of length $d$


How do we create the shorter length $d_\text{attn}$ vectors (pieces of queries, keys, values) ?
- by changing the projection matrices $\W_Q, \W_K, \W_V$ to shape $(d \times d_\text{attn})$
    - one for each head
    - $\W^{(j)}_Q, \W^{(j)}_K, \W^{(j)}_V$ are the projection matrices for head $j$



## Projecting the lookup result

In the [original Attention paper, Figure 2](https://arxiv.org/pdf/1706.03762.pdf#page=4)
- the attention lookup output
- is projected through matrix $\W_O$ of shape $(d \times d)$

The argument is similar to why we project queries, keys, and values via $\W_Q, \W_K, \W_V$
- the *learned* projection potentially increases the power
- if not, $\W_O$ could be learned to be the Identity matrix.

This projection of output also enables greater flexibility in breaking up the value part of the key/value pairs
- We can choose any length
- Let the Output projection matrix reduce the size of the concatenated head outputs
- to size $d$ as required

## Multi-head summary

The paper summarizes Multi-Head Attention as

$$
\text{MultiHead}(Q, K,V) =  \text{Concat}(\text{head}_1, \ldots, \text{head}_{n_\text{head}}) \; \W_O
$$
where
$$
\text{head}_j= \text{Attention}( Q * \W_Q^{(j)}, K * \W_K^{(j)}, V * \W_V^{(j)})
$$

# Count the weights !

The weights/parameters are in the matrices $\W_Q, \W_K, \W_V$ and $\W_O$
- all of size $\OrderOf{d^2}$, total:
$$
4 * \OrderOf{d^2}
$$
- multiplied by the number of stacked Transformer blocks $n_\text{layer}$, total:
$$
4 * n_\text{layer} * \OrderOf{d^2}
$$

For GPT-3
- $n_\text{layer} = 96$
- $d_\text{model} = 12* 1024$

Total attention weights
$$
96 * (12*1024)^2 = 58 \text{ billion}
$$

# Advanced material

The remaining sections include code references to models constructed using the Functional API of Keras.

Even if you don't understand the code in detail, the intuition it conveys may be useful.


## Code: RNN Encoder-Decoder

The code for the Spanish to English Encoder Decoder can be found in a [TensorFlow tutorial](https://www.tensorflow.org/text/tutorials/nmt_with_attention)
- requires knowledge of Functional models in Keras
- Multi-head Attention implemented by a Keras layer
    - code not visible directly
    - but is a link to source on Githb
        - a bit complex since it is production code
- Colab notebook you can play with
    - substitute your own Spanish sentences as input
    - make Attention plots

A good web post on implementing MultiHead Attention can be found [here](https://machinelearningmastery.com/how-to-implement-multi-head-attention-from-scratch-in-tensorflow-and-keras/)
- rather than using $(d_\text{model} \times d_\text{attn})$ embedding matrices to project vectors from $d_\text{model}$ to $d_\text{attn}$
- it uses `Dense` layers with $d_\text{attn}$ units to achieve the same
- multi-head attention is achieved by *reshaping* the input
    - from 3D shape $( \text{batch_size} \times T \times d_\text{model} )$
    - to 4D shape $( \text{batch_size} \times T \times  n_\text{head} \times d_\text{attn} )$
        - where $d_\text{model} $ should be equal to $n_\text{head} * d_\text{attn}$

Here is a [Keras tutorial](https://keras.io/examples/nlp/neural_machine_translation_with_transformer/)
that uses an Encoder and Decoder that are both Transformers
- Self attention on the Decoder
- Cross attention from the Decoder to the Encoder

Here is the relevant code for the Decoder

     def call(self, inputs, encoder_outputs, mask=None):
            causal_mask = self.get_causal_attention_mask(inputs)
            if mask is not None:
                padding_mask = tf.cast(mask[:, tf.newaxis, :], dtype="int32")
                padding_mask = tf.minimum(padding_mask, causal_mask)

            attention_output_1 = self.attention_1(
                query=inputs, value=inputs, key=inputs, attention_mask=causal_mask
            )
            out_1 = self.layernorm_1(inputs + attention_output_1)

            attention_output_2 = self.attention_2(
                query=out_1,
                value=encoder_outputs,
                key=encoder_outputs,
                attention_mask=padding_mask,
            )
            out_2 = self.layernorm_2(out_1 + attention_output_2)

            proj_output = self.dense_proj(out_2)
            return self.layernorm_3(out_2 + proj_output)

- The Decoder input (partially generated English Translation)
    - Masked Self Attention on the input via the statement
            attention_output_1 = self.attention_1(
                    query=inputs, value=inputs, key=inputs, attention_mask=causal_mask
                )
        - keys = values = queries = inputs
        - **causal masked**: via the option `attention_mask=causal_mask`
    - uses Cross attention via the statement
    
        attention_output_2 = self.attention_2(
                query=out_1,
                value=encoder_outputs,
                key=encoder_outputs,
                attention_mask=padding_mask,
            )
        - query is output of the Self-Attention
            - the query is created by self-attention of Decoder input
        - keys = values = `encoder_outputs` (sequence of Encoder latent states)

## Code: Encoder-Decoder Transformer

Here is the Encoder-Decoder for Spanish to English Translation, using Transformers for both the Encoder and Decoder
- Encoder: left-side
    - Bottom row: Encoder Spanish Tokens
    - Top row: Self-Attention to Spanish tokens
- Decoder: right side
    - Bottom row: latent representation of English tokens generated so far
    - Next row: Decoder Masked Self Attention
- Matrix: column $\tt$
    - Attention weight of Decoder output at position $\tt$ on each of the $\bar T$ latent representation of the Encoder's Spanish tokens
    

<table>
    <center><strong>Transformer Encoder-Decoder for Spanish to English translation</strong></center>
    <tr>
        <img src="https://www.tensorflow.org/images/tutorials/transformer/Transformer-1layer-words.png" width=40%>
    </tr>
    
Attribution: https://www.tensorflow.org/images/tutorials/transformer/Transformer-1layer-words.png

</table>

The above diagram illustrates the difference between
- Self Attention in the Encoder (left side)
- **Masked** Self Attention in the Decoder (right side)

Take a look at the range of positions accessible by the first output position
- in the Encoder: **all** Input positions
- in the Decoder: **only** the first input

Similarly for position $\tt$
- in the Encoder: **all** Input positions
- in the Decoder: **only** the prefix of length $\tt$


# Conclusion

We introduced Context Sensitive Memory as the vehicle with which to implement the Attention mechanism.

Context Sensitive Memory is similar to a Python dict/hash, but allowing "soft" matching.

It is easily built using the basic building blocks of Neural Networks, like Fully Connected layers.

This is another concrete example of Neural Programming.

We *will not* spend time on the actual code of an Attention layer.

If you're interested there are several web articles
that do so, for example, [here](https://machinelearningmastery.com/how-to-implement-multi-head-attention-from-scratch-in-tensorflow-and-keras/)


In [2]:
print("Done")

Done
