In [1]:
%run Latex_macros.ipynb

<IPython.core.display.Latex object>

$$
\newcommand{\contextcsm}{\mathcal{c}}
\newcommand{\querycsm}{\mathcal{q}}
$$

# Implementing attention: High level view

To state the problem of Attention more abstractly as follows

Given
- Source sequence $\bar{\contextcsm}_{([1:\bar{T}])}$
    - the sequence being "attended to"
    - a sequence of source "contexts"
- and a Target context $\contextcsm_\tp$ 
    - called the "query"

Output
- the Source context $\bar{\contextcsm}_{(\bar{\tt})}$
- that most closely matches the desired Target context $\contextcsm_\tp$

For example, let's consider Cross Attention in an Encoder-Decoder architecture
- $\bar{\contextcsm}_{([1:\bar{T}])}$ may be the sequence of latent states of an Encoder
- "query" $\contextcsm_\tp = \h_\tp$ is the state of the Decoder when generating output $\hat\y_\tp$ at position $\tt$
- we want to output $\bar{\contextcsm}_{(\bar \tt)}$: one latent state of the Encoder
    - relevant for output position $\tt$
    - as described by  $\contextcsm_\tp = \h_\tp$ 

The mechanism we use to match Target and Source contexts is called *Context Sensitive Memory*.

Summary
- Context Sensitive Memory is similar to a Python `dict`
    - consists of a collection of Key/Value pairs
- One may perform a "lookup"
    - By presenting a "query"
    - Which matches the query against each key
- The result is a "soft" lookup
    - always returns a value, even if there is no exact match between the query and any key
    - the results is a weighted sum of the values in the key/value pairs
    - with weights based on the similarity of the query and the key
    
Let's see how [Context Sensitive Memory](Context_Sensitive_Memory.ipynb) works.




# Cross-Attention lookup: detailed view

In general the keys, values and queries could be generated by arbitrary parts of a larger
Neural Network that uses Attention.

In the case of an Encoder-Decoder architecture
the Attention is between
- queries created by the Decoder
- keys and values created by the Encoder
    - keys and values are identical

We use a Context Sensitive Memory to implement the Attention lookup.

The CSM has $\bar T$ key/value pairs
- the key and value for row $\bar \tt$ of the CSM is state $\bar \h_\tp$
$$k_{\bar \tt} = v_{\bar \tt} = \bar \h_{(\bar \tt)}$$

The Decoder creates one query for each of the $T$ positions of the Decoder output
- the query for position $\tt$ is Decoder state $\h_\tp$
$$q_\tt = \h_\tp$$

Thus, each position of the Decoder
- attends to all positions of the Encoder
- using Decoder state $\h_\tp$ as the query for output position $\tt$

Here is an illustration of the Attention inputs of the Encoder Decoder.
- left row bottom: sequence of latent states of the Encoder
    - used as keys/values:
        - sequence length: $\bar T$ for Cross-Attention; $T$ for Self-Attention
- right row botton: sequence of latent states of the Decoder
    - used as queries
    - sequence length: $T$
- top row: attention output
    - weighted sum of values
- Attention Weight matrix entry row $r_e$, column $c_d$
    - the weight of query at Decoder position $c_d$ on Encoder position $r_e$  
- Top row
    - position $\tt$: sum over column $\tt$'s (weights * values)
<img src="https://www.tensorflow.org/images/tutorials/transformer/CrossAttention-new-full.png" width=50%>

Here is a picture of the complete RNN Encoder Decoder designed to translate Spanish to English

Both the Encoder and Decoder are RNN's.

- Encoder: left side (bottom to top)
    - bottom row: sequence of token ids of Spanish language input
    - middle row: an unrolled, bidirectional RNN computation
        - computing an encoding (latent representation) for each of the $\bar T$ Spanish tokens
    - top row: sequence of latent representations of Spanish tokens
        - used as keys/values for Attention
- Decoder: similar to Encoder
    - top row: latent representation of generated English token ids
        - used as queries for Attention
    

<table>
    <center><strong>RNN Encoder-Decoder for Spanish to English translation</strong></center>
    <tr>
        <img src="https://www.tensorflow.org/images/tutorials/transformer/RNN%2Battention-words-spa.png" width=30%>
    </tr>
    
Attribution: https://www.tensorflow.org/text/tutorials/nmt_with_attention
</table>

# Self-attention lookup

In Self-Attention, the Decoder attends to its own inputs.

This can be implemented via a Context Sensitive Memory with $T$ key/value pairs
- where keys, values  are the same and are equal to the Decoder state at position $\tt$
$$
k_\tt = \v_\tt = \h_\tp
$$

The query at output positions $\tt$ is *also* he Decoder state at position $\tt$
$$
q_\tt = \h_\tp
$$

# Pre-processing queries, keys and values

Rather than using the raw states of the Encoder (resp., Decoder)
as keys/values (resp., queries) for the Attention Lookup
- we can map them through *matrices*
- whose weights are **learned** during training

This mapping potentially increases the power of a Transformer that uses Attention
- if the mapping adds no benefit, we would learn mapping matrices that were Identity matrices

We will give the details through an example that illustrates the Self Attention lookup behavior of the Transformer.

**Aside**
- we may not yet have covered the Transformer
- just know that the Decoder uses both
    - Masked Causal Self-Attention on its inputs
    - Cross Attention between the Decoder and the Encoder
    

In the Transformer use of Self-Attention
- keys, values and queries
- are identical !

The Transformer has a context for each of the $T$ positions in the input sequence.

We represent this as a matrix $\X$ of dimension $(T \times d)$
- where $d = d_\text{model}$ is the internal dimension of all vectors


That is
- the Transformer has a source context for each position
- which it uses as a query to "look up"  the most similar context

We can potentially increase the power of the Transformer
- my mapping the keys, values and queries
- through
- producing alternate representation
    - key $\x \mapsto \x W_K$
    - value $\x \mapsto \x W_V$
    - query $\x \mapsto \x W_Q$
- that may better be adapted to the task described by the training data



Embedding matrices $W_K, W_V, W_Q$ are *learned* through training
- if no better representation exists: we presumably learned identity matrices
- the embedding matrices can also reduce all vectors to length $d_\text{attn} = \frac{d}{n}$
    - to facilitate multi-head attention with $n$ heads

Multiple lookups can be performed in parallel via matrix multiplication
- when the score measuring the similarity of key $k$ and query $q$ is the dot product

In the case of the Transformer
- where keys, values and queries are identical
- and a lookup is performed for each of the $T$ positions

we use matrix multiplication of matrix $\X$

We keep track of the matrix sizes below (assuming $d_\text{attn} \le d$)

First: we map all vectors through the embedding matrices

out  &nbsp;  &nbsp;  &nbsp;  &nbsp; | &nbsp; | left &nbsp; &nbsp;  &nbsp;  &nbsp; | &nbsp;  | right &nbsp;  &nbsp;  &nbsp;|
:--:|:-:|:-:|:-:|:-:
$Q$ | = | $\X$| * |$\W_Q$ |
$K$ | = | $\X$| * |$\W_K$ |
$V$ | = | $\X$| * |$\W_V$ |
$(T \times d)$ | | $(T \times d)$ | | $(d \times d)$

Next: comparing the query $q$ at each positions, to all of the keys
- producing scores  $\alpha(q, k)$  that are implemented as dot product (matrix multiplication)

out  &nbsp; &nbsp; &nbsp;    &nbsp;  &nbsp; | &nbsp; | left &nbsp; &nbsp;  &nbsp;  &nbsp; | &nbsp;  | right &nbsp;  &nbsp;  &nbsp;|
:--:|:-:|:--:|:-:|:-:
$\alpha(q, k)$ | = | $Q$ | * |$K^T$ |
$(T \times T)$ | | $(T \times d)$ | | $(d \times T)$

- we ignore the softmax normalization of the weights

Finally: multiply the weights by the values
    
out  &nbsp; &nbsp; &nbsp;  &nbsp;  &nbsp; | &nbsp; | left &nbsp;&nbsp; &nbsp;  &nbsp;  &nbsp; | &nbsp;  | right &nbsp;  &nbsp;  &nbsp;|
:--:|:-:|:--:|:-:|:-:
 | = | $\alpha(q, k)$ | * |$V$ |
$(T \times d)$ | | $(T \times T)$ | | $(T \times d)$  

producing
- a single attention value of length $d$
- for each of the $T$ positions

## Multi-head attention

The picture shows $n$ Attention heads.

Note that each head is working on vectors of length $d_\text{attn} = \frac{d}{n}$ rather than
original dimensions $d$.
- variables with superscript $(j)$ are of fractional length

<table>
    <tr>
        <th><center>Decoder Multi-head Attention</center></th>
    </tr>
    <tr>
        <td><img src="images/Multihead_attention.png" width=80%></td>
    </tr>
</table>



How do we create the shorter length $d \over n$ vectors ?

We use projection matrices of size $(d \times {d \over n})$ **for each head** $j$
- multiplying each key by matrix $\W^{(j)}_\text{key}$
- multiplying each value  by matrix $\W^{(j)}_\text{value}$
- multiplying the original length $d$ query by matrix $\W^{(j)}_\text{query}$



Head $j$ 
- uses query $\h^{(j)} = \h * \W_\text{query}^{(j)}$
- against keys/values $\bar{\h}^{(j)} = \bar{\h} *  \W_\text{value}^{(j)}$



# Advanced material

The remaining sections include code references to models constructed using the Functional API of Keras.

Even if you don't understand the code in detail, the intuition it conveys may be useful.


## Code: RNN Encoder-Decoder

The code for the Spanish to English Encoder Decoder can be found in a [TensorFlow tutorial](https://www.tensorflow.org/text/tutorials/nmt_with_attention)
- requires knowledge of Functional models in Keras
- Multi-head Attention implemented by a Keras layer
    - code not visible directly
    - but is a link to source on Githb
        - a bit complex since it is production code
- Colab notebook you can play with
    - substitute your own Spanish sentences as input
    - make Attention plots

A good web post on implementing MultiHead Attention can be found [here](https://machinelearningmastery.com/how-to-implement-multi-head-attention-from-scratch-in-tensorflow-and-keras/)
- rather than using $(d_\text{model} \times d_\text{attn})$ embedding matrices to project vectors from $d_\text{model}$ to $d_\text{attn}$
- it uses `Dense` layers with $d_\text{attn}$ units to achieve the same
- multi-head attention is achieved by *reshaping* the input
    - from 3D shape $( \text{batch_size} \times T \times d_\text{model} )$
    - to 4D shape $( \text{batch_size} \times T \times  n_\text{head} \times d_\text{attn} )$
        - where $d_\text{model} $ should be equal to $n_\text{head} * d_\text{attn}$

Here is a [Keras tutorial](https://keras.io/examples/nlp/neural_machine_translation_with_transformer/)
that uses an Encoder and Decoder that are both Transformers
- Self attention on the Decoder
- Cross attention from the Decoder to the Encoder

Here is the relevant code for the Decoder

     def call(self, inputs, encoder_outputs, mask=None):
            causal_mask = self.get_causal_attention_mask(inputs)
            if mask is not None:
                padding_mask = tf.cast(mask[:, tf.newaxis, :], dtype="int32")
                padding_mask = tf.minimum(padding_mask, causal_mask)

            attention_output_1 = self.attention_1(
                query=inputs, value=inputs, key=inputs, attention_mask=causal_mask
            )
            out_1 = self.layernorm_1(inputs + attention_output_1)

            attention_output_2 = self.attention_2(
                query=out_1,
                value=encoder_outputs,
                key=encoder_outputs,
                attention_mask=padding_mask,
            )
            out_2 = self.layernorm_2(out_1 + attention_output_2)

            proj_output = self.dense_proj(out_2)
            return self.layernorm_3(out_2 + proj_output)

- The Decoder input (partially generated English Translation)
    - Masked Self Attention on the input via the statement
            attention_output_1 = self.attention_1(
                    query=inputs, value=inputs, key=inputs, attention_mask=causal_mask
                )
        - keys = values = queries = inputs
        - **causal masked**: via the option `attention_mask=causal_mask`
    - uses Cross attention via the statement
    
        attention_output_2 = self.attention_2(
                query=out_1,
                value=encoder_outputs,
                key=encoder_outputs,
                attention_mask=padding_mask,
            )
        - query is output of the Self-Attention
            - the query is created by self-attention of Decoder input
        - keys = values = `encoder_outputs` (sequence of Encoder latent states)

## Code: Encoder-Decoder Transformer

Here is the Encoder-Decoder for Spanish to English Translation, using Transformers for both the Encoder and Decoder
- Encoder: left-side
    - Bottom row: Encoder Spanish Tokens
    - Top row: Self-Attention to Spanish tokens
- Decoder: right side
    - Bottom row: latent representation of English tokens generated so far
    - Next row: Decoder Masked Self Attention
- Matrix: column $\tt$
    - Attention weight of Decoder output at position $\tt$ on each of the $\bar T$ latent representation of the Encoder's Spanish tokens
    

<table>
    <center><strong>Transformer Encoder-Decoder for Spanish to English translation</strong></center>
    <tr>
        <img src="https://www.tensorflow.org/images/tutorials/transformer/Transformer-1layer-words.png" width=40%>
    </tr>
    
Attribution: https://www.tensorflow.org/images/tutorials/transformer/Transformer-1layer-words.png

</table>

# Conclusion

We introduced Context Sensitive Memory as the vehicle with which to implement the Attention mechanism.

Context Sensitive Memory is similar to a Python dict/hash, but allowing "soft" matching.

It is easily built using the basic building blocks of Neural Networks, like Fully Connected layers.

This is another concrete example of Neural Programming.

In [2]:
print("Done")

Done
