In [1]:
%run Latex_macros.ipynb

<IPython.core.display.Latex object>

$$
\newcommand{\contextcsm}{\mathcal{c}}
\newcommand{\querycsm}{\mathcal{q}}
$$

# Implementing Attention: motivation
Attention is a mechanism
- Used in sequence to sequence problems
- Which maps a Source sequence to a Target sequence
- Often (but not necessarily) utilizing an Encoder-Decoder architecture

- To cause the Decoder at time step $\tt$
- To "attend to" (focus it's attention)
- On a particular prefix of the Source input sequence $\x$

That is
- Each output of the Target sequence
- Is dependent on a "context"
- Which is defined by the Source sequence


<table>
    <tr>
        <th><center>Decoder: Attention</center></th>
    </tr>
    <tr>
        <td><img src="images/Decoder_attention.png"></td>
    </tr>
</table>

We will show the basic mechanism for Attention.

[Attention is all you need](https://arxiv.org/pdf/1706.03762.pdf) is the key paper on this topic

Note that current practice
- Most often uses a variant of this mechanism called *Self Attention*
- In a popular and powerful architecture called the *Transformer*
- We will provide a simplified explanation using a two part Encoder-Decoder model
- Without specifically referring to the architecture of either part

# Implementing attention: mechanics


To state the problem of Attention more abstractly
- The source sequence is $\x_{(1)}, \ldots, \x_{(\bar{T})} $.
- The Encoder associates a "context" $\bar{\contextcsm}_{(\bar{\tt})}$ with the prefix of $\x$ ending at $\bar{\tt}$, for $1 \le \bar{\tt} \le T$
   
- The Decoder associates a context $\contextcsm_\tp$ with the output generation


The problem of Attention
- Is finding the Source context $\bar{\contextcsm}_{(\bar{\tt})}$
- That most closely matches the desired Target context $\contextcsm_\tp$

Getting a little philosophical:

- A "thought" is an amorphous collection of neurons in the brain: "A sunny day at the beach"
- A "sentence" is a sequence of words that describes the thought
- The "sentence" may be different in two distinct languages, but they represent the same thought
- The context is the Neural Networks representation of the thought

So we translate from Source sequence to Target sequence
- By matching the contexts of the Source (Encoder) and Target (Decoder)
    

The Source context $\bar{\contextcsm}_{(\bar{\tt})}$
- Can be generated by a smaller Neural Network that is part of the Encoder

Similarly the Target context $\contextcsm_\tp$
- Can be generated by a smaller Neural Network that is part of the Decoder

To summarize
- The Encoder creates a context for each prefix of the Source input
- The Decoder creates a context for each prefix of the Target output
- At step $\tt$, the Decoder "attends to" the Source context $\bar{\contextcsm}_{(\bar{\tt})}$that most closely matches the Target context $\contextcsm_\tp$
    - Using this context to generate $\hat{\y}_\tp$

The mechanism we use to match Target and Source contexts is called *Context Sensitive Memory*
which we introduced in a previous [module](Neural_Programming.ipynb#Soft-Lookup)

# Using Context Sensitive Memory to implement Attention


Remember that our ultimate goal
- Is to generate a context
- That can be passed as the second argument $\mathbf{s}$
- Of the Decoder function responsible for generating Decoder output $\hat{\y}_\tp$
$$
\hat{\y}_\tp = D( \h_\tp; \mathbf{s})
$$

Context Sensitive Memory is exactly what we need to obtain a value for $\mathbf{s}$.

At time step $\tt$, the Decoder: 
- Generates a query $\querycsm_\tp$ containing the Target context
- Matches the query against Context Sensitive Memory $M$
- To obtain a Source context
- That is equated to $\mathbf{s}$



We will simplify the presentation
by identifying contexts with latent states (short-term memory)
$$
\begin{array}[lll]\\
\bar{\contextcsm}_{(\bar{\tt})} & = & \bar{\h}_{(\bar{\tt})} \\
\contextcsm_\tp & = & \h_\tp
\end{array}
$$

So matching Source and Target contexts becomes equivalent to matching Encoder and Decoder latent states.



Define Context Sensitive Memory $M$ to be the pairs
$$
\{ \,(\bar{\h}_{(\bar{\tt})}, \bar{\h}_{(\bar{\tt})} )\;| \;1 \le \bar{\tt} \le \bar{T} \,\}
$$

In other words:
- We make the key equal to the value
- And both are equal to the Source the context $\bar{\contextcsm}_{(\bar{\tt})}$

The Decoder then performs a Soft Lookup against Context Sensitive Memory $M$
- Using query $\querycsm_\tp = \h_\tp$
- Returning a "blend" of Encoder latent states
- As required by the "Choose" box

## Extensions

It is not strictly necessary to equate contexts with latent states
- One can implement a small Neural Network to find the "best" representation for contexts

Nor is it necessary for the keys and values of the Context Sensitive Memory to be identical.
   
The only requirement is that the Encoder and Decoder "speak the same language" and produce values 
of the appropriate type.



# Conclusion

We introduced Context Sensitive Memory as the vehicle with which to implement the Attention mechanism.

Context Sensitive Memory is similar to a Python dict/hash, but allowing "soft" matching.

It is easily built using the basic building blocks of Neural Networks, like Fully Connected layers.

This is another concrete example of Neural Programming.

In [2]:
print("Done")

Done
