In [1]:
%run Latex_macros.ipynb

<IPython.core.display.Latex object>

### Context Sensitive Memory

A Context Sensitive Memory is like a Python dict:
- data is stored via (key, value) pairs
- a "query" matches a key and returns the associated value

The difference from a Python dict
- the query is compared to every key, and a "weight" indicating strength of match is returned
    - match can be approximate
- the value returned is the weighted sum of all values
    - if there is an exact match of one and only one key, this is equivalent to a Python dict.

Context Sensitive Memory
- a collect of key/value pairs, like a Python dict
$$
M  = \{ (k_\tt, v_\tt | 1 \le \tt \le T \}
$$
- lookup: pass in a "query", get a value-like output

As we learned in studying gates: the lookup needs to make soft choices rather than hard choices to
be differentiable.

## Normalized scores

$$
\alpha(q, k) = \frac{ \exp(\text{score}(q, k) )}{ \sum_{k' \in \text{keys}(M) } { \exp( \text{score}(q, k' ) } }
$$

## Soft lookup

$$
\c = \text{lookup}(q, M) = \sum_{ (k,v) \in M} { \alpha(q, k) * v }
$$

## Scoring functions

**Redefine using generic k,v rather than h_t**

There are several choices for the scoring function

$$
\text{score}(\h_\tp, \bar{\h}_{(\tt')}) =
\begin{cases}
\h^T_\tp \cdot \bar{\h}_{(\tt')} & \text{dot product, cosine similarity} \\
\h^T_\tp \W_\alpha \bar{\h}_{(\tt')} & \text{general} \\
\v^T_\alpha \tanh(\W_\alpha [ \h_\tp; \bar{\h}_{(\tt')}]) & \text{concat}\\
\end{cases}
$$

**Note**

What is $\v^T_\alpha$ ?

# Attention

Consider a many to many implementation of a Recurrent NN (RNN, LSTM, etc).


<table>
    <tr>
        <th><center>RNN Encoder/Decoder</center></th>
    </tr>
    <tr>
        <td><img src="images/RNN_layer_API_Encoder_Decoder.jpg"</td>
    </tr>
</table>


An example might be a network that adds descriptions/captions to a stream of images (video)
- input sequence: a sequence of frames
- output sequence: a sequence of words

or that translates from one language to another
- input sequence: words in source language
- output sequence: words in target language

It is very possible that the next word (time step $\tt$) might refer to a much earlier frame ($\tt' \lt \tt)$.

A similar thing happens when translating between languages.

There is not necessarily a correspondence between output $\tt$ and input $\tt$.


So an LSTM needs to decide which part of the past to "attend" (pay attention) to.

We can help it via a mechanism know as "attention", which we sketch below.

<table>
    <tr>
        <th><center>Sequence to Sequence: training (teacher forcing) + inference: No attention</center></th>
    </tr>
    <tr>
        <td><img src="images/RNN_seq2seq.png"></td>
    </tr>
</table>

<table>
    <tr>
        <th><center>Decoder: No attention</center></th>
    </tr>
    <tr>
        <td><img src="images/Decoder_no_attention.png"></td>
    </tr>
</table>

<table>
    <tr>
        <th><center>Decoder: Attention</center></th>
    </tr>
    <tr>
        <td><img src="images/Decoder_attention.png"></td>
    </tr>
</table>

<table>
    <tr>
        <th><center>Sequence to Sequence: attention</center></th>
    </tr>
    <tr>
        <td><img src="images/RNN_seq2seq_attention.png" width=1000></td>
    </tr>
</table>

The decoder is able to "select one" of the prior states, rather than just the latest one.

Of course, by now, we understand that this is a "soft" select (case/switch)
- needs to be differentiable
- so it provides a weighted combination of all prior states
    - a mask that is almost OHE becomes a true "choose one"

How does the LSTM decide which of the past states to attend to ?

Same way as all Machine Learning:
- it is controlled by weights
- that are learned by training !

So Deep Learning layers are almost becoming little computers that learn their own programs !

In [3]:
print("Done")

Done
