# Attention

Should all features of an example input to a neural network be treated equally? Or should we pay attention to some features more than others?

Examples:
- In language translation: we have a it feels natural to pay attention to different parts of the input as we translate
- If you're looking to answer a question about a product from its images, it makes sense to pay attention to those that relate to the question

In all of the above examples, we have:
- A query: a representation of what we want to look for in the input values
- A set of values: a representation of each part of the input
- How much attention we should pay to each value

## TODO generic attention mechanism

Typically, the queriy and values are vector representations.

E.g:
- Seq2seq translation:
    - Values: Encoder hidden states
    - Query: Current decoder hidden state
- Product image question answering:
    - Values: Embeddings of each image
    - Query: Question being asked

## TODO generic attention examples

> Attention is a way to combine an arbitrary set of representations (the values) into a single representation, in a way that pays more attention to some of them than others, based on some other representation (the query)

You've only got a limited amount of attention.
But you can pay a different percentage of our attention to each part of the input (each word, each pixel, each image). 
The most attention we could pay to a word is 100%, and the least is 0%. 
Or 1.0 and 0.0 as proportions.

So we could give each word a number between 0 and 1 which represents the proportion of our attention we give that word.

The query is used to select the importance of the values in the resulting summary.

> Attention is another neural network building block (like all Pytorch modules)

## Variations of attention

Before we get into the specific details, you should understand that 
there are many forms of attention, but they always include:
1. Align: Compute the attention scores (alignment) between the queries, $Q$ and values, $V$ $e \in \R^N$
1. Weight: Turn the scores into an attention distribution, $\alpha \in R^N$
1. Combine: Using the attention distribution to combine the values.

Changing what happens at each of these steps change the type of attention you're implementing

> We call the combined values _the context_

## So how does this work mathematically?

Attention assumes that you've already got the following inputs:
- The query, which represent what you're looking for
- The values, which represent what information you have available

And expects the following output:
- A single representation of the queried values

The simplest form of attention is called dot-product attention

### Dot product attention

1. Align: Compute the dot product between the query and each value to compute an alignment score for each query-value pair
1. Weight: Take the softmax of the alignment scores to compute the attention weights
1. Combine: Add up the values, weighted by their attention weights

> Notice that, in its most basic form (dot product), an attention block has no learnable parameters


In [None]:
import torch
import torch.nn.functional as F

logits = torch.tensor([12, -1, 0.4, 5, 2])
attention_distribution = F.softmax(logits, dim=0)

print(attention_distribution.shape)
encoder_hidden_states = 

context = torch.dot()

### Additive Attention - Learning the alignment function

Assuming that the cosine similarity is the right function to compare alignment is quite an assumption.

So let's learn the function instead, like we are doing for the rest of the neural network, by setting it to a trainable neural network.

Typically, we use a 1-layer neural network, passing in a stacked vector of the two input hidden states.

> Using a single layer neural network as the alignment function is known as _additive attention_

#### $e_t = W_{alignment} \cdot tanh(\begin{bmatrix}W_{encoder} \cdot h_{decoder}^{t'-1} \\ \\ W_{decoder} \cdot  h_{encoder}^t \end{bmatrix})$

In [None]:
class AdditiveAttention(torch.nn.Module):
    def __init__(self, input_dim=128):
        attention_hidden_dim = 128
        self.layers = torch.nn.Sequential( # TODO
            torch.nn.Linear(2*input_dim, attention_hidden_dim), # TODO
            torch.nn.Tanh(), # TODO
            torch.nn.Linear(attention_hidden_dim, 2*input_dim), # TODO
            torch.nn.Softmax() # TODO
        ) # TODO

    def forward(self, query, values):
        alignments = self.layers(
            torch.concat(query, values)
        )
        for 

There are many other forms of attention, but those shown above are the central ones to understand right now.

## Every input is connected to every output by a weight... How is this different to a linear layer in a NN, that takes a weighted input?

There are similarities to a linear layer, but there are key differences:
- The weighting of each input feature changes based on the alignment of the query and the values
- Attention takes a set of vectors, whereas a linear layer takes in a single vector
- In attention, the input vectors and output vector are usually the same size

## Other notes about attention

> Like RNNs, attention blocks can process inputs of an varying lengths

> The values are the set of representations stored in the model's "memory" - you can think of it like model RAM

> We say that the query attends to the values.

We call the function $a$, the _alignment function_. Intuitively, it tells you which parts of the source sequence correspond to the target sequence. In traditional (non-neural) NLP systems, this was a function that told you which words, if any corresponded to others between the translation pairs.






## Let's look at attention applied to language translation

> Attention was first applied to translation problems, and was able to smash benchmarks.

### Why does attention help for translation?

Seq2seq models face a challenge that the entire representation of the encoded sequence must be captured in a single vector. That encoding represents the concept of the source sequence as a whole. 
All of the rich information in the source sequence must be captured in this "information bottleneck", making it likely that some detail will be lost.

For a task such as translation, which a seq2seq model could tackle, this can make things difficult. The encoding gives you an idea of what the output should represent, but there are often many ways that the source could be translated, and getting a word-to-word translation can be difficult after everything has been summarised.

The typical and intuitive explanation here is that a human translator does not read the whole source sentence, memorise it, and then translate it. Instead, they read the whole thing to get an idea of what the translation needs to represent, and then they translate it part by part, looking back at the source sentence to translate a few words at a time. They are primed with the concept that the translation needs to represent, but they need to pay attention to parts of the source sequence as they perform decide the next word in the translated output.

Vanilla seq2seq models tend to be able to perform well on short sequences, where the information can be "memorised" within just a single vector, but perform worse on longer sequences.

## The Attention Mechanism in seq2seq models

So how do we compute those attention logits (the alignment scores) in a seq2seq model?

Intuitively, it would make sense that the attention that should be paid to one word is a function of what we think about the output translation so far (our current decoder hidden state), and what we think about that word in the context of the input (the encoder hidden states). These will be our queries and values:
- Query, $Q$: 
    - The current decoder state
- Values, $V$:
    - The encoder hidden states

Overall, it looks like this:

![attention mechanism](../images/RNN%20Seq2seq%20Attention.gif)
# TODO add Q,V labels

> In the case of translation, the attention distribution has as many elements as the source sentence has tokens.

As long as the decoder contains enough information to tell it where to look back to, then it can grab more information as and when it needs it, instead of wasting effort carrying it throughout and making sure it is available in the final encoding.


Cross-attention is the type of attention we have seen here where the values come from a different source (the encoder) than the queries (which come from the decoder)

In [None]:
class Seq2SeqWithAttention(torch.nn.Module):
    def __init__(self):
        

> It is important to understand that attention is a general technique that can be applied to many tasks (not just translation) and in many architectures (not just seq2seq)

Note that there is no notion of time (or generally position), which is why we need to encode it.
When using an RNN encoder, the values naturally contain a positional encoding, because the RNNs incorporate information from preceding timesteps.

This makes it different to convolution, which have an explicitly defined use of position.

Attention really works. It smashed benchmarks when it was discovered shortly after seq2seq.

## Why does attention help in seq2seq models?

### Attention eliminates the information bottleneck

At every timestep, the decoder can see the entire sequence of encoder hidden states.

# TODO diagram

### Attention opens the gradient superhighway

Becuase of the fact that at every timestep, the entire sequence of encoder hidden states is fed directly to the decoder, the gradient does not have to flow through many sequential layers of the models to influence the weights that affected far away calculations, such as the first encoder hidden state.

# TODO diagram

### Attention makes the model somewhat interpretable

You can tell what is being considered by looking at the attention weights

# TODO diagram