# Attention

## Motivation

Seq2seq models face a challenge that the entire representation of the encoded sequence must be captured in a single vector. That encoding represents the concept of the source sequence as a whole. 

For a task such as translation, which a seq2seq model could tackle, this can make things difficult. The encoding gives you an idea of what the output should represent, but there are often many ways that the source could be translated, and getting a word-to-word translation can be difficult after everything has been summarised.

The typical and intuitive explanation here is that a human translator does not read the whole source sentence, memorise it, and then translate it. Instead, they read the whole thing to get an idea of what the translation needs to represent, and then they translate it part by part, looking back at the source sentence to translate a few words at a time. They are primed with the concept that the translation needs to represent, but they need to pay attention to parts of the source sequence as they perform decide the next word in the translated output.

## What's the result of this?

Vanilla seq2seq models tend to be able to perform well on short sequences, where the information can be "memorised" within just a single vector, but perform worse on longer sequences.

## The Attention Mechanism

### The Attention Score

You've only got a limited amount of attention.
But you can pay a different percentage of our attention to each word. 
The most attention we could pay to a word is 100%, and the least is 0%. 
Or 1.0 and 0.0 as proportions.

So we could give each word a number between 0 and 1 which represents the proportion of our attention we give that word.
We call this number $\alpha_t$.


$\alpha$ is a vector of the attention paid to each part of the input.

# $\alpha = \begin{bmatrix} \alpha_1 \\ \vdots \\ \alpha_t \\ \vdots \\ \alpha_T \end{bmatrix}$

In the case of translation, $\alpha$ has as many elements as the source sentence has tokens.

# $\alpha_t \in \R^T$

This is the distribution of our attention paid to each input token.


### How do we calculate the attention score?

Because the attention score is a distribution, it can be computed by applying the softmax function to a vector of logits, $e$. Those logits should have larger values where more attention should be paid.

# $\alpha_t = softmax(e)$

> The logits that are softmaxed to compute the attention distribution are also known as the alignment scores.

In [4]:
import torch
import torch.nn.functional as F

logits = torch.tensor([12, -1, 0.4, 5, 2])
attention_distribution = F.softmax(logits, dim=0)


So how do we compute those attention logits - the alignment scores?

Intuitively, it would make sense that the attention that should be paid to one word is a function of what we think about the output translation so far, and what we think about that word in the context of the input.

That is, the current decoder hidden state, and the encoder hidden state for the timestep you're computing the attention for.

Note that the most recently computed decoder hidden state is the one from the previous timestep $h_{decoder}^{t'-1}$.
We don't have the current decoder hidden state - that's what we are trying to use the attention to compute.

$alpha_t = f(h_{decoder}^{t'-1}, h_{encoder}^t)$

So the question is, what function is $f$? What function can combine those two vectors to give this attention logit?

We could pick a function, like a multiplication or something. 
But instead, why don't we have the model figure it out for us? 
This function can be learnt, like the rest of the neural network can be learnt. 
In fact, the function itself can be a neural network.
Typically, we use a 1-layer neural network, passing in a stacked vector of the two input hidden states.

# $e_t = W_{alignment} \cdot tanh(\begin{bmatrix}W_{encoder} \cdot h_{decoder}^{t'-1} \\ \\ W_{decoder} \cdot  h_{encoder}^t \end{bmatrix})$

## Using the attention distribution

The point of computing the attention distribution, which tells us which input tokens to pay attention to, was to use it to make a prediction for the next decoder hidden state.

We can use it to create a sum of the encoder hidden state representations, weighted by the attention paid to each of them. 

This is known as the _context_ vector as it gives a representation of the hidden states in context of what should be paid attention.

## $context, c = \sum_t \alpha_t h_{encoder}^t$

# TODO diagram



In [None]:

print(attention_distribution.shape)
encoder_hidden_states = 

context = torch.dot()

The context is combined with the most recent decoder hidden state to compute the next decoder hidden state.

## $next \ decoder \ hidden \ state \ input = \begin{bmatrix}c \\ h_{decoder}^{t'}\end{bmatrix}$

## TODO diagram

This is then processed as any decoder input would be.

## Why does attention help?

### Attention eliminates the information bottleneck

At every timestep, the decoder can see the entire sequence of encoder hidden states.

# TODO diagram

### Attention opens the gradient superhighway

Becuase of the fact that at every timestep, the entire sequence of encoder hidden states is fed directly to the decoder, the gradient does not have to flow through many sequential layers of the models to influence the weights that affected far away calculations, such as the first encoder hidden state.

# TODO diagram