# Attention #

## Revisiting the recurrent neural network ##

![](http://colah.github.io/posts/2015-08-Understanding-LSTMs/img/RNN-unrolled.png)

A useful perspective on recurrent neural networks is as we have discussed before, to see it as a very deep feed-forward neural network – think multi-layer perceptron – where the same weights are used at each layer.

![](http://colah.github.io/posts/2015-08-Understanding-LSTMs/img/RNN-longtermdependencies.png)

In terms of a learning problem, it becomes gradually more difficult to learn relationships as the distance between the related input and output grows.

![](http://colah.github.io/posts/2015-08-Understanding-LSTMs/img/LSTM3-chain.png)


The LSTM with its cell state does resolve the issue of vanishing gradients, but as we are about to see, challenges still remain when it comes to long-term dependencies.

## Sequence-to-Sequence recurrent neural network (seq2seq) ##

![](https://d3i71xaburhd42.cloudfront.net/cd52da21cdec50b25b6fb0ba6741091ad38fc986/2-Figure1-1.png)

(From Sutskever et al. (2014))

![](https://i.stack.imgur.com/Ux0Xh.png)

(From Cho et al. (2014))

Very briefly at the end of the recurrent neural network lecture last week I presented this neural network structure commonly referred to as “seq2seq”. It was actually discovered independently by two different groups back in 2014. Sutskever et al. (2014) at Google Brain and Cho et al. (2014) at Mila in Montréal – the latter being the original GRU paper.

Here is a concrete realisation of this architecture for the task it was originally designed for: machine translation.

![](https://cdn-images-1.medium.com/max/1600/1*iXV7BD2iKMBIG-tmyzx88g.jpeg)

Formall, we have two functions. An *encoder* that takes a sequence of discrete symbols (or discrete symbols encoded as vectors, remember word2vec) and turns into into a low-dimensional representation of its information content:

$$ f_{enc}(X) = \textbf{h} $$

Which it then passes to a decoder function that based on $\textbf{h}$ decodes an output sequence $Y^{\prime}$:

$$ f_{dec}(\textbf{h}) = Y^{\prime} $$

In a way, if $X = Y$, you can think of these as autoencoders with discrete, variable-length inputs and outputs. Not entirely surprisingly, they have been pre-trained in exactly this way (Dai and Le, 2015).

Considering the model from a perspective we are already familiar with, we can see the encoder as nothing more than what we have already spoken about in terms of using a recurrent neural network to encode a variable-width sequence as a fixed-size output. The decoder component is performing language modelling akin to what we have already studied, the only real difference is that language modelling *conditioned* on a hidden state as well as what has already been generated.


## Alignment in machine translation ##

However, in 2014 these initial seq2seq models could not compete with “oldschool” statistical machine translation models. Rather, they were seen as proof of concepts that may one day lead to neural models being able to perform high quality translation.

![](https://miro.medium.com/max/1522/1*VMsuEe0XNzi2WGxKMnHgew.png)

In machine translation it is actually not that common to simply associate an input sequence with an output sequence. Rather, you also *align* input tokens and output tokens. This dates back to at least the 80s with the very first “modern” statistical machine translation models. So, can we draw upon this structure to construct a superior seq2seq model?

## seq2seq with attention ##

A common pattern in this course has been to take something non-differentiabel with inspiration from previous work or the natural world, what would happen if we did the same thing to the fact that oldschool statistical machine translation models rely on alignment?

![](https://user-images.githubusercontent.com/7529838/31751822-86b68320-b4c2-11e7-8f19-165a5ec4c021.png)

We have an encoded input sequence:

$$ \textbf{h}^{enc} = \{ h^{enc}_1, \ldots, h^{enc}_t \} $$

At each time step in the *decoder*, we take the dot product between our output state and *each* encoded input:

$$ \textbf{a} = \textrm{softmax}(\{h^{dec}_i \cdot h^{enc}_1, \ldots,  h^{dec}_t \cdot h^{enc}_t\}) $$

This gives us a probability distribution over $\textbf{h}^{enc}$ which we can use to scale the contribution of each member of $\textbf{h}^{enc}$:

$$ \sum_{i = 1}^{t}  a_i h^{enc}_i $$

We then use this resulting vector when predicting the next token to output by the decoder.

Let us go back to that alignment picture and see how one can interpret it in a new light, that of “soft alignments”.

This seemingly simple addition to the seq2seq model by Bahdanau et al. (2015) led to neural machine translation subsuming the field until the present day.

## Decomposable attention ##

![](https://miro.medium.com/max/1200/1*ZByTGTbUbqmwyr8M1cAVjg.png)

(From [Parikh et al. (2016)](https://www.aclweb.org/anthology/D16-1244.pdf))

Attention at this stage was looked at fairly cynically by myself and others as addressing the difficulty of training a model capable of encoding a long sequence into a single vector and then unpacking it – I stick to this view to this day. But a year later the Google Brain group sought to build a *pure* attention-based model, *without any recurrence*.

For their task, they had two sequences and was asked to provide a classification decision. What they opted to do was to calculate the attention between each word, then aggregate these interactions, and lastly feed it into a multi-layer perceptron – simple! What was amazing were the results, this model performed as well as models an order of magnitude more weights! This was a shot across the bow against the recurrent neural networks that had dominated natural language processing for the last three years at that point.

Can we relate this model to the “bag of vectors” model from Assignment 2 somehow?

## The transformer ##

As a next logical step, the same group at Google Brain then moved to introduce weights into the attention mechanism, leading to the model we now commonly know as the transformer.

![](https://jalammar.github.io/images/t/The_transformer_encoder_decoder_stack.png)

![](https://jalammar.github.io/images/t/encoder_with_tensors_2.png)

A key point to not here, the weights for the feed-forward network are shared.

![](https://jalammar.github.io/images/t/transformer_self_attention_score.png)

![](https://jalammar.github.io/images/t/self-attention_softmax.png)

![](https://jalammar.github.io/images/t/self-attention-output.png)

![](https://jalammar.github.io/images/t/self-attention-matrix-calculation.png)

![](https://jalammar.github.io/images/t/self-attention-matrix-calculation-2.png)

![](https://jalammar.github.io/images/t/transformer_attention_heads_qkv.png)

![](https://jalammar.github.io/images/t/transformer_attention_heads_z.png)

![](https://jalammar.github.io/images/t/transformer_attention_heads_weight_matrix_o.png)

![](https://jalammar.github.io/images/t/transformer_self-attention_visualization.png)

If we manually inpsect the attention heads, we quickly 

![](https://jalammar.github.io/images/t/transformer_self-attention_visualization_3.png)

So what about the decoder then… It actually uses the same kind of self-attention units and decodes 

![](https://jalammar.github.io/images/t/transformer_decoding_2.gif)

## Acknowledgements ##

[Jay Alammar’s](https://jalammar.github.io/illustrated-transformer) wonderful “The Illustrated Transformer”.