# Attention Is All You Need
* This paper deals with sequence transduction (e.g. neural language translation models).
* Most of these models have been based on recurrent or convolutional architectures.
* This paper does not use recurrent or convolutional architectures and propose a new architecture that they call *Transformer*.
* A bit of a paradigm shift!
* On English to German they reach BLEU score of 28.4
* On English to French 41.0 BLEU score

## Previous State of the Art
* *Encoder-Decoder* structure
    * One encoder that takes the source sequence and produces something that is fed into a decoder which then produces the output sequence.
    * Both recurrent and convolutional (and combinations) have been used for both encoder and decoder (I think).
    * For recurrent models (usually bidirectional LSTMs)
        * The encoder sequentially processes the source sequence $x = (x_1, ..., x_n)$ to produce hidden state representations $h = (h_1, ..., h_n)$. Each $h_t$ is computed as a function of the token (word vector) at time step $t$ and the previous hidden state $h_{t-1}$
        * $h$ is given to the decoder which then sequentially produces an output sequence $y = (y_1, ..., y_m)$
            * Sometimes only $h_n$ is used as input to the decoder?
            * At each timestep the decoder also outputs a hidden state which is fed into the next invocation of it.
            * The decoder is auto-regressive, meaning the produced tokens at previous timesteps are also fed into the decoder when producing the next.

## Difficulties
* Recurrent models are because of their sequential computation not as parallelizable within training examples.
    * Slower
    * More memory consumption leading to smaller batch sizes.
* *Long range dependencies*
    * With RNN based architectures, all memory has to be stored in a fixed sized vector that is passed a long through time. These networks are usually trained to keep important information and discard unnecessary information which can often yield worse translation when dependencies are far apart.
    * When the decoder predicts a token, the information needed to do this well must be present in the hidden state fed in. If the path between this hidden state and the hidden state for the source tokens that matter for this prediction are far apart in terms of hidden state computation steps this becomes hard.
    * The goal is to make these paths shorter so that information needed is easily accessible at any position and one way to do this is *attention* explained below.

## Attention
* The goal of attention is to reduce the distance of information flow so that the model can get the relevant information more easily.
* Self-attention is to relate different positions of a single sequence to each other. Basically, pass a weighted average of all previous (and future states dep. on architecture?) states to the computation for a token.
* Attention functions can be summarized as a mapping between a query and a set of key-value pairs to an output.
    * The output is a weighted sum of the values
    * Weights are computed by a compatibility function of query and keys.
    * Basically the key-values should store interesting stuff and the queries are questions about this.
* In self-attention keys, values and queries are all the sequence itself?
    * Sort of: How does a token in a sequence relate to the other tokens in the sequence.

## This Paper, the Transformer model
* No recurrence/convolutions, instead just relying on attention.
* Still have the encoder-decoder structure.
    * The entire source sequence is fed to the encoder.
    * The output of the encoder is input length $n$ times $d_{model}$
    * The encoder output and the tokens produced so far are fed into the decoder which produces probabilities for the next token to predict. This is repeated to do a translation.
    * For training, this means that we can compute gradients and update weights for each single token in a sequence to sequence pair. Compare to an RNN where backpropagation through each timestep would happen.

### Encoder
* Source tokens into embeddings
* Add positional info
* $N = 6$ identical layers
    * First: multi-head attention with residual connection and then layer normalization
    * Second: position-wise feed forward network with residual connection and then layer normalization
    * Both (and word embeddings) produce outputs with dimension $d_{model} = 512$ 

### Decoder
* Output tokens into embeddings
* Add positional info
* $N = 6$ identical layers
    * First: masked multi-head attention over output embeddings with residual connection and then layer normalization
        * The mask is to ensure that predictions for position $i$ can only depend on previous outputs of the decoder (since the others haven't been predicted yet).
    * Second: multi-head attention over encoder's output with queries from previous sublayer with residual connection and then layer normalization
    * Third: position-wise feed forward network with residual connection and then layer normalization
* Linear + softmax to probabilities of next token

<img src="figs/attentionisallyouneed/model.png" width="40%">


### Common

#### Positional info encoding
* Since no convolution or recurrence, they also feed in positional information
* Added to token embedding, so also $d_{model}$ dimension
* Can be learned or fixed
* They use different sinusoids for each dimension with token position as input

#### Position-wise feed forward networks
* 1 hidden layer neural network with relu at hidden layer.
* It's applied at each position separately and identically.
* Weights are shared for different positions but not shared across the $N$ layers.

### Attention
    
#### Scaled Dot-Product Attention
* Input is queries and keys.
* Compute dot product between queries and keys -> similarity.
* Apply softmax on dot products.
* Use the softmax output as weights on the values to compute the output.

#### Multi-Head Attention
* Instead of just one attention function they do it multiple time as follows.
* They project the queries, keys, and values $h$ times with different linear projections.
* On each projected version of queries, key, and values the scaled dot product attention is applied in parallel.
* All outputs are concatenated and linearly projected to final output.

#### How it's applied
* In the second attention sublayer of the decoder
    * Keys and values come from the encoder's output.
    * Queries come from the previous self-attention layer of the decoder.
* In the self-attention sublayers of the encoder
    * The keys, values, and queries come from the previous layer (not sublayer) of the encoder.
    * Each position can attend to all other positions in the sequence as represented by the previous layer.
* In the self-attention sublayers of the decoder
    * The keys, values, and queries come from the previous layer (not sublayer) of the decoder.
    * Each position can attend to all previous (timewise) positions in the produced sequence by the previous layer.
* TODO: Still not sure exactly how the queries, keys, values are defined from the output.

### Training
* Adam optimizer
* First linearly increasing learning rate, then decreasing proportionally to inverse sqrt of steps.
* Dropout at each sublayer before adding residual and layer normalizaton.
* Dropout at sum of token and position embeddings.
* Label smoothing