# Attention Is All You Need [1]

Additional parameters $W_V$, $W_Q$ and $W_K$ are incorporated in the attention mechanism

$$
u_i = \sum_{j=1}^{J} \alpha(h^e_1, \cdots, h^e_J, h^d_{i-1})\,W_V\,h^e_j
$$
with
$$
\alpha(h^e_1, \cdots, h^e_J, h^d_{i-1}) = \mathcal{S}(e_{ij})
$$
and
$$
e_{ij} = W_Q\,h_{i-1}^d \cdot W_K\,h_j^e
$$

But, there is much more

  * RNN replaced by FFN: direct connection between source and target words and faster computation
  * Multi-layer self-attention and cross-attention
$$
\hat{y}_1^{\hat{I }} = \operatorname*{argmax}_{y_1^I} \prod_{i=1}^I p(y_i\mid v(u(y_1^{i-1}),u(x_1^J)))
$$  
  * Multi-head attention 
  * Positional encoding 

  

<p style="page-break-after:always;"></p>

### Multi-layer self-attention and cross-attention

The original Transformer is a 6-layer encoder with self-attention plus 6-layer self-attention+cross-attention decoder.

### Encoder

Each layer of the encoder can be depicted as follows

<center>
<img src="Encoder_Transformer.svg" width="200"/>
</center>

In the layer $l$ of the encoder

$$u^{e,l}_j = \sum_{j=1}^{J} \alpha(h^{e,l-1}_1, \cdots, h^{e,l-1}_J, h^{e,l-1}_j)\,W^{e,l}_V\,h^{e,l-1}_j$$

and a FFN to generate the hidden state of layer $l$ at the encoder

$$
\begin{align}
\notag h^{e,l}_j &= F(u^{e,l}_j)\\
\notag           &= W_2^{e,l}\,\operatorname{ReLU}\left(W_1^{e,l}\,u_j^{e,l}+b_1^{e,l}\right)+b_2^{e,l}
\end{align}
$$

where $W_2^{e,l}$, $W_1^{e,l}$ are the weights of the second and first layer of the
FFN, and $b_2^{e,l}$ and $b_1^{e,l}$ the corresponding bias.


### Decoder

In the case of the decoder, layer l is represented as

<center>
<img src="Decoder_Transformer.svg" width="200"/>
</center>

In the layer $l$ of the decoder, first, self-attention similar to the encoder self-attention but limited to the first i positions

$$u^{d,l}_i = \sum_{i=1}^{i} \alpha(h^{d,l-1}_1, \cdots, h^{d,l-1}_i, h^{d,l-1}_i)\,W^{d,l}_V\,h^{d,l-1}_i$$

Followed by cross-attention to all source positions of the last encoder layer

$$v^{d,l}_i = \sum_{j=1}^{J} \alpha(h^{e,L}_1, \cdots, h^{e,L}_J, u^{d,l}_i)\,W^{c,l}_V\,h^{e,L}_j$$

Finally, similarly to the encoder, the feed-forward net to generate the hidden state of layer $l$ at the decoder

$$
\begin{align}
\notag h^{d,l}_i &= F(v^{d,l}_i)\\
\notag           &= W_2^{d,l}\,\operatorname{ReLU}\left(W_1^{d,l}\,v_i^{d,l}+b_1^{d,l}\right)+b_2^{d,l}
\end{align}
$$


<p style="page-break-after:always;"></p>

## Transformer Architecture

<img src="Transformer_Architecture.svg" width="1000"/>

<p style="page-break-after:always;"></p>

### Multi-head attention

At each layer, N indepedent attention mechanism $\left( W_Q,\,W_K,\,W_V \right)$ to jointly attend to information from different representation
subspaces at different positions

Applied to self-attention encoder, self-attention decoder and cross-attention decoder

For instance, self-attention encoder for the $n$-th head

$$u^{e,l,n}_j = \sum_{j=1}^{J} \alpha(h^{e,l-1}_1, \cdots, h^{e,l-1}_J, h^{e,l-1}_j;\,W^{e,l,n}_Q,\,W^{e,l,n}_K\,)\,W^{e,l,n}_V\,h^{e,l-1}_j$$

The $N$ representations of $u^{e,l}_j$ are concatenated and projected via a feed-forward net

$$
u^{e,l}_j = F\left(\left[u^{e,l,n}_1;\cdots;u^{e,l,n}_J\right]\right)
$$

<p style="page-break-after:always;"></p>

### Positional encoding and word embeddings

The Transformer architecture encodes position information as

$$
\begin{align}
\notag  p_{j,2k} & = \sin(j/10000^{2k/D})\\
\notag  p_{j,2k+1} & = \cos(j/10000^{2k/D})
\end{align}
$$
being $D$ the dimension of the word embedding so that they can be summed up for initialisation. For instance, for the encoder 
$$
h_j^{e,0} = w_j+p_j\quad\text{for}~~ 1\leq j \leq J
$$

### Output layer

For each state of the decoder in the last layer $h^{d,L}_i$, a probabilistic distribution over the target vocabulary is defined 

$$
p(y_i\mid v(u(y_1^{i-1}),u(x_1^J))) = \mathcal{S} ( F ( h^{d,L}_i ) ) \qquad 1 \leq i \leq I
$$


<p style="page-break-after:always;"></p>

### Transformer Architecture

* The vanilla Transformer model is built on the encoder-decoder architecture, which consists of two stacks of Transformer blocks as the encoder and decoder, respectively. 

* The encoder adopts stacked multi-head self-attention layers to encode the input sequence for generating its latent representations

* The decoder performs cross-attention on these representations and autoregressively generates the target sequence

<center>
<img src="Transformer_OriginalArchitecture.svg" width="500"/>
</center>

### More details on the Transformer architecture


Given $z\in\mathbb{R}^D$ and a function $f:\mathbb{R}^D\rightarrow \mathbb{R}^D$, a
**residual function** $R$ is defined as:

  $$
  R(z,f(z))=f(z)+z
  $$

Given a sequence of vectors $z_1,\dots,z_K$ from a layer with $z_k\in\mathbb{R}^D$ for $1\leq k\leq K$, a **layer normalization** $N$ is [2]
$$
N(z_1,\dots,z_K)=(\bar{z}_1,\dots,\bar{z}_K)
$$
such that
$$
\bar{z}_{k,i}= \gamma~ \frac{z_{k,i}-\mu_i}{\sigma_i}+\beta 
    ~~~~1\leq k\leq K, ~1\leq i\leq D
$$
where $\gamma$ and $\beta$ are hyper-parameters, and
$$
\mu_i=\frac{\sum_{k=1}^K z_{k,i}}{K}~~~\text{and}~~~
\sigma_i^2=\frac{\sum_{k=1}^K (z_{k,i}-\mu_i)^2}{K}~~~ 1\leq i\leq D
$$


<p style="page-break-after:always;"></p>

## Training

Given a training set of N bilingual pairs, maximize the log-likelihood
$$
\widehat{W} = \operatorname*{argmax}_{W} \sum_{x_n,y_n} \sum_{i=1}^I p(y_{ni}\mid y_{n1}^{i-1},x_{n1}^J;W)
$$

Optimisation based on backpropagation using stochastic gradient descent:
  * Adam optimisation [3]
  * Early stopping
  * Data augmentation
  * Hyperparameter tuning: Word embedding dimensions, number of layers, learning rate, dropout, weight decay

<p style="page-break-after:always;"></p>

## Decoding: Beyond Beam Search [4]

For a source sentence $x_1^J$ search for a target sentence
$$
\begin{align}
\hat{y}_1^{\hat{I }} &= \operatorname*{argmax}_{y_1^I} \prod_{i=1}^I p(y_i\mid y_1^{i-1},x_1^J)
\end{align}
$$

Beam search emphasis on empty outputs and usually fails to find optimal outputs

Beam search is still the best option only for reference-based metrics

Beam search replaced by a more powerful metric-driven search technique

Define a value function that estimates the eventual (heuristic) score from an unfinished translation

First approach: linear combination of maximum likelihood and value function for one future word

Monte-Carlo Tree Search (MCTS) aimed at optimising the metric of interest via a value function (or
the metric itself when available) that can explore further in the future (longer sequences of future words).

For every decoding step, a fixed budget of simulations is allocated to build a tree of possible future sequences

MCTS outperformed beam search for reference-less scores


<p style="page-break-after:always;"></p>

## Additional bibliography

<ol>
<li><a href="https://arxiv.org/pdf/1706.03762" target="_blank">A. Vaswani et al. Attention Is All You Need, NIPS 2017.</a></li>
<li><a href="https://proceedings.mlr.press/v119/xiong20b/xiong20b.pdf" target="_blank">R. Xiong et al. On Layer Normalization in the Transformer Architecture, ICML 2020.</a></li>
<li><a href="https://arxiv.org/pdf/1412.6980" target="_blank">D. Kingma and J. Ba. Adam. A Method for Stochastic Optimization, ICLR 2015.</a></li>
<li><a href="https://aclanthology.org/2021.emnlp-main.662.pdf" target="_blank">R. Leblond et al. Machine Translation Decoding beyond Beam Search, EMNLP 2021.</a></li>
</ol>

<p style="page-break-after:always;"></p>