## Recurrent Neural Machine Translation

From LM to MT: from monolingual to cross-lingual word prediction

Early proposals with syncronized RNN [1]

Non-syncronized RNN: Encoder-Decoder architecture [2]

<img src="Sutskever2014_EncoderDecoder.svg" width="1000"/>

A source LM (encoder) whose hidden state is transfered to a target LM (decoder)

<p style="page-break-after:always;"></p>

## Decoding (Beam Search)

For a source sentence $x_1^J$ search for a target sentence
$$
\begin{align}
\hat{y}_1^{\hat{I }} &= \operatorname*{argmax}_{y_1^I} P(y_1^I\mid x_1^J)\\
                     &= \operatorname*{argmax}_{y_1^I} \prod_{i=1}^I p(y_i\mid y_1^{i-1},x_1^J)\\
                     &= \operatorname*{argmax}_{y_1^I} \prod_{i=1}^I p(y_i\mid y_1^{i-1},u(x_1^J))
\end{align}
$$
where $u(x_1^J)$ is a hidden-state representation of $x_1^J$.

Search for the most likely translation using a simple left-to-right beam search decoder

At each timestep each partial hypothesis (prefix translations) is extended with every word in the vocabulary (in practice, the k most probable tokens).

Resulting in at most $k \times k$ new prefixes of size increased by 1.

Then among these, the $k$ prefixes with the highest likelihood are maintaned

When the symbol &lt;eos&gt; appears in a hypothesis, it is considered to be complete


<p style="page-break-after:always;"></p>

### Results on WMT14 English-to-French [2]:

|Systems|BLEU|
|:------|---:|
|Baseline (PB+Neural LM) | 33.3|
|Ensemble of 5 LSTMs | 34.8|
|Reescoring 1000-best with ensemble of 5 LSTMs | 36.5|
|Moses for WMT14 | 37.0 |

<p style="page-break-after:always;"></p>

## RNN with attention (Bahdanau et al.) [3]

Conventional RNN *squash* all the source sentence information into a fixed-length vector

RNN tend to to better represent recent inputs

RNN with attention encode the input sentence into a sequence of vectors $h_1^J$ 

They select a subset of these vectors $h_1^J$ adaptively (attention mechanism) while decoding
$$
\hat{y}_1^{\hat{I }} = \operatorname*{argmax}_{y_1^I} \prod_{i=1}^I p(y_i\mid y_1^{i-1},u_i)
$$
where 
$$
\begin{align}
\notag u_i &= a(h^e_1, \cdots, h^e_J, h^d_{i-1})\\ 
\notag     &= \sum_{j=1}^{J} \alpha(h^e_1, \cdots, h^e_J, h^d_{i-1})\,h^e_j
\end{align}    
$$
being $\alpha(h^e_j, h^d_{i-1})$ a similarity measure (alignment) between the encoder state at position $j$ and 
the decoder state at position $i-1$.

<p style="page-break-after:always;"></p>

Bahdanau et al. define this similarity measure as a probability distribution over source positions
$$
\begin{align}
\alpha(h^e_1, \cdots, h^e_J, h^d_{i-1}) &= \alpha_j(h^e_1, \cdots, h^e_J, h^d_{i-1})\\
                         &= \alpha_{ij}\\
                         &= \frac{\exp(e_{ij})}{\sum_{k=1}^{J} \exp(e_{ik})}\\
                         &= \mathcal{S}(e_{ij})
\end{align}
$$
being
$$
e_{ij} = v_a^t \tanh \left( W_a h^d_{i-1} + U_a h^e_j \right)
$$
where $v_a \in \mathbb{R}$, $W_a \in \mathbb{R}^{n \times n}$ and $U_a \in \mathbb{R}^{n \times 2n}$

Consequently, the hidden state of the RNN decoder $h^d_i$ is enriched with $u_i$ as
$$
h^d_i = F(h^d_{i-1}, y_{i−1}, u_i)
$$
and the output layer computing the probability over the vocabulary also incorporates $u_i$ and the residual connection $y_{i-1}$ 
$$
p(y_i\mid y_1^{i-1},u_i) = g(h^d_i, y_{i−1}, u_i)
$$
Indeed, the RNN hidden state at the encoder $h_j$ is the concatenation of the hidden state of a forward and a backward RNN.

<img src="Bahdanau2015_EncoderDecoder.svg" width="800"/>

## Results PBMT vs NMT (2017)
<center>
<img src="PBMT-NMT2017.png" width="600"/>
</center>

## RNN with attention (Luong et al.) [4]

Simplification and generalisation over [3]

  * Unidirectional encoder
  * Alternative attention mechanism: $ e_{ij} = \left\{ \begin{matrix} \left(h_i^d\right)^t \, h^e_j\\\left(h_i^d\right)^t \, W_a \, h^e_j\\ W_a \, [h^d_i;h^e_j]  \end{matrix} \right.$

All in all, in 2014, NMT systems still behind SOTA phrase-based systems 

<p style="page-break-after:always;"></p>

<img src="Luong2015_EncoderDecoder.svg" width="1000"/>

<p style="page-break-after:always;"></p>

## Additional bibliography

<ol>
<li><a href="https://www.isca-archive.org/eurospeech_1997/castano97_eurospeech.pdf" target="_blank">M.A. Castaño and F. Casacuberta. A Connectionist Approach to Machine Translation, EuroSpeech 1997.</a></li>
<li><a href="https://papers.nips.cc/paper_files/paper/2014/file/a14ac55a4f27472c5d894ec1c3c743d2-Paper.pdf" target="_blank">I. Sutskever et al. Sequence to Sequence Learning with Neural Networks, NIPS 2014.</a></li>
<li><a href="https://arxiv.org/pdf/1409.0473" target="_blank">D. Bahdanau et al. Neural Machine Translation by Jointly Learning to Align and Translate, ICLR 2015.</a></li>
<li><a href="https://aclanthology.org/D15-1166.pdf" target="_blank">M. Luong et al. Effective Approaches to Attention-based Neural Machine Translation, EMNLP 2015.</a></li>
</ol>

<p style="page-break-after:always;"></p>