# Recurrent Neural Machine Translation

From LM to MT: from monolingual to cross-lingual word prediction

Early proposals with syncronized RNN [1]

Non-syncronized RNN: Encoder-Decoder architecture [2]

<img src="Sutskever2014_EncoderDecoder.svg" width="1000"/>

A source LM (encoder) whose hidden state is transfered to a target LM (decoder)

In [6]:
import graphviz as G

# boolean variables to denote dense or sparse connections between layers
DENSE = True
SPARSE = False


TIMESTEPS = 11
TIME_OFFSET = 3
words=['&lt;s&gt;','the','house','is','blue','.','&lt;/s&gt;','la','casa','es','blanca','&lt;/s&gt;']

unrolled = G.Digraph(node_attr={'shape':'circle', 'fixedsize':'true'}, graph_attr={'style':'invis', 'rankdir':'BT', 'color':'transparent'})

i=0
for step in range(TIMESTEPS+2):
    if step == 0 or step == TIMESTEPS+1:
        with unrolled.subgraph(name='cluster_'+str(i)) as c:
            c.node('a'+str(step), '', color='transparent')
            c.node('b'+str(step), '', color='transparent')
            c.node('c'+str(step), '', color='transparent') 
            c.node('d'+str(step), '', color='transparent')
            c.edge('a'+str(step), 'b'+str(step), style='invis') 
            c.edge('b'+str(step), 'c'+str(step), style='invis')
            c.edge('c'+str(step), 'd'+str(step), style='invis')
    else:
        with unrolled.subgraph(name='cluster_'+str(i)) as c:
            c.node('a'+str(step), words[TIMESTEPS-step], color='transparent');
            c.node('b'+str(step), 'WE')
            #c.node('c'+str(step), 't'+'{:=+d}'.format(TIME_OFFSET-step) if TIME_OFFSET-step else 't')
            c.node('c'+str(step), '')
            c.node('d'+str(step), 'SM');
            #c.node('e'+str(step), '<w<sub>'+'t'+'{:=+d}'.format(TIME_OFFSET-step+1)+'</sub>>' if TIME_OFFSET-step+1 else '<w<sub>'+'t'+'</sub>>', color='transparent');
            c.node('e'+str(step), words[TIMESTEPS-step+1], color='transparent');
            c.edge('a'+str(step), 'b'+str(step), label='<w<sub>'+'t'+'{:=+d}'.format(TIME_OFFSET-step)+'</sub>>' if TIME_OFFSET-step else '<w<sub>'+'t'+'</sub>>'); 
            c.edge('b'+str(step), 'c'+str(step), label='<e<sub>'+'t'+'{:=+d}'.format(TIME_OFFSET-step)+'</sub>>' if TIME_OFFSET-step else '<e<sub>'+'t'+'</sub>>'); 
            c.edge('c'+str(step), 'd'+str(step), label='<h<sub>'+'t'+'{:=+d}'.format(TIME_OFFSET-step)+'</sub>>' if TIME_OFFSET-step else '<h<sub>'+'t'+'</sub>>');
            c.edge('d'+str(step), 'e'+str(step), label='<y<sub>'+'t'+'{:=+d}'.format(TIME_OFFSET-step+1)+'</sub>>' if TIME_OFFSET-step+1 else '<y<sub>'+'t'+'</sub>>');
            
for step in range(1, TIMESTEPS+1):
    unrolled.edge('c'+str(step-1), 'c'+str(step), label='<s<sub>'+'t'+'{:=+d}'.format(TIME_OFFSET-step)+'</sub>>' if TIME_OFFSET-step else '<s<sub>'+'t'+'</sub>>', constraint='false', dir='back', color='black')

unrolled.render(filename='RNNMT', format='svg');



For a source sentence $x_1^J$ search for a target sentence
$$
\begin{align}
\hat{y}_1^{\hat{I }} &= \argmax_{y_1^I} P(y_1^I\mid x_1^J)\\
                     &= \argmax_{y_1^I} \prod_{i=1}^I p(y_i\mid y_1^{i-1},x_1^J)\\
                     &= \argmax_{y_1^I} \prod_{i=1}^I p(y_i\mid y_1^{i-1},c(x_1^J))
\end{align}
$$
where $u(x_1^J)$ is a hidden-state representation of $x_1^J$.

Search for the most likely translation using a simple left-to-right beam search decoder

At each timestep each partial hypothesis (prefix translations) is extended with every word in the vocabulary.

A small number $B$ of most-probable partial hypotheses are mantained

When the symbol &lt;s&gt; appears in a hypothesis, it is considered to be complete


Results on WMT14 English-to-French [2]:

|Systems|BLEU|
|:------|---:|
|Baseline (PB+Neural LM) | 33.3|
|Ensemble of 5 LSTMs | 34.8|
|Reescoring 1000-best with ensemble of 5 LSTMs | 36.5|
|Moses for WMT14 | 37.0 |

# RNN with attention (Bahdanau et al.) [3]

Conventional RNN *squash* all the source sentence information into a fixed-length vector

RNN tend to to better represent recent inputs

RNN with attention encode the input sentence into a sequence of vectors $h_1^J$ 

They select a subset of these vectors $h_1^J$ adaptively (attention mechanism) while decoding

$$
\hat{y}_1^{\hat{I }} = \argmax_{y_1^I} \prod_{i=1}^I p(y_i\mid y_1^{i-1},c_i)
$$
where
$$
c_i = \sum_{j=1}^{J} \alpha_{ij} h_j
$$
and
$$
\alpha_{ij} = \frac{\exp(a_{ij})}{\sum_{k=1}^{J} \exp(a_{ik})}
$$
being
$$
a_{ij} = v_a^t \tanh \left( W_a h_{i-1} + U_a h_j \right)
$$
where $v_a \in \mathbb{R}$, $W_a \in \mathbb{R}^{n \times n}$ and $U_a \in \mathbb{R}^{n \times 2n}$

Consequently, the hidden state of the RNN decoder $h_i$ is enriched with $c_i$ as
$$
h_i = f(h_{i-1}, y_{i−1}, c_i)
$$
and the output layer computing the probability over the vocabulary also incorporates $c_i$ and the residual connection $y_{i-1}$ 
$$
p(y_i\mid y_1^{i-1},c_i) = g(h_i, y_{i−1}, c_i)
$$
Indeed, the RNN hidden state at the encoder $h_j$ is the concatenation of the hidden state of a forward and a backward RNN.

<img src="Bahdanau2015_EncoderDecoder.svg" width="1000"/>

# RNN with attention (Luong et al.) [4]

Simplification and generalisation over [3]

  * Unidirectional encoder
  * Alternative attention mechanism: $ a_{ij} = \left\{ \begin{matrix} h_i^t \, h_j\\h_i^t \, W_a \, h_j\\ W_a \, [h_i;h_j]  \end{matrix} \right.$


<img src="Luong2015_EncoderDecoder.svg" width="1000"/>

All in all, in 2014, NMT systems still behind SOTA phrase-based systems 

# Attention Is All You Need [5]



## Additional bibliography

<ol>
<li><a href="https://www.isca-archive.org/eurospeech_1997/castano97_eurospeech.pdf" target="_blank">M.A. Castaño and F. Casacuberta. A Connectionist Approach to Machine Translation, EuroSpeech 1997.</a></li>
<li><a href="https://papers.nips.cc/paper_files/paper/2014/file/a14ac55a4f27472c5d894ec1c3c743d2-Paper.pdf" target="_blank">I. Sutskever et al. Sequence to Sequence Learning with Neural Networks, NIPS 2014.</a></li>
<li><a href="https://arxiv.org/pdf/1409.0473" target="_blank">D. Bahdanau et al. Neural Machine Translation by Jointly Learning to Align and Translate, ICLR 2015.</a></li>
<li><a href="https://aclanthology.org/D15-1166.pdf" target="_blank">M. Luong et al. Effective Approaches to Attention-based Neural Machine Translation, EMNLP 2015.</a></li>
<li><a href="https://arxiv.org/pdf/1706.03762" target="_blank">A. Vaswani et al. Attention Is All You Need, NIPS 2017.</a></li>
</ol>