## Framework

Neural MT is based on the modelisation of  $P(y \mid x)$ with a neural network (NN) to perform

$$\hat{y} = \argmax_{y} P(y \mid x)$$

where $x$ is a source sentence and $y$ is a target sentence.

## Preliminars

Processing a sequence of words by a NN is a challenging task because of its discrete nature 

Previous works had to work with very limited vocabulary (a few tens of words):
  * Language modelling with Elman (recurrent) NN [1]
  * Machine translation with RNN [2]

Need of mapping word representation to a continuous space to be processed by a NN
  * Unsuccessful attempts by using bag-of-words, latent semantic indexing, word classes, clustering, etc.
  * Learning a feature vector for each word together with the task 

Learning a word feature vector, a.k.a. word embedding, jointly with the language modelling task [3]
  * Feedforward NN achieved 20% perplexity relative reduction w.r.t. n-grams
  * Maximise log-likelihood = minimise cross-entropy = minimise perplexity
  * Still limited vocabulary size (tens of thousands) and running words (a few millions)
  * A few tens of hidden units -> training time: one week per epoch in 40 CPUs 

<img src="MLPLM.svg" width="500"/>


In [35]:
import graphviz; graphviz.Source('''
digraph { 
    concentrate=True;
    rankdir=BT;
    node [shape=record];
    WE [label="Word embedding\n|{output:|input:}|{{m}|{V}}"];
    HL [label="MLP (Hidden layer weights)\n|{output:|input:}|{{h}|{m · (n-1)}}"];
    OL [label="Softmax (Output weights)\n|{output:|input:}|{{V}|{h + m · (n-1)}}"];
    WE -> HL
    HL -> OL
    WE -> OL
    node [shape=circle];
    wb [label=<W<sub>i-n-1</sub>>,fixedsize=true,width=0.7];
    wm [label="...",fixedsize=true,width=0.7];
    we [label=<W<sub>i-1</sub>>,fixedsize=true,width=0.7];
    wo  [label=<W<sub>i</sub>>,fixedsize=true,width=0.7];
    wb -> WE
    wm -> WE
    we -> WE
    OL  -> wo
}''').render(filename='MLPLM', format='svg');

Language modelling on real-tasks with RNN [4]
  * From backpropagation to backpropagation through time
  * Various optimisations to scale up running words in training (hundreds of millions)
  * Relative reduction of 20% in state-of-the-art ASR tasks
  
  <img src="RNNLM.svg" width="700"/>

In [36]:
import graphviz as G

# boolean variables to denote dense or sparse connections between layers
DENSE = True
SPARSE = False


TIMESTEPS = 5
TIME_OFFSET = 3

unrolled = G.Digraph(node_attr={'shape':'circle', 'fixedsize':'true'}, graph_attr={'style':'invis', 'rankdir':'BT', 'color':'transparent'})

i=0
for step in range(TIMESTEPS+2):
    if step == 0 or step == TIMESTEPS+1:
        with unrolled.subgraph(name='cluster_'+str(i)) as c:
            c.node('a'+str(step), '', color='transparent')
            c.node('b'+str(step), '...', color='transparent')
            c.node('c'+str(step), '...', color='transparent') 
            c.node('d'+str(step), '...', color='transparent')
            c.edge('a'+str(step), 'b'+str(step), style='invis') 
            c.edge('b'+str(step), 'c'+str(step), style='invis')
            c.edge('c'+str(step), 'd'+str(step), style='invis')
    else:
        with unrolled.subgraph(name='cluster_'+str(i)) as c:
            c.node('a'+str(step), '', color='transparent');
            c.node('b'+str(step), 'WE')
            #c.node('c'+str(step), 't'+'{:=+d}'.format(TIME_OFFSET-step) if TIME_OFFSET-step else 't')
            c.node('c'+str(step), '')
            c.node('d'+str(step), 'SM');
            c.node('e'+str(step), '', color='transparent');
            c.edge('a'+str(step), 'b'+str(step), label='<w<sub>'+'t'+'{:=+d}'.format(TIME_OFFSET-step)+'</sub>>' if TIME_OFFSET-step else '<w<sub>'+'t'+'</sub>>'); 
            c.edge('b'+str(step), 'c'+str(step), label='<w<sub>'+'t'+'{:=+d}'.format(TIME_OFFSET-step)+'</sub>>' if TIME_OFFSET-step else '<w<sub>'+'t'+'</sub>>'); 
            c.edge('c'+str(step), 'd'+str(step), label='<y<sub>'+'t'+'{:=+d}'.format(TIME_OFFSET-step)+'</sub>>' if TIME_OFFSET-step else '<y<sub>'+'t'+'</sub>>');
            c.edge('d'+str(step), 'e'+str(step), label='');

for step in range(1, TIMESTEPS+2):
    unrolled.edge('c'+str(step-1), 'c'+str(step), label='<h<sub>'+'t'+'{:=+d}'.format(TIME_OFFSET-step)+'</sub>>' if TIME_OFFSET-step else '<h<sub>'+'t'+'</sub>>', constraint='false', dir='back', color='black')

unrolled.render(filename='RNNLM', format='svg');

## Additional bibliography

<ol>
<li><a href="https://onlinelibrary.wiley.com/doi/epdf/10.1207/s15516709cog1402_1" target="_blank">J. Elman. Finding Structure in Time, Cognitive Science 1990.</a></li>
<li><a href="https://www.isca-archive.org/eurospeech_1997/castano97_eurospeech.pdf" target="_blank">M.A. Castaño and F. Casacuberta. A Connectionist Approach to Machine Translation, EuroSpeech 1997.</a></li>
<li><a href="https://www.jmlr.org/papers/volume3/bengio03a/bengio03a.pdf" target="_blank">Y. Bengio et al. A Neural Probabilistic Language Model, Journal of Machine Learning Research 2003.</a></li>
<li><a href="https://www.fit.vut.cz/study/phd-thesis-file/283/283.pdf" target="_blank">T. Mikolov. Statistical Language Models based on Neural Networks, Ph.D. Thesis 2012.</a></li>
</ol>