## Framework

Neural MT is based on the modelisation of  $P(y \mid x)$ with a neural network (NN) to perform

$$\hat{y} = \argmax_{y} P(y \mid x)$$

where $x$ is a source sentence and $y$ is a target sentence.

## Preliminars

Processing a sequence of words by a NN is a challenging task because of its discrete nature 

Previous works had to work with very limited vocabulary (a few tens of words):
  * Language modelling with Elman (recurrent) NN [1]
  * Machine translation with RNN [2]

Need of reducing the vocabulary size to reduce computational requirements
  * Tokeniser: language dependent and not reversibly convertible
  * Unsupervised reversible tokenisers: Byte-Pair Encoding [7], SentencePiece [8], etc.

Need of mapping word representation to a continuous space to be processed by a NN
  * Unsuccessful attempts by using bag-of-words, latent semantic indexing, word classes, clustering, etc.
  * Learning a feature vector for each word together with the task 


## Unsupervised reversible tokenisers

### Byte-pair Encoding (BPE)
  * It iteratively merges the most frequent pair of symbols up to a maximum number of merge operations
  * It starts with a symbol vocabulary that is the character vocabulary

In [81]:
!echo -e  "low\nlowest\nlower\nnew\nnewest\nnewer\nwider" | tee train.txt
!subword-nmt learn-bpe --min-frequency 1 -s 3 < train.txt > codes.txt
!cat codes.txt; echo "-----------"
!subword-nmt apply-bpe -c codes.txt < train.txt 


low
lowest
lower
new
newest
newer
wider


100%|##########################################| 3/3 [00:00<00:00, 12180.94it/s]
#version: 0.2
w e
n e
l o
-----------
lo@@ w
lo@@ we@@ s@@ t
lo@@ we@@ r
ne@@ w
ne@@ we@@ s@@ t
ne@@ we@@ r
w@@ i@@ d@@ e@@ r


## Unsupervised reversible tokenisers

### SentencePiece
  * Integrates the implementation of the unigram (*uni-subword*) language model [9] 
  * Capable of outputing multiple subword segmentations with probabilities
  * The most probable subword segmentations can be considered for training and decoding
  * No need of previous tokenisation

In [102]:
import sentencepiece as spm

params = ('--input=train.txt ''--model_prefix=train ''--vocab_size=18')
spm.SentencePieceTrainer.Train(params)
sp = spm.SentencePieceProcessor()
!cat train.vocab
sp.Load('train.model')
sp.nbest_encode('lowest', nbest_size=3, out_type=str)

<unk>	0
<s>	0
</s>	0
r	-1.94426
s	-2.44426
t	-2.44426
▁lowe	-2.47125
▁newe	-2.47126
e	-3.31599
▁new	-3.37828
▁low	-3.37829
o	-3.44406
n	-3.44416
d	-3.44426
i	-3.44426
l	-3.44426
w	-3.44426
▁	-3.44426


sentencepiece_trainer.cc(177) LOG(INFO) Running command: --input=train.txt --model_prefix=train --vocab_size=18
sentencepiece_trainer.cc(77) LOG(INFO) Starts training with : 
trainer_spec {
  input: train.txt
  input_format: 
  model_prefix: train
  model_type: UNIGRAM
  vocab_size: 18
  self_test_sample_size: 0
  character_coverage: 0.9995
  input_sentence_size: 0
  shuffle_input_sentence: 1
  seed_sentencepiece_size: 1000000
  shrinking_factor: 0.75
  max_sentence_length: 4192
  num_threads: 16
  num_sub_iterations: 2
  max_sentencepiece_length: 16
  split_by_unicode_script: 1
  split_by_number: 1
  split_by_whitespace: 1
  split_digits: 0
  pretokenization_delimiter: 
  treat_whitespace_as_suffix: 0
  allow_whitespace_only_pieces: 0
  required_chars: 
  byte_fallback: 0
  vocabulary_output_piece_score: 1
  train_extremely_large_corpus: 0
  hard_vocab_limit: 1
  use_all_vocab: 0
  unk_id: 0
  bos_id: 1
  eos_id: 2
  pad_id: -1
  unk_piece: <unk>
  bos_piece: <s>
  eos_piece: </s>
  p

[['▁lowe', 's', 't'],
 ['▁low', 'e', 's', 't'],
 ['▁', 'l', 'o', 'w', 'e', 's', 't']]

Learning a word feature vector, a.k.a. word embedding, jointly with the language modelling task [3]
  * Feedforward NN achieved 20% perplexity relative reduction w.r.t. n-grams
  * Maximise log-likelihood = minimise cross-entropy = minimise perplexity
  * Still limited vocabulary size (tens of thousands) and running words (a few millions)
  * A few tens of hidden units -> training time: one week per epoch in 40 CPUs 

<img src="MLPLM.svg" width="500"/>

In [35]:
import graphviz; graphviz.Source('''
digraph { 
    concentrate=True;
    rankdir=BT;
    node [shape=record];
    WE [label="Word embedding\n|{output:|input:}|{{m}|{V}}"];
    HL [label="MLP (Hidden layer weights)\n|{output:|input:}|{{h}|{m · (n-1)}}"];
    OL [label="Softmax (Output weights)\n|{output:|input:}|{{V}|{h + m · (n-1)}}"];
    WE -> HL
    HL -> OL
    WE -> OL
    node [shape=circle];
    wb [label=<W<sub>i-n-1</sub>>,fixedsize=true,width=0.7];
    wm [label="...",fixedsize=true,width=0.7];
    we [label=<W<sub>i-1</sub>>,fixedsize=true,width=0.7];
    wo  [label=<W<sub>i</sub>>,fixedsize=true,width=0.7];
    wb -> WE
    wm -> WE
    we -> WE
    OL  -> wo
}''').render(filename='MLPLM', format='svg');

Language modelling on real-tasks with RNN [4]
  * From backpropagation to backpropagation through time
  * Various optimisations to scale up running words in training (hundreds of millions)
  * Numerical stability issues: double precision and gradient explosion (truncation)
  * Relative reduction of 20% in state-of-the-art ASR tasks
  
  <img src="RNNLM.svg" width="700"/>

In [45]:
import graphviz as G

# boolean variables to denote dense or sparse connections between layers
DENSE = True
SPARSE = False


TIMESTEPS = 5
TIME_OFFSET = 3

unrolled = G.Digraph(node_attr={'shape':'circle', 'fixedsize':'true'}, graph_attr={'style':'invis', 'rankdir':'BT', 'color':'transparent'})

i=0
for step in range(TIMESTEPS+2):
    if step == 0 or step == TIMESTEPS+1:
        with unrolled.subgraph(name='cluster_'+str(i)) as c:
            c.node('a'+str(step), '', color='transparent')
            c.node('b'+str(step), '...', color='transparent')
            c.node('c'+str(step), '...', color='transparent') 
            c.node('d'+str(step), '...', color='transparent')
            c.edge('a'+str(step), 'b'+str(step), style='invis') 
            c.edge('b'+str(step), 'c'+str(step), style='invis')
            c.edge('c'+str(step), 'd'+str(step), style='invis')
    else:
        with unrolled.subgraph(name='cluster_'+str(i)) as c:
            c.node('a'+str(step), '', color='transparent');
            c.node('b'+str(step), 'WE')
            #c.node('c'+str(step), 't'+'{:=+d}'.format(TIME_OFFSET-step) if TIME_OFFSET-step else 't')
            c.node('c'+str(step), '')
            c.node('d'+str(step), 'SM');
            c.node('e'+str(step), '<w<sub>'+'t'+'{:=+d}'.format(TIME_OFFSET-step+1)+'</sub>>' if TIME_OFFSET-step+1 else '<w<sub>'+'t'+'</sub>>', color='transparent');
            c.edge('a'+str(step), 'b'+str(step), label='<w<sub>'+'t'+'{:=+d}'.format(TIME_OFFSET-step)+'</sub>>' if TIME_OFFSET-step else '<w<sub>'+'t'+'</sub>>'); 
            c.edge('b'+str(step), 'c'+str(step), label='<w<sub>'+'t'+'{:=+d}'.format(TIME_OFFSET-step)+'</sub>>' if TIME_OFFSET-step else '<w<sub>'+'t'+'</sub>>'); 
            c.edge('c'+str(step), 'd'+str(step), label='<y<sub>'+'t'+'{:=+d}'.format(TIME_OFFSET-step)+'</sub>>' if TIME_OFFSET-step else '<y<sub>'+'t'+'</sub>>');
            c.edge('d'+str(step), 'e'+str(step), label='');
            
for step in range(1, TIMESTEPS+2):
    unrolled.edge('c'+str(step-1), 'c'+str(step), label='<h<sub>'+'t'+'{:=+d}'.format(TIME_OFFSET-step)+'</sub>>' if TIME_OFFSET-step else '<h<sub>'+'t'+'</sub>>', constraint='false', dir='back', color='black')

unrolled.render(filename='RNNLM', format='svg');

## Limitations of RNN

Output depens not only on near but also far context in a sentence

Numerical instability issues with gradients (propagated error) vanishing and exploiding

Solution: Long Short-Term Memory (LSTM) [5] or Gated Recurrent Units (GRU) [6] cells

## LSTM cell

It replaces the nodes in the hidden layer

It explicitly models a memory state to retain near/far context

Output and memory state change depends on parametrised *gates*:
  * input gate: how much new input changes memory state
  * forget gate: how much of prior memory state is retained
  * output gate: how strongly memory state is passed on to next layer



# 

## Additional bibliography

<ol>
<li><a href="https://onlinelibrary.wiley.com/doi/epdf/10.1207/s15516709cog1402_1" target="_blank">J. Elman. Finding Structure in Time, Cognitive Science 1990.</a></li>
<li><a href="https://www.isca-archive.org/eurospeech_1997/castano97_eurospeech.pdf" target="_blank">M.A. Castaño and F. Casacuberta. A Connectionist Approach to Machine Translation, EuroSpeech 1997.</a></li>
<li><a href="https://www.jmlr.org/papers/volume3/bengio03a/bengio03a.pdf" target="_blank">Y. Bengio et al. A Neural Probabilistic Language Model, Journal of Machine Learning Research 2003.</a></li>
<li><a href="https://www.fit.vut.cz/study/phd-thesis-file/283/283.pdf" target="_blank">T. Mikolov. Statistical Language Models based on Neural Networks, Ph.D. Thesis 2012.</a></li>
<li><a href="https://www.bioinf.jku.at/publications/older/2604.pdf" target="_blank">S. Hochreiter and J. Schmidhuber. Long short-term memory, Neural Computation 1997.</a></li>
<li><a href="https://arxiv.org/pdf/1406.1078" target="_blank">K. Cho et al. Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation, ACL 2014.</a></li>
<li><a href="https://aclanthology.org/P16-1162.pdf" target="_blank">R. Sennrich et al. Neural Machine Translation of Rare Words with Subword Units, ACL 2016.</a></li>
<li><a href="https://aclanthology.org/D18-2012.pdf" target="_blank">T. Kudo and J. Richardson. SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing, EMNLP 2018.</a></li>
<li><a href="https://aclanthology.org/P18-1007.pdf" target="_blank">T. Kudo. Subword Regularization: Improving Neural Network Translation Models
with Multiple Subword Candidates, ACL 2018.</a></li>
</ol>