<center>
<h1>Machine Translation</h1>
</center>

<br><br><br><br><br><br><br><br>

<center>
<h1>Neural models</h1>
</center>

<p style="page-break-after:always;"></p>

## Framework

Neural MT is based on the modelisation of  $P(y \mid x)$ with a neural network (NN) to perform

$$\hat{y} = \operatorname*{argmax}_{y} P(y \mid x)$$

where $x$ is a source sentence and $y$ is a target sentence.

## Preliminars

Processing a sequence of words by a NN is a challenging task because of its discrete nature 

Previous works had to work with very limited vocabulary (a few tens of words):
  * Language modelling with Elman (recurrent) NN [1]
  * Machine translation with RNN [2]

Need of reducing the vocabulary size to reduce computational requirements
  * Tokeniser: language dependent and not reversibly convertible
  * Unsupervised reversible tokenisers: Byte-Pair Encoding [7], SentencePiece [8], etc.

Need of mapping word representation to a continuous space to be processed by a NN
  * Unsuccessful attempts by using bag-of-words, latent semantic indexing, word classes, clustering, etc.
  * Learning a feature vector for each word in context together with the task 

<p style="page-break-after:always;"></p>

## Unsupervised reversible tokenisers

### Byte-pair Encoding (BPE)
  * It iteratively merges the most frequent pair of symbols up to a maximum number of merge operations
  * It starts with a symbol vocabulary that is the character vocabulary

In [81]:
!echo -e  "low\nlowest\nlower\nnew\nnewest\nnewer\nwider" | tee train.txt
!subword-nmt learn-bpe --min-frequency 1 -s 3 < train.txt > codes.txt
!cat codes.txt; echo "-----------"
!subword-nmt apply-bpe -c codes.txt < train.txt 


low
lowest
lower
new
newest
newer
wider


100%|##########################################| 3/3 [00:00<00:00, 12180.94it/s]
#version: 0.2
w e
n e
l o
-----------
lo@@ w
lo@@ we@@ s@@ t
lo@@ we@@ r
ne@@ w
ne@@ we@@ s@@ t
ne@@ we@@ r
w@@ i@@ d@@ e@@ r


<p style="page-break-after:always;"></p>

## Unsupervised reversible tokenisers

### SentencePiece
  * Integrates the implementation of the unigram (*uni-subword*) language model [9] 
  * Capable of outputing multiple subword segmentations with probabilities
  * The most probable subword segmentations can be considered for training and decoding
  * No need of previous tokenisation

In [None]:
import sentencepiece as spm

params = ('--input=train.txt ''--model_prefix=train ''--vocab_size=18')
spm.SentencePieceTrainer.Train(params)
sp = spm.SentencePieceProcessor()

In [12]:
!cat train.vocab

<unk>	0
<s>	0
</s>	0
r	-1.94426
s	-2.44426
t	-2.44426
▁lowe	-2.47125
▁newe	-2.47126
e	-3.31599
▁new	-3.37828
▁low	-3.37829
o	-3.44406
n	-3.44416
d	-3.44426
i	-3.44426
l	-3.44426
w	-3.44426
▁	-3.44426


<p style="page-break-after:always;"></p>

In [18]:
sp.Load('train.model')
sp.nbest_encode('lowest', nbest_size=3, out_type=str)


[['▁lowe', 's', 't'],
 ['▁low', 'e', 's', 't'],
 ['▁', 'l', 'o', 'w', 'e', 's', 't']]

In [23]:
sp.sample_encode_and_score('lowest', num_samples=2, alpha=0.1, out_type=str, wor=True)

[(['▁lowe', 's', 't'], -0.003965185023844242),
 (['▁low', 'e', 's', 't'], -0.026968346908688545)]

<p style="page-break-after:always;"></p>

Learning a word feature vector, a.k.a. word embedding, jointly with the language modelling task [3]
  * Feedforward NN achieved 20% perplexity relative reduction w.r.t. n-grams
  * Maximise log-likelihood = minimise cross-entropy = minimise perplexity
  * Still limited vocabulary size (tens of thousands) and running words (a few millions)
  * A few tens of hidden units -> training time: one week per epoch in 40 CPUs 

<img src="MLPLM.svg" width="500"/>

<p style="page-break-after:always;"></p>

Language modelling on real-tasks with RNN [4]
  * From backpropagation to backpropagation through time
  * Various optimisations to scale up running words in training (hundreds of millions)
  * Numerical stability issues: double precision and gradient explosion (truncation)
  * Relative reduction of 20% in state-of-the-art ASR tasks
  
  <img src="RNNLM.svg" width="700"/>

## Limitations of RNN

Output depens not only on near but also far context in a sentence

Numerical instability issues with gradients (propagated error) vanishing and exploiding

Solution: Long Short-Term Memory (LSTM) [5] or Gated Recurrent Units (GRU) [6] cells

## LSTM cell

It replaces the nodes in the hidden layer

It explicitly models a memory state to retain near/far context

Output and memory state change depends on parametrised *gates*:
  * input gate: how much new input changes memory state
  * forget gate: how much of prior memory state is retained
  * output gate: how strongly memory state is passed on to next layer



<p style="page-break-after:always;"></p>

## Additional bibliography

<ol>
<li><a href="https://onlinelibrary.wiley.com/doi/epdf/10.1207/s15516709cog1402_1" target="_blank">J. Elman. Finding Structure in Time, Cognitive Science 1990.</a></li>
<li><a href="https://www.isca-archive.org/eurospeech_1997/castano97_eurospeech.pdf" target="_blank">M.A. Castaño and F. Casacuberta. A Connectionist Approach to Machine Translation, EuroSpeech 1997.</a></li>
<li><a href="https://www.jmlr.org/papers/volume3/bengio03a/bengio03a.pdf" target="_blank">Y. Bengio et al. A Neural Probabilistic Language Model, Journal of Machine Learning Research 2003.</a></li>
<li><a href="https://www.fit.vut.cz/study/phd-thesis-file/283/283.pdf" target="_blank">T. Mikolov. Statistical Language Models based on Neural Networks, Ph.D. Thesis 2012.</a></li>
<li><a href="https://www.bioinf.jku.at/publications/older/2604.pdf" target="_blank">S. Hochreiter and J. Schmidhuber. Long short-term memory, Neural Computation 1997.</a></li>
<li><a href="https://arxiv.org/pdf/1406.1078" target="_blank">K. Cho et al. Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation, ACL 2014.</a></li>
<li><a href="https://aclanthology.org/P16-1162.pdf" target="_blank">R. Sennrich et al. Neural Machine Translation of Rare Words with Subword Units, ACL 2016.</a></li>
<li><a href="https://aclanthology.org/D18-2012.pdf" target="_blank">T. Kudo and J. Richardson. SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing, EMNLP 2018.</a></li>
<li><a href="https://aclanthology.org/P18-1007.pdf" target="_blank">T. Kudo. Subword Regularization: Improving Neural Network Translation Models
with Multiple Subword Candidates, ACL 2018.</a></li>
</ol>

<p style="page-break-after:always;"></p>