# Multilingual NMT [1]


## Motivation

Low-resource languages

No-resource language: zero-shot translation models

To take advantage of common features in similar languages

A single engine instead many engines

## Possible settings

Multi-way translation. The goal is constructing a single NMT system
for one-to-many, many-to-one or many-to-many translation using parallel
corpora for more than one language pair. Parallel corpora are available
for each language pairs

Low resource translation. (a) a high-resource language pair is available
to assist a low-resource language pair. (b) no direct parallel corpus for
the low-resource pair and a pivot language is used.


Multi-source translation. Documents that have to be translated into more
than one language

## Multi-way NMT

One encoder for all source languages or one encoder for
each source language

One decoder for all target languages or one decoder for
each target language

### From minimal to complete parameter sharing

One encoder and one decoder for each language: the burden is on the shared attention layer

One encoder for all: common vocabulary across all languages (BPE, SentencePiece, etc.)

One decoder for all: 
  * prefix sentences with language tag
  * vocabulary increase: larger softmax layer and slower inference
  * particularly useful for related languages

Complete parameter sharing:
  * adopted by massively multilingual NMT
  * suffer from representation bottlenecks <!-- not all translation directions show improved performance despite a massive amount of data
being fed to a model with a massive number of parameters -->

Controlled Parameter Sharing
  * degree of parameter sharing is the divergence between the languages involved
  * simplicity
  * flexibility of modelling
  <!-- target language-specific attention performs better than other attention-sharing configurations -->
  * learning the degree of parameter sharing from the training data


### Addressing Language Divergence

Select vocabulary from different languages and learn sub-word vocabulary to avoid OOV words

Representation similarity varies across layers: lower variance at inner layers of encoder/decoder

Source sentence length impacts encoder representation 

Language tag proved to be very effective in decoder representation

### Training protocols

Fully shared models, a single batch ccontain sentence pairs from multiple language pairs

Oversample smaller datasets to match the sizes of the largest datasets

Knowledge Distillation: train a large model with many layers and then distill its knowledge into a smaller model

Going beyong oversampling for low-resource languages

## Low-resource languages

Data augmentation strategies can improve translation quality: back-translation and self-training


Training rich-resource (parent) and low-resource languages:

  * Jointly training when sharing the same target language

  * Fine-tune: Train parent model and fine-tune it on the low-resource pair

  * Transfer learning easier on the source-side than on the target-side
    <!--  need of specific target language representation -->
  * A related parent language benefits the child language more than an unrelated parent

## MNMT for unseen languages pairs

Can we do better than unsupervised NMT by utilizing multilingual translation corpora?

### Pivot translation

Even if two languages do not have a parallel corpus, they are likely to share a
parallel corpus with a third language (called the pivot language)

Cascaded approach: independent source-pivot (S-P) and pivot-target (P-T) MT systems

Limitations: error propagation and double decoding time

Re-ranking of n-best translations can improve performance

### Zero-shot Translation

The MNMT system has not been trained for the unseen language pair...

... but the system is able to generate reasonable target language translations for the source sentence

Need of target language tag as input to generate desired target language

Translation between any arbitrary language pair, while source and target seeing in training

Its performance is generally lower than the pivot translation system

### Zero-resource Translation

Optimizing the translation quality of a specific unseen pair when no resources available for that pair

Synthetic Corpus Generation: 

  * Pivot sentences of a P-T corpora are backtranslated to the source language to create S-T corpora
  * Cascade system S-P and P-T

Combining Pre-trained Encoders and Decoders

## Additional bibliography

<ol>
<li><a href="https://dl.acm.org/doi/pdf/10.1145/3406095" target="_blank">R. Dabre et al. A Survey of Multilingual Neural Machine Translation, ACM Computing Survey 2020.</a></li>
<li><a href="https://arxiv.org/pdf/2404.04925" target="_blank">L. Qin et al. Multilingual Large Language Model: A Survey of Resources, Taxonomy and Frontiers, arXiv 2024.</a></li>
</ol>