# Multilingual Models for Inference

## XLM

**XLM** has many different checkpoints.

### XLM with language embeddings

The following XLM models use language embeddings to specify the language used at inference:
* `FacebookAI/xlm-mlm-ende-1024` (Masked language modeling, English-German)
* `FacebookAI/xlm-mlm-enfr-1024` (Masked language modeling, English-French)
* `FacebookAI/xlm-mlm-enro-1024` (Masked language modeling, English-Romanian)
* `FacebookAI/xlm-mlm-xnli15-1024` (Masked language modeling, XNLI languages)
* `FacebookAI/xlm-mlm-tlm-xnli15-1024` (Masked language modeling + translation, XNLI languages)
* `FacebookAI/xlm-clm-enfr-1024` (Causal language modeling, English-French)
* `FacebookAI/xlm-clm-ende-1024` (Causal language modeling, English-German)

Language embeddings are represented as a tensor of the same shape as the `input_ids` passed to the model. The values in these tensors depend on the language used and are identified by the tokenizer's `lang2id` and `id2lang` attributes.

We will use `FacebookAI/xlm-clm-enfr-1024` as an example:

In [None]:
import torch
from transformers import XLMTokenizer, XLMWithLMHeadModel

tokenizer = XLMTokenizer.from_pretrained("facebook/xlm-clm-enfr-1024")
model = XLMWithLMHeadModel.from_pretrained("facebook/xlm-clm-enfr-1024")

The `lang2id` attribute of the tokenizer displays the model''s languages and their ids:

In [None]:
tokenizer.lang2id

In [None]:
input_ids = torch.tensor(
    [tokenizer.encode("Wikipedia was used to")] # batch size of 1
)

Set the language id as `"en"` and use it to define the language embeddings. The language embedding is a tensor filled with 0 since that is the language id for English. This tensor should be the same size as `input_ids`.

In [None]:
language_id = tokenizer.lang2id['en']
langs = torch.tensor([language_id] * input_ids.shape[1])
# reshape it to be the size (batch_size, sequence_length)
langs = langs.view(1, -1)

langs.size() == input_ids.size()

Now we can pass the `input_ids` and language embedding to the model:

In [None]:
outputs = model(input_ids, langs=langs)

### XLM without language embeddings

* `FacebookAI/xlm-mlm-17-1280` (Masked language modeling, 17 languages)
* `FacebookAI/xlm-mlm-100-1280` (Masked language modeling, 100 languages)

These models are used for generic sentence representations, unlike the previous XLM checkpoints with language embeddings.

## BERT

* `google-bert/bert-base-multilingual-uncased` (Masked language modeling + Next sentence prediction, 102 languages)
* `google-bert/bert-base-multilingual-cased` (Masked language modeling + Next sentence prediction, 104 languages)

These models do not require language embeddings during inference. They should identify the language from the context and infer accordingly.

### XLM-RoBERTa

* `FacebookAI/xlm-roberta-base` (Masked language modeling, 100 languages)
* `FacebookAI/xlm-roberta-large` (Masked language modeling, 100 languages)

XLM-RoBERTa was trained on 2.5TB of newly created and cleaned CommonCrawl data in 100 languages.. It provides strong gains over previously releated multilingual models like mBERT or XLM on downstream tasks like classification, sequence labeling, and question answering.

## M2M100

* `facebook/m2m100_418M` (Translation)
* `facebook/m2m100_1.2B` (Translation)

We will load the `facebook/m2m100_418M` checkpoint to translate from Chinese to English.

In [None]:
from transformers import M2M100ForConditionalGeneration, M2M100Tokenizer

en_text = "Do not meddle in the affairs of wizards, for they are subtle and quick to anger."
chinese_text = "不要插手巫師的事務, 因為他們是微妙的, 很快就會發怒."

tokenizer = M2M100Tokenizer.from_pretrained("facebook/m2m100_418M", src_lang="zh")
model = M2M100ForConditionalGeneration.from_pretrained("facebook/m2m100_418M")

In [None]:
encoded_zh = tokenizer(chinese_text, return_tensors='pt')

M2M100 forces the target language id as the first generated token to translate the target language. We can set the `forced_bos_token_id` to `en` in the `generate` method to translate to English:

In [None]:
generated_tokens = model.generate(
    **encoded_zh,
    forced_bos_token_id=tokenizer.get_lang_id('en')
)

tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)

## MBart

* `facebook/mbart-large-50-one-to-many-mmt` (One-to-many multilingual machine translation, 50 languages)
* `facebook/mbart-large-50-many-to-many-mmt` (Many-to-many multilingual machine translation, 50 languages)
* `facebook/mbart-large-50-many-to-one-mmt` (Many-to-one multilingual machine translation, 50 languages)
* `facebook/mbart-large-50` (Multilingual translation, 50 languages)
* `facebook/mbart-large-cc25`

We will load the `facebook/mbart-large-50-many-to-many-mmt` to translate Finnish to English.

In [None]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

en_text = "Do not meddle in the affairs of wizards, for they are subtle and quick to anger."
fi_text = "Älä sekaannu velhojen asioihin, sillä ne ovat hienovaraisia ja nopeasti vihaisia."

tokenizer = AutoTokenizer.from_pretrained("facebook/mbart-large-50-many-to-many-mmt", src_lang="fi_FI")
model = AutoModelForSeq2SeqLM.from_pretrained("facebook/mbart-large-50-many-to-many-mmt")

In [None]:
encoded_en = tokenizer(en_text, return_tensors="pt")

MBart forces the target language id as the first generated token to translate to the target language. We set the `forced_bos_token_id` to `en` in the `generate` method to translate to English:

In [None]:
generated_tokens = model.generate(
    **encoded_en,
    forced_bos_token_id=tokenizer.lang_code_to_id["en_XX"]
)
tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)