# **[Hugging Face: Translation](https://huggingface.co/tasks/translation)**



## **Translation**

Translation is the task of converting text from one language to another.

### **Use Cases**

You can find over a thousand Translation models on the Hub, but sometimes you might not find a model for the language pair you are interested in. When this happen, you can use a pretrained multilingual Translation model like [mBART](https://huggingface.co/facebook/mbart-large-cc25) and further train it on your own data in a process called fine-tuning.

**Multilingual conversational agents** 🧐

Translation models can be used to build conversational agents across different languages. This can be done in two ways.
- **Translate the dataset to a new language**. You can translate a dataset of intents (inputs) and responses to the target language. You can then train a new intent classification model with this new dataset. This allows you to proofread responses in the target language and have better control of the chatbot's outputs.
- **Translate the input and output of the agent**. You can use a Translation model in user inputs so that the chatbot can process it. You can then translate the output of the chatbot into the language of the user. This approach might be less reliable as the chatbot will generate responses that were not defined before.

### **Example**

In [1]:
!pip install -U transformers --quiet

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.1/7.1 MB[0m [31m50.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m236.8/236.8 kB[0m [31m21.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.8/7.8 MB[0m [31m71.7 MB/s[0m eta [36m0:00:00[0m
[?25h

In [2]:
from transformers import pipeline

model = pipeline("translation_en_to_fr")

text = "My name is Imar and I live in Zürich."
model(text)

No model was supplied, defaulted to t5-base and revision 686f1db (https://huggingface.co/t5-base).
Using a pipeline without specifying a model name and revision in production is not recommended.


Downloading (…)lve/main/config.json:   0%|          | 0.00/1.21k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/892M [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

Downloading (…)ve/main/spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.39M [00:00<?, ?B/s]

For now, this behavior is kept to avoid breaking backwards compatibility when padding/encoding with `truncation is True`.
- Be aware that you SHOULD NOT rely on t5-base automatically truncating your input to 512 when padding/encoding.
- If you want to encode/pad to sequences longer than 512 you can either instantiate this tokenizer with `model_max_length` or pass `max_length` when encoding/padding.


[{'translation_text': "Je m'appelle Imar et j'habite à Zürich."}]

In [7]:
# Download used models
!git clone https://huggingface.co/t5-base

Cloning into 't5-base'...
remote: Enumerating objects: 75, done.[K
remote: Counting objects: 100% (4/4), done.[K
remote: Compressing objects: 100% (4/4), done.[K
remote: Total 75 (delta 0), reused 0 (delta 0), pack-reused 71[K
Unpacking objects: 100% (75/75), 967.99 KiB | 4.50 MiB/s, done.
Filtering content: 100% (5/5), 4.15 GiB | 35.46 MiB/s, done.


In [4]:
from transformers import pipeline

model_checkpoint = "t5-base"
model = pipeline("translation_en_to_fr", model=model_checkpoint)

text = "My name is Imar and I live in Zürich."

output = model(text)
translation = output[0]["translation_text"]

print(translation)

Je m'appelle Imar et j'habite à Zürich.


### **Additional Resources**
- [Hugging Face | Models for Translation](https://huggingface.co/models?pipeline_tag=translation)
- [Hugging Face | Pipelines](https://huggingface.co/docs/transformers/main_classes/pipelines)
- [Hugging Face | Course Chapter on Translation](https://huggingface.co/learn/nlp-course/chapter7/4?fw=pt)
- [Hugging Face | Models for Translation in Portuguese](https://huggingface.co/models?pipeline_tag=translation&language=pt&sort=downloads)

### **Example in Portuguese**

In [2]:
# MarianTokenizer requires the SentencePiece library
# https://github.com/google/sentencepiece#installation 
! pip install sentencepiece

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting sentencepiece
  Downloading sentencepiece-0.1.99-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m26.6 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: sentencepiece
Successfully installed sentencepiece-0.1.99


In [25]:
# Download used models
!git clone https://huggingface.co/Helsinki-NLP/opus-mt-tc-big-en-pt

Cloning into 'opus-mt-tc-big-en-pt'...
remote: Enumerating objects: 26, done.[K
remote: Total 26 (delta 0), reused 0 (delta 0), pack-reused 26[K
Unpacking objects: 100% (26/26), 328.99 KiB | 4.27 MiB/s, done.
Filtering content: 100% (5/5), 1.41 GiB | 34.59 MiB/s, done.


In [24]:
from transformers import MarianMTModel, MarianTokenizer

token_name = "Helsinki-NLP/opus-mt-tc-big-en-pt"
model_name = "Helsinki-NLP/opus-mt-tc-big-en-pt"

tokenizer = MarianTokenizer.from_pretrained(token_name)
model_pt = MarianMTModel.from_pretrained(model_name)

text = ">>por<< He has been to Hawaii several times."

inputs = tokenizer(text, max_length=512, return_tensors="pt", padding=True)
translation_ids = model_pt.generate(**inputs)
translation = tokenizer.decode(translation_ids[0], skip_special_tokens=True)

translation

'Ele já esteve no Havaí várias vezes.'