# Language Translation Using the T5 Model and HuggingFace Framework in PyTorch

Translates English to German with high accuracy in only a few lines of code by using the HuggingFace Framework. This notebook shows both the highest level abstraction with pipelines and a slightly lower level API where we create the model and tokenizer before doing the translations.

This example use the same test sentences I used in my 2nd Coursera class capstone project plus a bonus complex sentence.  All examples below were confirmed to have been correctly translated by reverse translating the results using Google Translate.

T5 (Text-to-Text Transfer Transformer) Model
https://arxiv.org/abs/1910.10683

For T5 to be able to operate on all NLP tasks, it transforms them into text-to-text problems by using specific prefixes: “summarize: ”, “question: ”, “translate English to German: ” and so forth. 

In [None]:
pip install transformers

In [6]:
from transformers import pipeline
translator = pipeline("translation_en_to_de")

In [17]:
english_strings = ["I need my key.", "I have won.", "take a bus", "Do you know that?", "That'll be fun.", "You can fly non-stop from San Francisco to Munich on United Airlines but its a long flight"]

In [30]:
for english_string in english_strings:
  print(f'English: {english_string}')
  german_string = translator(english_string)[0]['translation_text']
  print(f'German: {german_string}\n')

English: I need my key.
German: Ich brauche meinen Schlüssel.

English: I have won.
German: Ich habe gewonnen.

English: take a bus
German: Bus nehmen

English: Do you know that?
German: Wissen Sie das?

English: That'll be fun.
German: Das wird Spaß machen.

English: You can fly non-stop from San Francisco to Munich on United Airlines but its a long flight
German: Sie können mit United Airlines von San Francisco nach München fliegen, aber es ist ein langer Flug.



Going one level lower in the HuggingFace API in case we needed more control including the ability to specify the model, tokenizer and if needed the config.  Note, we are using T5 summarization and a special feature to do translation. This is in PyTorch, although there is very little difference from TensorFlow 2.0.

In [46]:
from transformers import AutoModelWithLMHead, AutoTokenizer
model = AutoModelWithLMHead.from_pretrained("t5-base")
tokenizer = AutoTokenizer.from_pretrained("t5-base")



In [49]:
for english_string in english_strings:
  print(f'English: {english_string}')

  # Add the T5 specific prefix “translate English to German: “ since this is a generative model
  english_string = "translate English to German:" + english_string

  inputs = tokenizer.encode(english_string, return_tensors="pt")

  # Overriding PreTrainedModel.generate() default config, e.g. max_length
  outputs = model.generate(inputs, max_length=50, num_beams=4, early_stopping=True)
  german_string = tokenizer.decode(outputs[0]).lstrip('<pad>').rstrip('</s>')
  print(f'German: {german_string}\n')


English: I need my key.
German:  Ich brauche meinen Schlüssel.

English: I have won.
German:  Ich habe gewonnen.

English: take a bus
German:  Bus nehmen

English: Do you know that?
German:  Wissen Sie das?

English: That'll be fun.
German:  Das wird Spaß machen.

English: You can fly non-stop from San Francisco to Munich on United Airlines but its a long flight
German:  Sie können mit United Airlines von San Francisco nach München fliegen, aber es ist ein langer Flug.

