# Language Translation Using the T5 and MarianMT Models and HuggingFace Framework

Translates English to German with high accuracy in only a few lines of code by using the HuggingFace Framework. 

This notebook shows both the highest level abstraction with pipelines and slightly lower level APIs where we create the model and tokenizer before doing the translations.

This example use the same test sentences I used in my 2nd Coursera class capstone project plus a bonus complex sentence.  

Uses the T5 Model (Text-to-Text Transfer Transformer) from Google https://arxiv.org/abs/1910.10683

The T5 Model uses the C4 dataset (Colossal Clean Crawled Corpus) consisting of about 750 gigabytes of clean English text scraped from the web

T5 has been fine tuned to a number of specific NLP tasks including those in the GLUE and SuperGlue NLP benchmarks, which can be run using the code shown below with minor modifications.

Also, demonstrate translating the same English to German using the MarianMT model, which is pretrained on over 1,000 language translation combinations so easy to generalize. Marian is a framework for translation models, using the same models as BART. https://arxiv.org/pdf/1804.00344.pdf 

https://huggingface.co/transformers/master/model_doc/marian.html 

Lastly translate the same strings via the Google API.


In [None]:
pip install transformers

In [2]:
import warnings
warnings.filterwarnings(action='ignore')

In [None]:
from transformers import pipeline
translator = pipeline("translation_en_to_de")

In [4]:
english_strings = ["I need my key.", "I have won.", "take a bus", "Do you know that?", "That'll be fun.", "You can fly non-stop from San Francisco to Munich on United Airlines but its a long flight"]

In [5]:
for english_string in english_strings:
  print(f'English: {english_string}')
  german_string = translator(english_string)[0]['translation_text']
  print(f'German: {german_string}\n')

English: I need my key.
German: Ich brauche meinen Schlüssel.

English: I have won.
German: Ich habe gewonnen.

English: take a bus
German: Bus nehmen

English: Do you know that?
German: Wissen Sie das?

English: That'll be fun.
German: Das wird Spaß machen.

English: You can fly non-stop from San Francisco to Munich on United Airlines but its a long flight
German: Sie können mit United Airlines von San Francisco nach München fliegen, aber es ist ein langer Flug.



Going one level lower in the HuggingFace API in case we needed more control including the ability to specify the model, tokenizer and if needed the config.  Note, we are using T5 summarization and a special feature to do translation. 

The following uses PyTorch

In [None]:
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
model = AutoModelForSeq2SeqLM.from_pretrained("t5-base")
tokenizer = AutoTokenizer.from_pretrained("t5-base")

In [7]:
for english_string in english_strings:
  print(f'English: {english_string}')

  # Add the T5 specific prefix “translate English to German: “ since this is a generative model
  # Other prefixes are available in T5 for other NLP tasks including those in GLUE & SuperGLUE
  english_string = "translate English to German:" + english_string

  inputs = tokenizer.encode(english_string, return_tensors="pt")

  # Overriding PreTrainedModel.generate() default config, e.g. max_length
  outputs = model.generate(inputs, max_length=50, num_beams=4, early_stopping=True)
  german_string = tokenizer.decode(outputs[0]).lstrip('<pad>').rstrip('</s>')
  print(f'German: {german_string}\n')


English: I need my key.
German:  Ich brauche meinen Schlüssel.

English: I have won.
German:  Ich habe gewonnen.

English: take a bus
German:  Bus nehmen

English: Do you know that?
German:  Wissen Sie das?

English: That'll be fun.
German:  Das wird Spaß machen.

English: You can fly non-stop from San Francisco to Munich on United Airlines but its a long flight
German:  Sie können mit United Airlines von San Francisco nach München fliegen, aber es ist ein langer Flug.



Same as above except in Tensorflow 2.0.  Notice the differences are trivial.

In [None]:
from transformers import TFAutoModelForSeq2SeqLM, AutoTokenizer

model = TFAutoModelForSeq2SeqLM.from_pretrained("t5-base")
tokenizer = AutoTokenizer.from_pretrained("t5-base")

In [9]:
for english_string in english_strings:
  print(f'English: {english_string}')

  # Add the T5 specific prefix “translate English to German: “ since this is a generative model
  # Other prefixes are available in T5 for other NLP tasks including those in GLUE & SuperGLUE
  english_string = "translate English to German:" + english_string

  inputs = tokenizer.encode(english_string, return_tensors="tf")

  # Overriding PreTrainedModel.generate() default config, e.g. max_length
  outputs = model.generate(inputs, max_length=50, num_beams=4, early_stopping=True)
  german_string = tokenizer.decode(outputs[0]).lstrip('<pad>').rstrip('</s>')
  print(f'German: {german_string}\n')

English: I need my key.
German:  Ich brauche meinen Schlüssel.

English: I have won.
German:  Ich habe gewonnen.

English: take a bus
German:  Bus nehmen

English: Do you know that?
German:  Wissen Sie das?

English: That'll be fun.
German:  Das wird Spaß machen.

English: You can fly non-stop from San Francisco to Munich on United Airlines but its a long flight
German:  Sie können mit United Airlines von San Francisco nach München fliegen, aber es ist ein langer Flug.



**MarianMT Model**

Over 1000 language combinations supported. German translation below had one sentence not translated as well as T5.

https://huggingface.co/transformers/master/model_doc/marian.html

In [10]:
% pip install sentencepiece



In [None]:
from transformers import MarianMTModel, MarianTokenizer

src_text = [">>deu<< " + s for s in english_strings]

model_name = 'Helsinki-NLP/opus-mt-en-de'

tokenizer = MarianTokenizer.from_pretrained(model_name)
print(tokenizer.supported_language_codes)
model = MarianMTModel.from_pretrained(model_name)
outputs = model.generate(**tokenizer.prepare_seq2seq_batch(src_text, return_tensors="pt"))
german_strings = [tokenizer.decode(t, skip_special_tokens=True) for t in outputs]

In [12]:
for english_string, german_string in zip(english_strings, german_strings):
  print(f'English: {english_string}')
  print(f'German: {german_string}\n')

English: I need my key.
German: Ich brauche meinen Schlüssel.

English: I have won.
German: Ich habe gewonnen.

English: take a bus
German: Nehmen Sie einen Bus

English: Do you know that?
German: Weißt du das?

English: That'll be fun.
German: Das wird lustig.

English: You can fly non-stop from San Francisco to Munich on United Airlines but its a long flight
German: Sie können nonstop von San Francisco nach München auf United Airlines fliegen, aber es ist ein langer Flug



**Google Translate API**

Use a modified/patched version of Google Translate

In [13]:
! pip install google_trans_new



In [14]:
from google_trans_new import google_translator

translator = google_translator()  

In [15]:
for english_string in english_strings:
  german_string = translator.translate(english_string,lang_tgt='de')  
  print(f'English: {english_string}')
  print(f'German: {german_string}\n')

English: I need my key.
German: Ich brauche meinen Schlüssel. 

English: I have won.
German: Ich habe gewonnen. 

English: take a bus
German: nehmen Sie einen Bus 

English: Do you know that?
German: Weißt du, dass? 

English: That'll be fun.
German: Das wird Spaß machen. 

English: You can fly non-stop from San Francisco to Munich on United Airlines but its a long flight
German: Sie können mit United Airlines nonstop von San Francisco nach München fliegen, aber es ist ein langer Flug 

