In [1]:
pip install transformers==4.12.4 sentencepiece

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers==4.12.4
  Downloading transformers-4.12.4-py3-none-any.whl (3.1 MB)
[K     |████████████████████████████████| 3.1 MB 30.5 MB/s 
[?25hCollecting sentencepiece
  Downloading sentencepiece-0.1.97-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[K     |████████████████████████████████| 1.3 MB 56.3 MB/s 
[?25hCollecting huggingface-hub<1.0,>=0.1.0
  Downloading huggingface_hub-0.11.1-py3-none-any.whl (182 kB)
[K     |████████████████████████████████| 182 kB 54.1 MB/s 
Collecting tokenizers<0.11,>=0.10.1
  Downloading tokenizers-0.10.3-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (3.3 MB)
[K     |████████████████████████████████| 3.3 MB 55.1 MB/s 
Collecting sacremoses
  Downloading sacremoses-0.0.53.tar.gz (880 kB)
[K     |████████████████████████████████| 880 kB 61.9 MB/s 
Building wheels for

In [2]:
from transformers import *

#source & destination languages
src = "en"
dst = "de"

task_name = f"translation_{src}_to_{dst}"
model_name = f"Helsinki-NLP/opus-mt-{src}-{dst}"

translator = pipeline(task_name,model=model_name, tokenizer=model_name)

Downloading:   0%|          | 0.00/1.30k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/284M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/42.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/750k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/778k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.21M [00:00<?, ?B/s]

In [3]:
translator("You're a genius.")[0] ["translation_text"]

'Du bist ein Genie.'

In [4]:
article = """
Albert Einstein ( 14 March 1879 – 18 April 1955) was a German-born theoretical physicist, widely acknowledged to be one of the greatest physicists of all time. 
Einstein is best known for developing the theory of relativity, but he also made important contributions to the development of the theory of quantum mechanics. 
Relativity and quantum mechanics are together the two pillars of modern physics. 
His mass–energy equivalence formula E = mc2, which arises from relativity theory, has been dubbed "the world's most famous equation". 
His work is also known for its influence on the philosophy of science.
He received the 1921 Nobel Prize in Physics "for his services to theoretical physics, and especially for his discovery of the law of the photoelectric effect", a pivotal step in the development of quantum theory. 
His intellectual achievements and originality resulted in "Einstein" becoming synonymous with "genius"
"""
translator(article)[0]["translation_text"]

'Albert Einstein (* 14. März 1879 – 18. April 1955) war ein deutscher theoretischer Physiker, der allgemein als einer der größten Physiker aller Zeiten anerkannt wurde. Einstein ist am besten für die Entwicklung der Relativitätstheorie bekannt, aber er leistete auch wichtige Beiträge zur Entwicklung der Quantenmechaniktheorie. Relativität und Quantenmechanik sind zusammen die beiden Säulen der modernen Physik. Seine Massenenergieäquivalenzformel E = mc2, die aus der Relativitätstheorie hervorgeht, wurde als „die berühmteste Gleichung der Welt" bezeichnet. Seine Arbeit ist auch für ihren Einfluss auf die Philosophie der Wissenschaft bekannt. Er erhielt 1921 den Nobelpreis für Physik „für seine Verdienste um die theoretische Physik und vor allem für seine Entdeckung des Gesetzes über den photoelektrischen Effekt", einen entscheidenden Schritt in der Entwicklung der Quantentheorie. Seine intellektuellen Leistungen und Originalität führten dazu, dass „Einstein" zum Synonym für „Genius" wur

In [5]:
def get_translation_model_and_tokenizer(src_lang, dst_lang):
  """
  Given the source and destination languages, returns the appropriate model
  See the language codes here: https://developers.google.com/admin-sdk/directory/v1/languages
  For the 3-character language codes, you can google for the code!
  """
  # construct our model name
  model_name = f"Helsinki-NLP/opus-mt-{src}-{dst}"
  # initialize the tokenizer & model
  tokenizer = AutoTokenizer.from_pretrained(model_name)
  model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
  # return them for use
  return model, tokenizer

In [6]:
# source & destination languages
src = "en"
dst = "zh"

model, tokenizer = get_translation_model_and_tokenizer(src, dst)

Downloading:   0%|          | 0.00/44.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.37k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/788k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/786k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.54M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/298M [00:00<?, ?B/s]

In [7]:
# encode the text into tensor of integers using the appropriate tokenizer
inputs = tokenizer.encode(article, return_tensors="pt", max_length=512, truncation=True)
print(inputs)

tensor([[32614, 53456,    22,   992,   776,   822,  4048,     8,  3484,   822,
           820, 50940,    17,    43,    13,  8214,    16, 32941, 34899, 60593,
             2,  5514,  7131,     9,    34,   141,     4,     3,  7680, 60593,
            24,     4,    61,   220,     6, 53456,    32,  1109,  3305,    15,
           320,     3, 19082,     4,  1294, 24030, 28453,     2,   187,   172,
            81,   157,   435,  1061,     9,     3,    92,     4,     3, 19082,
             4, 52682, 54813,     6, 45978, 28453,     7, 52682, 54813,    46,
          1105,     3,   263, 12538,     4,  6683, 46089,     6,  1608,  3196,
          3484, 45425, 50560, 14655,   509,     8,  6873,  4374,   149,  9132,
            62, 22703,    51,  1294, 24030, 28453, 19082,     2,    66,    74,
         16044, 18553,   258,    40,  1862,   431,    23,    24,   447, 23761,
         47364, 10594,  1608,   119,    32,    81,  3305,    15,    45,  6748,
            19,     3, 34857,     4,  4102,     6,  

In [8]:
# generate the translation output using greedy search
greedy_outputs = model.generate(inputs)
# decode the output and ignore special tokens
print(tokenizer.decode(greedy_outputs[0], skip_special_tokens=True))

阿尔伯特·爱因斯坦(1879年3月14日至1955年4月18日)是德国出生的理论物理学家,被广泛承认是有史以来最伟大的物理学家之一。爱因斯坦以发展相对论闻名,但他也为量子力学理论的发展做出了重要贡献。相对论和量子力学是现代物理学的两大支柱。他的质量 — — 能源等值公式E = mc2来自相对论,被称作“世界最著名的方程 ” 。 他的工作也因其对科学哲学的影响而著称。 他获得了1921年诺贝尔物理奖,“因为他对理论物理学的服务,特别是他发现了光电效应法 ”, 这是量子理论发展的关键一步。 他的智力成就和创举导致“Einstein”成为“genius”的同义词。


In [9]:
# let's change target language
src = "en"
dst = "ar"

# get en-ar model & tokenizer
model, tokenizer = get_translation_model_and_tokenizer(src, dst)

Downloading:   0%|          | 0.00/44.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.36k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/782k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/895k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/2.02M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/293M [00:00<?, ?B/s]

In [10]:
# yet another example
text = "It can be severe, and has caused millions of deaths around the world as well as lasting health problems in some who have survived the illness."
# tokenize the text
inputs = tokenizer.encode(text, return_tensors="pt", max_length=512, truncation=True)
# this time we use 5 beams and return 5 sequences and we can compare!
beam_outputs = model.generate(
    inputs, 
    num_beams=5, 
    num_return_sequences=5,
    early_stopping=True,
)
for i, beam_output in enumerate(beam_outputs):
  print(tokenizer.decode(beam_output, skip_special_tokens=True))
  print("="*50)

ويمكن أن تكون حادة، وقد تسببت في ملايين الوفيات في جميع أنحاء العالم، فضلا عن مشاكل صحية دائمة في بعض الذين نجوا من المرض.
ويمكن أن تكون خطيرة، وقد تسببت في ملايين الوفيات في جميع أنحاء العالم، فضلا عن مشاكل صحية دائمة في بعض الذين نجوا من المرض.
ويمكن أن تكون حادة، وقد تسببت في ملايين الوفيات في جميع أنحاء العالم، فضلا عن مشاكل صحية دائمة لدى بعض الذين نجوا من المرض.
ويمكن أن تكون حادة، وقد تسببت في ملايين الوفيات في جميع أنحاء العالم، فضلا عن مشاكل صحية دائمة في بعض من نجوا من المرض.
ويمكن أن تكون حادة، وقد تسببت في وفاة ملايين الأشخاص في جميع أنحاء العالم، فضلا عن مشاكل صحية دائمة في بعض الذين نجوا من المرض.
