# WMT21 Translation

https://huggingface.co/facebook/wmt21-dense-24-wide-x-en

WMT 21 X-En is a 4.7B multilingual encoder-decoder (seq-to-seq) model trained for one-to-many multilingual translation. It was introduced in this paper and first released in this repository.

The model can directly translate text from 7 languages: Hausa (ha), Icelandic (is), Japanese (ja), Czech (cs), Russian (ru), Chinese (zh), German (de) to English.

To translate into a target language, the target language id is forced as the first generated token. To force the target language id as the first generated token, pass the forced_bos_token_id parameter to the generate method.

Since the model was trained with domain tags, you should prepend them to the input as well.

"wmtdata newsdomain": Use for sentences in the news domain
"wmtdata otherdomain": Use for sentences in all other domain

In [1]:
from transformers import pipeline
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

#https://huggingface.co/facebook/wmt21-dense-24-wide-en-x
model_name="facebook/wmt21-dense-24-wide-en-x"
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)



In [2]:
def translate(model, tokenizer, textinput):
    newsdomain="wmtdata newsdomain "
    otherdomain="wmtdata otherdomain "
    textinput = newsdomain+textinput
    inputs = tokenizer(textinput, return_tensors="pt")

    # translate English to Chinese
    generated_tokens = model.generate(**inputs, forced_bos_token_id=tokenizer.get_lang_id("zh")) #max_new_tokens
    result=tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)
    print(result)
    return result

In [3]:
textinput="Israeli troops scour Gaza’s al-Shifa Hospital for evidence of Hamas."
result = translate(model, tokenizer, textinput)



['以色列军队搜查加沙的希法医院,寻找哈马斯存在的证据。']


In [4]:
textinput="The Israeli raid of the Gaza Strip’s largest hospital stretched into its second day Thursday as troops searched for evidence of the extensive Hamas infrastructure that Israeli and U.S. officials have said lies beneath the facility. The Israel Defense Forces said Thursday that searches had uncovered the body of a captive Israeli woman in a house near the hospital, along with weapons. On Wednesday, the IDF released photographs and video of small caches of weapons it said belonged to Hamas. The military added to its case Thursday with a photo and video of a rough cavity that it described as an “operational tunnel shaft.” The Washington Post verified the location of the shaft inside the al-Shifa Hospital complex but could not verify where the opening led or what its purpose might be."
result = translate(model, tokenizer, textinput)

['以色列对加沙地带最大医院的突袭行动于周四进入第二天,以军搜查了哈马斯在该医院地下拥有大量基础设施的证据,以色列和美国官员称这些基础设施位于该设施之下。以色列国防军周四表示,搜查行动在医院附近的一所房屋中发现了一名被俘以色列妇女的尸体,以及武器。周三,以色列国防军发布了据称属于哈马斯的小型武器藏匿处的照片和视频。军方周四在其案例中添加了一张照片和一段视频,该照片和视频显示了一个粗糙的空洞,该空洞被描述为"操作隧道竖井"。《华盛顿邮报》核实了该竖井在al-Shifa医院建筑群内的位置,但无法核实该开口通向何处,也无法核实其目的。']


In [5]:
device='mps'
model.to(device)

M2M100ForConditionalGeneration(
  (model): M2M100Model(
    (shared): Embedding(128009, 2048, padding_idx=1)
    (encoder): M2M100Encoder(
      (embed_tokens): Embedding(128009, 2048, padding_idx=1)
      (embed_positions): M2M100SinusoidalPositionalEmbedding()
      (layers): ModuleList(
        (0-23): 24 x M2M100EncoderLayer(
          (self_attn): M2M100Attention(
            (k_proj): Linear(in_features=2048, out_features=2048, bias=True)
            (v_proj): Linear(in_features=2048, out_features=2048, bias=True)
            (q_proj): Linear(in_features=2048, out_features=2048, bias=True)
            (out_proj): Linear(in_features=2048, out_features=2048, bias=True)
          )
          (self_attn_layer_norm): LayerNorm((2048,), eps=1e-05, elementwise_affine=True)
          (activation_fn): ReLU()
          (fc1): Linear(in_features=2048, out_features=16384, bias=True)
          (fc2): Linear(in_features=16384, out_features=2048, bias=True)
          (final_layer_norm): LayerNo

In [6]:
def translate(model, tokenizer, textinput, device):
    newsdomain="wmtdata newsdomain "
    otherdomain="wmtdata otherdomain "
    textinput = newsdomain+textinput
    inputs = tokenizer(textinput, return_tensors="pt")
    inputs.to(device)

    # translate English to Chinese
    generated_tokens = model.generate(**inputs, forced_bos_token_id=tokenizer.get_lang_id("zh")) #max_new_tokens
    result=tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)
    print(result)
    return result

In [7]:
textinput="The Israeli raid of the Gaza Strip’s largest hospital stretched into its second day Thursday as troops searched for evidence of the extensive Hamas infrastructure that Israeli and U.S. officials have said lies beneath the facility. The Israel Defense Forces said Thursday that searches had uncovered the body of a captive Israeli woman in a house near the hospital, along with weapons. On Wednesday, the IDF released photographs and video of small caches of weapons it said belonged to Hamas. The military added to its case Thursday with a photo and video of a rough cavity that it described as an “operational tunnel shaft.” The Washington Post verified the location of the shaft inside the al-Shifa Hospital complex but could not verify where the opening led or what its purpose might be."
result = translate(model, tokenizer, textinput, device)

  input_ids = input_ids.repeat_interleave(expand_size, dim=0)
  sent_lengths_max = sent_lengths.max().item() + 1


['以色列对加沙地带最大医院的突袭行动于周四进入第二天,以军搜查了哈马斯在该医院地下拥有大量基础设施的证据,以色列和美国官员称这些基础设施位于该设施之下。以色列国防军周四表示,搜查行动在医院附近的一所房屋中发现了一名被俘以色列妇女的尸体,以及武器。周三,以色列国防军发布了据称属于哈马斯的小型武器藏匿处的照片和视频。军方周四在其案例中添加了一张照片和一段视频,该照片和视频显示了一个粗糙的空洞,该空洞被描述为"操作隧道竖井"。《华盛顿邮报》核实了该竖井在al-Shifa医院建筑群内的位置,但无法核实该开口通向何处,也无法核实其目的。']


# English Tagalog

https://github.com/YoonjungChoi/ProjectHAWAII/blob/main/Helsinki-NLP_TEST.ipynb

In [1]:
from transformers import pipeline
from transformers import AutoTokenizer

#https://huggingface.co/Helsinki-NLP
'''
Pipeline supports running on CPU or GPU through the device argument.
Users can specify device argument as an integer, -1 meaning “CPU”, >= 0 referring the CUDA device ordinal.
'''
model_checkpoint = "Helsinki-NLP/opus-mt-en-tl"

tokenizer = AutoTokenizer.from_pretrained(model_checkpoint, return_tensors="pt")
translator = pipeline("translation", model=model_checkpoint)
translator("Default to expanded threads")



Downloading tokenizer_config.json:   0%|          | 0.00/42.0 [00:00<?, ?B/s]

Downloading config.json:   0%|          | 0.00/1.36k [00:00<?, ?B/s]

Downloading source.spm:   0%|          | 0.00/827k [00:00<?, ?B/s]

Downloading target.spm:   0%|          | 0.00/835k [00:00<?, ?B/s]

Downloading vocab.json:   0%|          | 0.00/1.32M [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/296M [00:00<?, ?B/s]

Downloading generation_config.json:   0%|          | 0.00/293 [00:00<?, ?B/s]

[{'translation_text': 'Paglubog sa pinalawak na mga sinulid'}]

In [6]:
textinput="Default to expanded threads"
inputs = tokenizer(textinput, return_tensors="pt")
inputs

{'input_ids': tensor([[ 2667,  1534, 49919,    12, 15147, 26102,     0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1]])}

In [9]:
from transformers import AutoModelForSeq2SeqLM
model = AutoModelForSeq2SeqLM.from_pretrained(model_checkpoint)

In [10]:
model

MarianMTModel(
  (model): MarianModel(
    (shared): Embedding(57373, 512, padding_idx=57372)
    (encoder): MarianEncoder(
      (embed_tokens): Embedding(57373, 512, padding_idx=57372)
      (embed_positions): MarianSinusoidalPositionalEmbedding(512, 512)
      (layers): ModuleList(
        (0-5): 6 x MarianEncoderLayer(
          (self_attn): MarianAttention(
            (k_proj): Linear(in_features=512, out_features=512, bias=True)
            (v_proj): Linear(in_features=512, out_features=512, bias=True)
            (q_proj): Linear(in_features=512, out_features=512, bias=True)
            (out_proj): Linear(in_features=512, out_features=512, bias=True)
          )
          (self_attn_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
          (activation_fn): SiLUActivation()
          (fc1): Linear(in_features=512, out_features=2048, bias=True)
          (fc2): Linear(in_features=2048, out_features=512, bias=True)
          (final_layer_norm): LayerNorm((512,),

In [11]:
generated_tokens = model.generate(
                        inputs["input_ids"],
                        attention_mask=inputs["attention_mask"],
                        max_length=128,
                    )
generated_tokens

tensor([[57372,   508, 20428,     4, 15728,     6,    10, 18531,     0]])

In [12]:
result=tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)
result

['Paglubog sa pinalawak na mga sinulid']

In [1]:
import evaluate
metric = evaluate.load("sacrebleu")



In [2]:
predictions = [
    "This plugin lets you translate web pages between several languages automatically."
]
references = [
    [
        "This plugin allows you to automatically translate web pages between several languages."
    ]
]
metric.compute(predictions=predictions, references=references)

{'score': 46.750469682990165,
 'counts': [11, 6, 4, 3],
 'totals': [12, 11, 10, 9],
 'precisions': [91.66666666666667,
  54.54545454545455,
  40.0,
  33.333333333333336],
 'bp': 0.9200444146293233,
 'sys_len': 12,
 'ref_len': 13}

In [4]:
import sacrebleu
#bleu = sacrebleu.corpus_bleu(predictions, references, tokenize="none", lowercase=True)
bleu = sacrebleu.corpus_bleu(predictions, references)
print("BLEU score:", bleu.score)

BLEU score: 46.750469682990165


In [5]:
predictions = ["hello there general kenobi", "foo bar foobar"]
references = [["hello there general kenobi", "hello there !"], ["foo bar foobar", "foo bar foobar"]]
results = metric.compute(predictions=predictions, references=references)
print(list(results.keys()))
print(round(results["score"], 1))

['score', 'counts', 'totals', 'precisions', 'bp', 'sys_len', 'ref_len']
100.0


In [8]:
predictions = ["hello there general kenobi", "on our way to ankh morpork"]
references = [["hello there general kenobi", "hello there !"], ["goodbye ankh morpork", "ankh morpork"]]
results = metric.compute(predictions=predictions, references=references, tokenize="none")
print(list(results.keys()))
print(round(results["score"], 1))

['score', 'counts', 'totals', 'precisions', 'bp', 'sys_len', 'ref_len']
39.8


In [16]:
results

{'score': 39.76353643835252,
 'counts': [6, 4, 2, 1],
 'totals': [10, 8, 6, 4],
 'precisions': [60.0, 50.0, 33.333333333333336, 25.0],
 'bp': 1.0,
 'sys_len': 10,
 'ref_len': 7}

In [9]:
batch = {"predictions": predictions, "references": references}

In [10]:
batch

{'predictions': ['hello there general kenobi', 'on our way to ankh morpork'],
 'references': [['hello there general kenobi', 'hello there !'],
  ['goodbye ankh morpork', 'ankh morpork']]}

In [12]:
bleu = sacrebleu.corpus_bleu(predictions, references)
print("BLEU score:", bleu)

BLEU score: BLEU = 39.76 60.0/50.0/33.3/25.0 (BP = 1.000 ratio = 1.429 hyp_len = 10 ref_len = 7)


In [14]:
bleu.score

39.76353643835252

In [15]:
bleu.counts

[6, 4, 2, 1]

In [17]:
bleu.totals

[10, 8, 6, 4]

In [18]:
bleu.precisions

[60.0, 50.0, 33.333333333333336, 25.0]

In [19]:
bleu.bp

1.0

In [20]:
bleu.sys_len

10

In [21]:
bleu.ref_len

7

In [22]:
preds=['"我将呼吁上帝,谁值得被赞美。"', '母亲母亲、', '给你']
refs=[['我要求告当赞美的耶和华'], ['媳妇'], ['給你錢']]

In [25]:
bleu = sacrebleu.corpus_bleu(preds, refs, tokenize="zh")
print("BLEU score:", bleu.score)

BLEU score: 5.412989186545263


In [48]:
preds=['我将呼吁上帝,谁值得被赞美', '母亲母亲、', '给你']
#refs=[['我将呼吁上帝,谁值得被赞美'], ['母亲母亲、'], ['给你']]
refs=[['我将呼吁上帝,谁值得被赞美', '母亲母亲、', '给你']]
bleu = sacrebleu.corpus_bleu(preds, refs, tokenize="zh")
print("BLEU score:", bleu.score)

BLEU score: 100.00000000000004


In [27]:
import evaluate
metric = evaluate.load("sacrebleu")
result=metric.compute(predictions=preds, references=refs)

In [28]:
result['score']

0.0

In [45]:
refs = [ # First set of references
        ['The dog bit the man.', 'It was not unexpected.', 'The man bit him first.'],
        # Second set of references
        ['The dog had bit the man.', 'No one was surprised.', 'The man had bitten the dog.'],
        ]
sys = ['The dog bit the man.', "It wasn't surprising.", 'The man had just bitten him.']
bleu = sacrebleu.corpus_bleu(sys, refs, tokenize="none")
print("BLEU score:", bleu.score)

BLEU score: 49.19195660047277
