In [1]:
pip install argostranslate

Note: you may need to restart the kernel to use updated packages.


In [1]:
import argostranslate.package
import argostranslate.translate

from_code = 'en'
to_code = 'zh'

# Download and install Argos Translate package
argostranslate.package.install_from_path('./translate-en_zh-1_9.argosmodel')
argostranslate.package.install_from_path('./translate-zh_en-1_9.argosmodel')
# Translate
installed_languages = argostranslate.translate.get_installed_languages()
translatedText = argostranslate.translate.translate("Hello World!", from_code, to_code)
print(translatedText)

translatedText = argostranslate.translate.translate("哈罗,世界!", to_code, from_code)
print(translatedText)

哈罗,世界!
Hello, world!


In [3]:
import pandas as pd

In [6]:
validation = pd.read_csv('../iwslt_dataset/validation.csv')
validation.head()

Unnamed: 0,en,zh
0,Last year I showed these two slides so that d...,去年我给各位展示了两个 关于北极冰帽的演示 在过去三百万年中 其面积由相当于美国南方48州面...
1,But this understates the seriousness of this p...,但这些没能完全说明这个问题的严重性 因为这没有表示出冰帽的厚度
2,"The arctic ice cap is, in a sense, the beatin...",感觉上，北极冰帽 就好象全球气候系统中跳动的心脏
3,It expands in winter and contracts in summer.,冬天心脏舒张，夏天心脏收缩
4,The next slide I show you will be a rapid fas...,下面我要展示的是 在过去25年里的极剧变化


In [7]:
X_validation = validation['en']
y_validation = validation['zh']

In [9]:
y_validation_pred = X_validation.apply(lambda x: argostranslate.translate.translate(x, from_code, to_code))
y_validation_pred.head()

0    去年,我们... 我演示了这两张幻灯片,证明北极冰盖, 在过去三百万年中的大部分时间里, 它...
1                         但这低估了这个特殊问题的严重性,因为它没有显示冰的厚度.
2                            从某种意义上说,北极冰盖是全球气候系统的跳动心脏。
3                                        它在冬季扩张,夏季则签约.
4                       我给你们看的下一张幻灯片 将快速向前看过去25年发生的事情。
Name: en, dtype: object

In [10]:
predicted_translations = y_validation_pred.tolist()
reference_translations = y_validation.tolist()

In [11]:
import re

predictions = []
references = []
pattern = r'([\u4e00-\u9fff])\.$'
punctuation_mapping = {
    ',': '，',
    '?': '？',
    '!': '！',
    ':': '：',
    ';': '；',
    '"': '”',
    "'": '’',
    '(': '（',
    ')': '）',
    '[': '【',
    ']': '】',
    '{': '｛',
    '}': '｝'
}

for pred in predicted_translations:
    pred = pred.replace(" ", "")
    for eng_punc, chi_punc in punctuation_mapping.items():
        pred = pred.replace(eng_punc, chi_punc)
    pred = re.sub(pattern, r'\1。', pred)
    predictions.append(pred)

for ref in reference_translations:
    ref = ref.replace(" ", "")
    for eng_punc, chi_punc in punctuation_mapping.items():
        ref = ref.replace(eng_punc, chi_punc)
    ref = re.sub(pattern, r'\1。', ref)
    references.append([ref])

In [12]:
for ref in references:
    print(ref)

['去年我给各位展示了两个关于北极冰帽的演示在过去三百万年中其面积由相当于美国南方48州面积总和缩减了40%']
['但这些没能完全说明这个问题的严重性因为这没有表示出冰帽的厚度']
['感觉上，北极冰帽就好象全球气候系统中跳动的心脏']
['冬天心脏舒张，夏天心脏收缩']
['下面我要展示的是在过去25年里的极剧变化']
['红色的是永冻冰']
['你看，它正在变成深蓝色这是每年冬天形成的年度冰在夏天永冻冰收缩']
['所谓的“永冻”，是指形成五年或更久的冰你看，这也像血液一样输送到身体各部位']
['在25年的时间里，它从这里，到了这里']
['值得注意的是温室效应使得北冰洋周围的冻土层受热而这里有大量被冻封的碳解冻时，微生物降解碳形成甲烷']
['如果突破顶点，温室气体排放量将是现有大气层中的全球温室污染总量']
['在阿拉斯加的一些浅湖里已经可以看到水中探头的沼气泡']
['去年冬天，UniversityofAlaska的KateyWalter教授结队去了一个浅湖']
['哇（开心而又惊叹滴笑）戈尔：她很好，我们怎么样呢？']
['有一个原因，北方沉积的大量热能加热了格陵兰岛']
['这是一条每年融化的河']
['但流量却比往年都要大']
['这是格陵兰岛西南的Kangerlussuaq河']
['如果你想了解陆地上的冰块融化如何使得海平面上升这里就是它的入海口']
['这里的流量正在急速上升']
['南极，地球的另一端这个行星上最大的冰块']
['上个月，科学家报告整个大洲正处于冰量减少的阶段']
['在南极洲的西部突然发现几个低于海平面的岛屿正在加速融化']
['这相当于海平面上20英尺，和格陵兰岛一样']
['在Himalayas，第三大冰块在顶部你可以看到新的湖泊，而几年前，这只是冰河']
['全球40%从其融水中获得一半的饮用水']
['在安第斯山脉，这条冰河是这座城市的饮用水源']
['流量正在增加']
['但当它们消失时，我们也将失去饮用水']
['在California，Sierra的积雪每年减少40%']
['对于蓄水而言，这是一个打击']
['而且如你看到的，预计是非常严重的']
['全球的干燥化正在导致火灾数量急剧增加']
['而全世界的灾害数量也正以绝对显著的空前的速度增加']
['在过去三十年内，灾害总数达到了更早七十五年总数的四倍

In [13]:
for pred in predictions:
    print(pred)

去年，我们...我演示了这两张幻灯片，证明北极冰盖，在过去三百万年中的大部分时间里，它的体积是下48个州，缩水了40%。
但这低估了这个特殊问题的严重性，因为它没有显示冰的厚度。
从某种意义上说，北极冰盖是全球气候系统的跳动心脏。
它在冬季扩张，夏季则签约。
我给你们看的下一张幻灯片将快速向前看过去25年发生的事情。
永久冰块以红色标注。
如你所见，它扩张到深蓝色——这就是冬季的年冰，夏季则收缩。
所谓的永久冰，5岁或5岁以上，你可以看到几乎就像血一样，从这里的体内溢出。
25年来，它从这个，到这个。
这是一个问题，因为暖化使北冰洋周围的冰冻地面加热，那里有大量的冰冻碳，当它融化后，被微生物变成甲烷。
与大气中全球升温污染的总量相比，如果我们越过这个临界点，该污染量可能会增加一倍。
在阿拉斯加的一些浅水湖里甲烷正在从水中涌出
阿拉斯加大学的KateyWalter教授去年冬天又和另一支队伍一起出海到另一个浅湖。
视频：哇！她没事问题是，我们是否会这样做。
其中一个原因是，这个巨大的热水槽使格陵兰从北方升温。
这条河每年融化一次
但书卷比以前大得多。
这是格陵兰西南部的Kangerlussuaq河。
如果你想知道海平面是如何从陆基冰融化中上升的这就是它到达海洋的地方
这些流动正在迅速增加。
在地球的另一端，南极洲是地球上最大的冰块。
上个月，科学家报告说，整个大陆现在处于负冰平衡状态。
南极洲西部的海底岛屿融化速度特别快
这相当于20英尺的海平面，格陵兰岛也是如此。
在喜马拉雅山脉，冰体的第三大质量：在顶部你可以看到新的湖泊，几年前是冰川。
世界上40%的人都能从融化流中获得一半的饮用水。
在安第斯山脉，这个冰川是这座城市的饮用水源。
资金流动有所增加。
但是当他们离开的时候，大部分的饮用水也一样。
在加利福尼亚州，雪地的降幅下降了40%。
这是击中水库。
正如你所读的，这些预测是严肃的。
世界各地的干燥导致火灾剧增。
世界各地的灾难以绝对非同寻常和前所未有的速度增加。
过去30年是前75年的4倍。
这是一种完全不可持续的模式。
如果你从历史的角度来审视，你就能看出这是在做什么。
在过去五年里，我们每天在海洋中增加了7000万吨二氧化碳2500万吨。
仔细观察太平洋东部地区，从美洲向西延伸，以及印度次大陆的两侧，海洋中氧气急剧枯竭。
全球暖化的最大原因以及20%的森林砍

In [None]:
import sacrebleu

bleu_score = sacrebleu.corpus_bleu(predictions, references, tokenize='zh')

bleu_score_smooth = sacrebleu.corpus_bleu(predictions, references, smooth_method='add-k', smooth_value=1, tokenize='zh')

ter_score_none = sacrebleu.corpus_ter(predictions, references)
ter_score_with_asian = sacrebleu.corpus_ter(predictions, references, asian_support=True)
ter_score_with_norm = sacrebleu.corpus_ter(predictions, references, normalized=True)
ter_score_with_asiannorm = sacrebleu.corpus_ter(predictions, references, asian_support=True, normalized=True)
ter_score_with_punct = sacrebleu.corpus_ter(predictions, references, asian_support=True, normalized=True)
ter_score_all = sacrebleu.corpus_ter(predictions, references, no_punct=True, asian_support=True, normalized=True)

print(f"BLEU score without smoothing: {bleu_score}")
print(f"BLEU score with smoothing: {bleu_score_smooth}")
print(ter_score_none)
print(ter_score_with_asian)
print(ter_score_with_norm)
print(ter_score_with_asiannorm)
print(f"TER with punctuation: {ter_score_with_punct}")
print(f"TER without punctuation: {ter_score_all}")


BLEU score without smoothing: BLEU = 43.45 96.6/63.2/32.1/18.2 (BP = 1.000 ratio = 1.000 hyp_len = 58 ref_len = 58)
BLEU score with smoothing: BLEU = 44.81 96.6/63.8/33.3/19.6 (BP = 1.000 ratio = 1.000 hyp_len = 58 ref_len = 58)
TER = 100.00
TER = 100.00
TER = 585.35
TER = 100.10
TER with punctuation: TER = 100.10
TER without punctuation: TER = 88.62


In [None]:
# NOTE: Past, incorrect implementations

# import editdistance

# ter_score = sum([editdistance.eval(ref, hyp) / max(len(ref), len(hyp)) for ref, hyp in zip(reference_translations, predicted_translations)]) / len(reference_translations)

# print(f"TER Score = {ter_score}")

# import sacrebleu
# from jiwer import wer

# # bleu_score = sacrebleu.corpus_bleu(predicted_translations, [reference_translations]).score

# wer_score = sum([wer(ref, hyp) for ref, hyp in zip(reference_translations, predicted_translations)]) / len(reference_translations)

# # print(f"BLEU Score: {bleu_score}")
# print(f"WER Score: {wer_score}")

# from nltk.translate.bleu_score import corpus_bleu
# from nltk.metrics import edit_distance
# from nltk.translate.bleu_score import SmoothingFunction


# # Extract predicted and reference translations as lists of strings
# predicted_translations = y_validation_pred.tolist()
# reference_translations = y_validation.tolist()

# # Calculate BLEU score
# bleu_score = corpus_bleu([[ref] for ref in reference_translations], [pred for pred in predicted_translations])

# print("BLEU:", bleu_score)

# import jieba

# cut = list(jieba.cut(reference_translations[0]))

# print(cut)

WER Score: 1.12320954547917


In [30]:
y_validation_pred_back = y_validation_pred.apply(lambda x: argostranslate.translate.translate(x, to_code, from_code))
X_validation_pred_back = y_validation.apply(lambda x: argostranslate.translate.translate(x, to_code, from_code))

In [31]:
predicted_translations_back = y_validation_pred_back.tolist()
reference_translations_back = X_validation_pred_back.tolist()

In [32]:
print(predicted_translations_back)

["Last year, we... I showed these two slides to prove that for most of the last three million years, it's the size of the next 48 states, and it's shrunk by 40 percent.", 'But it underestimates the gravity of this particular problem because it does not show the thickness of the ice.', 'In a sense, the Arctic ice sheet is the beating heart of the global climate system.', 'It expands in the winter and contracts in the summer.', "And the next slide I'm showing you is going to move on quickly with what's happened in the last 25 years.", 'Permanent ice is marked in red.', 'As you can see, it expands to deep blue -- this is winter ice, summer shrinks.', 'So-called permanent ice, five years old or older, you can see almost like blood spilling out of here.', "25 years, it's from this to this.", 'This is a problem because warming heats up the frozen surface around the Arctic Ocean, where there is a large amount of frozen carbon that, when it melts, becomes methane by microorganisms.', 'Compared

In [33]:
print(reference_translations_back)

['Last year I showed you two demonstrations about the Arctic ice cap, which in the last three million years has reduced its size by 40 percent from 48 states in the south of the United States.', "But it doesn't give a full picture of the seriousness of the problem, because it doesn't indicate the thickness of the ice cap.", 'The Arctic ice cap feels like a beating heart in the global climate system.', 'The heart of winter is swollen, the heart of summer shrinks.', "What I'm going to show you is the dramatic changes that have taken place in the last 25 years.", 'The red is frozen ice.', "You see, it's turning into deep blue, and this is the annual ice that forms every winter, and it shrinks in summer.", "It's called permafrost, which means ice is formed for five years or more.", 'In 25 years, it came from here and here.', "It's worth noting that the greenhouse effect heats up the tundra around the Arctic Ocean, where there's a lot of frozen carbon, and when it's unfrozen, microbes degra

In [34]:
import re

predictions_back = []
references_back = []
punctuation_mapping = {
    '，': ',',
    '？': '?',
    '！': '!' ,
    '：': ':',
    '；': ';',
    '”': '"',
    '’': "'",
    '（': '(',
    '）': ')',
    '【' : '[',
    '】' : ']',
    '｛': '{',
    '｝': '}'
}

for pred in predicted_translations_back:
    for chi_punc, eng_punc in punctuation_mapping.items():
        pred = pred.replace(chi_punc, eng_punc)
    predictions_back.append(pred)

for ref in reference_translations_back:
    for chi_punc, eng_punc in punctuation_mapping.items():
        ref = ref.replace(chi_punc, eng_punc)
    references_back.append([ref])

In [35]:
import sacrebleu

bleu_score = sacrebleu.corpus_bleu(predictions_back, references_back)

bleu_score_smooth = sacrebleu.corpus_bleu(predictions_back, references_back, smooth_method='add-k', smooth_value=1)

ter_score_with_punct = sacrebleu.corpus_ter(predictions_back, references_back, normalized=True)
ter_score_all = sacrebleu.corpus_ter(predictions_back, references_back, no_punct=True,normalized=True)

print(f"BLEU score without smoothing: {bleu_score}")
print(f"BLEU score with smoothing: {bleu_score_smooth}")
print(f"TER with punctuation: {ter_score_with_punct}")
print(f"TER without punctuation: {ter_score_all}")

BLEU score without smoothing: BLEU = 29.41 95.0/61.5/23.7/5.4 (BP = 1.000 ratio = 1.000 hyp_len = 40 ref_len = 40)
BLEU score with smoothing: BLEU = 33.11 95.0/62.5/25.6/7.9 (BP = 1.000 ratio = 1.000 hyp_len = 40 ref_len = 40)
TER with punctuation: TER = 112.34
TER without punctuation: TER = 107.88
