### Measuring Machines with BLEU: Applying the BLEU Evaluation Score to Machine Translations


This notebook implements an example of the BLEU score using patent texts

In [93]:
from nltk.translate.bleu_score import sentence_bleu, corpus_bleu, 
modified_precision
from nltk import word_tokenize, bigrams, trigrams, ngrams
import tokenize

#### Inspect and tokenize texts

In [1]:
# original language, in Chinese, of patent abstract NLP patent by Alibaba
original_txt = """本发明公开了一种机器处理及文本纠错方法和装置、计算设备以及存储介质
，具体包括错误文本和对应的正确文本的纠错改写对, 以纠错改写对作为训练语料，对机器处理模型
进行训练，由此准备好适用于文本纠错的机器处理模型。可以通过从日志中挖掘纠错改写对来对机器
处理模型进行训练，使其适于对文本进行纠错。将第一文本输入到机器处理模型中，得到第二文本，
即纠错结果文本。另外，还可以使用语言模型或常用词库先判断第一文本是否需要进行纠错。可以使
用从日志中挖掘出的训练语料来训练语言模型，也可以通过对日志中的文本进行分词、统计来整理常
用词库。由此，使得能够方便地实现文本纠错"""

In [2]:
# original sentence, in Chinese, from patent NLP abstract by Alibaba
original_sentence = """可以使用从日志中挖掘出的训练语料来训练语言模型，也可以通过对
日志中的文本进行分词、统计来整理常用词库."""

# human translation #1, via Gengo, of Chinese sentence to English 
human1_sentence = """The training corpus extracted from a log can be used 
to train the language model, or the common lexicon can be sorted by 
segmenting and counting text in the log."""

# human translation #2, via Gengo
human2_sentence = """It can use the practice language material gathered 
from the diary or daily journal to train the language model, and it can
also initialize the common vocabulary bank through the segmentation and
analysis of the diary or daily journal text."""

# machine translation by Google Translate from Chinese to English
google_sentence = """The language model can be trained using the training 
corpus extracted from the log, or the common lexicon can be organized by 
segmenting and counting the text in the log."""

# machine translation by WIPO
wipo_sentence = """The training corpus extracted from the log can be 
used to train the language model and also, through text segmentation and 
statistical analysis of text in the log compile a lexicon of commonly 
used words."""

#### Tokenize texts

In [29]:
# tokenize texts

In [30]:
def tokenize(dictionary):
    """converts a dictionary of texts to a list of lists of tokens"""
    returned_list = []
    for key, value in dictionary.items():
        list_val = value.split()
        returned_list.append(list_val)
    return returned_list

In [44]:
dict_of_sentences = {'reference_sentence_1': human1_sentence
                     ,'reference_sentence_2': human2_sentence
                     ,'candidate_sentence_1': google_sentence
                     ,'candidate_sentence_2': wipo_sentence}


i = tokenize(dict_of_sentences)

#### Examples of n-grams

In [32]:
# this returns a list of bigrams
bi_grams = list(bigrams(i[0][0:6]))
bi_grams

[('The', 'training'),
 ('training', 'corpus'),
 ('corpus', 'extracted'),
 ('extracted', 'from'),
 ('from', 'a')]

In [33]:
# this returns a list of tuples containing 4-grams
four_grams = list(ngrams(i[0][0:8], 4))
four_grams

[('The', 'training', 'corpus', 'extracted'),
 ('training', 'corpus', 'extracted', 'from'),
 ('corpus', 'extracted', 'from', 'a'),
 ('extracted', 'from', 'a', 'log'),
 ('from', 'a', 'log', 'can')]

#### Sentence-level scores

In [139]:
# BLEU-1
bleu1mod_google = modified_precision([i[0]], i[2], n=1)
print(float(bleu1mod_google))

# BLEU-2
bleu2mod = modified_precision([i[0]], i[2], n=2)
print(float(bleu2mod))

# BLEU-3
bleu3mod = modified_precision([i[0]], i[2], n=3)
print(float(bleu3mod))

# BLEU-4
bleu4mod = modified_precision([i[0]], i[2], n=4)
print(float(bleu4mod))

# BLEU-4
bleu4mod = modified_precision([i[0]], i[2], n=4)
print(float(bleu4mod))

0.7666666666666667
0.5172413793103449
0.35714285714285715
0.2222222222222222


In [140]:
# .42 BLEU-1 score for Google's translation
bleu1_google = sentence_bleu([i[0]], i[2], weights=(1.0, 0))
print(bleu1_google)

# .42 BLEU-2 score for Google's translation
bleu2_google = sentence_bleu([i[0]], i[2], weights=(0.5, 0.5))
print(bleu2_google)

# .42 BLEU-3 score for Google's translation
bleu3_google = sentence_bleu([i[0]], i[2], weights=(0.333, 0.333, 0.333))
print(bleu3_google)

# .42 BLEU-4 score for Google's translation
bleu4_google = sentence_bleu([i[0]], i[2], weights=(0.25, 0.25, 0.25, 0.25))
print(bleu4_google)

# .42 BLEU-4 score for Google's translation
bleu4_google = sentence_bleu([i[0]], i[2])
print(bleu4_google)

0.7666666666666667
0.6297235299224027
0.5215911609582645
0.4211941439196335
0.4211941439196335


In [92]:
# .35 BLEU score for WIPO's translation
bleu_wipo = sentence_bleu([i[0]], i[3])
bleu_wipo

0.34690864856059794

In [68]:
bleu_google_2refs = sentence_bleu([i[0], i[1]], i[2])
bleu_google_2refs

0.4370614964591188

In [69]:
bleu_wipo_2refs = sentence_bleu([i[0], i[1]], i[3])
bleu_wipo_2refs

0.38635522321645016

In [None]:
# modified

In [111]:
23 / 30, 15 / 29, 10 / 28, 6 / 27

(0.7666666666666667,
 0.5172413793103449,
 0.35714285714285715,
 0.2222222222222222)

In [112]:
(23 + 15 + 10 + 6) / (30 + 29 + 28 + 27)

0.47368421052631576

In [105]:
bleu2mod = modified_precision([i[0]], i[2], n=2)
print(bleu2mod)

15/29


In [102]:
bleu3mod = modified_precision([i[0]], i[2], n=3)
bleu3mod

Fraction(10, 28)

In [103]:
bleu4mod = modified_precision([i[0]], i[2], n=4)
bleu4mod

Fraction(6, 27)

In [75]:
ref_four_grams = list(ngrams(i[0], 4))
ref_four_grams, len(four_grams)

([('The', 'training', 'corpus', 'extracted'),
  ('training', 'corpus', 'extracted', 'from'),
  ('corpus', 'extracted', 'from', 'a'),
  ('extracted', 'from', 'a', 'log'),
  ('from', 'a', 'log', 'can'),
  ('a', 'log', 'can', 'be'),
  ('log', 'can', 'be', 'used'),
  ('can', 'be', 'used', 'to'),
  ('be', 'used', 'to', 'train'),
  ('used', 'to', 'train', 'the'),
  ('to', 'train', 'the', 'language'),
  ('train', 'the', 'language', 'model,'),
  ('the', 'language', 'model,', 'or'),
  ('language', 'model,', 'or', 'the'),
  ('model,', 'or', 'the', 'common'),
  ('or', 'the', 'common', 'lexicon'),
  ('the', 'common', 'lexicon', 'can'),
  ('common', 'lexicon', 'can', 'be'),
  ('lexicon', 'can', 'be', 'sorted'),
  ('can', 'be', 'sorted', 'by'),
  ('be', 'sorted', 'by', 'segmenting'),
  ('sorted', 'by', 'segmenting', 'and'),
  ('by', 'segmenting', 'and', 'counting'),
  ('segmenting', 'and', 'counting', 'text'),
  ('and', 'counting', 'text', 'in'),
  ('counting', 'text', 'in', 'the'),
  ('text', 'in',

In [79]:
google_four_grams = list(ngrams(i[2], 4))
google_four_grams, len(google_four_grams)

([('The', 'language', 'model', 'can'),
  ('language', 'model', 'can', 'be'),
  ('model', 'can', 'be', 'trained'),
  ('can', 'be', 'trained', 'using'),
  ('be', 'trained', 'using', 'the'),
  ('trained', 'using', 'the', 'training'),
  ('using', 'the', 'training', 'corpus'),
  ('the', 'training', 'corpus', 'extracted'),
  ('training', 'corpus', 'extracted', 'from'),
  ('corpus', 'extracted', 'from', 'the'),
  ('extracted', 'from', 'the', 'log,'),
  ('from', 'the', 'log,', 'or'),
  ('the', 'log,', 'or', 'the'),
  ('log,', 'or', 'the', 'common'),
  ('or', 'the', 'common', 'lexicon'),
  ('the', 'common', 'lexicon', 'can'),
  ('common', 'lexicon', 'can', 'be'),
  ('lexicon', 'can', 'be', 'organized'),
  ('can', 'be', 'organized', 'by'),
  ('be', 'organized', 'by', 'segmenting'),
  ('organized', 'by', 'segmenting', 'and'),
  ('by', 'segmenting', 'and', 'counting'),
  ('segmenting', 'and', 'counting', 'the'),
  ('and', 'counting', 'the', 'text'),
  ('counting', 'the', 'text', 'in'),
  ('the', '

In [83]:
set(ref_four_grams) = set(google_four_grams)

SyntaxError: can't assign to function call (<ipython-input-83-dbc8561fc587>, line 1)

In [85]:
set(google_four_grams) & set(ref_four_grams)

{('by', 'segmenting', 'and', 'counting'),
 ('common', 'lexicon', 'can', 'be'),
 ('or', 'the', 'common', 'lexicon'),
 ('text', 'in', 'the', 'log.'),
 ('the', 'common', 'lexicon', 'can'),
 ('training', 'corpus', 'extracted', 'from')}

In [None]:
# by default, bleu_score calculates a BLEU-4,which i a score for the overlap of up to 4-grams

In [None]:
#calculcation

In [None]:
"the geometric mean of the test corpus’ modified precision scores times an exponential brevity penalty factor"

In [None]:
pn = C∈{Candidates} ∑ n-gram∈C  ∑ Countclip(n-gram) / ∑ Count(n-gram′)
 C′∈{Candidates} n-gram′ ∈C′

In [None]:
The strong signal differentiating human (high precision) from machine (low precision) is striking. The difference becomes stronger as we go from un- igram precision to 4-gram precision. It appears that any single n-gram precision score can distinguish between a good translation and a bad translation. To be useful, however, the metric must also reliably distinguish between translations that do not differ so greatly in quality. Furthermore, it must distinguish between two human translations of differing quality. This latter requirement ensures the continued valid- ity of the metric as MT approaches human transla- tion quality.
To this end, we obtained a human translation by someone lacking native proficiency in both the source (Chinese) and the target language (English). For comparison, we acquired human translations of the same documents by a native English speaker. We also obtained machine translations by three commer- cial systems. These five “systems” — two humans and three machines — are scored against two refer- ence professional human translations. The average modified n-gram precision results are shown in Fig- ure 2.
Each of these n-gram statistics implies the same
∑ ∑ Countclip(n-gram) ∑ ∑ Count(n-gram′)
 C′∈{Candidates} n-gram′ ∈C′

### Playground

score = sentence_bleu(references, candidate)

- sentence_bleu accepts a reference sentences as a list of lists of tokens from a sentence.
- input reference sentences a list of lists of strings

# by default, bleu_score calculates a BLEU-4,which is an the overlap of 4-grams overlaps 


score = sentence_bleu(references, candidate), 4

reference1 = ['It', 'is', 'a', 'guide', 'to', 'action', 'that',
               'ensures', 'that', 'the', 'military', 'will', 'forever',
               'heed', 'Party', 'commands']

reference2 = ['It', 'is', 'the', 'guiding', 'principle', 'which',
             'guarantees', 'the', 'military', 'forces', 'always',
               'being', 'under', 'the', 'command', 'of', 'the',
               'Party']
reference3 = ['It', 'is', 'the', 'practical', 'guide', 'for', 'the',
              'army', 'always', 'to', 'heed', 'the', 'directions',
              'of', 'the', 'party']

references = [reference1, reference2, reference3]

references

hypothesis1 = ['It', 'is', 'a', 'guide', 'to', 'action', 'which',
               'ensures', 'that', 'the', 'military', 'always',
              'obeys', 'the', 'commands', 'of', 'the', 'party']

hypothesis2 = ['It', 'is', 'to', 'insure', 'the', 'troops',
             'forever', 'hearing', 'the', 'activity', 'guidebook',
             'that', 'party', 'direct']

sentence_bleu(references, hypothesis1) # doctest: +ELLIPSIS

In [None]:
# this is a human translation of the original text, conducted through the company Gengo at a "standard" level
human_trans = """The invention discloses a machine processing and text error correction method and device, a
computing device, and a storage medium, specifically comprising corrected and rewritten text pairs of incorrect 
text and corresponding correct text. The corrected and rewritten text pairs serving as a training corpus to train
the machine processing model, thereby preparing a machine processing model suitable for text error correction. 
Through extraction of corrected and rewritten text pairs from a log, the machine processing model can be trained
and thus made fit for text correction by inputting the first text into the machine processing model to get the 
second text, that is the error correction result text. In addition, the language model or the common lexicon can
be used to determine whether the first text needs to be corrected. The training corpus extracted from a log can 
be used to train the language model, or the common lexicon can be sorted by segmenting and counting text in the 
log. This is how to easily implement text error correction."""

In [None]:
human_trans = """This invention makes public a machine processing and text error correction method and hardware, computing equipment and storage medium, and specifically pairs error text with the corresponding corrected and modified correct text. It uses this text pair as training material for the machine processing model, and from there prepares the machine processing model that is applied to the text correction. It can train the machine processing model using a diary or daily journal and make it suitable for text correction. The first text version is inputted into the machine processing model to get the second text version, which is the corrected text. Additionally, it can also use a stored language model or common vocabulary bank to determine if the first text version needs correction. It can use the practice language material gathered from the diary or daily journal to train the language model, and it can also initialize the common vocabulary bank through the segmentation and analysis of the diary or daily journal text. Through all this, text correction is conveniently implemented. 
Translated by: Translator #333872"""

In [None]:
human_reference = human_trans.split()

In [None]:
google_trans = """["The invention discloses a machine processing and text error correction method and device, a
                    computing device and a storage medium, and particularly comprises an error correction rewriting
                    pair of an error text and a corresponding correct text, and an error correction rewriting pair
                    as a training corpus, and a machine processing model Training is performed, thereby preparing
                    a machine processing model suitable for text correction. The machine processing model can be
                    trained to mine the error correction by mining the error correction rewrite pair from the log. 
                    The first text is input into the machine processing model to obtain a second text, that is, an
                    error correction result text. In addition, you can use the language model or common lexicon to
                    determine whether the first text needs to be corrected. The language model can be trained using
                    the training corpus extracted from the log, or the common lexicon can be organized by segmenting
                    and counting the text in the log. Thereby, text correction is facilitated."]"""

In [None]:
wipo_trans = """["The present invention discloses a machine processing and text correction method and device, 
computing equipment and a storage medium. Specifically comprising corrected and rewritten text pairs of incorrect 
text and corresponding correct text, the corrected and rewritten text pairs serving as a training corpus for training
a machine processing model, and in this way developing a machine processing model for use in text correction. 
Through extraction of corrected and rewritten text pairs from a log, the machine processing model can be trained 
and thus made fit for text correction by inputting a first text into the machine processing model to obtain a second
text i.e. a corrected text result. Moreover, a language model or a lexicon of commonly used words can be used to 
assess whether text needs correction. The training corpus extracted from the log can be used to train the language 
model and also, through text segmentation and statistical analysis of text in the log compile a lexicon of commonly 
used words. Thus, text correction can be made easier and more convenient."]"""

In [None]:
def tokenize_translations:
    str_list = []
    [list[0].split() for list in lists] 
    return list of list of tokens

In [None]:
google_hypothesis = google_trans.split()

In [None]:
wipo_hypothesis = wipo_trans.split()

In [None]:
references = [human_reference]

In [None]:
google_bleu_score = sentence_bleu([human_reference], google_hypothesis)
bleu_sentence

In [None]:
wipo_bleu_score = sentence_bleu([human_reference], wipo_hypothesis)
wipo_bleu_score

In [None]:
corpus_bleu([human_reference], [google_hypothesis])

`corpus_bleu`

In [None]:
 corpus_bleu() 

In [None]:
# - this example hypothesis has zero 3-gram and 4-gram overlaps:
sentence_bleu(references, hypothesis2)

the `bleu_score` module in the translate module contains ? for applying BLEU

the `sentence_bleu` calculates a BLEU score

nltk.translate.bleu_score.sentence_bleu

- the function accepts as parameters, the reference sentences, the hypothesis sentences, and weights for n-grams

- `sentence_bleu` returns a BLEU score

- if there is no ngrams overlap for any order of n-grams, BLEU returns the value 0. 

- the bleu_score submodule of nltk.translate to calculate BLEU scores. 
- the translate module contains experimental features for machine translation.
- the bleu_score submodule contains different implementations of the bleu_score, including both the original algorithm, as well as more recent adaptations of the original

- If there is no ngrams overlap for any order of n-grams, BLEU returns the value 0. 
- This is because the precision for the order of n-grams without overlap is 0, and the geometric mean in the final BLEU score computation multiplies the 0 with the precision of other n-grams. 
- This results in 0 (independently of the precision of the othe n-gram orders). 
- The following example has zero 3-gram and 4-gram overlaps: