## Measuring Multilingual Machines
#### BLEU Scores and Cross-Lingual Machine Learning

#### by Lee Mackey

This notebook accompanies the Medium article titled "Measuring Multilingual Machines"

In [2]:
# Import modules

import sacrebleu
from nltk.translate.bleu_score import (sentence_bleu, corpus_bleu, 
                                       modified_precision, 
                                       SmoothingFunction)

from nltk import (bigrams, trigrams, ngrams, sent_tokenize, word_tokenize)
from multilingual_machines import split_tokens, clean_punctuation
import string
import json
from IPython.display import IFrame

Does your machine learn in Chinese? 您和您的機器學習中文嗎? I don’t speak a word of Mandarin or Cantonese so Google Translate gets all the credit — good or bad — for the preceding sentence. But how might a researcher quickly evaluate the quality of this machine translation? This question encapsulates the basic challenge that gives rise to the BLEU metric. BLEU, which stands for bilingual language understudy, is the default measure of machine translation quality and is also sometimes applied to more general cross-lingual approaches to natural language processing (NLP). The metric is well-established in the machine translation space but some analysts also question the application to a wider set of NLP tasks beyond the original purpose for which the algorithm was developed. This article explores these issues by briefly discussing basic lessons and limits of BLEU using examples drawn from the multilingual space of global patent documents.

### Basics of BLEU

Researchers at IBM developed the BLEU algorithm in 2002 as an efficient method to evaluate machine translation tasks that would otherwise require human evaluators. The original paper by the developers, Papineni and colleagues, is a good place to start if you’re interested in the founding milieu and details of the algorithm [1]. 

In [3]:
# browse original paper that introduced BLEU
IFrame('https://www.aclweb.org/anthology/P02-1040.pdf', width=700, height=300)

BLEU is an adjusted measure of precision of the overlap of word sequences between a “candidate” machine translation and one or more “reference” human translations. Conceptualizing the algorithm at the unit of a machine-translated sentence, BLEU counts the maximum number of times that word sequences, expressed by the term n-grams, occur in human-translated sentences. The adjusted counts of each n-gram in the sentence are summed and the number is then “adjusted” by dividing by the total (unclipped) number of n-grams in the candidate text. 

The BLEU score is as a number between 0 and 1, where 0 represents the complete absence of overlap in n-grams between candidate and reference texts, and where 1 might equal a machine translation that is exactly similar to one of the reference texts. While this example considers BLEU at the level of the sentence, the actual evaluation of a machine translation is calculated by averaging out sentence scores across an entire corpus and adjusting this aggregate metric to account for the typically-longer word lengths of machine translations. If you’re interested in learning more, you might learn to calculate BLEU in a fifteen-minute video by Andrew Ng of DeepLearning.Ai. To make learning BLEU more tangible, I begin with machine translations and human translations of Chinese text

### Applications of BLEU using Patent Texts
A growing share of patents in the machine learning space are originally written and filed in Chinese according to a recent report by WIPO, the global organization governing patents. To explore the basics of BLEU in more tangible manner, we first begin with the international filing of a Chinese language patent by the e-commerce company Alibaba for an invention related to natural language processing.

In [4]:
# read example data from 'patent_examples.txt' file (JSON)
with open('patent_examples.txt') as f:
    data = json.load(f)
    
# for more details on the example patent and the data source:

# paste into browser to inspect patent at Chinese version of WIPO GUI:   
# https://patentscope.wipo.int/search/zh/detail.jsf?docId=WO2019085779
    
# paste into browser to inspect patent at English version of WIPO GUI:
# https://patentscope.wipo.int/search/en/detail.jsf?docId=WO2019085779

In [5]:
IFrame('https://patentscope.wipo.int/search/zh/detail.jsf?docId=WO2019085779', width=700, height=300)


A sentence from the Chinese language abstract section of the original patent is displayed below.

In [6]:
# inspect sentence from summary of patent in original Chinese
print(data['original_sentence_cn'])

['可以使用从日志中挖掘出的训练语料来训练语言模型，也可以通过对日志中的文本进行分词、统计来整理常用词库.']


Human translators often produce translations of equivalent quality that nonetheless differ in structure or word choice. BLEU is therefore developed to accept single or multiple reference translations by humans. Next, we obtain Chinese-to-English "reference" translations from two human translators via the platform Gengo. The translations are below.

In [7]:
# inspect Ch-to-En human translation #1 of sentence from patent summary
reference_human1_sentence = data['reference_human1_sentence']
print(reference_human1_sentence)

['The training corpus extracted from a log can be used to train the language model, or the common lexicon can be sorted by segmenting and counting text in the log.']


In [8]:
# inspect Ch-to-En human translation #1 of full patent summary
reference_human1_summary = data['reference_human1_summary']
print(reference_human1_summary)

['The invention discloses a machine processing and text error correction method and device, a computing device, and a storage medium, specifically comprising corrected and rewritten text pairs of incorrect text and corresponding correct text.', 'The corrected and rewritten text pairs serving as a training corpus to train the machine processing model, thereby preparing a machine processing model suitable for text error correction.', 'Through extraction of corrected and rewritten text pairs from a log, the machine processing model can be trained and thus made fit for text correction by inputting the first text into the machine processing model to get the second text, that is the error correction result text.', 'In addition, the language model or the common lexicon can be used to determine whether the first text needs to be corrected.', 'The training corpus extracted from a log can be used to train the language model, or the common lexicon can be sorted by segmenting and counting text in 

In [9]:
# inspect Ch-to-En human translation #2 of sentence from patent summary
reference_human2_sentence = data['reference_human2_sentence']
print(reference_human2_sentence)

['It can use the practice language material gathered from the diary or daily journal to train the language model, and it can also initialize the common vocabulary bank through the segmentation and analysis of the diary or daily journal text']


In [10]:
# inspect Ch-to-En human translation #2 of full patent summary
reference_human2_summary = print(data['reference_human2_summary'])

['This invention makes public a machine processing and text error correction method and hardware, computing equipment and storage medium, and specifically pairs error text with the corresponding corrected and modified correct text.', 'It uses this text pair as training material for the machine processing model, and from there prepares the machine processing model that is applied to the text correction.', 'It can train the machine processing model using a diary or daily journal and make it suitable for text correction.', 'The first text version is inputted into the machine processing model to get the second text version, which is the corrected text.', 'Additionally, it can also use a stored language model or common vocabulary bank to determine if the first text version needs correction.', 'It can use the practice language material gathered from the diary or daily journal to train the language model, and it can also initialize the common vocabulary bank through the segmentation and analy

Finally, we source "candidate" machine translations from two separate machine learning algorithms: Google Translate, and the World Intellectual Property Organization (WIPO).

In [11]:
# inspect machine translation by Google of full summary
candidate_google_summary = data['candidate_google_summary']
print(candidate_google_summary)

['The invention discloses a machine processing and text error correction method and device, a computing device and a storage medium, and particularly comprises an error correction rewriting pair of an error text and a corresponding correct text, and an error correction rewriting pair as a training corpus, and a machine processing model.', 'Training is performed, thereby preparing a machine processing model suitable for text correction. The machine processing model can be trained to mine the error correction by mining the error correction rewrite pair from the log.', 'The first text is input into the machine processing model to obtain a second text, that is, an error correction result text.', 'In addition, you can use the language model or common lexicon to determine whether the first text needs to be corrected.', ' The language model can be trained using the training corpus extracted from the log, or the common lexicon can be organized by segmenting and counting the text in the log.', 

In [12]:
# inspect machine translation by Google of sentence from summary
candidate_google_sentence = data['candidate_google_sentence']
print(candidate_google_sentence)

['The language model can be trained using the training corpus extracted from the log, or the common lexicon can be organized by segmenting and counting the text in the log.']


In [13]:
# inspect machine translation by WIPO of full summary
candidate_wipo_summary = data['candidate_wipo_summary']
print(candidate_wipo_summary)

['The present invention discloses a machine processing and text correction method and device, computing equipment and a storage medium.', 'Specifically comprising corrected and rewritten text pairs of incorrect text and corresponding correct text, the corrected and rewritten text pairs serving as a training corpus for training a machine processing model, and in this way developing a machine processing model for use in text correction.', 'Through extraction of corrected and rewritten text pairs from a log, the machine processing model can be trained and thus made fit for text correction by inputting a first text into the machine processing model to obtain a second text i.e. a corrected text result.', 'Moreover, a language model or a lexicon of commonly used words can be used to assess whether text needs correction. The training corpus extracted from the log can be used to train the language model and also, through text segmentation and statistical analysis of text in the log compile a l

In [14]:
# inspect machine translation by WIPO of full summary
candidate_wipo_sentence = data['candidate_wipo_sentence']
print(candidate_wipo_sentence)

['The training corpus extracted from the log can be used to train the language model and also, through text segmentation and statistical analysis of text in the log compile a lexicon of commonly used words.']


#### Calculate BLEU

We calculate BLEU scores using the nltk.translate.bleu_score implementation from the Natural Language Toolkit (NLTK) package. We use the corpus_bleu method and not the sentence_bleu method that only implements a partial version of the BLEU algorithm. After preparing the translation texts via standard NLP pre-processing to represent the texts as word tokens. 

In [15]:
# For summary-level example:

# split sentences into tokens
ref_human1_summary = split_tokens(reference_human1_summary)
ref_human2_summary = split_tokens(reference_human1_summary)
can_google_summary = split_tokens(candidate_google_summary)
can_wipo_summary = split_tokens(candidate_wipo_summary)

ref_human1_sentence = split_tokens(reference_human1_sentence)
ref_human2_sentence = split_tokens(reference_human2_sentence)
can_google_sentence = split_tokens(candidate_google_sentence)
can_wipo_sentence = split_tokens(candidate_wipo_sentence)

# clean punctuation from tokens
ref_human1_summary = clean_punctuation(ref_human1_summary)
ref_human2_summary = clean_punctuation(ref_human2_summary)
can_google_summary = clean_punctuation(can_google_summary)
can_wipo_summary = clean_punctuation(can_wipo_summary)

ref_human1_sentence = clean_punctuation(ref_human1_sentence)
ref_human2_sentence = clean_punctuation(ref_human2_sentence)
can_google_sentence = clean_punctuation(can_google_sentence)
can_wipo_sentence = clean_punctuation(can_wipo_sentence)

In [16]:
# Inspect examples of n-grams

# return list of bi-grams
bi_grams = list(ngrams(ref_human1_summary[0], 2))[0:5]

# returns list of four-grams
four_grams = list(ngrams(ref_human1_summary[0], 4))[0:5]

print(f"bi-grams: {bi_grams}")
print(f"four-grams: {four_grams}")

bi-grams: [('The', 'invention'), ('invention', 'discloses'), ('discloses', 'a'), ('a', 'machine'), ('machine', 'processing')]
four-grams: [('The', 'invention', 'discloses', 'a'), ('invention', 'discloses', 'a', 'machine'), ('discloses', 'a', 'machine', 'processing'), ('a', 'machine', 'processing', 'and'), ('machine', 'processing', 'and', 'text')]


We then pass the corpus_bleu function the two reference sentences as a list of lists of tokens. The algorithm computes the n-gram matches for each candidate sentence and add the clipped n-gram counts for all the candidate sentences. Next, BLEU divide by the number of candidate n-grams in the candidate corpus to compute a modified BLEU score for the candidate corpus.

In [18]:
# BLEU-4 example
bleu4_example = corpus_bleu([ref_human1_sentence], ref_human1_sentence)
print(float(bleu4_example))

1.0


In [19]:
# BLEU-4 for Google translation with one reference using corpus_bleu
bleu4_google_corpusbleu_human1 = corpus_bleu([ref_human1_sentence], 
                                             can_google_sentence)
print(float(bleu4_google_corpusbleu_human1))

0.4370614964591188


In [20]:
# BLEU-4 for Google translation with second reference using corpus_bleu
bleu4_google_corpusbleu_human2 = corpus_bleu([ref_human2_sentence], 
                                             can_google_sentence)
print(float(bleu4_google_corpusbleu_human2))

5.010025055942425e-155


The hypothesis contains 0 counts of 3-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()
The hypothesis contains 0 counts of 4-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()


In [21]:
# BLEU-4 for Google translation with two references using corpus_bleu
bleu4_google_corpusbleusm_2refs = corpus_bleu([[ref_human1_sentence[0], 
                                                ref_human2_sentence[0]]], 
                                                can_google_sentence)
print(float(bleu4_google_corpusbleusm_2refs))

0.4523563820810908


In [22]:
# Note: BLEU was not originally intended for sentence-level calculation
# NLTK has a sentence_bleu score but this will not return accurate
# results across a corpus of sentences (docs)

# BLEU-4 for Google translation with two references using sentence_bleu
bleu4_google_sentencebleu = sentence_bleu([ref_human1_sentence[0], 
                                           ref_human2_sentence[0]],
                                           can_google_sentence[0])
print(float(bleu4_google_sentencebleu))

0.4523563820810908


In [23]:
# BLEU-4 for WIPO translation with two references using corpus_bleu
bleu4_wipo_corpusbleusm_2refs = corpus_bleu([[ref_human1_sentence[0], 
                                              ref_human2_sentence[0]]], 
                                        can_wipo_sentence)
print(float(bleu4_wipo_corpusbleusm_2refs))

0.48435192067731586


What's the result? The score of the first candidate translation by Google Translate is 0.45. The score of the second machine translation by WIPO is 0.48.

If we conduct the same translation on the entire abstract text, the score of Google is .XXX, and the score of WIPO is. 

#### Full summary

"Instead of averaging the sentence level BLEU scores (i.e. macro-average precision), the original BLEU metric accounts for the micro-average precision (i.e. summing the numerators and denominators for each hypothesis-reference(s) pairs before the division)."

In [24]:
# BLEU-4 for Google translation of summary with two references using corpus_bleu
# bleu4_google_summary = corpus_bleu(ref_human1_summary, ref_human2_summary, can_google_summary)
# print(float(bleu4_google_summary))

In [25]:
# BLEU-4 for Google translation of summary with two references using corpus_bleu
bleu4_google_summary = corpus_bleu(ref_human1_summary, ref_human1_summary)
print(float(bleu4_google_summary))

7.135281163847754e-232


The hypothesis contains 0 counts of 2-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()


In [26]:
# BLEU-4 for WIPO translation with two references using corpus_bleu
# bleu4_wipo_summary_corpusbleusm_2refs = corpus_bleu([[ref_human1_sentence[0], 
#                                                       ref_human2_sentence[0]]], 
#                                         can_wipo_summary)
# print(float(bleu4_wipo_summary_corpusbleusm_2refs))

In [27]:
# Note: by default, bleu_score calculates a BLEU-4,which is a score for the overlap of up to 4-grams

In [28]:
# Note: "the geometric mean of the test corpus’ modified precision scores times an exponential brevity penalty factor"

With this basic example of the application of BLEU in mind, we can now discuss some of the potential limits of the application of BLEU to machine translation and natural language processing tasks more generally.

### Limits of BLEU in cross-lingual machine learning

There appears to be agreement that BLEU is appropriate for diagnostic evaluation of machine translation systems but some researchers also suggest that BLEU may not be appropriate for certain types of tasks. As we saw above, BLEU allows for candidates to be compared against multiple references but some researchers also suggest that BLEU's allowable variation in translation goes further than it should [5]. As one researcher suggests, "BLEU may also allow variations that would receive varied human evaluations but which BLEU scores in the same score… [and] may not correlate with human judgment to the degree that it is currently believed to do. (Callison-Burch). Significant variation, some of which may be of lower quality, but with the same BLEU scores [4]. The authors argue "that there are instances when an improvement in Bleu is not sufficient to reflect a genuine improvement in translation quality, and in other circumstances that it is not necessary to improve Bleu in order to achieve a noticeable improvement in translation quality." [3]

BLEU may also not be appropriate for comparisons of machine learning systems that employ significantly different strategies. Callison-Burch argue that inappropriate uses for Bleu include: 1) comparing systems which employ radically different strategies (especially comparing phrase-based statistical machine translation systems against systems that do not employ similar n-gram-based approaches). 2) trying to detect improvements for aspects of translation that are not modeled well by Bleu, and 3) monitoring improvements that occur infrequently within a test corpus. Some researchers suggest that BLEU is not appropriate for wider tasks, such as for evaluation of individual texts, or for scientific hypothesis testing [6].

Some researchers raise questions about the construct validity of BLEU as a measure. When conducting a structured literature review, some researchers argue that BLEU–human correlations, and suggest that whether BLEU correlates with human evaluations is very dependent on the details of the systems being evaluated, the exact corpus texts used, and the exact protocol used for human evaluations. (Reiter). As a surrogate endpoint, Reiter argues that BLEU is useful only if such scores "reliably predict an outcome that is of real-world importance or is the core of a scientific hypothesis we wish to test." Reiter finds that none of the "surveyed papers used real-world human evaluations; that is, they all used human evaluations performed in an artificial context (usually by paid individuals, crowdsourced workers, or the researchers themselves), rather than looking at the impact of systems on real-world users." Given these absence of validated findings on BLEU, Reiter calls for A/B tests of correlations between their A/B tests and BLEU (?). "the results of real-world A/B testing could be used to determine contexts in which BLEU reliably had good correlation with real-world effectiveness."

Reiter and others argue that researchers should approaches BLEU as a diagnostic for machine translation at the system level, but not as an evaluation technique to measure the output of a system. Reiter finds that "the evidence does not support using BLEU to evaluate other types of NLP systems (outside of [machine translation), and it does not support using BLEU to evaluate individual texts rather than NLP systems." Reiter argues that BLEU should not be the primary evaluation technique of NLP papers. Reiter argues that this is because of concerns about the validity and reliability of BLEU. 
While there is a recognition of some of the shortcomings of the measure, there is also no clear replacement. Some researchers call for more clarity in the reporting of BLEU scores [2]. This includes the argument that BLEU is under-specified and contains parameters, that preprocessing schemes have a large effect on scores, rendering scores un-comparable, and because there are not standard conventions for researchers in reporting the details of BLEU scores in a standard manner.


---

If you're working across languages in your natural language processing flows, understanding these details of BLEU will help to select when and how to use this metric in your projects.