# Measuring Multilingual Machines
#### Exploring BLEU Scores using Patent Data
<br>
#### by Lee Mackey

This notebook accompanies the article [Measuring Multilingual Machines](https://medium.com/@glmack) published on Medium.
<br>
<br>

## Introduction
---

Does your machine learn in Chinese? I don't speak Mandarin or Cantonese so Google Translate gets all the credit \- good or bad \- for translating the preceding sentence into: "您和您的機器學習中文嗎?" But how might a researcher quickly evaluate the quality of machine translations? This question encapsulates the basic challenge that gives rise to the BLEU metric. BLEU, which stands for bilingual language understudy, is a default measure of machine translation quality and is also sometimes applied to cross-lingual natural language processing (NLP) tasks. The metric is well-established in the machine translation space but some analysts question its application to a wider set of NLP tasks beyond the original purpose for which the algorithm was developed. This article explores lessons, implementations and limits of BLEU using examples drawn from multilingual patent documents.

In [1]:
# import modules
from nltk.translate.bleu_score import corpus_bleu
from nltk import bigrams, trigrams, ngrams
from multilingual_machines import split_tokens, clean_punctuation
import string
import json
from IPython.display import IFrame

In [2]:
# load example data from 'patent_examples.txt' file in Github repo
with open('patent_examples.txt') as f:
    data = json.load(f)

## Basics of BLEU
---


Researchers at IBM developed the BLEU algorithm in 2002 as an efficient method to evaluate machine translation systems without human translators. The original [paper](https://www.aclweb.org/anthology/P02-1040.pdf) by the developers, Papineni and colleagues, is a good place to start if you’re interested in the founding context and objectives of the algorithm. 

In [5]:
# browse original BLEU paper
IFrame('https://www.aclweb.org/anthology/P02-1040.pdf', width=700, height=300)

BLEU is an adjusted precision measure of the overlap of word sequences between a “candidate” machine translation and “reference” human translations. BLEU is a measure of precision in that it counts "n-grams", word sequences of length *n*, of a machine translation that match the n-grams in a human translation, and then divides by the number of n-grams in the machine translation. The measure is adjusted in the sense that it "clips" the n-gram count to the maximum number of n-gram occurrences in a human translation and penalizes machine translations that diverge in word length from the reference translation.

The resulting BLEU score is as a number between 0 and 1, where 0 represents the complete absence of n-gram matches between candidate and reference texts, and where 1 might equal a machine translation that is exactly similar to one of the reference texts. In practice, this measure typically aggregates multiple word sequence lengths \- 4-grams (four-word sequences), tri-grams (three-word sequences), bi-grams (two-word sequences), and uni-grams (one-word sequences) \- via a geometric means of the respective n scores. While the algorithm is designed for comparisons at the level of a document, BLEU calculates n-grams at the basic unit of a sentence and then combines sentence scores across an entire document of sentences. To make the metric more tangible, you can practice applying BLEU using translations of Chinese-language patents.

## Applying BLEU to Patent Texts
---

A growing share of patents in the machine learning space are written and filed in Chinese according to a recent report by WIPO (World Intellectual Property Organization), which is the global organization that governs patents. To explore the basics of BLEU in this multilingual space, you can first begin with a Chinese patent that Alibaba, the e-ecommerce company, filed for a NLP innovation at global scale. The title of the Chinese language patent is displayed below.


#### Acquire translations of Chinese patent text

In [6]:
# inspect sentence from summary of patent in original Chinese
print(data['original_title_cn'])

['机器处理及文本纠错方法和装置、计算设备以及存储介质']


For more details about the example patent, you can access the WIPO data query tool in your browser, in either the English version or the Chinese version.

In [13]:
# inspect the Chinese version using the WIPO GUI
IFrame('https://patentscope.wipo.int/search/zh/detail.jsf?docId=WO2019085779',
       width=700,
       height=300)

# exchange url or paste in browser to inspect English version on WIPO GUI
# https://patentscope.wipo.int/search/en/detail.jsf?docId=WO2019085779

Human translators often produce translations of equivalent quality that nonetheless differ in word choice and structure. BLEU therefore accepts single or multiple human translations for comparison to machine translations. You can next obtain Chinese-to-English reference translations from two different human translators via the platform Gengo.

In [14]:
# inspect human #1's Ch-to-En translation of patent summary
reference_human1_summary = data['reference_human1_summary']
print(reference_human1_summary)

['The invention discloses a machine processing and text error correction method and device, a computing device, and a storage medium, specifically comprising corrected and rewritten text pairs of incorrect text and corresponding correct text.', 'The corrected and rewritten text pairs serving as a training corpus to train the machine processing model, thereby preparing a machine processing model suitable for text error correction.', 'Through extraction of corrected and rewritten text pairs from a log, the machine processing model can be trained and thus made fit for text correction by inputting the first text into the machine processing model to get the second text, that is the error correction result text.', 'In addition, the language model or the common lexicon can be used to determine whether the first text needs to be corrected.', 'The training corpus extracted from a log can be used to train the language model, or the common lexicon can be sorted by segmenting and counting text in 

In [15]:
# inspect human #2's Ch-to-En translation #2 of patent summary
reference_human2_summary = data['reference_human2_summary']
print(reference_human2_summary)

['This invention makes public a machine processing and text error correction method and hardware, computing equipment and storage medium, and specifically pairs error text with the corresponding corrected and modified correct text.', 'It uses this text pair as training material for the machine processing model, and from there prepares the machine processing model that is applied to the text correction.', 'It can train the machine processing model using a diary or daily journal and make it suitable for text correction.', 'The first text version is inputted into the machine processing model to get the second text version, which is the corrected text.', 'Additionally, it can also use a stored language model or common vocabulary bank to determine if the first text version needs correction.', 'It can use the practice language material gathered from the diary or daily journal to train the language model, and it can also initialize the common vocabulary bank through the segmentation and analy

Finally, you can source "candidate" machine translations from two separate machine learning algorithms: Google Translate and the World Intellectual Property Organization (WIPO).

In [16]:
# inspect machine translation by Google of full summary
candidate_google_summary = data['candidate_google_summary']
print(candidate_google_summary)

['The invention discloses a machine processing and text error correction method and device, a computing device and a storage medium, and particularly comprises an error correction rewriting pair of an error text and a corresponding correct text, and an error correction rewriting pair as a training corpus, and a machine processing model.', 'Training is performed, thereby preparing a machine processing model suitable for text correction. The machine processing model can be trained to mine the error correction by mining the error correction rewrite pair from the log.', 'The first text is input into the machine processing model to obtain a second text, that is, an error correction result text.', 'In addition, you can use the language model or common lexicon to determine whether the first text needs to be corrected.', ' The language model can be trained using the training corpus extracted from the log, or the common lexicon can be organized by segmenting and counting the text in the log.', 

In [17]:
# inspect machine translation by WIPO of full summary
candidate_wipo_summary = data['candidate_wipo_summary']
print(candidate_wipo_summary)

['The present invention discloses a machine processing and text correction method and device, computing equipment and a storage medium.', 'Specifically comprising corrected and rewritten text pairs of incorrect text and corresponding correct text, the corrected and rewritten text pairs serving as a training corpus for training a machine processing model, and in this way developing a machine processing model for use in text correction.', 'Through extraction of corrected and rewritten text pairs from a log, the machine processing model can be trained and thus made fit for text correction by inputting a first text into the machine processing model to obtain a second text i.e. a corrected text result.', 'Moreover, a language model or a lexicon of commonly used words can be used to assess whether text needs correction. The training corpus extracted from the log can be used to train the language model and also, through text segmentation and statistical analysis of text in the log compile a l

#### Calculate BLEU

There are multiple implementations and extensions of BLEU to explore. You can begin by calculating scores using the implementation from the Natural Language Toolkit (NLTK) package. The corpus_bleu method takesafter preparing the translation texts via standard NLP pre-processing steps.

In [None]:
# Pre-process candidate and reference translations

# split translations into tokens
ref_human1_summary = split_tokens(reference_human1_summary)
ref_human2_summary = split_tokens(reference_human2_summary)
can_google_summary = split_tokens(candidate_google_summary)
can_wipo_summary = split_tokens(candidate_wipo_summary)

# clean punctuation from tokens
ref_human1_summary = clean_punctuation(ref_human1_summary)
ref_human2_summary = clean_punctuation(ref_human2_summary)
can_google_summary = clean_punctuation(can_google_summary)
can_wipo_summary = clean_punctuation(can_wipo_summary)

You can then pass the corpus_bleu function the two reference translations as lists of lists of tokens. The algorithm computes the n-gram matches for each candidate sentence and adds the clipped n-gram counts for all the candidate sentences. Next, BLEU divides the number of candidate n-grams in the candidate corpus to compute a modified BLEU score for the candidate corpus. By default, bleu_score calculates scores take the geometric mean of n-gram scores up to 4-grams, including tri-grams, bi-grams and unigrams. 

In [None]:
# Inspect examples of n-grams

# return list of bi-grams
bi_grams = list(ngrams(ref_human1_summary[0], 2))[0:5]

# returns list of four-grams
four_grams = list(ngrams(ref_human1_summary[0], 4))[0:5]

print(f"bi-grams: {bi_grams}")
print(f"four-grams: {four_grams}")

For a document with multiple sentences, the original BLEU implementation computes the n-gram matches sentence by sentence, then sums the clipped n-gram counts for all the candidate sentences and, lastly, divides by the number of candidate n-grams in the document. As the NLTK documentation states: the original BLEU metric "calculates the micro-average precision (i.e. summing the numerators and denominators for each hypothesis-reference(s) pairs before the division).""Instead of averaging the sentence level BLEU scores (i.e. macro-average precision), the original BLEU metric accounts for the micro-average precision (sums the numerators and denominators for each hypothesis-reference(s) pairs before the division)."

In [None]:
# the NLTK corpus_bleu accepts a list of list of words (list(list(str))) – a list of hypothesis sentences
# (list(list(list(str)))) – a corpus of lists of reference sentences

In [None]:
refs_list_6 = [ref_human1_summary[:] + ref_human2_summary[:]] * 6
print(len(refs_list_6))

In [None]:
refs_list_5 = [ref_human1_summary[:] + ref_human2_summary[:]] * 5
print(len(refs_list_5))

In [None]:
# BLEU-4 example using second sentence of google translation of patent summary
bleu_google = corpus_bleu(refs_list_6, can_google_summary[:])
print(round(bleu_google, 2))

In [None]:
# BLEU-4 example using second sentence of WIPO translation of patent summary
bleu_wipo = corpus_bleu(refs_list_5, can_wipo_summary[:])
print(round(bleu_wipo, 2))

What's the result? The score of the first candidate translation by Google Translate is 0.45. The score of the second machine translation by WIPO is 0.48.

If we conduct the same translation on the entire abstract text, the score of Google is .53, and the score of WIPO is 0.54. With this basic example of the application of BLEU in mind, we can now discuss some of the potential limits of the application of BLEU to machine translation and natural language processing tasks more generally.

## Limits of BLEU
---

There is a general consensus among researchers that BLEU is expedient for diagnostic evaluation of machine translation systems but some researchers caution suggest that BLEU may not be appropriate for certain types of translation and NLP tasks. One critique is that BLEU's flexibility in allowing variation in translations may result in documents of differing quality that nonetheless score the same on the BLEU metric [1]. The authors argue that "there are instances when an improvement in BLEU is not sufficient to reflect a genuine improvement in translation quality, and in other circumstances that it is not necessary to improve BLEU in order to achieve a noticeable improvement in translation quality." [3] BLEU may also not be appropriate for comparisons of machine learning systems that employ significantly different strategies. Callison-Burch argue that inappropriate uses for Bleu include: 1) comparing systems which employ radically different strategies (especially comparing phrase-based statistical machine translation systems against systems that do not employ similar n-gram-based approaches). 2) trying to detect improvements for aspects of translation that are not modeled well by Bleu, and 3) monitoring improvements that occur infrequently within a test corpus. Some researchers suggest that BLEU is not appropriate for wider tasks, such as for evaluation of individual texts, or for scientific hypothesis testing [6].

When conducting a structured literature review, some researchers argue that BLEU–human correlations, and suggest that whether BLEU correlates with human evaluations is very dependent on the details of the systems being evaluated, the exact corpus texts used, and the exact protocol used for human evaluations. (Reiter). As a surrogate endpoint, Reiter argues that BLEU is useful only if such scores "reliably predict an outcome that is of real-world importance or is the core of a scientific hypothesis we wish to test." Reiter finds that none of the "surveyed papers used real-world human evaluations; that is, they all used human evaluations performed in an artificial context (usually by paid individuals, crowdsourced workers, or the researchers themselves), rather than looking at the impact of systems on real-world users." Given these absence of validated findings on BLEU, Reiter calls for A/B tests of correlations between their A/B tests and BLEU (?). "the results of real-world A/B testing could be used to determine contexts in which BLEU reliably had good correlation with real-world effectiveness." Reiter and others argue that researchers should approach BLEU as a diagnostic for machine translation at the system level, but not as an evaluation technique to measure the output of a system. Reiter finds that "the evidence does not support using BLEU to evaluate other types of NLP systems (outside of [machine translation), and it does not support using BLEU to evaluate individual texts rather than NLP systems." Reiter argues that BLEU should not be the primary evaluation technique of NLP papers. Reiter argues that this is because of concerns about the validity and reliability of BLEU. 
While there is a recognition of some of the shortcomings of the measure, there is also no clear replacement. Some researchers call for more clarity in the reporting of BLEU scores [2]. This includes the argument that BLEU is under-specified and contains parameters, that preprocessing schemes have a large effect on scores, rendering scores un-comparable, and because there are not standard conventions for researchers in reporting the details of BLEU scores in a standard manner.

## Conclusion
---
The basics, examples and limits of BLEU addressed in this article are important as patent documents in the machine learning space continue to become increasingly multilingual. If you're working across languages in your natural language processing flows, understanding these details of BLEU will help to select when and how to use this metric in your projects. These questions are important as patent documents in the machine learning space continue to become increasingly multilingual.