# Measuring Multilingual Machines
#### BLEU Scores and Cross-Lingual Machine Learning

#### by Lee Mackey

This notebook accompanies an article of the same name published on Medium

Does your machine learn in Chinese? 您和您的機器學習中文嗎? I don’t speak a word of Mandarin or Cantonese so Google Translate gets all the credit — good or bad — for the preceding sentence. But how could you quickly evaluate the quality of this machine translation? This challenge encapsulates the basic demand that gives rise to the BLEU metric. BLEU, which stands for bilingual language understudy, is the default measure of machine translation quality and is also sometimes applied to more general cross-lingual approaches to natural language processing (NLP). The metric is well-established in the machine translation space but some analysts also question the application to a wider set of NLP tasks beyond the original purpose for which the algorithm was developed. This article explores these issues by briefly discussing basic lessons and limits of BLEU using examples drawn from the multilingual space of global patent documents.

In [7]:
# Import packages

from nltk.translate.bleu_score import (sentence_bleu, corpus_bleu, 
                                       modified_precision, 
                                       SmoothingFunction)

from nltk import (bigrams, trigrams, ngrams, sent_tokenize, word_tokenize)

import tokenize
import textwrap
import string

### Basics of BLEU

Researchers at IBM developed the BLEU algorithm in 2002 as an efficient method to evaluate machine translation tasks that would otherwise require human evaluators. The original paper by the developers, Papineni and colleagues, is a good place to start if you’re interested in the founding milieu and details of the algorithm [1]. BLEU is an adjusted measure of precision of the overlap of word sequences between a “candidate” machine translation and one or more “reference” human translations. Conceptualizing the algorithm at the unit of a machine-translated sentence, BLEU counts the maximum number of times that word sequences, expressed by the term n-grams, occur in human-translated sentences. The adjusted counts of each n-gram in the sentence are summed and the number is then “adjusted” by dividing by the total (unclipped) number of n-grams in the candidate text. 

The BLEU score is as a number between 0 and 1, where 0 represents the complete absence of overlap in n-grams between candidate and reference texts, and where 1 might equal a machine translation that is exactly similar to one of the reference texts. While this example considers BLEU at the level of the sentence, the actual evaluation of a machine translation is calculated by averaging out sentence scores across an entire corpus and adjusting this aggregate metric to account for the typically-longer word lengths of machine translations. If you’re interested in learning more, you might learn to calculate BLEU in a fifteen-minute video by Andrew Ng of DeepLearning.Ai. To make learning BLEU more tangible, I begin with machine translations and human translations of Chinese text

### Applications of BLEU using Patent Texts as Examples
A growing share of patents in the machine learning space are originally written and filed in Chinese, according to a recent report by the global governance organization of patents. To explore the basics of BLEU in more tangible manner, we first begin with the international filing of a Chinese language patent by the e-commerce company Alibaba for an invention related to natural language processing.

In [None]:
# Inspect example data: international patent for NLP invention by Alibaba

# inspect title of patent in original Chinese
original_title_cn = ("""机器处理及文本纠错方法和装置、计算设备以及存储介质""")

# inspect summary of patent in original Chinese
original_summary_cn = ("""本发明公开了一种机器处理及文本纠错方法和装置、计算设备以及存储介质
，具体包括错误文本和对应的正确文本的纠错改写对, 以纠错改写对作为训练语料，对机器处理模型
进行训练，由此准备好适用于文本纠错的机器处理模型。可以通过从日志中挖掘纠错改写对来对机器
处理模型进行训练，使其适于对文本进行纠错。将第一文本输入到机器处理模型中，得到第二文本，
即纠错结果文本。另外，还可以使用语言模型或常用词库先判断第一文本是否需要进行纠错。可以使
用从日志中挖掘出的训练语料来训练语言模型，也可以通过对日志中的文本进行分词、统计来整理常
用词库。由此，使得能够方便地实现文本纠错""")

# for more details on the example patent and the data source:
# paste url into browser to inspect patent at Chinese version of WIPO Patentscope GUI   
https://patentscope.wipo.int/search/zh/detail.jsf?docId=WO2019085779
    
# paste url into browser to inspect patent at English version of WIPO Patentscope GUI
https://patentscope.wipo.int/search/en/detail.jsf?docId=WO2019085779

A sentence from the Chinese language abstract section of the original patent is displayed below.

In [None]:
# inspect sentence from summary of patent in original Chinese
original_sentence = """可以使用从日志中挖掘出的训练语料来训练语言模型，也可以通过对
日志中的文本进行分词、统计来整理常用词库."""

Human translators often produce translations of equivalent quality that nonetheless differ in structure or word choice. BLEU is thus developed to accept single or multiple reference translations by humans. Next, we obtain Chinese-to-English "reference" from two human translators via the translation platform Gengo. The translations are below.

In [None]:
# Inspect "standard" quality Ch-to-En translations by humans on Gengo platform

# inspect Ch-to-En human translation #1 of sentence from patent summary
reference_human1_sentence = """The training corpus extracted from a log can be used 
to train the language model, or the common lexicon can be sorted by 
segmenting and counting text in the log."""

# inspect Ch-to-En human translation #1 of full patent summary
reference_human1_summary = ("""The invention discloses a machine processing and \
text error correction method and device, a computing device, and a \
storage medium, specifically comprising corrected and rewritten text \
pairs of incorrect  text and corresponding correct text. The corrected \
and rewritten text pairs serving as a training corpus to train the \
machine processing model, thereby preparing a machine processing model \
suitable for text error correction. Through extraction of corrected and \
rewritten text pairs from a log, the machine processing model can be \
trained and thus made fit for text correction by inputting the first text \
into the machine processing model to get the second text, that is the \
error correction result text. In addition, the language model or the \
common lexicon can be used to determine whether the first text needs to \
be corrected. The training corpus extracted from a log can be used to \
train the language model, or the common lexicon can be sorted by \
segmenting and counting text in the log. This is how to easily implement \
text error correction.""")

In [None]:
# inspect Ch-to-En human translation #2 of sentence from patent summary
reference_human2_sentence = """It can use the practice language material gathered 
from the diary or daily journal to train the language model, and it can
also initialize the common vocabulary bank through the segmentation and
analysis of the diary or daily journal text."""

# inspect Ch-to-En human translation #2 of full patent summary
reference_human2_summary = ("""This invention makes public a machine processing and
text error correction method and hardware, computing equipment and storage 
medium, and specifically pairs error text with the corresponding corrected 
and modified correct text. It uses this text pair as training material for 
the machine processing model, and from there prepares the machine processing
model that is applied to the text correction. It can train the machine processing
model using a diary or daily journal and make it suitable for text correction.
The first text version is inputted into the machine processing model to get 
the second text version, which is the corrected text. Additionally, it can 
also use a stored language model or common vocabulary bank to determine if 
the first text version needs correction. It can use the practice language 
material gathered from the diary or daily journal to train the language model,
and it can also initialize the common vocabulary bank through the segmentation
and analysis of the diary or daily journal text. Through all this, text 
correction is conveniently implemented.""")

Finally, we source "candidate" machine translations from two separate machine learning algorithms: Google Translate, and the World Intellectual Property Organization (WIPO).

In [None]:
# inspect machine translation by Google Translate of sentence from summary
candidate_google_sentence = """The language model can be trained using the training 
corpus extracted from the log, or the common lexicon can be organized by 
segmenting and counting the text in the log."""

In [24]:
# inspect machine translation by Google Translate of full summary
candidate_google_summary = """The invention discloses a machine processing and
text error correction method and device, a computing device and a
storage medium, and particularly comprises an error correction
rewriting pair of an error text and a corresponding correct text, and
an error correction rewriting pair as a training corpus, and a machine
processing model. Training is performed, thereby preparing a machine
processing model suitable for text correction. The machine processing
model can be trained to mine the error correction by mining the error
correction rewrite pair from the log. The first text is input into the
machine processing model to obtain a second text, that is, an error
correction result text. In addition, you can use the language model or
common lexicon to determine whether the first text needs to be corrected.
The language model can be trained using the training corpus extracted
from the log, or the common lexicon can be organized by segmenting and
counting the text in the log. Thereby, text correction is facilitated."""

In [27]:
x = textwrap.wrap(candidate_google_summary, width=79)
x

['The invention discloses a machine processing and text error correction method',
 'and device, a computing device and a storage medium, and particularly comprises',
 'an error correction rewriting pair of an error text and a corresponding correct',
 'text, and an error correction rewriting pair as a training corpus, and a',
 'machine processing model. Training is performed, thereby preparing a machine',
 'processing model suitable for text correction. The machine processing model can',
 'be trained to mine the error correction by mining the error correction rewrite',
 'pair from the log. The first text is input into the machine processing model to',
 'obtain a second text, that is, an error correction result text. In addition,',
 'you can use the language model or common lexicon to determine whether the first',
 'text needs to be corrected. The language model can be trained using the',
 'training corpus extracted from the log, or the common lexicon can be organized',
 'by segmenting a

In [18]:
candidate_google_summary

'The invention discloses a machine processing and\ntext error correction method and device, a computing device and a\nstorage medium, and particularly comprises an error correction\nrewriting pair of an error text and a corresponding correct text, and\nan error correction rewriting pair as a training corpus, and a machine\nprocessing model. Training is performed, thereby preparing a machine\nprocessing model suitable for text correction. The machine processing\nmodel can be trained to mine the error correction by mining the error\ncorrection rewrite pair from the log. The first text is input into the\nmachine processing model to obtain a second text, that is, an error\ncorrection result text. In addition, you can use the language model or\ncommon lexicon to determine whether the first text needs to be corrected.\nThe language model can be trained using the training corpus extracted\nfrom the log, or the common lexicon can be organized by segmenting and\ncounting the text in the log. Th

In [None]:
# inspect machine translation by WIPO of full summary
candidate_wipo_sentence = ("""The training corpus extracted from the log can be 
used to train the language model and also, through text segmentation and 
statistical analysis of text in the log compile a lexicon of commonly 
used words.""")

In [None]:
# inspect machine translation by Google Translate of full summary
candidate_wipo_summary = ("""The present invention discloses a machine processing and text correction method and device, 
computing equipment and a storage medium. Specifically comprising corrected and rewritten text pairs of incorrect 
text and corresponding correct text, the corrected and rewritten text pairs serving as a training corpus for training
a machine processing model, and in this way developing a machine processing model for use in text correction. 
Through extraction of corrected and rewritten text pairs from a log, the machine processing model can be trained 
and thus made fit for text correction by inputting a first text into the machine processing model to obtain a second
text i.e. a corrected text result. Moreover, a language model or a lexicon of commonly used words can be used to 
assess whether text needs correction. The training corpus extracted from the log can be used to train the language 
model and also, through text segmentation and statistical analysis of text in the log compile a lexicon of commonly 
used words. Thus, text correction can be made easier and more convenient."")

In [None]:
# Pre-process data

In [None]:
def tokenize(dictionary):
    """converts a dictionary of texts to a list of lists of tokens"""
    returned_list = []
    for key, value in dictionary.items():
        list_val = value.split()
        returned_list.append(list_val)
    return returned_list

In [None]:
# organizes references in a dictionary
references_dict = {'reference_sentence_1': human1_sentence
                  ,'reference_sentence_2': human2_sentence}

candidates_dict = {'candidate_sentence_1': google_sentence
                  ,'candidate_sentence_2': wipo_sentence}


# tokenizes translations using helper function
reference_list = tokenize(references_dict)
candidates_list = tokenize(candidates_dict)

In [None]:
# this returns a list of bigrams
bi_grams = list(ngrams(candidates_list[0], 1))[0:3]
print(f"Bi-gram examples from Google's translation: {bi_grams}")

In [None]:
# this returns a list of bigrams
bi_grams = list(ngrams(candidates_list[0], 2))[0:3]
print(f"Bi-gram examples from Google's translation: {bi_grams}")

In [None]:
# this returns a list of bigrams
bi_grams = list(ngrams(candidates_list[0], 3))[0:3]
print(f"Bi-gram examples from Google's translation: {bi_grams}")

In [None]:
# this returns a list of tuples containing 4-grams
four_grams = list(ngrams(candidates_list[0], 4))[0:3]
print(f"4-gram examples from Google's translation: {four_grams}")

#### Sentence-level scores

In [8]:
# BLEU-1 for Google translation
bleu1mod_google = modified_precision([references_list[0]], candidates_list[0], n=1)
print(float(bleu1mod_google))

NameError: name 'references_list' is not defined

In [None]:
# BLEU-2 for Google translation
bleu2mod = modified_precision([references_list[0]], candidates_list, n=2)
print(float(bleu2mod))

# BLEU-3 for Google translation
bleu3mod = modified_precision([references_list[0]], candidates_list, n=3)
print(float(bleu3mod))

# BLEU-4 for Google translation
bleu4mod = modified_precision([references_list[0]], candidates_list, n=4)
print(float(bleu4mod))

# BLEU-4 for Google translation, which is default for function
bleu4mod = modified_precision(references_list[0]], candidates_list, n=4)
print(float(bleu4mod))

In [None]:
# .35 BLEU score for WIPO's translation
bleu_wipo = sentence_bleu([i[0]], i[3])
bleu_wipo

In [None]:
bleu_google_2refs = sentence_bleu([i[0], i[1]], i[2])
bleu_google_2refs

In [None]:
bleu_wipo_2refs = sentence_bleu([i[0], i[1]], i[3])
bleu_wipo_2refs

In [None]:
bleu2mod = modified_precision([i[0]], i[2], n=2)
print(bleu2mod)

In [None]:
bleu3mod = modified_precision([i[0]], i[2], n=3)
bleu3mod

In [None]:
bleu4mod = modified_precision([i[0]], i[2], n=4)
bleu4mod

In [None]:
ref_four_grams = list(ngrams(i[0], 4))
ref_four_grams, len(four_grams)

google_four_grams = list(ngrams(i[2], 4))
google_four_grams, len(google_four_grams)

In [None]:
set(ref_four_grams) = set(google_four_grams)

In [None]:
set(google_four_grams) & set(ref_four_grams)

In [None]:
# by default, bleu_score calculates a BLEU-4,which i a score for the overlap of up to 4-grams

In [None]:
"the geometric mean of the test corpus’ modified precision scores times an exponential brevity penalty factor"

In [None]:
by default, bleu_score calculates a BLEU-4,which is an the overlap of 4-grams overlaps

In [None]:
# converts google paragraph to a list of sentences
google_sentences = sent_tokenize(google_paragraph)
print(google_sentences[5])

In [None]:
google_sentences[0], len(google_sentences), type(google_sentences)

In [None]:
human1_paragraph

In [9]:
# inspect human translation #2 of the original text
# conducted through the company Gengo at a "standard" level
human2_paragraph = """This invention makes public a machine processing and text error correction method and hardware, computing equipment and storage medium, and specifically pairs error text with the corresponding corrected and modified correct text. It uses this text pair as training material for the machine processing model, and from there prepares the machine processing model that is applied to the text correction. It can train the machine processing model using a diary or daily journal and make it suitable for text correction. The first text version is inputted into the machine processing model to get the second text version, which is the corrected text. Additionally, it can also use a stored language model or common vocabulary bank to determine if the first text version needs correction. It can use the practice language material gathered from the diary or daily journal to train the language model, and it can also initialize the common vocabulary bank through the segmentation and analysis of the diary or daily journal text. Through all this, text correction is conveniently implemented."""

In [13]:
i = textwrap.wrap(human1_paragraph)
i

['The invention discloses a machine processing and text error correction',
 'method and device, a computing device, and a storage medium,',
 'specifically comprising corrected and rewritten text pairs of',
 'incorrect  text and corresponding correct text. The corrected and',
 'rewritten text pairs serving as a training corpus to train the machine',
 'processing model, thereby preparing a machine processing model',
 'suitable for text error correction. Through extraction of corrected',
 'and rewritten text pairs from a log, the machine processing model can',
 'be trained and thus made fit for text correction by inputting the',
 'first text into the machine processing model to get the second text,',
 'that is the error correction result text. In addition, the language',
 'model or the common lexicon can be used to determine whether the first',
 'text needs to be corrected. The training corpus extracted from a log',
 'can be used to train the language model, or the common lexicon can be',

In [None]:
wipo_paragraph = """The present invention discloses a machine processing and text correction method and device, 
computing equipment and a storage medium. Specifically comprising corrected and rewritten text pairs of incorrect 
text and corresponding correct text, the corrected and rewritten text pairs serving as a training corpus for training
a machine processing model, and in this way developing a machine processing model for use in text correction. 
Through extraction of corrected and rewritten text pairs from a log, the machine processing model can be trained 
and thus made fit for text correction by inputting a first text into the machine processing model to obtain a second
text i.e. a corrected text result. Moreover, a language model or a lexicon of commonly used words can be used to 
assess whether text needs correction. The training corpus extracted from the log can be used to train the language 
model and also, through text segmentation and statistical analysis of text in the log compile a lexicon of commonly 
used words. Thus, text correction can be made easier and more convenient.""