# Measuring Multilingual Machines
#### BLEU Scores and Cross-Lingual Machine Learning

#### by Lee Mackey

This notebook accompanies the Medium article titled "Measuring Multilingual Machines"

In [1]:
# Import modules

import sacrebleu
from nltk.translate.bleu_score import (sentence_bleu, corpus_bleu, 
                                       modified_precision, 
                                       SmoothingFunction)

from nltk import (bigrams, trigrams, ngrams, sent_tokenize, word_tokenize)
from multilingual_machines import split_tokens, clean_punctuation
import string
import json

Does your machine learn in Chinese? 您和您的機器學習中文嗎? I don’t speak a word of Mandarin or Cantonese so Google Translate gets all the credit — good or bad — for the preceding sentence. But how might a researcher quickly evaluate the quality of this machine translation? This question encapsulates the basic challenge that gives rise to the BLEU metric. BLEU, which stands for bilingual language understudy, is the default measure of machine translation quality and is also sometimes applied to more general cross-lingual approaches to natural language processing (NLP). The metric is well-established in the machine translation space but some analysts also question the application to a wider set of NLP tasks beyond the original purpose for which the algorithm was developed. This article explores these issues by briefly discussing basic lessons and limits of BLEU using examples drawn from the multilingual space of global patent documents.

### Basics of BLEU

Researchers at IBM developed the BLEU algorithm in 2002 as an efficient method to evaluate machine translation tasks that would otherwise require human evaluators. The original paper by the developers, Papineni and colleagues, is a good place to start if you’re interested in the founding milieu and details of the algorithm [1]. BLEU is an adjusted measure of precision of the overlap of word sequences between a “candidate” machine translation and one or more “reference” human translations. Conceptualizing the algorithm at the unit of a machine-translated sentence, BLEU counts the maximum number of times that word sequences, expressed by the term n-grams, occur in human-translated sentences. The adjusted counts of each n-gram in the sentence are summed and the number is then “adjusted” by dividing by the total (unclipped) number of n-grams in the candidate text. 

The BLEU score is as a number between 0 and 1, where 0 represents the complete absence of overlap in n-grams between candidate and reference texts, and where 1 might equal a machine translation that is exactly similar to one of the reference texts. While this example considers BLEU at the level of the sentence, the actual evaluation of a machine translation is calculated by averaging out sentence scores across an entire corpus and adjusting this aggregate metric to account for the typically-longer word lengths of machine translations. If you’re interested in learning more, you might learn to calculate BLEU in a fifteen-minute video by Andrew Ng of DeepLearning.Ai. To make learning BLEU more tangible, I begin with machine translations and human translations of Chinese text

### Applications of BLEU using Patent Texts
A growing share of patents in the machine learning space are originally written and filed in Chinese according to a recent report by WIPO, the global organization governing patents. To explore the basics of BLEU in more tangible manner, we first begin with the international filing of a Chinese language patent by the e-commerce company Alibaba for an invention related to natural language processing.

In [2]:
# read example data from 'patent_examples.txt' file (JSON)
with open('patent_examples.txt') as f:
    data = json.load(f)
    
# for more details on the example patent and the data source:

# paste into browser to inspect patent at Chinese version of WIPO Patentscope GUI:   
# https://patentscope.wipo.int/search/zh/detail.jsf?docId=WO2019085779
    
# paste into browser to inspect patent at English version of WIPO Patentscope GUI:
# https://patentscope.wipo.int/search/en/detail.jsf?docId=WO2019085779

A sentence from the Chinese language abstract section of the original patent is displayed below.

In [3]:
# inspect sentence from summary of patent in original Chinese
print(data['original_sentence_cn'])

['可以使用从日志中挖掘出的训练语料来训练语言模型，也可以通过对日志中的文本进行分词、统计来整理常用词库.']


Human translators often produce translations of equivalent quality that nonetheless differ in structure or word choice. BLEU is therefore developed to accept single or multiple reference translations by humans. Next, we obtain Chinese-to-English "reference" translations from two human translators via the platform Gengo. The translations are below.

In [4]:
# inspect Ch-to-En human translation #1 of sentence from patent summary
reference_human1_sentence = data['reference_human1_sentence']
print(reference_human1_sentence)

['The training corpus extracted from a log can be used to train the language model, or the common lexicon can be sorted by segmenting and counting text in the log.']


In [5]:
# inspect Ch-to-En human translation #1 of full patent summary
reference_human1_summary = data['reference_human1_summary']
print(reference_human1_summary)

['The invention discloses a machine processing and text error correction method and device, a computing device, and a storage medium, specifically comprising corrected and rewritten text pairs of incorrect text and corresponding correct text.', 'The corrected and rewritten text pairs serving as a training corpus to train the machine processing model, thereby preparing a machine processing model suitable for text error correction.', 'Through extraction of corrected and rewritten text pairs from a log, the machine processing model can be trained and thus made fit for text correction by inputting the first text into the machine processing model to get the second text, that is the error correction result text.', 'In addition, the language model or the common lexicon can be used to determine whether the first text needs to be corrected.', 'The training corpus extracted from a log can be used to train the language model, or the common lexicon can be sorted by segmenting and counting text in 

In [6]:
# inspect Ch-to-En human translation #2 of sentence from patent summary
reference_human2_sentence = data['reference_human2_sentence']
print(reference_human2_sentence)

['It can use the practice language material gathered from the diary or daily journal to train the language model, and it can also initialize the common vocabulary bank through the segmentation and analysis of the diary or daily journal text']


In [7]:
# inspect Ch-to-En human translation #2 of full patent summary
reference_human2_summary = print(data['reference_human2_summary'])

['This invention makes public a machine processing and text error correction method and hardware, computing equipment and storage medium, and specifically pairs error text with the corresponding corrected and modified correct text.', 'It uses this text pair as training material for the machine processing model, and from there prepares the machine processing model that is applied to the text correction.', 'It can train the machine processing model using a diary or daily journal and make it suitable for text correction.', 'The first text version is inputted into the machine processing model to get the second text version, which is the corrected text.', 'Additionally, it can also use a stored language model or common vocabulary bank to determine if the first text version needs correction.', 'It can use the practice language material gathered from the diary or daily journal to train the language model, and it can also initialize the common vocabulary bank through the segmentation and analy

Finally, we source "candidate" machine translations from two separate machine learning algorithms: Google Translate, and the World Intellectual Property Organization (WIPO).

In [8]:
# inspect machine translation by Google Translate of full summary
candidate_google_summary = data['candidate_google_summary']
print(candidate_google_summary)

['The invention discloses a machine processing and text error correction method and device, a computing device and a storage medium, and particularly comprises an error correction rewriting pair of an error text and a corresponding correct text, and an error correction rewriting pair as a training corpus, and a machine processing model.', 'Training is performed, thereby preparing a machine processing model suitable for text correction. The machine processing model can be trained to mine the error correction by mining the error correction rewrite pair from the log.', 'The first text is input into the machine processing model to obtain a second text, that is, an error correction result text.', 'In addition, you can use the language model or common lexicon to determine whether the first text needs to be corrected.', ' The language model can be trained using the training corpus extracted from the log, or the common lexicon can be organized by segmenting and counting the text in the log.', 

In [9]:
# inspect machine translation by Google Translate of sentence from summary
candidate_google_sentence = data['candidate_google_sentence']
print(candidate_google_sentence)

['The language model can be trained using the training corpus extracted from the log, or the common lexicon can be organized by segmenting and counting the text in the log.']


In [10]:
# inspect machine translation by WIPO of full summary
candidate_wipo_summary = data['candidate_wipo_summary']
print(candidate_wipo_summary)

['The present invention discloses a machine processing and text correction method and device, computing equipment and a storage medium.', 'Specifically comprising corrected and rewritten text pairs of incorrect text and corresponding correct text, the corrected and rewritten text pairs serving as a training corpus for training a machine processing model, and in this way developing a machine processing model for use in text correction.', 'Through extraction of corrected and rewritten text pairs from a log, the machine processing model can be trained and thus made fit for text correction by inputting a first text into the machine processing model to obtain a second text i.e. a corrected text result.', 'Moreover, a language model or a lexicon of commonly used words can be used to assess whether text needs correction. The training corpus extracted from the log can be used to train the language model and also, through text segmentation and statistical analysis of text in the log compile a l

In [11]:
# inspect machine translation by WIPO of full summary
candidate_wipo_sentence = data['candidate_wipo_sentence']
print(candidate_wipo_sentence)

['The training corpus extracted from the log can be used to train the language model and also, through text segmentation and statistical analysis of text in the log compile a lexicon of commonly used words.']


#### Pre-process data

In [12]:
reference_human1_summary

['The invention discloses a machine processing and text error correction method and device, a computing device, and a storage medium, specifically comprising corrected and rewritten text pairs of incorrect text and corresponding correct text.',
 'The corrected and rewritten text pairs serving as a training corpus to train the machine processing model, thereby preparing a machine processing model suitable for text error correction.',
 'Through extraction of corrected and rewritten text pairs from a log, the machine processing model can be trained and thus made fit for text correction by inputting the first text into the machine processing model to get the second text, that is the error correction result text.',
 'In addition, the language model or the common lexicon can be used to determine whether the first text needs to be corrected.',
 'The training corpus extracted from a log can be used to train the language model, or the common lexicon can be sorted by segmenting and counting text

In [113]:
# For summary-level example
# split sentences into tokens
tokens_ref_human1_summary = split_tokens(reference_human1_summary)
tokens_ref_human2_summary = split_tokens(reference_human1_summary)

# clean punctuation from tokens
tokens_ref_human1_summary = clean_punctuation(tokens)
tokens_ref_human2_summary = clean_punctuation(tokens)

In [None]:
# split sentences into tokens for human 2 translation of summary
tokens_ref_human2_summary = split_tokens(reference_human1_summary)

# clean punctuation from tokens
tokens_ref_human1_summary = clean_punctuation(tokens)

In [136]:
# For sentence-level example
# split sentences into tokens
tokens_ref_human1_sentence = split_tokens(reference_human1_sentence)
tokens_ref_human2_sentence = split_tokens(reference_human2_sentence)

# clean punctuation from tokens
tokens_ref_human1_sentence = clean_punctuation(tokens_ref_human1_sentence)
tokens_ref_human2_sentence = clean_punctuation(tokens_ref_human2_sentence)

In [31]:
# split sentences into tokens
tokens_candidate_google_summary = split_tokens(candidate_google_summary)

# clean punctuation from tokens
tokens_candidate_google_summary = clean_punctuation(tokens_candidate_google_summary)

In [34]:
# split sentences into tokens
tokens_candidate_google_sentence = split_tokens(candidate_google_sentence)

# clean punctuation from tokens
tokens_candidate_google_sentence = clean_punctuation(tokens_candidate_google_sentence)

In [92]:
# split sentences into tokens
tokens_candidate_wipo_sentence = split_tokens(candidate_wipo_sentence)

# clean punctuation from tokens
tokens_candidate_wipo_sentence = clean_punctuation(tokens_candidate_wipo_sentence)

In [32]:
# Inspect n-grams

# return list of bi-grams
bi_grams = list(ngrams(tokens_ref_human1_summary[0], 2))[0:5]

# returns list of four-grams
four_grams = list(ngrams(tokens_ref_human1_summary[0], 4))[0:5]

print(f"bi-grams: {bi_grams}")
print(f"four-grams: {four_grams}")

bi-grams: [('The', 'invention'), ('invention', 'discloses'), ('discloses', 'a'), ('a', 'machine'), ('machine', 'processing')]
four-grams: [('The', 'invention', 'discloses', 'a'), ('invention', 'discloses', 'a', 'machine'), ('discloses', 'a', 'machine', 'processing'), ('a', 'machine', 'processing', 'and'), ('machine', 'processing', 'and', 'text')]


#### Calculate BLEU

We calculate BLEU scores using the nltk.translate.bleu_score implementation from the Natural Language Toolkit (NLTK) package. We use the modified_precision function and not the sentence_bleu method that only implements a partial version of the BLEU algorithm. After preparing the translation texts via standard NLP pre-processing to represent the texts as word tokens, we pass the modified_precision function the two reference sentences as a list of lists of tokens. We compute the n-gram matches for each candidate sentence and add the clipped n-gram counts for all the candidate sentences. Next, we then divide by the number of candidate n-grams in the candidate corpus to compute a modified precision score for the candidate corpus.

What's the result? The score of the first candidate translation by Google Translate is .XXX. The score of the second machine translation by WIPO is .XXX. If we conduct the same translation on the entire abstract text, the score of Google is .XXX, and the score of WIPO is. With this basic example of the application of BLEU in mind, we can now discuss some of the potential limits of the application of BLEU to machine translation and natural language processing tasks more generally.

In [149]:
# BLEU-4 example
bleu4_example = corpus_bleu([tokens_ref_human1_sentence], tokens_ref_human1_sentence)
print(float(bleu4_example))

1.0


In [150]:
# BLEU-4 for Google translation with one reference using corpus_bleu
bleu4_google_corpusbleu_human1 = corpus_bleu([tokens_ref_human1_sentence], tokens_candidate_google_sentence)
print(float(bleu4_google_corpusbleu_human1))

0.4370614964591188


In [151]:
# BLEU-4 for Google translation with second reference using corpus_bleu
bleu4_google_corpusbleu_human2 = corpus_bleu([tokens_ref_human2_sentence], tokens_candidate_google_sentence)
print(float(bleu4_google_corpusbleu_human2))

5.010025055942425e-155


In [155]:
# BLEU-4 for Google translation with two references using corpus_bleu
bleu4_google_corpusbleusm_2refs = corpus_bleu([[tokens_ref_human1_sentence[0], tokens_ref_human2_sentence[0]]], 
                                        tokens_candidate_google_sentence)
print(float(bleu4_google_corpusbleusm_2refs))

0.4523563820810908


In [154]:
# BLEU-4 for Google translation with two references using sentence_bleu
bleu2_google_sentencebleu = sentence_bleu([tokens_ref_human1_sentence[0], tokens_ref_human2_sentence[0]], tokens_candidate_google_sentence[0])
print(float(bleu4_google_sentencebleu))

NameError: name 'bleu4_google_sentencebleu' is not defined

In [144]:
# BLEU-4 for Google translation with smoothing function
bleu4_google_sentencebleu = sentence_bleu(tokens_ref_human1_sentence, tokens_candidate_google_sentence[0], smoothing_function=chencherry.method1)
print(float(bleu4_google_sentencebleu))

0.4370614964591188


In [145]:
# BLEU-4 for WIPO translation
bleu4_google_sentencebleusm = corpus_bleu([tokens_ref_human1_sentence], tokens_candidate_wipo_sentence)
print(float(bleu2_google_sentencebleusm))

0.4103757636936567


In [146]:
# BLEU-4 for WIPO translation
bleu4_wipo_corpusbleusm_2refs = corpus_bleu([[tokens_ref_human1_sentence[0], tokens_ref_human2_sentence[0]]], 
                                        tokens_candidate_wipo_sentence)
print(float(bleu4_wipo_corpusbleusm_2refs))

0.48435192067731586


#### Multiple references

In [87]:
# BLEU-2 for Google translation
bleu2_example = sentence_bleu(tokens_ref_human1_sentence, tokens_ref_human1_sentence[0], smoothing_function=chencherry.method1)
print(float(bleu2_example))

1.0


In [85]:
tokens_ref_human1_sentence[0]

['The',
 'training',
 'corpus',
 'extracted',
 'from',
 'a',
 'log',
 'can',
 'be',
 'used',
 'to',
 'train',
 'the',
 'language',
 'model',
 'or',
 'the',
 'common',
 'lexicon',
 'can',
 'be',
 'sorted',
 'by',
 'segmenting',
 'and',
 'counting',
 'text',
 'in',
 'the',
 'log']

In [75]:
# BLEU-2 for Google translation
bleu2_google = modified_precision(tokens_ref_human1_sentence, tokens_candidate_google_sentence, n=2)
print(float(bleu2_google))

0.0


In [68]:
print(tokens_ref_human1_sentence)

[['The', 'training', 'corpus', 'extracted', 'from', 'a', 'log', 'can', 'be', 'used', 'to', 'train', 'the', 'language', 'model', 'or', 'the', 'common', 'lexicon', 'can', 'be', 'sorted', 'by', 'segmenting', 'and', 'counting', 'text', 'in', 'the', 'log']]


In [66]:
print(tokens_candidate_google_sentence)

[['The', 'language', 'model', 'can', 'be', 'trained', 'using', 'the', 'training', 'corpus', 'extracted', 'from', 'the', 'log', 'or', 'the', 'common', 'lexicon', 'can', 'be', 'organized', 'by', 'segmenting', 'and', 'counting', 'the', 'text', 'in', 'the', 'log']]


In [None]:
# BLEU-3 for Google translation
bleu3mod = modified_precision([references_list[0]], candidates_list, n=3)
print(float(bleu3mod))

# BLEU-4 for Google translation
bleu4mod = modified_precision([references_list[0]], candidates_list, n=4)
print(float(bleu4mod))

# BLEU-4 for Google translation, which is default for function
bleu4mod = modified_precision(references_list[0]], candidates_list, n=4)
print(float(bleu4mod))

In [None]:
# .35 BLEU score for WIPO's translation
bleu_wipo = sentence_bleu([i[0]], i[3])
bleu_wipo

In [None]:
bleu_google_2refs = sentence_bleu([i[0], i[1]], i[2])
bleu_google_2refs

In [None]:
bleu_wipo_2refs = sentence_bleu([i[0], i[1]], i[3])
bleu_wipo_2refs

In [None]:
bleu2mod = modified_precision([i[0]], i[2], n=2)
print(bleu2mod)

In [None]:
bleu3mod = modified_precision([i[0]], i[2], n=3)
bleu3mod

In [None]:
bleu4mod = modified_precision([i[0]], i[2], n=4)
bleu4mod

In [None]:
# by default, bleu_score calculates a BLEU-4,which i a score for the overlap of up to 4-grams

In [None]:
"the geometric mean of the test corpus’ modified precision scores times an exponential brevity penalty factor"

In [None]:
# calculcate bleu score of candidate Google summary
bleu_wipo_summary = corpus_bleu()

In [None]:
# calculate bleu score of candidate WIPO summary
bleu_google_summary = corpus_bleu()