## BLEU

**From KantanMT**: BLEU Score is quick to use, inexpensive to operate, language independent, and correlates highly with human evaluation. It is the most widely used automated method of determining the quality of machine translation. 

The BLEU metric scores a translation on a scale of 0 to 1, but is frequently displayed as a percentage value. The closer to 1, the more the translation correlates to a human translation. The BLEU score is the proportion of words that appear in MT which also exist in the human translation (the "golden reference"). Put simply, the BLEU metric measures how many words overlap in a given translation when compared to a reference translation, giving higher scores to sequential words.

What is a good BLEU Score? BLEU scores range from 0-100%. A score less than 15% means that your machine translation engine is not performing optimally and a high level of post-editing will be required to finalise your translations and reach publishable quality.

A score greater than 50% is a very good score and significantly less post-editing will be required to achieve publishable translation quality.

Improving BLEU Score. There is a high correlation between the number of words used in training a machine translation engine and its BLEU score. Put simply, the more training data that is uploaded, the better the BLEU score and consequently the generated translations.

BLEU has some important limitations that are described by Rachael Tatman [here](https://towardsdatascience.com/evaluating-text-output-in-nlp-bleu-at-your-own-risk-e8609665a213).

*   It doesn't consider meaning
*   It doesn't directly consider sentence structure
*   It doesn't handle morphologically rich languages well
*   It doesn't map well to human judgements

Only use BLEU if:

*   You're doing machine translation AND
*   You're evaluating across an entire corpus AND
*   You know the limitations of the metric and you're prepared to accept them.

## METEOR

From [RWS Moravia](https://www.rws.com/insights/rws-moravia-blog/interview-with-an-expert-how-do-you-measure-mt/):

Then there is a metric called METEOR. METEOR's algorithm is more nuanced because not only does it compare MT and the human reference in both directions, it also takes into account things like linguistics. While BLEU checks existing words exactly as they appear, METEOR considers some linguistic variants. In English, "ride" or "riding" would count as two different words for the BLEU score. But for METEOR, it would count as a single word because they have the same root.

That's why we generally use METEOR in more cases than we use BLEU. These nuances can affect the accuracy of the quality measurement.


## Procedure

Collect the following:

1.  A source text file of at least a couple thousand words.
2.  The 'candidate' raw machine translation of the source text
3.  A 'reference' human translation of the source text

Requirements for the files to be tested:

*   Each file should be a text file, UTF-8 encoded.
*   We will perform our BLEU score evaluation using the Spanish language. BLEU score evaluation of Chinese is a little bit different and requires some other tools.
*   Each line should have only one sentence.
*   Each file should have exactly the same number of lines.
*   Each line should line up with the other lines in the other files.

Requirements for our Python environment:

Please make sure you have installed pandas and nltk in Python.

To install pandas:

`pip install pandas`

To install nltk:

`pip install nltk`

To install the nltk punctuation data:

`python -m nltk.downloader punkt`

## Importing necessary modules

We will need pathlib, pandas, and several nltk modules. Import them with these statements below:

In [1]:
from pathlib import Path
import pandas as pd
from nltk.tokenize import word_tokenize
from nltk.translate.bleu_score import sentence_bleu, SmoothingFunction
from nltk.translate.meteor_score import meteor_score

## Set paths of files

Tell Python where to find the files you'll be evaluating.

In [2]:
SOURCE = Path(r"source.txt")
CANDIDATE = Path(r"candidate.txt")
REFERENCE = Path(r"reference.txt")

## Set up a table for results

The result list stores rows which contain the results of the BLEU score evaluation. The column headers are indicated by COLUMNS.

In [17]:
result = list()
COLUMNS = ("Source",
           "Reference",
           "Candidate",
           "BLEU",
           "METEOR")

## Open the files

Now, open each file. You can open multiple files in one line by using the backslash to continue code onto the next line.

**Note**: For the purposes of this demonstration, I'm using Baidu as my MT candidate and Google as my reference translation. A real evaluation would use a human translation as the reference translation.

## Read the lines and tokenize the content

Read all the lines into a tuple called lines. Then, pass that tuple to a `zip` function. When you put an asterisk `*` in front of your function argument, it passes each item in the tuple as its own argument.

Before we evaluate the BLEU score, we need to tokenize the text. Essentially, tokenization intelligently splits the text into words.

## Get the BLEU score and METEOR score

We will be using `sentence_bleu` for our example. The first parameter is for reference texts. The first parameter expects a list of one or more lists of tokens. We've already tokenized our reference sentence as `ref_token`.

The next parameter is for the machine translation candidate. This parameter expects a list of tokens. We have tokenized our candidate as `can_token`.

Next, we will use a smoothing function to calculate a score even when there are no direct n-gram overlaps. For more details, look up `nltk` smoothing functions.

The `meteor_score` function can accept as little as two parameters: a list of reference sentences (not tokens!) and the candidate sentence (not tokens!)

Append each row to the `result` list of rows.

In [18]:
# Open the files
with open(SOURCE, encoding="UTF-8") as a, \
     open(CANDIDATE, encoding="UTF-8") as b, \
     open(REFERENCE, encoding="UTF-8") as c:
     
    # Read the lines and tokenize the content
    lines = a.readlines(), b.readlines(), c.readlines()
    for source, candidate, reference in zip(*lines):
        src_token = word_tokenize(source)
        ref_token = word_tokenize(reference)
        can_token = word_tokenize(candidate)
        
        # Get the BLEU score and METEOR score
        smoothing = SmoothingFunction().method1
        bleu = sentence_bleu([ref_token], can_token, smoothing_function=smoothing)
        meteor = meteor_score([reference], candidate)
        
        # Append to result list
        result.append([source, reference, candidate, bleu, meteor])

## Write the results

Outside the loop, convert our list into a dataframe with the specified columns and write it to Excel.

In [19]:
df = pd.DataFrame(result, columns=COLUMNS)
df.to_excel("BLEU.xlsx", index=False)
print(df.head(5))

                                              Source  \
0  At this stage you are using documented process...   
1  Ultimately, however, reports have limited inhe...   
2  Some organizations are just getting started an...   
3  “To sit back and to see a full year of data, t...   
4                                 What’s possible?\n   

                                           Reference  \
0  En esta etapa, está utilizando procesos docume...   
1  Sin embargo, en última instancia, los informes...   
2  Algunas organizaciones recién comienzan y nece...   
3  "Para sentarse y ver un año completo de datos,...   
4                                 ¿Qué es posible?\n   

                                           Candidate      BLEU    METEOR  
0  En esta etapa, usará un proceso de documentaci...  0.155455  0.456349  
1  Sin embargo, en última instancia, el significa...  0.320016  0.574743  
2  Algunas organizaciones son incipientes y neces...  0.409318  0.722388  
3  "Es impresionante senta

Open the file and view the results. Set the bleu score to a percentage. What do you think? Are these the scores you expected? What makes the scores higher or lower?

Without a pair of human eyes, numbers are just numbers. I recommend pairing a BLEU/METEOR evaluation done by a computer with a [TAUS Adequacy/Fluency](https://www.taus.net/academy/best-practices/evaluate-best-practices/adequacy-fluency-guidelines) evaluation done by a human.

![TAUS Adequacy/Fluency](screenshots\taus_adequacy_fluency.jpg)