# Practice session 5

## Understanding BLEU

In this practice session, we will learn about BLEU (bilingual evaluation understudy), the most popular automatic metric for evaluating machine transltion quality. BLEU is a way to measure how close a hypothesis translation is to a reference translation made by a human. To understand BLEU, we will write, step-by-step, code for calculating BLEU on sentence level.

**Important disclaimer 1.** What we will implement is a somewhat simplified version of the BLEU score, as our purpose is to understand the general idea. Actual implementations will have some additional tricks in them.

**Important disclaimer 2.** We will calculate BLEU on sentence level, but it is **not** actually meant to be used like that, because the score for a single translation-reference pair can be very misleading. BLEU should be used as a corpus-level metric.

### 1. Unigram precision

We will roughly follow the explanation from [Wikipedia](https://en.wikipedia.org/wiki/BLEU#Algorithm), but we will also implement the steps in code. For additional reading, check out this [article](https://towardsdatascience.com/evaluating-text-output-in-nlp-bleu-at-your-own-risk-e8609665a213), which explains the same process quite simply and clearly (but notice that it has a different formula in step 4).

Let's start from the simplest idea. We can try to measure how close two sentences are by calculating single-word, or unigram, precision. To do this, we have to simply check what portion of the words that appear in the hypothesis translation also appear in the reference.

$P = m / w_t$, where $m$ is number of words from the hypothesis translation that are found in the reference, and $w_t$ is the total number of words in the hypothesis. 

We will look at some sentences which are already tokenized:

In [None]:
ref = "each of the kids ate an apple ."
hyp1 = "the kids ate an apple each ."
hyp2 = "the the the the ."
hyp3 = "each of the cats ate a potato ."

In [None]:
### YOUR CODE ###
# Split the sentences by spaces, calculate unigram precision
def unigram_precision(reference, hypothesis):
    pass

# Calculate unigram precision for the 3 hypotheses


We can immediately see that unigram precision doesn't do much for us: it is perfect (1.0) for "the the the the .", which is obviously not a good sentence.

### 2. Clipped unigram precision

To mitigate this situation, we can limit the maximum number of matches we will count. For each word in the hypothesis translation, let's calculate $m_{max}$ – how many times it occurs in the reference translation. (We call it maximum count because there may be several reference translations, and then we would take the maximum count of each word over these several references.) Now $m$ will be equal to the total number of words from the hypothesis translation that are found in the reference if that number is lower than $m_{max}$, and to $m_{max}$ otherwise.

In [None]:
### YOUR CODE ###
# Split the sentences by spaces, calculate clipped unigram precision
def unigram_precision_clipped(reference, hypothesis):
    # Calculate m_max for each word 
    # Calculate m for the hypothesis sentence, taking m_max into account
    # Calculate precision by dividing m by length of hypothesis
    pass

# Calculate clipped unigram precision for the 3 hypotheses


Now the second hypothesis should give us a score of 0.4. That's already a bit better.

### 3. n-grams

But what if we looked at a hypothesis like "of kids apple an ate the each ."? (A neural MT model is unlikely to generate such Yoda-like sentences, but let's pretend it did.) It has all the same words as the reference, and exactly the same number of times, but this hypothesis is unintelligible. So, we can see that single words are not a good level to operate on.

Because of that, we will consider n-grams – sequences of consecutive words in the sentence that have a certain length. For example, bigrams of our reference sentence "each of the kids ate an apple ." are: "each of", "of the", "the kids", "kids ate", "ate an", "an apple" and "apple .".

Typically, BLEU takes into account unigrams, bigrams, trigrams and 4-grams. Shorter n-grams are responsible for the content, and longer ones for keeping the text readable and grammatical. But for the purposes of this exercise, let's limit ourselves to bigrams only. (If you feel like it, you can later add support for trigrams and 4-grams. Then you should calculate clipped precision for unigrams, bigrams, trigrams and 4-grams, and calculate their geometric mean.)

In [None]:
### YOUR CODE ###

def extract_bigrams(sentence):
    # Return a dictionary, where keys are bigrams of the sentence,
    # and values show how many times the bigram occurs in that sentence
    pass

def bigram_precision_clipped(reference, hypothesis):
    # Do the same as in unigram_precision_clipped, but for bigrams
    pass

# Calculate clipped bigram precision for the 3 hypotheses


Now we finally managed to beat the second hypothesis. It should have a score of 0.

### 4. Brevity penalty

Let us now consider hypothesis "the kids". It still has BLEU score 1.0!

BLEU in the form we have now favours short translations, which can have high precision while keeping very little meaning from the reference. So the final thing we will introduce is the brevity penalty. If the hypothesis is at least as long as the reference, the brevity penalty is 1. If it is shorter, it equals $e^{1-r/c}$, where $r$ is the length of the reference, and $c$ the length of the hypothesis (candidate). The shorter the hypothesis is compared to the reference, the closer to 0 the number will be. We will multiply our scores by this coefficient.

In [None]:
### YOUR CODE ###

def brevity_penalty(reference, hypothesis):
    # Calculate brevity penalty for a reference-hypothesis pair
    pass

# Calculate clipped biigram precision * brevity penalty for the 3 hypotheses


Now we have something more reasonable. "the kids" has a very low score of 0.05.