### [evaluate-metric](https://huggingface.co/evaluate-metric)

https://huggingface.co/spaces/evaluate-metric/bleu

https://huggingface.co/spaces/evaluate-metric/rouge

https://huggingface.co/spaces/evaluate-metric/meteor

https://huggingface.co/spaces/evaluate-metric/perplexity

In [2]:
import evaluate

The **BLEU (Bilingual Evaluation Understudy)** BLEU (Bilingual Evaluation Understudy) is an algorithm for evaluating the quality of text which has been machine-translated from one natural language to another ( evaluating a generated sentence to a reference sentence ). Quality is considered to be the correspondence between a machine’s output and that of a human: “the closer a machine translation is to a professional human translation, the better it is” – this is the central idea behind BLEU. BLEU was one of the first metrics to claim a high correlation with human judgements of quality, and remains one of the most popular automated and inexpensive metrics.

Scores are calculated for individual translated segments—generally sentences—by comparing them with a set of good quality reference translations. Those scores are then averaged over the whole corpus to reach an estimate of the translation’s overall quality. Neither intelligibility nor grammatical correctness are not taken into account. Here's the formula to calculate the BLEU score:

### BLEU score formula:

\$
\text{BLEU} = \text{BP} \times \exp\left( \sum_{n=1}^{N} w_n \cdot \log p_n \right)
\$

Where:

- **BP** is the **brevity penalty** to account for shorter candidate translations:
  \$
  \text{BP} = 
  \begin{cases} 
  1 & \text{if } c > r \\
  e^{(1 - \frac{r}{c})} & \text{if } c \leq r
  \end{cases}
  \$
  - \$ c \$ is the length of the candidate translation.
  - \$ r \$ is the length of the reference translation.

- **\$ p_n \$** is the precision for n-grams of size \( n \):
  \$
  p_n = \frac{\text{number of matched n-grams}}{\text{total n-grams in the candidate}}
  \$

- **\$ w_n \$** is the weight for each n-gram level, typically uniform:
  \$
  w_n = \frac{1}{N}
  \$
  For example, if using 4-grams, the weights are \$ w_1 = w_2 = w_3 = w_4 = 0.25 \$.

- **N** is the maximum length of the n-grams (often 4, so you consider unigrams, bigrams, trigrams, and 4-grams).

### Steps for calculation:
1. **Calculate precision** for unigrams, bigrams, trigrams, etc.
2. **Apply the brevity penalty** (BP) to penalize overly short candidate translations.
3. Combine the precision values for different n-grams using the geometric mean (exponential of their sum).
4. Multiply the result by the brevity penalty.

### Limitations and Bias

This metric has multiple known limitations:

- BLEU compares overlap in tokens from the predictions and references, instead of comparing meaning. This can lead to discrepancies between BLEU scores and human ratings.

- Shorter predicted translations achieve higher scores than longer ones, simply due to how the score is calculated. A brevity penalty is introduced to attempt to counteract this.

- BLEU scores are not comparable across different datasets, nor are they comparable across different languages.

- BLEU scores can vary greatly depending on which parameters are used to generate the scores, especially when different tokenization and normalization techniques are used. It is therefore not possible to compare BLEU scores generated using different parameters, or when these parameters are unknown. For more discussion around this topic, see the following issue.


Example where each prediction has 1 reference:

In [6]:
predictions = ["hello there general kenobi","foo bar foobar"]

references = [["hello there general kenobi"],
              ["foo bar foobar"]
             ]

bleu = evaluate.load("bleu")

results = bleu.compute(predictions=predictions, references=references)

print(results)

{'bleu': 1.0, 'precisions': [1.0, 1.0, 1.0, 1.0], 'brevity_penalty': 1.0, 'length_ratio': 1.0, 'translation_length': 7, 'reference_length': 7}


Example where the second prediction has 2 references:

In [19]:
predictions = [
    "hello there general kenobi",
    "foo bar foobar"
]

references = [
    ["hello there general kenobi", "hello there!"],  # Keep references closer to predictions
    ["foo bar foobar", "foo bar"]  # Ensure all n-grams match
]

bleu = evaluate.load("bleu")

results = bleu.compute(predictions=predictions, references=references)

print(results)


{'bleu': 1.0, 'precisions': [1.0, 1.0, 1.0, 1.0], 'brevity_penalty': 1.0, 'length_ratio': 1.4, 'translation_length': 7, 'reference_length': 5}


Example with the word tokenizer from NLTK:

In [21]:
import nltk
nltk.download('punkt_tab')
from nltk.tokenize import word_tokenize

predictions = [
    "hello there general kenobi",
    "foo bar foobar"
]

references = [
    ["hello there general kenobi", "hello there!"],  # Keep references closer to predictions
    ["foo bar foobar", "foo bar"]  # Ensure all n-grams match
]

bleu = evaluate.load("bleu")

results = bleu.compute(predictions=predictions, references=references, tokenizer=word_tokenize)

print(results)

[nltk_data] Downloading package punkt_tab to /home/loc/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


{'bleu': 1.0, 'precisions': [1.0, 1.0, 1.0, 1.0], 'brevity_penalty': 1.0, 'length_ratio': 1.4, 'translation_length': 7, 'reference_length': 5}


The **ROUGE (Recall-Oriented Understudy for Gisting Evaluation)** score is a set of metrics used to evaluate the quality of text, typically in summarization or machine translation, by comparing the overlap between the generated text and reference text. The most commonly used variants are **ROUGE-N**, **ROUGE-L**, and **ROUGE-W**.

### 1. **ROUGE-N** (N-gram overlap)
Measures the overlap of n-grams between the candidate and reference text.

\[
\text{ROUGE-N} = \frac{\sum_{\text{n-gram} \in \text{Reference}} \min(\text{Count}_{\text{candidate}}(\text{n-gram}), \text{Count}_{\text{Reference}}(\text{n-gram}))}{\sum_{\text{n-gram} \in \text{Reference}} \text{Count}_{\text{Reference}}(\text{n-gram})}
\]



Where:
- \$ \text{Count}_{\text{match}}(\text{n-gram}) \$ is the number of n-grams that match between the candidate and reference.
- \$ \text{Count}_{\text{Reference}}(\text{n-gram}) \$ is the total number of n-grams in the reference text.
- \$ N \$ refers to the size of the n-gram (e.g., unigrams, bigrams, trigrams, etc.).

#### Example for ROUGE-1 (unigram overlap):
\$
\text{ROUGE-1} = \frac{\text{Number of overlapping unigrams}}{\text{Total unigrams in reference}}
\$

### 2. **ROUGE-L** (Longest Common Subsequence)
Measures the longest common subsequence (LCS) between the candidate and reference text.

\$
\text{ROUGE-L} = \frac{LCS(\text{candidate}, \text{reference})}{\text{length of reference}}
\$

Where:
- \$ LCS(\text{candidate}, \text{reference}) \$ is the length of the longest common subsequence between the candidate and reference text.

### 3. **ROUGE-W** (Weighted Longest Common Subsequence)
This is similar to ROUGE-L but gives more weight to consecutive matches in the longest common subsequence. It penalizes scattered matches.

\$
\text{ROUGE-W} = \frac{LCS_{\text{weighted}}(\text{candidate}, \text{reference})}{\text{length of reference}}
\$

Where \$ LCS_{\text{weighted}} \$ assigns higher weights to consecutive matching words in the sequence.

### 4. **ROUGE-S** (Skip-bigram)
Measures the overlap of skip-bigrams, which are pairs of words that occur in both texts in the same order, allowing for gaps in between.

\$
\text{ROUGE-S} = \frac{\text{Number of matching skip-bigrams}}{\text{Total skip-bigrams in reference}}
\$

### Key Variants:
- **ROUGE-N**: Precision, recall, and F1-score can be computed. The formula above calculates **recall**, while **precision** would use the total n-grams in the candidate as the denominator.
  
- **ROUGE-L**: Measures recall based on the LCS, but precision and F1-score variants can also be calculated using the candidate’s length.

### F1-score for ROUGE:
You can combine **precision** and **recall** into the **F1-score** as:

\$
\text{F1} = \frac{2 \times \text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}}
\$

Would you like help implementing any of these ROUGE metrics in code?

**METEOR**, an automatic metric for machine translation evaluation that is based on a generalized concept of unigram matching between the machine-produced translation and human-produced reference translations. Unigrams can be matched based on their surface forms, stemmed forms, and meanings; furthermore, METEOR can be easily extended to include more advanced matching strategies. Once all generalized unigram matches between the two strings have been found, METEOR computes a score for this matching using a combination of unigram-precision, unigram-recall, and a measure of fragmentation that is designed to directly capture how well-ordered the matched words in the machine translation are in relation to the reference.

METEOR gets an R correlation value of 0.347 with human evaluation on the Arabic data and 0.331 on the Chinese data. This is shown to be an improvement on using simply unigram-precision, unigram-recall and their harmonic F1 combination.

If this is a text-based metric, make sure to wrap you input in double quotes. Alternatively you can use a JSON-formatted list as input.