### [evaluate-metric](https://huggingface.co/evaluate-metric)

https://huggingface.co/spaces/evaluate-metric/bleu

https://huggingface.co/spaces/evaluate-metric/rouge

https://huggingface.co/spaces/evaluate-metric/meteor

https://huggingface.co/spaces/evaluate-metric/perplexity

In [2]:
import evaluate

The **BLEU (Bilingual Evaluation Understudy)** BLEU (Bilingual Evaluation Understudy) is an algorithm for evaluating the quality of text which has been machine-translated from one natural language to another ( evaluating a generated sentence to a reference sentence ). Quality is considered to be the correspondence between a machine’s output and that of a human: “the closer a machine translation is to a professional human translation, the better it is” – this is the central idea behind BLEU. BLEU was one of the first metrics to claim a high correlation with human judgements of quality, and remains one of the most popular automated and inexpensive metrics.

Scores are calculated for individual translated segments—generally sentences—by comparing them with a set of good quality reference translations. Those scores are then averaged over the whole corpus to reach an estimate of the translation’s overall quality. Neither intelligibility nor grammatical correctness are not taken into account. Here's the formula to calculate the BLEU score:

### BLEU score formula:

\$
\text{BLEU} = \text{BP} \times \exp\left( \sum_{n=1}^{N} w_n \cdot \log p_n \right)
\$

Where:

- **BP** is the **brevity penalty** to account for shorter candidate translations:
  \$
  \text{BP} = 
  \begin{cases} 
  1 & \text{if } c > r \\
  e^{(1 - \frac{r}{c})} & \text{if } c \leq r
  \end{cases}
  \$
  - \$ c \$ is the length of the candidate translation.
  - \$ r \$ is the length of the reference translation.

- **\$ p_n \$** is the precision for n-grams of size \( n \):
  \$
  p_n = \frac{\text{number of matched n-grams}}{\text{total n-grams in the candidate}}
  \$

- **\$ w_n \$** is the weight for each n-gram level, typically uniform:
  \$
  w_n = \frac{1}{N}
  \$
  For example, if using 4-grams, the weights are \$ w_1 = w_2 = w_3 = w_4 = 0.25 \$.

- **N** is the maximum length of the n-grams (often 4, so you consider unigrams, bigrams, trigrams, and 4-grams).

### Steps for calculation:
1. **Calculate precision** for unigrams, bigrams, trigrams, etc.
2. **Apply the brevity penalty** (BP) to penalize overly short candidate translations.
3. Combine the precision values for different n-grams using the geometric mean (exponential of their sum).
4. Multiply the result by the brevity penalty.

### Limitations and Bias

This metric has multiple known limitations:

- BLEU compares overlap in tokens from the predictions and references, instead of comparing meaning. This can lead to discrepancies between BLEU scores and human ratings.

- Shorter predicted translations achieve higher scores than longer ones, simply due to how the score is calculated. A brevity penalty is introduced to attempt to counteract this.

- BLEU scores are not comparable across different datasets, nor are they comparable across different languages.

- BLEU scores can vary greatly depending on which parameters are used to generate the scores, especially when different tokenization and normalization techniques are used. It is therefore not possible to compare BLEU scores generated using different parameters, or when these parameters are unknown. For more discussion around this topic, see the following issue.


Example where each prediction has 1 reference:

In [6]:
predictions = ["hello there general kenobi","foo bar foobar"]

references = [["hello there general kenobi"],
              ["foo bar foobar"]
             ]

bleu = evaluate.load("bleu")

results = bleu.compute(predictions=predictions, references=references)

print(results)

{'bleu': 1.0, 'precisions': [1.0, 1.0, 1.0, 1.0], 'brevity_penalty': 1.0, 'length_ratio': 1.0, 'translation_length': 7, 'reference_length': 7}


Example where the second prediction has 2 references:

In [19]:
predictions = [
    "hello there general kenobi",
    "foo bar foobar"
]

references = [
    ["hello there general kenobi", "hello there!"],  # Keep references closer to predictions
    ["foo bar foobar", "foo bar"]  # Ensure all n-grams match
]

bleu = evaluate.load("bleu")

results = bleu.compute(predictions=predictions, references=references)

print(results)


{'bleu': 1.0, 'precisions': [1.0, 1.0, 1.0, 1.0], 'brevity_penalty': 1.0, 'length_ratio': 1.4, 'translation_length': 7, 'reference_length': 5}


Example with the word tokenizer from NLTK:

In [21]:
import nltk
nltk.download('punkt_tab')
from nltk.tokenize import word_tokenize

predictions = [
    "hello there general kenobi",
    "foo bar foobar"
]

references = [
    ["hello there general kenobi", "hello there!"],  # Keep references closer to predictions
    ["foo bar foobar", "foo bar"]  # Ensure all n-grams match
]

bleu = evaluate.load("bleu")

results = bleu.compute(predictions=predictions, references=references, tokenizer=word_tokenize)

print(results)

[nltk_data] Downloading package punkt_tab to /home/loc/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


{'bleu': 1.0, 'precisions': [1.0, 1.0, 1.0, 1.0], 'brevity_penalty': 1.0, 'length_ratio': 1.4, 'translation_length': 7, 'reference_length': 5}


The **ROUGE (Recall-Oriented Understudy for Gisting Evaluation)** score is a set of metrics used to evaluate the quality of text, typically in summarization or machine translation, by comparing the overlap between the generated text and reference text. The most commonly used variants are **ROUGE-N**, **ROUGE-L**, and **ROUGE-W**.

### 1. **ROUGE-N** (N-gram overlap)
Measures the overlap of n-grams between the candidate and reference text.

$$
\text{ROUGE-N} = \frac{\sum_{\text{n-gram} \in \text{Reference}} \min(\text{Count}_{\text{candidate}}(\text{n-gram}), \text{Count}_{\text{Reference}}(\text{n-gram}))}{\sum_{\text{n-gram} \in \text{Reference}} \text{Count}_{\text{Reference}}(\text{n-gram})}
$$



Where:
- \$ \text{Count}_{\text{match}}(\text{n-gram}) \$ is the number of n-grams that match between the candidate and reference.
- \$ \text{Count}_{\text{Reference}}(\text{n-gram}) \$ is the total number of n-grams in the reference text.
- \$ N \$ refers to the size of the n-gram (e.g., unigrams, bigrams, trigrams, etc.).

#### Example for ROUGE-1 (unigram overlap):
\$
\text{ROUGE-1} = \frac{\text{Number of overlapping unigrams}}{\text{Total unigrams in reference}}
\$

### 2. **ROUGE-L** (Longest Common Subsequence)
Measures the longest common subsequence (LCS) between the candidate and reference text.

\$
\text{ROUGE-L} = \frac{LCS(\text{candidate}, \text{reference})}{\text{length of reference}}
\$

Where:
- \$ LCS(\text{candidate}, \text{reference}) \$ is the length of the longest common subsequence between the candidate and reference text.

### 3. **ROUGE-W** (Weighted Longest Common Subsequence)
This is similar to ROUGE-L but gives more weight to consecutive matches in the longest common subsequence. It penalizes scattered matches.

\$
\text{ROUGE-W} = \frac{LCS_{\text{weighted}}(\text{candidate}, \text{reference})}{\text{length of reference}}
\$

Where \$ LCS_{\text{weighted}} \$ assigns higher weights to consecutive matching words in the sequence.

### 4. **ROUGE-S** (Skip-bigram)
Measures the overlap of skip-bigrams, which are pairs of words that occur in both texts in the same order, allowing for gaps in between.

\$
\text{ROUGE-S} = \frac{\text{Number of matching skip-bigrams}}{\text{Total skip-bigrams in reference}}
\$

### Key Variants:
- **ROUGE-N**: Precision, recall, and F1-score can be computed. The formula above calculates **recall**, while **precision** would use the total n-grams in the candidate as the denominator.
  
- **ROUGE-L**: Measures recall based on the LCS, but precision and F1-score variants can also be calculated using the candidate’s length.

### F1-score for ROUGE:
You can combine **precision** and **recall** into the **F1-score** as:

\$
\text{F1} = \frac{2 \times \text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}}
\$

Would you like help implementing any of these ROUGE metrics in code?


```
ImportError: To be able to use evaluate-metric/rouge, you need to install the following dependencies['absl', 'rouge_score'] using 'pip install # Here to have a nice missing dependency error message early on rouge_score' for instance'
```

In [5]:
# !pip install absl-py rouge-score

At minimum, this metric takes as input a list of predictions and a list of references:

In [6]:
rouge = evaluate.load('rouge')

predictions = ["hello there", "general kenobi"]

references = ["hello there", "general kenobi"]

results = rouge.compute(predictions=predictions,
                       references=references)

print(results)

{'rouge1': 1.0, 'rouge2': 1.0, 'rougeL': 1.0, 'rougeLsum': 1.0}


One can also pass a custom tokenizer which is especially useful for non-latin languages.

In [7]:
results = rouge.compute(predictions=predictions,
                       references=references,
                       tokenizer=lambda x:x.split())

print(results)

{'rouge1': 1.0, 'rouge2': 1.0, 'rougeL': 1.0, 'rougeLsum': 1.0}


It can also deal with lists of references for each predictions:

In [9]:
predictions = ["hello there", "general kenobi"]

references = [["hello", "there"], ["general kenobi", "general yoda"]]

results = rouge.compute(predictions=predictions,references=references)

print(results)

{'rouge1': 0.8333333333333333, 'rouge2': 0.5, 'rougeL': 0.8333333333333333, 'rougeLsum': 0.8333333333333333}


The **METEOR** (Metric for Evaluation of Translation with Explicit ORdering) score is a popular metric used to evaluate the quality of machine translations. It incorporates several factors such as precision, recall, synonymy, stemming, and word order to provide a more nuanced evaluation than some other metrics like BLEU.

The general steps to compute the METEOR score involve:

1. **Exact word matching** between the candidate and reference translations.
2. **Stemming** to match variations of the same word (e.g., “running” and “run”).
3. **Synonymy** to match synonyms (e.g., “big” and “large”).
4. **Penalty for fragmentation** to account for how well word order is preserved.

The basic formula for the METEOR score is:

\$
\text{METEOR} = (1 - P) \cdot F_{\text{mean}}
\$

Where:

- \$ P \$ is a penalty based on how fragmented the matches are (i.e., if the word order between the candidate and reference translations is not preserved, this increases).
- \$ F_{\text{mean}} \$ is a weighted harmonic mean of precision (\$ P \$) and recall (\$ R \$):

\$
F_{\text{mean}} = \frac{10 \cdot P \cdot R}{9 \cdot P + R}
\$

Here:

- **Precision (\$ P \$)** is the proportion of matched words in the candidate translation to the total number of words in the candidate translation.
- **Recall (\$ R \$)** is the proportion of matched words in the candidate translation to the total number of words in the reference translation.

The penalty \$ P \$ is computed based on the number of chunks, which refers to groups of consecutive words in the candidate translation that match words in the reference translation in the correct order. If the word order is more disordered, the penalty increases.

\$
P = 0.5 \cdot \left(\frac{\text{number of chunks}}{\text{number of matched words}}\right)
\$

The final METEOR score balances recall and precision with penalties for word order mismatches and fragmentation.

### Key Components:
- **Precision**: How much of the candidate translation overlaps with the reference translation.
- **Recall**: How much of the reference translation is covered by the candidate translation.
- **Fragmentation Penalty**: Penalizes candidate translations that do not preserve the word order.

Let me know if you'd like more details or a concrete example!

In [None]:
from evaluate import load

meteor = evaluate.load('meteor')

predictions = ["It is a guide to action which ensures that the military always obeys the commands of the party"]
reference = ["It is a guide to action which ensures that the military always obeys the commands of the party"]

results = meteor.compute(predictions=predictions, references=reference)

print(round(results['meteor'], 2))

[nltk_data] Downloading package wordnet to /home/loc/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt_tab to /home/loc/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /home/loc/nltk_data...


The **perplexity** score is commonly used to evaluate language models. It measures how well a probability model predicts a sample. Specifically, perplexity can be understood as the inverse probability of the test set normalized by the number of words. Lower perplexity indicates a better model.

The formula for **perplexity** is:

\$
\text{Perplexity}(P) = 2^{-\frac{1}{N} \sum_{i=1}^{N} \log_2 P(w_i)}
\$

Where:
- \$ P(w_i) \$ is the probability assigned by the model to the \$i\$-th word in the sequence.
- \$ N \$ is the total number of words in the sequence.

Alternatively, if using natural logarithms, the formula becomes:

\$
\text{Perplexity}(P) = e^{-\frac{1}{N} \sum_{i=1}^{N} \log P(w_i)}
\$

### Interpretation:

- **\$ P(w_i) \$** represents the probability the model assigns to the word \$ w_i \$.
- The sum of the log probabilities over the sequence is averaged over \$ N \$ (the number of words).
- The result is then exponentiated to give a positive number, which represents the perplexity.

### Key Points:
- Lower perplexity is better, meaning the model is less "perplexed" by the data and assigns higher probabilities to the correct sequences.
- If a model predicts a word sequence perfectly, the perplexity will be 1.
- Higher perplexity means the model is uncertain and assigns low probabilities to the sequence.

Perplexity is commonly used in evaluating models such as n-gram models, neural network-based models, and large language models.

Calculating perplexity on predictions defined here:

In [16]:
from evaluate import load

perplexity = evaluate.load("perplexity", module_type="metric")

input_texts = ["lorem ipsum", "Happy Birthday!", "Bienvenue"]

results = perplexity.compute(model_id='gpt2',
                             add_start_token=False,
                             predictions=input_texts)
print(list(results.keys()))

print(results)

  0%|          | 0/1 [00:00<?, ?it/s]

['perplexities', 'mean_perplexity']
{'perplexities': [32.25433349609375, 1499.6904296875, 408.2761535644531], 'mean_perplexity': 646.7403055826823}


Calculating perplexity on predictions loaded in from a dataset:

In [17]:
import datasets

perplexity = evaluate.load("perplexity", module_type="metric")

input_texts = datasets.load_dataset("wikitext",
                                    "wikitext-2-raw-v1",
                                    split="test")["text"][:50]
input_texts = [s for s in input_texts if s!='']
results = perplexity.compute(model_id='gpt2',
                             predictions=input_texts)

print(list(results.keys()))

results

  0%|          | 0/2 [00:00<?, ?it/s]

['perplexities', 'mean_perplexity']


{'perplexities': [567.9088134765625,
  56.67512512207031,
  43.7514533996582,
  446.309326171875,
  105.64181518554688,
  46.81957244873047,
  58.58171463012695,
  104.24046325683594,
  54.72477340698242,
  47.75749588012695,
  344.9775085449219,
  125.95054626464844,
  127.30074310302734,
  121.87387084960938,
  5530.92724609375,
  63.25591278076172,
  40.26132583618164,
  221.20889282226562,
  67.8737564086914,
  124.36256408691406,
  51.45122528076172,
  28.862401962280273,
  90.88360595703125,
  52.91471862792969,
  43.79275131225586,
  65.90267181396484,
  28.63719367980957],
 'mean_perplexity': 320.846203274197}