# Text Generation Metrics

We use huggingface `evaluate` library for most of the metrics shown. See documentation here: https://huggingface.co/evaluate-metric


In [None]:
!pip install evaluate sacrebleu rouge_score bert_score unbabel-comet
import evaluate

See all huggingface metrics here: https://huggingface.co/evaluate-metric

## BLEU

In [None]:
bleu = evaluate.load("bleu")

In [None]:
pred = "เขา หาม มเหสี"
target = "เขา หาม หมา มเหสี"
results = bleu.compute(predictions=[pred], references=[[target]], tokenizer=lambda s: s.split(" "))
results

## ChrF

In [None]:
chrf  = evaluate.load("chrf")

In [None]:
results = chrf.compute(predictions=[pred], references=[[target]]) # if word_order = 2, it will be chrF++! but need to input tokenizer
results

## ROUGE

In [None]:
rouge  = evaluate.load("rouge")

In [None]:
candidates = ["Summarization is cool"]
references = [["Summarization is beneficial and cool","Summarization saves time"]]

results = rouge.compute(predictions=candidates, references=references)
print(results)

In [None]:
candidates = ["A fast brown fox leaps over a sleeping dog"]
references = [["The quick brown fox jumps over the lazy dog"]]

results = rouge.compute(predictions=candidates, references=references)
print(results)

Using huggingface evaluate with Thai will not work natively.  
--> See https://stackoverflow.com/questions/73963171/rouge-score-metric-for-non-english-arabic-language-is-not-working    

--> https://stackoverflow.com/questions/76633871/why-rouge-score-results-are-confusing-for-non-english-languages

https://github.com/huggingface/evaluate/issues/108

It seems like the rouge_score library that this metric uses filters all non-alphanueric latin characters
in `rouge_scorer/tokenize.py` with `text = re.sub(r"[^a-z0-9]+", " ", six.ensure_str(text))`.

The RougeScorer accepts a tokenizer keyword argument.

In [None]:
from rouge_score import rouge_scorer
pred = "เขา หาม มเหสี"
target = "เขา หาม หมา มเหสี"

class MyTokenizer:
  def tokenize(s):
    return s.split(" ")
r_scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], tokenizer=MyTokenizer)
results = r_scorer.score(target, pred)
results

## METEOR

In [None]:
meteor  = evaluate.load("meteor")

In [None]:
pred = "the cat sat on the mat"
target = "the cat sat on the mat"
results = meteor.compute(predictions=[pred], references=[[target]])
results

In [None]:
pred = "the cat sat on the mat"
target = "the cat sat on the big mat"
results = meteor.compute(predictions=[pred], references=[[target]])
results

## TER

In [None]:
ter  = evaluate.load("ter")

In [None]:
pred = "the cat sat on the mat"
target = "the cats sat on the mat"
results = ter.compute(predictions=[pred], references=[[target]])
results

Shift word "sat"

In [None]:
pred = "the cat sat on the mat"
target = "the cats on the mat sat"
results = ter.compute(predictions=[pred], references=[[target]])
results

Shift "on the mat"

In [None]:
pred = "the cat sat on the mat"
target = "on the mat the cat sat"
results = ter.compute(predictions=[pred], references=[[target]])
results

## BertScore

In [None]:
bertscore = evaluate.load("bertscore")

The original BERTScore paper showed that BERTScore correlates well with human judgment on sentence-level and system-level evaluation, but this depends on the model and language pair selected.

Multilingual Bert supported languages: https://github.com/google-research/bert/blob/master/multilingual.md#list-of-languages

>The multilingual model supports the following languages. These languages were chosen because they are the top 100 languages with the largest Wikipedias [...]
>
> The **Multilingual Cased (New)** release contains additionally **Thai** and **Mongolian**, which were not included in the original release.

Finally, calculating the BERTScore metric involves downloading the BERT model that is used to compute the score-- the default model for `en`, `roberta-large`, takes over 1.4GB of storage space and downloading it can take a significant amount of time depending on the speed of your internet connection. If this is an issue, choose a smaller model; for instance `distilbert-base-uncased` is 268MB.

Using `lang=th` downloads `bert-base-multilingual-cased_L9_no-idf_version=0.3.12(hug_trans=4.47.1)`, which should support Thai.

In [None]:
pred = "เขาหามมเหสี"
target = "เขาหามหมามเหสี"
results = bertscore.compute(predictions=[pred], references=[target], lang="th")
results

In [None]:
results = bertscore.compute(predictions=["ชีวิตทุกข์ทรมานจริง"], references=["ชีวิตมันแย่มาก"], lang="th")
results

In [None]:
results = bertscore.compute(predictions=["รู้สึกสนุกสุดยอด"], references=["ชีวิตมันแย่มาก"], lang="th")
results

## COMET

In [None]:
comet = evaluate.load('comet')

COMET takes 3 lists of strings as input: sources (a list of source sentences), predictions (a list of candidate translations) and references (a list of reference translations).

In [None]:
source = ["Dem Feuer konnte Einhalt geboten werden", "Schulen und Kindergärten wurden eröffnet."]
hypothesis = ["The fire could be stopped", "Schools and kindergartens were open"]
reference = ["They were able to control the fire.", "Schools and kindergartens opened"]
results = comet.compute(predictions=hypothesis, references=reference, sources=source)
results