ROUGE score: Standing for Recall-Oriented Understudy for Gisting Evaluation
Text summarization is the art of distilling the essence of text content. It involves creating coherent and concise summaries of longer documents while retaining key information. The ROUGE Score is the go-to metric for assessing the quality of such machine-generated summaries.

pip install rouge-score

The ROUGE Score has three main components: ROUGE-N, ROUGE-L, and ROUGE-S

ROUGE-N
ROUGE-N is a component of the ROUGE score that quantifies the overlap of N-grams, contiguous sequences of N items (typically words or characters), between the system-generated summary and the reference summary. It provides insights into the precision and recall of the system's output by considering the matching N-gram sequences.

ROUGE-L
ROUGE-L, another component of the ROUGE Score, calculates the Longest Common Subsequence (LCS) between the system and reference summaries. Unlike N-grams, LCS measures the maximum sequence of words (not necessarily contiguous) that appear in both summaries. It offers a more flexible similarity measure and helps capture shared information beyond strict word-for-word matches.

ROUGE-S
ROUGE-S focuses on skip-bigrams. A skip-bigram is a pair of words in a sentence that allows for gaps or words in between. This component identifies the skip-bigram overlap between the system and reference summaries, enabling the assessment of sentence-level structure similarity. It can capture paraphrasing relationships between sentences and provide insights into the system's ability to convey information with flexible word ordering.

ref: https://thepythoncode.com/article/calculate-rouge-score-in-python

In [1]:
from rouge_score import rouge_scorer



The next step is to initialize the RougeScorer. You need to specify which types of ROUGE scores you are interested in. For instance, if you want to calculate ROUGE-1, ROUGE-2, and ROUGE-L scores, you would initialize the scorer like this:

In [2]:
scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True)

In [3]:
candidate_summary = "the cat was found under the bed"
reference_summary = "the cat was under the bed"
scores = scorer.score(reference_summary, candidate_summary)
for key in scores:
    print(f'{key}: {scores[key]}')

rouge1: Score(precision=0.8571428571428571, recall=1.0, fmeasure=0.923076923076923)
rouge2: Score(precision=0.6666666666666666, recall=0.8, fmeasure=0.7272727272727272)
rougeL: Score(precision=0.8571428571428571, recall=1.0, fmeasure=0.923076923076923)


In [4]:
candidate_summary = "the cat was found under the bed"
reference_summaries = ["the cat was under the bed", "found a cat under the bed"]
scores = {key: [] for key in ['rouge1', 'rouge2', 'rougeL']}
for ref in reference_summaries:
    temp_scores = scorer.score(ref, candidate_summary)
    for key in temp_scores:
        scores[key].append(temp_scores[key])

for key in scores:
    print(f'{key}:\n{scores[key]}')

rouge1:
[Score(precision=0.8571428571428571, recall=1.0, fmeasure=0.923076923076923), Score(precision=0.7142857142857143, recall=0.8333333333333334, fmeasure=0.7692307692307692)]
rouge2:
[Score(precision=0.6666666666666666, recall=0.8, fmeasure=0.7272727272727272), Score(precision=0.3333333333333333, recall=0.4, fmeasure=0.3636363636363636)]
rougeL:
[Score(precision=0.8571428571428571, recall=1.0, fmeasure=0.923076923076923), Score(precision=0.5714285714285714, recall=0.6666666666666666, fmeasure=0.6153846153846153)]


In [5]:
generated_summary = "I absolutely loved reading the Hunger Games"
reference_summary = "I loved reading the Hunger Games"

In [6]:
import evaluate
rouge_score = evaluate.load("rouge")

Downloading builder script:   0%|          | 0.00/6.27k [00:00<?, ?B/s]

In [8]:
scores

{'rouge1': 0.923076923076923,
 'rouge2': 0.7272727272727272,
 'rougeL': 0.923076923076923,
 'rougeLsum': 0.923076923076923}

In [7]:
scores = rouge_score.compute(
    predictions=[generated_summary], references=[reference_summary]
)
scores

{'rouge1': 0.923076923076923,
 'rouge2': 0.7272727272727272,
 'rougeL': 0.923076923076923,
 'rougeLsum': 0.923076923076923}

In [26]:
rouge = evaluate.load('rouge')
predictions = ["hello there", "general kenobi"]
references = [["hello", "there"], ["general kenobi", "general yoda"]]
results = rouge.compute(predictions=predictions,
                        references=references)
results

{'rouge1': 0.8333333333333333,
 'rouge2': 0.5,
 'rougeL': 0.8333333333333333,
 'rougeLsum': 0.8333333333333333}

https://huggingface.co/spaces/evaluate-metric/rouge
predictions (list): list of predictions to score. Each prediction should be a string with tokens separated by spaces.
references (list or list[list]): list of reference for each prediction or a list of several references per prediction. 

In [38]:
predictions = ["hello there", "general kenobi"]
references = [["hello there"], ["kenobi", "general"]]
results = rouge.compute(predictions=predictions,
                        references=references, use_aggregator=False)
results

{'rouge1': [1.0, 0.6666666666666666],
 'rouge2': [1.0, 0.0],
 'rougeL': [1.0, 0.6666666666666666],
 'rougeLsum': [1.0, 0.6666666666666666]}

In [39]:
localscorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True)

In [43]:
newscores = localscorer.score(references[0][0], predictions[0])
newscores

{'rouge1': Score(precision=1.0, recall=1.0, fmeasure=1.0),
 'rouge2': Score(precision=1.0, recall=1.0, fmeasure=1.0),
 'rougeL': Score(precision=1.0, recall=1.0, fmeasure=1.0)}

In [40]:
newscores = localscorer.score(reference_summary, generated_summary)
newscores

{'rouge1': Score(precision=0.8571428571428571, recall=1.0, fmeasure=0.923076923076923),
 'rouge2': Score(precision=0.6666666666666666, recall=0.8, fmeasure=0.7272727272727272),
 'rougeL': Score(precision=0.8571428571428571, recall=1.0, fmeasure=0.923076923076923)}

In [17]:
newscores['rouge1']

Score(precision=0.8571428571428571, recall=1.0, fmeasure=0.923076923076923)

In [19]:
newscores['rouge1'].fmeasure

0.923076923076923

In [22]:
#https://github.com/google-research/google-research/tree/master/rouge
scores = localscorer.score('The quick brown fox jumps over the lazy dog',
                      'The quick brown dog jumps on the log.')
scores

{'rouge1': Score(precision=0.75, recall=0.6666666666666666, fmeasure=0.7058823529411765),
 'rouge2': Score(precision=0.2857142857142857, recall=0.25, fmeasure=0.26666666666666666),
 'rougeL': Score(precision=0.625, recall=0.5555555555555556, fmeasure=0.5882352941176471)}

In [24]:
result = localscorer.score_multi(["first text", "first something"], "text first")
result

{'rouge1': Score(precision=1.0, recall=1.0, fmeasure=1.0),
 'rouge2': Score(precision=0.0, recall=0.0, fmeasure=0.0),
 'rougeL': Score(precision=0.5, recall=0.5, fmeasure=0.5)}

In [44]:
candidate_summary = ["the cat was found under the bed","I absolutely loved reading the Hunger Games"]
reference_summaries = ["the cat was under the bed", "I loved reading the Hunger Games"]

In [50]:
newscores = localscorer.score_multi(reference_summaries[0], candidate_summary[0])

In [51]:
newscores

{'rouge1': Score(precision=0.0, recall=0.0, fmeasure=0.0),
 'rouge2': Score(precision=0.0, recall=0.0, fmeasure=0.0),
 'rougeL': Score(precision=0.0, recall=0.0, fmeasure=0.0)}

In [53]:
reference_summaries[0]

'the cat was under the bed'

In [52]:
newscores = localscorer.score(reference_summaries[0], candidate_summary[0])
newscores

{'rouge1': Score(precision=0.8571428571428571, recall=1.0, fmeasure=0.923076923076923),
 'rouge2': Score(precision=0.6666666666666666, recall=0.8, fmeasure=0.7272727272727272),
 'rougeL': Score(precision=0.8571428571428571, recall=1.0, fmeasure=0.923076923076923)}

Datasets actually computes confidence intervals for precision, recall, and F1-score; these are the low, mid, and high attributes