# Summary evaluation

Today we'll take a look at how we can evaluate the quality of model-generated summaries in different ways.

## Install packages

Tip: You might need to restart the jupyter kernel after installation.

In [1]:
%pip install rouge_score
%pip install bert_score 
%pip install blanc 
%pip install nltk 
%pip install sentencepiece 
%pip install protobuf 
%pip install transformers 
%pip install datasets 
%pip install spacy
%pip install evaluate
!python -m spacy download en_core_web_sm

Defaulting to user installation because normal site-packages is not writeable
Collecting rouge_score
  Downloading rouge_score-0.1.2.tar.gz (17 kB)
  Preparing metadata (setup.py) ... [?25ldone
[?25hCollecting absl-py (from rouge_score)
  Downloading absl_py-2.1.0-py3-none-any.whl.metadata (2.3 kB)
Collecting nltk (from rouge_score)
  Downloading nltk-3.9.1-py3-none-any.whl.metadata (2.9 kB)
Collecting numpy (from rouge_score)
  Downloading numpy-2.1.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (62 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m62.0/62.0 kB[0m [31m965.3 kB/s[0m eta [36m0:00:00[0m0:01[0m
Collecting joblib (from nltk->rouge_score)
  Downloading joblib-1.4.2-py3-none-any.whl.metadata (5.4 kB)
Collecting regex>=2021.8.3 (from nltk->rouge_score)
  Downloading regex-2024.11.6-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (40 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m40.5/40.5 

## Load the data

We'll use a small slice of the English part of the `xlsum` dataset from the `datasets` library. You can take a look at what kind of data this includes [here](https://huggingface.co/datasets/csebuetnlp/xlsum).

In [2]:
from datasets import load_dataset

ds = load_dataset("csebuetnlp/xlsum", "english", split='train[:1%]')

README.md:   0%|          | 0.00/14.6k [00:00<?, ?B/s]

xlsum.py:   0%|          | 0.00/4.55k [00:00<?, ?B/s]

0000.parquet:   0%|          | 0.00/315M [00:00<?, ?B/s]

0001.parquet:   0%|          | 0.00/264M [00:00<?, ?B/s]

0000.parquet:   0%|          | 0.00/20.8M [00:00<?, ?B/s]

0000.parquet:   0%|          | 0.00/20.9M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/306522 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/11535 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/11535 [00:00<?, ? examples/s]

In [3]:
ds

Dataset({
    features: ['id', 'url', 'title', 'summary', 'text'],
    num_rows: 3065
})

The articles are in the `text` column and the summaries are in the `summary` column. Let's extract them and take a look at a few examples.

In [4]:
articles = ds["text"][0:10]
articles

 'Atlantis Resources unveiled the marine energy device at Invergordon ahead of it being shipped to Kirkwall. Trials on the device will now be run at the European Marine Energy Centre test site off Eday. The device stands 22.5m (73ft) tall, weighs 1,300 tonnes and has two sets of blades on a single unit. It could generate enough power for 1,000 homes.',
 'Police were called to the scene outside the Coral shop on Compton Road in Harehills just before 14:00 BST. The man was taken to hospital for treatment but his condition is not known. West Yorkshire Police said the area has been cordoned off and officers remain at the scene. The force has appealed for information.',
 'Anthony ZurcherNorth America reporter@awzurcheron Twitter With tensions rising between the US and Iran, the long-term consequences will largely depend on the nature of Iran\'s response to the attack and the intensity of any conflict that follows. If the end result is a US withdrawal from Iraq, the politics of the situation

In [5]:
reference_summaries = ds["summary"][0:10]
reference_summaries

['Winds could reach gale force in Wales with stormy weather set to hit the whole of the country this week.',
 'The massive tidal turbine AK1000 has been installed in 35m (114.8ft) of water at a test site in Orkney.',
 'A man has been stabbed in broad daylight outside a betting shop in Leeds.',
 'It was inevitable that the fallout from the US airstrike that killed Iranian General Qasem Soleimani would spill into presidential politics. Everything spills into presidential politics these days, and this is without a doubt a major story.',
 'Week four of social distancing is starting to take its toll.',
 'A 37-year-old man has been arrested as part an ongoing investigation into criminality linked to the North Antrim Ulster Defence Association (UDA).',
 'Electric buses will soon be running on the roads in Coventry.',
 'A Jersey deputy is calling on the number of States members to be reduced more than current proposals.',
 'About 200 posts are to go at the Boots site in Nottingham.',
 'A degre

Discuss:
- Based on these examples, what do you think of the quality of the dataset?
- Do you foresee any potential pitfalls for evaluation, based on your observations?

Let's take a look into the density of the summaries.

In [None]:
from utils.fragments import Fragments

fragment = [Fragments(summary, article, lang="en") for summary, article in zip(reference_summaries, articles)]
density = [frag.density() for frag in fragment]

In [None]:
len(list(filter(lambda x: x <= 1.5, density))) / len(density)

If you remember, summaries with density values below 1.5 are considered abstractive, meaning these seem to be highly abstractive summaries.
However, the density values are not a perfect measure of abstractive quality:
- Can you think of a way we might be able to "game" the density metric?

## Generating summaries
Now let's generate some summaries using a pre-trained model. We'll use the `mt5-small` model from the `transformers` library.

In [None]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

model_name = "google/mt5-small"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name, min_length=10, max_length=50)

To make everything a bit easier for ourselves, let's make a function which:
1. Takes an input text
2. Tokenises the text (remember to set the padding and truncation arguments to True)
3. Generates a summary based on the tokenised input (and prompt, if you're so inclined)
4. Decodes the generated summary from tokens into words, and
5. Returns the output

(Hint: there is one potential solution in the class_8_solution notebook, if you're in need :-)).

Now let's use that function to generate some summaries for the articles in the dataset.

In [None]:
your_pipeline_function(articles[0])

In [None]:
generated_summaries = [your_pipeline_function(article) for article in articles]

In [None]:
generated_summaries

## Evaluation
Now let's evaluate the quality of the generated summaries with some commonly used metrics.

In [None]:
from evaluate import load

rouge = load("rouge")
rouge.compute(references=reference_summaries, predictions=generated_summaries)

We can also take a look at the ROUGE scores for the individual summaries:

In [None]:
rouge.compute(references=reference_summaries, predictions=generated_summaries, use_aggregator=False)

The BERTScore metric does not use an aggregator, but we can average the scores ourselves to get an overall score.

In [None]:
bertscore = load("bertscore")
bertscores = bertscore.compute(references=reference_summaries, predictions=generated_summaries, lang="en")
bertscores

In [None]:
import numpy as np

np.mean(bertscores["precision"]), np.mean(bertscores["recall"]), np.mean(bertscores["f1"])

In [None]:
import nltk

nltk.download('punkt_tab')

We can also try a reference-free metric, such as BLANC, in case we do not have access to reference summaries, or we do not want to rely on them due to quality, etc.

In [None]:
import blanc

blanc = blanc.BlancHelp()
blanc.eval_pairs(articles, generated_summaries)

Discuss:
- What do these values tell us about the quality of the generated summaries?
- What are the strenghts and weaknesses of using reference-free metrics?
- What are the potential weaknesses of using a less known metric?

## Exercise

Now, the summaries we generated aren't exactly great, likely because the mt5 model was not fine-tuned for that purpose.
- Try to generate 10 new summaries using a model that has been fine-tuned for summarisation (e.g., our old friend, flan-t5-small)
- When you have the summaries, evaluate them using the same quantitative metrics as before
- Then try to conduct a qualitative evaluation of the summaries - in your groups, decide on some evalaution criteria (e.g., ranking, "stars", etc.), evaluate the summaries based on these criteria, and compare your results within the group and with the quantitative metrics

### Bonus exercise
Try to create a LLM judge that can evaluate the quality of the summaries based on the criteria you defined.
- Load in a generative pre-trained model from huggingface
- Prompt it with your evaluation criteria
- Compare its evaluation with your own