---
# Text Summarization with PEGASUS-XSUM

This notebook demonstrates the process of summarizing text using the Google PEGASUS-XSUM model from the Hugging Face Transformers library. The PEGASUS-XSUM model is a pre-trained abstractive text summarization model that can generate concise summaries of long text inputs.

The notebook is organized as follows:

1. **Installation and Importing Libraries**: We install the required Hugging Face Transformers library and the Newspaper3k library for extracting article content.

2. **Fetching Article Content**: We use the Newspaper3k library to fetch a long text from a Wikipedia article.

3. **Checking GPU Availability**: We check for the availability of a GPU to run the PEGASUS-XSUM model more efficiently.

4. **Splitting Text into Chunks**: Since the PEGASUS-XSUM model accepts a maximum of 1024 tokens as input, we split the text into smaller chunks to avoid exceeding this limit.

5. **Summarizing Text**: We summarize each chunk of text using the PEGASUS-XSUM model and then concatenate the summaries to produce the final summary.

6. **Evaluating the Summary**: We evaluate the quality of the summary by calculating various evaluation metrics like BLEU, METEOR, and ROUGE scores.

Throughout this notebook, you will learn how to fetch content from a Wikipedia article, split the text into smaller chunks, use the PEGASUS-XSUM model to generate summaries, and evaluate the quality of the generated summaries.

---


# libs install

In [13]:
!pip install transformers
!pip install transformers newspaper3k

# used later for evaluation
import nltk
nltk.download('wordnet')

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

# Get a long text from wikipedia

In [14]:
from newspaper import Article

url = "https://en.wikipedia.org/wiki/Presidency_of_Nicolas_Sarkozy"

article = Article(url)
article.download()
article.parse()
text = article.text

# check GPU

In [15]:
import torch

if torch.cuda.is_available():
    device = torch.device("cuda")
    print(f"Using GPU: {torch.cuda.get_device_name(0)}")
else:
    device = torch.device("cpu")
    print("Using CPU")


Using GPU: Tesla T4


# Split text into correct size chunks and summarize chunks

The summarization model accepts max 1024 tokens in input. We use a simple word count with a safety margin. We could use a tokenizer to get the correct amounts in each chunk, but the score isn't too affected by this method.

In [19]:
from transformers import pipeline
from tqdm import tqdm

# summarizer = pipeline('summarization')

from transformers import pipeline
import torch

if torch.cuda.is_available():
    device = torch.device("cuda")
    print(f"Using GPU: {torch.cuda.get_device_name(0)}")
else:
    device = torch.device("cpu")
    print("Using CPU")

# summarizer = pipeline('summarization').to(torch.device)
device = 0 if torch.cuda.is_available() else -1
# summarizer = pipeline("summarization", device=device, model="sshleifer/distilbart-cnn-12-6") 
summarizer = pipeline("summarization", device=device, model="google/pegasus-xsum") 


def split_text(text, chunk_size):
    words = text.split()
    chunks = []
    current_chunk = []

    for word in words:
        if len(' '.join(current_chunk)) + len(word) < chunk_size:
            current_chunk.append(word)
        else:
            chunks.append(' '.join(current_chunk))
            current_chunk = [word]

    if current_chunk:
        chunks.append(' '.join(current_chunk))

    return chunks

def summarize_text(text, max_sentences=1, max_tokens=1024):
    chunks = split_text(text, max_tokens - 10)  # Subtracting 10 tokens to be on the safe side
    summaries = []

    for chunk in tqdm(chunks):
        summary = summarizer(chunk, max_length=max_sentences * 20, min_length=max_sentences * 5, do_sample=False)
        summaries.append(summary[0]['summary_text'])

    summary = ' '.join(summaries)
    summary_len = len([sentence for sentence in summary.split('.') if sentence != ""]) #remove empty strings to count real sentences

    # summarize until 10 sentences left -- recursive summarize, yields poor results
    # if summary_len > 10:
    #   summary = summarize_text(summary, max_sentences=5, max_tokens=1024)

    return summary

# input_text = "Your long text here."
summary = summarize_text(text, 10)
print(summary)


Using GPU: Tesla T4


  0%|          | 0/27 [00:00<?, ?it/s]Your max_length is set to 200, but you input_length is only 197. You might consider decreasing max_length manually, e.g. summarizer('...', max_length=98)
 22%|██▏       | 6/27 [00:13<00:52,  2.49s/it]Your max_length is set to 200, but you input_length is only 194. You might consider decreasing max_length manually, e.g. summarizer('...', max_length=97)
 52%|█████▏    | 14/27 [00:33<00:28,  2.17s/it]Your max_length is set to 200, but you input_length is only 194. You might consider decreasing max_length manually, e.g. summarizer('...', max_length=97)
100%|██████████| 27/27 [01:02<00:00,  2.31s/it]

Nicolas Sarkozy is the sixth president of the French Fifth Republic, and the first conservative president of the republic since the quinquennat reform of 2000, which abolished the two-term limit for presidential elections and replaced it with a five-year term. Nicolas Sarkozy became the first French president since the Fifth Republic to be elected with a majority of votes in the first round of the presidential elections in May 2007 and a majority of votes in the second round of the elections in June 2008.[2] He became the first French president since the Fifth Republic to be elected with a majority of votes in the first round of the presidential elections in May 2007 and a majority of votes in the second round of the elections in June 2008. Nicolas Sarkozy was the French president between 2007 and 2012, during which time the French economy was hit by the global financial crisis and the country's credit rating was cut from AA to A by the ratings agency Standard & Poor's (S&P). President




In [20]:
summary.split('.')

['Nicolas Sarkozy is the sixth president of the French Fifth Republic, and the first conservative president of the republic since the quinquennat reform of 2000, which abolished the two-term limit for presidential elections and replaced it with a five-year term',
 ' Nicolas Sarkozy became the first French president since the Fifth Republic to be elected with a majority of votes in the first round of the presidential elections in May 2007 and a majority of votes in the second round of the elections in June 2008',
 '[2] He became the first French president since the Fifth Republic to be elected with a majority of votes in the first round of the presidential elections in May 2007 and a majority of votes in the second round of the elections in June 2008',
 " Nicolas Sarkozy was the French president between 2007 and 2012, during which time the French economy was hit by the global financial crisis and the country's credit rating was cut from AA to A by the ratings agency Standard & Poor's (S

In [21]:
print(f"this is the length of the original text : {len(text)} and this is the length of the summary : {len(summary)}")

this is the length of the original text : 27305 and this is the length of the summary : 10134


# Evaluation of the summary

##  helper function

In [22]:
import nltk
from nltk.translate.bleu_score import corpus_bleu
from nltk.translate.meteor_score import single_meteor_score
from nltk.util import ngrams

def evaluation_scores(original_text, summary):
    reference_summary = original_text.split()
    generated_summary = summary.split()
    
    bleu = corpus_bleu([reference_summary], [generated_summary])
    meteor = single_meteor_score(generated_summary, reference_summary)
    original_ngrams = list(ngrams(original_text.split(), 1)) + list(ngrams(original_text.split(), 2)) + list(ngrams(original_text.split(), 3))
    summary_ngrams = list(ngrams(summary.split(), 1)) + list(ngrams(summary.split(), 2)) + list(ngrams(summary.split(), 3))
    original_ngrams = set(original_ngrams)
    summary_ngrams = set(summary_ngrams)
    
    overlap = original_ngrams & summary_ngrams
    rouge_1 = len(overlap) / len(original_ngrams)
    rouge_2 = len(overlap) / len(summary_ngrams)
    rouge_l = max(rouge_1, rouge_2)
    
    return bleu, meteor, rouge_1, rouge_2, rouge_l

# Test the function

original_text = "In an effort to help slow the spread of COVID-19, many countries have implemented social distancing measures, including the closure of non-essential businesses. Despite the challenges, some entrepreneurs have found ways to adapt and even thrive in the new environment. For example, a restaurant in Italy has started offering home delivery, while a clothing store in the United States has shifted to online sales."
summary = "Many countries have closed non-essential businesses to slow the spread of COVID-19. Some entrepreneurs have adapted and thrived, such as a restaurant in Italy offering home delivery and a clothing store in the US shifting to online sales."

bleu, meteor, rouge_1, rouge_2, rouge_l = evaluation_scores(original_text, summary)

print("BLEU score:", bleu)
print("METEOR score:", meteor)
print("ROUGE-1 score:", rouge_1)
print("ROUGE-2 score:", rouge_2)
print("ROUGE-L score:", rouge_l)


BLEU score: 8.726094729337945e-232
METEOR score: 0.6741036650012007
ROUGE-1 score: 0.24431818181818182
ROUGE-2 score: 0.4095238095238095
ROUGE-L score: 0.4095238095238095


## Evaluation on Rouge, Meteor, and Bleu

Really poor scores

In [23]:
bleu, meteor, rouge_1, rouge_2, rouge_l = evaluation_scores(text, summary)

print("BLEU score:", bleu)
print("METEOR score:", meteor)
print("ROUGE-1 score:", rouge_1)
print("ROUGE-2 score:", rouge_2)
print("ROUGE-L score:", rouge_l)

BLEU score: 8.726094729337945e-232
METEOR score: 0.029069767441860468
ROUGE-1 score: 0.0016661459960429033
ROUGE-2 score: 0.1523809523809524
ROUGE-L score: 0.1523809523809524


# Further summary and re-evaluation

In [28]:
summary = summarize_text(summary, 10)
bleu, meteor, rouge_1, rouge_2, rouge_l = evaluation_scores(text, summary)

print("BLEU score:", bleu)
print("METEOR score:", meteor)
print("ROUGE-1 score:", rouge_1)
print("ROUGE-2 score:", rouge_2)
print("ROUGE-L score:", rouge_l)

  0%|          | 0/11 [00:00<?, ?it/s]Your max_length is set to 200, but you input_length is only 194. You might consider decreasing max_length manually, e.g. summarizer('...', max_length=97)
 27%|██▋       | 3/11 [00:05<00:13,  1.70s/it]Your max_length is set to 200, but you input_length is only 194. You might consider decreasing max_length manually, e.g. summarizer('...', max_length=97)
 55%|█████▍    | 6/11 [00:13<00:11,  2.28s/it]Your max_length is set to 200, but you input_length is only 130. You might consider decreasing max_length manually, e.g. summarizer('...', max_length=65)
 64%|██████▎   | 7/11 [00:16<00:10,  2.75s/it]Your max_length is set to 200, but you input_length is only 79. You might consider decreasing max_length manually, e.g. summarizer('...', max_length=39)
 73%|███████▎  | 8/11 [00:17<00:06,  2.26s/it]Your max_length is set to 200, but you input_length is only 155. You might consider decreasing max_length manually, e.g. summarizer('...', max_length=77)
 91%|████

BLEU score: 5.319290575787799e-232
METEOR score: 0.23464927473136332
ROUGE-1 score: 0.025512860564406957
ROUGE-2 score: 0.2387914230019493
ROUGE-L score: 0.2387914230019493


In [29]:
summary.split('.')

["Nicolas Sarkozy was the French president between 2007 and 2012, during which time the French economy was hit by the global financial crisis and the country's credit rating was cut from AA to A by the ratings agency Standard & Poor's (S&P)",
 " French President Nicolas Sarkozy and his wife, Carla Bruni-Sarkozy, left Paris on 14 May for a week-long holiday in the Mediterranean, which included a visit to the island of Gozo, where Sarkozy's father, Pal Sarkozy, is from",
 " Sarkozy and his wife, Carla Bruni-Sarkozy, left Paris on 14 May for a week-long holiday in the Mediterranean, which included a visit to the island of Gozo, where Sarkozy's father, Pal Sarkozy, is from",
 ' French President Nicolas Sarkozy has signed a series of agreements with Libyan leader Muammar Gaddafi, including a $230 million (168 million euros) antitank missile deal, in exchange for the release of five Bulgarian nurses who had been imprisoned in Libya for more than eight years',
 ' French President Nicolas Sark

In [30]:
summary = summarize_text(summary, 10)
bleu, meteor, rouge_1, rouge_2, rouge_l = evaluation_scores(text, summary)

print("BLEU score:", bleu)
print("METEOR score:", meteor)
print("ROUGE-1 score:", rouge_1)
print("ROUGE-2 score:", rouge_2)
print("ROUGE-L score:", rouge_l)

 40%|████      | 2/5 [00:04<00:06,  2.27s/it]Your max_length is set to 200, but you input_length is only 181. You might consider decreasing max_length manually, e.g. summarizer('...', max_length=90)
 60%|██████    | 3/5 [00:05<00:03,  1.86s/it]Your max_length is set to 200, but you input_length is only 134. You might consider decreasing max_length manually, e.g. summarizer('...', max_length=67)
 80%|████████  | 4/5 [00:11<00:03,  3.24s/it]Your max_length is set to 200, but you input_length is only 117. You might consider decreasing max_length manually, e.g. summarizer('...', max_length=58)
100%|██████████| 5/5 [00:14<00:00,  2.80s/it]


BLEU score: 5.5082675278483286e-232
METEOR score: 0.17187886244018644
ROUGE-1 score: 0.00937207122774133
ROUGE-2 score: 0.18181818181818182
ROUGE-L score: 0.18181818181818182


In [31]:
summary.split('.')

["French President Nicolas Sarkozy and his wife, Carla Bruni-Sarkozy, left Paris on 14 May for a week-long holiday in the Mediterranean, which included a visit to the island of Gozo, where Sarkozy's father, Pal Sarkozy, is from",
 ' European leader to call for an end to the rule of Libyan leader Muammar Gaddafi, who has been in power for more than 40 years and who has been accused by many African leaders of being too pessimistic about the future of the continent, and of failing to address the issue of poverty',
 ' 888-282-0465 888-282-0465 888-282-0465 888-282-0465 technology has been designed and rigorously tested for completeness and security 888-282-0465',
 ' 888-282-0465 888-282-0465 888-282-0465 can also be taught to people at the point of use, but cannot be copied and pasted at the point of use 888-282-0465',
 ' 888-282-0465 can also be taught to people at the point of use, but cannot be copied and pasted at the point of use 888-282-0465',
 ' All photographs are copyrighted',
 " 

# Conclusion

The quality of text extraction is important because, in this example, the sources from the wikipedia article kept growing in importance over the subject.  
In this scenario, extracting the main text from sources and image captions could have yielded higher scores in rouge/bleu evaluation.

This first summarization pass has a lower rouge score than the second pass. Albeit counterintuitive, this can be explained by the removal of some of the numbers from references, which do not reflect the content of the complete article. The third summarization pass has worse scores, as expected. 