# Abstractive Text Summarization Using Pegasus
_By: Ling Li Ya_

References:
1. [Exploring Pegasus - A New Text Summarization NLP Model](https://signal.onepointltd.com/post/102ghb9/exploring-pegasus-a-new-text-summarization-nlp-model)
2. [Notebook referred to prepare this notebook](https://colab.research.google.com/drive/1-zq8AJktuC3gQAHTuSiiZ_qvDl4wK7rq#scrollTo=S3PYeeGuda0m)

In [6]:
!pip install tensorflow



## 1. Install and Import Dependencies

Install `sentencepiece` to be used as a tokenizer for the model

In [8]:
!pip install sentencepiece



Install `transformers` to use its `summarization pipeline`

In [9]:
!pip install transformers



Install `bs4` to use `BeautifulSoup`

In [10]:
!pip install bs4



Import all dependencies

In [11]:
from transformers import PegasusForConditionalGeneration, PegasusTokenizer, pipeline
from bs4 import BeautifulSoup
import torch
import requests



Check whether CUDA (GPU) is correctly installed on device. This is important because tensors will need to be assigned to use GPU instead of CPU. GPU is much faster than CPU in this case due to multi-threading.

In [12]:
torch.cuda.is_available()

True

Check which CPU is being used.

In [13]:
torch.cuda.current_device()

0

Get the name of the GPU being used.

In [14]:
torch.cuda.get_device_name(0)

'GeForce RTX 3050 Ti Laptop GPU'

Total GPUs available on device.

In [15]:
torch.cuda.device_count()

1

## 2. Setup Generator
Deifne and get model

In [18]:
model_name = 'google/pegasus-xsum'
device = 'cuda'
tokenizer = PegasusTokenizer.from_pretrained(model_name)
model = PegasusForConditionalGeneration.from_pretrained(model_name).to(device)

Get pipeline text summarization utility

In [19]:
summarizer = pipeline('summarization', model=model, tokenizer=tokenizer)

## 3. Process Input Text
Get input from website URLs

In [20]:
URL = 'https://en.wikipedia.org/wiki/Rococo'

Get HTTP URL using `requests`

In [21]:
r = requests.get(URL)

Parse HTML body returned from the URL and format it to have a better readability

In [22]:
soup = BeautifulSoup(r.text, 'html.parser')

In [23]:
results = soup.find_all(['h1', 'p'])
results[:3]

[<h1 class="firstHeading" id="firstHeading">Rococo</h1>,
 <p class="mw-empty-elt">
 </p>,
 <p><b>Rococo</b> (<span class="rt-commentedText nowrap"><span class="IPA nopopups noexcerpt"><a href="/wiki/Help:IPA/English" title="Help:IPA/English">/<span style="border-bottom:1px dotted"><span title="'r' in 'rye'">r</span><span title="/ə/: 'a' in 'about'">ə</span><span title="/ˈ/: primary stress follows">ˈ</span><span title="'k' in 'kind'">k</span><span title="/oʊ/: 'o' in 'code'">oʊ</span><span title="'k' in 'kind'">k</span><span title="/oʊ/: 'o' in 'code'">oʊ</span></span>/</a></span></span>, <small>also</small> <span class="rt-commentedText nowrap"><small><a href="/wiki/American_English" title="American English">US</a>: </small><span class="IPA nopopups noexcerpt"><a href="/wiki/Help:IPA/English" title="Help:IPA/English">/<span style="border-bottom:1px dotted"><span title="/ˌ/: secondary stress follows">ˌ</span><span title="'r' in 'rye'">r</span><span title="/oʊ/: 'o' in 'code'">oʊ</span><

Text enclosed within the HTML tags are selected and joined together

In [24]:
text = [result.text for result in results]
ARTICLE = ' '.join(text)
ARTICLE[0:1000]

"Rococo \n Rococo (/rəˈkoʊkoʊ/, also US: /ˌroʊkəˈkoʊ/), less commonly Roccoco or Late Baroque, is an exceptionally ornamental and theatrical style of architecture, art and decoration which combines asymmetry, scrolling curves, gilding, white and pastel colors, sculpted molding, and trompe-l'œil frescoes to create surprise and the illusion of motion and drama.  It is often described as the final expression of the Baroque movement.[1]\n The Rococo style began in France in the 1730s as a reaction against the more formal and geometric Style Louis XIV. It was known as the style rocaille, or rocaille style.[2] It soon spread to other parts of Europe, particularly northern Italy, Austria, southern Germany, Central Europe and Russia.[3] It also came to influence the other arts, particularly sculpture, furniture, silverware, glassware, painting, music, and theatre.[4] Although originally a secular style primarily used for interiors of private residences the Rococo had a spiritual aspect to it w

In [None]:
src_text = [
    """Rococo (/rəˈkoʊkoʊ/, also US: /ˌroʊkəˈkoʊ/), less commonly Roccoco or Late Baroque, is an exceptionally ornamental and theatrical style of architecture, art and decoration which combines asymmetry, scrolling curves, gilding, white and pastel colors, sculpted molding, and trompe-l'œil frescoes to create surprise and the illusion of motion and drama. It is often described as the final expression of the Baroque movement.[1] The Rococo style began in France in the 1730s as a reaction against the more formal and geometric Style Louis XIV. It was known as the style rocaille, or rocaille style.[2] It soon spread to other parts of Europe, particularly northern Italy, Austria, southern Germany, Central Europe and Russia.[3] It also came to influence the other arts, particularly sculpture, furniture, silverware, glassware, painting, music, and theatre.[4] Although originally a secular style primarily used for interiors of private residences the Rococo had a spiritual aspect to it which led to its widespread use in church interiors, particularly in Central Europe, Portugal, and South America.[5]""", """The word rococo was first used as a humorous variation of the word rocaille.[6][7] Rocaille was originally a method of decoration, using pebbles, seashells and cement, which was often used to decorate grottoes and fountains since the Renaissance.[8][9] In the late 17th and early 18th century rocaille became the term for a kind of decorative motif or ornament that appeared in the late Style Louis XIV, in the form of a seashell interlaced with acanthus leaves. In 1736 the designer and jeweler Jean Mondon published the Premier Livre de forme rocquaille et cartel, a collection of designs for ornaments of furniture and interior decoration. It was the first appearance in print of the term "rocaille" to designate the style.[10] The carved or molded seashell motif was combined with palm leaves or twisting vines to decorate doorways, furniture, wall panels and other architectural elements.[11]"""
]

## 4. Chunk text

Append <eos> to punctuations that marks the end of a sentence
<br />
Without the <eos> tag, sentences will be split without any punctuation

In [25]:
ARTICLE = ARTICLE.replace('.', '.<eos>')
ARTICLE = ARTICLE.replace('!', '!<eos>')
ARTICLE = ARTICLE.replace('?', '?<eos>')
sentences = ARTICLE.split('<eos>')
sentences[:10]

["Rococo \n Rococo (/rəˈkoʊkoʊ/, also US: /ˌroʊkəˈkoʊ/), less commonly Roccoco or Late Baroque, is an exceptionally ornamental and theatrical style of architecture, art and decoration which combines asymmetry, scrolling curves, gilding, white and pastel colors, sculpted molding, and trompe-l'œil frescoes to create surprise and the illusion of motion and drama.",
 '  It is often described as the final expression of the Baroque movement.',
 '[1]\n The Rococo style began in France in the 1730s as a reaction against the more formal and geometric Style Louis XIV.',
 ' It was known as the style rocaille, or rocaille style.',
 '[2] It soon spread to other parts of Europe, particularly northern Italy, Austria, southern Germany, Central Europe and Russia.',
 '[3] It also came to influence the other arts, particularly sculpture, furniture, silverware, glassware, painting, music, and theatre.',
 '[4] Although originally a secular style primarily used for interiors of private residences the Rococo

Limit the size of text in a chunk so that it is smaller than 500 words
<br />
Split sentences into words (2D array)
<br />
This is to avoid the error as shown below
```py
Token indices sequence length is longer than the specified maximum sequence length for this model (512). Running this sequence through the model will result in indexing errors.
```

In [26]:
max_chunk = 250
current_chunk = 0
chunks = []

for sentence in sentences:
    if len(chunks) == current_chunk + 1:
        # Check if the chunk is less than 500 words
        if len(chunks[current_chunk]) + len(sentence.split(' ')) <= max_chunk:
            chunks[current_chunk].extend(sentence.split(' '))
        # Next chunk
        else:
            current_chunk += 1
            chunks.append(sentence.split(' '))
    else:
        print(current_chunk)
        chunks.append(sentence.split(' '))

print("A total of " + str(current_chunk + 1) + " chunks")
print("A total of " + str(len(chunks[0])) + " words in chunk[0]")

0
A total of 26 chunks
A total of 212 words in chunk[0]


Append words into sentences again where each chunk is ensured to have less than 500 words

In [27]:
for chunk_id in range (len(chunks)):
    chunks[chunk_id] = ' '.join(chunks[chunk_id])

print("A total of " + str(len(chunks[0].split(' '))) + " words in chunk[0]")

A total of 212 words in chunk[0]


## 5. Summarise Text

Summarise based on each chunk

In [45]:
batch = tokenizer(chunks, truncation=True, padding='longest', return_tensors="pt").to(device)
batch

{'input_ids': tensor([[91930, 91930,   143,  ...,     0,     0,     0],
        [ 1126,  2000, 32887,  ...,     0,     0,     0],
        [  139,  4234,   116,  ...,     0,     0,     0],
        ...,
        [  222,   109, 75864,  ...,     0,     0,     0],
        [ 4648,   131,   116,  ...,  2895,   107,     1],
        [55242,  3203, 91911,  ...,     0,     0,     0]], device='cuda:0'), 'attention_mask': tensor([[1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0],
        ...,
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 1, 1, 1],
        [1, 1, 1,  ..., 0, 0, 0]], device='cuda:0')}

In [46]:
torch.cuda.empty_cache()

In [48]:
# translated = []
# for i in range(len(batch)):
#     translated += model.generate(batch[i])

translated = model.generate(**batch)
translated

KeyError: 'Indexing with integers (to access backend Encoding for a given batch index) is not available when using Python based tokenizers'

In [31]:
res = summarizer(chunks)
res

To keep the current behavior, use torch.div(a, b, rounding_mode='trunc'), or for actual floor division, use torch.div(a, b, rounding_mode='floor'). (Triggered internally at  ..\aten\src\ATen\native\BinaryOps.cpp:467.)
  return torch.floor_divide(self, other)


KeyboardInterrupt: 

[{'summary_text': 'The style rococo was first used as a humorous variation of the word rocaille.'},
 {'summary_text': 'The term rococo was first used in print in 1828 to describe decoration "which belonged to the style of the 18th century."'},
 {'summary_text': 'A chronology of key events:'},
 {'summary_text': 'Rocaille was a style of architecture and furniture developed in France in the 17th and 18th centuries.'},
 {'summary_text': 'All images are copyrighted.'},
 {'summary_text': 'The Venetian Rococo was one of the most popular styles of decoration in Europe in the 17th and 18th centuries, and was influenced by the French, Italian and German styles.'},
 {'summary_text': 'In our series of letters from African journalists, film-maker and columnist Farai Sevenzo looks at the influence of French Rococo on German architecture.'},
 {'summary_text': "One of the most striking features of the Prince-Bishop's residence in Munich is the stairway."},
 {'summary_text': 'All images are copyrighted

## 6. Formatting Text
Preprocessing: format the `dict` object into a `string`.

In [18]:
summary = ''
for result in res:
    summary += ''.join(str(val.capitalize()) + "\n" for _, val in result.items())

summary = summary.replace(' .', '.')
summary = summary.replace(" !", "!")
summary = summary.replace(" ?", "?")

## 7. Results

Some statistics and the final result.

In [19]:
words_after = len(summary.split(' '))
words_before = len(ARTICLE)
reduced_by = (words_before - words_after) / words_before * 100

print("Number of words in summary: " + str(words_after))
print("Number of words in original article: " + str(words_before))
print("Reduced by: " + str(round(reduced_by, 2)) + "%\n")
print(summary)

Number of words in summary: 365
Number of words in original article: 163
Reduced by: -123.93%

The style rococo was first used as a humorous variation of the word rocaille.
The term rococo was first used in print in 1828 to describe decoration "which belonged to the style of the 18th century."
A chronology of key events:
Rocaille was a style of architecture and furniture developed in france in the 17th and 18th centuries.
All images are copyrighted.
The venetian rococo was one of the most popular styles of decoration in europe in the 17th and 18th centuries, and was influenced by the french, italian and german styles.
In our series of letters from african journalists, film-maker and columnist farai sevenzo looks at the influence of french rococo on german architecture.
One of the most striking features of the prince-bishop's residence in munich is the stairway.
All images are copyrighted.
The history of british furniture can be traced back to the 17th century.
The british rococo was in