# Abstractive Text Summarization Using T5
_By Ling Li Ya_

References:
1. [Exploring the Limits of Transfer Learning with a Unified
Text-to-Text Transformer](https://arxiv.org/abs/1910.10683)

## 1. Install and Import Dependencies
Install `pytorch`

In [1]:
!pip3 install torch==1.9.0+cu111 torchvision==0.10.0+cu111 torchaudio==0.9.0 -f https://download.pytorch.org/whl/torch_stable.html

Looking in links: https://download.pytorch.org/whl/torch_stable.html


Install `transformers` to use `pipeline`

In [2]:
!pip install transformers



Install `bs4` to use `BeautifulSoup`

In [3]:
!pip install bs4



In [3]:
from transformers import pipeline
from bs4 import BeautifulSoup
import requests



## 2. Setup Generator
Get pipeline text summarization utility

In [4]:
summarizer = pipeline("summarization", model="t5-base", tokenizer="t5-base", framework="pt")

## 3. Process Input Text
Get input from website URLs

In [5]:
URL = 'https://en.wikipedia.org/wiki/Rococo'

Get HTTP URL using `requests`

In [6]:
r = requests.get(URL)

Parse HTML body returned from the URL and format it to have a better readability

In [7]:
soup = BeautifulSoup(r.text, 'html.parser')

Find all text chunks with 'h1' and 'p' tags

In [16]:
results = soup.find_all(['h1', 'p'])
results[:3]

[<h1 class="firstHeading" id="firstHeading">Rococo</h1>,
 <p class="mw-empty-elt">
 </p>,
 <p><b>Rococo</b> (<span class="rt-commentedText nowrap"><span class="IPA nopopups noexcerpt"><a href="/wiki/Help:IPA/English" title="Help:IPA/English">/<span style="border-bottom:1px dotted"><span title="'r' in 'rye'">r</span><span title="/ə/: 'a' in 'about'">ə</span><span title="/ˈ/: primary stress follows">ˈ</span><span title="'k' in 'kind'">k</span><span title="/oʊ/: 'o' in 'code'">oʊ</span><span title="'k' in 'kind'">k</span><span title="/oʊ/: 'o' in 'code'">oʊ</span></span>/</a></span></span>, <small>also</small> <span class="rt-commentedText nowrap"><small><a href="/wiki/American_English" title="American English">US</a>: </small><span class="IPA nopopups noexcerpt"><a href="/wiki/Help:IPA/English" title="Help:IPA/English">/<span style="border-bottom:1px dotted"><span title="/ˌ/: secondary stress follows">ˌ</span><span title="'r' in 'rye'">r</span><span title="/oʊ/: 'o' in 'code'">oʊ</span><

Text enclosed within the HTML tags are selected and joined together

In [9]:
text = [result.text for result in results]
ARTICLE = ' '.join(text)
ARTICLE[0:1000]

"Rococo \n Rococo (/rəˈkoʊkoʊ/, also US: /ˌroʊkəˈkoʊ/), less commonly Roccoco or Late Baroque, is an exceptionally ornamental and theatrical style of architecture, art and decoration which combines asymmetry, scrolling curves, gilding, white and pastel colors, sculpted molding, and trompe-l'œil frescoes to create surprise and the illusion of motion and drama.  It is often described as the final expression of the Baroque movement.[1]\n The Rococo style began in France in the 1730s as a reaction against the more formal and geometric Style Louis XIV. It was known as the style rocaille, or rocaille style.[2] It soon spread to other parts of Europe, particularly northern Italy, Austria, southern Germany, Central Europe and Russia.[3] It also came to influence the other arts, particularly sculpture, furniture, silverware, glassware, painting, music, and theatre.[4] Although originally a secular style primarily used for interiors of private residences the Rococo had a spiritual aspect to it w

## 4. Chunk text

Append <eos> to punctuations that marks the end of a sentence
<br />
Without the <eos> tag, sentences will be split without any punctuation

In [10]:
ARTICLE = ARTICLE.replace('.', '.<eos>')
ARTICLE = ARTICLE.replace('!', '!<eos>')
ARTICLE = ARTICLE.replace('?', '?<eos>')
sentences = ARTICLE.split('<eos>')
sentences[:10]

["Rococo \n Rococo (/rəˈkoʊkoʊ/, also US: /ˌroʊkəˈkoʊ/), less commonly Roccoco or Late Baroque, is an exceptionally ornamental and theatrical style of architecture, art and decoration which combines asymmetry, scrolling curves, gilding, white and pastel colors, sculpted molding, and trompe-l'œil frescoes to create surprise and the illusion of motion and drama.",
 '  It is often described as the final expression of the Baroque movement.',
 '[1]\n The Rococo style began in France in the 1730s as a reaction against the more formal and geometric Style Louis XIV.',
 ' It was known as the style rocaille, or rocaille style.',
 '[2] It soon spread to other parts of Europe, particularly northern Italy, Austria, southern Germany, Central Europe and Russia.',
 '[3] It also came to influence the other arts, particularly sculpture, furniture, silverware, glassware, painting, music, and theatre.',
 '[4] Although originally a secular style primarily used for interiors of private residences the Rococo

Limit the size of text in a chunk so that it is smaller than 500 words
<br />
Split sentences into words (2D array)
<br />
This is to avoid the error as shown below
```py
Token indices sequence length is longer than the specified maximum sequence length for this model (1024). Running this sequence through the model will result in indexing errors.
```

In [11]:
max_chunk = 500
current_chunk = 0
chunks = []

for sentence in sentences:
    if len(chunks) == current_chunk + 1:
        # Check if the chunk is less than 500 words
        if len(chunks[current_chunk]) + len(sentence.split(' ')) <= max_chunk:
            chunks[current_chunk].extend(sentence.split(' '))
        # Next chunk
        else:
            current_chunk += 1
            chunks.append(sentence.split(' '))
    else:
        print(current_chunk)
        chunks.append(sentence.split(' '))

print("A total of " + str(current_chunk + 1) + " chunks")
print("A total of " + str(len(chunks[0])) + " words in chunk[0]")

0
A total of 12 chunks
A total of 488 words in chunk[0]


Append words into sentences again where each chunk is ensured to have less than 500 words

In [12]:
for chunk_id in range (len(chunks)):
    chunks[chunk_id] = ' '.join(chunks[chunk_id])

print("A total of " + str(len(chunks[0].split(' '))) + " words in chunk[0]")

A total of 488 words in chunk[0]


## 4. Summarise Text

Summarise based on each chunk

In [13]:
res = summarizer(chunks)
res

Token indices sequence length is longer than the specified maximum sequence length for this model (893 > 512). Running this sequence through the model will result in indexing errors
To keep the current behavior, use torch.div(a, b, rounding_mode='trunc'), or for actual floor division, use torch.div(a, b, rounding_mode='floor'). (Triggered internally at  ..\aten\src\ATen\native\BinaryOps.cpp:467.)
  return torch.floor_divide(self, other)


[{'summary_text': 'the Rococo style began in the 1730s as a reaction against the more formal and geometric Style Louis XIV . the term rocaille was first used in print in 1825 to describe decoration which was "out of style and old-fashioned" since the mid-19th century, the term has been accepted by art historians .'},
 {'summary_text': 'the Rocaille style, or French Rococo, appeared in Paris during the reign of Louis XV . it was used particularly in salons, a new style of room designed to impress and entertain guests . the furniture designers and craftsmen in the style included juste-Aurele Meissonier and Nicolas Pineau .'},
 {'summary_text': 'the Venetian Rococo style is deeply anchored in popular culture . it is characterized by a light-filled weightlessness, festive cheerfulness and movement . the style reached its peak in southern Germany and Austria from the 1730s until the 1770s .'},
 {'summary_text': 'François de Cuvilliés was one of the first to create a Rococo building in Germa

Format the `dict` object into a `string`.

In [19]:
summary = ''
for result in res:
    summary += ''.join(str(val.capitalize()) + "\n" for _, val in result.items())

summary = summary.replace(' .', '.')
summary = summary.replace(" !", "!")
summary = summary.replace(" ?", "?")

TypeError: unsupported operand type(s) for +: 'builtin_function_or_method' and 'str'

Some statistics and the final result.

In [None]:
words_after = len(summary.split(' '))
words_before = len(ARTICLE)
reduced_by = (words_before - words_after) / words_before * 100

print("Number of words in summary: " + str(words_after))
print("Number of words in original article: " + str(words_before))
print("Reduced by: " + str(round(reduced_by, 2)) + "%\n")
print(summary)

Number of words in summary: 483
Number of words in original article: 37336
Reduced by: 98.71%

The rococo style began in the 1730s as a reaction against the more formal and geometric style louis xiv. the term rocaille was first used in print in 1825 to describe decoration which was "out of style and old-fashioned" since the mid-19th century, the term has been accepted by art historians.
the rocaille style, or french rococo, appeared in paris during the reign of louis xv. it was used particularly in salons, a new style of room designed to impress and entertain guests. the furniture designers and craftsmen in the style included juste-aurele meissonier and nicolas pineau.
the venetian rococo style is deeply anchored in popular culture. it is characterized by a light-filled weightlessness, festive cheerfulness and movement. the style reached its peak in southern germany and austria from the 1730s until the 1770s.
françois de cuvilliés was one of the first to create a rococo building in ger