# Abstractive Text Summarization Using T5
_By Ling Li Ya_

References:
1. [Exploring the Limits of Transfer Learning with a Unified
Text-to-Text Transformer](https://arxiv.org/abs/1910.10683)
2. [Truecasing in Natural Language Processing](https://towardsdatascience.com/truecasing-in-natural-language-processing-12c4df086c21)
3. [POS Tag List Reference](https://stackoverflow.com/questions/29332851/what-does-nn-vbd-in-dt-nns-rb-means-in-nltk)

## 1. Install and Import Dependencies
Install `pytorch`

In [7]:
!pip3 install torch==1.9.0+cu111 torchvision==0.10.0+cu111 torchaudio==0.9.0 -f https://download.pytorch.org/whl/torch_stable.html

Looking in links: https://download.pytorch.org/whl/torch_stable.html


You should consider upgrading via the 'C:\Users\liana\AppData\Local\Programs\Python\Python38\python.exe -m pip install --upgrade pip' command.


Install `transformers` to use its `summarization pipeline`

In [8]:
!pip install transformers



You should consider upgrading via the 'C:\Users\liana\AppData\Local\Programs\Python\Python38\python.exe -m pip install --upgrade pip' command.


Install `bs4` to use `BeautifulSoup`

In [9]:
!pip install bs4



You should consider upgrading via the 'C:\Users\liana\AppData\Local\Programs\Python\Python38\python.exe -m pip install --upgrade pip' command.


Install `standfordnlp` to use its `POS processor pipeline`

In [10]:
!pip install stanfordnlp



You should consider upgrading via the 'C:\Users\liana\AppData\Local\Programs\Python\Python38\python.exe -m pip install --upgrade pip' command.


Import all dependencies

In [11]:
from transformers import pipeline, MT5Model, T5Tokenizer
from bs4 import BeautifulSoup
import requests
import stanfordnlp

## 2. Setup Generator
Get pipeline text summarization utility

In [12]:
summarizer = pipeline("summarization", model="google/mt5-small", tokenizer="google/mt5-small", framework="pt")

Downloading: 100%|██████████| 553/553 [00:00<00:00, 277kB/s]
Downloading: 100%|██████████| 1.20G/1.20G [01:44<00:00, 11.4MB/s]
Downloading: 100%|██████████| 82.0/82.0 [00:00<00:00, 81.8kB/s]
Downloading: 100%|██████████| 4.31M/4.31M [00:02<00:00, 1.55MB/s]
Downloading: 100%|██████████| 99.0/99.0 [00:00<00:00, 33.0kB/s]


## 3. Process Input Text
Get input from website URLs

In [13]:
URL = 'https://en.wikipedia.org/wiki/Rococo'

Get HTTP URL using `requests`

In [14]:
r = requests.get(URL)

Parse HTML body returned from the URL and format it to have a better readability

In [15]:
soup = BeautifulSoup(r.text, 'html.parser')

Find all text chunks with 'h1' and 'p' tags

In [16]:
results = soup.find_all(['h1', 'p'])
results[:3]

[<h1 class="firstHeading" id="firstHeading">Rococo</h1>,
 <p class="mw-empty-elt">
 </p>,
 <p><b>Rococo</b> (<span class="rt-commentedText nowrap"><span class="IPA nopopups noexcerpt"><a href="/wiki/Help:IPA/English" title="Help:IPA/English">/<span style="border-bottom:1px dotted"><span title="'r' in 'rye'">r</span><span title="/ə/: 'a' in 'about'">ə</span><span title="/ˈ/: primary stress follows">ˈ</span><span title="'k' in 'kind'">k</span><span title="/oʊ/: 'o' in 'code'">oʊ</span><span title="'k' in 'kind'">k</span><span title="/oʊ/: 'o' in 'code'">oʊ</span></span>/</a></span></span>, <small>also</small> <span class="rt-commentedText nowrap"><small><a href="/wiki/American_English" title="American English">US</a>: </small><span class="IPA nopopups noexcerpt"><a href="/wiki/Help:IPA/English" title="Help:IPA/English">/<span style="border-bottom:1px dotted"><span title="/ˌ/: secondary stress follows">ˌ</span><span title="'r' in 'rye'">r</span><span title="/oʊ/: 'o' in 'code'">oʊ</span><

Text enclosed within the HTML tags are selected and joined together

In [17]:
text = [result.text for result in results]
ARTICLE = ' '.join(text)
ARTICLE[0:1000]

"Rococo \n Rococo (/rəˈkoʊkoʊ/, also US: /ˌroʊkəˈkoʊ/), less commonly Roccoco or Late Baroque, is an exceptionally ornamental and theatrical style of architecture, art and decoration which combines asymmetry, scrolling curves, gilding, white and pastel colors, sculpted molding, and trompe-l'œil frescoes to create surprise and the illusion of motion and drama.  It is often described as the final expression of the Baroque movement.[1]\n The Rococo style began in France in the 1730s as a reaction against the more formal and geometric Style Louis XIV. It was known as the style rocaille, or rocaille style.[2] It soon spread to other parts of Europe, particularly northern Italy, Austria, southern Germany, Central Europe and Russia.[3] It also came to influence the other arts, particularly sculpture, furniture, silverware, glassware, painting, music, and theatre.[4] Although originally a secular style primarily used for interiors of private residences the Rococo had a spiritual aspect to it w

## 4. Chunk text

Append <eos> to punctuations that marks the end of a sentence
<br />
Without the <eos> tag, sentences will be split without any punctuation

In [18]:
ARTICLE = ARTICLE.replace('.', '.<eos>')
ARTICLE = ARTICLE.replace('!', '!<eos>')
ARTICLE = ARTICLE.replace('?', '?<eos>')
sentences = ARTICLE.split('<eos>')
sentences[:10]

["Rococo \n Rococo (/rəˈkoʊkoʊ/, also US: /ˌroʊkəˈkoʊ/), less commonly Roccoco or Late Baroque, is an exceptionally ornamental and theatrical style of architecture, art and decoration which combines asymmetry, scrolling curves, gilding, white and pastel colors, sculpted molding, and trompe-l'œil frescoes to create surprise and the illusion of motion and drama.",
 '  It is often described as the final expression of the Baroque movement.',
 '[1]\n The Rococo style began in France in the 1730s as a reaction against the more formal and geometric Style Louis XIV.',
 ' It was known as the style rocaille, or rocaille style.',
 '[2] It soon spread to other parts of Europe, particularly northern Italy, Austria, southern Germany, Central Europe and Russia.',
 '[3] It also came to influence the other arts, particularly sculpture, furniture, silverware, glassware, painting, music, and theatre.',
 '[4] Although originally a secular style primarily used for interiors of private residences the Rococo

Limit the size of text in a chunk so that it is smaller than 500 words
<br />
Split sentences into words (2D array)
<br />
This is to avoid the error as shown below
```py
Token indices sequence length is longer than the specified maximum sequence length for this model (1024). Running this sequence through the model will result in indexing errors.
```

In [19]:
max_chunk = 500
current_chunk = 0
chunks = []

for sentence in sentences:
    if len(chunks) == current_chunk + 1:
        # Check if the chunk is less than 500 words
        if len(chunks[current_chunk]) + len(sentence.split(' ')) <= max_chunk:
            chunks[current_chunk].extend(sentence.split(' '))
        # Next chunk
        else:
            current_chunk += 1
            chunks.append(sentence.split(' '))
    else:
        print(current_chunk)
        chunks.append(sentence.split(' '))

print("A total of " + str(current_chunk + 1) + " chunks")
print("A total of " + str(len(chunks[0])) + " words in chunk[0]")

0
A total of 12 chunks
A total of 488 words in chunk[0]


Append words into sentences again where each chunk is ensured to have less than 500 words

In [20]:
for chunk_id in range (len(chunks)):
    chunks[chunk_id] = ' '.join(chunks[chunk_id])

print("A total of " + str(len(chunks[0].split(' '))) + " words in chunk[0]")

A total of 488 words in chunk[0]


## 5. Summarise Text

Summarise based on each chunk

In [21]:
res = summarizer(chunks)
res

[{'summary_text': '<extra_id_0> -'},
 {'summary_text': '<extra_id_0> decorations'},
 {'summary_text': '<extra_id_0> added:'},
 {'summary_text': '<extra_id_0> a statue of the Cathedral of the Cathedral of the Cathedral of the'},
 {'summary_text': '<extra_id_0> a Renaissance Renaissance Renaissance Renaissance Renaissance Renaissance Renaissance Renaissance'},
 {'summary_text': '<extra_id_0> continued in France.'},
 {'summary_text': '<extra_id_0> created an icon of the Rococo.'},
 {'summary_text': '<extra_id_0> sculpture'},
 {'summary_text': '<extra_id_0> statue'},
 {'summary_text': '<extra_id_0> -'},
 {'summary_text': "<extra_id_0> a l'anglais."},
 {'summary_text': '<extra_id_0> .'}]

## 6. Formatting Text
Preprocessing: format the `dict` object into a `string`.

In [22]:
summary = ''
for result in res:
    summary += ''.join(str(val.capitalize()) + "\n" for _, val in result.items())

summary = summary.replace(' .', '.')
summary = summary.replace(" !", "!")
summary = summary.replace(" ?", "?")

Check `pytorch` version. `standfordnlp` requires at least version 1.0.0 or older.

`torch==version_number`

In [23]:
!pip freeze | grep torch

pytorch-lightning==1.4.5
torch==1.9.0+cu111
torchaudio==0.9.0
torchmetrics==0.5.1
torchvision==0.10.0+cu111


Download the English `stanfordnlp` model. It will take some time because the model is very huge (about 1.96GB). Type 'y' to continue with the download in any prompted dialogue box.

In [24]:
# stanfordnlp.download('en')

Create a `pipeline` with `pos` processor. POS stands for Parts of speech where tagging is done.

In [25]:
stf_nlp = stanfordnlp.Pipeline(processors='tokenize,mwt,pos')

Use device: gpu
---
Loading: tokenize
With settings: 
{'model_path': 'C:\\Users\\liana\\stanfordnlp_resources\\en_ewt_models\\en_ewt_tokenizer.pt', 'lang': 'en', 'shorthand': 'en_ewt', 'mode': 'predict'}
---
Loading: pos
With settings: 
{'model_path': 'C:\\Users\\liana\\stanfordnlp_resources\\en_ewt_models\\en_ewt_tagger.pt', 'pretrain_path': 'C:\\Users\\liana\\stanfordnlp_resources\\en_ewt_models\\en_ewt.pretrain.pt', 'lang': 'en', 'shorthand': 'en_ewt', 'mode': 'predict'}
Done loading processors!
---


Evaluate the preprocessed summary text with `stanfordnlp`.

In [26]:
doc = stf_nlp(summary)

The breakdown analysis of the summary text.

In [27]:
print(*[f'word: {word.text+" "}\tupos: {word.upos}\txpos: {word.xpos}' for sent in doc.sentences for word in sent.words], sep='\n')

word: < 	upos: PUNCT	xpos: -LRB-
word: extra_id_0 	upos: X	xpos: ADD
word: > 	upos: PUNCT	xpos: -RRB-
word: - 	upos: PUNCT	xpos: ,
word: < 	upos: PUNCT	xpos: -LRB-
word: extra_id_0 	upos: X	xpos: ADD
word: > 	upos: PUNCT	xpos: -RRB-
word: decorations 	upos: NOUN	xpos: NNS
word: < 	upos: PUNCT	xpos: -LRB-
word: extra_id_0 	upos: X	xpos: ADD
word: > 	upos: PUNCT	xpos: -RRB-
word: added 	upos: VERB	xpos: VBN
word: : 	upos: PUNCT	xpos: :
word: < 	upos: PUNCT	xpos: -LRB-
word: extra_id_0 	upos: X	xpos: ADD
word: > 	upos: PUNCT	xpos: -RRB-
word: a 	upos: DET	xpos: DT
word: statue 	upos: NOUN	xpos: NN
word: of 	upos: ADP	xpos: IN
word: the 	upos: DET	xpos: DT
word: cathedral 	upos: NOUN	xpos: NN
word: of 	upos: ADP	xpos: IN
word: the 	upos: DET	xpos: DT
word: cathedral 	upos: NOUN	xpos: NN
word: of 	upos: ADP	xpos: IN
word: the 	upos: DET	xpos: DT
word: cathedral 	upos: NOUN	xpos: NN
word: of 	upos: ADP	xpos: IN
word: the 	upos: DET	xpos: DT
word: < 	upos: PUNCT	xpos: -LRB-
word: extra_id_0 	

Capitalise proper nouns. `PROPN` stands for proper noun and `NNP` stands for proper noun, singular phrase.

In [28]:
doc_list = [w.text.capitalize() if w.upos in ["PROPN","NNS"] else w.text for sent in doc.sentences for w in sent.words]
doc_list

['<',
 'extra_id_0',
 '>',
 '-',
 '<',
 'extra_id_0',
 '>',
 'decorations',
 '<',
 'extra_id_0',
 '>',
 'added',
 ':',
 '<',
 'extra_id_0',
 '>',
 'a',
 'statue',
 'of',
 'the',
 'cathedral',
 'of',
 'the',
 'cathedral',
 'of',
 'the',
 'cathedral',
 'of',
 'the',
 '<',
 'extra_id_0',
 '>',
 'a',
 'Renaissance',
 'Renaissance',
 'Renaissance',
 'Renaissance',
 'Renaissance',
 'Renaissance',
 'Renaissance',
 'Renaissance',
 '<',
 'extra_id_0',
 '>',
 'continued',
 'in',
 'France',
 '.',
 '<',
 'extra_id_0',
 '>',
 'created',
 'an',
 'icon',
 'of',
 'the',
 'rococo',
 '.',
 '<',
 'extra_id_0',
 '>',
 'sculpture',
 '<',
 'extra_id_0',
 '>',
 'statue',
 '<',
 'extra_id_0',
 '>',
 '-',
 '<',
 'extra_id_0',
 '>',
 'a',
 "l'anglais",
 '.',
 '<',
 'extra_id_0',
 '>.']

Capitalise every first word of the sentence. Add a space in front of words that are not punctuation.

In [29]:
i = 0
for sent in doc.sentences:
    for w in range(len(sent.words)):
        if w != 2:
            if sent.words[w - 1].xpos in ["!", "."]: # Capitalise each first word
                doc_list[i] = sent.words[w].text.capitalize()
        if sent.words[w].upos != "PUNCT" and i != 0: # Add a space before non-punctuation words
            doc_list[i] = " "+ doc_list[i]
        i += 1

doc_list

['<',
 ' extra_id_0',
 '>',
 '-',
 '<',
 ' extra_id_0',
 '>',
 ' decorations',
 '<',
 ' extra_id_0',
 '>',
 ' added',
 ':',
 '<',
 ' extra_id_0',
 '>',
 ' a',
 ' statue',
 ' of',
 ' the',
 ' cathedral',
 ' of',
 ' the',
 ' cathedral',
 ' of',
 ' the',
 ' cathedral',
 ' of',
 ' the',
 '<',
 ' extra_id_0',
 '>',
 ' A',
 ' Renaissance',
 ' Renaissance',
 ' Renaissance',
 ' Renaissance',
 ' Renaissance',
 ' Renaissance',
 ' Renaissance',
 ' Renaissance',
 '<',
 ' extra_id_0',
 '>',
 ' continued',
 ' in',
 ' France',
 '.',
 '<',
 ' extra_id_0',
 '>',
 ' Created',
 ' an',
 ' icon',
 ' of',
 ' the',
 ' rococo',
 '.',
 '<',
 ' extra_id_0',
 '>',
 ' sculpture',
 '<',
 ' extra_id_0',
 '>',
 ' statue',
 '<',
 ' extra_id_0',
 '>',
 '-',
 '<',
 ' extra_id_0',
 '>',
 ' a',
 " l'anglais",
 '.',
 '<',
 ' extra_id_0',
 ' >.']

Join all items in `doc_list` into a string.

In [30]:
summary = ""
for s in doc_list:
    summary += s

## 7. Results

Some statistics and the final result.

In [31]:
sentences = []
one_sen = ""
for s in doc_list:
    one_sen += s
    if s == ".":
        sentences.append(one_sen)
        one_sen = ""

sentences

['< extra_id_0>-< extra_id_0> decorations< extra_id_0> added:< extra_id_0> a statue of the cathedral of the cathedral of the cathedral of the< extra_id_0> A Renaissance Renaissance Renaissance Renaissance Renaissance Renaissance Renaissance Renaissance< extra_id_0> continued in France.',
 '< extra_id_0> Created an icon of the rococo.',
 "< extra_id_0> sculpture< extra_id_0> statue< extra_id_0>-< extra_id_0> a l'anglais."]

In [32]:
md = "" # Markdown output to be fed into slide generator
txt = ""
for sentence in sentences:
        if(len(txt) == 0):
            md += "# " + "Header" + "\n\n- " + sentence
            txt += sentence
        elif(len(txt) < 1000 and sentence != sentences[-1]): # If the number of characters on a slide exceeds 1000, create new slide
            md += "\n-" + sentence
            txt += sentence
        else:
            md += "\n\n---\n\n"
            txt = ""

print(md)

# Header

- < extra_id_0>-< extra_id_0> decorations< extra_id_0> added:< extra_id_0> a statue of the cathedral of the cathedral of the cathedral of the< extra_id_0> A Renaissance Renaissance Renaissance Renaissance Renaissance Renaissance Renaissance Renaissance< extra_id_0> continued in France.
-< extra_id_0> Created an icon of the rococo.

---




In [33]:
words_after = len(summary.split(' '))
words_before = len(ARTICLE)
reduced_by = (words_before - words_after) / words_before * 100

print("Number of words in summary: " + str(words_after))
print("Number of words in original article: " + str(words_before))
print("Reduced by: " + str(round(reduced_by, 2)) + "%\n")
print(summary)

Number of words in summary: 51
Number of words in original article: 37326
Reduced by: 99.86%

< extra_id_0>-< extra_id_0> decorations< extra_id_0> added:< extra_id_0> a statue of the cathedral of the cathedral of the cathedral of the< extra_id_0> A Renaissance Renaissance Renaissance Renaissance Renaissance Renaissance Renaissance Renaissance< extra_id_0> continued in France.< extra_id_0> Created an icon of the rococo.< extra_id_0> sculpture< extra_id_0> statue< extra_id_0>-< extra_id_0> a l'anglais.< extra_id_0 >.
