# Text Summarisation with NLP

## Extraction vs Abstraction

- Abstraction is to generate a summary based on the text, which could include words/phrases not in the original text
- Extraction is to pull out certain words/phrases from the text (less complex than abstraction)
- Unsupervised learning approaches are commonly used

Find some text.

In [1]:
url = 'https://news.sky.com/story/hawaiis-mauna-loa-molten-lava-threatens-to-block-main-highway-as-tourists-flock-to-catch-a-glimpse-of-volcano-12760860'

Use `BeautifulSoup`

In [2]:
import urllib.request
from bs4 import BeautifulSoup
from bs4.element import Comment

def tag_visible(element):
    if element.parent.name in ['style', 'script', 'head', 'title', 'meta', '[document]']:
        return False
    if isinstance(element, Comment):
        return False
    return True

def text_from_html(url):
    soup = BeautifulSoup(url, 'html.parser')
    texts = soup.findAll(text=True)
    visible_texts = filter(tag_visible, texts)  
    return u" ".join(t.strip() for t in visible_texts)

text = urllib.request.urlopen(url).read()
text = text_from_html(text)
text



- Difficult to filter out paragraphs that are not in the main article
- Finding text needs to be applied to *any* webpage and since every webpage has a different structure a sustainable method is needed
- Try `trafilatura` instead

In [3]:
!pip install trafilatura

Collecting charset-normalizer>=2.1.1
  Using cached charset_normalizer-3.0.1-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (170 kB)
Installing collected packages: charset-normalizer
  Attempting uninstall: charset-normalizer
    Found existing installation: charset-normalizer 2.1.1
    Uninstalling charset-normalizer-2.1.1:
      Successfully uninstalled charset-normalizer-2.1.1
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
beatrix-jupyterlab 3.1.7 requires google-cloud-bigquery-storage, which is not installed.
requests 2.28.1 requires charset-normalizer<3,>=2, but you have charset-normalizer 3.0.1 which is incompatible.
pandas-profiling 3.1.0 requires markupsafe~=2.0.1, but you have markupsafe 2.1.1 which is incompatible.
google-api-core 1.33.2 requires protobuf!=3.20.0,!=3.20.1,!=4.21.0,!=4.21.1,!=4.21.2,!=4.21.3,!=4.21.4,!=4.21.5,<4.0.0dev

In [4]:
import trafilatura
def text_from_html(url):
    return trafilatura.extract(trafilatura.fetch_url(url))

text_from_html(url)

'Hawaii\'s Mauna Loa: Molten lava threatens to block main highway as tourists flock to catch a glimpse of volcano\nThe molten rock could make the road impassable and force drivers to find alternative coastal routes in the north and south, adding hours to commute times.\nSaturday 3 December 2022 11:49, UK\nHawaii\'s residents are bracing for further disruption if lava from Mauna Loa volcano slides across a key highway and blocks the quickest route connecting two sides of the island.\nIt comes after the spectacle of incandescent lava spewing from the volcano last Sunday for the first time in 38 years has attracted thousands of visitors to the city of Hilo.\nBut now, the molten rock could make the road impassable and force drivers to find alternative coastal routes in the north and south, adding hours to commute times.\nThe lava is oozing slowly at a rate that may reach the road next week, but its unpredictable path might change course or the flow could stop completely and spare the highw

-`trafilatura` is able to return clean content of the page as a string and is incredibly fast
- [Documentation](https://trafilatura.readthedocs.io/en/latest/)
- Split text into sentences
- Split sentences into words (or tokens)
- Calculate similarity between sentences by comparing tokens in each sentences
- Similarity can be calculated as the number of words in each sentence divided by the average sentence length
- Assign a score to each sentence
- Select best scoring sentences

In [5]:
text = text_from_html(url)

In [6]:
def similarity(sentenceA, sentenceB):
    """
    1. Split sentences on space
    2. Pull similar words
    3. Return similarity score
    
    >>> similarity('Hello, my name is James', 'Hello, James is my name') # perfect similarity
    1
    
    >>> similarity('Hello, my name is James', 'Hi, I am James')
    0.2222222222222222
    """
    sentenceA = sentenceA.split(' ')
    sentenceB = sentenceB.split(' ')
    similar_words = [word for word in sentenceA if word in sentenceB]
    return len(similar_words) / ((len(sentenceA) + len(sentenceB)) / 2)

- Create a matrix of similarities between pairs of sentences using `sent_tokenize` from `nltk.tokenize`
- Matrix will have square dimensions, and be of `shape` `(len(token_sentences), len(token_sentences))`

In [7]:
from nltk.tokenize import sent_tokenize
token_sentences = sent_tokenize(text)
scores = [[similarity(token_sentences[i], token_sentences[j]) for i in range(len(token_sentences))] 
          for j in range(len(token_sentences))]

If all went correctly, `len(scores)` == `len(token_sentences)`

In [8]:
len(scores) == len(token_sentences)

True

- Calculate individual scores for each sentence
- `scores` is the the similarity of each sentence with other sentences
- e.g. `scores[0]` is the similarity of the first sentence with every other sentence
- Calculate a sentence, i, score by summing `scores` for sentence i with the other sentences
- Input sentence with total score into a dictionary

In [9]:
sum(scores[0])

4.340800536015334

In [10]:
sentence_score = {token_sentences[i]: sum(scores[i]) for i in range(len(scores))}

In [11]:
sentence_score

{"Hawaii's Mauna Loa: Molten lava threatens to block main highway as tourists flock to catch a glimpse of volcano\nThe molten rock could make the road impassable and force drivers to find alternative coastal routes in the north and south, adding hours to commute times.": 4.340800536015334,
 "Saturday 3 December 2022 11:49, UK\nHawaii's residents are bracing for further disruption if lava from Mauna Loa volcano slides across a key highway and blocks the quickest route connecting two sides of the island.": 4.0737589717640486,
 'It comes after the spectacle of incandescent lava spewing from the volcano last Sunday for the first time in 38 years has attracted thousands of visitors to the city of Hilo.': 3.763592767710419,
 'But now, the molten rock could make the road impassable and force drivers to find alternative coastal routes in the north and south, adding hours to commute times.': 3.8180138056111566,
 'The lava is oozing slowly at a rate that may reach the road next week, but its unp

- Sort the sentences by scores
- Return the top three sentences

In [12]:
sorted(sentence_score, key=sentence_score.get, reverse=True)[:3]

['Traffic has since clogged the road as people try to get a glimpse of the lava.',
 "Hawaii's Mauna Loa: Molten lava threatens to block main highway as tourists flock to catch a glimpse of volcano\nThe molten rock could make the road impassable and force drivers to find alternative coastal routes in the north and south, adding hours to commute times.",
 'Hawaii police said a handful of resulting accidents included a two-vehicle crash that sent two people to the hospital with "not serious injuries".']

- Wrap everything up into functions
- Clean up the output
- Use `word_tokenize` and `stopwords` to improve algorithm

In [13]:
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
stop = stopwords.words('english')
import string
punctuations = list(string.punctuation)

def similarity(sentenceA, sentenceB):
    sentenceA = [i for i in word_tokenize(sentenceA) if i not in punctuations and i not in stop]
    sentenceB = [i for i in word_tokenize(sentenceB) if i not in punctuations and i not in stop]
    similar_words = [word for word in sentenceA if word in sentenceB]
    return len(similar_words) / ((len(sentenceA) + len(sentenceB)) / 2)

def calculate_score(text):
    token_sentences = sent_tokenize(text)
    scores = [[similarity(token_sentences[i], token_sentences[j]) for i in range(len(token_sentences))] 
          for j in range(len(token_sentences))]
    return sorted({token_sentences[i]: sum(scores[i]) for i in range(len(scores))}, key=sentence_score.get, reverse=True)

def summarise(text, num_sentences=3):
    summary = calculate_score(text)
    return ' '.join(summary[:num_sentences])

summarise(text)

'Traffic has since clogged the road as people try to get a glimpse of the lava. Hawaii\'s Mauna Loa: Molten lava threatens to block main highway as tourists flock to catch a glimpse of volcano\nThe molten rock could make the road impassable and force drivers to find alternative coastal routes in the north and south, adding hours to commute times. Hawaii police said a handful of resulting accidents included a two-vehicle crash that sent two people to the hospital with "not serious injuries".'

- Compare with the `sumy` package
- Install `sumy` withing virtual environment
- `sumy` has implemented various summarization algorithms such as TextRank, Luhn, Edmundson, LSA, LexRank, KL-Sum, SumBasic, and Reduction
- Use TextRank

In [14]:
!pip install sumy

Collecting charset-normalizer<3,>=2
  Using cached charset_normalizer-2.1.1-py3-none-any.whl (39 kB)
Installing collected packages: charset-normalizer
  Attempting uninstall: charset-normalizer
    Found existing installation: charset-normalizer 3.0.1
    Uninstalling charset-normalizer-3.0.1:
      Successfully uninstalled charset-normalizer-3.0.1
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
beatrix-jupyterlab 3.1.7 requires google-cloud-bigquery-storage, which is not installed.
pandas-profiling 3.1.0 requires markupsafe~=2.0.1, but you have markupsafe 2.1.1 which is incompatible.
htmldate 1.4.0 requires charset-normalizer>=3.0.1, but you have charset-normalizer 2.1.1 which is incompatible.
google-api-core 1.33.2 requires protobuf!=3.20.0,!=3.20.1,!=4.21.0,!=4.21.1,!=4.21.2,!=4.21.3,!=4.21.4,!=4.21.5,<4.0.0dev,>=3.19.5, but you have protobuf 3.19.4 whi

In [15]:
from sumy.parsers.plaintext import PlaintextParser
from sumy.nlp.tokenizers import Tokenizer

parser = PlaintextParser.from_string(text,Tokenizer("english"))

Summarise using TextRank

In [16]:
from sumy.summarizers.text_rank import TextRankSummarizer
summariser = TextRankSummarizer()
summary = summariser(parser.document,2)
text_summary = ''

for sentence in summary:
    text_summary += str(sentence)
text_summary

"Hawaii's Mauna Loa: Molten lava threatens to block main highway as tourists flock to catch a glimpse of volcano The molten rock could make the road impassable and force drivers to find alternative coastal routes in the north and south, adding hours to commute times.The period between Thanksgiving and Christmas is typically quiet for Hawaii's travel industry, but this week thousands of cars have caused traffic jams on Route 200, known as the Saddle Road, which connects the cities of Hilo on the east side of Hawaii Island and Kailua-Kona on the west side."

- Compare with LexRank
- LexRank is graph-based (unsupervised, similar to TextRank)
- LexRank uses cosine similarity

In [17]:
from sumy.summarizers.lex_rank import LexRankSummarizer
summariser = LexRankSummarizer()
summary = summariser(parser.document,2)
test_summary = ''
for sentence in summary:
    text_summary += str(sentence)
text_summary

"Hawaii's Mauna Loa: Molten lava threatens to block main highway as tourists flock to catch a glimpse of volcano The molten rock could make the road impassable and force drivers to find alternative coastal routes in the north and south, adding hours to commute times.The period between Thanksgiving and Christmas is typically quiet for Hawaii's travel industry, but this week thousands of cars have caused traffic jams on Route 200, known as the Saddle Road, which connects the cities of Hilo on the east side of Hawaii Island and Kailua-Kona on the west side.Hawaii's Mauna Loa: Molten lava threatens to block main highway as tourists flock to catch a glimpse of volcano The molten rock could make the road impassable and force drivers to find alternative coastal routes in the north and south, adding hours to commute times.Saturday 3 December 2022 11:49, UK Hawaii's residents are bracing for further disruption if lava from Mauna Loa volcano slides across a key highway and blocks the quickest ro

- Latent Semnatic Analysis is based on singular value decomposition (SVD)
- Reduced data dimensionality
- Performs spatial decomsposition and captures information in a singular vector and the magnitude

In [18]:
from sumy.summarizers.lsa import LsaSummarizer
summariser = LsaSummarizer()
summary = summariser(parser.document,2)
test_summary = ''
for sentence in summary:
    text_summary += str(sentence)
text_summary

'Hawaii\'s Mauna Loa: Molten lava threatens to block main highway as tourists flock to catch a glimpse of volcano The molten rock could make the road impassable and force drivers to find alternative coastal routes in the north and south, adding hours to commute times.The period between Thanksgiving and Christmas is typically quiet for Hawaii\'s travel industry, but this week thousands of cars have caused traffic jams on Route 200, known as the Saddle Road, which connects the cities of Hilo on the east side of Hawaii Island and Kailua-Kona on the west side.Hawaii\'s Mauna Loa: Molten lava threatens to block main highway as tourists flock to catch a glimpse of volcano The molten rock could make the road impassable and force drivers to find alternative coastal routes in the north and south, adding hours to commute times.Saturday 3 December 2022 11:49, UK Hawaii\'s residents are bracing for further disruption if lava from Mauna Loa volcano slides across a key highway and blocks the quickes

- KL Divergence calculates difference between two probability distributions
- KL Divergence takes a matrix of the KLD value of sentences from the input document

In [19]:
from sumy.summarizers.kl import KLSummarizer
summariser = KLSummarizer()
summary = summariser(parser.document,2)
test_summary = ''
for sentence in summary:
    text_summary += str(sentence)
text_summary

'Hawaii\'s Mauna Loa: Molten lava threatens to block main highway as tourists flock to catch a glimpse of volcano The molten rock could make the road impassable and force drivers to find alternative coastal routes in the north and south, adding hours to commute times.The period between Thanksgiving and Christmas is typically quiet for Hawaii\'s travel industry, but this week thousands of cars have caused traffic jams on Route 200, known as the Saddle Road, which connects the cities of Hilo on the east side of Hawaii Island and Kailua-Kona on the west side.Hawaii\'s Mauna Loa: Molten lava threatens to block main highway as tourists flock to catch a glimpse of volcano The molten rock could make the road impassable and force drivers to find alternative coastal routes in the north and south, adding hours to commute times.Saturday 3 December 2022 11:49, UK Hawaii\'s residents are bracing for further disruption if lava from Mauna Loa volcano slides across a key highway and blocks the quickes