<a href="https://colab.research.google.com/github/poojith18/Text_Summarization/blob/main/Text_Summarization_using_Libraries.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Text summarizatin using libraries

# Preparing the environment

In [None]:
import nltk
import re
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [None]:
!pip install goose3

In [None]:
from goose3 import Goose
g = Goose()
url = 'https://en.wikipedia.org/wiki/Automatic_summarization'
article = g.extract(url)

In [None]:
article.cleaned_text

'Automatic summarization is the process of shortening a set of data computationally, to create a subset (a summary) that represents the most important or relevant information within the original content.\n\nIn addition to text, images and videos can also be summarized. Text summarization finds the most informative sentences in a document;[1] image summarization finds the most representative images within an image collection[citation needed]; video summarization extracts the most important frames from the video content.[2]\n\nThere are two general approaches to automatic summarization: extraction and abstraction.\n\nHere, content is extracted from the original data, but the extracted content is not modified in any way. Examples of extracted content include key-phrases that can be used to "tag" or index a text document, or key sentences (including headings) that collectively comprise an abstract, and representative images or video segments, as stated above. For text, extraction is analog

In [None]:
original_sentences = [sentence for sentence in nltk.sent_tokenize(article.cleaned_text)]
original_sentences

['Automatic summarization is the process of shortening a set of data computationally, to create a subset (a summary) that represents the most important or relevant information within the original content.',
 'In addition to text, images and videos can also be summarized.',
 'Text summarization finds the most informative sentences in a document;[1] image summarization finds the most representative images within an image collection[citation needed]; video summarization extracts the most important frames from the video content.',
 '[2]\n\nThere are two general approaches to automatic summarization: extraction and abstraction.',
 'Here, content is extracted from the original data, but the extracted content is not modified in any way.',
 'Examples of extracted content include key-phrases that can be used to "tag" or index a text document, or key sentences (including headings) that collectively comprise an abstract, and representative images or video segments, as stated above.',
 'For text, 

In [None]:
from IPython.core.display import HTML
def visualize(title, sentence_list, best_sentences):
  text = ''

  display(HTML(f'<h1>Summary - {title}</h1>'))
  for sentence in sentence_list:
    if sentence in best_sentences:
      text += ' ' + str(sentence).replace(sentence, f"<mark>{sentence}</mark>")
    else:
      text += ' ' + sentence
  display(HTML(f""" {text} """))

## sumy

- https://pypi.org/project/sumy/

In [None]:
!pip install sumy

In [None]:
from sumy.parsers.plaintext import PlaintextParser
from sumy.nlp.tokenizers import Tokenizer
from sumy.summarizers.luhn import LuhnSummarizer

In [None]:
parser = PlaintextParser.from_string(article.cleaned_text, Tokenizer('english'))

In [None]:
summarizer = LuhnSummarizer()

In [None]:
summary = summarizer(parser.document, 120)

In [None]:
summary

(<Sentence: Automatic summarization is the process of shortening a set of data computationally, to create a subset (a summary) that represents the most important or relevant information within the original content.>,
 <Sentence: Text summarization finds the most informative sentences in a document;[1] image summarization finds the most representative images within an image collection[citation needed]; video summarization extracts the most important frames from the video content.>,
 <Sentence: Here, content is extracted from the original data, but the extracted content is not modified in any way.>,
 <Sentence: Examples of extracted content include key-phrases that can be used to "tag" or index a text document, or key sentences (including headings) that collectively comprise an abstract, and representative images or video segments, as stated above.>,
 <Sentence: For text, extraction is analogous to the process of skimming, where the summary (if available), headings and subheadings, figur

In [None]:
best_sentences = []
for sentence in summary:
  #print(sentence)
  best_sentences.append(str(sentence))

In [None]:
visualize(article.title, original_sentences, best_sentences)

## pysummarization

- https://pypi.org/project/pysummarization/

In [None]:
!pip install pysummarization

In [None]:
from pysummarization.nlpbase.auto_abstractor import AutoAbstractor
from pysummarization.tokenizabledoc.simple_tokenizer import SimpleTokenizer
from pysummarization.abstractabledoc.top_n_rank_abstractor import TopNRankAbstractor

In [None]:
auto_abstractor = AutoAbstractor()
auto_abstractor.tokenizable_doc = SimpleTokenizer()
auto_abstractor.delimiter_list = [".", "\n"]
abstractable_doc = TopNRankAbstractor()

In [None]:
summary = auto_abstractor.summarize(article.cleaned_text, abstractable_doc)

In [None]:
summary

{'scoring_data': [(0, 21.551724137931036),
  (1, 9.0),
  (2, 11.0),
  (4, 9.0),
  (5, 8.0),
  (6, 29.257142857142856),
  (7, 19.862068965517242),
  (11, 6.0),
  (29, 6.125),
  (30, 7.117647058823529)],
 'summarize_result': ['Automatic summarization is the process of shortening a set of data computationally, to create a subset (a summary) that represents the most important or relevant information within the original content.\n',
  'In addition to text, images and videos can also be summarized.\n',
  ' Text summarization finds the most informative sentences in a document;[1] image summarization finds the most representative images within an image collection[citation needed]; video summarization extracts the most important frames from the video content.\n',
  'There are two general approaches to automatic summarization: extraction and abstraction.\n',
  'Here, content is extracted from the original data, but the extracted content is not modified in any way.\n',
  ' Examples of extracted c

In [None]:
best_sentences = []
for sentence in summary['summarize_result']:
  #print(sentence)
  best_sentences.append(re.sub(r'\s+', ' ', sentence).strip())

In [None]:
best_sentences

['Automatic summarization is the process of shortening a set of data computationally, to create a subset (a summary) that represents the most important or relevant information within the original content.',
 'In addition to text, images and videos can also be summarized.',
 'Text summarization finds the most informative sentences in a document;[1] image summarization finds the most representative images within an image collection[citation needed]; video summarization extracts the most important frames from the video content.',
 'There are two general approaches to automatic summarization: extraction and abstraction.',
 'Here, content is extracted from the original data, but the extracted content is not modified in any way.',
 'Examples of extracted content include key-phrases that can be used to "tag" or index a text document, or key sentences (including headings) that collectively comprise an abstract, and representative images or video segments, as stated above.',
 'For text, extract

In [None]:
visualize(article.title, original_sentences, best_sentences)

## BERT

- https://pypi.org/project/bert-extractive-summarizer/

In [None]:
!pip install bert-extractive-summarizer

In [None]:
!pip install transformers==3.1.0

In [None]:
from summarizer import Summarizer

In [None]:
summarizer = Summarizer()
summary = summarizer(article.cleaned_text)

100%|██████████| 434/434 [00:00<00:00, 96625.51B/s]
100%|██████████| 1344997306/1344997306 [00:44<00:00, 30273826.18B/s]
100%|██████████| 231508/231508 [00:00<00:00, 692847.18B/s]


In [None]:
summary

'Automatic summarization is the process of shortening a set of data computationally, to create a subset (a summary) that represents the most important or relevant information within the original content. Approaches aimed at higher summarization quality rely on combined software and human effort. In Machine Aided Human Summarization, extractive techniques highlight candidate passages for inclusion (to which the human adds or removes text). Summarization systems are able to create both query relevant text summaries and generic machine-generated summaries depending on what the user needs. Imagine a system, which automatically pulls together news articles on a given topic (from the web), and concisely represents the latest news as a summary. Image collection summarization is another application example of automatic summarization. Query based summarization techniques, additionally model for relevance of the summary with the query. For example, news articles rarely have keyphrases attached, 

In [None]:
summary_tokenized = [sentence for sentence in nltk.sent_tokenize(summary)]
summary_tokenized

['Automatic summarization is the process of shortening a set of data computationally, to create a subset (a summary) that represents the most important or relevant information within the original content.',
 'Approaches aimed at higher summarization quality rely on combined software and human effort.',
 'In Machine Aided Human Summarization, extractive techniques highlight candidate passages for inclusion (to which the human adds or removes text).',
 'Summarization systems are able to create both query relevant text summaries and generic machine-generated summaries depending on what the user needs.',
 'Imagine a system, which automatically pulls together news articles on a given topic (from the web), and concisely represents the latest news as a summary.',
 'Image collection summarization is another application example of automatic summarization.',
 'Query based summarization techniques, additionally model for relevance of the summary with the query.',
 'For example, news articles rare

In [None]:
visualize(article.title, original_sentences, summary_tokenized)