In [21]:
text = """
There are broadly two types of extractive summarization tasks depending on what the summarization program focuses on. The first is generic summarization, which focuses on obtaining a generic summary or abstract of the collection (whether documents, or sets of images, or videos, news stories etc.). The second is query relevant summarization, sometimes called query-based summarization, which summarizes objects specific to a query. Summarization systems are able to create both query relevant text summaries and generic machine-generated summaries depending on what the user needs.
An example of a summarization problem is document summarization, which attempts to automatically produce an abstract from a given document. Sometimes one might be interested in generating a summary from a single source document, while others can use multiple source documents (for example, a cluster of articles on the same topic). This problem is called multi-document summarization. A related application is summarizing news articles. Imagine a system, which automatically pulls together news articles on a given topic (from the web), and concisely represents the latest news as a summary.
Image collection summarization is another application example of automatic summarization. It consists in selecting a representative set of images from a larger set of images.[4] A summary in this context is useful to show the most representative images of results in an image collection exploration system. Video summarization is a related domain, where the system automatically creates a trailer of a long video. This also has applications in consumer or personal videos, where one might want to skip the boring or repetitive actions. Similarly, in surveillance videos, one would want to extract important and suspicious activity, while ignoring all the boring and redundant frames captured.
At a very high level, summarization algorithms try to find subsets of objects (like set of sentences, or a set of images), which cover information of the entire set. This is also called the core-set. These algorithms model notions like diversity, coverage, information and representativeness of the summary. Query based summarization techniques, additionally model for relevance of the summary with the query. Some techniques and algorithms which naturally model summarization problems are TextRank and PageRank, Submodular set function, Determinantal point process, maximal marginal relevance (MMR) etc. 
"""

In [3]:
import spacy
!python -m spacy download en_core_web_sm

[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('en_core_web_sm')


In [6]:
from spacy.lang.en.stop_words import STOP_WORDS
from string import punctuation

In [8]:
stopwords = list(STOP_WORDS)

In [9]:
nlp = spacy.load('en_core_web_sm')

In [22]:
doc = nlp(text)

In [23]:
tokenizer = [token.text for token in doc]
print(tokenizer)

['\n', 'There', 'are', 'broadly', 'two', 'types', 'of', 'extractive', 'summarization', 'tasks', 'depending', 'on', 'what', 'the', 'summarization', 'program', 'focuses', 'on', '.', 'The', 'first', 'is', 'generic', 'summarization', ',', 'which', 'focuses', 'on', 'obtaining', 'a', 'generic', 'summary', 'or', 'abstract', 'of', 'the', 'collection', '(', 'whether', 'documents', ',', 'or', 'sets', 'of', 'images', ',', 'or', 'videos', ',', 'news', 'stories', 'etc', '.', ')', '.', 'The', 'second', 'is', 'query', 'relevant', 'summarization', ',', 'sometimes', 'called', 'query', '-', 'based', 'summarization', ',', 'which', 'summarizes', 'objects', 'specific', 'to', 'a', 'query', '.', 'Summarization', 'systems', 'are', 'able', 'to', 'create', 'both', 'query', 'relevant', 'text', 'summaries', 'and', 'generic', 'machine', '-', 'generated', 'summaries', 'depending', 'on', 'what', 'the', 'user', 'needs', '.', '\n', 'An', 'example', 'of', 'a', 'summarization', 'problem', 'is', 'document', 'summarizatio

In [24]:
punctuation = punctuation +'\n'
punctuation

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~\n\n\n'

In [32]:
word_frequencies = {}
for word in doc:
  if word.text.lower() not in stopwords:
    if word.text.lower() not in punctuation:
      if word.text.lower() not in word_frequencies.keys():
        word_frequencies[word.text.lower()] = 1
      else:
        word_frequencies[word.text.lower()] += 1


In [33]:
print(word_frequencies)

{'broadly': 1, 'types': 1, 'extractive': 1, 'summarization': 15, 'tasks': 1, 'depending': 2, 'program': 1, 'focuses': 2, 'generic': 3, 'obtaining': 1, 'summary': 6, 'abstract': 2, 'collection': 3, 'documents': 2, 'sets': 1, 'images': 4, 'videos': 3, 'news': 4, 'stories': 1, 'etc': 2, 'second': 1, 'query': 6, 'relevant': 2, 'called': 3, 'based': 2, 'summarizes': 1, 'objects': 2, 'specific': 1, 'systems': 1, 'able': 1, 'create': 1, 'text': 1, 'summaries': 2, 'machine': 1, 'generated': 1, 'user': 1, 'needs': 1, 'example': 3, 'problem': 2, 'document': 4, 'attempts': 1, 'automatically': 3, 'produce': 1, 'given': 2, 'interested': 1, 'generating': 1, 'single': 1, 'source': 2, 'use': 1, 'multiple': 1, 'cluster': 1, 'articles': 3, 'topic': 2, 'multi': 1, 'related': 2, 'application': 2, 'summarizing': 1, 'imagine': 1, 'system': 3, 'pulls': 1, 'web': 1, 'concisely': 1, 'represents': 1, 'latest': 1, 'image': 2, 'automatic': 1, 'consists': 1, 'selecting': 1, 'representative': 2, 'set': 7, 'larger':

In [35]:
max_frequency = max(word_frequencies.values())
max_frequency

15

In [36]:
for word in word_frequencies.keys():
  word_frequencies[word] = word_frequencies[word]/max_frequency
print(word_frequencies)

{'broadly': 0.06666666666666667, 'types': 0.06666666666666667, 'extractive': 0.06666666666666667, 'summarization': 1.0, 'tasks': 0.06666666666666667, 'depending': 0.13333333333333333, 'program': 0.06666666666666667, 'focuses': 0.13333333333333333, 'generic': 0.2, 'obtaining': 0.06666666666666667, 'summary': 0.4, 'abstract': 0.13333333333333333, 'collection': 0.2, 'documents': 0.13333333333333333, 'sets': 0.06666666666666667, 'images': 0.26666666666666666, 'videos': 0.2, 'news': 0.26666666666666666, 'stories': 0.06666666666666667, 'etc': 0.13333333333333333, 'second': 0.06666666666666667, 'query': 0.4, 'relevant': 0.13333333333333333, 'called': 0.2, 'based': 0.13333333333333333, 'summarizes': 0.06666666666666667, 'objects': 0.13333333333333333, 'specific': 0.06666666666666667, 'systems': 0.06666666666666667, 'able': 0.06666666666666667, 'create': 0.06666666666666667, 'text': 0.06666666666666667, 'summaries': 0.13333333333333333, 'machine': 0.06666666666666667, 'generated': 0.06666666666

In [37]:
sentence_tokens = [sent for sent in doc.sents]
print(sentence_tokens)

[
There are broadly two types of extractive summarization tasks depending on what the summarization program focuses on., The first is generic summarization, which focuses on obtaining a generic summary or abstract of the collection (whether documents, or sets of images, or videos, news stories etc.)., The second is query relevant summarization, sometimes called query-based summarization, which summarizes objects specific to a query., Summarization systems are able to create both query relevant text summaries and generic machine-generated summaries depending on what the user needs.
, An example of a summarization problem is document summarization, which attempts to automatically produce an abstract from a given document., Sometimes one might be interested in generating a summary from a single source document, while others can use multiple source documents (for example, a cluster of articles on the same topic)., This problem is called multi-document summarization., A related application 

In [42]:
sentence_score = {}
for sent in sentence_tokens:
  for word in sent:
    if word.text.lower() in word_frequencies.keys():
      if sent not in sentence_score.keys():
        sentence_score[sent] = word_frequencies[word.text.lower()]
      else:
        sentence_score[sent] += word_frequencies[word.text.lower()]



In [43]:
sentence_score

{
 There are broadly two types of extractive summarization tasks depending on what the summarization program focuses on.: 2.6,
 The first is generic summarization, which focuses on obtaining a generic summary or abstract of the collection (whether documents, or sets of images, or videos, news stories etc.).: 3.4666666666666672,
 The second is query relevant summarization, sometimes called query-based summarization, which summarizes objects specific to a query.: 4.000000000000001,
 Summarization systems are able to create both query relevant text summaries and generic machine-generated summaries depending on what the user needs.: 2.6666666666666674,
 An example of a summarization problem is document summarization, which attempts to automatically produce an abstract from a given document.: 3.466666666666667,
 Sometimes one might be interested in generating a summary from a single source document, while others can use multiple source documents (for example, a cluster of articles on the sa

In [44]:
from heapq import nlargest

In [45]:
select_length = int(len(sentence_tokens)*0.3)
select_length

6

In [46]:
summary = nlargest(select_length, sentence_score, key = sentence_score.get)

In [47]:
summary

[The second is query relevant summarization, sometimes called query-based summarization, which summarizes objects specific to a query.,
 At a very high level, summarization algorithms try to find subsets of objects (like set of sentences, or a set of images), which cover information of the entire set.,
 The first is generic summarization, which focuses on obtaining a generic summary or abstract of the collection (whether documents, or sets of images, or videos, news stories etc.).,
 An example of a summarization problem is document summarization, which attempts to automatically produce an abstract from a given document.,
 Some techniques and algorithms which naturally model summarization problems are TextRank and PageRank, Submodular set function, Determinantal point process, maximal marginal relevance (MMR),
 Query based summarization techniques, additionally model for relevance of the summary with the query.]

In [48]:
final_summary = [word.text for word in summary]

In [50]:
summary = ' '.join(final_summary)

In [51]:
summary

'The second is query relevant summarization, sometimes called query-based summarization, which summarizes objects specific to a query. At a very high level, summarization algorithms try to find subsets of objects (like set of sentences, or a set of images), which cover information of the entire set. The first is generic summarization, which focuses on obtaining a generic summary or abstract of the collection (whether documents, or sets of images, or videos, news stories etc.). An example of a summarization problem is document summarization, which attempts to automatically produce an abstract from a given document. Some techniques and algorithms which naturally model summarization problems are TextRank and PageRank, Submodular set function, Determinantal point process, maximal marginal relevance (MMR) Query based summarization techniques, additionally model for relevance of the summary with the query.'

In [52]:
text

'\nThere are broadly two types of extractive summarization tasks depending on what the summarization program focuses on. The first is generic summarization, which focuses on obtaining a generic summary or abstract of the collection (whether documents, or sets of images, or videos, news stories etc.). The second is query relevant summarization, sometimes called query-based summarization, which summarizes objects specific to a query. Summarization systems are able to create both query relevant text summaries and generic machine-generated summaries depending on what the user needs.\nAn example of a summarization problem is document summarization, which attempts to automatically produce an abstract from a given document. Sometimes one might be interested in generating a summary from a single source document, while others can use multiple source documents (for example, a cluster of articles on the same topic). This problem is called multi-document summarization. A related application is sum

In [53]:
len(text)

2475

In [54]:
len(summary)

912

In [None]:
#Hampir setengahnya