Steps of Text Summarazition
- Text Cleaning
- Sentence Tokenization
- Word Tokenization
- Word-freq table
- Summarize

In [27]:
# our text is this
text = """An example of a summarization problem is document summarization, which attempts to automatically produce an abstract from a given document. Sometimes one might be interested in generating a summary from a single source document, while others can use multiple source documents (for example, a cluster of articles on the same topic). This problem is called multi-document summarization. A related application is summarizing news articles. Imagine a system, which automatically pulls together news articles on a given topic (from the web), and concisely represents the latest news as a summary."""

In [14]:
import spacy
from spacy.lang.en.stop_words import STOP_WORDS
from string import punctuation

In [3]:
stopwords = list(STOP_WORDS)

In [4]:
nlp = spacy.load('en_core_web_sm')

In [31]:
doc = nlp(text)

In [32]:
tokens = [token.text for token in doc]
print(tokens)

['An', 'example', 'of', 'a', 'summarization', 'problem', 'is', 'document', 'summarization', ',', 'which', 'attempts', 'to', 'automatically', 'produce', 'an', 'abstract', 'from', 'a', 'given', 'document', '.', 'Sometimes', 'one', 'might', 'be', 'interested', 'in', 'generating', 'a', 'summary', 'from', 'a', 'single', 'source', 'document', ',', 'while', 'others', 'can', 'use', 'multiple', 'source', 'documents', '(', 'for', 'example', ',', 'a', 'cluster', 'of', 'articles', 'on', 'the', 'same', 'topic', ')', '.', 'This', 'problem', 'is', 'called', 'multi', '-', 'document', 'summarization', '.', 'A', 'related', 'application', 'is', 'summarizing', 'news', 'articles', '.', 'Imagine', 'a', 'system', ',', 'which', 'automatically', 'pulls', 'together', 'news', 'articles', 'on', 'a', 'given', 'topic', '(', 'from', 'the', 'web', ')', ',', 'and', 'concisely', 'represents', 'the', 'latest', 'news', 'as', 'a', 'summary', '.']


In [16]:
punctuation = punctuation + '\n'
punctuation

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~\n'

In [33]:
word_freq = {}
for word in doc:
    if word.text.lower() not in stopwords:
        if word.text.lower() not in punctuation:
            if word.text not in word_freq.keys():
                word_freq[word.text] = 1
            else:
                word_freq[word.text] = word_freq[word.text] + 1

In [35]:
print(word_freq)

{'example': 2, 'summarization': 3, 'problem': 2, 'document': 4, 'attempts': 1, 'automatically': 2, 'produce': 1, 'abstract': 1, 'given': 2, 'interested': 1, 'generating': 1, 'summary': 2, 'single': 1, 'source': 2, 'use': 1, 'multiple': 1, 'documents': 1, 'cluster': 1, 'articles': 3, 'topic': 2, 'called': 1, 'multi': 1, 'related': 1, 'application': 1, 'summarizing': 1, 'news': 3, 'Imagine': 1, 'system': 1, 'pulls': 1, 'web': 1, 'concisely': 1, 'represents': 1, 'latest': 1}


In [36]:
max_freq = max(word_freq.values())

In [37]:
max_freq

4

In [38]:
for word in word_freq.keys():
    word_freq[word] = word_freq[word]/max_freq


In [39]:
print(word_freq)

{'example': 0.5, 'summarization': 0.75, 'problem': 0.5, 'document': 1.0, 'attempts': 0.25, 'automatically': 0.5, 'produce': 0.25, 'abstract': 0.25, 'given': 0.5, 'interested': 0.25, 'generating': 0.25, 'summary': 0.5, 'single': 0.25, 'source': 0.5, 'use': 0.25, 'multiple': 0.25, 'documents': 0.25, 'cluster': 0.25, 'articles': 0.75, 'topic': 0.5, 'called': 0.25, 'multi': 0.25, 'related': 0.25, 'application': 0.25, 'summarizing': 0.25, 'news': 0.75, 'Imagine': 0.25, 'system': 0.25, 'pulls': 0.25, 'web': 0.25, 'concisely': 0.25, 'represents': 0.25, 'latest': 0.25}


In [40]:
sentence_tokens = [sent for sent in doc.sents]
print(sentence_tokens)

[An example of a summarization problem is document summarization, which attempts to automatically produce an abstract from a given document., Sometimes one might be interested in generating a summary from a single source document, while others can use multiple source documents (for example, a cluster of articles on the same topic)., This problem is called multi-document summarization., A related application is summarizing news articles., Imagine a system, which automatically pulls together news articles on a given topic (from the web), and concisely represents the latest news as a summary.]


In [42]:
sentence_scores = {}
for sent in sentence_tokens:
    for word in sent:
        if word.text.lower() in word_freq.keys():
            if sent not in sentence_scores.keys():
                sentence_scores[sent] = word_freq[word.text.lower()]
            else:
                sentence_scores[sent] += word_freq[word.text.lower()] 

In [43]:
sentence_scores

{An example of a summarization problem is document summarization, which attempts to automatically produce an abstract from a given document.: 6.25,
 Sometimes one might be interested in generating a summary from a single source document, while others can use multiple source documents (for example, a cluster of articles on the same topic).: 6.0,
 This problem is called multi-document summarization.: 2.75,
 A related application is summarizing news articles.: 2.25,
 Imagine a system, which automatically pulls together news articles on a given topic (from the web), and concisely represents the latest news as a summary.: 5.75}

In [46]:
from heapq import nlargest

In [47]:
select_length = int(len(sentence_tokens)*0.3)
select_length

1

In [48]:
summary = nlargest(select_length, sentence_scores, key = sentence_scores.get)

In [49]:
summary

[An example of a summarization problem is document summarization, which attempts to automatically produce an abstract from a given document.]

In [50]:
final_sum = [word.text for word in summary]

In [51]:
summary = ' '.join(final_sum)

In [52]:
print(summary)

An example of a summarization problem is document summarization, which attempts to automatically produce an abstract from a given document.


In [53]:
len(text)

591

In [54]:
len(summary)

139