# **5. Text Summarization**
Text summarization in NLP is the process of summarizing the information in large texts for quicker consumption.
The technique, where a computer program shortens longer texts and generates summaries to pass the intended message, is defined as Automatic Text Summarization and is a common problem in machine learning and natural language processing (NLP).
Text summarization is the process of creating a short, coherent, and fluent summary of a longer text document and involves the outlining of the text’s major points.
Text identification, interpretation and summary generation, and analysis of the generated summary are some of the key challenges faced in the process of text summarization. The critical tasks in extraction-based summarization are identifying key phrases in the document and using them to discover relevant information to be included in the summary. Two different approaches that are used for text summarization are:
Extractive Summarization
Abstractive Summarization


# Defining Text

In [2]:
text = """ 
ummarization is the task of condensing a piece of text to a shorter version, reducing the size of the initial text while at the same time preserving key informational elements and the meaning of content. Since manual text summarization is a time expensive and generally laborious task, the automatization of the task is gaining increasing popularity and therefore constitutes a strong motivation for academic research.
There are important applications for text summarization in various NLP related tasks such as text classification, question answering, legal texts summarization, news summarization, and headline generation. Moreover, the generation of summaries can be integrated into these systems as an intermediate stage which helps to reduce the length of the document.
In the big data era, there has been an explosion in the amount of text data from a variety of sources. This volume of text is an inestimable source of information and knowledge which needs to be effectively summarized to be useful. This increasing availability of documents has demanded exhaustive research in the NLP area for automatic text summarization. Automatic text summarization is the task of producing a concise and fluent summary without any human help while preserving the meaning of the original text document.
It is very challenging, because when we as humans summarize a piece of text, we usually read it entirely to develop our understanding, and then write a summary highlighting its main points. Since computers lack human knowledge and language capability, it makes automatic text summarization a very difficult and non-trivial task.
Various models based on machine learning have been proposed for this task. Most of these approaches model this problem as a classification problem which outputs whether to include a sentence in the summary or not. Other approaches have used topic information, Latent Semantic Analysis (LSA), Sequence to Sequence models, Reinforcement Learning and Adversarial processes.
"""

# Importing Library

In [3]:
import spacy

In [5]:
from spacy.lang.en.stop_words import STOP_WORDS

In [6]:
from string import punctuation

# Stopword

In [7]:
stopwords = list(STOP_WORDS)

In [8]:
nlp = spacy.load('en_core_web_sm')

In [9]:
doc = nlp(text)

# Tokenization

In [10]:
tokens = [token.text for token in doc]
print(tokens)

[' \n', 'ummarization', 'is', 'the', 'task', 'of', 'condensing', 'a', 'piece', 'of', 'text', 'to', 'a', 'shorter', 'version', ',', 'reducing', 'the', 'size', 'of', 'the', 'initial', 'text', 'while', 'at', 'the', 'same', 'time', 'preserving', 'key', 'informational', 'elements', 'and', 'the', 'meaning', 'of', 'content', '.', 'Since', 'manual', 'text', 'summarization', 'is', 'a', 'time', 'expensive', 'and', 'generally', 'laborious', 'task', ',', 'the', 'automatization', 'of', 'the', 'task', 'is', 'gaining', 'increasing', 'popularity', 'and', 'therefore', 'constitutes', 'a', 'strong', 'motivation', 'for', 'academic', 'research', '.', '\n', 'There', 'are', 'important', 'applications', 'for', 'text', 'summarization', 'in', 'various', 'NLP', 'related', 'tasks', 'such', 'as', 'text', 'classification', ',', 'question', 'answering', ',', 'legal', 'texts', 'summarization', ',', 'news', 'summarization', ',', 'and', 'headline', 'generation', '.', 'Moreover', ',', 'the', 'generation', 'of', 'summari

In [16]:
#punctuation = punctuation+"\n"
punctuation

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~\n\n'

# Word Frequency Dictionary

In [19]:
word_frequency = {}
for word in doc:
  if word.text.lower() not in stopwords:
    if word.text.lower() not in punctuation:
      if word.text not in word_frequency.keys():
        word_frequency[word.text] = 1
      else:
        word_frequency[word.text] +=1

In [18]:
print(word_frequency)

{' \n': 1, 'ummarization': 1, 'task': 6, 'condensing': 1, 'piece': 2, 'text': 12, 'shorter': 1, 'version': 1, 'reducing': 1, 'size': 1, 'initial': 1, 'time': 2, 'preserving': 2, 'key': 1, 'informational': 1, 'elements': 1, 'meaning': 2, 'content': 1, 'manual': 1, 'summarization': 7, 'expensive': 1, 'generally': 1, 'laborious': 1, 'automatization': 1, 'gaining': 1, 'increasing': 2, 'popularity': 1, 'constitutes': 1, 'strong': 1, 'motivation': 1, 'academic': 1, 'research': 2, 'important': 1, 'applications': 1, 'NLP': 2, 'related': 1, 'tasks': 1, 'classification': 2, 'question': 1, 'answering': 1, 'legal': 1, 'texts': 1, 'news': 1, 'headline': 1, 'generation': 2, 'summaries': 1, 'integrated': 1, 'systems': 1, 'intermediate': 1, 'stage': 1, 'helps': 1, 'reduce': 1, 'length': 1, 'document': 2, 'big': 1, 'data': 2, 'era': 1, 'explosion': 1, 'variety': 1, 'sources': 1, 'volume': 1, 'inestimable': 1, 'source': 1, 'information': 2, 'knowledge': 2, 'needs': 1, 'effectively': 1, 'summarized': 1, 

In [22]:
max_frequency = max(word_frequency.values())
max_frequency


12

In [23]:
for word in word_frequency.keys():
  word_frequency[word] = word_frequency[word]/max_frequency

In [24]:
word_frequency

{' \n': 0.08333333333333333,
 'Adversarial': 0.08333333333333333,
 'Analysis': 0.08333333333333333,
 'Automatic': 0.08333333333333333,
 'LSA': 0.08333333333333333,
 'Latent': 0.08333333333333333,
 'Learning': 0.08333333333333333,
 'NLP': 0.16666666666666666,
 'Reinforcement': 0.08333333333333333,
 'Semantic': 0.08333333333333333,
 'Sequence': 0.16666666666666666,
 'academic': 0.08333333333333333,
 'answering': 0.08333333333333333,
 'applications': 0.08333333333333333,
 'approaches': 0.16666666666666666,
 'area': 0.08333333333333333,
 'automatic': 0.16666666666666666,
 'automatization': 0.08333333333333333,
 'availability': 0.08333333333333333,
 'based': 0.08333333333333333,
 'big': 0.08333333333333333,
 'capability': 0.08333333333333333,
 'challenging': 0.08333333333333333,
 'classification': 0.16666666666666666,
 'computers': 0.08333333333333333,
 'concise': 0.08333333333333333,
 'condensing': 0.08333333333333333,
 'constitutes': 0.08333333333333333,
 'content': 0.08333333333333333,
 

In [26]:
sentence_tokens = [sent for sent in doc.sents]
sentence_tokens

[ 
 ummarization is the task of condensing a piece of text to a shorter version, reducing the size of the initial text while at the same time preserving key informational elements and the meaning of content.,
 Since manual text summarization is a time expensive and generally laborious task, the automatization of the task is gaining increasing popularity and therefore constitutes a strong motivation for academic research.,
 There are important applications for text summarization in various NLP related tasks such as text classification, question answering, legal texts summarization, news summarization, and headline generation.,
 Moreover, the generation of summaries can be integrated into these systems as an intermediate stage which helps to reduce the length of the document.,
 In the big data era, there has been an explosion in the amount of text data from a variety of sources.,
 This volume of text is an inestimable source of information and knowledge which needs to be effectively summ

In [27]:
sentence_score = {}
for sent in sentence_tokens:
  for word in sent:
    if word.text.lower() in word_frequency.keys():
      if sent not in sentence_score.keys():
        sentence_score[sent] = word_frequency[word.text.lower()]
      else:
        sentence_score[sent] += word_frequency[word.text.lower()]


In [29]:
sentence_score

{ 
 ummarization is the task of condensing a piece of text to a shorter version, reducing the size of the initial text while at the same time preserving key informational elements and the meaning of content.: 4.166666666666667,
 Since manual text summarization is a time expensive and generally laborious task, the automatization of the task is gaining increasing popularity and therefore constitutes a strong motivation for academic research.: 4.000000000000001,
 There are important applications for text summarization in various NLP related tasks such as text classification, question answering, legal texts summarization, news summarization, and headline generation.: 4.916666666666666,
 Moreover, the generation of summaries can be integrated into these systems as an intermediate stage which helps to reduce the length of the document.: 1.0,
 In the big data era, there has been an explosion in the amount of text data from a variety of sources.: 1.7499999999999998,
 This volume of text is an 

In [30]:
from heapq import nlargest

In [35]:
select_length = int(len(sentence_tokens)*0.3)

In [37]:
summary = nlargest(select_length, sentence_score, key = sentence_score.get)

# final Summary

In [39]:
summary

[There are important applications for text summarization in various NLP related tasks such as text classification, question answering, legal texts summarization, news summarization, and headline generation.,
 Automatic text summarization is the task of producing a concise and fluent summary without any human help while preserving the meaning of the original text document.,
  
 ummarization is the task of condensing a piece of text to a shorter version, reducing the size of the initial text while at the same time preserving key informational elements and the meaning of content.]