<a href="https://colab.research.google.com/github/kalai2315/NLP_Projects/blob/main/Text_Summarization.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Text Summarization**

**Project Overview:**
This Text Summarization project automates the process of condensing large bodies of text into concise summaries while retaining the most important information. Using Natural Language Processing (NLP) techniques, the project supports both extractive and abstractive summarization methods. The summarization tool is designed to assist users in quickly digesting large volumes of information, such as articles, research papers, and reports.

In [2]:
Text="""Text summarization is the process of automatically creating a shorter version
 of a given text while retaining the most important information.
  There are two primary methods for text summarization:
  extractive and abstractive. Extractive summarization works by selecting
  key sentences, phrases, or sections directly from the source document and
  concatenating them to form a summary. In contrast, abstractive summarization generates
  new sentences that may not appear in the original text, often requiring
  advanced natural language processing (NLP) techniques. The goal of both
  methods is to reduce the text size while preserving its meaning, making the
  summarization process highly valuable for quickly digesting large volumes of
  information, such as articles, reports, or research papers."""


In [3]:
len(Text)

810

# **Importing Libraries**

In [4]:
import spacy
from spacy.lang.en.stop_words import STOP_WORDS
from string import punctuation

In [5]:
nlp=spacy.load('en_core_web_sm')

In [6]:
document = nlp(Text)

In [7]:
tokens=[token.text for token in document]
print(tokens)

['Text', 'summarization', 'is', 'the', 'process', 'of', 'automatically', 'creating', 'a', 'shorter', 'version', '\n ', 'of', 'a', 'given', 'text', 'while', 'retaining', 'the', 'most', 'important', 'information', '.', '\n  ', 'There', 'are', 'two', 'primary', 'methods', 'for', 'text', 'summarization', ':', '\n  ', 'extractive', 'and', 'abstractive', '.', 'Extractive', 'summarization', 'works', 'by', 'selecting', '\n  ', 'key', 'sentences', ',', 'phrases', ',', 'or', 'sections', 'directly', 'from', 'the', 'source', 'document', 'and', '\n  ', 'concatenating', 'them', 'to', 'form', 'a', 'summary', '.', 'In', 'contrast', ',', 'abstractive', 'summarization', 'generates', '\n  ', 'new', 'sentences', 'that', 'may', 'not', 'appear', 'in', 'the', 'original', 'text', ',', 'often', 'requiring', '\n  ', 'advanced', 'natural', 'language', 'processing', '(', 'NLP', ')', 'techniques', '.', 'The', 'goal', 'of', 'both', '\n  ', 'methods', 'is', 'to', 'reduce', 'the', 'text', 'size', 'while', 'preserving

In [8]:
punctuation=punctuation+'\n'
punctuation

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~\n'

# **Cleaning the Text**

In [11]:
word_frequency = {}

for word in document:
  #print(word)
  if word.text.lower() not in STOP_WORDS:
    if word.text.lower() not in punctuation:
      #print(word)
      if word.text not in word_frequency.keys():
        word_frequency[word.text]=1
        #print(word_frequency)
      else:
        word_frequency[word.text]+=1


In [12]:
print(word_frequency)

{'Text': 1, 'summarization': 5, 'process': 2, 'automatically': 1, 'creating': 1, 'shorter': 1, 'version': 1, '\n ': 1, 'given': 1, 'text': 4, 'retaining': 1, 'important': 1, 'information': 2, '\n  ': 9, 'primary': 1, 'methods': 2, 'extractive': 1, 'abstractive': 2, 'Extractive': 1, 'works': 1, 'selecting': 1, 'key': 1, 'sentences': 2, 'phrases': 1, 'sections': 1, 'directly': 1, 'source': 1, 'document': 1, 'concatenating': 1, 'form': 1, 'summary': 1, 'contrast': 1, 'generates': 1, 'new': 1, 'appear': 1, 'original': 1, 'requiring': 1, 'advanced': 1, 'natural': 1, 'language': 1, 'processing': 1, 'NLP': 1, 'techniques': 1, 'goal': 1, 'reduce': 1, 'size': 1, 'preserving': 1, 'meaning': 1, 'making': 1, 'highly': 1, 'valuable': 1, 'quickly': 1, 'digesting': 1, 'large': 1, 'volumes': 1, 'articles': 1, 'reports': 1, 'research': 1, 'papers': 1}


In [16]:
max_frequency = max(word_frequency.values())

In [17]:
for word in word_frequency.keys():
  word_frequency[word]=word_frequency[word]/max_frequency

In [18]:
print(word_frequency)

{'Text': 0.1111111111111111, 'summarization': 0.5555555555555556, 'process': 0.2222222222222222, 'automatically': 0.1111111111111111, 'creating': 0.1111111111111111, 'shorter': 0.1111111111111111, 'version': 0.1111111111111111, '\n ': 0.1111111111111111, 'given': 0.1111111111111111, 'text': 0.4444444444444444, 'retaining': 0.1111111111111111, 'important': 0.1111111111111111, 'information': 0.2222222222222222, '\n  ': 1.0, 'primary': 0.1111111111111111, 'methods': 0.2222222222222222, 'extractive': 0.1111111111111111, 'abstractive': 0.2222222222222222, 'Extractive': 0.1111111111111111, 'works': 0.1111111111111111, 'selecting': 0.1111111111111111, 'key': 0.1111111111111111, 'sentences': 0.2222222222222222, 'phrases': 0.1111111111111111, 'sections': 0.1111111111111111, 'directly': 0.1111111111111111, 'source': 0.1111111111111111, 'document': 0.1111111111111111, 'concatenating': 0.1111111111111111, 'form': 0.1111111111111111, 'summary': 0.1111111111111111, 'contrast': 0.1111111111111111, 'g

# **Sentence Tokenisation**

In [19]:
sent_tokens = [sentence for sentence in document.sents]
print(sent_tokens)

[Text summarization is the process of automatically creating a shorter version
 of a given text while retaining the most important information.
  , There are two primary methods for text summarization: 
  extractive and abstractive., Extractive summarization works by selecting 
  key sentences, phrases, or sections directly from the source document and 
  concatenating them to form a summary., In contrast, abstractive summarization generates 
  new sentences that may not appear in the original text, often requiring 
  advanced natural language processing (NLP) techniques., The goal of both 
  methods is to reduce the text size while preserving its meaning, making the
  summarization process highly valuable for quickly digesting large volumes of
  information, such as articles, reports, or research papers.]


In [20]:
sentence_score = {}


In [21]:
for sent in sent_tokens:
  for word in sent:
    if word.text.lower() in word_frequency.keys():
      if sent not in sentence_score.keys():
        sentence_score[sent] = word_frequency[word.text.lower()]
        #print(word_frequency[word.text.lower()])
      else:
        sentence_score[sent] += word_frequency[word.text.lower()]

In [23]:
print(sentence_score)

{Text summarization is the process of automatically creating a shorter version
 of a given text while retaining the most important information.
  : 3.7777777777777786, There are two primary methods for text summarization: 
  extractive and abstractive.: 2.6666666666666665, Extractive summarization works by selecting 
  key sentences, phrases, or sections directly from the source document and 
  concatenating them to form a summary.: 4.111111111111111, In contrast, abstractive summarization generates 
  new sentences that may not appear in the original text, often requiring 
  advanced natural language processing (NLP) techniques.: 4.666666666666664, The goal of both 
  methods is to reduce the text size while preserving its meaning, making the
  summarization process highly valuable for quickly digesting large volumes of
  information, such as articles, reports, or research papers.: 6.444444444444441}


**Extract 30% of sentences with max score**

In [24]:
from heapq import nlargest

In [25]:
len(sentence_score)

5

In [26]:
summary = nlargest(int(len(sentence_score)*0.3),sentence_score,key=sentence_score.get)

In [27]:
summary

[The goal of both 
   methods is to reduce the text size while preserving its meaning, making the
   summarization process highly valuable for quickly digesting large volumes of
   information, such as articles, reports, or research papers.]

In [28]:
final_summary = [word.text for word in summary]

In [29]:
print(final_summary)

['The goal of both \n  methods is to reduce the text size while preserving its meaning, making the\n  summarization process highly valuable for quickly digesting large volumes of\n  information, such as articles, reports, or research papers.']


In [30]:
summary = ' '.join(final_summary)

In [31]:
print(summary)

The goal of both 
  methods is to reduce the text size while preserving its meaning, making the
  summarization process highly valuable for quickly digesting large volumes of
  information, such as articles, reports, or research papers.


In [32]:
len(summary)

236

In [33]:
len(summary)/len(Text)

0.291358024691358