# **Text Summarization Task (Extractive)**


---

**Part 1: Installing Dependencies**

In [None]:
# Upgrade spaCy to the latest version
!pip install -U spacy

# Download the English model for spaCy
!python -m spacy download en_core_web_sm

Collecting en-core-web-sm==3.7.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.7.1/en_core_web_sm-3.7.1-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m75.8 MB/s[0m eta [36m0:00:00[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.


**Part 2: Importing Libraries and Initializing spaCy**

In [None]:
import spacy
from spacy.lang.en.stop_words import STOP_WORDS
from string import punctuation
from heapq import nlargest

**Part 3: Processing Text and Preparing Data Structures**

In [None]:
text = "Text summarization is the creation of a short, accurate, and fluent summary of a longer text document. Automatic text summarization methods are greatly needed to address the ever-growing amount of text data available online. This could help to discover relevant information and to consume relevant information faster. Consider the internet, which is made up of web pages, news stories, status updates, blogs, and many other things. Because the data is unstructured, the best we can do is perform search and glance over the results. Much of this text material has to be reduced to shorter, focused summaries that capture the important elements, so we can explore it more easily and to ensure that the summaries include the information we need. We need automatic text summarization tools for the below mentioned reasons:1.Summaries help you save time by reducing the amount of time you spend reading.2.Summaries aid in the selection of documents when conducting research.3.The efficacy of indexing is improved by automatic summarization.4.Human summarizers are more prejudiced than automatic summarizing techniques.5.Because they give individualized information, personalized summaries are important in question-answering systems.6.Commercial abstract services can enhance the volume of texts they can handle by using automatic or semi-automatic summarizing systemsThere are two main approaches to summarizing text documents:1.extractive methods,2.abstractive methods.To create the new summary, extractive text summarization selects phrases and sentences from the original document. Techniques include rating the importance of phrases in order to select just those that are most essential to the source's meaning. To capture the meaning of the source content, abstractive text summarization entails creating whole new words and sentences. This is a more difficult strategy, but it is also the one that humans will employ in the end. The content of the original document is selected and compressed using traditional methods. Most effective text summarizing methods are extractive because they are easier to implement, whereas abstractive alternatives offer the promise of more universal solutions.The act of constructing a concise and coherent version of a lengthier document is known as automatic text summarizing, or simply text summarization. We (humans) are often good at this sort of assignment since it necessitates first comprehending the content of the original material and then distilling the meaning and capturing key features in the new description. The goal of automatic summarizing studies is to create approaches that allow a computer to generate summaries that closely resemble those produced by humans. Generating words and phrases that convey the meaning of the source content is not enough. The summary should be factual and read as if it were a separate paper.After generation of the directed total graph, G, a community detection algorithm has been executed on the graph using the Infomap tool. Infomap [68] optimizes the map equation, which takes advantage of the information theoretic duality between compressing data and recognizing and retrieving relevant patterns or structures within the data. Using a network-based clustering technique [68], communities are identified which represent sets of distinct tweet IDs. Finally distinct tweets are identified mapping the tweet ID into tweets and summaries are generated considering one representative tweet from each community, which is named total summary.Based on the out-degree of the directed total graph, G, a summary has been generated. In the graph, a higher out-degree of a node implies more information going out from the node. From the Infomap output for each module, a representative set of nodes or tweets can be identified which are similar in nature. Instead of considering module representatives, as a summary entry the highest-degree node has been chosen from each set corresponding to modules and a summary file has been generated which is named total degree summary.As in the experimental dataset, microblogging datasets such as the Twitter dataset are considered. The maximum length of a tweet is 140 characters, so it can be considered that tweets having maximum or near-maximum length are more informative. So, instead of considering a module representative as a summary entry, the highest-length node has been chosen from each set corresponding to modules and a summary file has been generated which is named total length summary."

# Load the English model for processing text
nlp = spacy.load('en_core_web_sm')

# Process the text with spaCy
doc = nlp(text)

# Create a list of English stop words
stopwords = list(STOP_WORDS)

# Initialize punctuation and add newline characters
punctuation = punctuation + '\n'

# Initialize dictionary for word frequencies
word_frequencies = {}

**Part 4: Calculating Word Frequencies**

In [None]:
# Calculate word frequencies
for word in doc:
    if word.text.lower() not in stopwords and word.text.lower() not in punctuation:
        if word.text not in word_frequencies.keys():
            word_frequencies[word.text] = 1
        else:
            word_frequencies[word.text] += 1

# Find maximum frequency
max_frequency = max(word_frequencies.values())

# Normalize word frequencies
for word in word_frequencies.keys():
    word_frequencies[word] = word_frequencies[word] / max_frequency

**Part 5: Tokenizing into Sentences**

In [None]:
# Tokenize into sentences
sentence_tokens = [sent for sent in doc.sents]

**Part 6: Calculating Sentence Scores**

In [None]:
# Calculate sentence scores based on word frequencies
sentence_scores = {}
for sent in sentence_tokens:
    for word in sent:
        if word.text.lower() in word_frequencies.keys():
            if sent not in sentence_scores.keys():
                sentence_scores[sent] = word_frequencies[word.text.lower()]
            else:
                sentence_scores[sent] += word_frequencies[word.text.lower()]

**Part 7: Generating Extractive Summary**

In [None]:
from tabulate import tabulate

# Determine number of sentences for summary (e.g., 30% of total sentences)
select_length = int(len(sentence_tokens) * 0.3)

# Generate summary using nlargest to select top sentences
summary = nlargest(select_length, sentence_scores, key=sentence_scores.get)

# Extract text of top sentences for final summary
final_summary = [sent.text for sent in summary]

# Create a table structure for better visualization
table = [["Original Text", text], ["Extractive Summary", "\n".join(final_summary)]]

# Print the table using tabulate
print(tabulate(table, headers=["Section", "Content"], tablefmt="grid"))

+--------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

**Save the model**

In [None]:
import pickle

# Save the table as a pkl file
with open('model1.pkl', 'wb') as f:
    pickle.dump(table, f)