<a href="https://colab.research.google.com/github/jigarsiddhpura/TextSummarizer/blob/main/TextSummarizer.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [100]:
# !pip install -U pip setuptools wheel
# !pip install -U spacy
# !python -m spacy download en_core_web_lg

In [87]:
import spacy
nlp = spacy.load("en_core_web_lg")

In [110]:
from spacy.lang.en.stop_words import STOP_WORDS
from string import punctuation
from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np
from heapq import nlargest


In [89]:
document = "Machine learning (ML) is the scientific study of algorithms and statistical models that computer systems use to progressively improve their performance on a specific task. Machine learning algorithms build a mathematical model of sample data, known as “training data”, in order to make predictions or decisions without being explicitly programmed to perform the task. Machine learning algorithms are used in the applications of email filtering, detection of network intruders, and computer vision, where it is infeasible to develop an algorithm of specific instructions for performing the task. Machine learning is closely related to computational statistics, which focuses on making predictions using computers. The study of mathematical optimization delivers methods, theory and application domains to the field of machine learning. Data mining is a field of study within machine learning and focuses on exploratory data analysis through unsupervised learning. In its application across business problems, machine learning is also referred to as predictive analytics."

In [90]:
doc = nlp(document)

In [115]:
keywords = []
stopWords = list(STOP_WORDS)
stopWords.remove("not")
pos_tag = ['NOUN','VERB','ADJ','PROPN']

for token in doc:
  if(token.text in stopWords or token.text in punctuation):
    continue
  if(token.pos_ in pos_tag):
    keywords.append(token.text)
  

In [116]:
sentences = [sent for sent in doc.sents]
processed_sentences = []

for sent in sentences:
  processed_sent = " ".join([token.lemma_ for token in sent if token.pos_ in pos_tag if not token.text in stopWords and not token.is_punct])
  processed_sentences.append(processed_sent)

In [117]:
vectorizer = TfidfVectorizer()
tf_idf = vectorizer.fit_transform(processed_sentences)

In [118]:
tf_idf

<7x57 sparse matrix of type '<class 'numpy.float64'>'
	with 86 stored elements in Compressed Sparse Row format>

In [119]:
sentence_scores = tf_idf.sum(axis=1)
sentence_scores

matrix([[3.79800107],
        [3.83103784],
        [3.99312046],
        [2.89792417],
        [3.2158053 ],
        [3.0153618 ],
        [2.71758319]])

In [120]:
weighted_column_indices = np.argsort(sentence_scores, axis=0)
print("Sorted row indices:", weighted_column_indices)

Sorted row indices: [[6]
 [3]
 [5]
 [4]
 [0]
 [1]
 [2]]


In [121]:
weighted_indices = np.ravel(weighted_column_indices)
weighted_indices

array([6, 3, 5, 4, 0, 1, 2])

In [122]:
TOP_SENT_COUNT = 5

In [123]:
top_sentences = []
for index in weighted_indices[-TOP_SENT_COUNT:]:
  top_sentences.append(processed_sentences[index])

In [124]:
summarized_sentences = nlargest(3,top_sentences)

In [125]:
summarized_sentences

['study mathematical optimization deliver method theory application domain field machine learning',
 'machine learning algorithm build mathematical model sample datum know training datum order prediction decision program perform task',
 'machine learning algorithm application email filtering detection network intruder computer vision infeasible develop algorithm specific instruction perform task']

In [127]:
# Define a function to process each sentence
def process_sentence(sentence):
    # Parse the sentence using Spacy's sentence segmentation
    doc = nlp(sentence)

    # Lemmatize the words in the sentence and join them back into a string
    lemmas = [token.lemma_ for token in doc]
    sentence_text = ' '.join(lemmas)

    # Construct a grammatically correct sentence from the lemmatized words
    sentence_doc = nlp(sentence_text)
    sentence_text = ''
    for i, token in enumerate(sentence_doc):
        # Add spaces between words
        if i > 0:
            sentence_text += ' '
        # Add determiners to nouns if necessary
        if token.pos_ == 'NOUN' and token.dep_ != 'compound':
            sentence_text += 'a '
        # Add the word to the sentence
        sentence_text += token.text
    # Capitalize the first letter of the sentence
    sentence_text = sentence_text.capitalize()
    # Add a period to the end of the sentence
    sentence_text += '.'
    return sentence_text


In [128]:
processed_sentences = [process_sentence(sentence) for sentence in summarized_sentences]

# Print the results
for sentence in processed_sentences:
    print(sentence)

Study mathematical a optimization deliver method theory application domain field a machine learn.
Machine learning a algorithm build mathematical model sample a datum know training datum order prediction decision a program perform a task.
Machine learning algorithm application email filter detection network intruder computer a vision infeasible develop a algorithm specific a instruction a perform a task.
