# AbstractIt - Extractive Text Summarizer
## Github Repository Link: https://github.com/guptaharshnavin/AbstractIt-Extractive-Text-Summarizer 

## Problem Statement
#### Development of a Text Summarizer using Extractive Text Summarization technique.

## Importing Libraries

In [1]:
import re
# Importing Spacy English Library
import en_core_web_md
# Importing Spacy
import spacy
# Importing Stop Words
from spacy.lang.en.stop_words import STOP_WORDS
# Importing Count Vectorizer from SKLEARN
from sklearn.feature_extraction.text import CountVectorizer
# Import Rouge Scorer
from rouge_score import rouge_scorer

## Reading Text to be Summarised

In [2]:
# Opening File
file = open('input_text.txt','r',encoding = 'utf-8')
# Reading File Contents
file_lines = file.readlines()
# Closing Connection to File
file.close()

In [3]:
file_lines

["A virus is a submicroscopic infectious agent that replicates only inside the living cells of an organism.[1] Viruses infect all types of life forms, from animals and plants to microorganisms, including bacteria and archaea.[2] Since Dmitri Ivanovsky's 1892 article describing a non-bacterial pathogen infecting tobacco plants and the discovery of the tobacco mosaic virus by Martinus Beijerinck in 1898,[3] more than 6,000 virus species have been described in detail[4] of the millions of types of viruses in the environment.[5] Viruses are found in almost every ecosystem on Earth and are the most numerous type of biological entity.[6][7] The study of viruses is known as virology, a subspeciality of microbiology.\n",
 '\n',
 'When infected, a host cell is forced to rapidly produce thousands of identical copies of the original virus. When not inside an infected cell or in the process of infecting a cell, viruses exist in the form of independent particles, or virions, consisting of: (i) the 

## Input Text Preprocessing
The application is aimed for summarization of **Wikipedia Articles**, which include special characters in the form of annotations ([1]), and also may contain multiple blank lines and \n characters at the end of each line. Thus, these annotations, blank lines and \n is removed in this step.

In [4]:
index = 0
# Cleaning The File Content Line By Line
while index < len(file_lines):
    if file_lines[index] == '\n':
        file_lines.pop(index)
        continue
        
    # RE To Remove [] Wikipedia Annotations & New Line Special Character
    file_lines[index] = re.sub('[[0-9]+]','',file_lines[index]).replace('\n','').strip()
    index = index + 1
# Combining Seperate Lines into Single Paragraph
paragraph = ''.join(file_lines)

  if __name__ == '__main__':


In [5]:
# Loading the Spacy English MD Library
nlp = en_core_web_md.load()
# Creating Spacy Doc
doc = nlp(paragraph)

## Generating Corpus
In this section, the entire input text is stored as a list of sentences, which are converted into Lowercase format.

In [6]:
# Building Corpus
corpus = []

for s in doc.sents:
    corpus.append(s.text.lower())
print('Number of Sentences in Corpus : ' + str(len(corpus)))

Number of Sentences in Corpus : 25


## Generation of BOW (Bag Of Words) & Word Frequencies
In this section, we generate a BOW, which helps us to gather the frequency of word appearences in the entire corpus. After the BOW is generated, the sum is taken across the columns to generate frequency of words in the corpus. 
<br>Once the frequencies are generated, the most occuring words are listed which have top 5 most occuring frequencies, and the frequencies are normalised with respect to the highest frequency in the corpus.

In [7]:
# Applying Count Vectorizer To Corpus
cv = CountVectorizer(stop_words=list(STOP_WORDS))
cv_fit = cv.fit_transform(corpus)

In [8]:
# Getting Word List from Count Vectorizer
word_list = cv.get_feature_names()

# Getting Sum of Word Occurences
word_counts = cv_fit.toarray().sum(axis=0)

# Creating Dictionary for Words and Occurences
word_freq_dict = dict(zip(word_list, word_counts))

In [9]:
# Sorting Word Counts in Descending Order
sort_word_counts = sorted(word_freq_dict.values(), reverse=True)
# Getting Higher Freqeuncy Words
higher_frequency_words = []
# Words With Top 5 Frequency are considered as Higher Frequency Words
for word, freq in word_freq_dict.items():
    if freq in sort_word_counts[0:5]:
        higher_frequency_words.append(word)

In [10]:
print('Most Frequently Occurring Words In Text')
for word in higher_frequency_words:
    print(word)

Most Frequently Occurring Words In Text
cell
genetic
infecting
life
viral
virus
viruses


In [11]:
# Scaling The Frequency Of Words With Relative To Highest Frequency
for word in word_freq_dict.keys():
    word_freq_dict[word] = word_freq_dict[word]/sort_word_counts[0]

## Assignment of Ranks to Sentences
The sentences are assigned ranks on the number of higher frequencies words occuring in the sentences. The sentence has a higher rank, if there are more higher frequency words, and lower rank if there are fewer higher frequency words.<br>
The sentence ranks are assigned based on their scores, which is calculated by sum of the normalised frequency score of the words in the sentence.<br>
The sum is not calculated using increment methodology, as the sum of frequency accurately incorporates the importance of the words in the sentence.

In [12]:
# Assigning Ranks To Sentences
sentence_ranks = {}

for sentence in doc.sents:
    for word in sentence:
        if word.text.lower() in word_freq_dict.keys():
            if sentence in sentence_ranks.keys():
                sentence_ranks[sentence] += word_freq_dict[word.text.lower()]
            else:
                sentence_ranks[sentence] = word_freq_dict[word.text.lower()]

# Sorting Sentences According To Ranks
sorted_sentences = sorted(sentence_ranks.items(), key=lambda x: x[1], reverse=True)

## Generation of Summary
The final summary is generated by selecting the desired number of sentences from the sorted sentences based on their rank.

In [13]:
number_of_sentences = 6
final_para = ""

for i in range(0, number_of_sentences):
    sentence_tuple = sorted_sentences[i]
    final_para = final_para + sentence_tuple[0].text

In [14]:
final_para

"When not inside an infected cell or in the process of infecting a cell, viruses exist in the form of independent particles, or virions, consisting of: (i) the genetic material, i.e., long molecules of DNA or RNA that encode the structure of the proteins by which the virus acts; (ii) a protein coat, the capsid, which surrounds and protects the genetic material; and in some cases (iii) an outside envelope of lipids.Since Dmitri Ivanovsky's 1892 article describing a non-bacterial pathogen infecting tobacco plants and the discovery of the tobacco mosaic virus by Martinus Beijerinck in 1898, more than 6,000 virus species have been described in detail of the millions of types of viruses in the environment.One transmission pathway is through disease-bearing organisms known as vectors: for example, viruses are often transmitted from plant to plant by insects that feed on plant sap, such as aphids; and viruses in animals can be carried by blood-sucking insects.Viruses are considered by some bi

## Computing Rouge Scores

In [15]:
# Initializing Rouge Scorer Object
rscorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rouge3'], use_stemmer=True)

In [16]:
r_scores = rscorer.score(paragraph, final_para)

In [17]:
# Printing Scores
print('Rouge 1:', r_scores['rouge1'])
print('Rouge 2:', r_scores['rouge2'])
print('Rouge 3:', r_scores['rouge3'])

Rouge 1: Score(precision=1.0, recall=0.4205933682373473, fmeasure=0.5921375921375921)
Rouge 2: Score(precision=0.9791666666666666, recall=0.41083916083916083, fmeasure=0.5788177339901478)
Rouge 3: Score(precision=0.9581589958158996, recall=0.4010507880910683, fmeasure=0.5654320987654321)
