# Text summarization with NLTK in python

Text summarization is a subdomain of NLP that deals with extracting summaries from huge chucks of texts. There are two main types of techniques used for text summarization: NLP - based techniques and deep learning-based techniques. 

# Convert Paragraphs to Sentences 

We first need to convert the whole paragraph into sentences. The most common way of converting paragraphs to sentences is to split the paragraph whenever a period is encountered.

# Text Preprocessing 

After converting paragraph to sentences, we need to remove all the special characters, stop words and numbers from all the sentences. 

# Tokenizing the Sentences 

We need to tokenize all the sentences to get all the words that exist in the sentences.

# Find Weighted Frequency of Occurrence 

Next we need to find the weighted frequency of occurrences of all the words. We can find the weighted frequency of each word by dividing its frequency by the frequency of the most occurring word.

# Replace Words by Weighted Frequency in Original Sentences

The final step is to plug the weighted frequency in place of the corresponding words in original sentences and finding their sum. It is important to mention that weighted frequency for the words removed during preprocessing (stop words, punctuation, digits etc.) will be zero and therefore is not required to be added.

Sentence                 Sum of Weighted Frequencies 
So, keep working              1 + 0.20 = 1.20 

# Sort Sentences in Descending Order of Sum 

The final step is to sort the sentences in inverse order of their sum. The sentences with highest frequencies summarize the text. 

Similarly, you can add the sentence with the second highest sum of weighted frequencies to have a more informative summary. 

# Fetching Articles from Wikipedia 

In [9]:
import bs4 as bs
import urllib.request
import re
import nltk

In [4]:
scraped_data = urllib.request.urlopen('https://en.wikipedia.org/wiki/Artificial_intelligence')
article = scraped_data.read()

parsed_article = bs.BeautifulSoup(article,'lxml') #To parse the data
paragraphs = parsed_article.find_all('p')

article_text = ""
for p in paragraphs:
    article_text += p.text

# Preprocessing 

In [6]:
# Removing Square Brackets and Extra Spaces
article_text = re.sub(r'\[[0-9]*\]', ' ', article_text)
article_text = re.sub(r'\s+', ' ', article_text)

we do not want to remove anything else from the article since this is the original article. We will not remove other numbers, punctuation marks and special characters from this text since we will use this text to create summaries and weighted word frequencies will be replaced in this article. 

To clean the text and calculate weighted frequences, we will create another object.

In [7]:
# Removing special characters and digits
formatted_article_text = re.sub('[^a-zA-Z]', ' ', article_text )
formatted_article_text = re.sub(r'\s+', ' ', formatted_article_text)

We will use formatted_article_text to create weighted frequency histograms for the words and will replace these weighted frequencies with the words in the article_text object. 

# Converting Text To Sentences

In [10]:
sentence_list = nltk.sent_tokenize(article_text)

The formatted_article_text does not contain any punctuation and therefore cannot be converted into sentences using the full stop as a parameter. 

# Find Weighted Frequency of Occurrence

To find the frequency of occurrence of each word, we use the formatted_article_text variable. We used this variable to find the frequency of occurrence since it doesn't contain punctuation, digits, or other special characters.

In [11]:
stopwords = nltk.corpus.stopwords.words('english')

word_frequencies = {}
for word in nltk.word_tokenize(formatted_article_text):
    if word not in stopwords:
        if word not in word_frequencies.keys():
            word_frequencies[word] = 1
        else:
            word_frequencies[word] += 1

we first store all the English stop words from the nltk library into a stopwords variable. Next, we loop through all the sentences and then corresponding words to first check if they are stop words. If not, we proceed to check whether the words exist in word_frequency dictionary i.e. word_frequencies, or not. If the word is encountered for the first time, it is added to the dictionary as a key and its value is set to 1. Otherwise, if the word previously exists in the dictionary, its value is simply updated by 1.

Finally, to find the weighted frequency, we can simply divide the number of occurances of all the words by the frequency of the most occurring word.

In [12]:
maximum_frequncy = max(word_frequencies.values())

for word in word_frequencies.keys():
    word_frequencies[word] = (word_frequencies[word]/maximum_frequncy)

# Calculating Sentence Scores

We have now calculated the weighted frequencies for all the words. Now is the time to calculate the scores for each sentence by adding weighted frequencies of the words that occur in that particular sentence. 

In [13]:
sentence_scores = {}
for sent in sentence_list:
    for word in nltk.word_tokenize(sent.lower()):
        if word in word_frequencies.keys():
            if len(sent.split(' ')) < 30:
                if sent not in sentence_scores.keys():
                    sentence_scores[sent] = word_frequencies[word]
                else:
                    sentence_scores[sent] += word_frequencies[word]

We then check if the word exists in the word_frequencies dictionary. This check is performed since we created the sentence_list list from the article_text object; on the other hand, the word frequencies were calculated using the object, which doesn't contain any stop words, numbers, etc.

We do not want very long sentences in the summary, therefore, we calculate the score for only sentences with less than 30 words (although you can tweak this parameter for your own use-case).

# Getting the Summary

take top N sentences with the highest scores. The following script retrieves top 7 sentences and prints them on the screen. 

In [14]:
import heapq

summary_sentences = heapq.nlargest(7, sentence_scores, key=sentence_scores.get)

summary = ' '.join(summary_sentences)
print(summary)

 In computer science, artificial intelligence (AI), sometimes called machine intelligence, is intelligence demonstrated by machines, in contrast to the natural intelligence displayed by humans and animals. Neural networks can be applied to the problem of intelligent control (for robotics) or learning, using such techniques as Hebbian learning ("fire together, wire together"), GMDH or competitive learning. Musk also funds companies developing artificial intelligence such as DeepMind and Vicarious to "just keep an eye on what's going on with artificial intelligence. Many of the problems in this article may also require general intelligence, if machines are to solve the problems as well as people do. IBM has created its own artificial intelligence computer, the IBM Watson, which has beaten human intelligence (at some levels). "robotics" or "machine learning"), the use of particular tools ("logic" or artificial neural networks), or deep philosophical differences. A February 2020 European U