# Ref: https://medium.com/analytics-vidhya/simple-text-summarization-using-nltk-eedc36ebaaf8

Two different approaches to Text Summarization

Extraction-based summarization: Here, content is extracted from the original data, but the extracted content is not modified in any way. In Simple words we identify the important sentences or key — phrases from the original text and extract only those from the text.

In machine learning, extractive summarization usually involves weighing the essential sections of sentences and using the results to generate summaries.

Example:

Before Summarization

John and Joseph took a taxi to attend the night party in the city. While in the party, John collapsed and was rushed to the hospital.

After Summarization

John and Joseph attend party. John rushed hospital.

Abstraction-based summarization: Here summary of the texts can be different from original text, which is contrast to extraction based summarization where which used only existing sentences that were present. Advanced deep learning techniques are used to generate the new summary.

Example:

Before Summarization

John and Joseph took a taxi to attend the night party in the city. While in the party, John collapsed and was rushed to the hospital.

After Summarization

John was hospitalized after attending the party.

In this article, we will use extraction based summarization by picking the sentences with maximum importance score to form the summary using NLTK toolkit.

Steps involved to create the text summary

1) Data collection from Wikipedia using web scraping(using Urllib library)

2) Parsing the URL content of the data(using BeautifulSoup library)

3) Data clean-up like removing special characters, numeric values, stop words and punctuations.

4) Tokenization — Creation of tokens (Word tokens and Sentence tokens)

5) Calculate the word frequency for each word.

6) Calculate the weighted frequency for each sentence.

7) Creation of summary choosing 30% of top weighted sentences.

In [25]:
# text = """The Extractive based summarization method selects informative sentences from the document as they exactly appear in source 
# based on specific criteria to form summary. The main challenge before extractive summarization is to decide which sentences from 
# the input document is significant and likely to be included in the summary. For this task, sentence scoring is employed based on 
# features of sentences. It first, assigns a score to each sentence based on feature then rank sentences according to their score. 
# Sentences with the highest score are likely to be included in final summary. Following methods are the technique of extractive 
# text summarization."""
import pandas as pd
data = pd.read_csv('data_messy.csv')
df = pd.DataFrame(data)
df

Unnamed: 0.1,Unnamed: 0,title,contents
0,0,The Thais caught up in the Israel-Gaza war,"In a village lying close to the Mekong River, ..."
1,1,Catfishing: How I hunted down the gang imperso...,"James Blake, an entrepreneur and owner of a di..."
2,2,Hamas attack: 12 Thais killed and 11 kidnapped...,Twelve Thais have been killed and another 11 k...
3,3,Bangkok: Parents of Siam Paragon mall shooter ...,The parents of a 14-year-old boy who fatally s...
4,4,Bangkok: Two dead and 14-year-old held over Si...,A 14-year-old boy has been arrested after two ...
...,...,...,...
888,888,"Thai princess leaves $40,000 custom toilet 'un...","A toilet that cost an estimated $40,000 (£28,3..."
889,889,Missing backpacker Grace Taylor found in Thailand,A British backpacker has been found after goin...
890,890,Grace Taylor missing: Family of UK woman in Th...,"A 21-year-old woman from Swanage, Dorset is mi..."
891,891,Bangkok airport safety issues 'must be addressed',The airline industry has called on the Thai go...


In [26]:
text = df.head(1).contents
text = text.values[0]
text



In [27]:
import spacy
from spacy.lang.en.stop_words import STOP_WORDS
from string import punctuation

3) Tokenization & Data clean up

# Import the stop words from NLTK toolkit and punctuations from strings library.

Stop words are a set of commonly used words in any language. For example, in English, “the”, “is” and “and”, would easily qualify as stop words. In NLP and text mining applications, stop words are used to eliminate unimportant words, allowing applications to focus on the important words instead.

In [28]:
stopword = list(STOP_WORDS)

In [29]:
nlp = spacy.load('en_core_web_sm')

In [30]:
doc = nlp(text)

In [31]:
tokens = [token.text for token in doc]
print(tokens)



In [32]:
punctuation = punctuation + '\n'+'\n'
punctuation

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~\n\n'

# Text cleaning

# Creating the frequency table

Word tokenize the entire text. We have to create the dictionary with key as words and value as number of times word is repeated.

In [33]:
word_frequencies = {}
for word in doc:
    if word.text.lower() not in stopword:
        if word.text.lower() not in punctuation:
            if word.text not in word_frequencies.keys():
                word_frequencies[word.text] = 1
            else:
                word_frequencies[word.text] += 1

In [34]:
print(word_frequencies)



Then divide the number of occurrences of all the words by the frequency of the most occurring word, as shown below:

In [35]:
max_frequency = max(word_frequencies.values())
max_frequency

15

In [36]:
for word in word_frequencies.keys():
    word_frequencies[word] = word_frequencies[word]/max_frequency

word_frequencies

{'village': 0.13333333333333333,
 'lying': 0.06666666666666667,
 'close': 0.06666666666666667,
 'Mekong': 0.06666666666666667,
 'River': 0.06666666666666667,
 'Weerapon': 0.06666666666666667,
 'Golf': 0.4666666666666667,
 'Lapchan': 0.06666666666666667,
 'sits': 0.06666666666666667,
 'centre': 0.06666666666666667,
 'group': 0.06666666666666667,
 'Thai': 0.4666666666666667,
 'elders': 0.06666666666666667,
 'tie': 0.06666666666666667,
 'white': 0.06666666666666667,
 'threads': 0.06666666666666667,
 'wrists': 0.06666666666666667,
 'chant': 0.06666666666666667,
 'literally': 0.06666666666666667,
 'calling': 0.06666666666666667,
 'kwan': 0.06666666666666667,
 'spirit': 0.06666666666666667,
 'body': 0.06666666666666667,
 'narrow': 0.06666666666666667,
 'escape': 0.06666666666666667,
 'Hamas': 0.5333333333333333,
 'attack': 0.4,
 'Israel': 1.0,
 '7': 0.13333333333333333,
 'October': 0.2,
 '34': 0.06666666666666667,
 'year': 0.4,
 'old': 0.2,
 '25,000': 0.06666666666666667,
 'Thais': 0.2666666

# Tokenizing the article into sentences

To split the article_content into a set of sentences, we’ll use the built-in method from the nltk library.

In [37]:
sentences_token = [sent for sent in doc.sents]
sentences_token

[In a village lying close to the Mekong River, Weerapon "Golf" Lapchan sits in the centre of a group of Thai elders as they tie white threads around his wrists and chant.,
 They are literally calling his "kwan" or spirit back to his body, after his narrow escape during the Hamas attack on Israel on 7 October.,
 The 34-year-old is one of more than 25,000 Thais who were working on farms and orchards in Israel when the Hamas militants stormed in from Gaza.,
 At least 30 Thais were among the 200 or so foreign nationals who were killed in the attack.,
 Now the Thai government is helping others, thousands of them, to return home.,
 Thailand provides almost all the foreign farm labour in Israel.,
 Many of the Thai workers have to borrow money to go to Israel and now they are returning, jobless and in debt.  ,
 Yet some like Golf never want to go back.,
 On the morning of 7 October, when Golf and his co-workers saw rockets being fired, and then intercepted by Israel's Iron Dome defence system,

4) Finding the weighted frequencies of the sentences

To evaluate the score for every sentence in the text, we’ll be analysing the frequency of occurrence of each term. In this case, we’ll be scoring each sentence by its words; that is, adding the frequency of each important word found in the sentence.

In [38]:
sentences_scores = {}
for sent in sentences_token:
    for word in sent:
        if word.text.lower() in word_frequencies.keys():
            if sent not in sentences_scores.keys():
                sentences_scores[sent] = word_frequencies[word.text.lower()]
            else:
                sentences_scores[sent] += word_frequencies[word.text.lower()]

In [39]:
sentences_scores

{In a village lying close to the Mekong River, Weerapon "Golf" Lapchan sits in the centre of a group of Thai elders as they tie white threads around his wrists and chant.: 0.8666666666666666,
 They are literally calling his "kwan" or spirit back to his body, after his narrow escape during the Hamas attack on Israel on 7 October.: 1.0,
 The 34-year-old is one of more than 25,000 Thais who were working on farms and orchards in Israel when the Hamas militants stormed in from Gaza.: 1.4,
 At least 30 Thais were among the 200 or so foreign nationals who were killed in the attack.: 0.8666666666666667,
 Now the Thai government is helping others, thousands of them, to return home.: 0.6666666666666666,
 Thailand provides almost all the foreign farm labour in Israel.: 0.6,
 Many of the Thai workers have to borrow money to go to Israel and now they are returning, jobless and in debt.  : 1.0666666666666667,
 Yet some like Golf never want to go back.: 0.26666666666666666,
 On the morning of 7 Octob

5) Creation of summary

Using nalargest library get the top 30% weighted sentences. And later on join it to get the final summarized text.

In [40]:
from heapq import nlargest

In [41]:
select_lenth = int(len(sentences_token)*0.3)
select_lenth

17

In [42]:
summary = nlargest(select_lenth, sentences_scores, key = sentences_scores.get)
summary

[He left her and his six-year-old son in June last year to work on an avocado and pomegranate farm in the Nir Oz kibbutz, not far from where Golf was working.,
 He spent a harrowing few days under constant rocket attack at the organic vegetable farm where he was working, before borrowing more money for his flight home.,
 On the morning of 7 October, when Golf and his co-workers saw rockets being fired, and then intercepted by Israel's Iron Dome defence system, he says they were not unduly worried.,
 The international victims of Hamas' assault on IsraelWhat is happening in Israel and Gaza, and why now?How Hamas staged lightning assault no-one thought possibleGolf was brought back to Thailand on a government-organised evacuation flight on 13 October.,
 People told the BBC they have to pay significantly more than the official fees of around 70,000 baht ($2,100) to get to Israel - often up to 120,000 baht including extra costs and unofficial payments.,
 The 34-year-old is one of more than 

In [43]:
final_summary = [word.text for word in summary]
final_summary

['He left her and his six-year-old son in June last year to work on an avocado and pomegranate farm in the Nir Oz kibbutz, not far from where Golf was working.',
 'He spent a harrowing few days under constant rocket attack at the organic vegetable farm where he was working, before borrowing more money for his flight home.',
 "On the morning of 7 October, when Golf and his co-workers saw rockets being fired, and then intercepted by Israel's Iron Dome defence system, he says they were not unduly worried.",
 "The international victims of Hamas' assault on IsraelWhat is happening in Israel and Gaza, and why now?How Hamas staged lightning assault no-one thought possibleGolf was brought back to Thailand on a government-organised evacuation flight on 13 October.",
 'People told the BBC they have to pay significantly more than the official fees of around 70,000 baht ($2,100) to get to Israel - often up to 120,000 baht including extra costs and unofficial payments.',
 'The 34-year-old is one of

In [44]:
summary = ''.join(final_summary)
summary



Check the text length before and after text summarization.

In [45]:
len(text)

5927

In [46]:
len(summary)

2517