Written here is the code for extraction summarization for articles. In extraction-based summarization, a subset of words that represent the most important points is pulled from a piece of text and combined to make a summary. In ML, extractive summarization involves weighing the essential sections of sentences and using the results to generate summaries. We will use frequency of important words as weights.If the sentence contains the most number of important words, we will assume that this is the best sentence for summary. Adapted from <a href=https://blog.floydhub.com/gentle-introduction-to-text-summarization-in-machine-learning/>Floydhub blog</a>

There are five steps to do for an extraction summarization.
<ol>
    <li>Prepare Data</li>
    <li>Process text</li>
    <li>Tokenize text</li>
    <li>Evaluate the weighted occurence frequency of the words</li>
    <li>Substitute the words with the weighted frequency</li>
</ol>

In [2]:
import bs4 as BeautifulSoup
import urllib.request  
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize, sent_tokenize
import nltk

<h1>Prepare Data</h1>

In [3]:
# Fetching the content from the URL
fetched_data = urllib.request.urlopen('https://www.moh.gov.sg/news-highlights/details/preparing-for-our-transition-towards-covid-resilience')

article_read = fetched_data.read()

# Parsing the URL content and storing in a variable
article_parsed = BeautifulSoup.BeautifulSoup(article_read,'html.parser')


# Returning <p> tags
paragraphs = article_parsed.find_all('p')

article_content = ''

# Looping through the paragraphs and adding them to the variable
for p in paragraphs:  
    article_content += p.text

What is done here is fetching and reading the whole webpage, afterward we parse the whole webpage and extract all paragraphs. We will then append all paragraphs into a var called article_content.

<h1>Processing Data</h1>

In [4]:
def create_dictionary_table(text) -> dict:
    stop_words = set(stopwords.words('english'))
    
    #tokenize the words
    words = word_tokenize(text)
    
    #reducing words to root form
    stem = PorterStemmer()
    
    #Create dictionary for the word frequency table
    frequency_table = {}
    
    for wd in words:
        wd = stem.stem(wd)
        if wd in stop_words:
            continue
        if wd in frequency_table:
            frequency_table[wd] += 1
        else:
            frequency_table[wd] = 1
    return frequency_table   

<h1>Tokenize the text </h1>

In [5]:
#sent_tokenize()

<h1>Evaluate the weighted occurence frequency of the words</h1>

In [6]:
def calculate_sentence_scores(sentences,frequency_table) -> dict:
    sentence_weight = dict()
    
    for sentence in sentences:
        sentence_wordcount = len(word_tokenize(sentence))
        sentence_wordcount_without_stop_words = 0 
        for word_weight in frequency_table:
            if word_weight in sentence.lower():
                sentence_wordcount_without_stop_words += 1
                if sentence[:7] in sentence_weight:
                    sentence_weight[sentence[:7]] += frequency_table[word_weight]
                else:
                    sentence_weight[sentence[:7]] = frequency_table[word_weight]
                    
        sentence_weight[sentence[:7]] = sentence_weight[sentence[:7]] / sentence_wordcount_without_stop_words
    return sentence_weight

<h1>Calculate the threshold of each sentence </h1>

In [7]:
def calculate_average_score(sentence_weight):
    sum_values = 0 
    for entry in sentence_weight:
        sum_values += sentence_weight[entry]
    
    avg_score = (sum_values/len(sentence_weight))
    return avg_score

In [8]:
def get_article_summary(sentences, sentence_weight, threshold):
    sentence_counter = 0
    article_summary = ''

    for sentence in sentences:
        if sentence[:7] in sentence_weight and sentence_weight[sentence[:7]] >= (threshold):
            article_summary += " " + sentence
            sentence_counter += 1

    return article_summary

<h1> Wrapping everything up </h1>

In [9]:
frequency_table = create_dictionary_table(article_content)
sentences = sent_tokenize(article_content)
sentence_weight = calculate_sentence_scores(sentences,frequency_table)
avg_score = calculate_average_score(sentence_weight)

summary = get_article_summary(sentences, sentence_weight, 1.5*avg_score)

In [10]:
summary

' Click here for E-Consultation. Hence, we have ramped up efforts to encourage more seniors to be vaccinated. Many of our seniors have responded to these initiatives. But there are still about 80,000 in this group that have yet to be vaccinated. Hence, we have ramped up efforts to encourage more seniors to be vaccinated. Many of our seniors have responded to these initiatives. But there are still about 80,000 in this group that have yet to be vaccinated. solemnizations, congregational and other worship services. solemnizations, congregational and other worship services.11. These events may take place with up to 1,000 attendees if all are fully vaccinated. These events may take place with up to 1,000 attendees if all are fully vaccinated. Work-from-home. Work-from-home. Hence we will require vaccination, or regular testing in lieu, for selected sectors of the workforce. Hence we will require vaccination, or regular testing in lieu, for selected sectors of the workforce.'