# Ruled-Based Text Summarization Using Words Frequency Scoring

In [1]:
import re
import random
import urllib
import bs4 as BeautifulSoup
import pandas as pd

import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize, sent_tokenize

## Grasp Text Corpus From Internet (optional)

In [2]:
# fetching the content from the URL
fetched_data = urllib.request.urlopen('https://en.wikipedia.org/wiki/20th_century')

In [3]:
# parsing the URL content and storing in a variable
article_read = fetched_data.read()
article_parsed = BeautifulSoup.BeautifulSoup(article_read,'html.parser')

# returning <p> tags
paragraphs = article_parsed.find_all('p')

article_content = ''

# looping through the paragraphs and adding them to the variable
for p in paragraphs:  
    article_content += p.text + '\n'

In [4]:
print(article_content)

The 20th (twentieth) century was a century that began on
January 1, 1901[1] and ended on December 31, 2000.[2] It was the tenth and final century of the 2nd millennium. It is distinct from the century known as the 1900s which began on January 1, 1900, and ended on December 31, 1999.

The 20th century was dominated by a chain of events that heralded significant changes in world history as to redefine the era: flu pandemic, World War I and World War II, nuclear power and space exploration, nationalism and decolonization, the Cold War and post-Cold War conflicts; intergovernmental organizations and cultural homogenization through developments in emerging transportation and communications technology; poverty reduction and world population growth, awareness of environmental degradation, ecological extinction;[3][4] and the birth of the Digital Revolution, enabled by the wide adoption of MOS transistors and integrated circuits. It saw great advances in communication and medical technology th

## Load Text Corpus

In [5]:
text_corpus = []
f = open('./datasets/text_corpus.txt')

for text in f:
    text_corpus.append(text)

f.close()

In [6]:
text_corpus = '\n\n'.join(text_corpus)
print(text_corpus)

Those Who Are Resilient Stay In The Game Longer “On the mountains of truth you can never climb in vain: either you will reach a point higher up today, or you will be training your powers so that you will be able to climb higher tomorrow.” — Friedrich Nietzsche Challenges and setbacks are not meant to defeat you, but promote you. However, I realise after many years of defeats, it can crush your spirit and it is easier to give up than risk further setbacks and disappointments. Have you experienced this before? To be honest, I don’t have the answers. I can’t tell you what the right course of action is; only you will know. However, it’s important not to be discouraged by failure when pursuing a goal or a dream, since failure itself means different things to different people. To a person with a Fixed Mindset failure is a blow to their self-esteem, yet to a person with a Growth Mindset, it’s an opportunity to improve and find new ways to overcome their obstacles. Same failure, yet different 

## Calculate Frequency of Words

In [7]:
def create_frequency_table(text):
    
    stop_words = set(stopwords.words('english'))
    words = word_tokenize(text_corpus)
    ps = PorterStemmer()
    
    frequency_table = dict()
    for word in words:
        word = ps.stem(word)
        
        # filter stop words and non-alphanumeric
        if word not in stop_words and re.match('^\w+$', word):
            if word in frequency_table:
                frequency_table[word] += 1
            else:
                frequency_table[word] = 1
    
    all_word = []
    all_freq = []
    for word in frequency_table:
        all_word.append(word)
        all_freq.append(frequency_table[word])    

    frequency_table = pd.DataFrame(data={'Words': all_word, 'Frequency': all_freq})
    frequency_table.index += 1 
    
    return frequency_table

In [8]:
frequency_table = create_frequency_table(text_corpus)
frequency_table.head()

Unnamed: 0,Frequency,Words
1,2,resili
2,2,stay
3,1,In
4,3,game
5,2,longer


## Tokenize The Sentences

In [9]:
sentences = sent_tokenize(text_corpus)
print(sentences[0])

Those Who Are Resilient Stay In The Game Longer “On the mountains of truth you can never climb in vain: either you will reach a point higher up today, or you will be training your powers so that you will be able to climb higher tomorrow.” — Friedrich Nietzsche Challenges and setbacks are not meant to defeat you, but promote you.


## Score The Sentences Using Term Frequency

In [10]:
def score_sentences(sentences, frequency_table):
    
    sentences_with_score = dict()
    
    for sentence in sentences:
        
        word_count_in_sentence = (len(word_tokenize(sentence)))
        
        for word in frequency_table['Words']:    
            if word in sentence.lower():
                
                word_frequency = frequency_table['Frequency'][frequency_table['Words'] == word].item()
                
                if sentence in sentences_with_score:
                    sentences_with_score[sentence] += word_frequency
                else:
                    sentences_with_score[sentence] = word_frequency
        
        # normalizing score by dividing with each sentence length
        sentences_with_score[sentence] = sentences_with_score[sentence] / word_count_in_sentence
        
    return sentences_with_score

In [11]:
sentences_with_score = score_sentences(sentences, frequency_table)

In [12]:
sentence, score = random.choice(list(sentences_with_score.items()))
print('Sentence:', sentence)
print('Score:', score)

Sentence: It’s a fact, if you don’t know what you want you’ll get what life hands you and it may not be in your best interest, affirms author Larry Weidel: “Winners know that if you don’t figure out what you want, you’ll get whatever life hands you.” The key is to develop a powerful vision of what you want and hold that image in your mind.
Score: 0.7195121951219512


## Get Threshold

In [13]:
def get_threshold_by_average_score(sentences):
    
    sum_scores = 0
    for _, score in sentences.items():
        sum_scores += score
        
    # calculating average value of a sentence from original text
    average = round(sum_scores / len(sentences), 2)
    return average

In [14]:
threshold = get_threshold_by_average_score(sentences_with_score)
print('Threshold Score:', threshold)

Threshold Score: 1.13


## Get Final Summary

In [15]:
def generate_summary(sentences, threshold):
    
    sentence_count = 0
    text_summary = ''
    
    for sentence, score in sentences.items():
        if score > (threshold):
            text_summary += ' ' + sentence.strip()
        sentence_count += 1
    
    return text_summary, sentence_count

In [16]:
text_summary, sentence_count = generate_summary(sentences_with_score, threshold)

In [17]:
print(f'AWESOME, NEAT & CONCISE TEXT SUMMARY ({sentence_count} SENTENCES):\n')
print(text_summary.strip())

AWESOME, NEAT & CONCISE TEXT SUMMARY (53 SENTENCES):

Have you experienced this before? However, it’s important not to be discouraged by failure when pursuing a goal or a dream, since failure itself means different things to different people. Same failure, yet different responses. Each person has a different mindset that decides their outcome. Those who are resilient stay in the game longer and draw on their inner means to succeed. I’ve coached many clients who gave up after many years toiling away at their respective goal or dream. I know one thing for certain: don’t settle for less than what you’re capable of, but strive for something bigger. It must come from within you. Gnaw away at your problems until you solve them or find a solution. Most times, problems help you gain a skill or develop the resources to succeed later. So embrace your challenges and develop the grit to push past them instead of retreat in resignation. Where are you settling in your life right now? Are you willing

---