# Text summarization using NLTK

**AIM:** Design a script to produce a summary from scrapped tweets.  
**Reference:** [Text summarization with NLTK in Python.](https://stackabuse.com/text-summarization-with-nltk-in-python/)  


STEPS:  
- define text preprocessing to be done and do the text preprocessing
- combine cleaned tweets into one document
- extract a summary of the information from the large document

In [1]:
# necessary imports
##
import pandas as pd

import nltk
import re
from textblob import TextBlob

import heapq

from tqdm import tqdm

import language_tool_python

In [38]:
# read in a csv file
##
df = pd.read_csv("#RespectMyHustle.csv")

display(df.head())

Unnamed: 0.1,Unnamed: 0,tweet_id,text,created_at,likes,retweet_count
0,0,1310865450960642048,“Coming together is a beginning; keeping toget...,2020-09-29 08:54:12,0,0
1,1,1310864804010225665,Today is Empowerment day for Hustlers.\n\nEver...,2020-09-29 08:51:38,1,0
2,2,1310864208859402240,MERCY is all i crave for right nw. Oh Lord...s...,2020-09-29 08:49:16,1,0
3,3,1310863174296453120,we all know what this hustler movment is about...,2020-09-29 08:45:10,0,0
4,4,1310862095567773696,Self taught baker. @Stephjoybakery IG and FB. ...,2020-09-29 08:40:52,0,0


### Sample df to 30 rows
This will ensure preprocessing of dataframes take equal or approximately equal time iregardless of dataframe size

In [39]:
if len(df) > 30:
    df = df.sample(30, random_state=30).reset_index(drop=True)
else:
    pass

In [40]:
# a closer look at the texts
##
for tweet in df.text:
    print(tweet, "\n")

“Coming together is a beginning; keeping together is progress; working together is success.”

#TuesdayThoughts 
#RespectMyHustle 

Today is Empowerment day for Hustlers.

Every Hustle matters.

#RespectMyHustle 

#HustlerNation 

MERCY is all i crave for right nw. Oh Lord...send me an HELPER asap🙏🏼
#lagos #RespectMyHustle #Alert 

we all know what this hustler movment is about.Its only one mans political ambitions,real kenyans know were on our own. #respectmyhustle 

Self taught baker. @Stephjoybakery IG and FB. #RespectMyHustle 

#respectmyhustle

I do online writing 

This is a country of few millionaires and tens of millions Hustlers.

Thank u DP Ruto for recognizing Hustlers.

#RespectMyHustle 

The dynasties can keep dictionary meaning of the word hustler to themselves, as a farmer am an hustler #respectmyhustle 

@WilliamsRuto We appriciate your concern with Kenyans, with love from Tanzania #RespectMyHustle #FUTUREPRESIDENT 

@Future_maina Very cute. How much is the dog?#RespectM

### Text preprocessing steps
- remove links
- remove hashtags
- remove special characters (preserving fullstops since they mark end of sentence and we're doing summarization based off of sentence weights, also preserve single quotes)
- remove numbers
- remove emojis and emoticons
- correct spelling
- stripping the text
- put a fullstop at the end of each preprocessed tweet if there is not one
- make the first letter of preprocessed tweets a capital letter if it isn't the case

In [18]:
# function to do the preprocessing
##

punctuation = '!"#$%&\()*+,-/:;<=>?@[\\]^_`{|}~…' # removed fullstop and single quote
emoj = re.compile("["
        u"\U0001F600-\U0001F64F"  # emoticons
        u"\U0001F300-\U0001F5FF"  # symbols & pictographs
        u"\U0001F680-\U0001F6FF"  # transport & map symbols
        u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
        u"\U00002500-\U00002BEF"  # chinese char
        u"\U00002702-\U000027B0"
        u"\U00002702-\U000027B0"
        u"\U000024C2-\U0001F251"
        u"\U0001f926-\U0001f937"
        u"\U00010000-\U0010ffff"
        u"\u2640-\u2642" 
        u"\u2600-\u2B55"
        u"\u200d"
        u"\u23cf"
        u"\u23e9"
        u"\u231a"
        u"\ufe0f"  # dingbats
        u"\u3030"
                      "]+", re.UNICODE)

def text_process(text):
    text = re.sub(r'https?://\S+|www\.\S+', repl=r'', string=text)
    text = text.split()
    text = ' '.join([word for word in text if not word.startswith("#")])
    text = ''.join([c for c in text if c not in punctuation])
    text = ''.join([i for i in text if not i.isdigit()])
    text = re.sub(emoj, '', text)
    text = str(TextBlob(text).correct())
    text = text.strip()
    if not text.endswith("."):
        text = text + "."
    if not text[0].isupper():
        text = text[0].upper() + text[1:]
    return text

# text processing function preserving digits
def text_processdig(text):
    text = re.sub(r'https?://\S+|www\.\S+', repl=r'', string=text)
    text = text.split()
    text = ' '.join([word for word in text if not word.startswith("#")])
    text = ''.join([c for c in text if c not in punctuation])
    text = re.sub(emoj, '', text)
    text = str(TextBlob(text).correct())
    text = text.strip()
    if not text.endswith("."):
        text = text + "."
    if not text[0].isupper():
        text = text[0].upper() + text[1:]
    return text

In [19]:
# example
print("Original text: ", df.text[0])
print("\nPreprocessed text: ", text_process(df.text[0]))

Original text:  The story about closing of schools has made school girls know that love is not given but paid for.
 #Covid19Millionaires

Preprocessed text:  The story about closing of schools has made school girls know that love is not given but paid for.


Designing a document of preprocessed texts joined together

In [20]:
doc_without_numbers = ""

for tweet in tqdm(df.text, desc="Designing preprocessed document:"):
    tweet = text_process(tweet)
    doc_without_numbers += tweet
    
doc_with_numbers = ""

for tweet in tqdm(df.text, desc="Designing document preserving numbers:"):
    tweet = text_processdig(tweet)
    doc_with_numbers += tweet

Designing preprocessed document:: 100%|██████████| 30/30 [00:16<00:00,  1.84it/s]
Designing document preserving numbers:: 100%|██████████| 30/30 [00:17<00:00,  1.76it/s]


Convert document to recognizable sentences

In [21]:
sentence_list = nltk.sent_tokenize(doc_with_numbers)

In [22]:
# previewing the first 5 sentences
##
sentence_list[:5]

['The story about closing of schools has made school girls know that love is not given but paid for.NMSKenya Meeting for the how will I benefit from it by the way.ItsNyambane same way fools like you are dancing with who stole from the sick.',
 'staff.Were amongst those who attended the summit.',
 "Shameless to sit and ridicule fellow Kenyans with face speeches.He are Still Waiting for to be Brought to Took Am AFRAID He might wait till we get to the GRAVE.ARE ALREADY LEARNING THAT MORE MONEY MORE PROBLEMS.OmbetaC You wanted him to join hands with after feasting covid19 funds He did well of.Welcome to sonya where to be poor is a crime.MarkDienya RailaOdinga He is not part of the his.An one get a good job in sonya nowadays without experience.... An't you train me and make me experienced.CORRUPTION of NEW NORMAL of KENYA of SHOULD LEARN of COPE WITH of JKNjenga.You heard the message loud and clear no money was lost.",
 'Real with it...TheStarKenya StateHouseKenya MOHKenya But You Stole Mon

Find weighted frequency of occurence of each word(using preprocessed document)

In [23]:
stopwords = nltk.corpus.stopwords.words('english')

word_frequencies = {}
for word in nltk.word_tokenize(doc_without_numbers):
    if word not in stopwords:
        if word not in word_frequencies.keys():
            word_frequencies[word] = 1
        else:
            word_frequencies[word] += 1
            
# to find the weighted frequency, we can simply divide the number of occurances of
# all the words by the frequency of the most occurring word

maximum_frequncy = max(word_frequencies.values())

for word in word_frequencies.keys():
    word_frequencies[word] = (word_frequencies[word]/maximum_frequncy)

Calculating sentence scores

In [24]:
sentence_scores = {}
for sent in sentence_list:
    for word in nltk.word_tokenize(sent.lower()):
        if word in word_frequencies.keys():
            if len(sent.split(' ')) > 3 & len(sent.split(' ')) < 20:
                if sent not in sentence_scores.keys():
                    sentence_scores[sent] = word_frequencies[word]
                else:
                    sentence_scores[sent] += word_frequencies[word]

### Getting summary from sentences with higher score than others
Taking care of shorter and extremely longer summaries  
- summaries below 1500 characters  
- summaries over  2500 characters

In [25]:
n = 5

summary_sentences = heapq.nlargest(n, sentence_scores, key=sentence_scores.get)
summary = ' '.join(summary_sentences)

if len(summary) < 1500 and len(' '.join(df.text)) > 1500:
    while len(summary) < 2500:
        if n+3 < len(sentence_list):
            n = n+3
        summary_sentences = heapq.nlargest(n, sentence_scores, key=sentence_scores.get)
        
summary = ' '.join(summary_sentences)

In [26]:
print(summary)

Shameless to sit and ridicule fellow Kenyans with face speeches.He are Still Waiting for to be Brought to Took Am AFRAID He might wait till we get to the GRAVE.ARE ALREADY LEARNING THAT MORE MONEY MORE PROBLEMS.OmbetaC You wanted him to join hands with after feasting covid19 funds He did well of.Welcome to sonya where to be poor is a crime.MarkDienya RailaOdinga He is not part of the his.An one get a good job in sonya nowadays without experience.... An't you train me and make me experienced.CORRUPTION of NEW NORMAL of KENYA of SHOULD LEARN of COPE WITH of JKNjenga.You heard the message loud and clear no money was lost. Did you mean conference.It is high time the pack and go He are tired Sonya is tired Poverty is rising.Hurt heads to France to borrow more for.You look at of Magoha's confusion and you wonder what's the essence of that 100page of.USAmbKenya It least u now know our got is full of.All national thieves are at kick.... Real with it...TheStarKenya StateHouseKenya MOHKenya But 

In [27]:
len(sentence_list)

10

Correcting errors in sentences

In [29]:
# example of how the tool makes correction
##
tool = language_tool_python.LanguageTool('en-US')
text = 'A sentence with a error in the Hitchhiker’s Guide tot he Galaxy'
matches = tool.check(text)

print(f"Length of matches: {len(matches)}")

text = tool.correct(text)
print(text)

Length of matches: 2
A sentence with an error in the Hitchhiker’s Guide to the Galaxy


In [30]:
# applying to our summary text
##
matches = tool.check(summary)

In [31]:
len(matches)

42

In [32]:
summary = tool.correct(summary)

In [33]:
print(summary)

Shameless to sit and ridicule fellow Kenyans with face speeches. He is Still Waiting for to be Brought to Took Am AFRAID He might wait till we get to the GRAVE.ARE ALREADY LEARNING THAT MORE MONEY MORE PROBLEMS. OmbetaC You wanted him to join hands with after feasting COVID-19 funds He did well of. Welcome to Sonya where to be poor is a crime. MarkDienya Railroading He is not part of the. An one get a good job in Sonya nowadays without experience.... An't you train me and make me experienced.CORRUPTION of NEW NORMAL of KENYA of SHOULD LEARN of COPE WITH of JKNjenga. You heard the message loud and clear no money was lost. Did you mean conference? It is high time the pack and go He is tired Sonya is tired Poverty is rising. Hurt heads to France to borrow more for. You look at of Martha's confusion and you wonder what's the essence of that 100page of.USAmbKenya It least u now know our got is full of. All national thieves are at kick.... Real with it... TheStarKenya StateHouse Kenya MOH Ke