# Text summarization using NLTK

**AIM:** Design a script to produce a summary from scrapped tweets.  
**Reference:** [Text summarization with NLTK in Python.](https://stackabuse.com/text-summarization-with-nltk-in-python/)  


STEPS:  
- define text preprocessing to be done and do the text preprocessing
- combine cleaned tweets into one document
- extract a summary of the information from the large document

In [1]:
# necessary imports
##
import pandas as pd

import nltk
import re
from textblob import TextBlob

import heapq

from tqdm import tqdm

import language_tool_python

In [2]:
# read in a csv file
##
df = pd.read_csv("#OmwengaMurderDppExposed.csv")

display(df.head())

Unnamed: 0,tweet_id,text,created_at,likes,retweet_count
0,1302966206236897280,"SISI CITY, Egypt's New administrative Capital ...",2020-09-07 13:45:26,0,0
1,1302965798626041856,hey guys!kuna baridi and so i got this for you...,2020-09-07 13:43:48,1,0
2,1302965764731928578,"Before you do any Shit for your boss , think t...",2020-09-07 13:43:40,0,0
3,1302965378721644549,"Core i5 Desktops 4GB ram, 3.0GHZ, 500 GB Hard ...",2020-09-07 13:42:08,1,0
4,1302964894409650176,Take 1minute to thank Dp Ruto who faciliated c...,2020-09-07 13:40:13,8,1


In [3]:
# a closer look at the texts
##
for tweet in df.text[:30]:
    print(tweet, "\n")

SISI CITY, Egypt's New administrative Capital City. https://t.co/YLCEaP8gul via @YouTube #mutahikagwe Guardian Ange… https://t.co/3jCCOSVdaj 

hey guys!kuna baridi and so i got this for you,to make you warm and comfortable whenever you are in town and workin… https://t.co/DkFz6Bt1Oi 

Before you do any Shit for your boss , think twice........
https://t.co/wiUnEHfgpw
#OmwengaMurderDppExposed 

Core i5 Desktops 4GB ram, 3.0GHZ, 500 GB Hard disk, 19 inch monitor, keyboard, mouse, cables  @15K

contact  .☎0718… https://t.co/4GRIgHn9sd 

Take 1minute to thank Dp Ruto who faciliated construction of house to mukamo and Mia hau. From Grass to grace 🙏… https://t.co/U3glPzDkgs 

Two wrongs don't make a right. Look at this @citizentvkenya nes article by Hassan Mugambi. It sheds alot of light o… https://t.co/DE36KEqCY9 

I don’t think think this is an uphill task for the DPP when there’s evidence of someone clearly taking the gun from… https://t.co/KoRAO6aYuF 

How to Win Friends and Influence Peo

### Text preprocessing steps
- remove links
- remove hashtags
- remove special characters (preserving fullstops since they mark end of sentence and we're doing summarization based off of sentence weights, also preserve single quotes)
- remove numbers
- remove emojis and emoticons
- correct spelling
- stripping the text
- put a fullstop at the end of each preprocessed tweet if there is not one
- make the first letter of preprocessed tweets a capital letter if it isn't the case

In [4]:
# function to do the preprocessing
##

punctuation = '!"#$%&\()*+,-/:;<=>?@[\\]^_`{|}~…' # removed fullstop and single quote
emoj = re.compile("["
        u"\U0001F600-\U0001F64F"  # emoticons
        u"\U0001F300-\U0001F5FF"  # symbols & pictographs
        u"\U0001F680-\U0001F6FF"  # transport & map symbols
        u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
        u"\U00002500-\U00002BEF"  # chinese char
        u"\U00002702-\U000027B0"
        u"\U00002702-\U000027B0"
        u"\U000024C2-\U0001F251"
        u"\U0001f926-\U0001f937"
        u"\U00010000-\U0010ffff"
        u"\u2640-\u2642" 
        u"\u2600-\u2B55"
        u"\u200d"
        u"\u23cf"
        u"\u23e9"
        u"\u231a"
        u"\ufe0f"  # dingbats
        u"\u3030"
                      "]+", re.UNICODE)

def text_process(text):
    text = re.sub(r'https?://\S+|www\.\S+', repl=r'', string=text)
    text = text.split()
    text = ' '.join([word for word in text if not word.startswith("#")])
    text = ''.join([c for c in text if c not in punctuation])
    text = ''.join([i for i in text if not i.isdigit()])
    text = re.sub(emoj, '', text)
    text = str(TextBlob(text).correct())
    text = text.strip()
    if not text.endswith("."):
        text = text + "."
    if not text[0].isupper():
        text = text[0].upper() + text[1:]
    return text

# text processing function preserving digits
def text_processdig(text):
    text = re.sub(r'https?://\S+|www\.\S+', repl=r'', string=text)
    text = text.split()
    text = ' '.join([word for word in text if not word.startswith("#")])
    text = ''.join([c for c in text if c not in punctuation])
    text = re.sub(emoj, '', text)
    text = str(TextBlob(text).correct())
    text = text.strip()
    if not text.endswith("."):
        text = text + "."
    if not text[0].isupper():
        text = text[0].upper() + text[1:]
    return text

In [5]:
# example
print("Original text: ", df.text[0])
print("\nPreprocessed text: ", text_process(df.text[0]))

Original text:  SISI CITY, Egypt's New administrative Capital City. https://t.co/YLCEaP8gul via @YouTube #mutahikagwe Guardian Ange… https://t.co/3jCCOSVdaj

Preprocessed text:  SISI CITY Egypt's New administrative Capital City. via YouTube Guardian One.


Designing a document of preprocessed texts joined together

In [6]:
doc_without_numbers = ""

for tweet in tqdm(df.text, desc="Designing preprocessed document:"):
    tweet = text_process(tweet)
    doc_without_numbers += tweet
    
doc_with_numbers = ""

for tweet in tqdm(df.text, desc="Designing document preserving numbers:"):
    tweet = text_processdig(tweet)
    doc_with_numbers += tweet

Designing preprocessed document:: 100%|██████████| 193/193 [01:16<00:00,  2.52it/s]
Designing document preserving numbers:: 100%|██████████| 193/193 [01:16<00:00,  2.53it/s]


Convert document to recognizable sentences

In [7]:
sentence_list = nltk.sent_tokenize(doc_with_numbers)

In [8]:
# previewing the first 5 sentences
##
sentence_list[:5]

["SISI CITY Egypt's New administrative Capital City.",
 'via YouTube Guardian One.Hey guyskuna bride and so i got this for youth make you warm and comfortable whenever you are in town and working.Before you do any Hit for your boss  think twice........More in Desktop 4GB ram 3.0GHZ 500 of Hard disk 19 inch monitor keyboard mouse tables 15K contact .0718.Take minute to thank Up Auto who facilitated construction of house to mukamo and Via had.',
 "From Grass to grace.Two wrongs don't make a right.",
 'Took at this citizentvkenya new article by Vassar Mugambi.',
 "It sheds clot of light o.I don’t think think this is an uphill task for the DPP when there’s evidence of someone clearly taking the gun from.Now to In Friends and Influence People by Pale Carnegie of.Does CCTV forage really work in Sonya He should not complain too much now considering before Omwenga there was A.YesNo.Are you having Monday blue come amp enjoy hot video amp pick of my friends and I 0796245691.Are you having Monday

Find weighted frequency of occurence of each word(using preprocessed document)

In [9]:
stopwords = nltk.corpus.stopwords.words('english')

word_frequencies = {}
for word in nltk.word_tokenize(doc_without_numbers):
    if word not in stopwords:
        if word not in word_frequencies.keys():
            word_frequencies[word] = 1
        else:
            word_frequencies[word] += 1
            
# to find the weighted frequency, we can simply divide the number of occurances of
# all the words by the frequency of the most occurring word

maximum_frequncy = max(word_frequencies.values())

for word in word_frequencies.keys():
    word_frequencies[word] = (word_frequencies[word]/maximum_frequncy)

Calculating sentence scores

In [10]:
sentence_scores = {}
for sent in sentence_list:
    for word in nltk.word_tokenize(sent.lower()):
        if word in word_frequencies.keys():
            if len(sent.split(' ')) > 3 & len(sent.split(' ')) < 20:
                if sent not in sentence_scores.keys():
                    sentence_scores[sent] = word_frequencies[word]
                else:
                    sentence_scores[sent] += word_frequencies[word]

### Getting summary from sentences with higher score than others
Taking care of shorter and extremely longer summaries  
- summaries below 1500 characters  
- summaries over  2500 characters

In [11]:
n = 5

summary_sentences = heapq.nlargest(n, sentence_scores, key=sentence_scores.get)
summary = ' '.join(summary_sentences)

if len(summary) < 1500 and len(' '.join(df.text)) > 1500:
    while len(summary) < 2500:
        if n+3 < len(sentence_list):
            n = n+3
        summary_sentences = heapq.nlargest(n, sentence_scores, key=sentence_scores.get)
        
summary = ' '.join(summary_sentences)

In [12]:
print(summary)

It is very clear.The DPP failed to investigate this case thanks citizentvkenya for doing this for them now they claim they didn't s.The DPP saying the Arraigning Chris Sure an innocent man alongside the real suspect Robert Do raises more quest.The forage captures some of the events at the Briton group office of businessman Chris Sure on August 21 the da.When doing investigations before you even arrange a suspect in Court you must exhaust all evidence so as to make s.That's the Motive behind the DCI ignoring the CCTV forage The investigations they were doing were based on what.Why is ODPP behaving in this manner  Why does it look like they don't intend to do the right thing.Based on the CCTV forage then Do must answer to whether he had the permission or not otherwise he is the main s.Chris Sure through his lawyer Danstan Mary has challenged the murder charge against him and sought to be made a.SuperGirlKE kemaruth LionOfMara MercyNatra Sure’s lawyers argued the “intended prosecution and

In [13]:
len(sentence_list)

61

Correcting errors in sentences

In [14]:
# example of how the tool makes correction
##
tool = language_tool_python.LanguageTool('en-US')
text = 'A sentence with a error in the Hitchhiker’s Guide tot he Galaxy'
matches = tool.check(text)

print(f"Length of matches: {len(matches)}")

text = tool.correct(text)
print(text)

Length of matches: 2
A sentence with an error in the Hitchhiker’s Guide to the Galaxy


In [15]:
# applying to our summary text
##
matches = tool.check(summary)

In [16]:
len(matches)

129

In [17]:
summary = tool.correct(summary)

In [18]:
print(summary)

It is very clear. The DSP failed to investigate this case thanks citizentvkenya for doing this for them now they claim they didn't s. The DSP saying the Arraigning Chris Sure an innocent man alongside the real suspect Robert Do raises more quest. The forage captures some events at the Briton group office of businessman Chris Sure on August 21 the the. When doing investigations before you even arrange a suspect in Court you must exhaust all evidence to make s. That's the Motive behind the DCI ignoring the CCTV forage The investigations they were doing were based on what. Why is Opp behaving in this manner Why does it look like they don't intend to do the right thing. Based on the CCTV forage then Do must answer to whether he had the permission or not otherwise he is the main s. Chris Sure through his lawyer Feinstein Mary has challenged the murder charge against him and sought to be made a. SuperGirlKE earth LionOfMara Mercenary Sure’s lawyers argued the “intended prosecution and crimin