### Text Summarization to extract keywords and verify the originality of the content on Twitter

 An unbelievable amount of data is being generated per second across the globe. It is possible to gather insights from these existing sources. Furthermore, a large portion of this data is either redundant or do not contain useful information. The most efficient way to get access to the most important part of the data, without having to sift through redundant and insignificant data, is to summarize the data in a way that it contains only non-redundant and useful information. This data can be of any form such as audio, video, images, and text. In this article, we will see how we can use automatic text summarization techniques to summarize text data and use the keywords from the summarized data to verify the originality of the content on Twitter.

- Automatic Text Summarization 

The main idea of summarization is to find a subset of the data which contains the "information" of the entire set. There are 2 general approaches to automatic text summarization 

a) Extraction based summarization
b) Abstraction based summarization

### Text Summarization using NLTK - Steps 

I shall explain the steps involved in text summarization with the help of an example. Here, I have considered an article from the internet.

In [1]:
# Import all the necessary libraries

import numpy as np
import json
import re
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.chunk import tree2conlltags
from nltk import pos_tag

In [10]:
# Import data that needs to be summarized
f= open("test_summ.txt","r")
text = f.read()

In [11]:
print(text)

Elon Musk, under pressure from his lawyers and investors of Tesla, the company he co-founded, reached a deal with the Securities and Exchange Commission on Saturday to resolve securities fraud charges. The settlement will force Mr. Musk to step aside as chairman for three years and pay a $20 million fine.

The S.E.C. announced the deal two days after it sued Mr. Musk in federal court for misleading investors over his post on Twitter last month that he had “funding secured” for a buyout of the electric-car company at $420 a share.

The deal with the S.E.C. will allow him to remain as chief executive, something he could have jeopardized if he had gone to battle with the agency.

It is not clear why Mr. Musk changed his mind so quickly.

People familiar with the situation, who were not authorized to speak publicly on the matter, said lawyers for Mr. Musk and the company moved to reopen the talks with the S.E.C. on Friday. During that time, one of Tesla’s lawyers became instrumental in sec

#### Step 1 :

Remove stop words(Any word that does not add a value to the meaning of a sentence). I have also cleaned my text accordingly. Tokenize the words in the document and find its corresponding normalized frequency.
 

In [51]:
def clean_text (text):
    text = text.lower()
    text = re.sub(r"i'm", "i am", text)
    text = re.sub(r"\'s", " is", text)
    text = re.sub(r"\'d", " would", text)
    text = re.sub(r"\'ll", " will", text)
    text = re.sub(r"\'ve", " have", text)
    text = re.sub(r"\'re", " are", text)
    text = re.sub(r"won't", "will not", text)
    text = re.sub(r"can't", "cannot", text)
    text = re.sub(r"[-\"/;:|<>{}+=#_,']", "", text)
    text = re.sub(r"\n", "", text)
    text = re.sub(r"\xe2\x80\x9c", "", text)
    text = re.sub(r"\xe2\x80\x9d","", text)
    text = re.sub(r"\xe2\x80\x99","",text)
    return text    

In [52]:
cleaned_text = []

cleaned_text.append(clean_text(text))

In [53]:
cleaned_text[0]

'elon musk under pressure from his lawyers and investors of tesla the company he cofounded reached a deal with the securities and exchange commission on saturday to resolve securities fraud charges. the settlement will force mr. musk to step aside as chairman for three years and pay a $20 million fine.the s.e.c. announced the deal two days after it sued mr. musk in federal court for misleading investors over his post on twitter last month that he had funding secured for a buyout of the electriccar company at $420 a share.the deal with the s.e.c. will allow him to remain as chief executive something he could have jeopardized if he had gone to battle with the agency.it is not clear why mr. musk changed his mind so quickly.people familiar with the situation who were not authorized to speak publicly on the matter said lawyers for mr. musk and the company moved to reopen the talks with the s.e.c. on friday. during that time one of teslas lawyers became instrumental in securing a deal with t

#### The default encoding of python is set to utf-8.

In [54]:
import sys  

reload(sys)  
sys.setdefaultencoding('utf8')

In [55]:
# In this step the words are tokenized and stop words are removed

stopWords = set(stopwords.words("english"))

words = word_tokenize(cleaned_text[0])

In [57]:
without_stop_words = {}

for word in words :
    word = word.lower()
    
    if word in stopWords:
        continue
    if word in without_stop_words:
        without_stop_words[word] += 1
    else :
        without_stop_words[word] = 1

### The without stop words dictionary contains the different individual words/characters present in the document as the keys and the value corresponding to each key is its frequency of occurance. The Frequency of each word is normalised [0,1]. 

In [58]:
without_stop_words

{'$': 5,
 '.': 35,
 '10': 1,
 '14': 1,
 '20': 3,
 '3': 1,
 '420': 1,
 '7': 1,
 'ability': 1,
 'accomplishments': 1,
 'according': 2,
 'account': 1,
 'act': 1,
 'action': 1,
 'add': 1,
 'addition': 1,
 'address': 1,
 'admit': 1,
 'admitted': 1,
 'affect': 1,
 'agency.it': 1,
 'ago': 1,
 'agreed': 1,
 'agreeing': 1,
 'allow': 1,
 'almost': 1,
 'along': 1,
 'also': 3,
 'american': 1,
 'amount': 1,
 'analysts': 1,
 'announced': 3,
 'another': 1,
 'aside': 1,
 'asked': 1,
 'attacks': 1,
 'audacious': 1,
 'aug.': 1,
 'authorized': 2,
 'avakian': 1,
 'away': 1,
 'backed': 1,
 'bar': 2,
 'battle': 1,
 'battles': 1,
 'became': 1,
 'become': 2,
 'began': 1,
 'behind': 1,
 'bets': 1,
 'big': 1,
 'billionaire': 1,
 'board': 2,
 'boring': 1,
 'bought': 1,
 'briefed': 2,
 'buying': 1,
 'buyout': 1,
 'called': 1,
 'carmakers': 1,
 'caused': 1,
 'chairman': 4,
 'chairman.the': 1,
 'change': 1,
 'changed': 1,
 'characterized': 1,
 'charge': 2,
 'charged': 1,
 'charges': 1,
 'chief': 2,
 'civil': 2,
 'c

In [59]:
# Here I have considered minimum count as 0

maximum_count = float(max(without_stop_words.values()))

for without_stop_word in without_stop_words.keys() :
    without_stop_words[without_stop_word] = without_stop_words[without_stop_word] / maximum_count
        

#### Step 2 :

Identify individual sentences from the document. With the help of the words frequency distribution table obtained from Step 2, assign values to each word accordingly for each sentence, hence obtaining sentence score for each sentence. Identify sentences which have a scores more than a particular threshold value. 

In [60]:
sentences = sent_tokenize(cleaned_text[0])
sentence_value = {} 

for sentence in sentences :
    for w in word_tokenize(sentence) :
        if w in without_stop_words.keys():
            if sentence not in sentence_value.keys():
                sentence_value[sentence] = without_stop_words[w]
            else:
                sentence_value[sentence] += without_stop_words[w]


In [61]:
sentence_value

{'according to a person familiar with the negotiations.the whipsaw events of the past few days followed a series of selfinflicted wounds by mr. musk.his tweet about taking his company private along with attacks on critics on social media raised concerns with investors about whether mr. musk has become too focused on criticism from socalled shortsellers who had been making bets against him and tesla.': 5.342857142857141,
 'and now with mr. musk agreeing to step down as chairman teslas board must decide who should replace him.': 2.6857142857142855,
 'and they gave him a sense of how difficult it is to fight one of these suits even if he eventually won said people familiar with the negotiations.the parties worked much of friday and saturday to get a deal done.the settlement clears a big headache for tesla but other problems remain.the s.e.c.': 4.085714285714285,
 'announced the deal two days after it sued mr. musk in federal court for misleading investors over his post on twitter last mon

In [68]:
# I have considered a threshold of 4.0

summarised = {}   
         
for x in sentences :
    if sentence_value[x] >= 4.0 :
        summarised[x] = sentence_value[x]
    else :
        pass


In [69]:
summarised.keys()


['the amount of stock being bought by mr. musk matches the penalty the company has to pay under the settlement which was filed in federal court in manhattan.the s.e.c.',
 'and they gave him a sense of how difficult it is to fight one of these suits even if he eventually won said people familiar with the negotiations.the parties worked much of friday and saturday to get a deal done.the settlement clears a big headache for tesla but other problems remain.the s.e.c.',
 'but the settlement on saturday requires him to do that.in a socalled admit nor deny settlement a settling party cannot later disavow the terms of the settlement.after mr. musk was said to have rejected the deal on thursday his lawyers asked the s.e.c.',
 'had granted waivers to all of those companies so his settlement would not be held against them.waivers in such a situation are not uncommon legal experts have said.a tesla spokesman said that mr. musk a billionaire would be buying $20 million in tesla stock.',
 'on friday

In [70]:
tagged ={}

pattern = 'NP: {<DT>?<JJ>*<NN>}'

m = nltk.RegexpParser(pattern)

       
for summ in summarised.keys() :
    tagged[summ] = pos_tag(word_tokenize(summ))
    tagged[summ] = m.parse(tagged[summ])
    tagged[summ] = tree2conlltags(tagged[summ])


In [72]:
tagged

{'according to a person familiar with the negotiations.the whipsaw events of the past few days followed a series of selfinflicted wounds by mr. musk.his tweet about taking his company private along with attacks on critics on social media raised concerns with investors about whether mr. musk has become too focused on criticism from socalled shortsellers who had been making bets against him and tesla.': [('according',
   'VBG',
   u'O'),
  ('to', 'TO', u'O'),
  ('a', 'DT', u'B-NP'),
  ('person', 'NN', u'I-NP'),
  ('familiar', 'JJ', u'O'),
  ('with', 'IN', u'O'),
  ('the', 'DT', u'O'),
  ('negotiations.the', 'JJ', u'O'),
  ('whipsaw', 'JJ', u'O'),
  ('events', 'NNS', u'O'),
  ('of', 'IN', u'O'),
  ('the', 'DT', u'O'),
  ('past', 'JJ', u'O'),
  ('few', 'JJ', u'O'),
  ('days', 'NNS', u'O'),
  ('followed', 'VBD', u'O'),
  ('a', 'DT', u'B-NP'),
  ('series', 'NN', u'I-NP'),
  ('of', 'IN', u'O'),
  ('selfinflicted', 'JJ', u'O'),
  ('wounds', 'NNS', u'O'),
  ('by', 'IN', u'O'),
  ('mr.', 'NN', u

In [82]:
keywords = []

for y in tagged.keys() :
    for i in range(len(tagged[y])) :
        if tagged[y][i][1] == 'NN':
            if tagged[y][i][2] == 'B-NP' or 'I-NP':
                keywords.append(tagged[y][i][0])
            else :
                pass
        else :
            pass 
        

In [88]:
key_words = {}

for key_word in keywords :
    if key_word in key_words.keys() :
        key_words[key_word] += 1
    else :
        key_words[key_word] = 1
    



In [92]:
keys = {}

for kw in key_words.keys():
    if key_words[kw] > 1 :
        keys[kw] = key_words[kw]
    else :
        pass
        

In [93]:
keys


{'chairman': 2,
 'charge': 2,
 'company': 7,
 'court': 2,
 'deal': 4,
 'fraud': 2,
 'friday': 2,
 'matter': 2,
 'mr.': 9,
 'musk': 5,
 'person': 2,
 's.e.c': 10,
 'settlement': 10,
 'situation': 2,
 'spokesman': 2,
 'stock': 3,
 'tesla': 6,
 'thursday': 2}