### Extractive_Text_Summarization_Part2
As the name suggests, text summarization summarizes long text into bite sized pieces based on the approach we choose.

In addition to the benefit of summarizing large texts, this technique can even reduce the human bias when summarizing text. Think about it; think about the last time you were asked to summarize a book or a story – there’s always a degree of subjectivity. 

We could use text summarization to condense things like news articles, survey data, among others.

There are two main approaches to text summarization. You can either go **extract route** or the **abstract route**.  With the extract approach, we use text features, e.g. syntax, word frequencies, word weights, etc to summarize text. With the abstract approach, we use advanced modeling techniques e.g neural networks to summarize text. In this notebook, we will focus on the **extractive** method.

In [1]:
##Import Libraries

In [2]:
# !pip install -U gensim

In [123]:
####To generate Colored Text
class color:
   PURPLE = '\033[95m'
   CYAN = '\033[96m'
   DARKCYAN = '\033[36m'
   BLUE = '\033[94m'
   GREEN = '\033[92m'
   YELLOW = '\033[93m'
   RED = '\033[91m'
   BOLD = '\033[1m'
   UNDERLINE = '\033[4m'
   END = '\033[0m'

#print(color.BOLD + 'Hello World !' + color.END)


In [4]:
# !pip install -U gensim

Collecting gensim

You should consider upgrading via the 'C:\Users\LENOVO\Anaconda3\python.exe -m pip install --upgrade pip' command.



  Downloading gensim-3.8.3-cp37-cp37m-win_amd64.whl (24.2 MB)
Collecting Cython==0.29.14
  Using cached Cython-0.29.14-cp37-cp37m-win_amd64.whl (1.7 MB)
Collecting smart-open>=1.8.1
  Using cached smart_open-3.0.0.tar.gz (113 kB)
Building wheels for collected packages: smart-open
  Building wheel for smart-open (setup.py): started
  Building wheel for smart-open (setup.py): finished with status 'done'
  Created wheel for smart-open: filename=smart_open-3.0.0-py3-none-any.whl size=107102 sha256=99ee3a6ba94975a932ae563dc34dbce384cfa3d2494251cff2aa544ec11a8694
  Stored in directory: c:\users\lenovo\appdata\local\pip\cache\wheels\83\a6\12\bf3c1a667bde4251be5b7a3368b2d604c9af2105b5c1cb1870
Successfully built smart-open
Installing collected packages: Cython, smart-open, gensim
  Attempting uninstall: Cython
    Found existing installation: Cython 0.29.21
    Uninstalling Cython-0.29.21:
      Successfully uninstalled Cython-0.29.21
Successfully installed Cython-0.29.14 gensim-3.8.3 smart-op

In [2]:
import gensim
from gensim.summarization.summarizer import summarize
import re
import nltk
from nltk.tokenize import sent_tokenize,word_tokenize
from nltk.corpus import stopwords
from collections import defaultdict
import string
from heapq import nlargest
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
import networkx as nx
import string

Text sample we will use in this notebook

In [3]:
text = '''This introduction aims to tell the story of how we
put words into computers. It is part of the story of the field of
natural language processing (NLP), a branch of artificial
intelligence. It targets a wide audience with a basic
understanding of computer programming, but avoids a detailed
mathematical treatment, and it does not present any algorithms.
It also does not focus on any particular application of NLP such
as translation, question answering, or information extraction.
The ideas presented here were developed by many researchers over
many decades, so the citations are not exhaustive but rather direct
the reader to a handful of papers that are, in the author's view,
seminal. After reading this document, you should have a general
understanding of word vectors (also known as word embeddings): why
they exist, what problems they solve, where they come from, how
they have changed over time, and what some of the open questions
about them are. Readers already familiar with word vectors are
advised to skip to Section 5 for the discussion of the most recent
advance, contextual word vectors'''

### Method 1: GloVe Word Embeddings (TextRank)

Word embeddings are vector representations of a given word. The reason we want to use word embeddings for text summarization is so we can tell the similarities and differences amongst the words in the sentences in the paragraphs and assign weights accordingly.

In [4]:
#Split paragraph into sentences. We want to know how similar each sentence is with each other.
sentences = sent_tokenize(text)
sentences

['This introduction aims to tell the story of how we\nput words into computers.',
 'It is part of the story of the field of\nnatural language processing (NLP), a branch of artificial\nintelligence.',
 'It targets a wide audience with a basic\nunderstanding of computer programming, but avoids a detailed\nmathematical treatment, and it does not present any algorithms.',
 'It also does not focus on any particular application of NLP such\nas translation, question answering, or information extraction.',
 "The ideas presented here were developed by many researchers over\nmany decades, so the citations are not exhaustive but rather direct\nthe reader to a handful of papers that are, in the author's view,\nseminal.",
 'After reading this document, you should have a general\nunderstanding of word vectors (also known as word embeddings): why\nthey exist, what problems they solve, where they come from, how\nthey have changed over time, and what some of the open questions\nabout them are.',
 'Read

Cleaning our sentences.

In [5]:
#load stopwords
stop = nltk.corpus.stopwords.words('english')
 
#Pre-process your text. Remove punctuation, special characterts, numbers, etc. As the only thing we care
#about are the actual words in the text.
clean_sentences = [s.translate(string.punctuation) for s in sentences]
clean_sentences  = [s.translate(string.digits) for s in clean_sentences]
#lowercase
clean_sentences = [s.lower() for s in clean_sentences]
 
#remove stopwords as for text summarization purposes,
#these words add no value to word ranking.
def remove_stopwords(sentence):
    filtered_sentence = " ".join([i for i in sentence if i not in stop])
    return filtered_sentence
clean_sentences = [remove_stopwords(s.split()) for s in clean_sentences]
clean_sentences

['introduction aims tell story we+put words computers.',
 'part story field of+natural language processing (nlp), branch artificial+intelligence.',
 'targets wide audience basic+understanding computer programming, avoids detailed+mathematical treatment, present algorithms.',
 'also focus particular application nlp such+as translation, question answering, information extraction.',
 "ideas presented developed many researchers over+many decades, citations exhaustive rather direct+the reader handful papers are, author's view,+seminal.",
 'reading document, general+understanding word vectors (also known word embeddings): why+they exist, problems solve, come from, how+they changed time, open questions+about are.',
 'readers already familiar word vectors are+advised skip section 5 discussion recent+advance, contextual word vectors']

Let’s load in the word embeddings.

In [7]:
word_embeddings = {}
file_ = open('glove.6B.300d.txt' , encoding = 'utf-8')
for line in file_:
    values = line.split()
    word = values[0]
    coefs = np.asarray(values[1:], dtype='float')
    word_embeddings[word] = coefs
file_.close()

In [None]:
 #let’s compute the vector values on the words in our sentences

In [22]:
sentence_vectors = []
for i in clean_sentences:
    if len(i) != 0:
        vector = sum([word_embeddings.get(w, np.zeros((300,))) for w in i.split()])/(len(i.split())+0.001)
    else:
        vector = np.zeros((300,))
    sentence_vectors.append(vector)

Let’s compute the similarities between sentences, generate the scores based on the similarities and word embeddings and show the text summary.

In [23]:
#Compute sentence similaritiy, initiate with zeros
similarity_matrix = np.zeros([len(sentences), len(sentences)])
for i in range(len(sentences)):
    for j in range(len(sentences)):
        if i != j:
            similarity_matrix[i][j] = cosine_similarity(sentence_vectors[i].reshape(1,300), sentence_vectors[j].reshape(1,300))[0,0]


In [29]:
(similarity_matrix).shape

(7, 7)

In [30]:
sim_graph = nx.from_numpy_matrix(similarity_matrix)
scores = nx.pagerank(sim_graph)
 


In [43]:
#Sentence Ranking
ranked_sentences = sorted(((scores[i],s) for i,s in enumerate(sentences)), reverse=True)

#Choose desired number of sentences
print("**Summary is:** ")
for i in range(3):
    print(ranked_sentences[i][1])

**Summary is:** 
After reading this document, you should have a general
understanding of word vectors (also known as word embeddings): why
they exist, what problems they solve, where they come from, how
they have changed over time, and what some of the open questions
about them are.
It also does not focus on any particular application of NLP such
as translation, question answering, or information extraction.
The ideas presented here were developed by many researchers over
many decades, so the citations are not exhaustive but rather direct
the reader to a handful of papers that are, in the author's view,
seminal.


In [80]:
summary_wordembed_textrank = []
for i in range(3):
    summary_wordembed_textrank.append(ranked_sentences[i][1])

### Method 2: Weighted Frequencies

In this approach, we will compute weighted frequencies, similar to a weighted average. The weighted factor will based on the maximum frequency term in the text.

In [46]:
#Split paragraph into sentences. We want to know how similar each sentence is with each other.
sentences = sent_tokenize(text)
 
#load stopwords
stop = nltk.corpus.stopwords.words('english')
 
clean_sentences = [s.translate( string.punctuation) for s in sentences] 
clean_sentences = [s.translate(string.digits) for s in clean_sentences] #lowercase
clean_sentences = [s.lower() for s in clean_sentences]
clean_sentences = [remove_stopwords(s.split()) for s in clean_sentences]


In [47]:
clean_sentences

['introduction aims tell story we+put words computers.',
 'part story field of+natural language processing (nlp), branch artificial+intelligence.',
 'targets wide audience basic+understanding computer programming, avoids detailed+mathematical treatment, present algorithms.',
 'also focus particular application nlp such+as translation, question answering, information extraction.',
 "ideas presented developed many researchers over+many decades, citations exhaustive rather direct+the reader handful papers are, author's view,+seminal.",
 'reading document, general+understanding word vectors (also known word embeddings): why+they exist, problems solve, come from, how+they changed time, open questions+about are.',
 'readers already familiar word vectors are+advised skip section 5 discussion recent+advance, contextual word vectors']

In [48]:
#Computing the frequency of each word in our sentences.

In [50]:
word_frequencies = {}
for i in range(len(clean_sentences)):
    for word in nltk.word_tokenize(clean_sentences[i]):
        if word not in word_frequencies.keys():
            word_frequencies[word] = 1
        else:
            word_frequencies[word] += 1

In [52]:
#Let’s get the maximum frequency and multiply this to each word frequency to get the weighted frequency


In [53]:
maximum_frequency = max(word_frequencies.values())


In [54]:
maximum_frequency

14

In [56]:
for word in word_frequencies.keys():
    word_frequencies[word] = (word_frequencies[word]/maximum_frequency)

In [57]:
#Apply the computed frequencies to our UNCLEANED sentences and then show our summary.

In [59]:
#Apply scores to each UNCLEANED SENTENCE

sentence_scores = {}
for sent in sentences:
    for word in nltk.word_tokenize(sent.lower()):
        if word in word_frequencies.keys():
            if len(sent.split(' ')) < 30:
                if sent not in sentence_scores.keys():
                    sentence_scores[sent] = word_frequencies[word]
                else:
                    sentence_scores[sent] += word_frequencies[word]
 


In [60]:
sentence_scores

{'This introduction aims to tell the story of how we\nput words into computers.': 0.06632653061224489,
 'It is part of the story of the field of\nnatural language processing (NLP), a branch of artificial\nintelligence.': 0.1683673469387755,
 'It targets a wide audience with a basic\nunderstanding of computer programming, but avoids a detailed\nmathematical treatment, and it does not present any algorithms.': 0.21938775510204084,
 'It also does not focus on any particular application of NLP such\nas translation, question answering, or information extraction.': 0.23469387755102042,
 'Readers already familiar with word vectors are\nadvised to skip to Section 5 for the discussion of the most recent\nadvance, contextual word vectors': 0.19387755102040816}

In [125]:
#Choose number of sentences you want in your summary
summary_sentences = nlargest(4, sentence_scores, key=sentence_scores.get)
summary = ' '.join(summary_sentences)
print(summary)
summary_wtd_freq = summary 

It also does not focus on any particular application of NLP such
as translation, question answering, or information extraction. It targets a wide audience with a basic
understanding of computer programming, but avoids a detailed
mathematical treatment, and it does not present any algorithms. Readers already familiar with word vectors are
advised to skip to Section 5 for the discussion of the most recent
advance, contextual word vectors It is part of the story of the field of
natural language processing (NLP), a branch of artificial
intelligence.


### Method 3: Directly using Gensim

In [77]:
summary_direct_using_gensim = summarize(text, ratio = 0.4)
summary_direct_using_gensim
#ratio (float, optional) – Number between 0 and 1 that determines the proportion of the number of sentences of the original text to be chosen for the summary.
#word_count (int or None, optional) – Determines how many words will the output contain. If both parameters are provided, the ratio will be ignored.

'natural language processing (NLP), a branch of artificial\nmathematical treatment, and it does not present any algorithms.\nIt also does not focus on any particular application of NLP such\nas translation, question answering, or information extraction.\nunderstanding of word vectors (also known as word embeddings): why\nthey have changed over time, and what some of the open questions\nReaders already familiar with word vectors are\nadvance, contextual word vectors'

### Rouge score

A good question after carriying this algorithm out, is deciding how to evaluate it. Was it a good summary? We can use something called a ROUGE score. ROUGE scores compare the contents of the summary to the contents of the original text. This will work the same way that computing recall and precision for non-text data sets work. In the context of ROUGE, we will be comparing n-grams betweent the summary and the original text. Recall will be computed as the division of the number of common ngrams over the total number of ngrams in the original text. Precision will be computed as the division of the number of common ngrams over the number of ngrams in the summary.

In [66]:
#Let’s start with unigrams.

In [70]:
def remove_stopwords(sentence):
    """
    Takes a string and removes stopwords.
    """
    filtered_sentence = " ".join([i for i in sentence if i not in stop_words])
    return filtered_sentence

 
def sanitize_text(sentence):
    """
    Takes in a string and cleans it up.
    """
    sentence = sentence.lower()
    #Replace all none alphanumeric characters with spaces
    sentence = re.sub(r'[^a-zA-Z0-9\s]', ' ', sentence)
    return sentence



def generate_ngrams(sentence, n):
    """
    Takes in a string and the number of ngrams you want to produce.
    """
    #Clean text
    sentence = sanitize_text(sentence)
    #Split sentence into tokens
    tokens = [token for token in word_tokenize(sentence) if token != ""]
    #Create ngrams
    ngrams = zip(*[tokens[i:] for i in range(n)])
    return [" ".join(ngram) for ngram in ngrams]

In [101]:
#summary = " ".join(summary)

unigrams_sum = generate_ngrams(summary_direct_using_gensim, n=1)
unigrams_orig = generate_ngrams(text, n= 1)
unigrams_sum = set(unigrams_sum)
unigrams_orig = set(unigrams_orig)
 
matches = unigrams_sum.intersection(unigrams_orig)

#Recall
recall = float(len(matches)/len(unigrams_orig))
#Precision
precision = float(len(matches)/len(unigrams_sum))

print("Unigram_Rouge_score_of summary generated directly by gensim")
print([recall,precision])
print("F1 is :")
print(precision*recall/(precision + recall))

Unigram_Rouge_score_of summary generated directly by gensim
[0.43089430894308944, 1.0]
F1 is :
0.30113636363636365


In [113]:
##Let’s look at bigrams.


"The ideas presented here were developed by many researchers over\nmany decades, so the citations are not exhaustive but rather direct\nthe reader to a handful of papers that are, in the author's view,\nseminal."

In [124]:
bigrams_sum = generate_ngrams(summary_direct_using_gensim, n=2)
bigrams_orig = generate_ngrams(text, n= 2)
bigrams_sum = set(bigrams_sum)
bigrams_orig = set(bigrams_orig)
 
matches = bigrams_sum.intersection(bigrams_orig)
#Recall
recall = float(len(matches)/len(bigrams_orig))
#Precision
precision = float(len(matches)/len(bigrams_sum))
print(color.BOLD + "Bigram_Rouge_score_of summary generated directly by gensim" + color.END)
print([recall,precision])
print("F1 is :")
print(precision*recall/(precision + recall))

[1mBigram_Rouge_score_of summary generated directly by gensim[0m
[0.3546511627906977, 0.9384615384615385]
F1 is :
0.25738396624472576


In [115]:
summary_wordembed_textrank = "".join(summary_wordembed_textrank)

In [122]:
bigrams_sum = generate_ngrams(summary_wordembed_textrank, n=2)
bigrams_orig = generate_ngrams(text, n= 2)
bigrams_sum = set(bigrams_sum)
bigrams_orig = set(bigrams_orig)
 
matches = bigrams_sum.intersection(bigrams_orig)
#Recall
recall = float(len(matches)/len(bigrams_orig))
#Precision
precision = float(len(matches)/len(bigrams_sum))
print(color.BOLD + "Bigram_Rouge_score_of summary generated Wordembed Textrank" + color.END)
print([recall,precision])
print("F1 is :")
print(precision*recall/(precision + recall))

[1mBigram_Rouge_score_of summary generated Wordembed Textrank[0m
[0.5697674418604651, 0.98989898989899]
F1 is :
0.36162361623616235


In [126]:
bigrams_sum = generate_ngrams(summary_wtd_freq, n=2)
bigrams_orig = generate_ngrams(text, n= 2)
bigrams_sum = set(bigrams_sum)
bigrams_orig = set(bigrams_orig)
 
matches = bigrams_sum.intersection(bigrams_orig)
#Recall
recall = float(len(matches)/len(bigrams_orig))
#Precision
precision = float(len(matches)/len(bigrams_sum))
print(color.BOLD + "Bigram_Rouge_score_of summary generated by Weighted Freq" + color.END)
print([recall,precision])
print("F1 is :")
print(precision*recall/(precision + recall))

[1mBigram_Rouge_score_of summary generated by Weighted Freq[0m
[0.45930232558139533, 0.9634146341463414]
F1 is :
0.3110236220472441


**It would be better to pick the bigram score over the unigram score, mainly because bigrams carry slightly more context; hence, we can measure how much context from the original text is in the summary.**

In [98]:
#!pip install PyRouge