<a href="https://colab.research.google.com/github/mowillia/phantom_pen/blob/master/text_summarizer_3_word_embed.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Text Summarizer (#3)- Word Embeddings
**(June 17, 2019)**

Extractive Text Summarizer described in https://www.analyticsvidhya.com/blog/2018/11/introduction-text-summarization-textrank-python/

In [0]:
import numpy as np
import pandas as pd
import re
import textwrap

from sklearn.metrics.pairwise import cosine_similarity

import nltk
import nltk.data # natural language tool kit
from nltk.tokenize import sent_tokenize # $ pip install nltk
nltk.download('punkt')

from nltk.corpus import stopwords
import networkx as nx

[nltk_data] Downloading package punkt to /Users/Williams/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [0]:
## Function that outputs paragraphs from text file
def text_to_para(filename):
    
    para_list = open(filename).read().splitlines()
    
    para_list[:] = (value for value in para_list if value != '')
    
    return para_list

## function that outputs the sentences in a paragraph
def sents(para): 
    
    return sent_tokenize(para)

### function takes in a file and outputs a sentence length trajectory

## vector of sentences in a piece 
def raw_sents(filename):
    
    sent = []
    
    paragraphs = text_to_para(filename)[:]
    
    for paragraph in paragraphs:
        sent += sents(paragraph)
        
    return sent

In [0]:
filename  = '/content/sample_essay.txt'
sentences = raw_sents(filename);

In [0]:
# Extract word vectors
word_embeddings = {}
f = open('/content/glove.6B.100d.txt', encoding='utf-8')
for line in f:
    values = line.split()
    word = values[0]
    coefs = np.asarray(values[1:], dtype='float32')
    word_embeddings[word] = coefs
f.close()

In [0]:
len(word_embeddings)

400000

In [0]:
# remove punctuations, numbers and special characters
clean_sentences = pd.Series(sentences).str.replace("[^a-zA-Z]", " ")

# make alphabets lowercase
clean_sentences = [s.lower() for s in clean_sentences]

In [0]:
#english stopwords
stop_words = stopwords.words('english')

# function to remove stopwords
def remove_stopwords(sen):
    sen_new = " ".join([i for i in sen if i not in stop_words])
    return sen_new

# remove stopwords from the sentences
clean_sentences = [remove_stopwords(r.split()) for r in clean_sentences]

In [0]:
# create sentence vectors from word embeddings
sentence_vectors = []
for i in clean_sentences:
    if len(i) != 0:
        v = sum([word_embeddings.get(w, np.zeros((100,))) for w in i.split()])/(len(i.split())+0.001)
    else:
        v = np.zeros((100,))
    sentence_vectors.append(v)

In [0]:
# similarity matrix
sim_mat = np.zeros([len(sentences), len(sentences)])

In [0]:
# compute similarity matrix between sentences with cosine similarity
for i in range(len(sentences)):
    for j in range(len(sentences)):
        if i != j:
            sim_mat[i][j] = cosine_similarity(sentence_vectors[i].reshape(1,100), sentence_vectors[j].reshape(1,100))[0,0]

In [0]:
# apply page ranking to get ranking of sentences
nx_graph = nx.from_numpy_matrix(sim_mat)
scores = nx.pagerank(nx_graph)

In [0]:
# function to get ranked sentences
ranked_sentences = sorted(((scores[i],s) for i,s in enumerate(sentences)), reverse=True)

In [0]:
ranked_sentences;

In [0]:
# Extract top 10 sentences as the summary
for i in range(5):
  print(textwrap.fill(ranked_sentences[i][1], 50))
    #print(ranked_sentences[i][1])

Adept writers can do this with such softness of touch that at the end of their work you’re led to believe that what you have learned is absolutely true and would be true regardless of what you have just read.
And while I have many times been on the receiving end of being limited by such a perspective on life, it seems that I, like most people cannot help adopting it when it is convenient.
Because it was not exactly that Rand was attempting to manipulate people’s insecurities and anger — although that was precisely what she was doing — but that the language in which she was writing was so polemical and confident — so, one might say, autocratic — that it seemed specifically geared to give stability to those who most wanted some foundational principles by which to live a life.
For Americans, she painted the world as black and white, and thereby gave voice to the simple stories people wanted to believe, the stories they perhaps had always somehow believed but had never been able to articul