We perform extractive text summarization using the TextRank algorithm which uses a graph-based ranking model which requires no training, using GloVe v1.2 pre-trained word vectors, and performing various natural language preprocessing & tokenization using NLTK library.

Graph based ranking algorithms allow knowledge about the text as a whole and the relationship between different parts of a text to be used in making specific local ranking decisions.  It does so by taking into account global information recursively computed from the entire graph in order to evaluate the importance of a vertex within a graph, rather than relying only on local information.

Traditional word vector techniques depend on the distance or angle between pairs of word vectors to determine strength of a set  word representations.  Glove attempts to uncover more of the language structure by examining not only the scalar difference but various dimensions of difference.  It does this by examining the ratio of the co-occurance probability between pairings rather the just the probabilties themselves.  A weighted least squares regression is then applied to remove the noise.  Dimesionality reduction is applied to the co-occurance matrix to yield a lower dimensional matrix such that each vector represents a word.

In [1]:
import numpy as np
import pandas as pd
import nltk
import re

In [5]:
from nltk.tokenize import sent_tokenize
#article = open("../Text-Summarizer/articles/research paper.txt", "r")

#sentences = []

with open("../Text-Summarizer/articles/wired article.txt", "r", encoding="utf-8") as myfile:
    article=myfile.read()

sentences = sent_tokenize(article)

In [6]:
def flatten_list(sentences):
    new_sentences = []
    for s in sentences:
        new_sentences.append(sent_tokenize(s))

    new_sentences = [y for x in new_sentences for y in x] # flatten list
    return new_sentences

new_sentences = flatten_list(sentences)

In [7]:
# Extract word vectors
word_embeddings = {}
f = open('glove.6B.100d.txt', encoding='utf-8')
for line in f:
    values = line.split()
    word = values[0]
    coefs = np.asarray(values[1:], dtype='float32')
    word_embeddings[word] = coefs
f.close()

In [8]:
# remove punctuations, numbers and special characters
clean_sentences = pd.Series(new_sentences).str.replace("[^a-zA-Z]", " ")

# make alphabets lowercase
clean_sentences = [s.lower() for s in clean_sentences]

In [28]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\HP\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [10]:
from nltk.corpus import stopwords
stop_words = stopwords.words('english')

In [11]:
def remove_stopwords(sen):
    sen_new = " ".join([i for i in sen if i not in stop_words])
    return sen_new

In [12]:
clean_sentences = [remove_stopwords(r.split()) for r in clean_sentences]

In [13]:
word_embeddings = {}
f = open('glove.6B.100d.txt', encoding='utf-8')
for line in f:
    values = line.split()
    word = values[0]
    coefs = np.asarray(values[1:], dtype='float32')
    word_embeddings[word] = coefs
f.close()

In [14]:
sentence_vectors = []
for i in clean_sentences:
    if len(i) != 0:
        v = sum([word_embeddings.get(w, np.zeros((100,))) for w in i.split()])/(len(i.split())+0.001)
    else:
        v = np.zeros((100,))
    sentence_vectors.append(v)

In [15]:
for i in range(len(sentence_vectors)):
    sentence_vectors[i] = sentence_vectors[i].reshape(1, 100)

Cosine similarity computes the similarity between vectors based on the degree of orthogonality between vectors where a cosine of 1 is identical and a cosine of 0 is orthogonality.

In [21]:
def create_sim_matrix(n, sentence_vectors):
    from sklearn.metrics.pairwise import cosine_similarity
    sim_mat = np.zeros([n, n])
    for i in range(len(sentences)):
        for j in range(len(sentences)):
            if i != j:
                sim_mat[i][j] = cosine_similarity(sentence_vectors[i], sentence_vectors[j])[0,0]
    return sim_mat

sim_mat = create_sim_matrix(len(sentences), sentence_vectors)

In [23]:
import networkx as nx

nx_graph = nx.from_numpy_matrix(sim_mat)
scores = nx.pagerank(nx_graph)

In [26]:
ranked_sentences = sorted(((scores[i],s) for i,s in enumerate(sentences)), reverse=True)

In [27]:
#takes 33% of the most relevant content from the original article
for i in range(int(len(sentence_vectors)*(1/3))):
    print(ranked_sentences[i][1], end="\n\n")

THIS STARTUP IS CHALLENGING GOOGLE MAPS—AND IT NEEDS YOU

StreetCred's MapNYC program is an effort to find out what might motivate map enthusiasts, crypto-lovers, maybe even people who hadn’t the faintest about either, to feed it data.MUIRIS WOULFE/GETTY IMAGES
A WHOLE LIFETIME in New York City, and Christiana Ting didn’t realize just how many urgent care facilities there were until the app told her to start looking for them.

And it’s still reviewing lessons learned in New York: The company says the MapNYC project generated over 20,000 places in four weeks, some validated by three users.

The company says it will contribute that info back to OpenStreetMaps under a licensing agreement designed to make it easier for people and companies to share and collaborate when working with data.

It requires a lot of eyes on the ground and people actually observing changes in the field.”

Which is why Meech says he came up with MapNYC: an effort to find out what might motivate map enthusiasts, cry