We use extractive text summarization using TextRank algorithm which computes a similarity matrix for each sentence.  

In [1]:
import numpy as np
import pandas as pd
import nltk
import re

In [2]:
article = open("nytimes article.txt", "r")

sentences = []
for line in article:
    sentences.append(line.rstrip())
sentences[:5]

['THIS STARTUP IS CHALLENGING GOOGLE MAPS—AND IT NEEDS YOU',
 '',
 "StreetCred's MapNYC program is an effort to find out what might motivate map enthusiasts, crypto-lovers, maybe even people who hadn’t the faintest about either, to feed it data.MUIRIS WOULFE/GETTY IMAGES",
 'A WHOLE LIFETIME in New York City, and Christiana Ting didn’t realize just how many urgent care facilities there were until the app told her to start looking for them. “They were giving extra points for medical offices, and I found them, I think, on every block,” she says. “I’m not sure what that says about the neighborhood where I work.”',
 '']

In [3]:
print(len(sentences))

34


In [4]:
sentences = [s for s in sentences if len(s) != 0]

In [5]:
print(len(sentences))

18


In [6]:
from nltk.tokenize import sent_tokenize
new_sentences = []
for s in sentences:
  new_sentences.append(sent_tokenize(s))

new_sentences = [y for x in new_sentences for y in x] # flatten list

In [7]:
new_sentences[:5]

['THIS STARTUP IS CHALLENGING GOOGLE MAPS—AND IT NEEDS YOU',
 "StreetCred's MapNYC program is an effort to find out what might motivate map enthusiasts, crypto-lovers, maybe even people who hadn’t the faintest about either, to feed it data.MUIRIS WOULFE/GETTY IMAGES",
 'A WHOLE LIFETIME in New York City, and Christiana Ting didn’t realize just how many urgent care facilities there were until the app told her to start looking for them.',
 '“They were giving extra points for medical offices, and I found them, I think, on every block,” she says.',
 '“I’m not sure what that says about the neighborhood where I work.”']

In [8]:
# Extract word vectors
word_embeddings = {}
f = open('glove.6B.100d.txt', encoding='utf-8')
for line in f:
    values = line.split()
    word = values[0]
    coefs = np.asarray(values[1:], dtype='float32')
    word_embeddings[word] = coefs
f.close()

In [9]:
# remove punctuations, numbers and special characters
clean_sentences = pd.Series(new_sentences).str.replace("[^a-zA-Z]", " ")

# make alphabets lowercase
clean_sentences = [s.lower() for s in clean_sentences]

In [10]:
nltk.download('stopwords')


[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\jhsoo\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [11]:
from nltk.corpus import stopwords
stop_words = stopwords.words('english')

In [12]:
def remove_stopwords(sen):
    sen_new = " ".join([i for i in sen if i not in stop_words])
    return sen_new

In [13]:
clean_sentences = [remove_stopwords(r.split()) for r in clean_sentences]

In [14]:
word_embeddings = {}
f = open('glove.6B.100d.txt', encoding='utf-8')
for line in f:
    values = line.split()
    word = values[0]
    coefs = np.asarray(values[1:], dtype='float32')
    word_embeddings[word] = coefs
f.close()

In [15]:
sentence_vectors = []
for i in clean_sentences:
  if len(i) != 0:
    v = sum([word_embeddings.get(w, np.zeros((100,))) for w in i.split()])/(len(i.split())+0.001)
  else:
    v = np.zeros((100,))
  sentence_vectors.append(v)

In [16]:
for i in range(len(sentence_vectors)):
    sentence_vectors[i] = sentence_vectors[i].reshape(1, 100)

In [17]:
print(len(sentence_vectors))

56


In [18]:
sim_mat = np.zeros([len(sentences), len(sentences)])

In [19]:
print(sim_mat.shape)

(18, 18)


In [20]:
from sklearn.metrics.pairwise import cosine_similarity


In [21]:
for i in range(len(sentences)):
  for j in range(len(sentences)):
    if i != j:
      sim_mat[i][j] = cosine_similarity(sentence_vectors[i], sentence_vectors[j])[0,0]

In [22]:
import networkx as nx

nx_graph = nx.from_numpy_matrix(sim_mat)
scores = nx.pagerank(nx_graph)

In [23]:
ranked_sentences = sorted(((scores[i],s) for i,s in enumerate(sentences)), reverse=True)


In [24]:
#takes 33% of the most relevant content from the original article
for i in range(int(len(sentence_vectors)*(1/3))):
    print(ranked_sentences[i][1], end="\n\n")

But open-source cartography isn’t always comprehensive or particular enough for the users Meech is targeting. If a company making a VR game for kids needs to know the location of every playground in Cincinnati, there’s no guarantee the volunteers will plug that in. So StreetCred might offer a future mapping army an extra crypto incentive to find, validate, and label those locations.

StreetCred sees that as an opportunity. “There’s a lot of companies, none of whom I can name, who have location data, and that data needs improvement,” says Randy Meech, CEO of the small startup. (Meech’s last open-source mapping company, a Samsung subsidiary called Mapzen, shut down in January.) Maybe a client found a data set online or purchased one from another company. Either way, it’s static, and that means it’s only a matter of time before it fails to represent reality.

And more accurate, too, with serious assists from cryptocurrency-seeking mappers. Other companies rely on OpenStreetMap, a crowdsou