We perform extractive text summarization using the TextRank algorithm which uses a graph-based ranking model which requires no training, using GloVe v1.2 pre-trained word vectors, and performing various natural language preprocessing & tokenization using NLTK library.

Graph based ranking algorithms allow knowledge about the text as a whole and the relationship between different parts of a text to be used in making specific local ranking decisions.  It does so by taking into account global information recursively computed from the entire graph to evaluate the importance of a vertex within a graph, rather than relying only on local information.

Traditional word vector techniques depend on the distance or angle between pairs of word vectors to determine the strength of a set of word representations.  Glove attempts to uncover more of the language structure by examining not only the scalar difference but various dimensions of difference.  It does this by examining the ratio of the co-occurrence probability between pairings rather than just the probabilities alone.  A weighted least squares regression is then applied to remove the noise.  Dimensionality reduction is applied to the co-occurrence matrix to yield a lower dimensional matrix such that each vector represents a word.

Commonly, the ROUGE Metric is used to evaluate summarizer performance, but we did not have immediate resources to test it.  Subjectively, the performance seems reasonably good.  When summarizing NY Times articles, they were significantly shorter (33% of original article length was chosen), were easy to read, and conveyed the most important ideas within the text.

This summarizer does not understand or recognize the grammatical structure of language so it cannot extract this type of semantic information and its importance in relating sentences; eg: pronoun references.

In [1]:
import numpy as np
import pandas as pd
import nltk
from nltk.tokenize import sent_tokenize
import re

In [2]:
with open("../Text-Summarizer/articles/nytimes article.txt", "r") as myfile:
    article=myfile.read()

sentences = sent_tokenize(article)
sentences[:5]

['WASHINGTON — Defense Secretary Jim Mattis, whose experience and stability were widely seen as a balance to an unpredictable president, resigned on Thursday in protest of President Trump’s decision to withdraw 2,000 American troops from Syria.',
 'Mr. Trump announced Mr. Mattis’s resignation in two tweets Thursday evening, and said the retired four-star Marine general turned Pentagon chief will leave at the end of February.',
 'Officials said Mr. Mattis went to the White House on Thursday afternoon in a last attempt to convince Mr. Trump to keep American troops in Syria, where they have been fighting the Islamic State.',
 'He was rebuffed, and told the president that he was resigning as a result.',
 'Hours later, the Pentagon released Mr. Mattis’s resignation letter, in which he implicitly criticized his commander in chief.']

In [3]:
print(len(sentences))

24


In [4]:
def flatten_list(sentences):
    new_sentences = []
    for s in sentences:
        new_sentences.append(sent_tokenize(s))

    new_sentences = [y for x in new_sentences for y in x] # flatten list
    return new_sentences

new_sentences = flatten_list(sentences)

In [5]:
# Extract word vectors
word_embeddings = {}
f = open('glove.6B.100d.txt', encoding='utf-8')
for line in f:
    values = line.split()
    word = values[0]
    coefs = np.asarray(values[1:], dtype='float32')
    word_embeddings[word] = coefs
f.close()

In [6]:
# remove punctuations, numbers and special characters
clean_sentences = pd.Series(new_sentences).str.replace("[^a-zA-Z]", " ")

# make alphabets lowercase
clean_sentences = [s.lower() for s in clean_sentences]

In [7]:
nltk.download('stopwords')


[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\jhsoo\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [8]:
from nltk.corpus import stopwords
stop_words = stopwords.words('english')

In [9]:
def remove_stopwords(sen):
    sen_new = " ".join([i for i in sen if i not in stop_words])
    return sen_new

In [10]:
clean_sentences = [remove_stopwords(r.split()) for r in clean_sentences]

In [11]:
word_embeddings = {}
f = open('glove.6B.100d.txt', encoding='utf-8')
for line in f:
    values = line.split()
    word = values[0]
    coefs = np.asarray(values[1:], dtype='float32')
    word_embeddings[word] = coefs
f.close()

In [12]:
sentence_vectors = []
for i in clean_sentences:
  if len(i) != 0:
    v = sum([word_embeddings.get(w, np.zeros((100,))) for w in i.split()])/(len(i.split())+0.001)
  else:
    v = np.zeros((100,))
  sentence_vectors.append(v)

In [13]:
for i in range(len(sentence_vectors)):
    sentence_vectors[i] = sentence_vectors[i].reshape(1, 100)

In [14]:
print(len(sentence_vectors))

24


In [15]:
sim_mat = np.zeros([len(sentences), len(sentences)])

In [16]:
print(sim_mat.shape)

(24, 24)


In [17]:
from sklearn.metrics.pairwise import cosine_similarity


Cosine similarity computes the similarity between vectors based on the degree of orthogonality between vectors where a cosine of 1 is identical and a cosine of 0 is orthogonality.

In [18]:
for i in range(len(sentences)):
  for j in range(len(sentences)):
    if i != j:
      sim_mat[i][j] = cosine_similarity(sentence_vectors[i], sentence_vectors[j])[0,0]

In [19]:
import networkx as nx

nx_graph = nx.from_numpy_matrix(sim_mat)
scores = nx.pagerank(nx_graph)

In [20]:
ranked_sentences = sorted(((scores[i],s) for i,s in enumerate(sentences)), reverse=True)


# Output below: 33% of most relevant content from original article

In [21]:
#takes 33% of the most relevant content from the original article
for i in range(int(len(sentence_vectors)*(1/3))):
    print(ranked_sentences[i][1], end="\n\n")

WASHINGTON — Defense Secretary Jim Mattis, whose experience and stability were widely seen as a balance to an unpredictable president, resigned on Thursday in protest of President Trump’s decision to withdraw 2,000 American troops from Syria.

Officials said Mr. Mattis went to the White House on Thursday afternoon in a last attempt to convince Mr. Trump to keep American troops in Syria, where they have been fighting the Islamic State.

He called Mr. Mattis “an island of stability amidst the chaos of the Trump administration.”

“As we’ve seen with the President’s haphazard approach to Syria, our national defense is too important to be subjected to the President’s erratic whims,” Mr. Warner wrote in the Twitter post.

Mr. Mattis had told close friends that he would continue in the job despite his deteriorating relationship with Mr. Trump, because he viewed his commitment to protecting the Defense Department and its 1.3 million active duty service members as paramount.

Senator Marco Rubi