We perform extractive text summarization using the TextRank algorithm which uses a graph-based ranking model which requires no training, using GloVe v1.2 pre-trained word vectors, and performing various natural language preprocessing & tokenization using NLTK library.

Graph based ranking algorithms allow knowledge about the text as a whole and the relationship between different parts of a text to be used in making specific local ranking decisions.  It does so by taking into account global information recursively computed from the entire graph to evaluate the importance of a vertex within a graph, rather than relying only on local information.

Traditional word vector techniques depend on the distance or angle between pairs of word vectors to determine the strength of a set of word representations.  Glove attempts to uncover more of the language structure by examining not only the scalar difference but various dimensions of difference.  It does this by examining the ratio of the co-occurrence probability between pairings rather than just the probabilities alone.  A weighted least squares regression is then applied to remove the noise.  Dimensionality reduction is applied to the co-occurrence matrix to yield a lower dimensional matrix such that each vector represents a word.

Commonly, the ROUGE Metric is used to evaluate summarizer performance, but we did not have immediate resources to test it.  Subjectively, the performance seems reasonably good.  When summarizing NY Times articles, they were significantly shorter (33% of original article length was chosen), were easy to read, and conveyed the most important ideas within the text.

This summarizer does not understand or recognize the grammatical structure of language so it cannot extract this type of semantic information and its importance in relating sentences; eg: pronoun references.  Abstractive text summarization, possibly with GANs would be an interesting future project.

In [1]:
import numpy as np
import pandas as pd
import nltk
from nltk.tokenize import sent_tokenize
import re

In [2]:
# Read text from file as a string
with open("../Text-Summarizer/articles/article.txt", "r") as myfile:
    article=myfile.read()

# Tokenize sentences
sentences = sent_tokenize(article)
sentences[:5]

['Transfer Learning\nGo to the profile of Niklas Donges\nNiklas Donges\nApr 23, 2018\nTransfer Learning is the reuse of a pre-trained model on a new problem.',
 'It is currently very popular in the field of Deep Learning because it enables you to train Deep Neural Networks with comparatively little data.',
 'This is very useful since most real-world problems typically do not have millions of labeled data points to train such complex models.',
 'This blog post is intended to give you an overview of what Transfer Learning is, how it works, why you should use it and when you can use it.',
 'It will introduce you to the different approaches of Transfer Learning and provide you with some resources on already pre-trained models.']

In [3]:
print(len(sentences))

84


In [4]:
# Flatten list of sentences
def flatten_list(sentences):
    new_sentences = []
    for s in sentences:
        new_sentences.append(sent_tokenize(s))

    new_sentences = [y for x in new_sentences for y in x] # flatten list
    return new_sentences

new_sentences = flatten_list(sentences)

In [5]:
# Extract word vectors
word_embeddings = {}
f = open('glove.6B.100d.txt', encoding='utf-8')
for line in f:
    values = line.split()
    word = values[0]
    coefs = np.asarray(values[1:], dtype='float32')
    word_embeddings[word] = coefs
f.close()

In [6]:
# Remove punctuations, numbers, and special characters
clean_sentences = pd.Series(new_sentences).str.replace("[^a-zA-Z]", " ")

# Make lowercase
clean_sentences = [s.lower() for s in clean_sentences]

In [7]:
# Load stopwords, common words that we don't want to affect the ranking.  We add them back later
nltk.download('stopwords')
from nltk.corpus import stopwords
stop_words = stopwords.words('english')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\jhsoo\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [8]:
def remove_stopwords(sen):
    sen_new = " ".join([i for i in sen if i not in stop_words])
    return sen_new

In [9]:
# Remove stopwords
clean_sentences = [remove_stopwords(r.split()) for r in clean_sentences]

In [10]:
sentence_vectors = []
for i in clean_sentences:
  if len(i) != 0:
    v = sum([word_embeddings.get(w, np.zeros((100,))) for w in i.split()])/(len(i.split())+0.001)
  else:
    v = np.zeros((100,))
  sentence_vectors.append(v)

In [11]:
for i in range(len(sentence_vectors)):
    sentence_vectors[i] = sentence_vectors[i].reshape(1, 100)

print(len(sentence_vectors))

84


In [12]:
sim_mat = np.zeros([len(sentences), len(sentences)])
print(sim_mat.shape)

(84, 84)


Cosine similarity computes the similarity between vectors based on the degree of orthogonality between vectors where a cosine of 1 is identical and a cosine of 0 is orthogonality.

In [13]:
# Compute similarity matrix
from sklearn.metrics.pairwise import cosine_similarity
for i in range(len(sentences)):
  for j in range(len(sentences)):
    if i != j:
      sim_mat[i][j] = cosine_similarity(sentence_vectors[i], sentence_vectors[j])[0,0]

In [14]:
# Matrix to TextRank
import networkx as nx

# Generate graph from matrix representation
nx_graph = nx.from_numpy_matrix(sim_mat)
# Apply TextRank/PageRank algorithm to graph
scores = nx.pagerank(nx_graph)

# Output below: 38% of most relevant content from original article

In [15]:
#takes 38% of the most relevant content from the original article
#update retains original article sentence order

#score and sort each sentence by score along with original sentence order
ranked_sentences = sorted(((i,scores[i],s) for i,s in enumerate(sentences)), reverse=True, key=lambda x: x[1])

#keep only the top X% of ranked sentences
ranked_sentences = ranked_sentences[:int(len(sentence_vectors)*.38)]

#resort the ranked sentences in the original order
ranked_sentences = sorted(ranked_sentences, key=lambda x: x[0])

#print all the ranked sentences which meet the score threshold
for i in range(len(ranked_sentences)):
    print(ranked_sentences[i][2], end="\n\n")

This is very useful since most real-world problems typically do not have millions of labeled data points to train such complex models.

This blog post is intended to give you an overview of what Transfer Learning is, how it works, why you should use it and when you can use it.

It will introduce you to the different approaches of Transfer Learning and provide you with some resources on already pre-trained models.

In Transfer Learning, the knowledge of an already trained Machine Learning model is applied to a different but related problem.

For example, if you trained a simple classifier to predict whether an image contains a backpack, you could use the knowledge that the model gained during its training to recognize other objects like sunglasses.

The general idea is to use knowledge, that a model has learned from a task where a lot of labeled training data is available, in a new task where we don’t have a lot of data.

Transfer Learning is mostly used in Computer Vision and Natural L