We perform extractive text summarization using the TextRank algorithm which uses a graph-based ranking model which requires no training, using GloVe v1.2 pre-trained word vectors, and performing various natural language preprocessing & tokenization using NLTK library.

Graph based ranking algorithms allow knowledge about the text as a whole and the relationship between different parts of a text to be used in making specific local ranking decisions.  It does so by taking into account global information recursively computed from the entire graph in order to evaluate the importance of a vertex within a graph, rather than relying only on local information.

Traditional word vector techniques depend on the distance or angle between pairs of word vectors to determine strength of a set  word representations.  Glove attempts to uncover more of the language structure by examining not only the scalar difference but various dimensions of difference.  It does this by examining the ratio of the co-occurance probability between pairings rather the just the probabilties themselves.  A weighted least squares regression is then applied to remove the noise.  Dimesionality reduction is applied to the co-occurance matrix to yield a lower dimensional matrix such that each vector represents a word.

In [1]:
import numpy as np
import pandas as pd
import nltk
import re

In [2]:
from nltk.tokenize import sent_tokenize
#article = open("../Text-Summarizer/articles/research paper.txt", "r")

#sentences = []

with open("../Text-Summarizer/articles/Why We Sleep, Chp 2.txt", "r") as myfile:
    article=myfile.read()

sentences = sent_tokenize(article)

#for line in article:
#    sentences.append(line.rstrip())
sentences[:5]

['CHAPTER 2\n\nCaffeine, Jet Lag, and Melatonin\nLosing and Gaining Control of Your Sleep Rhythm\nHow does your body know when it’s time to sleep?',
 'Why do you suffer from jet lag after arriving in a new time zone?',
 'How do you overcome jet lag?',
 'Why does that acclimatization cause you yet more jet lag upon returning home?',
 'Why do some people use melatonin to combat these issues?']

In [3]:
print(len(sentences))

390


In [4]:
new_sentences = []
for s in sentences:
  new_sentences.append(sent_tokenize(s))

new_sentences = [y for x in new_sentences for y in x] # flatten list

In [5]:
new_sentences[:5]

['CHAPTER 2\n\nCaffeine, Jet Lag, and Melatonin\nLosing and Gaining Control of Your Sleep Rhythm\nHow does your body know when it’s time to sleep?',
 'Why do you suffer from jet lag after arriving in a new time zone?',
 'How do you overcome jet lag?',
 'Why does that acclimatization cause you yet more jet lag upon returning home?',
 'Why do some people use melatonin to combat these issues?']

In [6]:
# Extract word vectors
word_embeddings = {}
f = open('glove.6B.100d.txt', encoding='utf-8')
for line in f:
    values = line.split()
    word = values[0]
    coefs = np.asarray(values[1:], dtype='float32')
    word_embeddings[word] = coefs
f.close()

In [7]:
# remove punctuations, numbers and special characters
clean_sentences = pd.Series(new_sentences).str.replace("[^a-zA-Z]", " ")

# make alphabets lowercase
clean_sentences = [s.lower() for s in clean_sentences]

In [8]:
nltk.download('stopwords')


[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\jhsoo\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [9]:
from nltk.corpus import stopwords
stop_words = stopwords.words('english')

In [10]:
def remove_stopwords(sen):
    sen_new = " ".join([i for i in sen if i not in stop_words])
    return sen_new

In [11]:
clean_sentences = [remove_stopwords(r.split()) for r in clean_sentences]

In [12]:
word_embeddings = {}
f = open('glove.6B.100d.txt', encoding='utf-8')
for line in f:
    values = line.split()
    word = values[0]
    coefs = np.asarray(values[1:], dtype='float32')
    word_embeddings[word] = coefs
f.close()

In [13]:
sentence_vectors = []
for i in clean_sentences:
  if len(i) != 0:
    v = sum([word_embeddings.get(w, np.zeros((100,))) for w in i.split()])/(len(i.split())+0.001)
  else:
    v = np.zeros((100,))
  sentence_vectors.append(v)

In [14]:
for i in range(len(sentence_vectors)):
    sentence_vectors[i] = sentence_vectors[i].reshape(1, 100)

In [15]:
print(len(sentence_vectors))

390


In [16]:
sim_mat = np.zeros([len(sentences), len(sentences)])

In [17]:
print(sim_mat.shape)

(390, 390)


In [18]:
from sklearn.metrics.pairwise import cosine_similarity


Cosine similarity computes the similarity between vectors based on the degree of orthogonality between vectors where a cosine of 1 is identical and a cosine of 0 is orthogonality.

In [19]:
for i in range(len(sentences)):
  for j in range(len(sentences)):
    if i != j:
      sim_mat[i][j] = cosine_similarity(sentence_vectors[i], sentence_vectors[j])[0,0]

In [20]:
import networkx as nx

nx_graph = nx.from_numpy_matrix(sim_mat)
scores = nx.pagerank(nx_graph)

In [21]:
ranked_sentences = sorted(((scores[i],s) for i,s in enumerate(sentences)), reverse=True)


In [22]:
#takes 33% of the most relevant content from the original article
for i in range(int(len(sentence_vectors)*(1/3))):
    print(ranked_sentences[i][1], end="\n\n")

They are unable to function well at this time, one cause of which is that, despite being “awake,” their brain remains in a more sleep-like state throughout the early morning.

The experimental question facing Kleitman and Richardson was simple: When cut off from the daily cycle of light and dark, would their biological rhythms of sleep and wakefulness, together with body temperature, become completely erratic, or would they stay the same as those individuals in the outside world exposed to rhythmic daylight?

Let’s consider figure 7, showing the same forty-eight-hour slice of time and the two factors in question: the twenty-four-hour circadian rhythm and the sleep pressure signal of adenosine, and how much distance there is between them.

If you look at figure 7 once again, the graveyard-shift misery you experience around six a.m. can be explained by the combination of high adenosine sleep pressure and your circadian rhythm reaching its lowest point.

This is known as sleep pressure, a