In [5]:

import numpy as np
import pandas as pd
import scipy
import matplotlib.pyplot as plt
import networkx as nx
import spacy
from sklearn.feature_extraction.text import TfidfVectorizer
%matplotlib inline

#!python -m spacy download en

# Summarizing Text

Let's try out extractive summarization using the first four paragraphs of [The Great Gatsby](http://gutenberg.net.au/ebooks02/0200041h.html).

First, we'll try to extract the most representative sentence.  Then, we'll extract keywords.

## Sentence extraction

The steps of our sentence extraction process:

1. Parse and tokenize the text using spaCy, and divide into sentences.
2. Calculate the tf-idf matrix.
3. Calculate similarity scores.
4. Calculate TextRank: We're going to use the ´networkx´ package to run the TextRank algorithm.

Let's get started!


In [6]:
# Importing the text the lazy way.
gatsby="In my younger and more vulnerable years my father gave me some advice that I've been turning over in my mind ever since. \"Whenever you feel like criticizing any one,\" he told me, \"just remember that all the people in this world haven't had the advantages that you've had.\" He didn't say any more but we've always been unusually communicative in a reserved way, and I understood that he meant a great deal more than that. In consequence I'm inclined to reserve all judgments, a habit that has opened up many curious natures to me and also made me the victim of not a few veteran bores. The abnormal mind is quick to detect and attach itself to this quality when it appears in a normal person, and so it came about that in college I was unjustly accused of being a politician, because I was privy to the secret griefs of wild, unknown men. Most of the confidences were unsought--frequently I have feigned sleep, preoccupation, or a hostile levity when I realized by some unmistakable sign that an intimate revelation was quivering on the horizon--for the intimate revelations of young men or at least the terms in which they express them are usually plagiaristic and marred by obvious suppressions. Reserving judgments is a matter of infinite hope. I am still a little afraid of missing something if I forget that, as my father snobbishly suggested, and I snobbishly repeat a sense of the fundamental decencies is parcelled out unequally at birth. And, after boasting this way of my tolerance, I come to the admission that it has a limit. Conduct may be founded on the hard rock or the wet marshes but after a certain point I don't care what it's founded on. When I came back from the East last autumn I felt that I wanted the world to be in uniform and at a sort of moral attention forever; I wanted no more riotous excursions with privileged glimpses into the human heart. Only Gatsby, the man who gives his name to this book, was exempt from my reaction--Gatsby who represented everything for which I have an unaffected scorn. If personality is an unbroken series of successful gestures, then there was something gorgeous about him, some heightened sensitivity to the promises of life, as if he were related to one of those intricate machines that register earthquakes ten thousand miles away. This responsiveness had nothing to do with that flabby impressionability which is dignified under the name of the \"creative temperament\"--it was an extraordinary gift for hope, a romantic readiness such as I have never found in any other person and which it is not likely I shall ever find again. No--Gatsby turned out all right at the end; it is what preyed on Gatsby, what foul dust floated in the wake of his dreams that temporarily closed out my interest in the abortive sorrows and short-winded elations of men."

# We want to use the standard english-language parser.
parser = spacy.load('en')

# Parsing Gatsby.
gatsby = parser(gatsby)

# Dividing the text into sentences and storing them as a list of strings.
sentences=[]
for span in gatsby.sents:
    # go from the start to the end of each span, returning each token in the sentence
    # combine each token using join()
    sent = ''.join(gatsby[i].string for i in range(span.start, span.end)).strip()
    sentences.append(sent)

# Creating the tf-idf matrix.
counter = TfidfVectorizer(lowercase=False, 
                          stop_words=None,
                          ngram_range=(1, 1), 
                          analyzer=u'word', 
                          max_df=.5, 
                          min_df=1,
                          max_features=None, 
                          vocabulary=None, 
                          binary=False)

#Applying the vectorizer
data_counts=counter.fit_transform(sentences)

# Similarity

So far, this is all (hopefully) familiar: We've done text parsing and the tf-idf calculation before.  We should now have sentences represented as vectors, with each word having a score based on how often it occurs in the sentence divided by how often it occurs in the whole text.

Now let's calculate the similarity scores for the sentences and apply the TextRank algorithm.  Because TextRank is based on Google's PageRank algorithm, the function is called 'pagerank'.  The hyperparameters are the damping parameter ´alpha´ and the convergence parameter ´tol´.

In [7]:
# Calculating similarity
similarity = data_counts * data_counts.T

# Identifying the sentence with the highest rank.
nx_graph = nx.from_scipy_sparse_matrix(similarity)
ranks=nx.pagerank(nx_graph, alpha=.85, tol=.00000001)

ranked = sorted(((ranks[i],s) for i,s in enumerate(sentences)),
                reverse=True)
print(ranked[0])


(0.07478177112861596, 'This responsiveness had nothing to do with that flabby impressionability which is dignified under the name of the "creative temperament"--it was an extraordinary gift for hope, a romantic readiness such as I have never found in any other person and which it is not likely I shall ever find again.')


Since a lot of Gatsby is about the narrator acting as the observer of other peoples' sordid secrets, this seems pretty good.  Now, let's extract some keywords.

# Keyword summarization

1) Parse and tokenize text (already done).  
2) Filter out stopwords, choose only nouns and adjectives.  
3) Calculate the neighbors of words (we'll use a window of 4).  
4) Run TextRank on the neighbor matrix.  


In [8]:
# Removing stop words and punctuation, then getting a list of all unique words in the text
gatsby_filt = [word for word in gatsby if word.is_stop==False and (word.pos_=='NOUN' or word.pos_=='ADJ')]
words=set(gatsby_filt)

#Creating a grid indicating whether words are within 4 places of the target word
adjacency=pd.DataFrame(columns=words,index=words,data=0)

#Iterating through each word in the text and indicating which of the unique words are its neighbors
for i,word in enumerate(gatsby):
    # Checking if any of the word's next four neighbors are in the word list 
    if any([word == item for item in gatsby_filt]):
        # Making sure to stop at the end of the string, even if there are less than four words left after the target.
        end=max(0,len(gatsby)-(len(gatsby)-(i+5)))
        # The potential neighbors.
        nextwords=gatsby[i+1:end]
        # Filtering the neighbors to select only those in the word list
        inset=[x in gatsby_filt for x in nextwords]
        neighbors=[nextwords[i] for i in range(len(nextwords)) if inset[i]]
        print(nextwords, inset, neighbors)
        # Adding 1 to the adjacency matrix for neighbors of the target word
        if neighbors:
            #Access a group of rows and columns by label(s) or a boolean array. 
            adjacency.loc[word,neighbors]=adjacency.loc[word,neighbors]+1

print('done!')
        
print(adjacency)


and more vulnerable years [False, False, True, True] [vulnerable, years]
years my father gave [True, False, True, False] [years, father]
my father gave me [False, True, False, False] [father]
gave me some advice [False, False, False, True] [advice]
that I've been [False, False, False, False] []
ever since. " [False, False, False, False] []
in this world have [False, False, True, False] [world]
haven't had the [False, False, False, False] []
that you've had [False, False, False, False] []
in a reserved way [False, False, True, True] [reserved, way]
way, and I [True, False, False, False] [way]
, and I understood [False, False, False, False] []
deal more than that [True, False, False, False] [deal]
more than that. [False, False, False, False] []
I'm inclined to [False, False, True, False] [inclined]
to reserve all judgments [False, False, False, True] [judgments]
, a habit that [False, False, True, False] [habit]
that has opened up [False, False, False, False] []
natures to me and [True, 

[133 rows x 133 columns]


In [12]:
len(gatsby)

554

In [17]:
demo = ['hey', 'you', 'over', 'there']
'you' in demo
for i in range(len(demo)):
    print(i)

0
1
2
3


In [21]:
demo_df=pd.DataFrame(columns=demo,index=demo,data=0)

In [24]:
demo_df.loc['hey', 'hey'] = demo_df.loc['hey', 'hey']+1
demo_df.loc['hey', 'hey']

1

In [33]:
for i,word in enumerate(gatsby):
    print(len(gatsby), max(0,len(gatsby)-(len(gatsby)-(i+20))))

554 20
554 21
554 22
554 23
554 24
554 25
554 26
554 27
554 28
554 29
554 30
554 31
554 32
554 33
554 34
554 35
554 36
554 37
554 38
554 39
554 40
554 41
554 42
554 43
554 44
554 45
554 46
554 47
554 48
554 49
554 50
554 51
554 52
554 53
554 54
554 55
554 56
554 57
554 58
554 59
554 60
554 61
554 62
554 63
554 64
554 65
554 66
554 67
554 68
554 69
554 70
554 71
554 72
554 73
554 74
554 75
554 76
554 77
554 78
554 79
554 80
554 81
554 82
554 83
554 84
554 85
554 86
554 87
554 88
554 89
554 90
554 91
554 92
554 93
554 94
554 95
554 96
554 97
554 98
554 99
554 100
554 101
554 102
554 103
554 104
554 105
554 106
554 107
554 108
554 109
554 110
554 111
554 112
554 113
554 114
554 115
554 116
554 117
554 118
554 119
554 120
554 121
554 122
554 123
554 124
554 125
554 126
554 127
554 128
554 129
554 130
554 131
554 132
554 133
554 134
554 135
554 136
554 137
554 138
554 139
554 140
554 141
554 142
554 143
554 144
554 145
554 146
554 147
554 148
554 149
554 150
554 151
554 152
554 153
554 154


In [11]:
for item in gatsby_filt:
    print(item)

younger
vulnerable
years
father
advice
mind
people
world
advantages
communicative
reserved
way
great
deal
consequence
inclined
judgments
habit
curious
natures
victim
veteran
bores
abnormal
mind
quick
quality
normal
person
college
politician
privy
secret
griefs
wild
unknown
men
Most
confidences
sleep
preoccupation
hostile
levity
unmistakable
sign
intimate
revelation
horizon
intimate
revelations
young
men
terms
plagiaristic
obvious
suppressions
Reserving
judgments
matter
infinite
hope
little
afraid
father
sense
fundamental
decencies
birth
way
tolerance
admission
limit
Conduct
hard
rock
wet
marshes
certain
point
autumn
world
uniform
sort
moral
attention
riotous
excursions
privileged
glimpses
human
heart
man
book
exempt
reaction
unaffected
scorn
personality
unbroken
series
successful
gestures
gorgeous
sensitivity
promises
life
intricate
machines
earthquakes
miles
responsiveness
flabby
impressionability
dignified
creative
temperament
extraordinary
gift
hope
romantic
readiness
person
likely


In [25]:

# Running TextRank
nx_words = nx.from_numpy_matrix(adjacency.as_matrix())
ranks=nx.pagerank(nx_words, alpha=.85, tol=.00000001)

# Identifying the five most highly ranked keywords
ranked = sorted(((ranks[i],s) for i,s in enumerate(words)),
                reverse=True)
print(ranked[:5])


  This is separate from the ipykernel package so we can avoid doing imports until


[(0.013370948308795436, hope), (0.012223431176324349, promises), (0.012223431176324349, exempt), (0.012142068850548908, glimpses), (0.011895137937387881, intimate)]


These results are less impressive.  'Hope', 'promises', and 'glimpses' certainly fit the elegiac, on-the-outside-looking-in tone of the book, but 'exempt' and 'world' are pretty generic.  TextRank may perform better on a larger text sample.

# Drill

It is also possible that keyword phrases will work better.  Modfiy the keyword extraction code to extract two-word phrases (digrams) rather than single words.  Then try it with trigrams.  You will probably want to broaden the window that defines 'neighbors.'  Try a few different modifications, and write up your observations in your notebook.  Discuss with your mentor.

In [27]:
# build bigrams from gatsby_filt
bigrams = set()
prev_word = gatsby_filt[0]
for i in range (1,len(gatsby_filt)):
    bigrams.add(str(prev_word)+"-"+str(gatsby_filt[i]))
    prev_word = gatsby_filt[i]
  

  
adjacency=pd.DataFrame(columns=bigrams,index=bigrams,data=0)  
print(adjacency)



                             book-exempt  Reserving-judgments  \
book-exempt                            0                    0   
Reserving-judgments                    0                    0   
politician-privy                       0                    0   
Conduct-hard                           0                    0   
foul-dust                              0                    0   
creative-temperament                   0                    0   
man-book                               0                    0   
habit-curious                          0                    0   
inclined-judgments                     0                    0   
obvious-suppressions                   0                    0   
younger-vulnerable                     0                    0   
sleep-preoccupation                    0                    0   
suppressions-Reserving                 0                    0   
personality-unbroken                   0                    0   
intricate-machines       

[132 rows x 132 columns]


In [47]:
# Iterating through each word in the text and indicating which of the unique words are its neighbors
for i,word in enumerate(gatsby):
    if i == 553:
        break
    bigram = str(word)+'-'+str(gatsby[i+1])
    # Checking if any of the word's next four neighbors are in the word list 
    if any([bigram == item for item in bigrams]):
        # Making sure to stop at the end of the string, even if there are less than four words left after the target.
        end=max(0,len(gatsby)-(len(gatsby)-(i+10)))
        # The potential neighbors.
        nextwords=gatsby[i+1:end]
        # Filtering the neighbors to select only those in the word list
        nextgrams = []
        prev_word = nextwords[0]
        for i in range (1,len(nextwords)):
            nextgrams.append(str(prev_word)+"-"+str(nextwords[i]))
            prev_word = nextwords[i]
        
        
        inset=[x in bigrams for x in nextgrams]
        neighbors=[nextgrams[i] for i in range(len(nextgrams)) if inset[i]]
        print(nextgrams, inset, neighbors)
        # Adding 1 to the adjacency matrix for neighbors of the target word
        if neighbors:
            #Access a group of rows and columns by label(s) or a boolean array. 
            adjacency.loc[bigram,neighbors]=adjacency.loc[bigram,neighbors]+1

['years-my', 'my-father', 'father-gave', 'gave-me', 'me-some', 'some-advice', 'advice-that', 'that-I'] [False, False, False, False, False, False, False, False] []
['way-,', ',-and', 'and-I', 'I-understood', 'understood-that', 'that-he', 'he-meant', 'meant-a'] [False, False, False, False, False, False, False, False] []
['deal-more', 'more-than', 'than-that', 'that-.', '.-In', 'In-consequence', 'consequence-I', "I-'m"] [False, False, False, False, False, False, False, False] []
['natures-to', 'to-me', 'me-and', 'and-also', 'also-made', 'made-me', 'me-the', 'the-victim'] [False, False, False, False, False, False, False, False] []
['bores-.', '.-The', 'The-abnormal', 'abnormal-mind', 'mind-is', 'is-quick', 'quick-to', 'to-detect'] [False, False, False, True, False, False, False, False] ['abnormal-mind']
['mind-is', 'is-quick', 'quick-to', 'to-detect', 'detect-and', 'and-attach', 'attach-itself', 'itself-to'] [False, False, False, False, False, False, False, False] []
['person-,', ',-and', 

In [32]:
for i, word in enumerate(gatsby):
    bigram = str(word)+'-'+str(gatsby[i+1])

In-my
my-younger
younger-and
and-more
more-vulnerable
vulnerable-years
years-my
my-father
father-gave
gave-me
me-some
some-advice
advice-that
that-I
I-'ve
've-been
been-turning
turning-over
over-in
in-my
my-mind
mind-ever
ever-since
since-.
.-"
"-Whenever
Whenever-you
you-feel
feel-like
like-criticizing
criticizing-any
any-one
one-,
,-"
"-he
he-told
told-me
me-,
,-"
"-just
just-remember
remember-that
that-all
all-the
the-people
people-in
in-this
this-world
world-have
have-n't
n't-had
had-the
the-advantages
advantages-that
that-you
you-'ve
've-had
had-.
.-"
"-He
He-did
did-n't
n't-say
say-any
any-more
more-but
but-we
we-'ve
've-always
always-been
been-unusually
unusually-communicative
communicative-in
in-a
a-reserved
reserved-way
way-,
,-and
and-I
I-understood
understood-that
that-he
he-meant
meant-a
a-great
great-deal
deal-more
more-than
than-that
that-.
.-In
In-consequence
consequence-I
I-'m
'm-inclined
inclined-to
to-reserve
reserve-all
all-judgments
judgments-,
,-a
a-habit
habit-tha

IndexError: [E040] Attempt to access token at 554, max length 554.

In [45]:

# Running TextRank
nx_words = nx.from_numpy_matrix(adjacency.as_matrix())
ranks=nx.pagerank(nx_words, alpha=.85, tol=.00000001)

# Identifying the five most highly ranked keywords
ranked = sorted(((ranks[i],s) for i,s in enumerate(bigrams)),
                reverse=True)
print(ranked[:5])

  This is separate from the ipykernel package so we can avoid doing imports until


[(0.03501058397382176, 'riotous-excursions'), (0.0348320370392147, 'wet-marshes'), (0.0348320370392147, 'unmistakable-sign'), (0.0348320370392147, 'unbroken-series'), (0.0348320370392147, 'extraordinary-gift')]


# Trying bigrams with broader neighbor search

In [56]:
gatsby_filt = [word for word in gatsby if word.is_stop==False and (word.pos_=='NOUN' or word.pos_=='ADJ')]

# build bigrams from gatsby_filt
bigrams = set()
prev_word = gatsby_filt[0]
for i in range (1,len(gatsby_filt)):
    bigrams.add(str(prev_word)+"-"+str(gatsby_filt[i]))
    prev_word = gatsby_filt[i]
  

  
adjacency=pd.DataFrame(columns=bigrams,index=bigrams,data=0)  

In [57]:
# Iterating through each word in the text and indicating which of the unique words are its neighbors
for i,word in enumerate(gatsby):
    if i == 553:
        break
    bigram = str(word)+'-'+str(gatsby[i+1])
    # Checking if any of the word's next four neighbors are in the word list 
    if any([bigram == item for item in bigrams]):
        # Making sure to stop at the end of the string, even if there are less than four words left after the target.
        end=max(0,len(gatsby)-(len(gatsby)-(i+20)))
        # The potential neighbors.
        nextwords=gatsby[i+1:end]
        # Filtering the neighbors to select only those in the word list
        nextgrams = []
        prev_word = nextwords[0]
        for i in range (1,len(nextwords)):
            nextgrams.append(str(prev_word)+"-"+str(nextwords[i]))
            prev_word = nextwords[i]
        
        
        inset=[x in bigrams for x in nextgrams]
        neighbors=[nextgrams[i] for i in range(len(nextgrams)) if inset[i]]
        #print(nextgrams, inset, neighbors)
        # Adding 1 to the adjacency matrix for neighbors of the target word
        if neighbors:
            #Access a group of rows and columns by label(s) or a boolean array. 
            adjacency.loc[bigram,neighbors]=adjacency.loc[bigram,neighbors]+1

In [58]:

# Running TextRank
nx_words = nx.from_numpy_matrix(adjacency.as_matrix())
ranks=nx.pagerank(nx_words, alpha=.85, tol=.00000001)

# Identifying the five most highly ranked keywords
ranked = sorted(((ranks[i],s) for i,s in enumerate(bigrams)),
                reverse=True)
print(ranked[:5])

  This is separate from the ipykernel package so we can avoid doing imports until


[(0.030122012696693984, 'abnormal-mind'), (0.02590442862449952, 'flabby-impressionability'), (0.02464874783786393, 'extraordinary-gift'), (0.02464874783786393, 'creative-temperament'), (0.02426771217836155, 'obvious-suppressions')]


# Trigrams

In [72]:
import string
exclude = set(string.punctuation)

gatsby_filt = [word for word in gatsby if word.is_stop==False and word not in exclude and (word.pos_=='NOUN' or word.pos_=='ADJ')]

# build bigrams from gatsby_filt
bigrams = set()
prev_word = gatsby_filt[0]
for i in range (1,len(gatsby_filt)-1):
    bigrams.add(str(prev_word)+"-"+str(gatsby_filt[i])+'-'+str(gatsby_filt[i+1]))
    prev_word = gatsby_filt[i]
  

  
adjacency=pd.DataFrame(columns=bigrams,index=bigrams,data=0)

In [74]:
adjacency.head()

Unnamed: 0,levity-unmistakable-sign,person-likely-end,quality-normal-person,habit-curious-natures,rock-wet-marshes,riotous-excursions-privileged,decencies-birth-way,life-intricate-machines,matter-infinite-hope,admission-limit-Conduct,...,vulnerable-years-father,impressionability-dignified-creative,hope-little-afraid,sensitivity-promises-life,Reserving-judgments-matter,marshes-certain-point,tolerance-admission-limit,dust-wake-dreams,point-autumn-world,attention-riotous-excursions
levity-unmistakable-sign,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
person-likely-end,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
quality-normal-person,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
habit-curious-natures,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
rock-wet-marshes,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [75]:
# Iterating through each word in the text and indicating which of the unique words are its neighbors
for i,word in enumerate(gatsby):
    if i == 552:
        break
    bigram = str(word)+'-'+str(gatsby[i+1])+'-'+str(gatsby[i+2])
    # Checking if any of the word's next four neighbors are in the word list 
    if any([bigram == item for item in bigrams]):
        # Making sure to stop at the end of the string, even if there are less than four words left after the target.
        end=max(0,len(gatsby)-(len(gatsby)-(i+20)))
        # The potential neighbors.
        nextwords=gatsby[i+1:end]
        # Filtering the neighbors to select only those in the word list
        nextgrams = []
        prev_word = nextwords[0]
        for i in range (1,len(nextwords)-1):
            nextgrams.append(str(prev_word)+"-"+str(nextwords[i])+'-'+str(nextwords[i+1]))
            prev_word = nextwords[i]
        
        
        inset=[x in bigrams for x in nextgrams]
        neighbors=[nextgrams[i] for i in range(len(nextgrams)) if inset[i]]
        #print(nextgrams, inset, neighbors)
        # Adding 1 to the adjacency matrix for neighbors of the target word
        if neighbors:
            #Access a group of rows and columns by label(s) or a boolean array. 
            adjacency.loc[bigram,neighbors]=adjacency.loc[bigram,neighbors]+1

In [76]:

# Running TextRank
nx_words = nx.from_numpy_matrix(adjacency.as_matrix())
ranks=nx.pagerank(nx_words, alpha=.85, tol=.00000001)

# Identifying the five most highly ranked keywords
ranked = sorted(((ranks[i],s) for i,s in enumerate(bigrams)),
                reverse=True)
print(ranked[:5])

  This is separate from the ipykernel package so we can avoid doing imports until


[(0.0076335877862595495, 'younger-vulnerable-years'), (0.0076335877862595495, 'young-men-terms'), (0.0076335877862595495, 'years-father-advice'), (0.0076335877862595495, 'world-uniform-sort'), (0.0076335877862595495, 'world-advantages-communicative')]
