In [1]:
import numpy as np
import pandas as pd
import scipy
import matplotlib.pyplot as plt
import networkx as nx
import spacy
from sklearn.feature_extraction.text import TfidfVectorizer
%matplotlib inline

# Summarizing Text

Let's try out extractive summarization using the first four paragraphs of [The Great Gatsby](http://gutenberg.net.au/ebooks02/0200041h.html).

First, we'll try to extract the most representative sentence.  Then, we'll extract keywords.

## Sentence extraction

The steps of our sentence extraction process:

1. Parse and tokenize the text using spaCy, and divide into sentences.
2. Calculate the tf-idf matrix.
3. Calculate similarity scores.
4. Calculate TextRank: We're going to use the ´networkx´ package to run the TextRank algorithm.

Let's get started!


In [2]:
# Importing the text the lazy way.
gatsby="In my younger and more vulnerable years my father gave me some advice that I've been turning over in my mind ever since. \"Whenever you feel like criticizing any one,\" he told me, \"just remember that all the people in this world haven't had the advantages that you've had.\" He didn't say any more but we've always been unusually communicative in a reserved way, and I understood that he meant a great deal more than that. In consequence I'm inclined to reserve all judgments, a habit that has opened up many curious natures to me and also made me the victim of not a few veteran bores. The abnormal mind is quick to detect and attach itself to this quality when it appears in a normal person, and so it came about that in college I was unjustly accused of being a politician, because I was privy to the secret griefs of wild, unknown men. Most of the confidences were unsought--frequently I have feigned sleep, preoccupation, or a hostile levity when I realized by some unmistakable sign that an intimate revelation was quivering on the horizon--for the intimate revelations of young men or at least the terms in which they express them are usually plagiaristic and marred by obvious suppressions. Reserving judgments is a matter of infinite hope. I am still a little afraid of missing something if I forget that, as my father snobbishly suggested, and I snobbishly repeat a sense of the fundamental decencies is parcelled out unequally at birth. And, after boasting this way of my tolerance, I come to the admission that it has a limit. Conduct may be founded on the hard rock or the wet marshes but after a certain point I don't care what it's founded on. When I came back from the East last autumn I felt that I wanted the world to be in uniform and at a sort of moral attention forever; I wanted no more riotous excursions with privileged glimpses into the human heart. Only Gatsby, the man who gives his name to this book, was exempt from my reaction--Gatsby who represented everything for which I have an unaffected scorn. If personality is an unbroken series of successful gestures, then there was something gorgeous about him, some heightened sensitivity to the promises of life, as if he were related to one of those intricate machines that register earthquakes ten thousand miles away. This responsiveness had nothing to do with that flabby impressionability which is dignified under the name of the \"creative temperament\"--it was an extraordinary gift for hope, a romantic readiness such as I have never found in any other person and which it is not likely I shall ever find again. No--Gatsby turned out all right at the end; it is what preyed on Gatsby, what foul dust floated in the wake of his dreams that temporarily closed out my interest in the abortive sorrows and short-winded elations of men."

# We want to use the standard english-language parser.
parser = spacy.load('en')

# Parsing Gatsby.
gatsby = parser(gatsby)

# Dividing the text into sentences and storing them as a list of strings.
sentences=[]
for span in gatsby.sents:
    # go from the start to the end of each span, returning each token in the sentence
    # combine each token using join()
    sent = ''.join(gatsby[i].string for i in range(span.start, span.end)).strip()
    sentences.append(sent)

# Creating the tf-idf matrix.
counter = TfidfVectorizer(lowercase=False, 
                          stop_words=None,
                          ngram_range=(1, 1), 
                          analyzer=u'word', 
                          max_df=.5, 
                          min_df=1,
                          max_features=None, 
                          vocabulary=None, 
                          binary=False)

#Applying the vectorizer
data_counts=counter.fit_transform(sentences)

# Similarity

So far, this is all (hopefully) familiar: We've done text parsing and the tf-idf calculation before.  We should now have sentences represented as vectors, with each word having a score based on how often it occurs in the sentence divided by how often it occurs in the whole text.

Now let's calculate the similarity scores for the sentences and apply the TextRank algorithm.  Because TextRank is based on Google's PageRank algorithm, the function is called 'pagerank'.  The hyperparameters are the damping parameter ´alpha´ and the convergence parameter ´tol´.

In [3]:
# Calculating similarity
similarity = data_counts * data_counts.T

# Identifying the sentence with the highest rank.
nx_graph = nx.from_scipy_sparse_matrix(similarity)
ranks=nx.pagerank(nx_graph, alpha=.85, tol=.00000001)

ranked = sorted(((ranks[i],s) for i,s in enumerate(sentences)),
                reverse=True)
print(ranked[0])


(0.0639701861589625, 'The abnormal mind is quick to detect and attach itself to this quality when it appears in a normal person, and so it came about that in college I was unjustly accused of being a politician, because I was privy to the secret griefs of wild, unknown men.')


(This is just me, exploring some of the attributes available...)

In [4]:
dir(gatsby[0])

['_',
 '__bytes__',
 '__class__',
 '__delattr__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__le__',
 '__len__',
 '__lt__',
 '__ne__',
 '__new__',
 '__pyx_vtable__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__unicode__',
 'ancestors',
 'check_flag',
 'children',
 'cluster',
 'conjuncts',
 'dep',
 'dep_',
 'doc',
 'ent_id',
 'ent_id_',
 'ent_iob',
 'ent_iob_',
 'ent_kb_id',
 'ent_kb_id_',
 'ent_type',
 'ent_type_',
 'get_extension',
 'has_extension',
 'has_vector',
 'head',
 'i',
 'idx',
 'is_alpha',
 'is_ancestor',
 'is_ascii',
 'is_bracket',
 'is_currency',
 'is_digit',
 'is_left_punct',
 'is_lower',
 'is_oov',
 'is_punct',
 'is_quote',
 'is_right_punct',
 'is_sent_start',
 'is_space',
 'is_stop',
 'is_title',
 'is_upper',
 'lang',
 'lang_',
 'left_edge',
 'lefts',
 'lemma',
 'lemma_',
 'lex_id',
 'like_email',
 'li

In [5]:
for word in gatsby:
    print(word,word.pos_,word.lemma_,word.is_stop,word.is_sent_start,word.whitespace_,)

In ADP in True True  
my PRON -PRON- True None  
younger ADJ young False None  
and CCONJ and True None  
more ADV more True None  
vulnerable ADJ vulnerable False None  
years NOUN year False None  
my PRON -PRON- True None  
father NOUN father False None  
gave VERB give False None  
me PRON -PRON- True None  
some DET some True None  
advice NOUN advice False None  
that PRON that True None  
I PRON -PRON- True None 
've AUX have True None  
been AUX be True None  
turning VERB turn False None  
over ADP over True None  
in ADP in True None  
my PRON -PRON- True None  
mind NOUN mind False None  
ever ADV ever True None  
since ADV since True None 
. PUNCT . False None  
" PUNCT " False True 
Whenever ADV whenever True None  
you PRON -PRON- True None  
feel VERB feel False None  
like SCONJ like False None  
criticizing VERB criticize False None  
any DET any True None  
one NUM one True None 
, PUNCT , False None 
" PUNCT " False None  
he PRON -PRON- True None  
told VERB tell Fa

How I thought I might need to extract digrams, before I realized the vectorizer has parameters to do this work:

In [6]:
g_len = len(gatsby)
for idx,word in enumerate(gatsby):
    if g_len == (idx+1):
        print("Done")
        break
    print(idx,g_len,word,gatsby[idx+1])

0 552 In my
1 552 my younger
2 552 younger and
3 552 and more
4 552 more vulnerable
5 552 vulnerable years
6 552 years my
7 552 my father
8 552 father gave
9 552 gave me
10 552 me some
11 552 some advice
12 552 advice that
13 552 that I
14 552 I 've
15 552 've been
16 552 been turning
17 552 turning over
18 552 over in
19 552 in my
20 552 my mind
21 552 mind ever
22 552 ever since
23 552 since .
24 552 . "
25 552 " Whenever
26 552 Whenever you
27 552 you feel
28 552 feel like
29 552 like criticizing
30 552 criticizing any
31 552 any one
32 552 one ,
33 552 , "
34 552 " he
35 552 he told
36 552 told me
37 552 me ,
38 552 , "
39 552 " just
40 552 just remember
41 552 remember that
42 552 that all
43 552 all the
44 552 the people
45 552 people in
46 552 in this
47 552 this world
48 552 world have
49 552 have n't
50 552 n't had
51 552 had the
52 552 the advantages
53 552 advantages that
54 552 that you
55 552 you 've
56 552 've had
57 552 had .
58 552 . "
59 552 " He
60 552 He did
61 552 d

Since a lot of Gatsby is about the narrator acting as the observer of other peoples' sordid secrets, this seems pretty good.  Now, let's extract some keywords.

# Keyword summarization

1) Parse and tokenize text (already done).  
2) Filter out stopwords, choose only nouns and adjectives.  
3) Calculate the neighbors of words (we'll use a window of 4).  
4) Run TextRank on the neighbor matrix.  


In [7]:
# Removing stop words and punctuation, then getting a list of all unique words in the text
gatsby_filt = [word for word in gatsby if word.is_stop==False and (word.pos_=='NOUN' or word.pos_=='ADJ')]
words=set(gatsby_filt)

#Creating a grid indicating whether words are within 4 places of the target word
adjacency=pd.DataFrame(columns=words,index=words,data=0)

#Iterating through each word in the text and indicating which of the unique words are its neighbors
for i,word in enumerate(gatsby):
    # Checking if any of the word's next four neighbors are in the word list 
    if any([word == item for item in gatsby_filt]):
        # Making sure to stop at the end of the string, even if there are less than four words left after the target.
        end=max(0,len(gatsby)-(len(gatsby)-(i+5)))
        # The potential neighbors.
        nextwords=gatsby[i+1:end]
        # Filtering the neighbors to select only those in the word list
        inset=[x in gatsby_filt for x in nextwords]
        neighbors=[nextwords[i] for i in range(len(nextwords)) if inset[i]]
        # Adding 1 to the adjacency matrix for neighbors of the target word
        if neighbors:
            adjacency.loc[word,neighbors]=adjacency.loc[word,neighbors]+1

print('done!')

done!


In [8]:
# Running TextRank
#nx_words = nx.from_numpy_matrix(adjacency.as_matrix()) #<--gave me a FutureWarning and suggested .values. So...
nx_words = nx.from_numpy_matrix(adjacency.values)
ranks=nx.pagerank(nx_words, alpha=.85, tol=.00000001)

# Identifying the five most highly ranked keywords
ranked = sorted(((ranks[i],s) for i,s in enumerate(words)),
                reverse=True)
print(ranked[:5])

[(0.012415518587086127, promises), (0.012415518587086127, exempt), (0.012333160533862523, glimpses), (0.012082343425981395, intimate), (0.012054574145630725, earthquakes)]


These results are less impressive.  'Hope', 'promises', and 'glimpses' certainly fit the elegiac, on-the-outside-looking-in tone of the book, but 'exempt' and 'world' are pretty generic.  TextRank may perform better on a larger text sample.

# Drill

It is also possible that keyword phrases will work better.  Modifiy the keyword extraction code to extract two-word phrases (digrams) rather than single words.  Then try it with trigrams.  You will probably want to broaden the window that defines 'neighbors.'  Try a few different modifications, and write up your observations in your notebook.  Discuss with your mentor.

>__Each of the word sets could then operate as its own feature.__ Ngrams can be used to create term-document matrices (though it would now be ngram-document matrices), or used in topic modeling. In addition, ngrams are useful for text prediction as they can be used to determine what words are most likely to follow in a sentence, phrase, or search query.

>For a sentence with X words, there will be X-(N-1) Ngrams. 2-gram phrases are also called ‘bigrams,’ 3-gram phrases are called ‘trigrams,’ etc.

### Starting over with bigrams:

In [9]:
# Utility function to clean text.
import re
def text_cleaner(text):
    
    # Visual inspection shows spaCy does not recognize the double dash '--'.
    # Better get rid of it now!
    text = re.sub(r'--',' ',text)
# Get rid of extra whitespace.
    text = ' '.join(text.split())    
    return text

In [10]:
# Importing the text the lazy way.
gatsby="In my younger and more vulnerable years my father gave me some advice that I've been turning over in my mind ever since. \"Whenever you feel like criticizing any one,\" he told me, \"just remember that all the people in this world haven't had the advantages that you've had.\" He didn't say any more but we've always been unusually communicative in a reserved way, and I understood that he meant a great deal more than that. In consequence I'm inclined to reserve all judgments, a habit that has opened up many curious natures to me and also made me the victim of not a few veteran bores. The abnormal mind is quick to detect and attach itself to this quality when it appears in a normal person, and so it came about that in college I was unjustly accused of being a politician, because I was privy to the secret griefs of wild, unknown men. Most of the confidences were unsought--frequently I have feigned sleep, preoccupation, or a hostile levity when I realized by some unmistakable sign that an intimate revelation was quivering on the horizon--for the intimate revelations of young men or at least the terms in which they express them are usually plagiaristic and marred by obvious suppressions. Reserving judgments is a matter of infinite hope. I am still a little afraid of missing something if I forget that, as my father snobbishly suggested, and I snobbishly repeat a sense of the fundamental decencies is parcelled out unequally at birth. And, after boasting this way of my tolerance, I come to the admission that it has a limit. Conduct may be founded on the hard rock or the wet marshes but after a certain point I don't care what it's founded on. When I came back from the East last autumn I felt that I wanted the world to be in uniform and at a sort of moral attention forever; I wanted no more riotous excursions with privileged glimpses into the human heart. Only Gatsby, the man who gives his name to this book, was exempt from my reaction--Gatsby who represented everything for which I have an unaffected scorn. If personality is an unbroken series of successful gestures, then there was something gorgeous about him, some heightened sensitivity to the promises of life, as if he were related to one of those intricate machines that register earthquakes ten thousand miles away. This responsiveness had nothing to do with that flabby impressionability which is dignified under the name of the \"creative temperament\"--it was an extraordinary gift for hope, a romantic readiness such as I have never found in any other person and which it is not likely I shall ever find again. No--Gatsby turned out all right at the end; it is what preyed on Gatsby, what foul dust floated in the wake of his dreams that temporarily closed out my interest in the abortive sorrows and short-winded elations of men."

gatsby = text_cleaner(gatsby)

# We want to use the standard english-language parser.
parser = spacy.load('en')

# Parsing Gatsby.
gatsby = parser(gatsby)

# Dividing the text into sentences and storing them as a list of strings.
sentences=[]
for span in gatsby.sents:
    # go from the start to the end of each span, returning each token in the sentence
    # combine each token using join()
    sent = ''.join(gatsby[i].string for i in range(span.start, span.end)).strip()
    sentences.append(sent)

# Creating the tf-idf matrix.
counter = TfidfVectorizer(lowercase=False, 
                          stop_words=None,
                          ngram_range=(2,2), #<--Build bigrams instead of single words
                          analyzer=u'word', 
                          max_df=.5, 
                          min_df=1,
                          max_features=None, 
                          vocabulary=None, 
                          binary=False)

#Applying the vectorizer
data_counts=counter.fit_transform(sentences)

gatsby_nouns_adjectives = []
for word in gatsby:
    if word.is_stop==False and \
       word.is_punct==False and \
       (word.pos_=='NOUN' or word.pos_=='ADJ'):
        gatsby_nouns_adjectives.append(word)
g_len = len(gatsby_nouns_adjectives)
gatsby_bigrams = []

for idx,word in enumerate(gatsby_nouns_adjectives):
    if g_len == (idx+1):
        print("Done")
        break
    gatsby_bigrams.append('{}_{}'.format(word,gatsby_nouns_adjectives[idx+1]))
    print(idx,g_len,word,gatsby[idx+1])

In [11]:
gatsby_bigrams = list(counter.vocabulary_.keys())

In [12]:
# This stuff all stays the same:

# Calculating similarity
similarity = data_counts * data_counts.T

# Identifying the sentence with the highest rank.
nx_graph = nx.from_scipy_sparse_matrix(similarity)
ranks=nx.pagerank(nx_graph, alpha=.85, tol=.00000001)

ranked = sorted(((ranks[i],s) for i,s in enumerate(sentences)),
                reverse=True)
print(ranked[0])

(0.05045297153830663, 'The abnormal mind is quick to detect and attach itself to this quality when it appears in a normal person, and so it came about that in college I was unjustly accused of being a politician, because I was privy to the secret griefs of wild, unknown men.')


In [13]:
# Getting a list of all unique bigrams in the text
words=set(gatsby_bigrams)

#Creating a grid indicating whether words are within 4 places of the target word
adjacency=pd.DataFrame(columns=words,index=words,data=0)

#Iterating through each word in the text and indicating which of the unique words are its neighbors
for i,word in enumerate(gatsby_bigrams):
    # Checking if any of the word's next four neighbors are in the word list 
    if any([word == item for item in gatsby_bigrams]):
        # Making sure to stop at the end of the string, even if there are less than four words left after the target.
        end=max(0,len(gatsby_bigrams)-(len(gatsby_bigrams)-(i+5)))
        # The potential neighbors.
        nextwords=gatsby_bigrams[i+1:end]
        # Filtering the neighbors to select only those in the word list
        inset=[x in gatsby_bigrams for x in nextwords]
        neighbors=[nextwords[i] for i in range(len(nextwords)) if inset[i]]
        # Adding 1 to the adjacency matrix for neighbors of the target word
        if neighbors:
            adjacency.loc[word,neighbors]=adjacency.loc[word,neighbors]+1

print('done!')

done!


In [14]:
adjacency

Unnamed: 0,unusually communicative,he meant,of not,his dreams,turning over,it has,the secret,as have,wild unknown,shall ever,...,the hard,No Gatsby,or the,romantic readiness,sense of,as my,Conduct may,unbroken series,horizon for,at sort
unusually communicative,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
he meant,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
of not,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
his dreams,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
turning over,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
as my,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
Conduct may,0,0,0,0,0,0,0,0,0,0,...,1,0,0,0,0,0,0,0,0,0
unbroken series,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
horizon for,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [15]:
gatsby_bigrams

['In my',
 'my younger',
 'younger and',
 'and more',
 'more vulnerable',
 'vulnerable years',
 'years my',
 'my father',
 'father gave',
 'gave me',
 'me some',
 'some advice',
 'advice that',
 'that ve',
 've been',
 'been turning',
 'turning over',
 'over in',
 'in my',
 'my mind',
 'mind ever',
 'ever since',
 'Whenever you',
 'you feel',
 'feel like',
 'like criticizing',
 'criticizing any',
 'any one',
 'one he',
 'he told',
 'told me',
 'me just',
 'just remember',
 'remember that',
 'that all',
 'all the',
 'the people',
 'people in',
 'in this',
 'this world',
 'world haven',
 'haven had',
 'had the',
 'the advantages',
 'advantages that',
 'that you',
 'you ve',
 've had',
 'He didn',
 'didn say',
 'say any',
 'any more',
 'but we',
 'we ve',
 've always',
 'always been',
 'been unusually',
 'unusually communicative',
 'communicative in',
 'in reserved',
 'reserved way',
 'way and',
 'understood that',
 'that he',
 'he meant',
 'meant great',
 'great deal',
 'deal more',
 'mo

In [16]:
# Running TextRank
#nx_words = nx.from_numpy_matrix(adjacency.as_matrix()) #<--gave me a FutureWarning and suggested .values. So...
nx_words = nx.from_numpy_matrix(adjacency.values)
ranks=nx.pagerank(nx_words, alpha=.85, tol=.00000001)

# Identifying the five most highly ranked keywords
ranked = sorted(((ranks[i],s) for i,s in enumerate(words)),
                reverse=True)
print(ranked[:5])

[(0.002580891935930987, 'more vulnerable'), (0.002580891935930987, 'and short'), (0.0025240089764954807, 'vulnerable years'), (0.0025240089764954807, 'sorrows and'), (0.0024789984790138707, 'years my')]


### And now, trigrams?

In [17]:
# Importing the text the lazy way.
gatsby="In my younger and more vulnerable years my father gave me some advice that I've been turning over in my mind ever since. \"Whenever you feel like criticizing any one,\" he told me, \"just remember that all the people in this world haven't had the advantages that you've had.\" He didn't say any more but we've always been unusually communicative in a reserved way, and I understood that he meant a great deal more than that. In consequence I'm inclined to reserve all judgments, a habit that has opened up many curious natures to me and also made me the victim of not a few veteran bores. The abnormal mind is quick to detect and attach itself to this quality when it appears in a normal person, and so it came about that in college I was unjustly accused of being a politician, because I was privy to the secret griefs of wild, unknown men. Most of the confidences were unsought--frequently I have feigned sleep, preoccupation, or a hostile levity when I realized by some unmistakable sign that an intimate revelation was quivering on the horizon--for the intimate revelations of young men or at least the terms in which they express them are usually plagiaristic and marred by obvious suppressions. Reserving judgments is a matter of infinite hope. I am still a little afraid of missing something if I forget that, as my father snobbishly suggested, and I snobbishly repeat a sense of the fundamental decencies is parcelled out unequally at birth. And, after boasting this way of my tolerance, I come to the admission that it has a limit. Conduct may be founded on the hard rock or the wet marshes but after a certain point I don't care what it's founded on. When I came back from the East last autumn I felt that I wanted the world to be in uniform and at a sort of moral attention forever; I wanted no more riotous excursions with privileged glimpses into the human heart. Only Gatsby, the man who gives his name to this book, was exempt from my reaction--Gatsby who represented everything for which I have an unaffected scorn. If personality is an unbroken series of successful gestures, then there was something gorgeous about him, some heightened sensitivity to the promises of life, as if he were related to one of those intricate machines that register earthquakes ten thousand miles away. This responsiveness had nothing to do with that flabby impressionability which is dignified under the name of the \"creative temperament\"--it was an extraordinary gift for hope, a romantic readiness such as I have never found in any other person and which it is not likely I shall ever find again. No--Gatsby turned out all right at the end; it is what preyed on Gatsby, what foul dust floated in the wake of his dreams that temporarily closed out my interest in the abortive sorrows and short-winded elations of men."

gatsby = text_cleaner(gatsby)

# We want to use the standard english-language parser.
parser = spacy.load('en')

# Parsing Gatsby.
gatsby = parser(gatsby)

# Dividing the text into sentences and storing them as a list of strings.
sentences=[]
for span in gatsby.sents:
    # go from the start to the end of each span, returning each token in the sentence
    # combine each token using join()
    sent = ''.join(gatsby[i].string for i in range(span.start, span.end)).strip()
    sentences.append(sent)

# Creating the tf-idf matrix.
counter = TfidfVectorizer(lowercase=False, 
                          stop_words=None,
                          ngram_range=(3,3), #<--Build bigrams instead of single words
                          analyzer=u'word', 
                          max_df=.5, 
                          min_df=1,
                          max_features=None, 
                          vocabulary=None, 
                          binary=False)

#Applying the vectorizer
data_counts=counter.fit_transform(sentences)

In [18]:
gatsby_trigrams = list(counter.vocabulary_.keys())

In [19]:
# This stuff all stays the same:

# Calculating similarity
similarity = data_counts * data_counts.T

# Identifying the sentence with the highest rank.
nx_graph = nx.from_scipy_sparse_matrix(similarity)
ranks=nx.pagerank(nx_graph, alpha=.85, tol=.00000001)

ranked = sorted(((ranks[i],s) for i,s in enumerate(sentences)),
                reverse=True)
print(ranked[0])

(0.05, 'it is not likely I shall ever find again.')


In [20]:
# Getting a list of all unique bigrams in the text
words=set(gatsby_trigrams)

#Creating a grid indicating whether words are within 4 places of the target word
adjacency=pd.DataFrame(columns=words,index=words,data=0)

#Iterating through each word in the text and indicating which of the unique words are its neighbors
for i,word in enumerate(gatsby_trigrams):
    # Checking if any of the word's next four neighbors are in the word list 
    if any([word == item for item in gatsby_trigrams]):
        # Making sure to stop at the end of the string, even if there are less than four words left after the target.
        end=max(0,len(gatsby_trigrams)-(len(gatsby_trigrams)-(i+5)))
        # The potential neighbors.
        nextwords=gatsby_trigrams[i+1:end]
        # Filtering the neighbors to select only those in the word list
        inset=[x in gatsby_trigrams for x in nextwords]
        neighbors=[nextwords[i] for i in range(len(nextwords)) if inset[i]]
        # Adding 1 to the adjacency matrix for neighbors of the target word
        if neighbors:
            adjacency.loc[word,neighbors]=adjacency.loc[word,neighbors]+1

print('done!')

done!


In [21]:
adjacency

Unnamed: 0,East last autumn,readiness such as,impressionability which is,my tolerance come,being politician because,from my reaction,an intimate revelation,revelation was quivering,he told me,sorrows and short,...,personality is an,ve been turning,temporarily closed out,that as my,as my father,those intricate machines,he were related,when realized by,for hope romantic,not few veteran
East last autumn,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
readiness such as,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
impressionability which is,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
my tolerance come,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
being politician because,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
those intricate machines,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
he were related,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
when realized by,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
for hope romantic,0,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [22]:
gatsby_trigrams

['In my younger',
 'my younger and',
 'younger and more',
 'and more vulnerable',
 'more vulnerable years',
 'vulnerable years my',
 'years my father',
 'my father gave',
 'father gave me',
 'gave me some',
 'me some advice',
 'some advice that',
 'advice that ve',
 'that ve been',
 've been turning',
 'been turning over',
 'turning over in',
 'over in my',
 'in my mind',
 'my mind ever',
 'mind ever since',
 'Whenever you feel',
 'you feel like',
 'feel like criticizing',
 'like criticizing any',
 'criticizing any one',
 'any one he',
 'one he told',
 'he told me',
 'told me just',
 'me just remember',
 'just remember that',
 'remember that all',
 'that all the',
 'all the people',
 'the people in',
 'people in this',
 'in this world',
 'this world haven',
 'world haven had',
 'haven had the',
 'had the advantages',
 'the advantages that',
 'advantages that you',
 'that you ve',
 'you ve had',
 'He didn say',
 'didn say any',
 'say any more',
 'but we ve',
 'we ve always',
 've always

In [23]:
# Running TextRank
#nx_words = nx.from_numpy_matrix(adjacency.as_matrix()) #<--gave me a FutureWarning and suggested .values. So...
nx_words = nx.from_numpy_matrix(adjacency.values)
ranks=nx.pagerank(nx_words, alpha=.85, tol=.00000001)

# Identifying the five most highly ranked keywords
ranked = sorted(((ranks[i],s) for i,s in enumerate(words)),
                reverse=True)
print(ranked[:5])

[(0.0026351630280416856, 'sorrows and short'), (0.002635163028041685, 'more vulnerable years'), (0.0025770839316087034, 'abortive sorrows and'), (0.002577083931608703, 'vulnerable years my'), (0.0025311269517034147, 'the abortive sorrows')]


I see that there was a drop in the scores of the highest-ranked keywords/Ngrams as N increased from 1 to 1+. Not much of a difference between the scores of N-grams where N=2 or N=3.