Now that we have the sentence vectors, let's build a textrank summarizer in...a line of code or so...

In [1]:
import numpy as np
import networkx as nx
import random
import pickle

from pathlib import Path
# to compare with my summarizer
from summa import summarizer
from sklearn.metrics.pairwise import cosine_similarity

My summarizer

In [2]:
class Summarizer(object):
    """
    Simplest pagerank-based summarizer one can possibly think of.

    Parameters:
    ----------
    reviewtextDF: pd.DataFrame 
        DF with the original review and the processed sentences
    idx2sent: Dict
        keys are indexes and values sentences
    sent2vec: Dict
        keys are sentence indexes and values are average word vectors

    Attributes:
    ----------
    sent2idx: Dict
        keys are sentences and values are indexes
    """
    def __init__(self, reviewtextDF, idx2sent, sent2vec):
        super(Summarizer, self).__init__()
        self.reviewtextDF = reviewtextDF
        self.idx2sent = idx2sent
        self.sent2idx = {k:v for v,k in idx2sent.items()}
        self.sent2vec = sent2vec

    def summarize(self, idx, n=100, ratio=0.2):
        """
        Summarize the review corresponding to idx using n random sentences to
        build the graph. The length of the summary will be ratio times the 
        length of the review
        """
        # the indexes of the sentences in the review will be the first ones in the
        # similarity matrix
        org_review = self.reviewtextDF.iloc[idx].reviewText
        proc_sents = self.reviewtextDF.iloc[idx].processed_sents
        sentidx  = [self.sent2idx[s] for s in proc_sents]
        idxmap = {k:v for k,v in enumerate(sentidx)}

        # sample n random sentences to build a graph since 1.4 mil don't fit
        # in memory.
        rand_idx = random.sample(range(1, len(self.sent2idx)), n)
        rand_idx = [r for r in rand_idx if r not in sentidx]
        graph_idx = sentidx + rand_idx

        # compute similarity mtx, corresponding graph and scores
        wvmtx = np.vstack([self.sent2vec[i] for i in graph_idx])
        sim_mat = cosine_similarity(wvmtx) - np.eye(len(wvmtx))
        nx_graph = nx.from_numpy_array(sim_mat)
        scores = nx.pagerank_numpy(nx_graph)

        # extract the indexes corresponding to the sentences in the review and sort them
        scores = list({k: scores[k] for k in idxmap}.items())
        scores = sorted(scores, key=lambda x: x[1], reverse=True)

        # print summary
        summary_len = max(round((len(sentidx) * ratio)), 1)
        top_score_idx = [scores[i][0] for i in range(summary_len)]
        top_real_idx  = [idxmap[i] for i in top_score_idx]
        summary = '\n'.join([self.idx2sent[i] for i in top_real_idx])

        return org_review, summary

Let's go step by step through what is happening inside the class:

1. Given an index we extract the original text and the processed sentences. The sentences forming the review of interest will be the first rows in the similarity matrix. To eventually access their pagerank score by index we simply build a map. 

2. Given that a graph of 1.4 mil nodes (i.e. a similarity matrix of 1.4mil $\times$ 1.4mil numbers does not fit in memory, we do something much smaller/simpler. We pick n sentences (Default=100) and build a graph with those plus the sentences of the review of interest. 

3. We build a matrix of sentence vectors for the `len(graph_idx)` sentences and use `nx.from_numpy_array`, which is mean to be the fastest solution for small graphs, to get the pagerank score

4. We then extract the first indexes corresponding to the sentence of interest (i.e. those in the "query review") and sort their scores

Let's have a look to the results

In [19]:
DATA_PATH = Path('../data')
reviewtextDF = pd.read_pickle(DATA_PATH/'df_processed_reviews.p')
idx2sent = pickle.load(open(DATA_PATH/'idx2sent.p', 'rb'))
sent2vec = pickle.load(open(DATA_PATH/'sent2vec.p', 'rb'))

In [20]:
my_summarizer = Summarizer(reviewtextDF, idx2sent, sent2vec)

In [22]:
idx = np.random.choice(reviewtextDF.shape[0], 10)

In [28]:
for i in idx:
    review, my_summary = my_summarizer.summarize(i)
    summary = summarizer.summarize(review)
    print("Original review: \n {}".format(review))
    print("My Summary:  \n {}".format(my_summary))
    print("Summanlp TextRank:  \n {}".format(summary))

Original review: 
 OK - it's big!  However, that is what I like about it.  I can carry as much of my day-to-day life as I need to and still have room left over.  The yellow (think marigolds) is beautiful.  I get many compliments on this gorgeous bag.  I've spend 3 to 4 times the amount I paid for this and did not like them any better.  It may not last as long, but for the price and use I cannot complain.  Be careful; if you put all you can in here, you won't be able to carry it!!
My Summary:  
 i can carry as much of my day to day life as i need to and still have room left over
i have spend to times the amount i paid for this and did not like them any better
Summanlp TextRank:  
 Be careful; if you put all you can in here, you won't be able to carry it!!
Original review: 
 I've been eying these for awhile, And they are on the cheaper side but if you plan on buying jewerly that is this cheap you should expect that, Honestly the problems are only really in the chain used on the neckace, 

When no `Summanlp TextRank` appears is simply because the description is short. Change the `ratio` parameter in `summarizer.summarize(review, ratio=0.4)` and problem solved. 

As one might expect, the Summanlp summaries are better :) 

More on text summarization in the future.