In [1]:
import spacy
import pandas
from collections import Counter
from heapq import nlargest
nlp = spacy.load('en_core_web_sm')

In [2]:
rd = pandas.read_csv('preprossecedData.csv')
products = rd['name'].unique()

The function below is mostly based on the spaCy summarization algorithm in https://medium.com/analytics-vidhya/text-summarization-using-spacy-ca4867c6b744.

It takes a product name as an argument, then combines all reviews about that product into one text, and the sentences in this text are used to produce the summary.

The weight of each sentence is determined by the sum of the normalized frequencies of the words in that sentence.

The normalized frequency of each word is its frequency in the text, divided by the frequency of the most common word.

In [3]:
def summarize(product):
    df = rd[rd['name'] == product]
    reviewText = df['text'].str.cat(sep='. ')
    reviewDoc = nlp(reviewText)
    pos_tag = ['PROPN', 'ADJ', 'NOUN', 'VERB']
    words = [ token.text for token in reviewDoc if token.is_stop != True and token.is_punct != True and token.pos_ in pos_tag]
    freq_word = Counter(words)
    max_freq = freq_word.most_common(1)[0][1]
    for word in freq_word.keys():
        freq_word[word] = (freq_word[word]/max_freq)
    sent_strength = {}
    for sent in reviewDoc.sents:
        for word in sent:
            if word.text in freq_word.keys():
                if sent in sent_strength.keys():
                    sent_strength[sent] += freq_word[word.text]
                else:
                    sent_strength[sent] = freq_word[word.text]
    important_sents = nlargest(3, sent_strength, key=sent_strength.get)
    final_sentences = [ w.text for w in important_sents ]
    summary = ' STOP '.join(final_sentences)#"STOP" included to highlight beginnings of sentences (debug)
    return summary

In [4]:
print("Summarizing", products[0])
print(summarize(products[0]))

Summarizing AmazonBasics AAA Performance Alkaline Batteries (36 Count)
I was buying Duracell ProCell batteries for years as the normal Duracell and Energizer batteries didn't seem to last (mostly use them in my electric shaver), but these Amazon batteries have been more impressive than the ProCell batteries. STOP So far so good, I'll try and let folks know if they don't last as long as branded batteries.. bought these batteries because the duracell batteries tend to leak, but these batteries i had to replace in five light sets that the duracellslast months these lasted a week. STOP Upon further inspection, I checked the batteries, and the compartment was a little bit wet, so I cleaned it up and went to get new batteries (still in the plastic wrapper) and all of the batteries had an oily residue around them (some type of leakage) .Overall, dont save money here as you might lose your equipment and the money you spend on the batteries..


In [5]:
print("Summarizing", products[1])
print(summarize(products[1]))

Summarizing AmazonBasics AA Performance Alkaline Batteries (48 Count) - Packaging May Vary
Great price, great battery life, great look, just great, love it :D. Been using Amazon AA and AAA batteries over the past 6 months in a variety of electronic items and have found that they do not last as long as higher quality, higher price batteries. STOP I hope these alkalines stay strong in storage.. last longer then maxell batteries and dont get near as hot when in heavy use as other batteriessize normal AA size.material good aluminum and Alkaline.quality great don't get to hot in full use and last a lot longer then others iv tested.pro's long lasting, don't get to hot, good materials.con's noneoverall great batteries great quality great bargainwould buy again STOP They worked great for my needs, and will return based on the price.. bought these batteries because the duracell batteries tend to leak, but these batteries i had to replace in five light sets that the duracellslast months these la

With the current definition of the summarize function, some of the weighed sentences are too long.

A possible solution to this problem is to use rule based sentence segmentation instead of spaCy's default segmentation technique.

In [6]:
from spacy.language import Language
@Language.component("set_custom_boundaries")
def set_custom_boundaries(doc):
    for token in doc[:-1]:
        if token.text in [".", "!", "?", ":"]:
            doc[token.i + 1].is_sent_start = True
            continue
        if "." in token.text or "," in token.text or "!" in token.text or "?" in token.text or ":" in token.text:
            doc[token.i].is_sent_start = True
    return doc
nlp.add_pipe("set_custom_boundaries", before="parser")

<function __main__.set_custom_boundaries(doc)>

In [7]:
print("Summarizing", products[0])
print(summarize(products[0]))

Summarizing AmazonBasics AAA Performance Alkaline Batteries (36 Count)
Save your money buy Duracell batteries from Groupon they will stand behind less than quality batteries but Amazon will not once you buy Amazon batteries as has become the case with most of Amazon you are on your on without a paddle. STOP We've been going through a lot of AAA sized batteries and have been able to verify that these Amazon Basics batteries last just as long as other name brand alkaline batteries from Duracell STOP no I will not pay the price in the store as I find Amazon's batteries last approximately the same as like Energizer or Duracell and thieves Amazon basic brand batteries are probably half that price.


In [8]:
print("Summarizing", products[1])
print(summarize(products[1]))

Summarizing AmazonBasics AA Performance Alkaline Batteries (48 Count) - Packaging May Vary
We've been going through a lot of AAA sized batteries and have been able to verify that these Amazon Basics batteries last just as long as other name brand alkaline batteries from Duracell STOP Why pay extra for Duracell batteries when AmazonBasics' batteries will last almost as long I bought a 48-pack of AA batteries for just 12 that's just 25 cents per battery! STOP no I will not pay the price in the store as I find Amazon's batteries last approximately the same as like Energizer or Duracell and thieves Amazon basic brand batteries are probably half that price.


The rule based sentence segmentation results in shorter summaries, though somewhat repetitive and not always relevant.

The following definition of the summarization function excludes nouns and proper nouns from the weighting.

In [9]:
def summarize(product):
    df = rd[rd['name'] == product]
    reviewText = df['text'].str.cat(sep='. ')
    reviewDoc = nlp(reviewText)
    pos_tag = ['VERB', 'ADJ']
    words = [ token.text for token in reviewDoc if token.is_stop != True and token.is_punct != True and token.pos_ in pos_tag]
    freq_word = Counter(words)
    max_freq = freq_word.most_common(1)[0][1]
    for word in freq_word.keys():
        freq_word[word] = (freq_word[word]/max_freq)
    sent_strength = {}
    for sent in reviewDoc.sents:
        for word in sent:
            if word.text in freq_word.keys():
                if sent in sent_strength.keys():
                    sent_strength[sent] += freq_word[word.text]
                else:
                    sent_strength[sent] = freq_word[word.text]
    important_sents = nlargest(3, sent_strength, key=sent_strength.get)
    final_sentences = [ w.text for w in important_sents ]
    summary = ' STOP '.join(final_sentences)#"STOP" included to highlight beginnings of sentences (debug)
    return summary

In [10]:
print("Summarizing", products[0])
print(summarize(products[0]))

Summarizing AmazonBasics AAA Performance Alkaline Batteries (36 Count)
They seem to work just as good as the named brand ones I buy at the store and they defiantly work better than the cheap-o ones. STOP I bought various sizes from Amazon and they all work great - just make sure you buy in advance to have them on hand for when you need them instead of running to the store forking over a lot more money for the same performance - I did buy the 9 volt box to replace all my fire alarms in the house - PERFECT! STOP Works great - good product good quality.


In [11]:
print("Summarizing", products[1])
print(summarize(products[1]))

Summarizing AmazonBasics AA Performance Alkaline Batteries (48 Count) - Packaging May Vary
materials.con's noneoverall great batteries great quality great bargainwould buy again yes they seam to be great batteries use in all devices like flashlights STOP Great price and work at least as good (long) as the Duracell batteries I use to buy STOP They seem to work just as good as the named brand ones I buy at the store and they defiantly work better than the cheap-o ones.


These summaries seem shorter and more relevant. However, since sentences are now weighted based solely on the number of occurrences of frequent verbs and adjectives, the summarization algorithm favors sentences that contain many words like "great". This is not a problem in itself but excessive bias in a summary limits its usefulness.