In [1]:
import spacy
import pandas
from spacy.language import Language
from collections import Counter
from heapq import nlargest
nlp = spacy.load('en_core_web_sm')

In [2]:
rd = pandas.read_csv('preprossecedData.csv')
products = rd['name'].unique()

GenSim's summarization function

In [3]:
from gensim.summarization import summarize

In [4]:
def shorten(product):
    df = rd[rd['name'] == product]
    reviewText = df['text'].str.cat(sep='. ')
    return summarize(reviewText)

In [5]:
print("Summarizing", products[0])
print()
print(shorten(products[0]))

Summarizing AmazonBasics AAA Performance Alkaline Batteries (36 Count)

Bought a lot of batteries for Christmas and the AmazonBasics Cell have been good.
I haven't noticed a difference between the brand name batteries and the Amazon Basic brand.
These do not hold the amount of high power juice like energizer or duracell, but they are half the price..
AmazonBasics AA AAA batteries have done well by me appear to have a good shelf life.
I find amazon basics batteries to be equal if not superior to name brand ones.
When I first started getting the Amazon basic batteries I really liked them.
Use it for my fish tank's light at night and works great, I love how you can easily switch it off and on if you want it on while guests are there..
Thankful that I was able to find on Amazon for a great price and even better shipping.
I don't know if I would buy thus brand again seems like they don't last as long as Duracell.
In my opinion these did not last anywhere near as long as Duracel in things li

In [6]:
print("Summarizing", products[1])
print()
print(shorten(products[1]))

Summarizing AmazonBasics AA Performance Alkaline Batteries (48 Count) - Packaging May Vary

Doesn't seem to last as long in some devices as other brands but good for the cost..
The battery life is much poorer compared to name brands like energizer and Duracell..
These don't last as long as name brand of batteries like Duracell or energizer.
At this time I have a number of Amazon batteries because they are low cost..
This my second order and they seem to work as good as name brand and ship to my door..
This was my second purchase of amazon batteries and they work great.
Just as good or even better than name brand batteries and half the price.
These Amazon batteries did the job although I gave 4star only because I had a few I would say a hand full of batteries that were not as strong or were pretty weak but out of a box of 48 batteries, I will definitely buy again for this priceIm pretty well satisfied.Thank you!.
I find these batteries fail in a short time on items like wireless thermom

The function below is mostly based on the spaCy summarization algorithm in https://medium.com/analytics-vidhya/text-summarization-using-spacy-ca4867c6b744.

It takes a product name as an argument, then combines all reviews about that product into one text, and the sentences in this text are used to produce the summary.

The weight of each sentence is determined by the sum of the normalized frequencies of the words in that sentence.

The normalized frequency of each word is its frequency in the text, divided by the frequency of the most common word.

In [7]:
def shorten(product):
    df = rd[rd['name'] == product]
    reviewText = df['text'].str.cat(sep='. ')
    reviewDoc = nlp(reviewText)
    pos_tag = ['PROPN', 'ADJ', 'NOUN', 'VERB']
    words = [ token.lemma_ for token in reviewDoc if token.is_stop != True and token.is_punct != True and token.pos_ in pos_tag]
    freq_word = Counter(words)
    max_freq = freq_word.most_common(1)[0][1]
    for word in freq_word.keys():
        freq_word[word] = (freq_word[word]/max_freq)
    sent_strength = {}
    for sent in reviewDoc.sents:
        for word in sent:
            if word.lemma_ in freq_word.keys():
                if sent in sent_strength.keys():
                    sent_strength[sent] += freq_word[word.lemma_]
                else:
                    sent_strength[sent] = freq_word[word.lemma_]
    important_sents = nlargest(3, sent_strength, key=sent_strength.get)
    final_sentences = [ w.text for w in important_sents ]
    summary = ' STOP '.join(final_sentences)#"STOP" included to highlight beginnings of sentences (debug)
    return summary

In [8]:
print("Summarizing", products[0])
print()
print(shorten(products[0]))

Summarizing AmazonBasics AAA Performance Alkaline Batteries (36 Count)

So far so good, I'll try and let folks know if they don't last as long as branded batteries.. bought these batteries because the duracell batteries tend to leak, but these batteries i had to replace in five light sets that the duracellslast months these lasted a week. STOP I used them in an xbox one controller, which can eat up batteries, and received similar battery life to more expensive name brand batteries.. Just started using this pack, and so far they've lasted as long as any leading brand battery. STOP I was buying Duracell ProCell batteries for years as the normal Duracell and Energizer batteries didn't seem to last (mostly use them in my electric shaver), but these Amazon batteries have been more impressive than the ProCell batteries.


In [9]:
print("Summarizing", products[1])
print()
print(shorten(products[1]))

Summarizing AmazonBasics AA Performance Alkaline Batteries (48 Count) - Packaging May Vary

Great price, great battery life, great look, just great, love it :D. Been using Amazon AA and AAA batteries over the past 6 months in a variety of electronic items and have found that they do not last as long as higher quality, higher price batteries. STOP I hope these alkalines stay strong in storage.. last longer then maxell batteries and dont get near as hot when in heavy use as other batteriessize normal AA size.material good aluminum and Alkaline.quality great don't get to hot in full use and last a lot longer then others iv tested.pro's long lasting, don't get to hot, good materials.con's noneoverall great batteries great quality great bargainwould buy again STOP Why pay extra for Duracell batteries when AmazonBasics' batteries will last almost as long I bought a 48-pack of AA batteries for just 12 that's just 25 cents per battery!


With the current definition of the summarize function, some of the weighed sentences are too long.

A possible solution to this problem is to use rule based sentence segmentation instead of spaCy's default segmentation technique.

In [10]:
@Language.component("set_custom_boundaries")
def set_custom_boundaries(doc):
    for token in doc[:-1]:
        if token.text in [".", "!", "?", ":"]:
            doc[token.i + 1].is_sent_start = True
            continue
        if "." in token.text or "!" in token.text or "?" in token.text or ":" in token.text:
            doc[token.i].is_sent_start = True
    return doc
nlp.add_pipe("set_custom_boundaries", before="parser")

<function __main__.set_custom_boundaries(doc)>

In [11]:
print("Summarizing", products[0])
print()
print(shorten(products[0]))

Summarizing AmazonBasics AAA Performance Alkaline Batteries (36 Count)

I was buying Duracell ProCell batteries for years as the normal Duracell and Energizer batteries didn't seem to last (mostly use them in my electric shaver), but these Amazon batteries have been more impressive than the ProCell batteries. STOP I'm not a battery expert, nor did I compare different types of batteries in some type of battery experiment, but I didn't notice they died out any quicker than your average battery STOP Save your money buy Duracell batteries from Groupon they will stand behind less than quality batteries but Amazon will not once you buy Amazon batteries as has become the case with most of Amazon you are on your on without a paddle.


In [12]:
print("Summarizing", products[1])
print()
print(shorten(products[1]))

Summarizing AmazonBasics AA Performance Alkaline Batteries (48 Count) - Packaging May Vary

Why pay extra for Duracell batteries when AmazonBasics' batteries will last almost as long I bought a 48-pack of AA batteries for just 12 that's just 25 cents per battery! STOP I put a EverReady battery in one, Rayovac battery, one AmazonBasics battery and one Maxell battery purchased from STOP materials.con's noneoverall great batteries great quality great bargainwould buy again yes they seam to be great batteries use in all devices like flashlights, cameras, remotes, ext also get a lot for the price.


The rule based sentence segmentation results in shorter summaries, though somewhat repetitive and not always relevant.

The following definition of the summarization function excludes nouns and proper nouns from the weighting.

In [13]:
def shorten(product):
    df = rd[rd['name'] == product]
    reviewText = df['text'].str.cat(sep='. ')
    reviewDoc = nlp(reviewText)
    pos_tag = ['VERB', 'ADJ']
    words = [ token.lemma_ for token in reviewDoc if token.is_stop != True and token.is_punct != True and token.pos_ in pos_tag]
    freq_word = Counter(words)
    max_freq = freq_word.most_common(1)[0][1]
    for word in freq_word.keys():
        freq_word[word] = (freq_word[word]/max_freq)
    sent_strength = {}
    for sent in reviewDoc.sents:
        for word in sent:
            if word.lemma_ in freq_word.keys():
                if sent in sent_strength.keys():
                    sent_strength[sent] += freq_word[word.lemma_]
                else:
                    sent_strength[sent] = freq_word[word.lemma_]
    important_sents = nlargest(3, sent_strength, key=sent_strength.get)
    final_sentences = [ w.text for w in important_sents ]
    summary = ' STOP '.join(final_sentences)#"STOP" included to highlight beginnings of sentences (debug)
    return summary

In [14]:
print("Summarizing", products[0])
print()
print(shorten(products[0]))

Summarizing AmazonBasics AAA Performance Alkaline Batteries (36 Count)

I bought various sizes from Amazon and they all work great - just make sure you buy in advance to have them on hand for when you need them instead of running to the store forking over a lot more money for the same performance - I did buy the 9 volt box to replace all my fire alarms in the house - PERFECT! STOP Great price, great battery life, great look, just great, love it : STOP Great batteries and great the price is great to I will almost definitely buy these again


In [15]:
print("Summarizing", products[1])
print()
print(shorten(products[1]))

Summarizing AmazonBasics AA Performance Alkaline Batteries (48 Count) - Packaging May Vary

materials.con's noneoverall great batteries great quality great bargainwould buy again yes they seam to be great batteries use in all devices like flashlights, cameras, remotes, ext also get a lot for the price. STOP Great price, great battery life, great look, just great, love it : STOP I bought various sizes from Amazon and they all work great - just make sure you buy in advance to have them on hand for when you need them instead of running to the store forking over a lot more money for the same performance - I did buy the 9 volt box to replace all my fire alarms in the house - PERFECT!


These summaries seem shorter and more relevant. However, the summarization algorithm favors sentences that express a positive sentiment. This serves as a motivation for two different summarization functions: One for the positive aspects of a product and another one for the negative aspects.

To provide the chatbot with concise descriptions that can be used conversations, short (or even partial) sentences with noun-adjective pairs are preferrable. This requires the sentence segmentation to be redefined to split sentences at commas.

In [16]:
nlp.remove_pipe("set_custom_boundaries")
@Language.component("set_custom_boundaries")
def set_custom_boundaries(doc):
    for token in doc[:-1]:
        if token.text in ["," ,".", "!", "?", ":"]:
            doc[token.i + 1].is_sent_start = True
            continue
        if "," in token.text or "." in token.text or "!" in token.text or "?" in token.text or ":" in token.text:
            doc[token.i].is_sent_start = True
    return doc
nlp.add_pipe("set_custom_boundaries", before="parser")

<function __main__.set_custom_boundaries(doc)>

The following two functions do not consider pairs, but they solely include nouns and adjectives in the sentence weighting.

In [17]:
#Positive summary
def summarizeP(product):
    df = rd[(rd['name'] == product) & (rd['doRecommend'] == True)]
    reviewText = df['text'].str.cat(sep='. ')
    reviewDoc = nlp(reviewText)
    pos_tag = ['NOUN', 'ADJ']
    words = [ token.lemma_ for token in reviewDoc if token.is_stop != True and token.is_punct != True and token.pos_ in pos_tag]
    freq_word = Counter(words)
    max_freq = freq_word.most_common(1)[0][1]
    for word in freq_word.keys():
        freq_word[word] = (freq_word[word]/max_freq)
    sent_strength = {}
    for sent in reviewDoc.sents:
        for word in sent:
            if word.lemma_ in freq_word.keys():
                if sent in sent_strength.keys():
                    sent_strength[sent] += freq_word[word.lemma_]
                else:
                    sent_strength[sent] = freq_word[word.lemma_]
    important_sents = nlargest(3, sent_strength, key=sent_strength.get)
    final_sentences = [ w.text for w in important_sents ]
    summary = ' STOP '.join(final_sentences)#"STOP" included to highlight beginnings of sentences (debug)
    return summary

In [18]:
#Negative summary
def summarizeN(product):
    df = rd[(rd['name'] == product) & (rd['doRecommend'] == False)]
    reviewText = df['text'].str.cat(sep='. ')
    reviewDoc = nlp(reviewText)
    pos_tag = ['NOUN', 'ADJ']
    words = [ token.lemma_ for token in reviewDoc if token.is_stop != True and token.is_punct != True and token.pos_ in pos_tag]
    freq_word = Counter(words)
    max_freq = freq_word.most_common(1)[0][1]
    for word in freq_word.keys():
        freq_word[word] = (freq_word[word]/max_freq)
    sent_strength = {}
    for sent in reviewDoc.sents:
        for word in sent:
            if word.lemma_ in freq_word.keys():
                if sent in sent_strength.keys():
                    sent_strength[sent] += freq_word[word.lemma_]
                else:
                    sent_strength[sent] = freq_word[word.lemma_]
    important_sents = nlargest(3, sent_strength, key=sent_strength.get)
    final_sentences = [ w.text for w in important_sents ]
    summary = ' STOP '.join(final_sentences)#"STOP" included to highlight beginnings of sentences (debug)
    return summary

In [19]:
print("PSummarizing", products[0])
print()
print(summarizeP(products[0]))

PSummarizing AmazonBasics AAA Performance Alkaline Batteries (36 Count)

great price on good batteries that seem to be just as good as most alkaline batteries I've used. STOP We've been going through a lot of AAA sized batteries and have been able to verify that these Amazon Basics batteries last just as long as other name brand alkaline batteries from Duracell, STOP These batteries seem to hold up as well as any other batteries I've used and they are cheaper than the name brand batteries I had been using


In [20]:
print("NSummarizing", products[0])
print()
print(summarizeN(products[0]))

NSummarizing AmazonBasics AAA Performance Alkaline Batteries (36 Count)

If these batteries continue to perform well and have normal lives we will switch all of our battery purchases from name brand batteries to this brand STOP They're such a good deal that after I got these I replaced the batteries in every device I own that uses batteries (to avoid old batteries corroding and destroying anything). STOP I can't tell you how they are tested or at what point a battery is rejected but I can tell you that there are only certain types of batteries that I will buy because I don't believe in throwing away money and I Amazon batteries are right up there with my other 2 brands.


In [21]:
print("PSummarizing", products[1])
print()
print(summarizeP(products[1]))

PSummarizing AmazonBasics AA Performance Alkaline Batteries (48 Count) - Packaging May Vary

materials.con's noneoverall great batteries great quality great bargainwould buy again yes they seam to be great batteries use in all devices like flashlights, STOP We've been going through a lot of AAA sized batteries and have been able to verify that these Amazon Basics batteries last just as long as other name brand alkaline batteries from Duracell, STOP Great batteries and great the price is great to I will almost definitely buy these again


In [22]:
print("NSummarizing", products[1])
print()
print(summarizeN(products[1]))

NSummarizing AmazonBasics AA Performance Alkaline Batteries (48 Count) - Packaging May Vary

Why pay extra for Duracell batteries when AmazonBasics' batteries will last almost as long I bought a 48-pack of AA batteries for just 12 that's just 25 cents per battery! STOP They're such a good deal that after I got these I replaced the batteries in every device I own that uses batteries (to avoid old batteries corroding and destroying anything). STOP Do you have battery-operated items Do you need AA batteries to power your items These AA batteries will power your items


In [23]:
print("PSummarizing", products[2])
print()
print(summarizeP(products[2]))

PSummarizing AmazonBasics Backpack for Laptops up to 17-inches

This is a nice big bag with lots of pockets and zippers. STOP although for extended use in rain you would need a separate rain cover made of coated waterproof nylon. STOP If I had a choice I'd rather the laptop slot be replaced with a waterproof cooler space as I purchased this for hiking.


In [24]:
print("NSummarizing", products[2])
print()
print(summarizeN(products[2]))

NSummarizing AmazonBasics Backpack for Laptops up to 17-inches

There is mid-sized compartment that is in front of the main compartment and a smaller front compartment with the obligatory organizer/pockets to hold pens, STOP My normal backpack is a smaller one from Timberline which has less compartments but seems to hold a great deal of stuff and not feel nearly as bulky as this one STOP This AmazonBasics laptop is well-made and sturdy with a variety of pocket sizes and places to put stuff.


In [25]:
print("PSummarizing", products[3])
print()
print(summarizeP(products[3]))

PSummarizing AmazonBasics 15.6-Inch Laptop and Tablet Bag

it's a great value for a reasonably low price STOP For the price very good value and its lasting well. STOP This bag is really great value for money,


In [26]:
print("NSummarizing", products[3])
print()
print(summarizeN(products[3]))

NSummarizing AmazonBasics 15.6-Inch Laptop and Tablet Bag

fab case for any laptop would buy another again i got it for my laptop and it is fine and strong plenty of pockets. STOP On arrival I put my dell Inspiron laptop inside and the charger went in the front pocket. STOP arrived on time well packed disappointed with quality does the job it was only for keeping dust off second laptop,


Conclusions:

1. It may be necessary to change the preprocessed review data. As it is right now, the supposedly negative reviews tend to be recommending the product.
2. The current summarizeP and summarizeN functions would benefit from favoring sentences with noun-adjective pairs