To make the first, preliminary selection of relevant review data, we will consider the number of reviews per product

In [1]:
import pandas
import statistics
import spacy
import numpy as np
from spacy.util import minibatch, compounding
#from spacy.tokens import Doc
#from spacy.training import Example
nlp = spacy.load('en_core_web_sm')
df = pandas.read_csv("Datafiniti_Amazon_Consumer_Reviews_of_Amazon_Products_May19.csv", usecols=['id', 'name', 'reviews.doRecommend', 'reviews.rating', 'reviews.text'])
productNames = df['name'].unique()

In [2]:
reviewsPerProduct = {}
for name in productNames:
    reviewsPerProduct[name] = len(df[df['name'] == name])

# Selection
Since some of the products have far more reviews than others, we will for now look at products with more than 20 reviews to avoid giving too much weight to individual reviews.

In [3]:
for key, value in dict(reviewsPerProduct).items():
    if value < 20:
        del reviewsPerProduct[key]
relevantProducts = list(reviewsPerProduct.keys())

In [4]:
df2 = df[df['name'].isin(relevantProducts)]

The total number of selected reviews is

In [5]:
print(len(df2))

28174


The total number of selected products is

In [6]:
print(len(relevantProducts))

33


The lower and upper bounds as well as the average of the number of reviews per product are

In [7]:
print(min(reviewsPerProduct.values()))
print(max(reviewsPerProduct.values()))
print(statistics.mean(reviewsPerProduct.values()))

21
8343
853.7575757575758


The average number of characters in the reviews is

In [8]:
reviewText = df2['reviews.text'].str.cat(sep='')
#words = [ token.lemma_ for token in reviewText if token.is_punct != True ]
print(len(reviewText) / len(df2))

135.91279193582736


The percentage of reviews without a 'recommended' attribute (positive or negative) is

In [9]:
taggedRevAmount = len(df2[df2['reviews.doRecommend'] == False]) + len(df2[df2['reviews.doRecommend'] == True])
print(((len(df2) - taggedRevAmount) / len(df2)) * 100)

43.029033861006596


## Pre-processing
The average length of a review in the preliminary selection is 136 characters which is well enough to form an informative sentence. Since reviews with 30 or less characters may be too short to get significant information from, we will filter these reviews out.

In [10]:
df3 = df2[df2['reviews.text'].str.len() > 30].copy()
df3.columns = ['id', 'name', 'doRecommend', 'rating', 'text']

Considering that a large portion of the reviews (43%) have no attribute signifying whether the reviewer recommends the product or not, it would make sense to manually add these attributes using the "star scheme" ratings in the reviews. More reliable however would be to assign the 'doRecommend' attributes based on sentiment analysis. The textcat pipeline component will be used for this purpose.

As shown below, the total number of reviews that have passed the earlier filtering steps is 24.239

In [11]:
print(len(df3))

24239


Splitting these into recommending reviews, non recommending reviews, and reviews without such attribute, we get the following numbers:

In [12]:
print(len(df3[df3['doRecommend'] == True]))
print(len(df3[df3['doRecommend'] == False]))
print(len(df3[(df3['doRecommend'] != True) & (df3['doRecommend'] != False)]))

15318
731
8190


Since there are only 731 negative reviews, we will have to limit the training data for our textcat model to 731 positive and 731 negative reviews to balance positive and negative training data.

In [13]:
train_pos_df = df3[df3['doRecommend'] == True][:731]
train_neg_df = df3[df3['doRecommend'] == False][:731]
train_df = train_pos_df.append(train_neg_df)
train_df['tuples'] = train_df.apply(lambda row: (row['text'],int(row['doRecommend'])), axis=1)
train = train_df['tuples'].tolist()

Let us now train our textcat model (The following steps are entirely copied from https://www.kaggle.com/poonaml/text-classification-using-spacy).

In [14]:
#functions from spacy documentation
def load_data(limit=0, split=0.8):
    train_data = train
    np.random.shuffle(train_data)
    train_data = train_data[-limit:]
    texts, labels = zip(*train_data)
    cats = [{'POSITIVE': bool(y)} for y in labels]
    split = int(len(train_data) * split)
    return (texts[:split], cats[:split]), (texts[split:], cats[split:])

def evaluate(tokenizer, textcat, texts, cats):
    docs = (tokenizer(text) for text in texts)
    tp = 1e-8  # True positives
    fp = 1e-8  # False positives
    fn = 1e-8  # False negatives
    tn = 1e-8  # True negatives
    for i, doc in enumerate(textcat.pipe(docs)):
        gold = cats[i]
        for label, score in doc.cats.items():
            if label not in gold:
                continue
            if score >= 0.5 and gold[label] >= 0.5:
                tp += 1.
            elif score >= 0.5 and gold[label] < 0.5:
                fp += 1.
            elif score < 0.5 and gold[label] < 0.5:
                tn += 1
            elif score < 0.5 and gold[label] >= 0.5:
                fn += 1
    precision = tp / (tp + fp)
    recall = tp / (tp + fn)
    f_score = 2 * (precision * recall) / (precision + recall)
    return {'textcat_p': precision, 'textcat_r': recall, 'textcat_f': f_score}

#("Number of texts to train from","t" , int)
n_texts=30000
#You can increase texts count if you have more computational power.

#("Number of training iterations", "n", int))
n_iter=10

In [15]:
# add the text classifier to the pipeline if it doesn't exist
# nlp.create_pipe works for built-ins that are registered with spaCy
if 'textcat' not in nlp.pipe_names:
    textcat = nlp.create_pipe('textcat')
    nlp.add_pipe(textcat, last=True)
# otherwise, get it, so we can add labels to it
else:
    textcat = nlp.get_pipe('textcat')

# add label to text classifier
textcat.add_label('POSITIVE')

# load the dataset
print("Loading food reviews data...")
(train_texts, train_cats), (dev_texts, dev_cats) = load_data(limit=n_texts)
print("Using {} examples ({} training, {} evaluation)"
      .format(n_texts, len(train_texts), len(dev_texts)))
train_data = list(zip(train_texts,
                      [{'cats': cats} for cats in train_cats]))

Loading food reviews data...
Using 30000 examples (1169 training, 293 evaluation)


In [16]:
# get names of other pipes to disable them during training
other_pipes = [pipe for pipe in nlp.pipe_names if pipe != 'textcat']
with nlp.disable_pipes(*other_pipes):  # only train textcat
    optimizer = nlp.begin_training()
    print("Training the model...")
    print('{:^5}\t{:^5}\t{:^5}\t{:^5}'.format('LOSS', 'P', 'R', 'F'))
    for i in range(n_iter):
        losses = {}
        # batch up the examples using spaCy's minibatch
        batches = minibatch(train_data, size=compounding(4., 32., 1.001))
        for batch in batches:
            texts, annotations = zip(*batch)
            nlp.update(texts, annotations, sgd=optimizer, drop=0.2,
                       losses=losses)
        with textcat.model.use_params(optimizer.averages):
            # evaluate on the dev data split off in load_data()
            scores = evaluate(nlp.tokenizer, textcat, dev_texts, dev_cats)
        print('{0:.3f}\t{1:.3f}\t{2:.3f}\t{3:.3f}'  # print a simple table
              .format(losses['textcat'], scores['textcat_p'],
                      scores['textcat_r'], scores['textcat_f']))

Training the model...
LOSS 	  P  	  R  	  F  
4.017	0.841	0.731	0.782
1.895	0.883	0.883	0.883
1.540	0.906	0.862	0.883
0.948	0.924	0.841	0.881
1.007	0.919	0.855	0.886
0.677	0.919	0.855	0.886
0.622	0.919	0.855	0.886
0.510	0.912	0.862	0.887
0.400	0.918	0.848	0.882
0.354	0.897	0.897	0.897


In [21]:
# test the trained model
test_text1 = 'This tea is fun to watch as the flower expands in the water. Very smooth taste and can be used again and again in the same day. If you love tea, you gotta try these "flowering teas"'
test_text2 = "I bought this product at a local store, not from this seller. I usually use Wellness canned food, but thought my cat was bored and wanted something new. So I picked this up, knowing that Evo is a really good brand (like Wellness). It is one of the most disgusting smelling cat foods I've ever had the displeasure of using. I was gagging while trying to put it into the bowl.  My cat took one taste and walked away, and chose to eat nothing until I replaced it 12 hours later with some dry food. I would try another flavor of their food - since I know it's high quality - but I wouldn't buy the duck flavor again."
doc = nlp(test_text1)
test_text1, doc.cats

('This tea is fun to watch as the flower expands in the water. Very smooth taste and can be used again and again in the same day. If you love tea, you gotta try these "flowering teas"',
 {'POSITIVE': 0.9809731841087341})

In [22]:
doc2 = nlp(test_text2)
test_text2, doc2.cats

("I bought this product at a local store, not from this seller. I usually use Wellness canned food, but thought my cat was bored and wanted something new. So I picked this up, knowing that Evo is a really good brand (like Wellness). It is one of the most disgusting smelling cat foods I've ever had the displeasure of using. I was gagging while trying to put it into the bowl.  My cat took one taste and walked away, and chose to eat nothing until I replaced it 12 hours later with some dry food. I would try another flavor of their food - since I know it's high quality - but I wouldn't buy the duck flavor again.",
 {'POSITIVE': 0.3162754476070404})

Again, it should be noted that the steps above are copied - However, the last two, which correspond to testing the textcat model, show results that look very different from where this code is originally from: While one would expect the "POSITIVE" score for text 2 to be near 0, it is actually barely below 0.32 which means that the model trained with our data has a stronger bias for categorizing reviews as positive. 

The next step is to redefine the reviews that are missing a "recommended" attribute using this model.

In [23]:
df3[(df3['doRecommend'] != True) & (df3['doRecommend'] != False)]['text']

0        I order 3 of them and one of the item is bad q...
1        Bulk is always the less expensive way to go fo...
2        Well they are not Duracell but for the price i...
3        Seem to work as well as name brand batteries a...
4        These batteries are very long lasting the pric...
                               ...                        
12620    If your looking for great sound it cannot perf...
12647    I already had a tap upstairs but wanted anothe...
12701    Like DOS, you need to memorize your command an...
12764    It was just a few weeks ago that I was bemoani...
12788    I bought one on Prime day for about 50 shipped...
Name: text, Length: 8190, dtype: object

In [32]:
testerlist = list(nlp.pipe(df3[(df3['doRecommend'] != True) & (df3['doRecommend'] != False)]['text']))

In [33]:
print(testerlist[0].cats)
print(testerlist[1].cats)
print(testerlist[2].cats)

{'POSITIVE': 0.007800657767802477}
{'POSITIVE': 0.9470281600952148}
{'POSITIVE': 0.5396779179573059}


In [None]:
#def get_examples:
#    texts, labels = zip(*train_data)
#    for predicted, reference in zip(*train_data):
#        predicted = Doc(nlp.vocab, words=predicted)
#        examplesList.append(Example(predicted, reference))
#    return examplesList