To make the first, preliminary selection of relevant review data, we will consider the number of reviews per product

In [1]:
import pandas
import statistics
import spacy
nlp = spacy.load('en_core_web_sm')
df = pandas.read_csv("Datafiniti_Amazon_Consumer_Reviews_of_Amazon_Products_May19.csv", usecols=['id', 'name', 'reviews.doRecommend', 'reviews.rating', 'reviews.text'])
productNames = df['name'].unique()

In [2]:
reviewsPerProduct = {}
for name in productNames:
    reviewsPerProduct[name] = len(df[df['name'] == name])

# Selection
Since some of the products have far more reviews than others, we will for now look at products with more than 20 reviews to avoid giving too much weight to individual reviews.

In [3]:
for key, value in dict(reviewsPerProduct).items():
    if value < 20:
        del reviewsPerProduct[key]
relevantProducts = list(reviewsPerProduct.keys())

In [4]:
df2 = df[df['name'].isin(relevantProducts)]

The total number of selected reviews is

In [5]:
print(len(df2))

28174


The total number of selected products is

In [6]:
print(len(relevantProducts))

33


The lower and upper bounds as well as the average of the number of reviews per product are

In [7]:
print(min(reviewsPerProduct.values()))
print(max(reviewsPerProduct.values()))
print(statistics.mean(reviewsPerProduct.values()))

21
8343
853.7575757575758


The average number of words in the reviews is

In [8]:
nlp.max_length = 4000000;

In [9]:
reviewText = df2['reviews.text'].str.cat(sep=' ')
reviewText = nlp(reviewText)
words = [ token.lemma_ for token in reviewText if token.is_punct != True ]
print(len(words) / len(df2))

26.17377724142827


The percentage of reviews without a 'recommended' value is

In [10]:
taggedRevAmount = len(df2[df2['reviews.doRecommend'] == False]) + len(df2[df2['reviews.doRecommend'] == True])
print(((len(df2) - taggedRevAmount) / len(df2)) * 100)

43.029033861006596


## Pre-processing
The average length of a review in the preliminary selection is 26 words which is well enough to form an informative sentence. Since reviews with 30 or less characters may be too short to get significant information from, we will filter these reviews out.

Considering that a large portion of the reviews (43%) have no attribute signifying whether the reviewer recommends the product or not, it would make sense to manually add these attributes using the "star scheme" ratings in the reviews. As shown in RatingsVsRecommendations.ipynb, recommending reviews overwhelmingly give 5-star and 4-star ratings, while rather negative reviews tend to be in the range 1-3. While this does not necessarily mean that the majority of 3, 2 or even 1 star reviews are tagged as "not recommended", we will give 1 to 3 star reviews a "not recommend" attribute and 4 to 5 star reviews a "recommend" attribute, if they do not have one of these attributes already.

In [11]:
df3 = df2[df2['reviews.text'].str.len() > 30].copy()
df3.columns = ['id', 'name', 'doRecommend', 'rating', 'text']

In [None]:
df3.loc[df3['doRecommend'] != True & df3['doRecommend'] != False & df3['rating'] > 3, 'doRecommend'] = True