Our S3 bucket has the original k-core review data (from http://jmcauley.ucsd.edu/data/amazon/) for each of 24 product categories on Amazon, in the format `reviews_CATEGORY_5.json.gz`. The categories are: Amazon_Instant_Video, Apps_for_Android, Automotive, Baby, Beauty, Books, CDs_and_Vinyl, Cell_Phones_and_Accessories, Clothing_Shoes_and_Jewelry, Digital_Music, Electronics, Grocery_and_Gourmet_Food, Health_and_Personal_Care, Home_and_Kitchen, Kindle_Store, Movies_and_TV, Musical_Instruments, Office_Products, Patio_Lawn_and_Garden, Pet_Supplies, Sports_and_Outdoors, Tools_and_Home_Improvement, Toys_and_Games, and Video_Games.

Thest are 5-core reduced datasets, meaning they only include reviews for products with at least five reviews each from users who themselves have written at least 5 reviews.

Along with the raw JSON files, you will find three additional files for each product category that we generated:

### `CATEGORY.pkl`
This is a pickled dataframe derived from the raw JSON file. The data is the same as in the raw JSON, except that the review helpfulness ratings have been unpacked into two separate columns.

In [1]:
import pandas
df = pandas.read_pickle('../data/amazon_products/dataframes/Apps_for_Android.pkl')
df.head()

Unnamed: 0,asin,overall,reviewText,reviewTime,reviewerID,reviewerName,summary,unixReviewTime,reviewRatings,helpfulRatings
0,B004A9SDD8,3,"Loves the song, so he really couldn't wait to ...","11 2, 2013",A1N4O8VOJZTDVB,Annette Yancey,Really cute,1383350400,1,1
1,B004A9SDD8,5,"Oh, how my little grandson loves this app. He'...","12 5, 2011",A2HQWU6HUKIEC7,"Audiobook lover ""Kathy""",2-year-old loves it,1323043200,0,0
2,B004A9SDD8,5,I found this at a perfect time since my daught...,"05 21, 2012",A1SXASF6GYG96I,Barbara Gibbs,Fun game,1337558400,0,0
3,B004A9SDD8,5,My 1 year old goes back to this game over and ...,"12 6, 2012",A2B54P9ZDYH167,"Brooke Greenstreet ""Babylove""",We love our Monkeys!,1354752000,4,3
4,B004A9SDD8,5,There are three different versions of the song...,"02 1, 2014",AFOFZDTX5UC6D,C. Galindo,This is my granddaughters favorite app on my K...,1391212800,1,1


 - `reviewerID` - ID of the reviewer, e.g. A2SUAM1J3GNN3B
 - `asin` - ID of the product, e.g. 0000013714. You can use this ID to find the item through the Amazon web UI, too, e.g. https://www.amazon.com/gp/product/0000013714
 - `reviewerName` - name of the reviewer
 - `reviewRatings` - the number of helpful/unheplful ratings given to this review
 - `helpfulRating` - the number of those ratings indicating it was a helpful review
 - `reviewText` - text of the review
 - `overall` - rating of the product
 - `summary` - summary of the review
 - `unixReviewTime` - time of the review (unix time)
 - `reviewTime` - time of the review (raw)

### `CATEGORY.npy`

This is a (n_reviews x 300) Numpy array where the ith row corresponds to the Spacy document vector for the ith review in the corresponding dataframe. The document vectors are derived pre-trained Spacy word vectors:

In [2]:
import numpy
vectors = numpy.load('../data/amazon_products/dataframes/Apps_for_Android.npy')
print(vectors.shape)
vectors

(752937, 300)


array([[-0.03018869,  0.24029511, -0.23053358, ..., -0.09692115,
         0.04965657,  0.12173134],
       [-0.05654537,  0.21312688, -0.17486772, ..., -0.00957985,
        -0.03249335,  0.05693251],
       [-0.03627986,  0.17326778, -0.17520565, ..., -0.10917577,
         0.00921377,  0.1276993 ],
       ...,
       [ 0.00871697,  0.22195609, -0.22818609, ..., -0.01824116,
         0.09139968,  0.06807514],
       [ 0.02594237,  0.15006386, -0.144111  , ..., -0.0539003 ,
        -0.03813147,  0.14641456],
       [-0.04135761,  0.19131352, -0.19144626, ..., -0.17829847,
         0.07818279,  0.11182661]], dtype=float32)

### `CATEGORY.nlp.gz`

This contains the NLP parse of each review as a JSON blob, one review per line (the ith line corresponds to ith row in the dataframe). The object captures the dependency parse, part of speech tags, associated named entities for each review

In [3]:
from pprint import pprint
import json
import gzip
tree = json.loads(gzip.open('../data/amazon_products/dataframes/Apps_for_Android.nlp.gz').readline())
pprint(tree, depth=3)

[{'NE': '',
  'POS_coarse': 'VERB',
  'POS_fine': 'VB',
  'arc': 'ROOT',
  'lemma': 'wait',
  'modifiers': [{...}, {...}, {...}, {...}, {...}, {...}, {...}, {...}, {...}],
  'word': 'wait'},
 {'NE': '',
  'POS_coarse': 'ADJ',
  'POS_fine': 'JJ',
  'arc': 'ROOT',
  'lemma': 'interesting',
  'modifiers': [{...}, {...}, {...}],
  'word': 'interesting'},
 {'NE': '',
  'POS_coarse': 'VERB',
  'POS_fine': 'VB',
  'arc': 'ROOT',
  'lemma': 'play',
  'modifiers': [{...}, {...}, {...}, {...}, {...}, {...}, {...}, {...}],
  'word': 'play'}]


It is straightforward to write and apply simple helper functions to extract data from these objects:

In [4]:
def token_count(tree):
    """Count the number of non-punctuation tokens in document"""
    result = 0
    for token in tree:
        if token['POS_coarse'] != 'PUNCT':
            result+=1
        result += token_count(token['modifiers'])
    return result

token_count(tree)

42

In [5]:
def named_entities(tree):
    """Extract any named entities from a document"""
    result = []
    for token in tree:
        if token['NE']:
            result.append({k:v for k, v in token.items() if k!='modifiers'})
        result.extend(named_entities(token['modifiers']))
    return result
named_entities(tree)

[{'NE': 'CARDINAL',
  'POS_coarse': 'NUM',
  'POS_fine': 'CD',
  'arc': 'attr',
  'lemma': 'almost 3',
  'word': 'almost 3'}]