# <a id='toc1_'></a>[Data Loading](#toc0_)

Here we are looking at taking several samples of the amazon reviews dataset and loading them into a dataframe.

In [1]:
# get modules in 
import pandas as pd
import gzip
import json
import random
import linecache

**Table of contents**<a id='toc0_'></a>    
- [Data Loading](#toc1_)    
- [Datasets](#toc2_)    
    - [Review Dataset](#toc2_1_1_)    
    - [Product Metadata Dataset](#toc2_1_2_)    
- [Quick look at Reviews in a Product Category (*Example*)](#toc3_)    
- [The Review Dataset and Metadata Dataset](#toc4_)    
  - [Data with Fewer Reviews](#toc4_1_)    
  - [Data with A Lot of Reviews](#toc4_2_)    
      - [Batch 1](#toc4_2_1_1_)    
      - [Batch 2](#toc4_2_1_2_)    
      - [Batch 3](#toc4_2_1_3_)    
    - [Batch 4](#toc4_2_2_)    
    - [Batch 5](#toc4_2_3_)    
    - [Batch 6](#toc4_2_4_)    
    - [Batch 7](#toc4_2_5_)    
    - [Batch 8](#toc4_2_6_)    
    - [Batch 9](#toc4_2_7_)    
    - [Batch 10](#toc4_2_8_)    
    - [Merge Batches (for large reviews data)](#toc4_2_9_)    
  - [Merge Large Reviews with Few Reviews](#toc4_3_)    

<!-- vscode-jupyter-toc-config
	numbering=false
	anchor=true
	flat=false
	minLevel=1
	maxLevel=6
	/vscode-jupyter-toc-config -->
<!-- THIS CELL WILL BE REPLACED ON TOC UPDATE. DO NOT WRITE YOUR TEXT IN THIS CELL -->

# <a id='toc2_'></a>[Datasets](#toc0_)

We have individual datasets for each category. These data have been reduced to extract the $k$-core, such that each of the remaining users and items have $k$ reviews each.

- Amazon Fashion	
- All Beauty	
- Appliances	
- Arts, Crafts and Sewing	
- Automotive	
- Books	
- CDs and Vinyl	
- Cell Phones and Accessories	
- Clothing, Shoes and Jewelry	
- Digital Music	
- Electronics	
- Gift Cards	
- Grocery and Gourmet Food	
- Home and Kitchen	
- Industrial and Scientific	
- Kindle Store	
- Luxury Beauty	
- Magazine Subscriptions	
- Movies and TV	
- Musical Instruments	
- Office Products	
- Patio, Lawn and Garden	
- Pet Supplies	
- Prime Pantry	
- Software	
- Sports and Outdoors	
- Tools and Home Improvement	
- Toys and Games	
- Video Games	

***

### <a id='toc2_1_1_'></a>[Review Dataset](#toc0_)
Format is one-review-per-line in json. 

- **overall**: ratings of the product
- **reviewerID**: ID of the reviewer, e.g. A2SUAM1J3GNN3B
- **asin**: ID of the product, e.g. 0000013714
- **reviewerName**: name of the reviewer
- **vote**: helpful votes of the review
- **style**: a disctionary of the product metadata, e.g., "Format" is "Hardcover"
- **reviewText**: text of the review
- **summary**: summary of the review
- **unixReviewTime**: time of the review (unix time)
- **reviewTime**: time of the review (raw)
- **image**: images that users post after they have received the product

***
### <a id='toc2_1_2_'></a>[Product Metadata Dataset](#toc0_)
We also have metadata. 

- **asin**: ID of the product, e.g. 0000031852
- **title**: name of the product
- **feature**: bullet-point format features of the product
- **description**: description of the product
- **price**: price in US dollars (at time of crawl)
- **imageURL**: url of the product image
- **imageURL**: url of the high resolution product image
- **related**: related products (also bought, also viewed, bought together, buy after viewing)
- **salesRank**: sales rank information
- **brand**: brand name
- **categories**: list of categories the product belongs to
- **tech1**: the first technical detail table of the product
- **tech2**: the second technical detail table of the product
- **similar**: similar product table


***
# <a id='toc3_'></a>[Quick look at Reviews in a Product Category (*Example*)](#toc0_)

For Example: ***The Fashion Dataset*** 
    (AMAZON_FASHION_5.json)

In [3]:
# data path
fashion_data = "/Users/pavansingh/Library/CloudStorage/GoogleDrive-sngpav003@myuct.ac.za/My Drive/Masters 2022/Dissertation/Masters-Dissertation/Data/AMAZON_FASHION_5.json"
fashion_data = pd.read_json(fashion_data, lines=True)
#fashion_data = fashion_data.loc[:,['reviewerName', 'reviewText', 'overall', 'style']]
display(fashion_data.loc[10:14,:])
print("Shape of Data:", fashion_data.shape)

Unnamed: 0,overall,verified,reviewTime,reviewerID,asin,style,reviewerName,reviewText,summary,unixReviewTime,vote,image
10,2,True,"01 25, 2018",A3HX4X3TIABWOV,B000KPIHQ4,"{'Size Name:': ' Men's 6-6.5, Women's 8-8.5', ...",Denise A. Conte,Relieved my Plantar Fascitis for 3 Days. Then ...,These were recommended by my Podiatrist,1516838400,,
11,2,True,"01 5, 2017",AW8UBYMNJ894V,B000KPIHQ4,"{'Size Name:': ' Men's 8-8.5, Women's 10-10.5'...",Cognizant Consumer,This is my 6th pair and they are the best thin...,Not the same as all my other pairs.,1483574400,,
12,5,True,"10 17, 2016",A265UZVOZWTTXQ,B000KPIHQ4,,William_Jasper,We have used these inserts for years. They pr...,Great inserts,1476662400,,
13,5,True,"08 22, 2016",AW8UBYMNJ894V,B000KPIHQ4,,Cognizant Consumer,Pinnacle seems to have more cushioning so my h...,Personal favorite,1471824000,,
14,5,True,"03 23, 2016",A265UZVOZWTTXQ,B000KPIHQ4,,William_Jasper,Excellent insole with good support.,Five Stars,1458691200,,


Shape of Data: (3176, 12)


Below we: 

1. Calculate the number of missing values in the 'style' column of the fashion_data DataFrame. It does this by using the `isna()` method to create a boolean mask indicating where the 'style' column contains missing values (i.e., `NaN`), and then using the `sum()` method to count the number of `True` values in the mask.

2. Proceed with dropping all rows in the `fashion_data` DataFrame that have missing values in the 'style' column. It does this using the `dropna()` method with the subset parameter set to "style", which indicates that only rows with missing values in the 'style' column should be dropped. The `inplace` parameter is set to `True`, which indicates that the changes should be made to the fashion_data DataFrame directly (rather than creating a new DataFrame with the missing values dropped).

3. We sort the fashion_data DataFrame by the '`overall`' column in descending order. It does this using the `sort_values()` method with the by parameter set to "`overall`" to indicate that the DataFrame should be sorted by the '`overall`' column, and the ascending parameter set to False to indicate that the DataFrame should be sorted in descending order.


In [4]:
# see NA's in style
print(fashion_data['style'].isna().sum())

# remove NA's in style
fashion_data.dropna(subset=["style"], inplace=True)

# Sort resulting dataframe by overall rating
fashion_data.sort_values(by=['overall'], inplace=True, ascending=False)

# show resulting dataset
display(fashion_data.head(10))

# Shape of data
print("Shape of Data:", fashion_data.shape)

69


Unnamed: 0,overall,verified,reviewTime,reviewerID,asin,style,reviewerName,reviewText,summary,unixReviewTime,vote,image
0,5,True,"09 4, 2015",ALJ66O1Y6SLHA,B000K2PJ4K,"{'Size:': ' Big Boys', 'Color:': ' Blue/Orange'}",Tonya B.,Great product and price!,Five Stars,1441324800,,
1963,5,True,"04 18, 2016",AZRZ2FB7CFNOE,B0092UF54A,"{'Size:': ' 8 B(M) US', 'Color:': ' Black/Whit...",Catherine Uribe,I love my tennis shoes,Five Stars,1460937600,,
1951,5,True,"05 15, 2016",A2KCFRMKVHYSU7,B0092UF54A,"{'Size:': ' 8 B(M) US', 'Color:': ' Ocean Fog/...",Saving star,Really comfy and nice color,Great and comfy for sports,1463270400,,
1953,5,False,"05 10, 2016",A22WG2NE4D47UM,B0092UF54A,"{'Size:': ' 9 B(M) US', 'Color:': ' Black/Wolf...",Amazon Customer,These are the most comfortable shoes I've used...,Best shoes ever!!!,1462838400,,
1954,5,True,"05 9, 2016",A1SC6HVU28ND3D,B0092UF54A,"{'Size:': ' 8.5 B(M) US', 'Color:': ' Black/Wh...",Sarah,Very comfortable and looks great!,Five Stars,1462752000,,
1955,5,True,"05 8, 2016",AJDH2WVLX79KA,B0092UF54A,"{'Size:': ' 6.5 B(M) US', 'Color:': ' Black/Wh...",brookelynne,Love this shoes so comfy and great very everyd...,Fits wonderful,1462665600,,
1958,5,True,"05 4, 2016",A1AWX0M8R6A2I1,B0092UF54A,"{'Size:': ' 12 D(M) US', 'Color:': ' Cool Grey...",CBP,Perfict fit for me. Great looking shoes at a g...,nice,1462320000,,
1959,5,True,"04 23, 2016",AT5OQFDS6PEE1,B0092UF54A,"{'Size:': ' 9.5 B(M) US', 'Color:': ' Black/Wh...",H. Heckstall,The sneakers are very comfortable and fit to s...,Five Stars,1461369600,,
1960,5,True,"04 21, 2016",AOFQAZVA6Q6E7,B0092UF54A,"{'Size:': ' 10 B(M) US', 'Color:': ' Black/Whi...",D. Resendes,I've had these shoes for about a week now and ...,Wide Feet so Somewhat Tight,1461196800,3.0,
1965,5,True,"03 31, 2016",A2TRI54C8EMCX,B0092UF54A,"{'Size:': ' 9 B(M) US', 'Color:': ' Black/Wolf...",Andrea Seo B.,Love it!! Super comfortable and nice!! Got mor...,Love it!! Definetly recommend it,1459382400,,


Shape of Data: (3107, 12)


So we have 3107 reviews in our fashion data set. 

***
# <a id='toc4_'></a>[The Review Dataset and Metadata Dataset](#toc0_)

We have individual datasets for each category. We combine them to generate one larger datasets encompassing all the categories (5-core dataset).

The following function is created to read in large JSON files:

``` py
def read_file(filename, category):
    num_lines = sum(1 for line in open(filename))
    selected_lines = set()
    while len(selected_lines) < min(50000, num_lines):
        line_num = random.randint(1, num_lines)
        if line_num not in selected_lines:
            selected_lines.add(line_num)
            line = linecache.getline(filename, line_num)
            selected_data = json.loads(line)
            selected_data['category'] = category
            yield selected_data
```

1. It calculates the total number of lines in the file using the `sum(1 for line in open(filename))` expression.
2. It initializes an empty set called `selected_lines`, which will **store the line numbers that have been selected**.
3. It enters a loop that continues until the number of selected lines reaches the minimum value between 50,000 and the total number of lines in the file (`min(50000, num_lines)`).
4. Within each iteration of the loop, it generates a random line number using `random.randint(1, num_lines)`.
5. If the randomly generated line number is not already in the `selected_lines` set, it adds the line number to the set and proceeds to read that specific line from the file using `linecache.getline(filename, line_num)`.
6. The selected line is then parsed as JSON using `json.loads(line)`.
7. Additional data, such as the **category**, is added to the selected data object.
8. The selected data object is yielded, which means it will be returned as an element of an iterator.
9. The loop continues until the desired number of lines is selected.

The function defined as:

```py
def read_matching_metadata(filename, category, product_ids):
    with open(filename, 'r') as file:
        for line in file:
            data = json.loads(line)
            if data['asin'] in product_ids:
                data['category'] = category
                yield data
```

Reads a JSON file and yields metadata entries that match a given set of product IDs. 
- `read_matching_metadata` is a function that takes three parameters: `filename`, `category`, and `product_ids`.
- It opens the specified filename (assumed to be a JSON file) in read mode using a with statement, which ensures the file is properly closed after reading.
- It iterates over each line in the file using a for loop.
- For each line, it loads the line as a JSON object using `json.loads(line)`.
- It checks if the value of the '`asin`' key in the loaded JSON data is present in the `product_ids` set.
- If there is a match, it adds the '`category`' key to the data dictionary and assigns it the value of the `category` parameter.
- Finally, it yields the modified data using the `yield` statement, allowing the caller to iterate over the matching metadata entries one by one.


In [2]:
# review data
def read_file(filename, category):
    num_lines = sum(1 for line in open(filename))
    selected_lines = set()
    while len(selected_lines) < min(25000, num_lines):
        line_num = random.randint(1, num_lines)
        if line_num not in selected_lines:
            selected_lines.add(line_num)
            line = linecache.getline(filename, line_num)
            selected_data = json.loads(line)
            selected_data['category'] = category
            yield selected_data

In [3]:
# metadata
def read_matching_metadata(filename, category, product_ids):
    with open(filename, 'r') as file:
        for line in file:
            data = json.loads(line)
            if data['asin'] in product_ids:
                data['category'] = category
                yield data

## <a id='toc4_1_'></a>[Data with Fewer Reviews](#toc0_)



In [43]:
# initialise data list
data = []

# category files - smaller reviews
beauty = "/Users/pavansingh/Desktop/Amazon Review Data/All_Beauty_5.json"
fashion = "/Users/pavansingh/Desktop/Amazon Review Data/AMAZON_FASHION_5.json"
appliances = "/Users/pavansingh/Desktop/Amazon Review Data/Appliances_5.json"
gift_cards = "/Users/pavansingh/Desktop/Amazon Review Data/Gift_Cards_5.json"
industrial = "/Users/pavansingh/Desktop/Amazon Review Data/Industrial_and_Scientific_5.json"
luxury_beauty = "/Users/pavansingh/Desktop/Amazon Review Data/Luxury_Beauty_5.json"
magazine_subscriptions = "/Users/pavansingh/Desktop/Amazon Review Data/Magazine_Subscriptions_5.json"
software = "/Users/pavansingh/Desktop/Amazon Review Data/Software_5.json"

# load each file and join into dataframe
for category, filename in [('beauty', beauty), ('fashion', fashion), ('appliances', appliances), ('gift_cards', gift_cards), ('industrial', industrial), ('luxury_beauty', luxury_beauty), ('magazine_subscriptions', magazine_subscriptions), ('software', software)]:
    for selected_data in read_file(filename, category):
        data.append(selected_data)

# make it into a dataframe
data_with_less_reviews = pd.DataFrame(data)

# show the dataframe
print("Shape of all data:", data_with_less_reviews.shape)
display(data_with_less_reviews.head(5))

# save data_with_less_reviews to csv called few_revs.csv in folder Data
data_with_less_reviews.to_csv('Data/few_revs.csv')

# category value counts
print("Value counts of product reviews per category:\n",data_with_less_reviews['category'].value_counts())

Shape of all data: (78874, 13)


Unnamed: 0,overall,verified,reviewTime,reviewerID,asin,style,reviewerName,reviewText,summary,unixReviewTime,category,vote,image
0,5.0,False,"04 7, 2018",A31URN5S2Q0UJV,B000URXP6E,{'Size:': ' Small'},Boris Jones,Was skeptical at first. The liquid is kind of ...,Awesome quality!,1523059200,beauty,,
1,5.0,True,"02 4, 2014",A31XUJMEDBUGKR,B000URXP6E,{'Size:': ' 23'},Terry V.,Beautiful Beginnings have been the answer to m...,Works great!,1391472000,beauty,,
2,5.0,True,"05 11, 2013",A2XPTXCAX8WLHU,B000URXP6E,{'Size:': ' 263'},Mindy Lipton,My daughter bought this for me because she kno...,Love it,1368230400,beauty,,
3,5.0,True,"02 9, 2017",A2AXHDSJEBEOIB,B0012Y0ZG2,{'Size:': ' 500ml'},D.Marie,"smells delicious, cleans well, rinses off easi...","Smells yummy, cleans well, Will repurchase.",1486598400,beauty,,
4,5.0,True,"03 21, 2016",AXQAIG2XT292S,B00RZYW4RG,,Grandma Mary,"Fast shipping, great price & product. 100% sat...",great price & product,1458518400,beauty,,


Value counts of product reviews per category:
 industrial                25000
luxury_beauty             25000
software                  12805
beauty                     5269
fashion                    3176
gift_cards                 2972
magazine_subscriptions     2375
appliances                 2277
Name: category, dtype: int64


In [44]:
# Read product reviews file and extract productIDs
reviews_df = pd.read_csv('/Users/pavansingh/Library/CloudStorage/GoogleDrive-pavansingho23@gmail.com/My Drive/Portfolio/Masters-Dissertation/Code/Data/few_revs.csv', low_memory=False)
product_ids = set(reviews_df['asin'])

# Metadata for product categories with less reviews
beauty = "/Users/pavansingh/Desktop/Amazon Review Data/Metadata/meta_All_Beauty.json"
fashion = "/Users/pavansingh/Desktop/Amazon Review Data/Metadata/meta_AMAZON_FASHION.json"
appliances = "/Users/pavansingh/Desktop/Amazon Review Data/Metadata/meta_Appliances.json"
gift_cards = "/Users/pavansingh/Desktop/Amazon Review Data/Metadata/meta_Gift_Cards.json"
industrial = "/Users/pavansingh/Desktop/Amazon Review Data/Metadata/meta_Industrial_and_Scientific.json"
luxury_beauty = "/Users/pavansingh/Desktop/Amazon Review Data/Metadata/meta_Luxury_Beauty.json"
magazine_subscriptions = "/Users/pavansingh/Desktop/Amazon Review Data/Metadata/meta_Magazine_Subscriptions.json"
software = "/Users/pavansingh/Desktop/Amazon Review Data/Metadata/meta_Software.json"

# Load each metadata file and join into a dataframe
metadata_df_batch1 = []

for category, filename in [('beauty', beauty), ('fashion', fashion), ('appliances', appliances), ('gift_cards', gift_cards), ('industrial', industrial), ('luxury_beauty', luxury_beauty), ('magazine_subscriptions', magazine_subscriptions), ('software', software)]:
    for selected_data in read_matching_metadata(filename, category, product_ids):
        metadata_df_batch1.append(selected_data)

# to dataframe 
metadata_df_batch1 = pd.DataFrame(metadata_df_batch1)

# Print the resulting metadata dataframe
display(metadata_df_batch1.head(4))

# Value counts
print("Value Counts of products per Category:\n", metadata_df_batch1['category'].value_counts())

Unnamed: 0,category,tech1,description,fit,title,also_buy,tech2,brand,feature,rank,also_view,details,main_cat,similar_item,date,price,asin,imageURL,imageURLHighRes
0,beauty,,[INDICATIONS: Aqua Velva Cooling After Shave E...,,"Aqua Velva After Shave, Classic Ice Blue, 7 Ounce","[B00J232PCM, B0010V5MKG, B000052Y68, B00KOAIU7...",,Aqua Velva,[],"65,003 in Beauty & Personal Care (","[B01I9TIY1U, B07L1PZCS7, B01N12C89Y, B01I9TINT...",{'  Product Dimensions: ': '3 x 4 x 5 ...,All Beauty,,,,B0000530HU,[https://images-na.ssl-images-amazon.com/image...,[https://images-na.ssl-images-amazon.com/image...
1,beauty,,[<P><STRONG>Restores Moisture to Dehydrated Ha...,,Citre Shine Moisture Burst Shampoo - 16 fl oz,"[B07CSVCGZV, B07KMGC13Z, B0793XJ4WW, B01N7U1HB...",,Citre Shine,[],"1,693,702 in Beauty & Personal Care (",[],"{'ASIN: ': 'B00006L9LC', 'UPC:': '795827187965...",All Beauty,,,$23.00,B00006L9LC,[],[]
2,beauty,,"[A richly pigmented, micronized powder formula...",,"NARS Blush, Taj Mahal",[],,NARS,[],"505,302 in Beauty & Personal Care (","[B07FVJJ39R, B07JBQZDKB, B07HKVJC7G, B010VWL4E...","{'  Item Weight: ': '0.16 ounces', 'Sh...",All Beauty,,,$34.50,B00021DJ32,[https://images-na.ssl-images-amazon.com/image...,[https://images-na.ssl-images-amazon.com/image...
3,beauty,,[Avalon Organics Wrinkle Therapy Cleansing Mil...,,Avalon Organics Wrinkle Therapy CoQ10 Cleansin...,"[B0014407HC, B001ECQ41M, B00503OFIU, B00015XAQ...",,Avalon,[],"141,988 in Beauty &amp; Personal Care (","[B077ZG4C3L, B07DW6ZLFS, B00503OFIU, B07DVZMGL...",{'  Product Dimensions: ': '2.5 x 1.4 ...,All Beauty,,,$8.27,B0002JHI1I,[https://images-na.ssl-images-amazon.com/image...,[https://images-na.ssl-images-amazon.com/image...


Value Counts of products per Category:
 industrial                5188
luxury_beauty             1613
software                   855
magazine_subscriptions     249
gift_cards                 148
beauty                      89
appliances                  49
fashion                     31
Name: category, dtype: int64


In [45]:
# merge reviews and metadata
reviews_df = reviews_df.merge(metadata_df_batch1, on='asin', how='left').drop(columns=['Unnamed: 0','image', 'category_y', 'fit', 'also_buy', 'tech1', 'tech2', 'also_view', 'details', 'similar_item', 'imageURL', 'imageURLHighRes'])
reviews_df.head(4)

# save to csv
reviews_df.to_csv('Data/few_revs_meta.csv')

## <a id='toc4_2_'></a>[Data with A Lot of Reviews](#toc0_)

We split this up into 9 batches and load them seperately as the metadata is quite large and takes up a lot of memory.  

#### <a id='toc4_2_1_1_'></a>[Batch 1](#toc0_)

- arts_crafts_and_sewing
- automotive

In [47]:
# loading the review data!

data = []

arts_crafts = "/Users/pavansingh/Desktop/Amazon Review Data/Arts_Crafts_and_Sewing_5.json"
automotive = "/Users/pavansingh/Desktop/Amazon Review Data/Automotive_5.json"

# load each file and join into dataframe
for category, filename in [('arts_crafts', arts_crafts), ('automotive', automotive)]:
    for selected_data in read_file(filename, category):
        data.append(selected_data)

# make it into a dataframe
data = pd.DataFrame(data)

# show the dataframe
print("Shape of all data:", data.shape)
display(data.head(5))

# save data_with_less_reviews to csv called few_revs.csv in folder Data
data.to_csv('Data/revs_batch1.csv')

# category value counts
print("Value counts of product reviews per category:\n",data['category'].value_counts())

Shape of all data: (50000, 13)


Unnamed: 0,overall,verified,reviewTime,reviewerID,asin,reviewerName,reviewText,summary,unixReviewTime,category,vote,style,image
0,5.0,True,"03 2, 2018",A2E7LAVOYZO0LX,B003SBGW8C,Donna Matheson,Works great!,Five Stars,1519948800,arts_crafts,,,
1,5.0,True,"10 23, 2013",A3BC8NCB5H9UOC,B004BOY8NG,Jopaloma,The boning was very easy to use. The size was...,Just the ticket!,1382486400,arts_crafts,2.0,{'Size:': ' 12-Yard'},
2,5.0,True,"03 6, 2017",ADPIGCF2FF40K,B005N419GM,"Nurse Deb, PNP",Better quality than expected. This is a great...,Better than expected,1488758400,arts_crafts,,{'Size:': ' Size-US-4-(3.5mm)'},
3,5.0,True,"02 16, 2016",A34RGY9X5ORLK0,B0001VNQRC,Paul Mendez,ty,Five Stars,1455580800,arts_crafts,,,
4,3.0,True,"02 28, 2016",A2VJQ93EHZN9R6,B00TQ6MM9G,TRC,Great set. I do wish they had a threaded cap f...,Three Stars,1456617600,arts_crafts,,,


Value counts of product reviews per category:
 arts_crafts    25000
automotive     25000
Name: category, dtype: int64


In [48]:
# Read product reviews file and extract productIDs
reviews_df = pd.read_csv('/Users/pavansingh/Library/CloudStorage/GoogleDrive-pavansingho23@gmail.com/My Drive/Portfolio/Masters-Dissertation/Code/Data/revs_batch1.csv', low_memory=False)
product_ids = set(reviews_df['asin'])

# Metadata for product categories with less reviews
arts_crafts = "/Users/pavansingh/Desktop/Amazon Review Data/Metadata/meta_Arts_Crafts_and_Sewing.json"
automotive = "/Users/pavansingh/Desktop/Amazon Review Data/Metadata/meta_Automotive.json"

# Load each metadata file and join into a dataframe
data = []

for category, filename in [('arts_crafts', arts_crafts), ('automotive', automotive)]:
    for selected_data in read_matching_metadata(filename, category, product_ids):
        data.append(selected_data)

# to dataframe 
data = pd.DataFrame(data)

# Print the resulting metadata dataframe
display(data.head(4))

# Value counts
print("Value Counts of products per Category:\n", data['category'].value_counts())

Unnamed: 0,category,tech1,description,fit,title,also_buy,tech2,brand,feature,rank,also_view,details,main_cat,similar_item,date,price,asin,imageURL,imageURLHighRes
0,arts_crafts,"class=""a-keyvalue prodDetTable"" role=""present...",[Build your very own film archive. Fully embos...,,"Moleskine Passion Journal - Film, Large, Hard ...","[1926892801, 8862933193, 8862933118]",,Moleskine,[Used Book in Good Condition],"[>#229,441 in Office Products (See top 100), >...","[8862933193, B001KN2B08, 1948713047, 886293315...",{},Office Products,,"November 3, 2009",$137.58,8862933177,[https://images-na.ssl-images-amazon.com/image...,[https://images-na.ssl-images-amazon.com/image...
1,arts_crafts,,[100 Pcs Swarovski Crystal Rondelle Spacer Bea...,,100 Pcs Swarovski Crystal Rondelle Spacer Bead...,[],,Swarovski,[100 Pcs Swarovski Crystal Rondelle Spacer Bea...,"[>#251,114 in Arts, Crafts & Sewing (See Top 1...",[],{},"Arts, Crafts & Sewing",,,,9578232225,[],[]
2,arts_crafts,,"[<BR>Free Shipping to Worldwide, provided with...",,100 Pcs Swarovski Crystal Rondelle Spacer Bead...,"[B00T9G8WPG, B00T9GACU4, B00LL371DY, B0749HJ4D...",,Swarovski,[100 Pcs Swarovski Crystal Rondelle Spacer Bea...,"[>#211,302 in Arts, Crafts & Sewing (See Top 1...",[B00T9G8WPG],{},"Arts, Crafts & Sewing",,,,9628676717,[],[]
3,arts_crafts,,"[Beautiful Kettle dyed yarn showing blues, pur...",,Malabrigo Sock yarn (416 - Indecita),"[B00DX89JFI, B002JQ07L0]",,Malabrigo,[],"[>#94,694 in Home & Kitchen (See Top 100 in Ho...","[B079RSX1NR, B079S3NVJP, B07GBFK5TM, B07G7JRST...",{},Amazon Home,,,.a-box-inner{background-color:#fff}#alohaBuyBo...,9974314372,[],[]


Value Counts of products per Category:
 automotive     16677
arts_crafts    10906
Name: category, dtype: int64


In [49]:
# merge reviews and metadata
reviews_df = reviews_df.merge(data, on='asin', how='left').drop(columns=['Unnamed: 0','image', 'category_y', 'fit', 'also_buy', 'tech1', 'tech2', 'also_view', 'details', 'similar_item', 'imageURL', 'imageURLHighRes'])
display(reviews_df.head(4))

# save to csv
reviews_df.to_csv('Data/revs_meta_batch1.csv')


#### <a id='toc4_2_1_2_'></a>[Batch 2](#toc0_)

- cds_and_vinyl
- cell_phones_and_accessories

In [50]:
# loading the review data!

data = []

cds_and_vinyl = "/Users/pavansingh/Desktop/Amazon Review Data/CDs_and_Vinyl_5.json"
cell_phones = "/Users/pavansingh/Desktop/Amazon Review Data/Cell_Phones_and_Accessories_5.json"

# load each file and join into dataframe
for category, filename in [('cds_and_vinyl', cds_and_vinyl), ('cell_phones', cell_phones)]:
    for selected_data in read_file(filename, category):
        data.append(selected_data)

# make it into a dataframe
data = pd.DataFrame(data)

# show the dataframe
print("Shape of all data:", data.shape)
display(data.head(5))

# save data_with_less_reviews to csv called few_revs.csv in folder Data
data.to_csv('Data/revs_batch2.csv')

# category value counts
print("Value counts of product reviews per category:\n",data['category'].value_counts())

Shape of all data: (50000, 13)


Unnamed: 0,overall,vote,verified,reviewTime,reviewerID,asin,style,reviewerName,reviewText,summary,unixReviewTime,category,image
0,4.0,11.0,False,"10 14, 2007",A34MBR8R6XP87,B000SFZ05C,{'Format:': ' Audio CD'},A. Boller,This is a 2 CD set and while first CD is the s...,The CULT are back!,1192320000,cds_and_vinyl,
1,5.0,,False,"06 29, 2005",ABQWWY9HXLUJG,B00095L8NY,{'Format:': ' Audio CD'},PAMetalFAN,Best Metalcore CD of the year. The Agony Scen...,Add this to your collection ASAP,1120003200,cds_and_vinyl,
2,5.0,,False,"11 28, 2007",A3UYICC4TMM2O0,B000002OP6,{'Format:': ' Audio CD'},V. A. Peek,I can't believe the Amazon reviewer found this...,Relaxing Is What it is,1196208000,cds_and_vinyl,
3,5.0,,True,"01 16, 2013",A2UE4V6W2LO616,B000WCBPBO,,E M,All I can say is that I think Andy Williams ha...,Love! Love Love!,1358294400,cds_and_vinyl,
4,5.0,,True,"05 4, 2013",A1T4JEK323BYYO,B00000C3VM,{'Format:': ' Audio CD'},Carl R. Kannady,I like piano music and I like the way Liberace...,Music,1367625600,cds_and_vinyl,


Value counts of product reviews per category:
 cds_and_vinyl    25000
cell_phones      25000
Name: category, dtype: int64


In [51]:
# Read product reviews file and extract productIDs
reviews_df = pd.read_csv('/Users/pavansingh/Library/CloudStorage/GoogleDrive-pavansingho23@gmail.com/My Drive/Portfolio/Masters-Dissertation/Code/Data/revs_batch2.csv', low_memory=False)
product_ids = set(reviews_df['asin'])

# Metadata for product categories with less reviews
cds_and_vinyl = "/Users/pavansingh/Desktop/Amazon Review Data/Metadata/meta_CDs_and_Vinyl.json"
cell_phones = "/Users/pavansingh/Desktop/Amazon Review Data/Metadata/meta_Cell_Phones_and_Accessories.json"

# Load each metadata file and join into a dataframe
data = []

for category, filename in [('cds_and_vinyl', cds_and_vinyl), ('cell_phones', cell_phones)]:
    for selected_data in read_matching_metadata(filename, category, product_ids):
        data.append(selected_data)

# to dataframe 
data = pd.DataFrame(data)

# Print the resulting metadata dataframe
display(data.head(4))

# Value counts
print("Value Counts of products per Category:\n", data['category'].value_counts())

Unnamed: 0,category,tech1,description,fit,title,also_buy,tech2,brand,feature,rank,also_view,main_cat,similar_item,date,price,asin,imageURL,imageURLHighRes,details
0,cds_and_vinyl,,"[This is a concept album all the way, with tal...",,Christmas Eve and Other Stories,"[B01M0L3X65, B07G1V9Q3X, B00000AEDW, B0002ZDVG...",,Trans-Siberian Orchestra,[],73 in CDs & Vinyl (,"[B01M0L3X65, B07G1V9Q3X, B0002ZDVGS, B00000AED...","<img src=""https://images-na.ssl-images-amazon....",,,$5.98,5164885,[],[],
1,cds_and_vinyl,,"[1. Jesus Lord Of The Way I Feel, 2. Jehoshaph...",,Forgiven,"[B000025Q0M, B003H8F4NA, B003ZFVHPO, B003JMP1Z...",,Don Francisco,[],"369,849 in CDs & Vinyl (","[B003H8F4NA, B000025Q0M, B003JMP1ZK, 076013588...","<img src=""https://images-na.ssl-images-amazon....",,,,5465079,[],[],
2,cds_and_vinyl,,[run time 78 minAn accident or an illness---an...,,Escape from Hell,"[0967680670, 0967680689, B001AYJ2Y0, B00REG9GE...",,Daniel Kruse,[],"48,283 in Movies & TV (","[B00097E6NQ, 5559921017, 0967680670, 096768068...",Movies & TV,,,$9.85,967680654,[],[],
3,cds_and_vinyl,,[],,Chickenfoot III: Classic Rock,"[B0090PX4KE, B01N9URZKF, B00VM5HOHY, B01M4NTJS...",,Chickenfoot,[],"2,174,797 in CDs & Vinyl (","[B0090PX4KE, B01N9URZKF, B00VM5HOHY, B005PYAXE...","<img src=""https://images-na.ssl-images-amazon....",,,,1858704553,[],[],


Value Counts of products per Category:
 cds_and_vinyl    18592
cell_phones      14313
Name: category, dtype: int64


In [52]:
# merge reviews and metadata
reviews_df = reviews_df.merge(data, on='asin', how='left').drop(columns=['Unnamed: 0','image', 'category_y', 'fit', 'also_buy', 'tech1', 'tech2', 'also_view', 'details', 'similar_item', 'imageURL', 'imageURLHighRes'])
display(reviews_df.head(4))

# save to csv
reviews_df.to_csv('Data/revs_meta_batch2.csv')

Unnamed: 0,overall,vote,verified,reviewTime,reviewerID,asin,style,reviewerName,reviewText,summary,unixReviewTime,category_x,description,title,brand,feature,rank,main_cat,date,price
0,4.0,11.0,False,"10 14, 2007",A34MBR8R6XP87,B000SFZ05C,{'Format:': ' Audio CD'},A. Boller,This is a 2 CD set and while first CD is the s...,The CULT are back!,1192320000,cds_and_vinyl,[Legendary Alternative Rock band The Cult retu...,Born Into This,The Cult,[],"291,811 in CDs & Vinyl (","<img src=""https://images-na.ssl-images-amazon....",,$29.99
1,5.0,,False,"06 29, 2005",ABQWWY9HXLUJG,B00095L8NY,{'Format:': ' Audio CD'},PAMetalFAN,Best Metalcore CD of the year. The Agony Scen...,Add this to your collection ASAP,1120003200,cds_and_vinyl,[Cut from the same cloth as their labelmates C...,Darkest Red,The Agony Scene,[],"70,512 in CDs & Vinyl (","<img src=""https://images-na.ssl-images-amazon....",,$12.03
2,5.0,,False,"11 28, 2007",A3UYICC4TMM2O0,B000002OP6,{'Format:': ' Audio CD'},V. A. Peek,I can't believe the Amazon reviewer found this...,Relaxing Is What it is,1196208000,cds_and_vinyl,[Gill was at his very peak of country stardom ...,Let There Be Peace On Earth,Vince Gill,[],"122,098 in CDs & Vinyl (","<img src=""https://images-na.ssl-images-amazon....",,
3,5.0,,True,"01 16, 2013",A2UE4V6W2LO616,B000WCBPBO,,E M,All I can say is that I think Andy Williams ha...,Love! Love Love!,1358294400,cds_and_vinyl,[It's pretty obvious that <i>Ratatouille</i> s...,Bee Movie: Music From The Motion Picture,Rupert Gregson-Williams,[],"427,087 in CDs & Vinyl (","<img src=""https://images-na.ssl-images-amazon....",,.a-section.a-spacing-mini{margin-bottom:6px!im...




#### <a id='toc4_2_1_3_'></a>[Batch 3](#toc0_)

- clothing_shoes_and_jewelry
- digital_music

In [53]:
# loading the review data!

data = []

clothing_shoes_and_jewelry = "/Users/pavansingh/Desktop/Amazon Review Data/Clothing_Shoes_and_Jewelry_5.json"
digital_music = "/Users/pavansingh/Desktop/Amazon Review Data/Digital_Music_5.json"

# load each file and join into dataframe
for category, filename in [('clothing_shoes_and_jewelry', clothing_shoes_and_jewelry), ('digital_music', digital_music)]:
    for selected_data in read_file(filename, category):
        data.append(selected_data)

# make it into a dataframe
data = pd.DataFrame(data)

# show the dataframe
print("Shape of all data:", data.shape)
display(data.head(5))

# save data in folder Data
data.to_csv('Data/revs_batch3.csv')

# category value counts
print("Value counts of product reviews per category:\n",data['category'].value_counts())

Shape of all data: (50000, 13)


Unnamed: 0,overall,verified,reviewTime,reviewerID,asin,style,reviewerName,reviewText,summary,unixReviewTime,category,vote,image
0,5.0,True,"08 8, 2017",A22D4CE4Q72CT5,B019P5X156,{'Size:': ' 8'},KAREN,"Highly recommend, super comfortable for my 4 y...",Five Stars,1502150400,clothing_shoes_and_jewelry,,
1,1.0,True,"02 23, 2015",A1I6P8JR71PDSA,B00F93KOCW,"{'Size:': ' 8.5 B(M) US', 'Color:': ' Nude Pat...",Sandra Duarte,Too SMALL,One Star,1424649600,clothing_shoes_and_jewelry,,
2,1.0,True,"06 9, 2013",A1YOJIDLPAX54D,B008V1XKCK,"{'Size:': ' 9 B(M) US', 'Color:': ' Light Brown'}",Gizmo,"Historically Desert Boots were the cozy, comfo...",Not as hoped,1370736000,clothing_shoes_and_jewelry,4.0,
3,5.0,True,"03 16, 2015",AR4R0I0ISSUJ5,B0002TOZ1E,"{'Size:': ' 13-15 (Shoe Size 12-16)', 'Color:'...",Woody1,I love the Gold Toe brand. I have other styles...,The best Socks,1426464000,clothing_shoes_and_jewelry,,
4,4.0,True,"06 29, 2016",A3GNY9F7BZAR1N,B01EGR47ZG,"{'Size:': ' X-Large', 'Color:': ' Black'}",Max Credits,Wish it was just a little larger but otherwise...,"Lightweight sweat, nice.",1467158400,clothing_shoes_and_jewelry,,


Value counts of product reviews per category:
 clothing_shoes_and_jewelry    25000
digital_music                 25000
Name: category, dtype: int64


In [55]:
# Read product reviews file and extract productIDs
reviews_df = pd.read_csv('/Users/pavansingh/Library/CloudStorage/GoogleDrive-pavansingho23@gmail.com/My Drive/Portfolio/Masters-Dissertation/Code/Data/revs_batch3.csv', low_memory=False)
product_ids = set(reviews_df['asin'])

# Metadata for product categories with less reviews
clothing_shoes_and_jewelry = "/Users/pavansingh/Desktop/Amazon Review Data/Metadata/meta_Clothing_Shoes_and_Jewelry.json"
digital_music = "/Users/pavansingh/Desktop/Amazon Review Data/Metadata/meta_Digital_Music.json"

# Load each metadata file and join into a dataframe
data = []

for category, filename in [('clothing_shoes_and_jewelry', clothing_shoes_and_jewelry), ('digital_music', digital_music)]:
    for selected_data in read_matching_metadata(filename, category, product_ids):
        data.append(selected_data)

# to dataframe 
data = pd.DataFrame(data)

# Print the resulting metadata dataframe
display(data.head(4))

# Value counts
print("Value Counts of products per Category:\n", data['category'].value_counts())

Unnamed: 0,category,description,title,also_buy,feature,also_view,main_cat,date,price,asin,fit,rank,imageURL,imageURLHighRes,brand,tech1,details,similar_item,tech2
0,clothing_shoes_and_jewelry,[Includes One Broom. This broom goes great wit...,Adult Witch Broom,"[B001CK3ON2, B000YPMJF0, B000WCXJPO, B00EIMD6H...",[Brand new authentic licensed Pegan Witch broo...,"[B00404R58W, B01LXD6YDV, B00R9X9Q56, B00BBPLQU...",Toys & Games,5 star,$4.99,B00001TOXD,,,,,,,,,
1,clothing_shoes_and_jewelry,,Disguise Women's The Nightmare Before Christma...,"[B00GRO07I8, B07GZ9G1DS, B01KGJAOIW, B019NC0BD...","[100% Polyester, Imported, Hand Wash, Detachab...","[B003O68AK4, B06X9BGCQL, B003O68AJU, B00CXOLJ1...",,5 star,$19.93 - $85.52,B0000696B9,"class=""a-normal a-align-center a-spacing-smal...","191,283inClothing,ShoesJewelry(",,,,,,,
2,clothing_shoes_and_jewelry,"[The iconic, timeless chuck taylor all star sn...",Converse Chuck Taylor All Star Canvas Low Top ...,"[B074CTQWYS, B078H9GNKW, B078HCG2FG, B01G2N1WK...","[100% Textile, Imported, Rubber sole, Shaft me...","[B074CTQWYS, B078H9GNKW, B078HCG2FG, B01M9C1PD...",,5 star,$18.29 - $189.99,B00006XXGO,"class=""a-normal a-align-center a-spacing-smal...","858inClothing,ShoesJewelry(",[https://images-na.ssl-images-amazon.com/image...,[https://images-na.ssl-images-amazon.com/image...,,,,,
3,clothing_shoes_and_jewelry,"[The iconic, timeless chuck taylor all star sn...",Converse Chuck Taylor All Star High Top,"[B07HVFM4CG, B07C8V9T7T, B07B5FVT5K, B07CQB7P5...","[100% Textile, Imported, Rubber sole, Shaft me...","[B07C8V9T7T, B07B5FVT5K, B07HVFM4CG, B0741XXSR...",,5 star,$29.55 - $160.95,B000072US4,"class=""a-normal a-align-center a-spacing-smal...","412inClothing,ShoesJewelry(",[https://images-na.ssl-images-amazon.com/image...,[https://images-na.ssl-images-amazon.com/image...,,,,,


Value Counts of products per Category:
 clothing_shoes_and_jewelry    19629
digital_music                   132
Name: category, dtype: int64


In [56]:
# merge reviews and metadata
reviews_df = reviews_df.merge(data, on='asin', how='left').drop(columns=['Unnamed: 0','image', 'category_y', 'fit', 'also_buy', 'tech1', 'tech2', 'also_view', 'details', 'similar_item', 'imageURL', 'imageURLHighRes'])
display(reviews_df.head(4))

# save to csv
reviews_df.to_csv('Data/revs_meta_batch3.csv')

Unnamed: 0,overall,verified,reviewTime,reviewerID,asin,style,reviewerName,reviewText,summary,unixReviewTime,category_x,vote,description,title,feature,main_cat,date,price,rank,brand
0,5.0,True,"08 8, 2017",A22D4CE4Q72CT5,B019P5X156,{'Size:': ' 8'},KAREN,"Highly recommend, super comfortable for my 4 y...",Five Stars,1502150400,clothing_shoes_and_jewelry,,[Quality:Feathers' 100% super soft premium cot...,Feathers Boys White Tank 100% Cotton Super Sof...,[Made from 100% combed cotton / Machine wash t...,,5 star,$13.99 - $14.49,"32,745inClothing,ShoesJewelry(",Feathers
1,1.0,True,"02 23, 2015",A1I6P8JR71PDSA,B00F93KOCW,"{'Size:': ' 8.5 B(M) US', 'Color:': ' Nude Pat...",Sandra Duarte,Too SMALL,One Star,1424649600,clothing_shoes_and_jewelry,,"[Jessica Simpson is famous for her fun, sexy s...",Jessica Simpson Women's Bianca Platform Pump,"[100% Synthetic, Imported, Synthetic sole, Hee...",,5 star,,"2,267,530inClothing,ShoesJewelry(",
2,1.0,True,"06 9, 2013",A1YOJIDLPAX54D,B008V1XKCK,"{'Size:': ' 9 B(M) US', 'Color:': ' Light Brown'}",Gizmo,"Historically Desert Boots were the cozy, comfo...",Not as hoped,1370736000,clothing_shoes_and_jewelry,4.0,"[Hi-top desert booties have clean, slimming li...",Breckelle's Women's Sandy-61 Desert Ankle Boot,"[Synthetic, Rubber Sole, Faux Leather Upper, G...",,5 star,$35.00,"1,062,352inClothing,ShoesJewelry(",Breckelle's
3,5.0,True,"03 16, 2015",AR4R0I0ISSUJ5,B0002TOZ1E,"{'Size:': ' 13-15 (Shoe Size 12-16)', 'Color:'...",Woody1,I love the Gold Toe brand. I have other styles...,The best Socks,1426464000,clothing_shoes_and_jewelry,,[Premium comfortable cotton crew length socks ...,"Gold Toe Men's Crew 656s Athletic Sock, 6 Pack...","[79% Cotton, 11% Polyester, 9% Nylon, 1% Spand...",,5 star,$14.00,"83inClothing,ShoesJewelry(",



### <a id='toc4_2_2_'></a>[Batch 4](#toc0_)

- electronics
- musical_instruments

In [57]:
# loading the review data!

data = []

electronics = "/Users/pavansingh/Desktop/Amazon Review Data/Electronics_5.json"
musical_instruments = "/Users/pavansingh/Desktop/Amazon Review Data/Musical_Instruments_5.json"

# load each file and join into dataframe
for category, filename in [('electronics', electronics), ('musical_instruments', musical_instruments)]:
    for selected_data in read_file(filename, category):
        data.append(selected_data)

# make it into a dataframe
data = pd.DataFrame(data)

# show the dataframe
print("Shape of all data:", data.shape)
display(data.head(5))

# save data in folder Data
data.to_csv('Data/revs_batch4.csv')

# category value counts
print("Value counts of product reviews per category:\n",data['category'].value_counts())

Shape of all data: (50000, 13)


Unnamed: 0,overall,verified,reviewTime,reviewerID,asin,style,reviewerName,reviewText,summary,unixReviewTime,category,vote,image
0,4.0,True,"03 29, 2016",A1172NWLKHRG5D,B00FO0IHMY,{'Format:': ' Electronics'},samuel chavez,They work great!,Four Stars,1459209600,electronics,,
1,5.0,True,"08 30, 2016",A1SYD5RU6MQ5VY,B01G8BTYG0,{'Color:': ' 9 Pack - Black'},michael,Just used the three cable holder. Really pleas...,Really pleased. Works as described,1472515200,electronics,,
2,5.0,True,"12 28, 2016",A94GLWW4LODM3,B018BCJKE0,,Paul Hill,Got here on time works great highly recommend ...,Awesome,1482883200,electronics,,
3,4.0,True,"05 18, 2011",AL2BA3R3KQXNB,B003WGM7FU,,Alan Mushnick,I bought this case to replace this one\n(...)\...,perfectly fine case,1305676800,electronics,,
4,2.0,True,"09 11, 2014",AOPZOOPHZ8VOL,B00IN8VYC4,,Carl H,This product turned out to defective after 1 1...,Before that - the camera worked well and is ex...,1410393600,electronics,3.0,


Value counts of product reviews per category:
 electronics            25000
musical_instruments    25000
Name: category, dtype: int64


In [58]:
# Read product reviews file and extract productIDs
reviews_df = pd.read_csv('/Users/pavansingh/Library/CloudStorage/GoogleDrive-pavansingho23@gmail.com/My Drive/Portfolio/Masters-Dissertation/Code/Data/revs_batch4.csv', low_memory=False)
product_ids = set(reviews_df['asin'])

# Metadata for product categories with less reviews

electronics = "/Users/pavansingh/Desktop/Amazon Review Data/Metadata/meta_Electronics.json"
musical_instruments = "/Users/pavansingh/Desktop/Amazon Review Data/Metadata/meta_Musical_Instruments.json"

# Load each metadata file and join into a dataframe
data = []

for category, filename in [('electronics', electronics), ('musical_instruments', musical_instruments)]:
    for selected_data in read_matching_metadata(filename, category, product_ids):
        data.append(selected_data)

# to dataframe 
data = pd.DataFrame(data)

# Print the resulting metadata dataframe
display(data.head(4))

# Value counts
print("Value Counts of products per Category:\n", data['category'].value_counts())

Unnamed: 0,category,tech1,description,fit,title,also_buy,tech2,brand,feature,rank,also_view,main_cat,similar_item,date,price,asin,imageURL,imageURLHighRes,details
0,electronics,"class=""a-keyvalue prodDetTable"" role=""present...",[Genuine Replacement Cable Specifically design...,,Barnes &amp; Noble Replacement Charging Sync C...,[],,Barnes & Noble,[Detachable charging cable specific for your N...,[>#40 in Electronics > eBook Readers & Accesso...,"[B06ZZB2W2X, B01N5F6RNV, B01CZTZZVM, B00940BV1...",Computers,,"November 13, 2014",$34.88,059449771X,[https://images-na.ssl-images-amazon.com/image...,[https://images-na.ssl-images-amazon.com/image...,
1,electronics,,[The videosecu TV mount is a mounting solution...,,"VideoSecu 24"" Long Arm TV Wall Mount Low Profi...","[B000WYVBR0, B003O1UYHG, B002YV4WJS, B071HW7GS...",,VideoSecu,"[Fits most 22"" to 47"" HDTV and some up to 55"" ...",[>#176 in Electronics &gt; Accessories &amp; S...,[],All Electronics,"class=""a-bordered a-horizontal-stripes a-spa...","February 25, 2007",$34.99,0972683275,[https://images-na.ssl-images-amazon.com/image...,[https://images-na.ssl-images-amazon.com/image...,
2,electronics,,[What's Included? (1) Nook Simple Touch with G...,,Barnes and Noble Nook Simple Touch eBook Reade...,"[B016F1SVVM, 1616855711, B00KBPQHMO, 161682537...",,Nook,"[Box Content - eReader, microUSB Cable, Power ...","[>#62,776 in Electronics (See Top 100 in Elect...","[B077Y84B2C, B01MYQWLTV, 140053271X, B07BNGJXG...",All Electronics,"class=""a-bordered a-horizontal-stripes a-spa...","May 31, 2012",,1400501717,[https://images-na.ssl-images-amazon.com/image...,[https://images-na.ssl-images-amazon.com/image...,
3,electronics,,[<b>This Nook 1st Edition E-reader has been Fa...,,Barnes and Noble NOOK eBook Reader (WiFi only)...,[],,Barnes &amp; Noble,[Condition: Refurbished],"[>#176,377 in Electronics (See Top 100 in Elec...","[140053271X, B00EM3WGYY, B004D1OBFW, B077Y84B2...",All Electronics,,"July 22, 2010",,1400532620,[https://images-na.ssl-images-amazon.com/image...,[https://images-na.ssl-images-amazon.com/image...,


Value Counts of products per Category:
 electronics            17461
musical_instruments     8550
Name: category, dtype: int64


In [59]:
# merge reviews and metadata
reviews_df = reviews_df.merge(data, on='asin', how='left').drop(columns=['Unnamed: 0','image', 'category_y', 'fit', 'also_buy', 'tech1', 'tech2', 'also_view', 'details', 'similar_item', 'imageURL', 'imageURLHighRes'])
display(reviews_df.head(4))

# save to csv
reviews_df.to_csv('Data/revs_meta_batch4.csv')

Unnamed: 0,overall,verified,reviewTime,reviewerID,asin,style,reviewerName,reviewText,summary,unixReviewTime,category_x,vote,description,title,brand,feature,rank,main_cat,date,price
0,4.0,True,"03 29, 2016",A1172NWLKHRG5D,B00FO0IHMY,{'Format:': ' Electronics'},samuel chavez,They work great!,Four Stars,1459209600,electronics,,[The Black LG Tone Pro Bluetooth Stereo Headse...,LG Electronics Tone Pro HBS-750 Bluetooth Wire...,LG,"[1%, Imported, 3d Neck behind (around-the-neck...","[>#22,994 in Cell Phones & Accessories (See to...",Cell Phones & Accessories,"March 30, 2014",
1,5.0,True,"08 30, 2016",A1SYD5RU6MQ5VY,B01G8BTYG0,{'Color:': ' 9 Pack - Black'},michael,Just used the three cable holder. Really pleas...,Really pleased. Works as described,1472515200,electronics,,[],"Avantree 9 Pack Long Lasting Cable Clips, Desk...",Avantree,[VARIETY & ORGANIZE AND MANAGE: Total 9pcs in ...,[>#415 in Electronics (See Top 100 in Electron...,All Electronics,"May 27, 2016",$7.99
2,5.0,True,"12 28, 2016",A94GLWW4LODM3,B018BCJKE0,,Paul Hill,Got here on time works great highly recommend ...,Awesome,1482883200,electronics,,[],Cat 6 Ethernet Cable 25 ft White Flat - Solid ...,Jadaol,"[Jadaol High Performance Cat6 cable., High Per...",[>#3 in Computers & Accessories > Computer Acc...,Computers,"November 21, 2015",$7.89
3,4.0,True,"05 18, 2011",AL2BA3R3KQXNB,B003WGM7FU,,Alan Mushnick,I bought this case to replace this one\n(...)\...,perfectly fine case,1305676800,electronics,,[],Toblino: Leather iPad 1 Case (Folio Convertabl...,CESupply,[],"[>#53,560 in Computers & Accessories > Tablet ...",Computers,"July 19, 2010",



### <a id='toc4_2_3_'></a>[Batch 5](#toc0_)

- office_products
- patio_lawn_and_garden

In [60]:
# loading the review data!

data = []

office_products = "/Users/pavansingh/Desktop/Amazon Review Data/Office_Products_5.json"
patio_lawn_and_garden = "/Users/pavansingh/Desktop/Amazon Review Data/Patio_Lawn_and_Garden_5.json"

# load each file and join into dataframe
for category, filename in [('office_products', office_products), ('patio_lawn_and_garden', patio_lawn_and_garden)]:
    for selected_data in read_file(filename, category):
        data.append(selected_data)

# make it into a dataframe
data = pd.DataFrame(data)

# show the dataframe
print("Shape of all data:", data.shape)
display(data.head(5))

# save data in folder Data
data.to_csv('Data/revs_batch5.csv')

# category value counts
print("Value counts of product reviews per category:\n",data['category'].value_counts())

Shape of all data: (50000, 13)


Unnamed: 0,overall,verified,reviewTime,reviewerID,asin,reviewerName,reviewText,summary,unixReviewTime,category,style,vote,image
0,1.0,True,"01 1, 2017",A14DPEZ8UUPC6F,B0195415JG,Edna K,I received two identical sheets l am not impre...,Not that great,1483228800,office_products,,,
1,4.0,True,"02 9, 2012",A22CTX4Z8VP9KO,B000WP32ZI,GlassStudio,Deep saturation and beyond grape in color. It'...,Ouch This is PURPLE and it STAINS,1328745600,office_products,,,
2,5.0,True,"07 20, 2017",AN0ZO1SWXCDLU,B000050FZP,Jordan,"If you have kids and you buy this, you to can ...",Perfect Basic Phone,1500508800,office_products,{'Style:': ' White'},,
3,4.0,True,"03 10, 2018",A2EXON79HPL0SQ,B000CC6H5S,SC Girl,I got this for my grandson who has to read for...,Great for elementary school students who have ...,1520640000,office_products,{'Color:': ' Neon Blue'},,
4,4.0,True,"04 19, 2017",A19D3AAKV8QZ4H,B000B5RYE4,Lozahe,Good way to keep papers from jamming in feed s...,Great Accessory For Scanners in an Office,1492560000,office_products,,,


Value counts of product reviews per category:
 office_products          25000
patio_lawn_and_garden    25000
Name: category, dtype: int64


In [61]:
# Read product reviews file and extract productIDs
reviews_df = pd.read_csv('/Users/pavansingh/Library/CloudStorage/GoogleDrive-pavansingho23@gmail.com/My Drive/Portfolio/Masters-Dissertation/Code/Data/revs_batch5.csv', low_memory=False)
product_ids = set(reviews_df['asin'])

# Metadata for product categories with less reviews

office_products = "/Users/pavansingh/Desktop/Amazon Review Data/Metadata/meta_Office_Products.json"
patio_lawn_and_garden = "/Users/pavansingh/Desktop/Amazon Review Data/Metadata/meta_Patio_Lawn_and_Garden.json"

# Load each metadata file and join into a dataframe
data = []

for category, filename in [('office_products', office_products), ('patio_lawn_and_garden', patio_lawn_and_garden)]:
    for selected_data in read_matching_metadata(filename, category, product_ids):
        data.append(selected_data)

# to dataframe 
data = pd.DataFrame(data)

# Print the resulting metadata dataframe
display(data.head(4))

# Value counts
print("Value Counts of products per Category:\n", data['category'].value_counts())

Unnamed: 0,category,tech1,description,fit,title,also_buy,tech2,brand,feature,rank,also_view,main_cat,similar_item,date,price,asin,imageURL,imageURLHighRes,details
0,office_products,,"[, ]",,Tri-Fold Organizer Black XXL Book and Bible Cover,"[031043758X, 1934770132, B00W4E1TKU, 031080917...",,Visit Amazon's Zondervan Page,[],"787,995 in Books (","[0310809177, B003JAH9MU, B0007UQKO8, 031082370...",Books,,,,310432065,[],[],
1,office_products,,"[Made from durable nylon material, this sporty...",,Compass Med Book and Bible Cover,"[0310520347, B007WAWMZW, 0310802636, B0793FF3N...",,Visit Amazon's Zondervan Page,[],"50,648 in Books (","[0310806593, 0310802636, 031080292X, B007WAWMZ...",Books,,,$16.25,310806607,[],[],
2,office_products,,"[Featuring metal accents, purse-style handles ...",,Reptile Leather Extra Large Wine Bible Cover,"[B000OWOS1Q, 1934770132, 1934770981, B0007UQKO...",,Visit Amazon's Zondervan Page,[],"23,581 in Books (","[0310818605, B000OWOS1Q, B005JSC61O, B00BQZCLN...",Books,,,$13.95,310821800,[],[],
3,office_products,,[Classic design in a popular weathered look. *...,,Aviator Leather-Look Brown Extra Large Book an...,"[1934770914, B00W4E1UAY, 0310916410, 031080660...",,Visit Amazon's Zondervan Page,[],"8,063 in Books (","[B005KTQAAU, B00KLD9Q9C, B00W4E1TMS, B00ENP6KI...",Books,,,$12.95,310823706,[],[],


Value Counts of products per Category:
 patio_lawn_and_garden    12813
office_products          11871
Name: category, dtype: int64


In [62]:
# merge reviews and metadata
reviews_df = reviews_df.merge(data, on='asin', how='left').drop(columns=['Unnamed: 0','image', 'category_y', 'fit', 'also_buy', 'tech1', 'tech2', 'also_view', 'details', 'similar_item', 'imageURL', 'imageURLHighRes'])
display(reviews_df.head(4))

# save to csv
reviews_df.to_csv('Data/revs_meta_batch5.csv')

Unnamed: 0,overall,verified,reviewTime,reviewerID,asin,reviewerName,reviewText,summary,unixReviewTime,category_x,style,vote,description,title,brand,feature,rank,main_cat,date,price
0,1.0,True,"01 1, 2017",A14DPEZ8UUPC6F,B0195415JG,Edna K,I received two identical sheets l am not impre...,Not that great,1483228800,office_products,,,[],432 Planner Stickers - Busy Mom Collection for...,Denise Albright,[432 Peel & Stick stickers for 75 unique event...,"[>#5,235 in Office Products (See top 100), >#1...",Office Products,"December 9, 2015",$5.95
1,4.0,True,"02 9, 2012",A22CTX4Z8VP9KO,B000WP32ZI,GlassStudio,Deep saturation and beyond grape in color. It'...,Ouch This is PURPLE and it STAINS,1328745600,office_products,,,[Noodler's Ink is 100% made in the USA from ca...,"Noodler's Fountain Ink, 3 oz Bottle, Habannero...",Noodler's,[100% made in the USA from cap to glass to ink...,"[>#109,222 in Office Products (See top 100), >...",Office Products,"October 3, 2007",$17.56
2,5.0,True,"07 20, 2017",AN0ZO1SWXCDLU,B000050FZP,Jordan,"If you have kids and you buy this, you to can ...",Perfect Basic Phone,1500508800,office_products,{'Style:': ' White'},,[AT&T 210M Corded Phone is a basic phone that'...,"AT&amp;T 210 Basic Trimline Corded Phone, No A...",AT&T,"[13 number speed dial memory, Lighted keypad.,...",[>#295 in Office Products (See Top 100 in Offi...,Office Products,"November 21, 2000",$10.99
3,5.0,True,"07 20, 2017",AN0ZO1SWXCDLU,B000050FZP,Jordan,"If you have kids and you buy this, you to can ...",Perfect Basic Phone,1500508800,office_products,{'Style:': ' White'},,[AT&T 210M Corded Phone is a basic phone that'...,"AT&amp;T 210 Basic Trimline Corded Phone, No A...",AT&T,"[13 number speed dial memory, Lighted keypad.,...",[>#295 in Office Products (See Top 100 in Offi...,Office Products,"November 21, 2000",$10.99



### <a id='toc4_2_4_'></a>[Batch 6](#toc0_)

- sports_and_outdoors
- video_games

In [63]:
# loading the review data!

data = []

sports_and_outdoors = "/Users/pavansingh/Desktop/Amazon Review Data/Sports_and_Outdoors_5.json"
video_games = "/Users/pavansingh/Desktop/Amazon Review Data/Video_Games_5.json"

# load each file and join into dataframe
for category, filename in [('sports_and_outdoors', sports_and_outdoors), ('video_games', video_games)]:
    for selected_data in read_file(filename, category):
        data.append(selected_data)

# make it into a dataframe
data = pd.DataFrame(data)

# show the dataframe
print("Shape of all data:", data.shape)
display(data.head(5))

# save data in folder Data
data.to_csv('Data/revs_batch6.csv')

# category value counts
print("Value counts of product reviews per category:\n",data['category'].value_counts())

Shape of all data: (50000, 13)


Unnamed: 0,overall,verified,reviewTime,reviewerID,asin,style,reviewerName,reviewText,summary,unixReviewTime,category,vote,image
0,5.0,True,"08 16, 2015",AKUGX6GT5IAOQ,B001PR08GI,{'Team Name:': ' Kentucky Wildcats'},Derek R,Nice look.,Five Stars,1439683200,sports_and_outdoors,,
1,4.0,True,"09 11, 2014",A30VX6W01ZP5Q6,B00H0DVPU4,,bl33,Work great.,Four Stars,1410393600,sports_and_outdoors,,
2,5.0,True,"07 30, 2014",A269H3SS90XMHS,B002RDT1RG,,MinK,This is my 4th set of boxing gloves. I had to...,Good quality - seems to run small,1406678400,sports_and_outdoors,,
3,3.0,True,"01 1, 2016",A2F73J2AMZCD61,B00KSKY3PA,{'Color:': ' Black'},Amazon Customer,"Seems well made, but a bit bulky for what it is.",Three Stars,1451606400,sports_and_outdoors,,
4,2.0,True,"04 5, 2014",A3EE9NVA33U1ME,B00CPQ2SG6,{'Hand Orientation:': ' 1. Right Hand Draw'},Jay Hardwick,I got the RH holster instead of the LH holster...,"Inexpensive, yet difficult to wear",1396656000,sports_and_outdoors,,


Value counts of product reviews per category:
 sports_and_outdoors    25000
video_games            25000
Name: category, dtype: int64


In [64]:
# Read product reviews file and extract productIDs
reviews_df = pd.read_csv('/Users/pavansingh/Library/CloudStorage/GoogleDrive-pavansingho23@gmail.com/My Drive/Portfolio/Masters-Dissertation/Code/Data/revs_batch6.csv', low_memory=False)
product_ids = set(reviews_df['asin'])

# Metadata for product categories with less reviews

sports_and_outdoors = "/Users/pavansingh/Desktop/Amazon Review Data/Metadata/meta_Sports_and_Outdoors.json"
video_games = "/Users/pavansingh/Desktop/Amazon Review Data/Metadata/meta_Video_Games.json"

# Load each metadata file and join into a dataframe
data = []

for category, filename in [('sports_and_outdoors', sports_and_outdoors), ('video_games', video_games)]:
    for selected_data in read_matching_metadata(filename, category, product_ids):
        data.append(selected_data)

# to dataframe 
data = pd.DataFrame(data)

# Print the resulting metadata dataframe
display(data.head(4))

# Value counts
print("Value Counts of products per Category:\n", data['category'].value_counts())

Unnamed: 0,category,tech1,description,fit,title,also_buy,tech2,brand,feature,rank,also_view,main_cat,similar_item,date,price,asin,imageURL,imageURLHighRes,details
0,sports_and_outdoors,,[Find your way through New York while hitting ...,,Delorme New York State Atlas &amp; Gazetteer,"[0899334415, 0899334431, 0899333419, 089933351...",,Garmin,"[Amazingly detailed and beautifully crafted, l...","121,074 in Office Products (","[0528881922, 1569145792, 0899334431, 089933257...",Office Products,,,$19.95,899332757,[https://images-na.ssl-images-amazon.com/image...,[https://images-na.ssl-images-amazon.com/image...,
1,sports_and_outdoors,,[],,BenchMaster Pocket Guide - Fly Fishing - Fishing,"[1931676003, 0922273278, B000ZKSVSS, 161628873...",,Pocket Guides,[The Pocket Guide To Fly Fishing Knots is a co...,"74,348 in Sports & Outdoors (",[],Sports & Outdoors,,,$12.96,971100764,[https://images-na.ssl-images-amazon.com/image...,[https://images-na.ssl-images-amazon.com/image...,
2,sports_and_outdoors,,"[A disaster can strike at any moment. Luckily,...",,"Books Doomsday Prepping Crash Course Book, Brown","[1612432735, B00LE4RGOE, 1519118295, 076532725...",,Unknown,"[176 pages of doomsday survival on a budget, S...","504,923 in Sports & Outdoors (","[1612432735, 1426211228]",Sports & Outdoors,,,$10.80,1620878747,[],[],
3,sports_and_outdoors,,[Black Mountain Products (B.M.P.) resistance b...,,Black Mountain Products Resistance Band Set wi...,"[1612431712, B01AVDVHTI, B002YQUP7Q, B0136PR5T...",,Black Mountain,"[Bands included: Yellow (2-4 lbs.), blue (4-6 ...",303 in Sports & Outdoors (,[],Sports & Outdoors,"class=""a-bordered a-horizontal-stripes a-spa...",,$17.32,7245456313,[https://images-na.ssl-images-amazon.com/image...,[https://images-na.ssl-images-amazon.com/image...,{}


Value Counts of products per Category:
 sports_and_outdoors    16997
video_games            11228
Name: category, dtype: int64


In [65]:
# merge reviews and metadata
reviews_df = reviews_df.merge(data, on='asin', how='left').drop(columns=['Unnamed: 0','image', 'category_y', 'fit', 'also_buy', 'tech1', 'tech2', 'also_view', 'details', 'similar_item', 'imageURL', 'imageURLHighRes'])
display(reviews_df.head(4))

# save to csv
reviews_df.to_csv('Data/revs_meta_batch6.csv')

Unnamed: 0,overall,verified,reviewTime,reviewerID,asin,style,reviewerName,reviewText,summary,unixReviewTime,category_x,vote,description,title,brand,feature,rank,main_cat,date,price
0,5.0,True,"08 16, 2015",AKUGX6GT5IAOQ,B001PR08GI,{'Team Name:': ' Kentucky Wildcats'},Derek R,Nice look.,Five Stars,1439683200,sports_and_outdoors,,[This compact Embossed Leather Billfold Wallet...,NCAA Clemson Tigers Embossed Leather Billfold ...,Rico Industries,"[Measures 4.25-inches by 3.25-inches, Decorate...","124,585 in Sports & Outdoors (",Sports & Outdoors,,$21.99
1,4.0,True,"09 11, 2014",A30VX6W01ZP5Q6,B00H0DVPU4,,bl33,Work great.,Four Stars,1410393600,sports_and_outdoors,,"[As effective as it is unique, the patented Bl...","BlenderBottle Classic Shaker Bottle, 28-ounce,...",Blender Bottle,"[Effortlessly mix protein drinks, pancake batt...","[>#206,022 in Kitchen & Dining (See Top 100 in...",Amazon Home,"September 29, 2009",$24.77
2,5.0,True,"07 30, 2014",A269H3SS90XMHS,B002RDT1RG,,MinK,This is my 4th set of boxing gloves. I had to...,Good quality - seems to run small,1406678400,sports_and_outdoors,,[],Pro Impact Genuine Leather Boxing Gloves Black...,Pro Impact,[DURABLE LEATHER MATERIAL. These Pro Impact Ge...,"23,324 in Sports & Outdoors (",Sports & Outdoors,,$49.99
3,3.0,True,"01 1, 2016",A2F73J2AMZCD61,B00KSKY3PA,{'Color:': ' Black'},Amazon Customer,"Seems well made, but a bit bulky for what it is.",Three Stars,1451606400,sports_and_outdoors,,"[Allows you to tighten, tension, and secure he...",Nite Ize NI-NCJSA-01-R8_M CamJam XT Aluminum C...,Nite Ize,"[KNOT-FREE ROPE TIGHTENER - Tighten, tension, ...","[>#133,047 in Tools & Home Improvement (See to...",Tools & Home Improvement,"June 5, 2014",$10.39



### <a id='toc4_2_5_'></a>[Batch 7](#toc0_)

- tools_and_home_improvement
- kindle_store

In [66]:
# loading the review data!

data = []

tools_and_home_improvement = "/Users/pavansingh/Desktop/Amazon Review Data/Tools_and_Home_Improvement_5.json"
kindle_store = "/Users/pavansingh/Desktop/Amazon Review Data/Kindle_Store_5.json"

# load each file and join into dataframe
for category, filename in [('tools_and_home_improvement', tools_and_home_improvement), ('kindle_store', kindle_store)]:
    for selected_data in read_file(filename, category):
        data.append(selected_data)

# make it into a dataframe
data = pd.DataFrame(data)

# show the dataframe
print("Shape of all data:", data.shape)
display(data.head(5))

# save data in folder Data
data.to_csv('Data/revs_batch7.csv')

# category value counts
print("Value counts of product reviews per category:\n",data['category'].value_counts())

Shape of all data: (50000, 13)


Unnamed: 0,overall,verified,reviewTime,reviewerID,asin,style,reviewerName,reviewText,summary,unixReviewTime,category,vote,image
0,5.0,True,"10 3, 2015",A2P6CMDRPTUONH,B00F429RMQ,"{'Size:': ' 0.03', 'Style:': ' Welding wire'}",apis millett,worked good and the price was good. Keep up t...,Five Stars,1443830400,tools_and_home_improvement,,
1,5.0,True,"10 3, 2015",A2TU423HHUTTF9,B001P30BQE,{'Size:': ' Large'},AmazonFan,I am a large person that needed knee pads for ...,Pain Free Knees using these!,1443830400,tools_and_home_improvement,,
2,5.0,True,"03 15, 2016",A24VKQ6UFIX85W,B000J691J6,,Aa. in NorCal,My 2nd one to replace one I just wore out over...,BEST knife ever!,1458000000,tools_and_home_improvement,,
3,5.0,True,"09 5, 2014",ASJG3EMFTR28Z,B00HHIRO02,,Amazon Customer,This is a really nifty little tool. I needed a...,Neiko precision push drill,1409875200,tools_and_home_improvement,9.0,
4,5.0,True,"10 26, 2016",AL73NWNLG1B8U,B0015XIPN0,{'Color:': ' Orange'},Jon Lasham,high quality cord,Five Stars,1477440000,tools_and_home_improvement,,


Value counts of product reviews per category:
 tools_and_home_improvement    25000
kindle_store                  25000
Name: category, dtype: int64


In [67]:
# Read product reviews file and extract productIDs
reviews_df = pd.read_csv('/Users/pavansingh/Library/CloudStorage/GoogleDrive-pavansingho23@gmail.com/My Drive/Portfolio/Masters-Dissertation/Code/Data/revs_batch7.csv', low_memory=False)
product_ids = set(reviews_df['asin'])

# Metadata for product categories with less reviews

tools_and_home_improvement = "/Users/pavansingh/Desktop/Amazon Review Data/Metadata/meta_Tools_and_Home_Improvement.json"
kindle_store = "/Users/pavansingh/Desktop/Amazon Review Data/Metadata/meta_Kindle_Store.json"


# Load each metadata file and join into a dataframe
data = []

for category, filename in [('tools_and_home_improvement', tools_and_home_improvement), ('kindle_store', kindle_store)]:
    for selected_data in read_matching_metadata(filename, category, product_ids):
        data.append(selected_data)

# to dataframe 
data = pd.DataFrame(data)

# Print the resulting metadata dataframe
display(data.head(4))

# Value counts
print("Value Counts of products per Category:\n", data['category'].value_counts())

Unnamed: 0,category,tech1,description,fit,title,also_buy,tech2,brand,feature,rank,also_view,main_cat,similar_item,date,price,asin,imageURL,imageURLHighRes,details
0,tools_and_home_improvement,"class=""a-keyvalue prodDetTable"" role=""present...",[We don't know when or if this item will be ba...,,Breeding Organic Vegetables: A Step-by-Step Gu...,[],,SioGreen,[We don't know when or if this item will be ba...,"[>#1,638,473 in Tools & Home Improvement (See ...",[],Tools & Home Improvement,,"March 18, 2013",,0982085028,[],[],
1,tools_and_home_improvement,"class=""a-keyvalue prodDetTable"" role=""present...",[Who said you cant craft while enjoying family...,,"Mighty Bright 40516 XtraFlex2 Book Light, Pink","[1933622717, 1933622741, 0440406943]",,Mighty Bright,"[Two bright white, energy-efficient LED, Indiv...","[>#87,932 in Office Products (See top 100), >#...",[],Office Products,"class=""a-bordered a-horizontal-stripes a-spa...","July 2, 2008",$27.00,193362275X,[https://images-na.ssl-images-amazon.com/image...,[https://images-na.ssl-images-amazon.com/image...,
2,tools_and_home_improvement,"class=""a-keyvalue prodDetTable"" role=""present...","[CBconcept [10 Bulbs] 110V-120V AC 75 Watts, J...",,"CBconcept [10 Bulbs] 110V-120V AC 75 Watts, JC...","[B001UL7QTQ, B00317FIIS, B01M1ETJ4M, 754290476...",,CBconcept,[[10 Bulbs] 110v-120v AC G8 Bi-Pin Halogen Lig...,"[>#55,534 in Tools & Home Improvement (See top...",[],Tools & Home Improvement,"class=""a-bordered a-horizontal-stripes a-spa...","February 11, 2011",$7.95,7109036146,[https://images-na.ssl-images-amazon.com/image...,[https://images-na.ssl-images-amazon.com/image...,{}
3,tools_and_home_improvement,"class=""a-keyvalue prodDetTable"" role=""present...","[CBconcept [10 Bulbs] 110V - 120V AC 20 Watts,...",,"CBconcept [10 Bulbs] 110V - 120V AC 25 Watts, ...","[B00KXLOS4A, B079847Z99, B00X34K2D0, B00P9RR3PE]",,CBconcept,[[10 Bulbs] 110v-120v AC JCD G9 base Halogen L...,"[>#72,777 in Tools & Home Improvement (See top...",[],Tools & Home Improvement,"class=""a-bordered a-horizontal-stripes a-spa...","August 1, 2010",$10.95,711906441X,[https://images-na.ssl-images-amazon.com/image...,[https://images-na.ssl-images-amazon.com/image...,{}


Value Counts of products per Category:
 kindle_store                  18685
tools_and_home_improvement    16569
Name: category, dtype: int64


In [68]:
# merge reviews and metadata
reviews_df = reviews_df.merge(data, on='asin', how='left').drop(columns=['Unnamed: 0','image', 'category_y', 'fit', 'also_buy', 'tech1', 'tech2', 'also_view', 'details', 'similar_item', 'imageURL', 'imageURLHighRes'])
display(reviews_df.head(4))

# save to csv
reviews_df.to_csv('Data/revs_meta_batch7.csv')

Unnamed: 0,overall,verified,reviewTime,reviewerID,asin,style,reviewerName,reviewText,summary,unixReviewTime,category_x,vote,description,title,brand,feature,rank,main_cat,date,price
0,5.0,True,"10 3, 2015",A2P6CMDRPTUONH,B00F429RMQ,"{'Size:': ' 0.03', 'Style:': ' Welding wire'}",apis millett,worked good and the price was good. Keep up t...,Five Stars,1443830400,tools_and_home_improvement,,"[E71T-GS is an all-position, single-pass, flux...",Blue Demon E71TGS .035 X 1LB Spool Gasless Flu...,Blue Demon,[typical applications to include lap and butt ...,"[>#196,945 in Tools & Home Improvement (See to...",Tools & Home Improvement,"August 27, 2013",$13.32
1,5.0,True,"10 3, 2015",A2TU423HHUTTF9,B001P30BQE,{'Size:': ' Large'},AmazonFan,I am a large person that needed knee pads for ...,Pain Free Knees using these!,1443830400,tools_and_home_improvement,,"[Professional kneepads with layered gel, non-s...",DEWALT DG5204 Professional Kneepads with Layer...,Custom Leathercraft,[DURABLE: Ballistic poly material provides lon...,"[>#20,266 in Tools & Home Improvement (See top...",Tools & Home Improvement,"January 20, 2009",$34.95
2,5.0,True,"03 15, 2016",A24VKQ6UFIX85W,B000J691J6,,Aa. in NorCal,My 2nd one to replace one I just wore out over...,BEST knife ever!,1458000000,tools_and_home_improvement,,[Our aluminum InterFrame build SF folders shar...,CRKT M21-14SF EDC Folding Pocket Knife: Specia...,Columbia River Knife & Tool,"[Automated liner safety, Triple point serratio...","[>#88,049 in Tools & Home Improvement (See top...",Tools & Home Improvement,"April 2, 2009",$69.90
3,5.0,True,"09 5, 2014",ASJG3EMFTR28Z,B00HHIRO02,,Amazon Customer,This is a really nifty little tool. I needed a...,Neiko precision push drill,1409875200,tools_and_home_improvement,9.0,[Push drill bit set that provides precision an...,Neiko 10517A Precision Push Manual Hand Drill ...,Neiko,[Cordless and lightweight alternative to power...,"[>#164,035 in Tools & Home Improvement (See to...",Tools & Home Improvement,"December 24, 2013",$11.99


### <a id='toc4_2_6_'></a>[Batch 8](#toc0_)

- toys_and_games
- prime_pantry

In [69]:
# loading the review data!

data = []

toys_and_games = "/Users/pavansingh/Desktop/Amazon Review Data/Toys_and_Games_5.json"
prime_pantry = "/Users/pavansingh/Desktop/Amazon Review Data/Prime_Pantry_5.json"

# load each file and join into dataframe
for category, filename in [('toys_and_games', toys_and_games), ('prime_pantry', prime_pantry)]:
    for selected_data in read_file(filename, category):
        data.append(selected_data)

# make it into a dataframe
data = pd.DataFrame(data)

# show the dataframe
print("Shape of all data:", data.shape)
display(data.head(5))

# save data in folder Data
data.to_csv('Data/revs_batch8.csv')

# category value counts
print("Value counts of product reviews per category:\n",data['category'].value_counts())

Shape of all data: (50000, 13)


Unnamed: 0,reviewerID,asin,reviewerName,verified,reviewText,overall,reviewTime,summary,unixReviewTime,category,image,vote,style
0,AGIR4MLM5NFRZ,B000YXJEFK,StubbornBrunette,False,My daughter received this for her 1st birthday...,5.0,"08 2, 2016",Great baby doll for a young toddler,1470096000,toys_and_games,,,
1,AKCZIJ6ZHE41D,B017S86Q1S,Jarucia Jaycox,False,"<div id=""video-block-R3331ZQAXR8U5A"" class=""a-...",4.0,"10 15, 2017",Will read ANYTHING and that can be pretty fun,1508025600,toys_and_games,[https://images-na.ssl-images-amazon.com/image...,3.0,{'Style:': ' Bear'}
2,A15NHBDYACSMY9,B001PAFMTI,puzzle guy,True,These Thomas wooden trains and tracks are awes...,5.0,"01 28, 2013",Must get!,1359331200,toys_and_games,,,
3,A2I1SOF9R2PJY4,B00178JWKS,Bao Chau T. Duong,True,"My son expected this to be NOT a kit, Does not...",1.0,"01 16, 2013","Capsizes in water, does not float",1358294400,toys_and_games,,,
4,A3I29W7R4ARY1E,B017B1BP1S,san ann,True,my son loves the show and loves this lego set.,5.0,"06 14, 2016",Five Stars,1465862400,toys_and_games,,,{'Style:': ' Clay'}


Value counts of product reviews per category:
 toys_and_games    25000
prime_pantry      25000
Name: category, dtype: int64


In [70]:
# Read product reviews file and extract productIDs
reviews_df = pd.read_csv('/Users/pavansingh/Library/CloudStorage/GoogleDrive-pavansingho23@gmail.com/My Drive/Portfolio/Masters-Dissertation/Code/Data/revs_batch8.csv', low_memory=False)
product_ids = set(reviews_df['asin'])

# Metadata for product categories with less reviews

toys_and_games = "/Users/pavansingh/Desktop/Amazon Review Data/Metadata/meta_Toys_and_Games.json"
prime_pantry = "/Users/pavansingh/Desktop/Amazon Review Data/Metadata/meta_Prime_Pantry.json"

# Load each metadata file and join into a dataframe
data = []

for category, filename in [('toys_and_games', toys_and_games), ('prime_pantry', prime_pantry)]:
    for selected_data in read_matching_metadata(filename, category, product_ids):
        data.append(selected_data)

# to dataframe 
data = pd.DataFrame(data)

# Print the resulting metadata dataframe
display(data.head(4))

# Value counts
print("Value Counts of products per Category:\n", data['category'].value_counts())

Unnamed: 0,category,tech1,description,fit,title,also_buy,tech2,brand,feature,rank,also_view,main_cat,similar_item,date,price,asin,imageURL,imageURLHighRes,details
0,toys_and_games,,[<div>This space age character is likely to po...,,Dover Publications-Create Your Own Robot Stickers,[],,Dover Publications,[Create your very own robot with these fun sti...,"[>#24,463 in Arts, Crafts & Sewing (See Top 10...","[1426331800, 1609960653, B01C4OTAXC, 044981079...","Arts, Crafts & Sewing","class=""a-bordered a-horizontal-stripes a-spa...",,$3.04,486448789,[https://images-na.ssl-images-amazon.com/image...,[https://images-na.ssl-images-amazon.com/image...,
1,toys_and_games,,[Pom-Pom Puppies is the cutest breed of pom-po...,,Klutz Pom-Pom Puppies: Make Your Own Adorable ...,"[1338106430, 0545703190, 0545906520, 133810644...",,Klutz,[Create your own loveable pup from pomeranian ...,"[>#10,615 in Toys & Games (See Top 100 in Toys...","[0545703190, 1338159569, 1338106430, B076WVQDJ...",Toys & Games,,"March 7, 2013",$17.45,545561647,[https://images-na.ssl-images-amazon.com/image...,[https://images-na.ssl-images-amazon.com/image...,
2,toys_and_games,,[This fast-paced therapeutic card game helps c...,,Mad Dragon: An Anger Control Card Game,"[B01MREOLRC, B07GYRDRX8, 0641699840, 168373075...",,Therapy Game HQ,"[Designed for children aged 6 to 12, This list...","[>#10,932 in Toys & Games (See Top 100 in Toys...","[B01MREOLRC, B0773KRCCL, B01N4JIK4J, B013J5RPH...",Toys & Games,"class=""a-bordered a-horizontal-stripes a-spa...",,$19.95,615638996,[https://images-na.ssl-images-amazon.com/image...,[https://images-na.ssl-images-amazon.com/image...,
3,toys_and_games,,[Scoundrels of Skull port adds TWO new expansi...,,Lords of Waterdeep: Scoundrels of Skullport Ex...,"[0786959916, B07FN7BD27, B01IPUGYK6, B014TKCZ4...",,Wizards of the Coast,"[For 2-6 Players, 60 minute playing time, Scou...","[>#16,586 in Toys & Games (See Top 100 in Toys...","[0786959916, 0786955570, B072FCK6J7, B0716TVP9...",Toys & Games,"class=""a-bordered a-horizontal-stripes a-spa...",,$31.96,786964502,[],[],


Value Counts of products per Category:
 toys_and_games    17149
prime_pantry       4417
Name: category, dtype: int64


In [71]:
# merge reviews and metadata
reviews_df = reviews_df.merge(data, on='asin', how='left').drop(columns=['Unnamed: 0','image', 'category_y', 'fit', 'also_buy', 'tech1', 'tech2', 'also_view', 'details', 'similar_item', 'imageURL', 'imageURLHighRes'])
display(reviews_df.head(4))

# save to csv
reviews_df.to_csv('Data/revs_meta_batch8.csv')

Unnamed: 0,reviewerID,asin,reviewerName,verified,reviewText,overall,reviewTime,summary,unixReviewTime,category_x,vote,style,description,title,brand,feature,rank,main_cat,date,price
0,AGIR4MLM5NFRZ,B000YXJEFK,StubbornBrunette,False,My daughter received this for her 1st birthday...,5.0,"08 2, 2016",Great baby doll for a young toddler,1470096000,toys_and_games,,,[Make room in your heart for the 11'' Cabbage ...,Cabbage Patch Kids 11&quot; Drink N' Wet Newborn,Cabbage Patch Kids,[],"[>#398,461 in Toys & Games (See Top 100 in Toy...",Toys & Games,,.a-box-inner{background-color:#fff}#alohaBuyBo...
1,AKCZIJ6ZHE41D,B017S86Q1S,Jarucia Jaycox,False,"<div id=""video-block-R3331ZQAXR8U5A"" class=""a-...",4.0,"10 15, 2017",Will read ANYTHING and that can be pretty fun,1508025600,toys_and_games,3.0,{'Style:': ' Bear'},"[<div class=""boost-aplus-container""> <div clas...",Kayle Concepts Bluebee Pal Pro The Zebra - Tal...,Kayle Concepts,"[Plush, Imported, 4.0 Bluebee Pal Pro is a Tal...","[>#804,029 in Toys & Games (See Top 100 in Toy...",Toys & Games,,$74.99
2,A15NHBDYACSMY9,B001PAFMTI,puzzle guy,True,These Thomas wooden trains and tracks are awes...,5.0,"01 28, 2013",Must get!,1359331200,toys_and_games,,,"[Thomas And Friends Wooden Railway - Duncan, A...",Thomas And Friends Wooden Railway - Duncan,Learning Curve,[Each character has a unique personality and j...,"[>#399,221 in Toys & Games (See Top 100 in Toy...",Toys & Games,,$54.99
3,A2I1SOF9R2PJY4,B00178JWKS,Bao Chau T. Duong,True,"My son expected this to be NOT a kit, Does not...",1.0,"01 16, 2013","Capsizes in water, does not float",1358294400,toys_and_games,,,[This wooden model Titanic kit is great for sc...,"Darice 9178-91 Wooden Model Kit, Titanic",Darice,"[Wooden titanic model kit, Made up of wood, Th...","[>#3,571 in Arts, Crafts & Sewing (See Top 100...","Arts, Crafts & Sewing",,$5.97


### <a id='toc4_2_7_'></a>[Batch 9](#toc0_)

- home_and_kitchen
- movies_and_tv

In [4]:
# loading the review data!

data = []

home_and_kitchen = "/Users/pavansingh/Desktop/Amazon Review Data/Home_and_Kitchen_5.json"
movies_and_tv = "/Users/pavansingh/Desktop/Amazon Review Data/Movies_and_TV_5.json"

# load each file and join into dataframe
for category, filename in [('home_and_kitchen', home_and_kitchen), ('movies_and_tv', movies_and_tv)]:
    for selected_data in read_file(filename, category):
        data.append(selected_data)

# make it into a dataframe
data = pd.DataFrame(data)

# show the dataframe
print("Shape of all data:", data.shape)
display(data.head(5))

# save data in folder Data
data.to_csv('Data/revs_batch9.csv')

# category value counts
print("Value counts of product reviews per category:\n",data['category'].value_counts())

Shape of all data: (50000, 13)


Unnamed: 0,overall,verified,reviewTime,reviewerID,asin,style,reviewerName,reviewText,summary,unixReviewTime,category,vote,image
0,5.0,True,"03 13, 2016",ABBW6PYWAVW0A,B0015ZSW86,"{'Size:': ' 1 Clock', 'Package Quantity:': ' 1'}",D.Ecab,So far so good--been through two rain storms a...,Good size/price on the clock compared to other...,1457827200,home_and_kitchen,,
1,5.0,True,"11 13, 2017",APPH7P0WHUL5S,B003KRHDNC,,Mykel,"This holds so many kCups! thanks guys, excelle...",excellent product,1510531200,home_and_kitchen,,
2,5.0,True,"05 1, 2018",A82ZHT5OQS8CF,B00NLLUNSE,"{'Size:': ' Queen', 'Color:': ' Brown'}",D. Jackson,I ordered these sheets during a very busy time...,They are the first sheets I've found that are ...,1525132800,home_and_kitchen,,
3,5.0,True,"01 3, 2013",ANATXXBAXEUFM,B000Q9YVMS,{'Size:': ' 9-inch'},B. Smith,The mechanism that locks and unlocks these ton...,Works great,1357171200,home_and_kitchen,,
4,5.0,True,"06 5, 2015",A1BF20BCIQQJ2M,B00KFV8PY2,{'Package Quantity:': ' 1'},Judy Jacobson,I've used this for coconut oil fudge. I like ...,Good mold!,1433462400,home_and_kitchen,,


Value counts of product reviews per category:
 home_and_kitchen    25000
movies_and_tv       25000
Name: category, dtype: int64


In [5]:
# Read product reviews file and extract productIDs
reviews_df = pd.read_csv('/Users/pavansingh/Library/CloudStorage/GoogleDrive-pavansingho23@gmail.com/My Drive/Portfolio/Masters-Dissertation/Code/Data/revs_batch9.csv', low_memory=False)
product_ids = set(reviews_df['asin'])

# Metadata for product categories with less reviews

home_and_kitchen = "/Users/pavansingh/Desktop/Amazon Review Data/Metadata/meta_Home_and_Kitchen.json"
movies_and_tv = "/Users/pavansingh/Desktop/Amazon Review Data/Metadata/meta_Movies_and_TV.json"


# Load each metadata file and join into a dataframe
data = []

for category, filename in [('home_and_kitchen', home_and_kitchen), ('movies_and_tv', movies_and_tv)]:
    for selected_data in read_matching_metadata(filename, category, product_ids):
        data.append(selected_data)

# to dataframe 
data = pd.DataFrame(data)

# Print the resulting metadata dataframe
display(data.head(4))

# Value counts
print("Value Counts of products per Category:\n", data['category'].value_counts())

Unnamed: 0,category,tech1,description,fit,title,also_buy,tech2,brand,feature,rank,also_view,main_cat,similar_item,date,price,asin,imageURL,imageURLHighRes,details
0,home_and_kitchen,,"[CD'S ARTWORK, CASE AND DISC ARE LIKE NEW / FA...",,EMINEM SHOW,[],,,[audio cd],[],[],Amazon Home,,,,5509356839,[],[],
1,home_and_kitchen,,[Now it's easy to learn your a-b-c with our co...,,Little Wigwam Alphabet Placemat,"[6002582215, B01EYDBKNO, 6002582223, B01KORO72...",,Little Wigwam,"[Placemat Size: 420mm x 297mm (A3), Phonetical...","[>#53,717 in Home & Kitchen (See Top 100 in Ho...",[],Amazon Home,"class=""a-bordered a-horizontal-stripes a-spa...",,$7.99,6002582177,[https://images-na.ssl-images-amazon.com/image...,[https://images-na.ssl-images-amazon.com/image...,
2,home_and_kitchen,,[],,The Pampered Chef Serrated Bread Knife,[],,,[],"[>#2,322,675 in Kitchen & Dining (See Top 100 ...",[],Amazon Home,,"March 11, 2009",,7229004187,[],[],{}
3,home_and_kitchen,,"[Product dimensions\nLength: 196 ""\nMax. load...",,"IKEA - DIGNITET Curtain Wire, Stainless Steel",[],,IKEA,[Complete set with hardware and curtain wire; ...,"[>#59,072 in Home & Kitchen (See Top 100 in Ho...",[],Amazon Home,,,$17.32,9170011451,[],[],{}


Value Counts of products per Category:
 home_and_kitchen    18706
movies_and_tv       14202
Name: category, dtype: int64


In [6]:
# merge reviews and metadata
reviews_df = reviews_df.merge(data, on='asin', how='left').drop(columns=['Unnamed: 0','image', 'category_y', 'fit', 'also_buy', 'tech1', 'tech2', 'also_view', 'details', 'similar_item', 'imageURL', 'imageURLHighRes'])
display(reviews_df.head(4))

# save to csv
reviews_df.to_csv('Data/revs_meta_batch9.csv')

Unnamed: 0,overall,verified,reviewTime,reviewerID,asin,style,reviewerName,reviewText,summary,unixReviewTime,category_x,vote,description,title,brand,feature,rank,main_cat,date,price
0,5.0,True,"03 13, 2016",ABBW6PYWAVW0A,B0015ZSW86,"{'Size:': ' 1 Clock', 'Package Quantity:': ' 1'}",D.Ecab,So far so good--been through two rain storms a...,Good size/price on the clock compared to other...,1457827200,home_and_kitchen,,"[Shatter- and weather-resistant case, lens and...","Universal 11381 Indoor/Outdoor Clock, 13 1/2&q...",Universal,"[Shatter & Weather Resistant, 13-1/2"" Overall ...","[>#94,194 in Home & Kitchen (See Top 100 in Ho...",Amazon Home,,
1,5.0,True,"11 13, 2017",APPH7P0WHUL5S,B003KRHDNC,,Mykel,"This holds so many kCups! thanks guys, excelle...",excellent product,1510531200,home_and_kitchen,,[The Nifty Carousel for single serve coffee po...,K-Cup Carousel - Holds 35 K-Cups in Black,NIFTY,"[Holds up to 35 single serve coffee podss, Laz...",[>#11 in Tools & Home Improvement (See top 100...,Amazon Home,"January 21, 2010",$13.84
2,5.0,True,"05 1, 2018",A82ZHT5OQS8CF,B00NLLUNSE,"{'Size:': ' Queen', 'Color:': ' Brown'}",D. Jackson,I ordered these sheets during a very busy time...,They are the first sheets I've found that are ...,1525132800,home_and_kitchen,,[#1 Bed Sheet Set - Super Silky Soft - HIGHEST...,#1 Bed Sheet Set - HIGHEST QUALITY Brushed Mic...,Mellanni,"[100% Polyester, Imported, FEEL THE DIFFERENCE...",[>#5 in Home & Kitchen (See Top 100 in Home & ...,Amazon Home,,$29.70
3,5.0,True,"01 3, 2013",ANATXXBAXEUFM,B000Q9YVMS,{'Size:': ' 9-inch'},B. Smith,The mechanism that locks and unlocks these ton...,Works great,1357171200,home_and_kitchen,,"[, The Prepworks by Progressive 12-inch Silico...",Prepworks by Progressive Silicone Gripper Tong...,Progressive,[12-Inch stainless steel and silicone grip and...,"[>#151,868 in Kitchen & Dining (See Top 100 in...",Amazon Home,"October 2, 2001",$19.51



### <a id='toc4_2_8_'></a>[Batch 10](#toc0_)

- pet_supplies
- grocery_and_gourmet_food




In [7]:
# loading the review data!

data = []

pet_supplies = "/Users/pavansingh/Desktop/Amazon Review Data/Pet_Supplies_5.json"
grocery_and_gourmet_food = "/Users/pavansingh/Desktop/Amazon Review Data/Grocery_and_Gourmet_Food_5.json"

# load each file and join into dataframe
for category, filename in [('pet_supplies', pet_supplies), ('grocery_and_gourmet_food', grocery_and_gourmet_food)]:
    for selected_data in read_file(filename, category):
        data.append(selected_data)

# make it into a dataframe
data = pd.DataFrame(data)

# show the dataframe
print("Shape of all data:", data.shape)
display(data.head(5))

# save data in folder Data
data.to_csv('Data/revs_batch10.csv')

# category value counts
print("Value counts of product reviews per category:\n",data['category'].value_counts())

Shape of all data: (50000, 13)


Unnamed: 0,reviewerID,asin,reviewerName,verified,reviewText,overall,reviewTime,summary,unixReviewTime,category,style,image,vote
0,A1JTQ34I7Y2BL0,B0009YD8OC,MRJ,True,This is an AMAZING product for a dog that pull...,5.0,"01 23, 2015",This is an AMAZING product for a dog that pulls,1421971200,pet_supplies,,,
1,A27N5OXGPKJFNA,B00XPQT228,GOK,True,My birds love it!,5.0,"07 5, 2018",Five Stars,1530748800,pet_supplies,{'Size:': ' 1.1 pounds'},,
2,A1BIEW7UEE0S2T,B0002ARR2W,RC S.,True,This is a life saver for those tangles that ca...,5.0,"12 20, 2016",This is a life saver for those tangles that ca...,1482192000,pet_supplies,{'Size:': ' Pack of 1'},,
3,AOKRHVICVXS1J,B000X98CN0,Blake,True,excellent product,5.0,"05 12, 2015",Five Stars,1431388800,pet_supplies,"{'Size:': ' 12 oz', 'Package Type:': ' Standar...",,
4,A3VU97TZS8NOKS,B0058RA4HE,Pukeko,True,I bought these panels to extend our 4 panel pl...,5.0,"11 2, 2016",I bought these panels to extend our 4 panel pl...,1478044800,pet_supplies,"{'Color:': ' White', 'Style:': ' 2 Panel Add On'}",,


Value counts of product reviews per category:
 pet_supplies                25000
grocery_and_gourmet_food    25000
Name: category, dtype: int64


In [8]:
# Read product reviews file and extract productIDs
reviews_df = pd.read_csv('/Users/pavansingh/Library/CloudStorage/GoogleDrive-pavansingho23@gmail.com/My Drive/Portfolio/Masters-Dissertation/Code/Data/revs_batch10.csv', low_memory=False)
product_ids = set(reviews_df['asin'])

# Metadata for product categories with less reviews

pet_supplies = "/Users/pavansingh/Desktop/Amazon Review Data/Metadata/meta_Pet_Supplies.json"
grocery_and_gourmet_food = "/Users/pavansingh/Desktop/Amazon Review Data/Metadata/meta_Grocery_and_Gourmet_Food.json"

# Load each metadata file and join into a dataframe
data = []

for category, filename in [('pet_supplies', pet_supplies), ('grocery_and_gourmet_food', grocery_and_gourmet_food)]:
    for selected_data in read_matching_metadata(filename, category, product_ids):
        data.append(selected_data)

# to dataframe 
data = pd.DataFrame(data)

# Print the resulting metadata dataframe
display(data.head(4))

# Value counts
print("Value Counts of products per Category:\n", data['category'].value_counts())

Unnamed: 0,category,tech1,description,fit,title,also_buy,tech2,brand,feature,rank,also_view,main_cat,similar_item,date,price,asin,imageURL,imageURLHighRes,details
0,pet_supplies,,[VetVittles tm Nice Coat Stimulates hair growt...,,Nice Coat Tuna Flavor Pet Herbal Supplement,[],,VetVittles.com,"[Boosts strength, immunity and vitality, stren...","1,003,489 in Pet Supplies (",[],Pet Supplies,,,,1300451335,[https://images-na.ssl-images-amazon.com/image...,[https://images-na.ssl-images-amazon.com/image...,
1,pet_supplies,,[],,"WELLAND Wood Free Standing Folding Pet Gate, 5...","[B076V7LKXK, B01MA6RMID, B072ZY3TKG, B011BS5E9...",,WELLAND,[],[],"[B072ZQB5YK, B072ZY3TKG, B01MT3A1Q4, B076V7LKX...",Baby,,,$49.99,4121689569,[https://images-na.ssl-images-amazon.com/image...,[https://images-na.ssl-images-amazon.com/image...,
2,pet_supplies,,[Kills and repels fleas and ticks for 8 contin...,,Bayer Seresto Flea and Tick Collar for Dogs,[],,Bayer Animal Health,[Veterinarian-recommended ea and tick preventi...,11 in Pet Supplies (,"[B00WMMMHNM, B00B8CG5NK, B005B0OEQK, B07D9S6QP...",Pet Supplies,"class=""a-bordered a-horizontal-stripes a-spa...",,$37.99,6162622851,[https://images-na.ssl-images-amazon.com/image...,[https://images-na.ssl-images-amazon.com/image...,
3,pet_supplies,,[Medium bolt-on bottlebrush cage perch for med...,,Birds LOVE Bottlebrush Wood Bird Cage Perch,"[B01MAXU4N2, B0002AR73G, B07D4C91LQ, B01C5E0HC...",,Birds LOVE,[Medium bolt-on Bottlebrush cage perch for sma...,"56,922 in Pet Supplies (","[B01FUWY8NM, B0035HCVUW, B06XRT2N53, B0086YCPH...",Pet Supplies,"class=""a-bordered a-horizontal-stripes a-spa...",,$14.99,9822497938,[https://images-na.ssl-images-amazon.com/image...,[https://images-na.ssl-images-amazon.com/image...,"{'  Item Weight: ': '5 ounces', 'Shipp..."


Value Counts of products per Category:
 grocery_and_gourmet_food    12644
pet_supplies                12023
Name: category, dtype: int64


In [9]:
# merge reviews and metadata
reviews_df = reviews_df.merge(data, on='asin', how='left').drop(columns=['Unnamed: 0','image', 'category_y', 'fit', 'also_buy', 'tech1', 'tech2', 'also_view', 'details', 'similar_item', 'imageURL', 'imageURLHighRes'])
display(reviews_df.head(4))

# save to csv
reviews_df.to_csv('Data/revs_meta_batch10.csv')

Unnamed: 0,reviewerID,asin,reviewerName,verified,reviewText,overall,reviewTime,summary,unixReviewTime,category_x,style,vote,description,title,brand,feature,rank,main_cat,date,price
0,A1JTQ34I7Y2BL0,B0009YD8OC,MRJ,True,This is an AMAZING product for a dog that pull...,5.0,"01 23, 2015",This is an AMAZING product for a dog that pulls,1421971200,pet_supplies,,,[Enjoy stress-free walks in the park with your...,PetSafe Gentle Leader Head Collar with Trainin...,PetSafe,[],"131,607 in Pet Supplies (",Pet Supplies,,.a-box-inner{background-color:#fff}#alohaBuyBo...
1,A27N5OXGPKJFNA,B00XPQT228,GOK,True,My birds love it!,5.0,"07 5, 2018",Five Stars,1530748800,pet_supplies,{'Size:': ' 1.1 pounds'},,"[As a supplement to your bird's regular diet, ...",Quiko Classic Egg Food Bird Supplement,Quiko,"[Contains 1- 1.1Lb, Egg Foods Europeans Have B...","39,910 in Pet Supplies (",Pet Supplies,,$13.14
2,A1BIEW7UEE0S2T,B0002ARR2W,RC S.,True,This is a life saver for those tangles that ca...,5.0,"12 20, 2016",This is a life saver for those tangles that ca...,1482192000,pet_supplies,{'Size:': ' Pack of 1'},,[The SAFARI Dog De-Matting Comb removes mats a...,"Safari Dog De-Matting Comb, One Size, Dog Comb...",Safari Pet Products,[DOG DE-MATTING COMB: Designed to easily remov...,356 in Pet Supplies (,Pet Supplies,,$7.99
3,A1BIEW7UEE0S2T,B0002ARR2W,RC S.,True,This is a life saver for those tangles that ca...,5.0,"12 20, 2016",This is a life saver for those tangles that ca...,1482192000,pet_supplies,{'Size:': ' Pack of 1'},,[The SAFARI Dog De-Matting Comb removes mats a...,"Safari Dog De-Matting Comb, One Size, Dog Comb...",Safari Pet Products,[DOG DE-MATTING COMB: Designed to easily remov...,356 in Pet Supplies (,Pet Supplies,,$7.99


### <a id='toc4_2_9_'></a>[Merge Batches (for large reviews data)](#toc0_)

In this section, we merge the batches together to create one large dataset.

We use the `pd.concat()` function to merge the batches together. The resulting dataset is saved as a CSV file for use in the next section - **data cleaning**. 




In [12]:
# load and merge csv files
df1 = pd.read_csv("/Users/pavansingh/Library/CloudStorage/GoogleDrive-pavansingho23@gmail.com/My Drive/Portfolio/Masters-Dissertation/Code/Data/revs_meta_batch1.csv", low_memory=False)
df2 = pd.read_csv("/Users/pavansingh/Library/CloudStorage/GoogleDrive-pavansingho23@gmail.com/My Drive/Portfolio/Masters-Dissertation/Code/Data/revs_meta_batch2.csv", low_memory=False)
df3 = pd.read_csv("/Users/pavansingh/Library/CloudStorage/GoogleDrive-pavansingho23@gmail.com/My Drive/Portfolio/Masters-Dissertation/Code/Data/revs_meta_batch3.csv", low_memory=False)
df4 = pd.read_csv("/Users/pavansingh/Library/CloudStorage/GoogleDrive-pavansingho23@gmail.com/My Drive/Portfolio/Masters-Dissertation/Code/Data/revs_meta_batch4.csv", low_memory=False)
df5 = pd.read_csv("/Users/pavansingh/Library/CloudStorage/GoogleDrive-pavansingho23@gmail.com/My Drive/Portfolio/Masters-Dissertation/Code/Data/revs_meta_batch5.csv", low_memory=False)
df6 = pd.read_csv("/Users/pavansingh/Library/CloudStorage/GoogleDrive-pavansingho23@gmail.com/My Drive/Portfolio/Masters-Dissertation/Code/Data/revs_meta_batch6.csv", low_memory=False)
df7 = pd.read_csv("/Users/pavansingh/Library/CloudStorage/GoogleDrive-pavansingho23@gmail.com/My Drive/Portfolio/Masters-Dissertation/Code/Data/revs_meta_batch7.csv", low_memory=False)
df8 = pd.read_csv("/Users/pavansingh/Library/CloudStorage/GoogleDrive-pavansingho23@gmail.com/My Drive/Portfolio/Masters-Dissertation/Code/Data/revs_meta_batch8.csv", low_memory=False)
df9 = pd.read_csv("/Users/pavansingh/Library/CloudStorage/GoogleDrive-pavansingho23@gmail.com/My Drive/Portfolio/Masters-Dissertation/Code/Data/revs_meta_batch9.csv", low_memory=False)
df10 = pd.read_csv("/Users/pavansingh/Library/CloudStorage/GoogleDrive-pavansingho23@gmail.com/My Drive/Portfolio/Masters-Dissertation/Code/Data/revs_meta_batch10.csv", low_memory=False)

# merge all dataframes
frames = [df1, df2, df3, df4, df5, df6, df7, df8, df9, df10]
lots_revs_meta = pd.concat(frames)

# save to csv
lots_revs_meta.to_csv('Data/lots_revs_meta.csv')


In [13]:
# quick look at the data
print("Shape of all data:", lots_revs_meta.shape)
display(lots_revs_meta.head(3))

# value counts
print("\nValue counts of product reviews per category:\n",lots_revs_meta['category_x'].value_counts())

Shape of all data: (532991, 21)


Unnamed: 0.1,Unnamed: 0,overall,verified,reviewTime,reviewerID,asin,reviewerName,reviewText,summary,unixReviewTime,...,vote,style,description,title,brand,feature,rank,main_cat,date,price
0,0,5.0,True,"03 2, 2018",A2E7LAVOYZO0LX,B003SBGW8C,Donna Matheson,Works great!,Five Stars,1519948800,...,,,['The Grace Company-True Grips Non Slip Adhesi...,Crafters Workshop TrueCut Non-Slip Ruler Grips...,CRAFTERS WORKSHOP,['Apply these little rings to the back of any ...,"['>#902 in Arts, Crafts & Sewing (See Top 100 ...","Arts, Crafts & Sewing",,$5.11
1,1,5.0,True,"10 23, 2013",A3BC8NCB5H9UOC,B004BOY8NG,Jopaloma,The boning was very easy to use. The size was...,Just the ticket!,1382486400,...,2.0,{'Size:': ' 12-Yard'},['Boning provides shape and support to straple...,"Dritz Featherlite Boning, 12-Yard",Dritz,['Boning provides shape and support to straple...,"['>#44,492 in Arts, Crafts & Sewing (See Top 1...","Arts, Crafts & Sewing",,$19.14
2,2,5.0,True,"03 6, 2017",ADPIGCF2FF40K,B005N419GM,"Nurse Deb, PNP",Better quality than expected. This is a great...,Better than expected,1488758400,...,,{'Size:': ' Size-US-4-(3.5mm)'},['1 x HiyaHiya Circular 9-inch (23cm) Steel Kn...,HiyaHiya Circular 9-inch (23cm) Steel Knitting...,HiyaHiya,['1 x HiyaHiya Circular 9-inch (23cm) Steel Kn...,"['>#227,598 in Arts, Crafts & Sewing (See Top ...","Arts, Crafts & Sewing",,$11.00



Value counts of product reviews per category:
 musical_instruments           31144
cds_and_vinyl                 29761
video_games                   28596
office_products               28575
movies_and_tv                 28252
pet_supplies                  27819
tools_and_home_improvement    26769
home_and_kitchen              26276
electronics                   26220
patio_lawn_and_garden         26120
automotive                    25944
toys_and_games                25846
grocery_and_gourmet_food      25500
arts_crafts                   25392
clothing_shoes_and_jewelry    25383
sports_and_outdoors           25371
cell_phones                   25015
digital_music                 25008
kindle_store                  25000
prime_pantry                  25000
Name: category_x, dtype: int64


***
## <a id='toc4_3_'></a>[Merge Large Reviews with Few Reviews](#toc0_)

We now have two CSV files:

1. `few_revs_meta.csv`: contains the metadata for products with fewer reviews
2. `lots_revs_meta.csv`: contains the metadata for products with a lot of reviews

We merge these two datasets together to create one large dataset that we will use for data cleaning and the subsequent analysis.

In [18]:
# load and merge csv files
few = pd.read_csv("/Users/pavansingh/Library/CloudStorage/GoogleDrive-pavansingho23@gmail.com/My Drive/Portfolio/Masters-Dissertation/Code/Data/few_revs_meta.csv", low_memory=False)
lots = pd.read_csv("/Users/pavansingh/Library/CloudStorage/GoogleDrive-pavansingho23@gmail.com/My Drive/Portfolio/Masters-Dissertation/Code/Data/lots_revs_meta.csv", low_memory=False)

# merge all dataframes
frames = [few, lots]
all_revs_meta = pd.concat(frames)
all_revs_meta = all_revs_meta.drop(columns=['Unnamed: 0', 'Unnamed: 0.1'])

# save to csv
all_revs_meta.to_csv('Data/all_revs_meta.csv')

In [19]:
# quick look at the data
print("Shape of all data:", all_revs_meta.shape)
display(all_revs_meta.head(3))

# value counts
print("\nValue counts of product reviews per category:\n",all_revs_meta['category_x'].value_counts())



Shape of all data: (617770, 20)


Unnamed: 0,overall,verified,reviewTime,reviewerID,asin,style,reviewerName,reviewText,summary,unixReviewTime,category_x,vote,description,title,brand,feature,rank,main_cat,date,price
0,5.0,False,"04 7, 2018",A31URN5S2Q0UJV,B000URXP6E,{'Size:': ' Small'},Boris Jones,Was skeptical at first. The liquid is kind of ...,Awesome quality!,1523059200,beauty,,['Juicy burst of Starburst fruit flavored Lip ...,Bonne Bell Smackers Bath and Body Starburst Co...,Bonne Bell,[],"1,390,827 in Beauty &amp; Personal Care (",All Beauty,,
1,5.0,True,"02 4, 2014",A31XUJMEDBUGKR,B000URXP6E,{'Size:': ' 23'},Terry V.,Beautiful Beginnings have been the answer to m...,Works great!,1391472000,beauty,,['Juicy burst of Starburst fruit flavored Lip ...,Bonne Bell Smackers Bath and Body Starburst Co...,Bonne Bell,[],"1,390,827 in Beauty &amp; Personal Care (",All Beauty,,
2,5.0,True,"05 11, 2013",A2XPTXCAX8WLHU,B000URXP6E,{'Size:': ' 263'},Mindy Lipton,My daughter bought this for me because she kno...,Love it,1368230400,beauty,,['Juicy burst of Starburst fruit flavored Lip ...,Bonne Bell Smackers Bath and Body Starburst Co...,Bonne Bell,[],"1,390,827 in Beauty &amp; Personal Care (",All Beauty,,



Value counts of product reviews per category:
 musical_instruments           31144
cds_and_vinyl                 29761
video_games                   28596
office_products               28575
movies_and_tv                 28252
pet_supplies                  27819
tools_and_home_improvement    26769
industrial                    26534
home_and_kitchen              26276
electronics                   26220
luxury_beauty                 26134
patio_lawn_and_garden         26120
automotive                    25944
toys_and_games                25846
grocery_and_gourmet_food      25500
arts_crafts                   25392
clothing_shoes_and_jewelry    25383
sports_and_outdoors           25371
cell_phones                   25015
digital_music                 25008
kindle_store                  25000
prime_pantry                  25000
software                      14103
beauty                         5767
magazine_subscriptions         3810
fashion                        3176
gift_cards      