# Data Loading

Here we are looking at taking several samples of the amazon reviews dataset and loading them into a dataframe.

In [1]:
# get modules in 
import pandas as pd
import gzip
import json
import random
import linecache


# Datasets

We have individual datasets for each category. These data have been reduced to extract the $k$-core, such that each of the remaining users and items have $k$ reviews each.

- Amazon Fashion	
- All Beauty	
- Appliances	
- Arts, Crafts and Sewing	
- Automotive	
- Books	
- CDs and Vinyl	
- Cell Phones and Accessories	
- Clothing, Shoes and Jewelry	
- Digital Music	
- Electronics	
- Gift Cards	
- Grocery and Gourmet Food	
- Home and Kitchen	
- Industrial and Scientific	
- Kindle Store	
- Luxury Beauty	
- Magazine Subscriptions	
- Movies and TV	
- Musical Instruments	
- Office Products	
- Patio, Lawn and Garden	
- Pet Supplies	
- Prime Pantry	
- Software	
- Sports and Outdoors	
- Tools and Home Improvement	
- Toys and Games	
- Video Games	

***

### Review Dataset
Format is one-review-per-line in json. 

- **overall**: ratings of the product
- **reviewerID**: ID of the reviewer, e.g. A2SUAM1J3GNN3B
- **asin**: ID of the product, e.g. 0000013714
- **reviewerName**: name of the reviewer
- **vote**: helpful votes of the review
- **style**: a disctionary of the product metadata, e.g., "Format" is "Hardcover"
- **reviewText**: text of the review
- **summary**: summary of the review
- **unixReviewTime**: time of the review (unix time)
- **reviewTime**: time of the review (raw)
- **image**: images that users post after they have received the product

***
### Product Metadata Dataset
We also have metadata. 

- **asin**: ID of the product, e.g. 0000031852
- **title**: name of the product
- **feature**: bullet-point format features of the product
- **description**: description of the product
- **price**: price in US dollars (at time of crawl)
- **imageURL**: url of the product image
- **imageURL**: url of the high resolution product image
- **related**: related products (also bought, also viewed, bought together, buy after viewing)
- **salesRank**: sales rank information
- **brand**: brand name
- **categories**: list of categories the product belongs to
- **tech1**: the first technical detail table of the product
- **tech2**: the second technical detail table of the product
- **similar**: similar product table


***
# Quick look at Reviews in a Product Category (*Example*)

For Example: ***The Fashion Dataset*** 
    (AMAZON_FASHION_5.json)

In [3]:
# data path
fashion_data = "/Users/pavansingh/Library/CloudStorage/GoogleDrive-sngpav003@myuct.ac.za/My Drive/Masters 2022/Dissertation/Masters-Dissertation/Data/AMAZON_FASHION_5.json"
fashion_data = pd.read_json(fashion_data, lines=True)
#fashion_data = fashion_data.loc[:,['reviewerName', 'reviewText', 'overall', 'style']]
display(fashion_data.loc[10:14,:])
print("Shape of Data:", fashion_data.shape)

Unnamed: 0,overall,verified,reviewTime,reviewerID,asin,style,reviewerName,reviewText,summary,unixReviewTime,vote,image
10,2,True,"01 25, 2018",A3HX4X3TIABWOV,B000KPIHQ4,"{'Size Name:': ' Men's 6-6.5, Women's 8-8.5', ...",Denise A. Conte,Relieved my Plantar Fascitis for 3 Days. Then ...,These were recommended by my Podiatrist,1516838400,,
11,2,True,"01 5, 2017",AW8UBYMNJ894V,B000KPIHQ4,"{'Size Name:': ' Men's 8-8.5, Women's 10-10.5'...",Cognizant Consumer,This is my 6th pair and they are the best thin...,Not the same as all my other pairs.,1483574400,,
12,5,True,"10 17, 2016",A265UZVOZWTTXQ,B000KPIHQ4,,William_Jasper,We have used these inserts for years. They pr...,Great inserts,1476662400,,
13,5,True,"08 22, 2016",AW8UBYMNJ894V,B000KPIHQ4,,Cognizant Consumer,Pinnacle seems to have more cushioning so my h...,Personal favorite,1471824000,,
14,5,True,"03 23, 2016",A265UZVOZWTTXQ,B000KPIHQ4,,William_Jasper,Excellent insole with good support.,Five Stars,1458691200,,


Shape of Data: (3176, 12)


Below we: 

1. Calculate the number of missing values in the 'style' column of the fashion_data DataFrame. It does this by using the `isna()` method to create a boolean mask indicating where the 'style' column contains missing values (i.e., `NaN`), and then using the `sum()` method to count the number of `True` values in the mask.

2. Proceed with dropping all rows in the `fashion_data` DataFrame that have missing values in the 'style' column. It does this using the `dropna()` method with the subset parameter set to "style", which indicates that only rows with missing values in the 'style' column should be dropped. The `inplace` parameter is set to `True`, which indicates that the changes should be made to the fashion_data DataFrame directly (rather than creating a new DataFrame with the missing values dropped).

3. We sort the fashion_data DataFrame by the '`overall`' column in descending order. It does this using the `sort_values()` method with the by parameter set to "`overall`" to indicate that the DataFrame should be sorted by the '`overall`' column, and the ascending parameter set to False to indicate that the DataFrame should be sorted in descending order.


In [4]:
# see NA's in style
print(fashion_data['style'].isna().sum())

# remove NA's in style
fashion_data.dropna(subset=["style"], inplace=True)

# Sort resulting dataframe by overall rating
fashion_data.sort_values(by=['overall'], inplace=True, ascending=False)

# show resulting dataset
display(fashion_data.head(10))

# Shape of data
print("Shape of Data:", fashion_data.shape)

69


Unnamed: 0,overall,verified,reviewTime,reviewerID,asin,style,reviewerName,reviewText,summary,unixReviewTime,vote,image
0,5,True,"09 4, 2015",ALJ66O1Y6SLHA,B000K2PJ4K,"{'Size:': ' Big Boys', 'Color:': ' Blue/Orange'}",Tonya B.,Great product and price!,Five Stars,1441324800,,
1963,5,True,"04 18, 2016",AZRZ2FB7CFNOE,B0092UF54A,"{'Size:': ' 8 B(M) US', 'Color:': ' Black/Whit...",Catherine Uribe,I love my tennis shoes,Five Stars,1460937600,,
1951,5,True,"05 15, 2016",A2KCFRMKVHYSU7,B0092UF54A,"{'Size:': ' 8 B(M) US', 'Color:': ' Ocean Fog/...",Saving star,Really comfy and nice color,Great and comfy for sports,1463270400,,
1953,5,False,"05 10, 2016",A22WG2NE4D47UM,B0092UF54A,"{'Size:': ' 9 B(M) US', 'Color:': ' Black/Wolf...",Amazon Customer,These are the most comfortable shoes I've used...,Best shoes ever!!!,1462838400,,
1954,5,True,"05 9, 2016",A1SC6HVU28ND3D,B0092UF54A,"{'Size:': ' 8.5 B(M) US', 'Color:': ' Black/Wh...",Sarah,Very comfortable and looks great!,Five Stars,1462752000,,
1955,5,True,"05 8, 2016",AJDH2WVLX79KA,B0092UF54A,"{'Size:': ' 6.5 B(M) US', 'Color:': ' Black/Wh...",brookelynne,Love this shoes so comfy and great very everyd...,Fits wonderful,1462665600,,
1958,5,True,"05 4, 2016",A1AWX0M8R6A2I1,B0092UF54A,"{'Size:': ' 12 D(M) US', 'Color:': ' Cool Grey...",CBP,Perfict fit for me. Great looking shoes at a g...,nice,1462320000,,
1959,5,True,"04 23, 2016",AT5OQFDS6PEE1,B0092UF54A,"{'Size:': ' 9.5 B(M) US', 'Color:': ' Black/Wh...",H. Heckstall,The sneakers are very comfortable and fit to s...,Five Stars,1461369600,,
1960,5,True,"04 21, 2016",AOFQAZVA6Q6E7,B0092UF54A,"{'Size:': ' 10 B(M) US', 'Color:': ' Black/Whi...",D. Resendes,I've had these shoes for about a week now and ...,Wide Feet so Somewhat Tight,1461196800,3.0,
1965,5,True,"03 31, 2016",A2TRI54C8EMCX,B0092UF54A,"{'Size:': ' 9 B(M) US', 'Color:': ' Black/Wolf...",Andrea Seo B.,Love it!! Super comfortable and nice!! Got mor...,Love it!! Definetly recommend it,1459382400,,


Shape of Data: (3107, 12)


So we have 3107 reviews in our fashion data set. 

***
# Combining the Review Datasets

We have individual datasets for each category. We combine them to generate one larger datasets encompassing all the categories (5-core dataset).

**The following function is created to read in large JSON files:**

``` py
def read_file(filename, category):
    num_lines = sum(1 for line in open(filename))
    selected_lines = set()
    while len(selected_lines) < min(50000, num_lines):
        line_num = random.randint(1, num_lines)
        if line_num not in selected_lines:
            selected_lines.add(line_num)
            line = linecache.getline(filename, line_num)
            selected_data = json.loads(line)
            selected_data['category'] = category
            yield selected_data
```

1. It calculates the total number of lines in the file using the `sum(1 for line in open(filename))` expression.
2. It initializes an empty set called `selected_lines`, which will **store the line numbers that have been selected**.
3. It enters a loop that continues until the number of selected lines reaches the minimum value between 50,000 and the total number of lines in the file (`min(50000, num_lines)`).
4. Within each iteration of the loop, it generates a random line number using `random.randint(1, num_lines)`.
5. If the randomly generated line number is not already in the `selected_lines` set, it adds the line number to the set and proceeds to read that specific line from the file using `linecache.getline(filename, line_num)`.
6. The selected line is then parsed as JSON using `json.loads(line)`.
7. Additional data, such as the **category**, is added to the selected data object.
8. The selected data object is yielded, which means it will be returned as an element of an iterator.
9. The loop continues until the desired number of lines is selected.


In [6]:
def read_file(filename, category):
    num_lines = sum(1 for line in open(filename))
    selected_lines = set()
    while len(selected_lines) < min(50000, num_lines):
        line_num = random.randint(1, num_lines)
        if line_num not in selected_lines:
            selected_lines.add(line_num)
            line = linecache.getline(filename, line_num)
            selected_data = json.loads(line)
            selected_data['category'] = category
            yield selected_data

data = []

# category files - large reviews
arts_crafts = "/Users/pavansingh/Desktop/Amazon Review Data/Arts_Crafts_and_Sewing_5.json"
automotive = "/Users/pavansingh/Desktop/Amazon Review Data/Automotive_5.json"
cds_vinyl = "/Users/pavansingh/Desktop/Amazon Review Data/CDs_and_Vinyl_5.json"
cell_phones = "/Users/pavansingh/Desktop/Amazon Review Data/Cell_Phones_and_Accessories_5.json"
clothing_shoes = "/Users/pavansingh/Desktop/Amazon Review Data/Clothing_Shoes_and_Jewelry_5.json"
electronics = "/Users/pavansingh/Desktop/Amazon Review Data/Electronics_5.json"
grocery = "/Users/pavansingh/Desktop/Amazon Review Data/Grocery_and_Gourmet_Food_5.json"
home_kitchen = "/Users/pavansingh/Desktop/Amazon Review Data/Home_and_Kitchen_5.json"
kindle_store = "/Users/pavansingh/Desktop/Amazon Review Data/Kindle_Store_5.json"
movies_tv = "/Users/pavansingh/Desktop/Amazon Review Data/Movies_and_TV_5.json"
musical_instruments = "/Users/pavansingh/Desktop/Amazon Review Data/Musical_Instruments_5.json"
office_products = "/Users/pavansingh/Desktop/Amazon Review Data/Office_Products_5.json"
patio_lawn = "/Users/pavansingh/Desktop/Amazon Review Data/Patio_Lawn_and_Garden_5.json"
pet_supplies = "/Users/pavansingh/Desktop/Amazon Review Data/Pet_Supplies_5.json"
sports_outdoors = "/Users/pavansingh/Desktop/Amazon Review Data/Sports_and_Outdoors_5.json"
tools_home = "/Users/pavansingh/Desktop/Amazon Review Data/Tools_and_Home_Improvement_5.json"
toys_games = "/Users/pavansingh/Desktop/Amazon Review Data/Toys_and_Games_5.json"
video_games = "/Users/pavansingh/Desktop/Amazon Review Data/Video_Games_5.json"

# load each file and join into dataframe
for category, filename in [('arts_crafts_and_sewing', arts_crafts), ('automotive', automotive), ('cds_and_vinyl', cds_vinyl), ('cell_phones_and_accessories', cell_phones), ('clothing_shoes', clothing_shoes), ('electronics', electronics), ('grocery', grocery), ('home_and_kitchen', home_kitchen),  ('kindle_store', kindle_store), ('movies_tv', movies_tv), ('musical_instruments', musical_instruments), ('office_products', office_products),  ('patio_lawn', patio_lawn), ('pet_supplies', pet_supplies), ('sports_outdoors', sports_outdoors), ('tools_and_home_improvement', tools_home), ('toys_and_games', toys_games), ('video_games', video_games)]:
    for selected_data in read_file(filename, category):
        data.append(selected_data)

# make it into a dataframe
data_with_large_reviews = pd.DataFrame(data)

# show the dataframe
print("Shape of Data with Large Reviews Merged:", data_with_large_reviews.shape)
display(data_with_large_reviews.head(5))

# save df to csv called lots_revs.csv
data_with_large_reviews.to_csv('lots_revs.csv')

# category value counts
print("Value counts of product reviews per category:\n",data_with_large_reviews['category'].value_counts())

Shape of Data with Large Reviews Merged: (500000, 13)


Unnamed: 0,overall,verified,reviewTime,reviewerID,asin,style,reviewerName,reviewText,summary,unixReviewTime,category,vote,image
0,5.0,True,"08 28, 2014",A38C0RFEVT2HY3,B0013D53CS,{'Edition:': ' 2 Pack'},Christian,Awesome glue. Sticks really well. You get a lo...,Awesome glue. Sticks really well,1409184000,arts_crafts_and_sewing,,
1,5.0,True,"05 22, 2016",A3OI7AYBZ2BF1N,B00CB39D9I,{'Color:': ' Neon Pop'},Willow Daybreak,"Good quality, didn't fuzz up. I finger crochet...",made summer infinity scarves with this,1463875200,arts_crafts_and_sewing,,
2,4.0,True,"09 13, 2015",AKWE0F620RRNV,B00K6IOOMQ,,D. Myerscough,Very good for the price - fairly rough canvas.,Four Stars,1442102400,arts_crafts_and_sewing,,
3,4.0,True,"02 27, 2017",A3G86YMTB1Q6T8,B014G1CQP0,,Gisella Baum,Good for make your crochet,Four Stars,1488153600,arts_crafts_and_sewing,,
4,5.0,True,"02 8, 2016",AI40CU6M86F8U,B00KS1TVJW,{'Size:': ' 200pcs'},Maria,My husband thought that they would be a bit bi...,But he still had use for these to make childre...,1454889600,arts_crafts_and_sewing,,


In [8]:
data = []

# category files - smaller reviews
beauty = "/Users/pavansingh/Desktop/Amazon Review Data/All_Beauty_5.json"
fashion = "/Users/pavansingh/Desktop/Amazon Review Data/AMAZON_FASHION_5.json"
appliances = "/Users/pavansingh/Desktop/Amazon Review Data/Appliances_5.json"
gift_cards = "/Users/pavansingh/Desktop/Amazon Review Data/Gift_Cards_5.json"
industrial = "/Users/pavansingh/Desktop/Amazon Review Data/Industrial_and_Scientific_5.json"
luxury_beauty = "/Users/pavansingh/Desktop/Amazon Review Data/Luxury_Beauty_5.json"
magazine_subscriptions = "/Users/pavansingh/Desktop/Amazon Review Data/Magazine_Subscriptions_5.json"
software = "/Users/pavansingh/Desktop/Amazon Review Data/Software_5.json"



# load each file and join into dataframe
for category, filename in [('beauty', beauty), ('fashion', fashion), ('appliances', appliances), ('gift_cards', gift_cards), ('industrial', industrial), ('luxury_beauty', luxury_beauty), ('magazine_subscriptions', magazine_subscriptions), ('software', software)]:
    for selected_data in read_file(filename, category):
        data.append(selected_data)

# make it into a dataframe
data_with_less_reviews = pd.DataFrame(data)

# show the dataframe
print("Shape of all data:", data_with_less_reviews.shape)
display(data_with_less_reviews.head(5))

# save data_with_less_reviews to csv called few_revs.csv
data_with_less_reviews.to_csv('few_revs.csv')

# category value counts
print("Value counts of product reviews per category:\n",data_with_less_reviews['category'].value_counts())

Shape of all data: (113152, 13)


Unnamed: 0,overall,verified,reviewTime,reviewerID,asin,style,reviewerName,reviewText,summary,unixReviewTime,category,vote,image
0,5.0,False,"03 12, 2013",A1PI9G3J7CJRDR,B0012Y0ZG2,{'Size:': ' 33.8 oz'},Adam,I love this body wash though it's becoming har...,Amazing Lather!,1363046400,beauty,,
1,5.0,False,"05 12, 2018",AUS96J3A7A9MK,B00006L9LC,{'Size:': ' Small'},Kirk Wiper,"Got both products from this seller, shampoo an...",Selenium is awesome!,1526083200,beauty,,
2,5.0,True,"07 28, 2015",ACDH4NYWRB1PR,B0012Y0ZG2,{'Size:': ' 494'},CG,Great set!,Great set!,1438041600,beauty,,
3,5.0,True,"03 10, 2014",A3R33QRJ8AC767,B0012Y0ZG2,{'Size:': ' 67'},Amazon Customer,"I love this hard to find lotion/fragrance, as ...",Escada Moon Sparkle,1394409600,beauty,,
4,5.0,True,"12 14, 2016",A1JL5CJJDECOH4,B0012Y0ZG2,{'Size:': ' 29.2'},Tony T.,Great product!,5 stars!,1481673600,beauty,,


industrial                50000
luxury_beauty             34278
software                  12805
beauty                     5269
fashion                    3176
gift_cards                 2972
magazine_subscriptions     2375
appliances                 2277
Name: category, dtype: int64

In [10]:
# final dataset with all combined categories - merge the two dataframes (data_with_large_reviews and data_with_less_reviews)
all_revs = pd.concat([data_with_large_reviews, data_with_less_reviews], ignore_index = True)

# shape of final dataset
print("Shape of Combined/Total Data: ", all_revs.shape)

# save all data to csv called all_revs.csv under Data folder
all_revs.to_csv('Data/all_revs.csv')

Shape of Combined/Total Data:  (613152, 13)


***
# Loading the Metadata

We have metadata and further information about each product. The metadata is very large to load all the files at once. We do it in batches:

- batch_1
    - beauty, fashion, appliances, gift_cards, industrial, luxury_beauty, magazine_subscriptions, software

In [2]:
# Function to read file
def read_matching_metadata(filename, category, product_ids):
    with open(filename, 'r') as file:
        for line in file:
            data = json.loads(line)
            if data['asin'] in product_ids:
                data['category'] = category
                yield data

# Read product reviews file and extract productIDs
reviews_df = pd.read_csv('all_revs.csv', low_memory=False)
product_ids = set(reviews_df['asin'])

In [3]:
# category files - BATCH 1

beauty = "/Users/pavansingh/Desktop/Amazon Review Data/Metadata/meta_All_Beauty.json"
fashion = "/Users/pavansingh/Desktop/Amazon Review Data/Metadata/meta_AMAZON_FASHION.json"
appliances = "/Users/pavansingh/Desktop/Amazon Review Data/Metadata/meta_Appliances.json"
gift_cards = "/Users/pavansingh/Desktop/Amazon Review Data/Metadata/meta_Gift_Cards.json"
industrial = "/Users/pavansingh/Desktop/Amazon Review Data/Metadata/meta_Industrial_and_Scientific.json"
luxury_beauty = "/Users/pavansingh/Desktop/Amazon Review Data/Metadata/meta_Luxury_Beauty.json"
magazine_subscriptions = "/Users/pavansingh/Desktop/Amazon Review Data/Metadata/meta_Magazine_Subscriptions.json"
software = "/Users/pavansingh/Desktop/Amazon Review Data/Metadata/meta_Software.json"
arts_crafts = "/Users/pavansingh/Desktop/Amazon Review Data/Metadata/meta_Arts_Crafts_and_Sewing.json"
automotive = "/Users/pavansingh/Desktop/Amazon Review Data/Metadata/meta_Automotive.json"
cds_vinyl = "/Users/pavansingh/Desktop/Amazon Review Data/Metadata/meta_CDs_and_Vinyl.json"

# Load each metadata file and join into a dataframe
metadata_df_batch1 = []

for category, filename in [('beauty', beauty), ('fashion', fashion), ('appliances', appliances), ('gift_cards', gift_cards), ('industrial', industrial), ('luxury_beauty', luxury_beauty), ('magazine_subscriptions', magazine_subscriptions), ('software', software), ('arts_crafts_and_sewing', arts_crafts), ('automotive', automotive), ('cds_and_vinyl', cds_vinyl)]:
    for selected_data in read_matching_metadata(filename, category, product_ids):
        metadata_df_batch1.append(selected_data)

# to dataframe 
metadata_df_batch1 = pd.DataFrame(metadata_df_batch1)

# Print the resulting metadata dataframe
display(metadata_df_batch1.head(4))

# Value counts
print("Value Counts of products per Category:\n", metadata_df_batch1['category'].value_counts())

Unnamed: 0,category,tech1,description,fit,title,also_buy,tech2,brand,feature,rank,also_view,details,main_cat,similar_item,date,price,asin,imageURL,imageURLHighRes
0,beauty,,[INDICATIONS: Aqua Velva Cooling After Shave E...,,"Aqua Velva After Shave, Classic Ice Blue, 7 Ounce","[B00J232PCM, B0010V5MKG, B000052Y68, B00KOAIU7...",,Aqua Velva,[],"65,003 in Beauty & Personal Care (","[B01I9TIY1U, B07L1PZCS7, B01N12C89Y, B01I9TINT...",{'  Product Dimensions: ': '3 x 4 x 5 ...,All Beauty,,,,B0000530HU,[https://images-na.ssl-images-amazon.com/image...,[https://images-na.ssl-images-amazon.com/image...
1,beauty,,[<P><STRONG>Restores Moisture to Dehydrated Ha...,,Citre Shine Moisture Burst Shampoo - 16 fl oz,"[B07CSVCGZV, B07KMGC13Z, B0793XJ4WW, B01N7U1HB...",,Citre Shine,[],"1,693,702 in Beauty & Personal Care (",[],"{'ASIN: ': 'B00006L9LC', 'UPC:': '795827187965...",All Beauty,,,$23.00,B00006L9LC,[],[]
2,beauty,,"[A richly pigmented, micronized powder formula...",,"NARS Blush, Taj Mahal",[],,NARS,[],"505,302 in Beauty & Personal Care (","[B07FVJJ39R, B07JBQZDKB, B07HKVJC7G, B010VWL4E...","{'  Item Weight: ': '0.16 ounces', 'Sh...",All Beauty,,,$34.50,B00021DJ32,[https://images-na.ssl-images-amazon.com/image...,[https://images-na.ssl-images-amazon.com/image...
3,beauty,,[Avalon Organics Wrinkle Therapy Cleansing Mil...,,Avalon Organics Wrinkle Therapy CoQ10 Cleansin...,"[B0014407HC, B001ECQ41M, B00503OFIU, B00015XAQ...",,Avalon,[],"141,988 in Beauty &amp; Personal Care (","[B077ZG4C3L, B07DW6ZLFS, B00503OFIU, B07DVZMGL...",{'  Product Dimensions: ': '2.5 x 1.4 ...,All Beauty,,,$8.27,B0002JHI1I,[https://images-na.ssl-images-amazon.com/image...,[https://images-na.ssl-images-amazon.com/image...


Value Counts of products per Category:
 cds_and_vinyl             30394
automotive                27173
arts_crafts_and_sewing    15502
industrial                 5532
luxury_beauty              1625
software                    855
magazine_subscriptions      249
gift_cards                  148
beauty                       89
appliances                   49
fashion                      31
Name: category, dtype: int64


In [4]:
# category files - BATCH 2

cell_phones = "/Users/pavansingh/Desktop/Amazon Review Data/Metadata/meta_Cell_Phones_and_Accessories.json"
clothing_shoes = "/Users/pavansingh/Desktop/Amazon Review Data/Metadata/meta_Clothing_Shoes_and_Jewelry.json"

# Load each metadata file and join into a dataframe
metadata_df_batch2 = []

for category, filename in [('cell_phones_and_accessories', cell_phones), ('clothing_shoes', clothing_shoes)]:
    for selected_data in read_matching_metadata(filename, category, product_ids):
        metadata_df_batch2.append(selected_data)

# to dataframe 
metadata_df_batch2 = pd.DataFrame(metadata_df_batch2)

# Print the resulting metadata dataframe
display(metadata_df_batch2.head(4))

# Value counts
print("Value Counts of products per Category (batch 2):\n", metadata_df_batch2['category'].value_counts())

Unnamed: 0,category,tech1,description,fit,title,also_buy,tech2,brand,feature,rank,also_view,details,main_cat,similar_item,date,price,asin,imageURL,imageURLHighRes
0,cell_phones_and_accessories,,"[, Elegani Butterfly Case 3D Pattern Back Cove...",,MinisDesign 3d Bling Crystal Bow Transparent C...,[],,ELEGANI,[Fashionable with unique 3D butterfly design f...,"[>#228,544 in Cell Phones & Accessories (See T...",[],{},Cell Phones & Accessories,,,,7508492919,[https://images-na.ssl-images-amazon.com/image...,[https://images-na.ssl-images-amazon.com/image...
1,cell_phones_and_accessories,,"[Product Description\nHTC EVO 4G, Rubberized P...",,Rubberized Purple Wave Flower Snap on Design C...,[],,Generic,[Rubberized Purple Wave Flower Snap on Design ...,"[>#553,803 in Cell Phones & Accessories (See T...",[],{},Cell Phones & Accessories,,,,7532385086,[],[]
2,cell_phones_and_accessories,,[Samsung Official OEM Travel Wall Charger for ...,,Samsung Official OEM Travel Wall Charger for y...,[],,BlackBerry,[Safely charge your phone from your car using ...,"[>#494,589 in Cell Phones & Accessories (See T...",[],{},Cell Phones & Accessories,,,,8288853439,[https://images-na.ssl-images-amazon.com/image...,[https://images-na.ssl-images-amazon.com/image...
3,cell_phones_and_accessories,,[Safely charge your phone using the original a...,,Samsung Galaxy S2 Phone OEM Official Travel US...,[],,Samsung,[Safely charge your phone using the original a...,"[>#245,551 in Cell Phones & Accessories (See T...","[B00CF34B0A, 9784235951, B00YWCDT9Q, B00E1XGKL...",{},Cell Phones & Accessories,,,,8288878881,[],[]


Value Counts of products per Category (batch 2):
 cell_phones_and_accessories    21933
Name: category, dtype: int64


In [5]:
# category files - BATCH 3

electronics = "/Users/pavansingh/Desktop/Amazon Review Data/Metadata/meta_Electronics.json"
grocery = "/Users/pavansingh/Desktop/Amazon Review Data/Metadata/meta_Grocery_and_Gourmet_Food.json"

# Load each metadata file and join into a dataframe
metadata_df_batch3 = []

for category, filename in [('electronics', electronics), ('grocery', grocery)]:
    for selected_data in read_matching_metadata(filename, category, product_ids):
        metadata_df_batch3.append(selected_data)

# to dataframe 
metadata_df_batch3 = pd.DataFrame(metadata_df_batch3)

# Print the resulting metadata dataframe
display(metadata_df_batch3.head(4))

# Value counts
print("Value Counts of products per Category (batch 3):\n", metadata_df_batch3['category'].value_counts())

KeyError: 'category'

In [6]:
# category files - BATCH 4

home_kitchen = "/Users/pavansingh/Desktop/Amazon Review Data/Metadata/meta_Home_and_Kitchen.json"
kindle_store = "/Users/pavansingh/Desktop/Amazon Review Data/Metadata/meta_Kindle_Store.json"

# Load each metadata file and join into a dataframe
metadata_df_batch4 = []

for category, filename in [('home_and_kitchen', home_kitchen), ('kindle_store', kindle_store)]:
    for selected_data in read_matching_metadata(filename, category, product_ids):
        metadata_df_batch4.append(selected_data)

# to dataframe 
metadata_df_batch4 = pd.DataFrame(metadata_df_batch4)

# Print the resulting metadata dataframe
display(metadata_df_batch4.head(4))

# Value counts
print("Value Counts of products per Category (batch 4):\n", metadata_df_batch4['category'].value_counts())

Unnamed: 0,category,tech1,description,fit,title,also_buy,tech2,brand,feature,rank,also_view,main_cat,similar_item,date,price,asin,imageURL,imageURLHighRes,details
0,home_and_kitchen,,[A collection of over 300 words specially sele...,,Magnetic Poetry - The Poet Kit - More Essentia...,"[1890560014, B006GEBZG2, B00BC5L6AU, 193530527...",,Magnetic Poetry,[This collection (a staff-favorite here at Mag...,"[>#46,857 in Kitchen & Dining (See Top 100 in ...",[],Amazon Home,"class=""a-bordered a-horizontal-stripes a-spa...","July 19, 2004",$19.95,193228981X,[],[],
1,home_and_kitchen,,[Learn all about our Solar System with this st...,,Little Wigwam The Solar System Placemat - incl...,"[B000H6CDUW, B000H6DO9G, 6002582266, B000H6CA6...",,Little Wigwam,"[Includes all the dwarf planets - Ceres, Pluto...","[>#202,807 in Toys & Games (See Top 100 in Toy...","[B000H6DO9G, B07C1PCLWH, B01N3PJ6LN, B073VWPW8...",Toys & Games,"class=""a-bordered a-horizontal-stripes a-spa...",,$7.99,6002582258,[],[],
2,home_and_kitchen,,[Now it's easy to learn your a-b-c with our co...,,Little Wigwam Alphabet Placemat,"[6002582215, B01EYDBKNO, 6002582223, B01KORO72...",,Little Wigwam,"[Placemat Size: 420mm x 297mm (A3), Phonetical...","[>#53,717 in Home & Kitchen (See Top 100 in Ho...",[],Amazon Home,"class=""a-bordered a-horizontal-stripes a-spa...",,$7.99,6002582177,[https://images-na.ssl-images-amazon.com/image...,[https://images-na.ssl-images-amazon.com/image...,
3,home_and_kitchen,,[],,The Pampered Chef Serrated Bread Knife,[],,,[],"[>#2,322,675 in Kitchen & Dining (See Top 100 ...",[],Amazon Home,,"March 11, 2009",,7229004187,[],[],{}


Value Counts of products per Category (batch 4):
 home_and_kitchen    31389
Name: category, dtype: int64


In [7]:
# category files - BATCH 5

movies_tv = "/Users/pavansingh/Desktop/Amazon Review Data/Metadata/meta_Movies_and_TV.json"
musical_instruments = "/Users/pavansingh/Desktop/Amazon Review Data/Metadata/meta_Musical_Instruments.json"


# Load each metadata file and join into a dataframe
metadata_df_batch5 = []

for category, filename in [('movies_tv', movies_tv), ('musical_instruments', musical_instruments)]:
    for selected_data in read_matching_metadata(filename, category, product_ids):
        metadata_df_batch5.append(selected_data)

# to dataframe 
metadata_df_batch5 = pd.DataFrame(metadata_df_batch5)

# Print the resulting metadata dataframe
display(metadata_df_batch5.head(4))

# Value counts
print("Value Counts of products per Category (batch 5):\n", metadata_df_batch5['category'].value_counts())

Unnamed: 0,category,tech1,description,fit,title,also_buy,tech2,brand,feature,rank,also_view,main_cat,similar_item,date,price,asin,imageURL,imageURLHighRes,details
0,musical_instruments,,"[From Alfred Music, the worldwide leader in mu...",,"Alfreds Teach Yourself to Play Ukulele, Comple...",[],,Alfred Music Publishing,[High-quality Firebrand wood soprano ukulele w...,"[>#24,974 in Musical Instruments (See Top 100 ...","[B01F543PAW, B076KFB49J, B015XD4YLY, B01LYBZ4M...",Musical Instruments,"class=""a-bordered a-horizontal-stripes a-spa...","July 10, 2011",,739079891,[https://images-na.ssl-images-amazon.com/image...,[https://images-na.ssl-images-amazon.com/image...,
1,musical_instruments,,[This A-frame is designed to adjust the angle ...,,Guitar A-Frame Support,[],,Sageworks,"[Highly recommended by Aaron Shearer, Simple t...","[>#48,224 in Musical Instruments (See Top 100 ...","[B019MIPZ8M, B005QKNUOW, B015CDQ98Q, B0194MNYK...",Musical Instruments,,"July 13, 2007",,786615206,[https://images-na.ssl-images-amazon.com/image...,[https://images-na.ssl-images-amazon.com/image...,
2,musical_instruments,,[ChordBuddy is the easiest and most effective ...,,ChordBuddy Guitar Learning System for Right Ha...,"[1480391409, 1480393614, 1495046184, 149502271...",,ChordBuddy,[You just found the easiest way to learn guita...,"[>#5,813 in Musical Instruments (See Top 100 i...",[],Musical Instruments,,"August 2, 2011",$6.04,1480360295,[https://images-na.ssl-images-amazon.com/image...,[https://images-na.ssl-images-amazon.com/image...,
3,musical_instruments,,[How To Play Guitar Phase 1 Book. For kids or ...,,Ernie Ball How to Play Guitar Phase 1 Book,[],,Ernie Ball,"[Easy to follow, Good for self-teaching or les...","[>#11,901 in Musical Instruments (See Top 100 ...","[0634065408, 0634047019, 1893907937, 192857102...",Musical Instruments,,"October 21, 2005",,1928571018,[],[],


Value Counts of products per Category (batch 5):
 musical_instruments    10786
Name: category, dtype: int64


In [8]:
# category files - BATCH 6

office_products = "/Users/pavansingh/Desktop/Amazon Review Data/Metadata/meta_Office_Products.json"
patio_lawn = "/Users/pavansingh/Desktop/Amazon Review Data/Metadata/meta_Patio_Lawn_and_Garden.json"


# Load each metadata file and join into a dataframe
metadata_df_batch6 = []

for category, filename in [('office_products', office_products), ('patio_lawn', patio_lawn)]:
    for selected_data in read_matching_metadata(filename, category, product_ids):
        metadata_df_batch6.append(selected_data)

# to dataframe 
metadata_df_batch6 = pd.DataFrame(metadata_df_batch6)

# Print the resulting metadata dataframe
display(metadata_df_batch6.head(4))

# Value counts
print("Value Counts of products per Category (batch 6):\n", metadata_df_batch6['category'].value_counts())

Unnamed: 0,category,tech1,description,fit,title,also_buy,tech2,brand,feature,rank,also_view,main_cat,similar_item,date,price,asin,imageURL,imageURLHighRes,details
0,office_products,"class=""a-keyvalue prodDetTable"" role=""present...",[Corduroy the bear goes to the launderette wit...,,A Pocket for Corduroy,"[0140501738, 0448421917, 0670063428, 042528875...",,Ingram Book & Distributor,[9780140503524],"[>#422,894 in Office Products (See top 100), >...",[0140501738],Office Products,,"September 14, 2006",$0.95,140503528,[],[],
1,office_products,,"[, ]",,Tri-Fold Organizer Black XXL Book and Bible Cover,"[031043758X, 1934770132, B00W4E1TKU, 031080917...",,Visit Amazon's Zondervan Page,[],"787,995 in Books (","[0310809177, B003JAH9MU, B0007UQKO8, 031082370...",Books,,,,310432065,[],[],
2,office_products,,[This rugged covers is ideal for young explore...,,Adventure Bible Cover Blue Medium,"[0310520347, 0310727472, 0310746027, 031072742...",,Visit Amazon's Zondervan Page,[],"11,234 in Books (","[B0793FF3N5, 0310727472, 0310727421, 031080659...",Books,,,$15.84,310802636,[],[],
3,office_products,,"[Made from durable nylon material, this sporty...",,Compass Med Book and Bible Cover,"[0310520347, B007WAWMZW, 0310802636, B0793FF3N...",,Visit Amazon's Zondervan Page,[],"50,648 in Books (","[0310806593, 0310802636, 031080292X, B007WAWMZ...",Books,,,$16.25,310806607,[],[],


Value Counts of products per Category (batch 6):
 office_products    17142
Name: category, dtype: int64


In [None]:
# category files - BATCH 7

pet_supplies = "/Users/pavansingh/Desktop/Amazon Review Data/Metadata/meta_Pet_Supplies.json"
sports_outdoors = "/Users/pavansingh/Desktop/Amazon Review Data/Metadata/meta_Sports_and_Outdoors.json"
tools_home = "/Users/pavansingh/Desktop/Amazon Review Data/Metadata/meta_Tools_and_Home_Improvement.json"

# Load each metadata file and join into a dataframe
metadata_df_batch7 = []

for category, filename in [('pet_supplies', pet_supplies), ('sports_outdoors', sports_outdoors)('tools_and_home_improvement', tools_home)]:
    for selected_data in read_matching_metadata(filename, category, product_ids):
        metadata_df_batch5.append(selected_data)

# to dataframe 
metadata_df_batch7 = pd.DataFrame(metadata_df_batch7)

# Print the resulting metadata dataframe
display(metadata_df_batch7.head(4))

# Value counts
print("Value Counts of products per Category (batch 7):\n", metadata_df_batch7['category'].value_counts())

In [None]:
# category files - BATCH 8

toys_games = "/Users/pavansingh/Desktop/Amazon Review Data/Metadata/meta_Toys_and_Games.json"
video_games = "/Users/pavansingh/Desktop/Amazon Review Data/Metadata/meta_Video_Games.json"

# Load each metadata file and join into a dataframe
metadata_df_batch8 = []

for category, filename in [('toys_and_games', toys_games), ('video_games', video_games)]:
    for selected_data in read_matching_metadata(filename, category, product_ids):
        metadata_df_batch5.append(selected_data)

# to dataframe 
metadata_df_batch8 = pd.DataFrame(metadata_df_batch8)

# Print the resulting metadata dataframe
display(metadata_df_batch8.head(4))

# Value counts
print("Value Counts of products per Category (batch 8):\n", metadata_df_batch8['category'].value_counts())