# <a id='toc1_'></a>[Data Loading](#toc0_)

Here we are looking at taking several samples of the amazon reviews dataset and loading them into a dataframe.

- the key thing to remember is that we need to sample the reviewers and make sure we take all their reviews across all the datasets

In [2]:
# reset (removing all variables, functions, and other objects from memory)
%reset -f

# get modules in 
import pandas as pd
import gzip
import json
import random
import linecache

**Table of contents**<a id='toc0_'></a>    
- [Data Loading](#toc1_)    
- [Datasets](#toc2_)    
    - [Review Dataset](#toc2_1_1_)    
    - [Product Metadata Dataset](#toc2_1_2_)    
- [Quick look at Reviews in a Product Category (*Example*)](#toc3_)    
- [The Review Dataset and Metadata Dataset](#toc4_)    
  - [Data with Fewer Reviews](#toc4_1_)    
  - [Data with A Lot of Reviews](#toc4_2_)    
      - [Batch 1](#toc4_2_1_1_)    
      - [Batch 2](#toc4_2_1_2_)    
      - [Batch 3](#toc4_2_1_3_)    
    - [Batch 4](#toc4_2_2_)    
    - [Batch 5](#toc4_2_3_)    
    - [Batch 6](#toc4_2_4_)    
    - [Batch 7](#toc4_2_5_)    
    - [Batch 8](#toc4_2_6_)    
    - [Batch 9](#toc4_2_7_)    
    - [Batch 10](#toc4_2_8_)    
    - [Merge Batches (for large reviews data)](#toc4_2_9_)    
  - [Merge Large Reviews with Few Reviews](#toc4_3_)    

<!-- vscode-jupyter-toc-config
	numbering=false
	anchor=true
	flat=false
	minLevel=1
	maxLevel=6
	/vscode-jupyter-toc-config -->
<!-- THIS CELL WILL BE REPLACED ON TOC UPDATE. DO NOT WRITE YOUR TEXT IN THIS CELL -->

# <a id='toc2_'></a>[Datasets](#toc0_)

We have individual datasets for each category. These data have been reduced to extract the $k$-core, such that each of the remaining users and items have $k$ reviews each.

- Amazon Fashion	
- All Beauty	
- Appliances	
- Arts, Crafts and Sewing	
- Automotive	
- Books	
- CDs and Vinyl	
- Cell Phones and Accessories	
- Clothing, Shoes and Jewelry	
- Digital Music	
- Electronics	
- Gift Cards	
- Grocery and Gourmet Food	
- Home and Kitchen	
- Industrial and Scientific	
- Kindle Store	
- Luxury Beauty	
- Magazine Subscriptions	
- Movies and TV	
- Musical Instruments	
- Office Products	
- Patio, Lawn and Garden	
- Pet Supplies	
- Prime Pantry	
- Software	
- Sports and Outdoors	
- Tools and Home Improvement	
- Toys and Games	
- Video Games	

***

### <a id='toc2_1_1_'></a>[Review Dataset](#toc0_)
Format is one-review-per-line in json. 

- **overall**: ratings of the product
- **reviewerID**: ID of the reviewer, e.g. A2SUAM1J3GNN3B
- **asin**: ID of the product, e.g. 0000013714
- **reviewerName**: name of the reviewer
- **vote**: helpful votes of the review
- **style**: a disctionary of the product metadata, e.g., "Format" is "Hardcover"
- **reviewText**: text of the review
- **summary**: summary of the review
- **unixReviewTime**: time of the review (unix time)
- **reviewTime**: time of the review (raw)
- **image**: images that users post after they have received the product

***
### <a id='toc2_1_2_'></a>[Product Metadata Dataset](#toc0_)
We also have metadata. 

- **asin**: ID of the product, e.g. 0000031852
- **title**: name of the product
- **feature**: bullet-point format features of the product
- **description**: description of the product
- **price**: price in US dollars (at time of crawl)
- **imageURL**: url of the product image
- **imageURL**: url of the high resolution product image
- **related**: related products (also bought, also viewed, bought together, buy after viewing)
- **salesRank**: sales rank information
- **brand**: brand name
- **categories**: list of categories the product belongs to
- **tech1**: the first technical detail table of the product
- **tech2**: the second technical detail table of the product
- **similar**: similar product table


***
# <a id='toc4_'></a>[The Review Dataset and Metadata Dataset](#toc0_)

We have individual datasets for each category. We combine them to generate one larger datasets encompassing all the categories (5-core dataset).

The following function is created to read in large JSON files:

``` py
def read_file(filename, category):
    num_lines = sum(1 for line in open(filename))
    selected_lines = set()
    while len(selected_lines) < min(50000, num_lines):
        line_num = random.randint(1, num_lines)
        if line_num not in selected_lines:
            selected_lines.add(line_num)
            line = linecache.getline(filename, line_num)
            selected_data = json.loads(line)
            selected_data['category'] = category
            yield selected_data
```

1. It calculates the total number of lines in the file using the `sum(1 for line in open(filename))` expression.
2. It initializes an empty set called `selected_lines`, which will **store the line numbers that have been selected**.
3. It enters a loop that continues until the number of selected lines reaches the minimum value between 50,000 and the total number of lines in the file (`min(50000, num_lines)`).
4. Within each iteration of the loop, it generates a random line number using `random.randint(1, num_lines)`.
5. If the randomly generated line number is not already in the `selected_lines` set, it adds the line number to the set and proceeds to read that specific line from the file using `linecache.getline(filename, line_num)`.
6. The selected line is then parsed as JSON using `json.loads(line)`.
7. Additional data, such as the **category**, is added to the selected data object.
8. The selected data object is yielded, which means it will be returned as an element of an iterator.
9. The loop continues until the desired number of lines is selected.

The function defined as:

```py
def read_matching_metadata(filename, category, product_ids):
    with open(filename, 'r') as file:
        for line in file:
            data = json.loads(line)
            if data['asin'] in product_ids:
                data['category'] = category
                yield data
```

Reads a JSON file and yields metadata entries that match a given set of product IDs. 
- `read_matching_metadata` is a function that takes three parameters: `filename`, `category`, and `product_ids`.
- It opens the specified filename (assumed to be a JSON file) in read mode using a with statement, which ensures the file is properly closed after reading.
- It iterates over each line in the file using a for loop.
- For each line, it loads the line as a JSON object using `json.loads(line)`.
- It checks if the value of the '`asin`' key in the loaded JSON data is present in the `product_ids` set.
- If there is a match, it adds the '`category`' key to the data dictionary and assigns it the value of the `category` parameter.
- Finally, it yields the modified data using the `yield` statement, allowing the caller to iterate over the matching metadata entries one by one.


In [2]:
# review data
def read_file(filename, category):
    num_lines = sum(1 for line in open(filename))
    selected_lines = set()
    while len(selected_lines) < min(500000, num_lines):
        line_num = random.randint(1, num_lines)
        if line_num not in selected_lines:
            selected_lines.add(line_num)
            line = linecache.getline(filename, line_num)
            selected_data = json.loads(line)
            selected_data['category'] = category
            yield selected_data

## <a id='toc4_1_'></a>[Data with Fewer Reviews](#toc0_)



In [3]:
# initialise data list
data = []

# category files - smaller reviews
beauty = "/Users/pavansingh/Desktop/Amazon Review Data/All_Beauty_5.json"
fashion = "/Users/pavansingh/Desktop/Amazon Review Data/AMAZON_FASHION_5.json"
appliances = "/Users/pavansingh/Desktop/Amazon Review Data/Appliances_5.json"
gift_cards = "/Users/pavansingh/Desktop/Amazon Review Data/Gift_Cards_5.json"
industrial = "/Users/pavansingh/Desktop/Amazon Review Data/Industrial_and_Scientific_5.json"
luxury_beauty = "/Users/pavansingh/Desktop/Amazon Review Data/Luxury_Beauty_5.json"
magazine_subscriptions = "/Users/pavansingh/Desktop/Amazon Review Data/Magazine_Subscriptions_5.json"
software = "/Users/pavansingh/Desktop/Amazon Review Data/Software_5.json"

# load each file and join into dataframe
for category, filename in [('beauty', beauty), ('fashion', fashion), ('appliances', appliances), ('gift_cards', gift_cards), ('industrial', industrial), ('luxury_beauty', luxury_beauty), ('magazine_subscriptions', magazine_subscriptions), ('software', software)]:
    for selected_data in read_file(filename, category):
        data.append(selected_data)

# make it into a dataframe
data_with_less_reviews = pd.DataFrame(data)

# show the dataframe
print("Shape of all data:", data_with_less_reviews.shape)
display(data_with_less_reviews.head(5))

# save data_with_less_reviews to csv called few_revs.csv in folder Data
data_with_less_reviews.to_csv('Data/few_revs_1.csv')

# category value counts
print("Value counts of product reviews per category:\n",data_with_less_reviews['category'].value_counts())

Shape of all data: (140223, 13)


Unnamed: 0,overall,verified,reviewTime,reviewerID,asin,style,reviewerName,reviewText,summary,unixReviewTime,category,vote,image
0,5.0,True,"03 17, 2016",A34SO74JEYQXZW,B0012Y0ZG2,{'Size:': ' 39'},Jose,Sally's stop selling this great shampoo for fu...,Fuller Hair,1458172800,beauty,,
1,5.0,True,"03 13, 2014",A2TZW7B0YG2ZJQ,B0009RF9DW,{'Size:': ' 258'},flavio sanchez,i am ok with this adidas hair and body 3 activ...,i love it,1394668800,beauty,,
2,5.0,True,"10 26, 2013",A4DEEDXZK8L78,B000URXP6E,{'Size:': ' 205'},Gloria Karimi,This is a beautiful scented lotion. Very mois...,Beautiful Scent,1382745600,beauty,,
3,5.0,True,"12 15, 2017",A1X15KWJ11IC1P,B0012Y0ZG2,{'Size:': ' 144'},Mert Ozer,"Lovely product and works great, except the art...",Five Stars,1513296000,beauty,,
4,5.0,True,"02 7, 2014",A3RGQCA2GSFLX2,B000URXP6E,{'Size:': ' 169'},self,hard to find a lab coat the fits nice. this o...,Love the fit of the lab coat......,1391731200,beauty,,


Value counts of product reviews per category:
 industrial                77071
luxury_beauty             34278
software                  12805
beauty                     5269
fashion                    3176
gift_cards                 2972
magazine_subscriptions     2375
appliances                 2277
Name: category, dtype: int64


## <a id='toc4_2_'></a>[Data with A Lot of Reviews](#toc0_)

We split this up into 9 batches and load them seperately as the metadata is quite large and takes up a lot of memory.  

#### <a id='toc4_2_1_1_'></a>[Batch 1](#toc0_)

- arts_crafts_and_sewing
- automotive

In [4]:
# loading the review data!

data = []

arts_crafts = "/Users/pavansingh/Desktop/Amazon Review Data/Arts_Crafts_and_Sewing_5.json"
automotive = "/Users/pavansingh/Desktop/Amazon Review Data/Automotive_5.json"

# load each file and join into dataframe
for category, filename in [('arts_crafts', arts_crafts), ('automotive', automotive)]:
    for selected_data in read_file(filename, category):
        data.append(selected_data)

# make it into a dataframe
data = pd.DataFrame(data)

# show the dataframe
print("Shape of all data:", data.shape)
display(data.head(5))

# save data_with_less_reviews to csv called few_revs.csv in folder Data
data.to_csv('Data/revs_batch1_1.csv')

# category value counts
print("Value counts of product reviews per category:\n",data['category'].value_counts())

Shape of all data: (994485, 13)


Unnamed: 0,overall,verified,reviewTime,reviewerID,asin,style,reviewerName,reviewText,summary,unixReviewTime,category,vote,image
0,5.0,True,"12 30, 2014",A13FYJJW30UXMH,B004ALFAAK,{'Size:': ' 1 PACK'},Rats Bats Cats,Good deal.,Good deal,1419897600,arts_crafts,,
1,3.0,True,"03 26, 2013",A5YP0YXLUG3HA,B003S8YG2E,"{'Color:': ' Paint Splatter', 'Size Name:': ' ...",A Person,"While I love the pattern, the tape isn't reall...",Love the pattern,1364256000,arts_crafts,,
2,4.0,True,"09 25, 2017",ABRIV7R6PZ2F3,B015IS1UO8,"{'Size:': ' 6 Sizes - Diameter 4-10mm', 'Color...",H Rose Drummond,not sure of gauge of rings pretty small so far...,Four Stars,1506297600,arts_crafts,,
3,5.0,True,"03 12, 2015",A2AVHO8045O7A8,B00IA8MQ9C,,mika power,I was so excited to get these and the did not ...,wonderful fun,1426118400,arts_crafts,,
4,5.0,True,"02 18, 2018",A10EP7302OWIN3,B01CBT6OM0,{'Color:': ' Light Multi'},Marlon Vinson,"Excellent product, will recommend and shop her...",Five Stars,1518912000,arts_crafts,,


Value counts of product reviews per category:
 automotive     500000
arts_crafts    494485
Name: category, dtype: int64



#### <a id='toc4_2_1_2_'></a>[Batch 2](#toc0_)

- cds_and_vinyl
- cell_phones_and_accessories

In [5]:
# loading the review data!

data = []

cds_and_vinyl = "/Users/pavansingh/Desktop/Amazon Review Data/CDs_and_Vinyl_5.json"
cell_phones = "/Users/pavansingh/Desktop/Amazon Review Data/Cell_Phones_and_Accessories_5.json"

# load each file and join into dataframe
for category, filename in [('cds_and_vinyl', cds_and_vinyl), ('cell_phones', cell_phones)]:
    for selected_data in read_file(filename, category):
        data.append(selected_data)

# make it into a dataframe
data = pd.DataFrame(data)

# show the dataframe
print("Shape of all data:", data.shape)
display(data.head(5))

# save data_with_less_reviews to csv called few_revs.csv in folder Data
data.to_csv('Data/revs_batch2_1.csv')

# category value counts
print("Value counts of product reviews per category:\n",data['category'].value_counts())

Shape of all data: (1000000, 13)


Unnamed: 0,overall,verified,reviewTime,reviewerID,asin,style,reviewerName,reviewText,summary,unixReviewTime,category,vote,image
0,3.0,True,"09 11, 2015",A52XE06COLD8T,B001UJSTNA,{'Format:': ' Audio CD'},KDub,"I LOVE exotica, 50s, 60s kitchy, cool vibes. ...",Meh........,1441929600,cds_and_vinyl,,
1,5.0,True,"04 28, 2017",AYBXPYFIZSWVG,B014RAMQWI,{'Format:': ' Audio CD'},J.F.,I like it.,Five Stars,1493337600,cds_and_vinyl,,
2,5.0,True,"12 27, 2015",A2BX6I43I2ZOLR,B00BQ1D7X4,{'Format:': ' Audio CD'},Sean George,You know that old car you have that somehow su...,Still pushing,1451174400,cds_and_vinyl,,
3,5.0,True,"04 24, 2013",A3GWY93VRVPZZZ,B001DETFC6,{'Format:': ' Audio CD'},Bay_thoven,Great piece of work to save for History the so...,Amazing,1366761600,cds_and_vinyl,,
4,5.0,True,"11 26, 2015",A3N4L00LS02V9A,B000002LFG,{'Format:': ' Audio CD'},Susan Smith,"I love Joe Sample and on this album, he has Al...",None better,1448496000,cds_and_vinyl,,


Value counts of product reviews per category:
 cds_and_vinyl    500000
cell_phones      500000
Name: category, dtype: int64




#### <a id='toc4_2_1_3_'></a>[Batch 3](#toc0_)

- clothing_shoes_and_jewelry
- digital_music

In [6]:
# loading the review data!

data = []

clothing_shoes_and_jewelry = "/Users/pavansingh/Desktop/Amazon Review Data/Clothing_Shoes_and_Jewelry_5.json"
digital_music = "/Users/pavansingh/Desktop/Amazon Review Data/Digital_Music_5.json"

# load each file and join into dataframe
for category, filename in [('clothing_shoes_and_jewelry', clothing_shoes_and_jewelry), ('digital_music', digital_music)]:
    for selected_data in read_file(filename, category):
        data.append(selected_data)

# make it into a dataframe
data = pd.DataFrame(data)

# show the dataframe
print("Shape of all data:", data.shape)
display(data.head(5))

# save data in folder Data
data.to_csv('Data/revs_batch3_1.csv')

# category value counts
print("Value counts of product reviews per category:\n",data['category'].value_counts())

Shape of all data: (669781, 13)


Unnamed: 0,overall,verified,reviewTime,reviewerID,asin,style,reviewerName,reviewText,summary,unixReviewTime,category,vote,image
0,5.0,True,"07 16, 2015",A6DZTRP4A1A0E,B001970GZC,"{'Size:': ' Large', 'Product Packaging:': ' St...",Lyle D,Perfect costume. Looks amazing!,Five Stars,1437004800,clothing_shoes_and_jewelry,,
1,5.0,True,"07 7, 2017",A331ABXC5XGCDI,B00TCZIOM0,"{'Size:': ' Medium', 'Color:': ' Black'}",Chris,"grunt style shirts are the best, they never di...",Five Stars,1499385600,clothing_shoes_and_jewelry,,
2,5.0,True,"11 21, 2014",A19KUWKZ3CJW88,B00E5ATDAY,"{'Size:': ' 6', 'Color:': ' Black/ivory'}",Pam Poe,The quality is better than I expected. I orde...,Very Nice Dress!,1416528000,clothing_shoes_and_jewelry,,
3,3.0,True,"02 7, 2017",A2OR6FLKBR0X1G,B00VHR123Y,{'Color:': ' Orange'},JusMe-MrsB,Not what I expected but its doable.,So So,1486425600,clothing_shoes_and_jewelry,,
4,3.0,True,"06 17, 2015",A1M0HXX6B13URH,B00CDC2PEW,"{'Size:': ' Large', 'Color:': ' Kwts090_white'}",molly oehlert,I ordered a white shirt and what I got was mor...,I like the shirt itself but not the color I ne...,1434499200,clothing_shoes_and_jewelry,,


Value counts of product reviews per category:
 clothing_shoes_and_jewelry    500000
digital_music                 169781
Name: category, dtype: int64



### <a id='toc4_2_2_'></a>[Batch 4](#toc0_)

- electronics
- musical_instruments

In [7]:
# loading the review data!

data = []

electronics = "/Users/pavansingh/Desktop/Amazon Review Data/Electronics_5.json"
musical_instruments = "/Users/pavansingh/Desktop/Amazon Review Data/Musical_Instruments_5.json"

# load each file and join into dataframe
for category, filename in [('electronics', electronics), ('musical_instruments', musical_instruments)]:
    for selected_data in read_file(filename, category):
        data.append(selected_data)

# make it into a dataframe
data = pd.DataFrame(data)

# show the dataframe
print("Shape of all data:", data.shape)
display(data.head(5))

# save data in folder Data
data.to_csv('Data/revs_batch4_1.csv')

# category value counts
print("Value counts of product reviews per category:\n",data['category'].value_counts())

Shape of all data: (731392, 13)


Unnamed: 0,overall,verified,reviewTime,reviewerID,asin,style,reviewerName,reviewText,summary,unixReviewTime,category,vote,image
0,5.0,True,"08 17, 2013",A21OX7YEFC1H7U,B0001LS36Q,"{'Capacity:': ' 24-Port', 'Model:': ' Web Smart'}",hitech guy,Easy to install and setup. Will write mote lat...,So far so good,1376697600,electronics,,
1,1.0,True,"09 27, 2016",AQ732FT4ENIN1,B000068O4N,{'style:': ' Angled 3.5 mm TRS to 2.5mm TRS'},chickenlady,Does not work.\n\nI connected some headphones ...,Does not work. I connected some headphones to ...,1474934400,electronics,,
2,5.0,False,"06 28, 2017",A3ORA4RO5RI5I4,B014I8V0EO,"{'Length:': ' 6 Feet', 'Style:': ' 1-Pack'}",nioa,These are a must have item. Many computers are...,Must Have!,1498608000,electronics,,
3,4.0,True,"06 14, 2011",A3Q6ITMGJ9C2Y9,B000JWONA2,{'Style:': ' 2-port'},Charlie,I use with a Toshiba Tecra and it was plug and...,Works great - fast!,1308009600,electronics,,
4,1.0,True,"08 7, 2014",A38YDK7Z3IZSRM,B00IGDDFBE,,Jose,This is really only so so. It adds some pretty...,Not sure if I would purchase it again....,1407369600,electronics,2.0,


Value counts of product reviews per category:
 electronics            500000
musical_instruments    231392
Name: category, dtype: int64



### <a id='toc4_2_3_'></a>[Batch 5](#toc0_)

- office_products
- patio_lawn_and_garden

In [8]:
# loading the review data!

data = []

office_products = "/Users/pavansingh/Desktop/Amazon Review Data/Office_Products_5.json"
patio_lawn_and_garden = "/Users/pavansingh/Desktop/Amazon Review Data/Patio_Lawn_and_Garden_5.json"

# load each file and join into dataframe
for category, filename in [('office_products', office_products), ('patio_lawn_and_garden', patio_lawn_and_garden)]:
    for selected_data in read_file(filename, category):
        data.append(selected_data)

# make it into a dataframe
data = pd.DataFrame(data)

# show the dataframe
print("Shape of all data:", data.shape)
display(data.head(5))

# save data in folder Data
data.to_csv('Data/revs_batch5_1.csv')

# category value counts
print("Value counts of product reviews per category:\n",data['category'].value_counts())

Shape of all data: (1000000, 13)


Unnamed: 0,overall,verified,reviewTime,reviewerID,asin,style,reviewerName,reviewText,summary,unixReviewTime,category,vote,image
0,5.0,True,"05 27, 2015",A3J2562NFG8QRY,B00L8B1IP6,{'Color:': ' Standard.'},Movin&#039; south,Used for nominal voting during group work...I ...,Great bold colors for group work with flip charts,1432684800,office_products,,
1,5.0,True,"05 19, 2016",A1ZFCRR6FP5HME,B0008G1ULY,"{'Size:': ' 12', 'Color:': ' Pastel'}",Dusty67,Love love love this pastel colors! Hard to fin...,Great markers,1463616000,office_products,,
2,5.0,True,"03 5, 2013",A1CFN7KZK7BUU2,B000Y4A1H4,{'Style:': ' #4 Extra Hard'},Teresa M,I love pencils - yes I am a nerd. I just had ...,If you're ready for adventure - try these,1362441600,office_products,6.0,
3,5.0,True,"06 2, 2015",A29FOGOKKIZPFA,B00HZDRMXS,{'Size:': ' 1-Pack'},Willow Henry,The only complaint I would have is that it doe...,It does its job well,1433203200,office_products,3.0,
4,5.0,True,"02 19, 2013",A2F29ZE6OSBCUZ,B00006B8HT,{'Size:': ' Large'},AJC,I got so sick of my counter space being taken ...,Made my cooking much easier.,1361232000,office_products,,


Value counts of product reviews per category:
 office_products          500000
patio_lawn_and_garden    500000
Name: category, dtype: int64



### <a id='toc4_2_4_'></a>[Batch 6](#toc0_)

- sports_and_outdoors
- video_games

In [9]:
# loading the review data!

data = []

sports_and_outdoors = "/Users/pavansingh/Desktop/Amazon Review Data/Sports_and_Outdoors_5.json"
video_games = "/Users/pavansingh/Desktop/Amazon Review Data/Video_Games_5.json"

# load each file and join into dataframe
for category, filename in [('sports_and_outdoors', sports_and_outdoors), ('video_games', video_games)]:
    for selected_data in read_file(filename, category):
        data.append(selected_data)

# make it into a dataframe
data = pd.DataFrame(data)

# show the dataframe
print("Shape of all data:", data.shape)
display(data.head(5))

# save data in folder Data
data.to_csv('Data/revs_batch6_1.csv')

# category value counts
print("Value counts of product reviews per category:\n",data['category'].value_counts())

Shape of all data: (997577, 13)


Unnamed: 0,overall,verified,reviewTime,reviewerID,asin,style,reviewerName,reviewText,summary,unixReviewTime,category,vote,image
0,5.0,True,"07 29, 2017",A1GRU1EEN3VZ4U,B00ZPR5LGQ,"{'Size:': ' 120cm / 48inch | Pack of 3', 'Colo...",Pine Needles,"Have not used to them can yet, but seem high q...",Good,1501286400,sports_and_outdoors,,
1,1.0,True,"11 21, 2017",A1T6YYNI1APC2E,B00WRIDARS,,Betsey I.,This thing is crap. The concept is good. It ha...,This thing is crap. The concept is good,1511222400,sports_and_outdoors,,
2,1.0,True,"11 15, 2016",A2T67VHL6CFTG8,B00QNFRDYC,,Scott,This is the worst pliers I bought for fishing ...,This is the worst pliers I bought for fishing ...,1479168000,sports_and_outdoors,,
3,5.0,True,"02 5, 2017",A236NC79DXYR5J,B00794LHOI,{'Style Name:': ' Dead-Hold BDC MOA'},Rottie,Great scope,good product,1486252800,sports_and_outdoors,,
4,4.0,True,"04 29, 2017",AT8VPCOS2X69F,B00WTSJPFW,"{'Size:': ' 10/11 (45/46)', 'Color:': ' White ...",bstern,They are exactly what they're meant to be: tr...,but worked fine for the snorkeling trip we wer...,1493424000,sports_and_outdoors,,


Value counts of product reviews per category:
 sports_and_outdoors    500000
video_games            497577
Name: category, dtype: int64



### <a id='toc4_2_5_'></a>[Batch 7](#toc0_)

- tools_and_home_improvement
- kindle_store

In [10]:
# loading the review data!

data = []

tools_and_home_improvement = "/Users/pavansingh/Desktop/Amazon Review Data/Tools_and_Home_Improvement_5.json"
kindle_store = "/Users/pavansingh/Desktop/Amazon Review Data/Kindle_Store_5.json"

# load each file and join into dataframe
for category, filename in [('tools_and_home_improvement', tools_and_home_improvement), ('kindle_store', kindle_store)]:
    for selected_data in read_file(filename, category):
        data.append(selected_data)

# make it into a dataframe
data = pd.DataFrame(data)

# show the dataframe
print("Shape of all data:", data.shape)
display(data.head(5))

# save data in folder Data
data.to_csv('Data/revs_batch7_1.csv')

# category value counts
print("Value counts of product reviews per category:\n",data['category'].value_counts())

Shape of all data: (1000000, 13)


Unnamed: 0,overall,verified,reviewTime,reviewerID,asin,style,reviewerName,reviewText,summary,unixReviewTime,category,vote,image
0,5.0,True,"10 24, 2014",A3HO3HVBC8BV06,B0017YLTAI,{'Style:': ' Helmet Attachable'},Michael McCarthy,These work great and are lighter weight than m...,Five Stars,1414108800,tools_and_home_improvement,,
1,5.0,True,"03 25, 2014",A26K7HQWTE52GC,B0012HBQK8,{'Color:': ' Glazed Cotton White'},DailyReader,I still can't quite figure out how it works so...,Third Toilet - works great,1395705600,tools_and_home_improvement,,
2,4.0,True,"10 9, 2016",A140X1CP9DZMWF,B01GC6UR4O,,M. Swift,"As advertised, these shower heads increase the...",Good product for a great price!,1475971200,tools_and_home_improvement,2.0,
3,5.0,True,"03 24, 2016",AXOVPN08BHKI1,B0006FIOA2,{'Style:': ' Wet tile saw w/stand'},BH,Works great and is lightweight.,Five Stars,1458777600,tools_and_home_improvement,,
4,5.0,True,"05 4, 2015",A3OYDT2EL5EJQI,B001NIK6PC,,Nancy Lester,This variable voltage control allows me to slo...,Perfect for controlling speed of brush type motor,1430697600,tools_and_home_improvement,9.0,


Value counts of product reviews per category:
 tools_and_home_improvement    500000
kindle_store                  500000
Name: category, dtype: int64


### <a id='toc4_2_6_'></a>[Batch 8](#toc0_)

- toys_and_games
- prime_pantry

In [11]:
# loading the review data!

data = []

toys_and_games = "/Users/pavansingh/Desktop/Amazon Review Data/Toys_and_Games_5.json"
prime_pantry = "/Users/pavansingh/Desktop/Amazon Review Data/Prime_Pantry_5.json"

# load each file and join into dataframe
for category, filename in [('toys_and_games', toys_and_games), ('prime_pantry', prime_pantry)]:
    for selected_data in read_file(filename, category):
        data.append(selected_data)

# make it into a dataframe
data = pd.DataFrame(data)

# show the dataframe
print("Shape of all data:", data.shape)
display(data.head(5))

# save data in folder Data
data.to_csv('Data/revs_batch8_1.csv')

# category value counts
print("Value counts of product reviews per category:\n",data['category'].value_counts())

Shape of all data: (637788, 13)


Unnamed: 0,overall,verified,reviewTime,reviewerID,asin,reviewerName,reviewText,summary,unixReviewTime,category,style,vote,image
0,5.0,True,"09 1, 2013",A23AVLMW9MLN00,B0018DZ9RC,LauraS,just what my son wanted for his birthday. it ...,good value,1377993600,toys_and_games,,,
1,5.0,True,"07 25, 2015",A38N0SHWRHOWO2,B0015H2V72,Michele Bassini Paiao,100%,Five Stars,1437782400,toys_and_games,,,
2,5.0,True,"04 28, 2018",A3C513THIC3C0H,B00S288BKI,letsbejeweled,My 5 year old grandson absolutely loves this g...,Five Stars,1524873600,toys_and_games,,,
3,5.0,True,"07 29, 2014",A3CS0KPZ5L0FL6,B0074MEXRI,SasyPants,Perfect size! We used these during our rocket ...,Five Stars,1406592000,toys_and_games,{'Package Quantity:': ' 1'},,
4,4.0,True,"11 30, 2014",A1ETYJTVV63NUN,B000GI0VLE,Laura Pilarski,"Great game for kids' parties. However, the eg...",Great game for kids' parties,1417305600,toys_and_games,{'Package Quantity:': ' 1'},,


Value counts of product reviews per category:
 toys_and_games    500000
prime_pantry      137788
Name: category, dtype: int64


### <a id='toc4_2_7_'></a>[Batch 9](#toc0_)

- home_and_kitchen
- movies_and_tv

In [12]:
# loading the review data!

data = []

home_and_kitchen = "/Users/pavansingh/Desktop/Amazon Review Data/Home_and_Kitchen_5.json"
movies_and_tv = "/Users/pavansingh/Desktop/Amazon Review Data/Movies_and_TV_5.json"

# load each file and join into dataframe
for category, filename in [('home_and_kitchen', home_and_kitchen), ('movies_and_tv', movies_and_tv)]:
    for selected_data in read_file(filename, category):
        data.append(selected_data)

# make it into a dataframe
data = pd.DataFrame(data)

# show the dataframe
print("Shape of all data:", data.shape)
display(data.head(5))

# save data in folder Data
data.to_csv('Data/revs_batch9_1.csv')

# category value counts
print("Value counts of product reviews per category:\n",data['category'].value_counts())

Shape of all data: (1000000, 13)


Unnamed: 0,overall,verified,reviewTime,reviewerID,asin,style,reviewerName,reviewText,summary,unixReviewTime,category,vote,image
0,5.0,True,"09 27, 2014",A1V6S6VMXAX3WA,B006OU4I30,{'Style:': ' Twin'},William J. M,We bought this bed several months ago and my w...,Fantastic Air Mattress & Warranty Service!!,1411776000,home_and_kitchen,,
1,5.0,True,"09 13, 2013",A6D03K4M1KKGM,B000EANPXK,"{'Size:': ' 10"" x 31"" x 12""'}",Amazon Customer,This is a great shelf but be warned -- it is v...,Just what it says!,1379030400,home_and_kitchen,,
2,5.0,True,"11 23, 2016",ABQQ5FX9NDYT3,B000QCPNWM,{'Package Type:': ' Standard Packaging'},Carl O.,"Love this knife. I'm no knife expert, but thi...",Good knife,1479859200,home_and_kitchen,,
3,4.0,True,"12 6, 2015",A1LF25TD7GP7MF,B00W7C7M3M,{'Style Name:': ' Fushia Bagless Vacuum'},kaylin,Its has pretty good suction with a full tank. ...,Its decent.,1449360000,home_and_kitchen,,
4,4.0,True,"12 31, 2013",A2OFO1VBMQPNLP,B000RH173K,"{'Size:': ' Queen', 'Color:': ' Black'}",themmady,This Satin sheet looks great. The pillows are ...,looks great..pillow cover does not have any zi...,1388448000,home_and_kitchen,,


Value counts of product reviews per category:
 home_and_kitchen    500000
movies_and_tv       500000
Name: category, dtype: int64



### <a id='toc4_2_8_'></a>[Batch 10](#toc0_)

- pet_supplies
- grocery_and_gourmet_food




In [13]:
# loading the review data!

data = []

pet_supplies = "/Users/pavansingh/Desktop/Amazon Review Data/Pet_Supplies_5.json"
grocery_and_gourmet_food = "/Users/pavansingh/Desktop/Amazon Review Data/Grocery_and_Gourmet_Food_5.json"

# load each file and join into dataframe
for category, filename in [('pet_supplies', pet_supplies), ('grocery_and_gourmet_food', grocery_and_gourmet_food)]:
    for selected_data in read_file(filename, category):
        data.append(selected_data)

# make it into a dataframe
data = pd.DataFrame(data)

# show the dataframe
print("Shape of all data:", data.shape)
display(data.head(5))

# save data in folder Data
data.to_csv('Data/revs_batch10_1.csv')

# category value counts
print("Value counts of product reviews per category:\n",data['category'].value_counts())

Shape of all data: (1000000, 13)


Unnamed: 0,overall,verified,reviewTime,reviewerID,asin,style,reviewerName,reviewText,summary,unixReviewTime,category,vote,image
0,5.0,True,"01 2, 2017",A53DHI00JOOSS,B000LPOUNW,{'Color:': ' Assorted'},afm,My cats go absolutely ballistic playing with t...,Your cat will love it,1483315200,pet_supplies,,
1,4.0,True,"05 15, 2016",A28ITD3GIXKX3V,B000OA7UR2,{'Size:': ' 6'H x 4'W x 8'L'},roberta yates,Easy to put together but too small for my big ...,A little expensive for a smaller than expected...,1463270400,pet_supplies,,
2,3.0,False,"05 25, 2017",A1FIB34L2IU15Z,B01EVTDWGO,,W. Nicholls,The cats love it. It has a slower action than ...,"Cats love it, doesn't last and refill is too e...",1495670400,pet_supplies,,
3,5.0,True,"07 26, 2017",A35DR60YWZDFXV,B001X9FGO2,"{'Size:': ' Medium/Large', 'Color:': ' Purple'}",Bozena,So two of my dogs play tug a war with this so ...,Great toy.,1501027200,pet_supplies,,
4,5.0,True,"03 14, 2018",A3QE8FQV2TKFLT,B0002DHK1C,{'Size:': ' 1Pack'},Brenda,It worked. It really worked. I just hope tha...,No more ear mites,1520985600,pet_supplies,,


Value counts of product reviews per category:
 pet_supplies                500000
grocery_and_gourmet_food    500000
Name: category, dtype: int64


### <a id='toc4_2_9_'></a>[Merge Batches (for large reviews data)](#toc0_)

In this section, we merge the batches together to create one large dataset.

We use the `pd.concat()` function to merge the batches together. The resulting dataset is saved as a CSV file for use in the next section - **data cleaning**. 




In [14]:
# load and merge csv files
df1 = pd.read_csv("/Users/pavansingh/Library/CloudStorage/GoogleDrive-pavansingho23@gmail.com/My Drive/Portfolio/Masters-Dissertation/Code/Data/revs_batch1_1.csv", low_memory=False)
df2 = pd.read_csv("/Users/pavansingh/Library/CloudStorage/GoogleDrive-pavansingho23@gmail.com/My Drive/Portfolio/Masters-Dissertation/Code/Data/revs_batch2_1.csv", low_memory=False)
df3 = pd.read_csv("/Users/pavansingh/Library/CloudStorage/GoogleDrive-pavansingho23@gmail.com/My Drive/Portfolio/Masters-Dissertation/Code/Data/revs_batch3_1.csv", low_memory=False)
df4 = pd.read_csv("/Users/pavansingh/Library/CloudStorage/GoogleDrive-pavansingho23@gmail.com/My Drive/Portfolio/Masters-Dissertation/Code/Data/revs_batch4_1.csv", low_memory=False)
df5 = pd.read_csv("/Users/pavansingh/Library/CloudStorage/GoogleDrive-pavansingho23@gmail.com/My Drive/Portfolio/Masters-Dissertation/Code/Data/revs_batch5_1.csv", low_memory=False)
df6 = pd.read_csv("/Users/pavansingh/Library/CloudStorage/GoogleDrive-pavansingho23@gmail.com/My Drive/Portfolio/Masters-Dissertation/Code/Data/revs_batch6_1.csv", low_memory=False)
df7 = pd.read_csv("/Users/pavansingh/Library/CloudStorage/GoogleDrive-pavansingho23@gmail.com/My Drive/Portfolio/Masters-Dissertation/Code/Data/revs_batch7_1.csv", low_memory=False)
df8 = pd.read_csv("/Users/pavansingh/Library/CloudStorage/GoogleDrive-pavansingho23@gmail.com/My Drive/Portfolio/Masters-Dissertation/Code/Data/revs_batch8_1.csv", low_memory=False)
df9 = pd.read_csv("/Users/pavansingh/Library/CloudStorage/GoogleDrive-pavansingho23@gmail.com/My Drive/Portfolio/Masters-Dissertation/Code/Data/revs_batch9_1.csv", low_memory=False)
df10 = pd.read_csv("/Users/pavansingh/Library/CloudStorage/GoogleDrive-pavansingho23@gmail.com/My Drive/Portfolio/Masters-Dissertation/Code/Data/revs_batch10_1.csv", low_memory=False)

# merge all dataframes
frames = [df1, df2, df3, df4, df5, df6, df7, df8, df9, df10]
lots_revs_meta = pd.concat(frames)

# save to csv
lots_revs_meta.to_csv('Data/lots_revs_1.csv')


In [5]:
# quick look at the data
print("Shape of all data:", lots_revs_meta.shape)
display(lots_revs_meta.head(3))

# value counts
print("\nValue counts of product reviews per category:\n",lots_revs_meta['category'].value_counts())

Shape of all data: (9031023, 15)


Unnamed: 0.2,Unnamed: 0.1,Unnamed: 0,overall,verified,reviewTime,reviewerID,asin,style,reviewerName,reviewText,summary,unixReviewTime,category,vote,image
0,0,0,5.0,True,"12 30, 2014",A13FYJJW30UXMH,B004ALFAAK,{'Size:': ' 1 PACK'},Rats Bats Cats,Good deal.,Good deal,1419897600,arts_crafts,,
1,1,1,3.0,True,"03 26, 2013",A5YP0YXLUG3HA,B003S8YG2E,"{'Color:': ' Paint Splatter', 'Size Name:': ' ...",A Person,"While I love the pattern, the tape isn't reall...",Love the pattern,1364256000,arts_crafts,,
2,2,2,4.0,True,"09 25, 2017",ABRIV7R6PZ2F3,B015IS1UO8,"{'Size:': ' 6 Sizes - Diameter 4-10mm', 'Color...",H Rose Drummond,not sure of gauge of rings pretty small so far...,Four Stars,1506297600,arts_crafts,,



Value counts of product reviews per category:
 sports_and_outdoors           500000
patio_lawn_and_garden         500000
pet_supplies                  500000
movies_and_tv                 500000
home_and_kitchen              500000
toys_and_games                500000
kindle_store                  500000
tools_and_home_improvement    500000
automotive                    500000
grocery_and_gourmet_food      500000
office_products               500000
electronics                   500000
clothing_shoes_and_jewelry    500000
cell_phones                   500000
cds_and_vinyl                 500000
video_games                   497577
arts_crafts                   494485
musical_instruments           231392
digital_music                 169781
prime_pantry                  137788
Name: category, dtype: int64


***
## <a id='toc4_3_'></a>[Merge Large Reviews with Few Reviews](#toc0_)

We now have two CSV files:

1. `few_revs_meta.csv`: contains the metadata for products with fewer reviews
2. `lots_revs_meta.csv`: contains the metadata for products with a lot of reviews

We merge these two datasets together to create one large dataset that we will use for data cleaning and the subsequent analysis.

In [6]:
# load and merge csv files
few = pd.read_csv("/Users/pavansingh/Library/CloudStorage/GoogleDrive-pavansingho23@gmail.com/My Drive/Portfolio/Masters-Dissertation/Code/Data/few_revs_1.csv", low_memory=False)
#lots = pd.read_csv("/Users/pavansingh/Library/CloudStorage/GoogleDrive-pavansingho23@gmail.com/My Drive/Portfolio/Masters-Dissertation/Code/Data/lots_revs_1.csv", low_memory=False)
lots = lots_revs_meta

# merge all dataframes
frames = [few, lots]
all_revs_meta = pd.concat(frames)
all_revs_meta = all_revs_meta.drop(columns=['Unnamed: 0', 'Unnamed: 0.1'])

# save to csv
all_revs_meta.to_csv('Data/all_revs_1.csv')

## Only Keep Reviewers with more than 10 reviews

In [8]:
# only keep data with users who have > 5 reviews
all_revs = all_revs_meta.groupby('reviewerID').filter(lambda x: len(x) >= 10)

# save to csv
all_revs.to_csv('Data/all_revs_with_records_1.csv')

In [9]:
# quick look at the data
print("Shape of all data:", all_revs.shape) #4 164 059
display(all_revs.head(3))

# value counts
print("\nValue counts of product reviews per category:\n",all_revs['category'].value_counts())

Shape of all data: (4164059, 13)


Unnamed: 0,overall,verified,reviewTime,reviewerID,asin,style,reviewerName,reviewText,summary,unixReviewTime,category,vote,image
8,5.0,False,"09 11, 2017",A33PVCHCQ2BTN0,B0010ZBORW,{'Color:': ' Nail Brush'},VW,I really like this nail brush from Urban Spa. ...,"Handy nail brush, gets garden dirt out from un...",1505088000,beauty,,
11,4.0,False,"09 2, 2017",A2503LT8PZIHAD,B0010ZBORW,{'Color:': ' Foot File'},Trouble,This is about the same quality foot file as th...,Basic foot file,1504310400,beauty,,
12,4.0,False,"02 21, 2018",A1MAI0TUIM3R2X,B001LNODUS,{'Color:': ' Body Lotion'},Princess Bookworm,Nice lavender lotion that absorbs easily in my...,Fragrant Lavender Lotion,1519171200,beauty,,



Value counts of product reviews per category:
 arts_crafts                   339563
video_games                   327250
office_products               310464
patio_lawn_and_garden         294888
grocery_and_gourmet_food      266421
cds_and_vinyl                 258165
tools_and_home_improvement    235778
kindle_store                  224016
automotive                    185910
movies_and_tv                 185583
cell_phones                   177378
toys_and_games                172605
pet_supplies                  166771
sports_and_outdoors           160222
musical_instruments           145483
home_and_kitchen              141484
electronics                   140555
digital_music                 130017
prime_pantry                  114630
clothing_shoes_and_jewelry     79786
industrial                     64045
luxury_beauty                  24896
software                       10230
appliances                      2188
gift_cards                      2034
magazine_subscriptions     