# Data Loading
## MSc Advanced Analytics Dissertation

Here we are looking at taking several samples of the amazon reviews dataset and loading them into a dataframe.

In [2]:
# get modules in 
import pandas as pd
import gzip

# Datasets

We have individual datasets for each category. These data have been reduced to extract the $k$-core, such that each of the remaining users and items have $k$ reviews each.

- Amazon Fashion	
- All Beauty	
- Appliances	
- Arts, Crafts and Sewing	
- Automotive	
- Books	
- CDs and Vinyl	
- Cell Phones and Accessories	
- Clothing, Shoes and Jewelry	
- Digital Music	
- Electronics	
- Gift Cards	
- Grocery and Gourmet Food	
- Home and Kitchen	
- Industrial and Scientific	
- Kindle Store	
- Luxury Beauty	
- Magazine Subscriptions	
- Movies and TV	
- Musical Instruments	
- Office Products	
- Patio, Lawn and Garden	
- Pet Supplies	
- Prime Pantry	
- Software	
- Sports and Outdoors	
- Tools and Home Improvement	
- Toys and Games	
- Video Games	

***
Format is one-review-per-line in json. 

- **overall**: ratings of the product
- **reviewerID**: ID of the reviewer, e.g. A2SUAM1J3GNN3B
- **asin**: ID of the product, e.g. 0000013714
- **reviewerName**: name of the reviewer
- **vote**: helpful votes of the review
- **style**: a disctionary of the product metadata, e.g., "Format" is "Hardcover"
- **reviewText**: text of the review
- **summary**: summary of the review
- **unixReviewTime**: time of the review (unix time)
- **reviewTime**: time of the review (raw)
- **image**: images that users post after they have received the product

***
We also have metadata. 

- **asin**: ID of the product, e.g. 0000031852
- **title**: name of the product
- **feature**: bullet-point format features of the product
- **description**: description of the product
- **price**: price in US dollars (at time of crawl)
- **imageURL**: url of the product image
- **imageURL**: url of the high resolution product image
- **related**: related products (also bought, also viewed, bought together, buy after viewing)
- **salesRank**: sales rank information
- **brand**: brand name
- **categories**: list of categories the product belongs to
- **tech1**: the first technical detail table of the product
- **tech2**: the second technical detail table of the product
- **similar**: similar product table


***
# Quick look at a Dataset

For Example: ***The Fashion Dataset*** 
    (AMAZON_FASHION_5.json)

In [1]:
# data path
fashion_data = "/Users/pavansingh/Library/CloudStorage/GoogleDrive-sngpav003@myuct.ac.za/My Drive/Masters 2022/Dissertation/Masters-Dissertation/Data/AMAZON_FASHION_5.json"
fashion_data = pd.read_json(fashion_data, lines=True)
#fashion_data = fashion_data.loc[:,['reviewerName', 'reviewText', 'overall', 'style']]
display(fashion_data.loc[10:20,:])
print("Shape of Data:", fashion_data.shape)

NameError: name 'pd' is not defined

Below we: 

1. Calculate the number of missing values in the 'style' column of the fashion_data DataFrame. It does this by using the `isna()` method to create a boolean mask indicating where the 'style' column contains missing values (i.e., `NaN`), and then using the `sum()` method to count the number of `True` values in the mask.

2. Proceed with dropping all rows in the `fashion_data` DataFrame that have missing values in the 'style' column. It does this using the `dropna()` method with the subset parameter set to "style", which indicates that only rows with missing values in the 'style' column should be dropped. The `inplace` parameter is set to `True`, which indicates that the changes should be made to the fashion_data DataFrame directly (rather than creating a new DataFrame with the missing values dropped).

3. We sort the fashion_data DataFrame by the 'overall' column in descending order. It does this using the `sort_values()` method with the by parameter set to "overall" to indicate that the DataFrame should be sorted by the 'overall' column, and the ascending parameter set to False to indicate that the DataFrame should be sorted in descending order.


In [12]:
# see NA's in style
fashion_data['style'].isna().sum()

# remove NA's in style
fashion_data.dropna(subset=["style"], inplace=True)

# Sort resulting dataframe by overall rating
fashion_data.sort_values(by=['overall'], inplace=True, ascending=False)

# show resulting dataset
display(fashion_data.head(10))

# Shape of data
print("Shape of Data:", fashion_data.shape)

Unnamed: 0,overall,verified,reviewTime,reviewerID,asin,style,reviewerName,reviewText,summary,unixReviewTime,vote,image
0,5,True,"09 4, 2015",ALJ66O1Y6SLHA,B000K2PJ4K,"{'Size:': ' Big Boys', 'Color:': ' Blue/Orange'}",Tonya B.,Great product and price!,Five Stars,1441324800,,
1963,5,True,"04 18, 2016",AZRZ2FB7CFNOE,B0092UF54A,"{'Size:': ' 8 B(M) US', 'Color:': ' Black/Whit...",Catherine Uribe,I love my tennis shoes,Five Stars,1460937600,,
1951,5,True,"05 15, 2016",A2KCFRMKVHYSU7,B0092UF54A,"{'Size:': ' 8 B(M) US', 'Color:': ' Ocean Fog/...",Saving star,Really comfy and nice color,Great and comfy for sports,1463270400,,
1953,5,False,"05 10, 2016",A22WG2NE4D47UM,B0092UF54A,"{'Size:': ' 9 B(M) US', 'Color:': ' Black/Wolf...",Amazon Customer,These are the most comfortable shoes I've used...,Best shoes ever!!!,1462838400,,
1954,5,True,"05 9, 2016",A1SC6HVU28ND3D,B0092UF54A,"{'Size:': ' 8.5 B(M) US', 'Color:': ' Black/Wh...",Sarah,Very comfortable and looks great!,Five Stars,1462752000,,
1955,5,True,"05 8, 2016",AJDH2WVLX79KA,B0092UF54A,"{'Size:': ' 6.5 B(M) US', 'Color:': ' Black/Wh...",brookelynne,Love this shoes so comfy and great very everyd...,Fits wonderful,1462665600,,
1958,5,True,"05 4, 2016",A1AWX0M8R6A2I1,B0092UF54A,"{'Size:': ' 12 D(M) US', 'Color:': ' Cool Grey...",CBP,Perfict fit for me. Great looking shoes at a g...,nice,1462320000,,
1959,5,True,"04 23, 2016",AT5OQFDS6PEE1,B0092UF54A,"{'Size:': ' 9.5 B(M) US', 'Color:': ' Black/Wh...",H. Heckstall,The sneakers are very comfortable and fit to s...,Five Stars,1461369600,,
1960,5,True,"04 21, 2016",AOFQAZVA6Q6E7,B0092UF54A,"{'Size:': ' 10 B(M) US', 'Color:': ' Black/Whi...",D. Resendes,I've had these shoes for about a week now and ...,Wide Feet so Somewhat Tight,1461196800,3.0,
1965,5,True,"03 31, 2016",A2TRI54C8EMCX,B0092UF54A,"{'Size:': ' 9 B(M) US', 'Color:': ' Black/Wolf...",Andrea Seo B.,Love it!! Super comfortable and nice!! Got mor...,Love it!! Definetly recommend it,1459382400,,


Shape of Data: (3107, 12)


So we have 3107 reviews in our fashion data set. 

***
# Combining the Datasets

We have individual datasets for each category. We combine them to generate one larger datasets encompassing all the categories (5-core dataset).

We first load the smaller datasets, the product categories with fewer reviews then we look toward the other categories with many reviews ( > 100 000)

In [2]:
# load all the datasets into env

beauty = "/Users/pavansingh/Desktop/Amazon Review Data/All_Beauty_5.json"
beauty = pd.read_json(beauty, lines=True)
beauty['category'] = 'beauty'
print("Shape of Beauty: ", beauty.shape)

fashion = "/Users/pavansingh/Desktop/Amazon Review Data/AMAZON_FASHION_5.json"
fashion = pd.read_json(fashion, lines=True)
fashion['category'] = 'fashion'
print("Shape of Fashion: ", fashion.shape)

appliances = "/Users/pavansingh/Desktop/Amazon Review Data/Appliances_5.json"
appliances = pd.read_json(appliances, lines=True)
appliances['category'] = 'appliances'
print("Shape of Appliances: ", appliances.shape)

gift_cards = "/Users/pavansingh/Desktop/Amazon Review Data/Gift_Cards_5.json"
gift_cards = pd.read_json(gift_cards, lines=True)
gift_cards['category'] = 'gift_cards'
print("Shape of Gift Cards: ", gift_cards.shape)

industrial = "/Users/pavansingh/Desktop/Amazon Review Data/Industrial_and_Scientific_5.json"
industrial = pd.read_json(industrial, lines=True)
industrial['category'] = 'industrial'
print("Shape of Industrial: ", industrial.shape)


luxury_beauty = "/Users/pavansingh/Desktop/Amazon Review Data/Luxury_Beauty_5.json"
luxury_beauty = pd.read_json(luxury_beauty, lines=True)
luxury_beauty['category'] = 'luxury_beauty'
print("Shape of Luxury Beauty: ", luxury_beauty.shape)

magazine_subscriptions = "/Users/pavansingh/Desktop/Amazon Review Data/Magazine_Subscriptions_5.json"
magazine_subscriptions = pd.read_json(magazine_subscriptions, lines=True)
magazine_subscriptions['category'] = 'magazine_subscriptions'
print("Shape of Magazine Subscriptions: ", magazine_subscriptions.shape)

software = "/Users/pavansingh/Desktop/Amazon Review Data/Software_5.json"
software = pd.read_json(software, lines=True)
software['category'] = 'software'
print("Shape of Software: ", software.shape)


Shape of Beauty:  (5269, 13)
Shape of Fashion:  (3176, 13)
Shape of Appliances:  (2277, 13)
Shape of Gift Cards:  (2972, 13)
Shape of Industrial:  (77071, 13)
Shape of Luxury Beauty:  (34278, 13)
Shape of Magazine Subscriptions:  (2375, 13)
Shape of Software:  (12805, 13)


In [3]:
# merge the smaller datasets into one dataframe
# combine all data data
all_data = pd.concat([beauty, fashion, appliances, gift_cards, industrial, luxury_beauty, magazine_subscriptions, software], ignore_index = True)

# shape of combined data
print("Shape of Combined Data: ", all_data.shape)
display(all_data)

Shape of Combined Data:  (140223, 13)


Unnamed: 0,overall,verified,reviewTime,reviewerID,asin,style,reviewerName,reviewText,summary,unixReviewTime,vote,image,category
0,5,True,"09 1, 2016",A3CIUOJXQ5VDQ2,B0000530HU,"{'Size:': ' 7.0 oz', 'Flavor:': ' Classic Ice ...",Shelly F,As advertised. Reasonably priced,Five Stars,1472688000,,,beauty
1,5,True,"11 14, 2013",A3H7T87S984REU,B0000530HU,"{'Size:': ' 7.0 oz', 'Flavor:': ' Classic Ice ...",houserules18,Like the oder and the feel when I put it on my...,Good for the face,1384387200,,,beauty
2,1,True,"08 18, 2013",A3J034YH7UG4KT,B0000530HU,"{'Size:': ' 7.0 oz', 'Flavor:': ' Classic Ice ...",Adam,I bought this to smell nice after I shave. Wh...,Smells awful,1376784000,,,beauty
3,5,False,"05 3, 2011",A2UEO5XR3598GI,B0000530HU,"{'Size:': ' 7.0 oz', 'Flavor:': ' Classic Ice ...",Rich K,HEY!! I am an Aqua Velva Man and absolutely lo...,Truth is There IS Nothing Like an AQUA VELVA MAN.,1304380800,25,,beauty
4,5,True,"05 6, 2011",A3SFRT223XXWF7,B00006L9LC,{'Size:': ' 200ml/6.7oz'},C. C. Christian,If you ever want to feel pampered by a shampoo...,Bvlgari Shampoo,1304640000,3,,beauty
...,...,...,...,...,...,...,...,...,...,...,...,...,...
140218,4,False,"07 16, 2016",A1E50L7PCVXLN4,B01FFVDY9M,{'Platform:': ' Key Card'},Colinda,When I ordered this it was listed as Photo Edi...,File Management Software with Basic Editing Ca...,1468627200,,,software
140219,3,False,"06 17, 2017",AVU1ILDDYW301,B01HAP3NUG,,G. Hearn,This software has SO much going on. Theres a ...,"Might not be for the ""novice""",1497657600,,,software
140220,4,False,"01 24, 2017",A2LW5AL0KQ9P1M,B01HAP3NUG,,Dr. E,I have used both more complex and less complex...,"Great, Inexpensive Software for Those Who Have...",1485216000,,,software
140221,3,False,"06 14, 2018",AZ515FFZ7I2P7,B01HAP47PQ,{'Platform:': ' PC Disc'},Jerry Jackson Jr.,Pinnacle Studio 20 Ultimate is a perfectly ser...,Gets the job done ... but not as easy as it sh...,1528934400,,,software


Let us try combine the data with the larger datasets.

In [3]:
import json
import random
import linecache
import pandas as pd

def read_file(filename, category):
    num_lines = sum(1 for line in open(filename))
    selected_lines = set()
    while len(selected_lines) < min(50000, num_lines):
        line_num = random.randint(1, num_lines)
        if line_num not in selected_lines:
            selected_lines.add(line_num)
            line = linecache.getline(filename, line_num)
            selected_data = json.loads(line)
            selected_data['category'] = category
            yield selected_data

data = []

# category files
arts_crafts = "/Users/pavansingh/Desktop/Amazon Review Data/Arts_Crafts_and_Sewing_5.json"
automotive = "/Users/pavansingh/Desktop/Amazon Review Data/Automotive_5.json"
cds_vinyl = "/Users/pavansingh/Desktop/Amazon Review Data/CDs_and_Vinyl_5.json"
cell_phones = "/Users/pavansingh/Desktop/Amazon Review Data/Cell_Phones_and_Accessories_5.json"
clothing_shoes = "/Users/pavansingh/Desktop/Amazon Review Data/Clothing_Shoes_and_Jewelry_5.json"
electronics = "/Users/pavansingh/Desktop/Amazon Review Data/Electronics_5.json"
grocery = "/Users/pavansingh/Desktop/Amazon Review Data/Grocery_and_Gourmet_Food_5.json"
home_kitchen = "/Users/pavansingh/Desktop/Amazon Review Data/Home_and_Kitchen_5.json"
kindle_store = "/Users/pavansingh/Desktop/Amazon Review Data/Kindle_Store_5.json"
movies_tv = "/Users/pavansingh/Desktop/Amazon Review Data/Movies_and_TV_5.json"
musical_instruments = "/Users/pavansingh/Desktop/Amazon Review Data/Musical_Instruments_5.json"
office_products = "/Users/pavansingh/Desktop/Amazon Review Data/Office_Products_5.json"
patio_lawn = "/Users/pavansingh/Desktop/Amazon Review Data/Patio_Lawn_and_Garden_5.json"
pet_supplies = "/Users/pavansingh/Desktop/Amazon Review Data/Pet_Supplies_5.json"
sports_outdoors = "/Users/pavansingh/Desktop/Amazon Review Data/Sports_and_Outdoors_5.json"
tools_home = "/Users/pavansingh/Desktop/Amazon Review Data/Tools_and_Home_Improvement_5.json"
toys_games = "/Users/pavansingh/Desktop/Amazon Review Data/Toys_and_Games_5.json"
video_games = "/Users/pavansingh/Desktop/Amazon Review Data/Video_Games_5.json"

# load each file and join into dataframe
for category, filename in [('arts_crafts_and_sewing', arts_crafts), ('automotive', automotive), ('cds_and_vinyl', cds_vinyl), ('cell_phones_and_accessories', cell_phones), ('home_and_kitchen', home_kitchen), ('musical_instruments', musical_instruments), ('office_products', office_products), ('tools_and_home_improvement', tools_home), ('toys_and_games', toys_games), ('video_games', video_games)]:
    for selected_data in read_file(filename, category):
        data.append(selected_data)

# make it into a dataframe
df = pd.DataFrame(data)

# show the dataframe
print("Shape of all data:", df.shape)
display(df.head(5))

Shape of all data: (500000, 13)


Unnamed: 0,overall,verified,reviewTime,reviewerID,asin,reviewerName,reviewText,summary,unixReviewTime,category,style,vote,image
0,5.0,True,"12 12, 2014",ASECKSPMIO3E4,B00FARV8NG,Crystal,Thanks.,Five Stars,1418342400,arts_crafts_and_sewing,,,
1,5.0,True,"02 22, 2015",A30QCTR82XLPX,B001CE5DOQ,Mijaney,"Great blades, good price!",Excellent product!,1424563200,arts_crafts_and_sewing,,,
2,5.0,True,"07 14, 2015",A1CLLZFI30GVWA,B00DV69LO6,Yuchu,Good.,Five Stars,1436832000,arts_crafts_and_sewing,{'Size:': ' 1 Pack'},,
3,5.0,True,"12 18, 2014",A2215I936C4PLG,B004JE3A7S,RedPandaDuck,Product is perfect!,Five Stars,1418860800,arts_crafts_and_sewing,,,
4,5.0,True,"01 13, 2015",A2ZY9VWY1KCS55,B001BF3IM0,Deborah Carter,works great,Five Stars,1421107200,arts_crafts_and_sewing,"{'Size:': ' 6 Cans', 'Color:': ' Midnight Black'}",,


In [4]:
# save df to csv called Large Data
df.to_csv('Large_Data.csv')

In [5]:
data = []

# category files
beauty = "/Users/pavansingh/Desktop/Amazon Review Data/All_Beauty_5.json"
fashion = "/Users/pavansingh/Desktop/Amazon Review Data/AMAZON_FASHION_5.json"
appliances = "/Users/pavansingh/Desktop/Amazon Review Data/Appliances_5.json"
gift_cards = "/Users/pavansingh/Desktop/Amazon Review Data/Gift_Cards_5.json"
industrial = "/Users/pavansingh/Desktop/Amazon Review Data/Industrial_and_Scientific_5.json"
luxury_beauty = "/Users/pavansingh/Desktop/Amazon Review Data/Luxury_Beauty_5.json"
magazine_subscriptions = "/Users/pavansingh/Desktop/Amazon Review Data/Magazine_Subscriptions_5.json"
software = "/Users/pavansingh/Desktop/Amazon Review Data/Software_5.json"



# load each file and join into dataframe
for category, filename in [('beauty', beauty), ('fashion', fashion), ('appliances', appliances), ('gift_cards', gift_cards), ('industrial', industrial), ('luxury_beauty', luxury_beauty), ('magazine_subscriptions', magazine_subscriptions), ('software', software)]:
    for selected_data in read_file(filename, category):
        data.append(selected_data)

# make it into a dataframe
df_1 = pd.DataFrame(data)

# show the dataframe
print("Shape of all data:", df_1.shape)
display(df_1.head(5))

Shape of all data: (113152, 13)


Unnamed: 0,overall,vote,verified,reviewTime,reviewerID,asin,style,reviewerName,reviewText,summary,unixReviewTime,category,image
0,5.0,6.0,True,"05 11, 2015",A3V0ZDC7WJX4G6,B000URXP6E,{'Size:': ' 3.3 oz'},TwoCoconuts,Lovely powder bomb fragrance. I adore its sof...,Gorgeous Scent,1431302400,beauty,
1,5.0,,True,"01 14, 2015",A2MWTIZYINA2MH,B00006L9LC,{'Size:': ' 26'},IRENE ANAYA,I love the clean smell conditions so well plea...,,1421193600,beauty,
2,5.0,,False,"06 6, 2013",A2D0I7M2G43WWP,B000FI4S1E,,Mena P.,"I, like a lot of people, sometimes take Softso...",So cozy!!,1370476800,beauty,
3,5.0,,True,"12 6, 2016",A4IV41UZ0Y789,B000URXP6E,{'Size:': ' 29.2'},Robert Young,Great product and fast service.,Five Stars,1480982400,beauty,
4,5.0,,True,"08 2, 2017",ASJX5CT07ZCE3,B001OHV1H4,{'Size:': ' 126'},Dawn H.,I bought this shampoo to go in wedding guest g...,I also recieved an email from the company than...,1501632000,beauty,


In [8]:
# save df to csv called Small Data in a folder called Data
df_1.to_csv('Small_Data.csv')

OSError: Cannot save file into a non-existent directory: 'Data'

In [7]:
# final dataset with all combined categories
all_data = pd.concat([df, df_1], ignore_index = True)

# shape of final dataset
print("Shape of Combined Data: ", all_data.shape)

Shape of Combined Data:  (613152, 13)


In [9]:
# save all data to csv called All_Data.csv
all_data.to_csv('All_Data.csv')