# Data Loading
## MSc Advanced Analytics Dissertation

Here we are looking at taking several samples of the amazon reviews dataset and loading them into a dataframe.

In [2]:
# get modules in 
import pandas as pd
import gzip

# Datasets

We have individual datasets for each category. These data have been reduced to extract the $k$-core, such that each of the remaining users and items have $k$ reviews each.

- Amazon Fashion	
- All Beauty	
- Appliances	
- Arts, Crafts and Sewing	
- Automotive	
- Books	
- CDs and Vinyl	
- Cell Phones and Accessories	
- Clothing, Shoes and Jewelry	
- Digital Music	
- Electronics	
- Gift Cards	
- Grocery and Gourmet Food	
- Home and Kitchen	
- Industrial and Scientific	
- Kindle Store	
- Luxury Beauty	
- Magazine Subscriptions	
- Movies and TV	
- Musical Instruments	
- Office Products	
- Patio, Lawn and Garden	
- Pet Supplies	
- Prime Pantry	
- Software	
- Sports and Outdoors	
- Tools and Home Improvement	
- Toys and Games	
- Video Games	

Note: Format is one-review-per-line in json.

- **overall**: ratings of the product
- **reviewerID**: ID of the reviewer, e.g. A2SUAM1J3GNN3B
- **asin**: ID of the product, e.g. 0000013714
- **reviewerName**: name of the reviewer
- **vote**: helpful votes of the review
- **style**: a disctionary of the product metadata, e.g., "Format" is "Hardcover"
- **reviewText**: text of the review
- **summary**: summary of the review
- **unixReviewTime**: time of the review (unix time)
- **reviewTime**: time of the review (raw)
- **image**: images that users post after they have received the product

***
# Quick look at a Dataset

For Example: ***The Fashion Dataset*** 
    (AMAZON_FASHION_5.json)

In [3]:
# data path
fashion_data = "/Users/pavansingh/Library/CloudStorage/GoogleDrive-sngpav003@myuct.ac.za/My Drive/Masters 2022/Dissertation/Masters-Dissertation/Data/AMAZON_FASHION_5.json"
fashion_data = pd.read_json(fashion_data, lines=True)
#fashion_data = fashion_data.loc[:,['reviewerName', 'reviewText', 'overall', 'style']]
display(fashion_data.loc[10:20,:])
print("Shape of Data:", fashion_data.shape)

Unnamed: 0,overall,verified,reviewTime,reviewerID,asin,style,reviewerName,reviewText,summary,unixReviewTime,vote,image
10,2,True,"01 25, 2018",A3HX4X3TIABWOV,B000KPIHQ4,"{'Size Name:': ' Men's 6-6.5, Women's 8-8.5', ...",Denise A. Conte,Relieved my Plantar Fascitis for 3 Days. Then ...,These were recommended by my Podiatrist,1516838400,,
11,2,True,"01 5, 2017",AW8UBYMNJ894V,B000KPIHQ4,"{'Size Name:': ' Men's 8-8.5, Women's 10-10.5'...",Cognizant Consumer,This is my 6th pair and they are the best thin...,Not the same as all my other pairs.,1483574400,,
12,5,True,"10 17, 2016",A265UZVOZWTTXQ,B000KPIHQ4,,William_Jasper,We have used these inserts for years. They pr...,Great inserts,1476662400,,
13,5,True,"08 22, 2016",AW8UBYMNJ894V,B000KPIHQ4,,Cognizant Consumer,Pinnacle seems to have more cushioning so my h...,Personal favorite,1471824000,,
14,5,True,"03 23, 2016",A265UZVOZWTTXQ,B000KPIHQ4,,William_Jasper,Excellent insole with good support.,Five Stars,1458691200,,
15,5,True,"06 24, 2015",AW8UBYMNJ894V,B000KPIHQ4,,Cognizant Consumer,A little more cushion than the Powerstep Prote...,Great comfort!,1435104000,,
16,5,True,"11 17, 2014",A265UZVOZWTTXQ,B000KPIHQ4,,William_Jasper,These insoles help my heels feel much better. ...,These insoles help my heel feels much better. ...,1416182400,,
17,2,True,"01 25, 2018",A3HX4X3TIABWOV,B000V0IBDM,,Denise A. Conte,Relieved my Plantar Fascitis for 3 Days. Then ...,These were recommended by my Podiatrist,1516838400,,
18,2,True,"01 5, 2017",AW8UBYMNJ894V,B000V0IBDM,,Cognizant Consumer,This is my 6th pair and they are the best thin...,Not the same as all my other pairs.,1483574400,,
19,5,True,"10 17, 2016",A265UZVOZWTTXQ,B000V0IBDM,,William_Jasper,We have used these inserts for years. They pr...,Great inserts,1476662400,,


Shape of Data: (3176, 12)


Below we: 

1. Calculate the number of missing values in the 'style' column of the fashion_data DataFrame. It does this by using the `isna()` method to create a boolean mask indicating where the 'style' column contains missing values (i.e., `NaN`), and then using the `sum()` method to count the number of `True` values in the mask.

2. Proceed with dropping all rows in the `fashion_data` DataFrame that have missing values in the 'style' column. It does this using the `dropna()` method with the subset parameter set to "style", which indicates that only rows with missing values in the 'style' column should be dropped. The `inplace` parameter is set to `True`, which indicates that the changes should be made to the fashion_data DataFrame directly (rather than creating a new DataFrame with the missing values dropped).

3. We sort the fashion_data DataFrame by the 'overall' column in descending order. It does this using the `sort_values()` method with the by parameter set to "overall" to indicate that the DataFrame should be sorted by the 'overall' column, and the ascending parameter set to False to indicate that the DataFrame should be sorted in descending order.


In [4]:
# see NA's in style
fashion_data['style'].isna().sum()

# remove NA's in style
fashion_data.dropna(subset=["style"], inplace=True)

# Sort resulting dataframe by overall rating
fashion_data.sort_values(by=['overall'], inplace=True, ascending=False)

# show resulting dataset
display(fashion_data.head(10))

# Shape of data
print("Shape of Data:", fashion_data.shape)

Unnamed: 0,overall,verified,reviewTime,reviewerID,asin,style,reviewerName,reviewText,summary,unixReviewTime,vote,image
0,5,True,"09 4, 2015",ALJ66O1Y6SLHA,B000K2PJ4K,"{'Size:': ' Big Boys', 'Color:': ' Blue/Orange'}",Tonya B.,Great product and price!,Five Stars,1441324800,,
1963,5,True,"04 18, 2016",AZRZ2FB7CFNOE,B0092UF54A,"{'Size:': ' 8 B(M) US', 'Color:': ' Black/Whit...",Catherine Uribe,I love my tennis shoes,Five Stars,1460937600,,
1951,5,True,"05 15, 2016",A2KCFRMKVHYSU7,B0092UF54A,"{'Size:': ' 8 B(M) US', 'Color:': ' Ocean Fog/...",Saving star,Really comfy and nice color,Great and comfy for sports,1463270400,,
1953,5,False,"05 10, 2016",A22WG2NE4D47UM,B0092UF54A,"{'Size:': ' 9 B(M) US', 'Color:': ' Black/Wolf...",Amazon Customer,These are the most comfortable shoes I've used...,Best shoes ever!!!,1462838400,,
1954,5,True,"05 9, 2016",A1SC6HVU28ND3D,B0092UF54A,"{'Size:': ' 8.5 B(M) US', 'Color:': ' Black/Wh...",Sarah,Very comfortable and looks great!,Five Stars,1462752000,,
1955,5,True,"05 8, 2016",AJDH2WVLX79KA,B0092UF54A,"{'Size:': ' 6.5 B(M) US', 'Color:': ' Black/Wh...",brookelynne,Love this shoes so comfy and great very everyd...,Fits wonderful,1462665600,,
1958,5,True,"05 4, 2016",A1AWX0M8R6A2I1,B0092UF54A,"{'Size:': ' 12 D(M) US', 'Color:': ' Cool Grey...",CBP,Perfict fit for me. Great looking shoes at a g...,nice,1462320000,,
1959,5,True,"04 23, 2016",AT5OQFDS6PEE1,B0092UF54A,"{'Size:': ' 9.5 B(M) US', 'Color:': ' Black/Wh...",H. Heckstall,The sneakers are very comfortable and fit to s...,Five Stars,1461369600,,
1960,5,True,"04 21, 2016",AOFQAZVA6Q6E7,B0092UF54A,"{'Size:': ' 10 B(M) US', 'Color:': ' Black/Whi...",D. Resendes,I've had these shoes for about a week now and ...,Wide Feet so Somewhat Tight,1461196800,3.0,
1965,5,True,"03 31, 2016",A2TRI54C8EMCX,B0092UF54A,"{'Size:': ' 9 B(M) US', 'Color:': ' Black/Wolf...",Andrea Seo B.,Love it!! Super comfortable and nice!! Got mor...,Love it!! Definetly recommend it,1459382400,,


Shape of Data: (3107, 12)


So we have 3107 reviews in our fashion data set. 

***
# Combining the Datasets

We have individual datasets for each category. We combine them to generate one larger datasets encompassing all the categories (5-core dataset).

We first load the smaller datasets, the product categories with fewer reviews then we look toward the other categories with many reviews ( > 100 000)

In [5]:
# load all the datasets into env

beauty = "/Users/pavansingh/Desktop/Amazon Review Data/All_Beauty_5.json"
beauty = pd.read_json(beauty, lines=True)
beauty['category'] = 'beauty'
print("Shape of Beauty: ", beauty.shape)

fashion = "/Users/pavansingh/Desktop/Amazon Review Data/AMAZON_FASHION_5.json"
fashion = pd.read_json(fashion, lines=True)
fashion['category'] = 'fashion'
print("Shape of Fashion: ", fashion.shape)

appliances = "/Users/pavansingh/Desktop/Amazon Review Data/Appliances_5.json"
appliances = pd.read_json(appliances, lines=True)
appliances['category'] = 'appliances'
print("Shape of Appliances: ", appliances.shape)

gift_cards = "/Users/pavansingh/Desktop/Amazon Review Data/Gift_Cards_5.json"
gift_cards = pd.read_json(gift_cards, lines=True)
gift_cards['category'] = 'gift_cards'
print("Shape of Gift Cards: ", gift_cards.shape)

industrial = "/Users/pavansingh/Desktop/Amazon Review Data/Industrial_and_Scientific_5.json"
industrial = pd.read_json(industrial, lines=True)
industrial['category'] = 'industrial'
print("Shape of Industrial: ", industrial.shape)


luxury_beauty = "/Users/pavansingh/Desktop/Amazon Review Data/Luxury_Beauty_5.json"
luxury_beauty = pd.read_json(luxury_beauty, lines=True)
luxury_beauty['category'] = 'luxury_beauty'
print("Shape of Luxury Beauty: ", luxury_beauty.shape)

magazine_subscriptions = "/Users/pavansingh/Desktop/Amazon Review Data/Magazine_Subscriptions_5.json"
magazine_subscriptions = pd.read_json(magazine_subscriptions, lines=True)
magazine_subscriptions['category'] = 'magazine_subscriptions'
print("Shape of Magazine Subscriptions: ", magazine_subscriptions.shape)

software = "/Users/pavansingh/Desktop/Amazon Review Data/Software_5.json"
software = pd.read_json(software, lines=True)
software['category'] = 'software'
print("Shape of Software: ", software.shape)


Shape of Beauty:  (5269, 13)
Shape of Fashion:  (3176, 13)
Shape of Appliances:  (2277, 13)
Shape of Gift Cards:  (2972, 13)
Shape of Industrial:  (77071, 13)
Shape of Luxury Beauty:  (34278, 13)
Shape of Magazine Subscriptions:  (2375, 13)
Shape of Software:  (12805, 13)


In [6]:
# merge the smaller datasets into one dataframe
# combine all data data
all_data = pd.concat([beauty, fashion, appliances, gift_cards, industrial, luxury_beauty, magazine_subscriptions, software], ignore_index = True)

# shape of combined data
print("Shape of Combined Data: ", all_data.shape)


Shape of Combined Data:  (140223, 13)


Let us try combine the data with the larger datasets.

In [None]:

arts_crafts = "/Users/pavansingh/Desktop/Amazon Review Data/Arts_Crafts_and_Sewing_5.json"
arts_crafts = pd.read_json(arts_crafts, lines=True)
arts_crafts['category'] = 'arts_crafts' 
print("Shape of Arts & Crafts: ", arts_crafts.shape)

automotive = "/Users/pavansingh/Desktop/Amazon Review Data/Automotive_5.json"
automotive = pd.read_json(automotive, lines=True)
automotive['category'] = 'automotive'
print("Shape of Automotive: ", automotive.shape)

books = "/Users/pavansingh/Desktop/Amazon Review Data/Books_5.json"
books = pd.read_json(books, lines=True)
books['category'] = 'books'
print("Shape of Books: ", books.shape)

cds_vinyl = "/Users/pavansingh/Desktop/Amazon Review Data/CDs_and_Vinyl_5.json"
cds_vinyl = pd.read_json(cds_vinyl, lines=True)
cds_vinyl['category'] = 'cds_vinyl'
print("Shape of CDs & Vinyl: ", cds_vinyl.shape)

cell_phones = "/Users/pavansingh/Desktop/Amazon Review Data/Cell_Phones_and_Accessories_5.json"
cell_phones = pd.read_json(cell_phones, lines=True)
cell_phones['category'] = 'cell_phones'
print("Shape of Cell Phones: ", cell_phones.shape)

clothing_shoes = "/Users/pavansingh/Desktop/Amazon Review Data/Clothing_Shoes_and_Jewelry_5.json"
clothing_shoes = pd.read_json(clothing_shoes, lines=True)
clothing_shoes['category'] = 'clothing_shoes'
print("Shape of Clothing & Shoes: ", clothing_shoes.shape)

digital_music = "/Users/pavansingh/Desktop/Amazon Review Data/Digital_Music_5.json"
digital_music = pd.read_json(digital_music, lines=True)
digital_music['category'] = 'digital_music'
print("Shape of Digital Music: ", digital_music.shape)

electronics = "/Users/pavansingh/Desktop/Amazon Review Data/Electronics_5.json"
electronics = pd.read_json(electronics, lines=True)
electronics['category'] = 'electronics'
print("Shape of Electronics: ", electronics.shape)


grocery_gourmet = "/Users/pavansingh/Desktop/Amazon Review Data/Grocery_and_Gourmet_Food_5.json"
grocery_gourmet = pd.read_json(grocery_gourmet, lines=True)
grocery_gourmet['category'] = 'grocery_gourmet'
print("Shape of Grocery & Gourmet: ", grocery_gourmet.shape)

home_kitchen = "/Users/pavansingh/Desktop/Amazon Review Data/Home_and_Kitchen_5.json"
home_kitchen = pd.read_json(home_kitchen, lines=True)
home_kitchen['category'] = 'home_kitchen'
print("Shape of Home & Kitchen: ", home_kitchen.shape)


kindle_store = "/Users/pavansingh/Desktop/Amazon Review Data/Kindle_Store_5.json"
kindle_store = pd.read_json(kindle_store, lines=True)
kindle_store['category'] = 'kindle_store'
print("Shape of Kindle Store: ", kindle_store.shape)

movies_tv = "/Users/pavansingh/Desktop/Amazon Review Data/Movies_and_TV_5.json"
movies_tv = pd.read_json(movies_tv, lines=True)
movies_tv['category'] = 'movies_tv'
print("Shape of Movies & TV: ", movies_tv.shape)

musical_instruments = "/Users/pavansingh/Desktop/Amazon Review Data/Musical_Instruments_5.json"
musical_instruments = pd.read_json(musical_instruments, lines=True)
musical_instruments['category'] = 'musical_instruments'
print("Shape of Musical Instruments: ", musical_instruments.shape)

office_products = "/Users/pavansingh/Desktop/Amazon Review Data/Office_Products_5.json"
office_products = pd.read_json(office_products, lines=True)
office_products['category'] = 'office_products'
print("Shape of Office Products: ", office_products.shape)

patio_lawn = "/Users/pavansingh/Desktop/Amazon Review Data/Patio_Lawn_and_Garden_5.json"
patio_lawn = pd.read_json(patio_lawn, lines=True)
patio_lawn['category'] = 'patio_lawn'
print("Shape of Patio Lawn: ", patio_lawn.shape)

pet_supplies = "/Users/pavansingh/Desktop/Amazon Review Data/Pet_Supplies_5.json"
pet_supplies = pd.read_json(pet_supplies, lines=True)
pet_supplies['category'] = 'pet_supplies'
print("Shape of Pet Supplies: ", pet_supplies.shape)

prime_pantry = "/Users/pavansingh/Desktop/Amazon Review Data/Prime_Pantry_5.json"
prime_pantry = pd.read_json(prime_pantry, lines=True)
prime_pantry['category'] = 'prime_pantry'
print("Shape of Prime Pantry: ", prime_pantry.shape)


sports_outdoors = "/Users/pavansingh/Desktop/Amazon Review Data/Sports_and_Outdoors_5.json"
sports_outdoors = pd.read_json(sports_outdoors, lines=True)
sports_outdoors['category'] = 'sports_outdoors'
print("Shape of Sports & Outdoors: ", sports_outdoors.shape)

tools_home = "/Users/pavansingh/Desktop/Amazon Review Data/Tools_and_Home_Improvement_5.json"
tools_home = pd.read_json(tools_home, lines=True)
tools_home['category'] = 'tools_home'
print("Shape of Tools & Home: ", tools_home.shape)

toys_games = "/Users/pavansingh/Desktop/Amazon Review Data/Toys_and_Games_5.json"
toys_games = pd.read_json(toys_games, lines=True)
toys_games['category'] = 'toys_games'
print("Shape of Toys & Games: ", toys_games.shape)

video_games = "/Users/pavansingh/Desktop/Amazon Review Data/Video_Games_5.json"
video_games = pd.read_json(video_games, lines=True)
video_games['category'] = 'video_games'
print("Shape of Video Games: ", video_games.shape)


# combine all data data
all_data = pd.concat([beauty, fashion, appliances, arts_crafts, automotive, books, cds_vinyl, cell_phones, clothing_shoes, digital_music, electronics, gift_cards, grocery_gourmet, home_kitchen, industrial, kindle_store, luxury_beauty, magazine_subscriptions, movies_tv, musical_instruments, office_products, patio_lawn, pet_supplies, prime_pantry, software, sports_outdoors, tools_home, toys_games, video_games], ignore_index = True)

# shape of combined
print("Shape of Combined Data: ", all_data.shape)


In [None]:
# final dataset with all combined categories
