The purpose of this notebook is to load and clean data from Amazon.com product reviews, ultimately this data will be incorporated into a study to determine how much reviews and/or ratings affect product sales. There are two large JSON files which contain JSON lines for each review,  and product. Both files are too large to simultaneously fit into memory so only a portion of the data will be loaded. Because Amazon product descriptions are often inconsistant among the various vendors, in addition to typical data cleaning tasks (dropping NA's, etc), I'll also run a short algorithm to determine if the datasets contain possible redundant products with only slightly different names (ie "Casio men's watch GT2HF2" fitness, vs. "Casio men's watch GT2HF2")

In [1]:
#First we need to load the various packages

import pandas as pd
import os
from fuzzywuzzy import fuzz
from fuzzywuzzy import process
from datetime import datetime
import gzip
import json
cDir=os.getcwd()
os.chdir(os.path.abspath('C:/Users/micha/Documents/Springboard/Unit_7-Data_Wrangling/Data'))


Next I need to define a couple functions to read in the data. The first function "parse" will create a generator that will yield a JSON line from the .json.gz file. The second function will return a pandas data frame with numRow rows.

In [2]:
def parse(path):
  g = gzip.open(path, 'rb')
  for l in g:
    yield json.loads(l)

def getDF(path, numRow):
  i = 0
  df = {}
  if i <= numRow: 
      for d in parse(path):
        df[i] = d
        i += 1
  return pd.DataFrame.from_dict(df, orient='index')
    


In [3]:
reviewData = getDF('Home_and_Kitchen_5.json.gz', 10000)

Next I'll load the product data into a pandas dataframe. Because the JSON data file is so large, the loading process so this takes a while.

In [None]:
metaData = getDF('meta_Home_and_Kitchen.json.gz', 1000)

In [5]:
metaData.columns

Index(['category', 'tech1', 'description', 'fit', 'title', 'also_buy', 'image',
       'tech2', 'brand', 'feature', 'rank', 'also_view', 'main_cat',
       'similar_item', 'date', 'price', 'asin', 'details'],
      dtype='object')

I don't want to load the JSON data again (it takes about 90 minutes to run through the entire file), so I'll export the data to a csv for ease of loading in the future. 

In [7]:
metaData.to_csv('metaData.csv')
reviewData.to_csv('reviewData.csv')

In [4]:
def productComparison(productData):
    '''This function takes in a dataframe of Amazon data and returns a new data frame with consistent naming convention for 
    the products'''
    #I want to keep an eye on how long this funtion takes to run because I know it's going to be a little slow
    startTime = datetime.now()
    possibleMatch = []
    # Frist identify all the unique product IDs (product_parent) and unique product names (product_title)
    uniqueParent = productData['asin'].unique()
    uniqueTitle = productData['title'].unique()
    #iterate thorugh the unique product IDs to see if the associated product title matches any of the other unique titles. 
    for product in uniqueParent:
        #The titles will change as the data is refined
        #uniqueTitle = productData['product_title'].unique()
        #First get the title associated with a product ID
        try:
            prodComp = productData.loc[productData['asin'] == product, 'title'].unique()[0]
            #now iterate through the unique titles
            for compProd in uniqueTitle:
                #No need to make any changes to the dataframe if the product name is the exact same as the comparison string...
                if compProd != prodComp:
                    #determine both the set ratio and sort ratio
                    setRatio = fuzz.token_set_ratio(prodComp, compProd)
                    sortRatio = fuzz.token_sort_ratio(prodComp, compProd)
                    #If the set ratio and sort ratio both exceed some threshold, then we will update the name of the product in the dataframe
                    if setRatio > 90 and sortRatio > 90:
                        possibleMatch.append([prodComp, compProd, setRatio, sortRatio])
                        productData.loc[productData['title'] == compProd, 'title'] = prodComp
        except:
            #there will be times that all instances of a where a product title will no longer be in the dataframe (already been changed)
            continue
        #Reset the product IDs
        productData.loc[productData['title'] == prodComp, 'parent'] = product
    processTime = datetime.now()-startTime
    print(processTime)
    return productData, possibleMatch

In [8]:
df = pd.merge(reviewData, metaData, how = 'left', on='asin')

Now that the data are merged into one dataframe, I'll clear the reviewData and metaData dataframes to free up space.

In [20]:
reviewData = []
metaData = []

In [21]:
#Drop the large columns from the DF
df1 = df.drop(['image_x', 'tech1', 'description', 'image_y', 'tech2', 'also_buy', 'feature', 'also_view', 'similar_item', 'details'], axis = 1)

Lets take a look at the finished product. 

In [22]:
df1.head()

Unnamed: 0,overall,verified,reviewTime,reviewerID,asin,reviewerName,reviewText,summary,unixReviewTime,vote,style,category,fit,title,brand,rank,main_cat,date,price
0,5.0,True,"11 5, 2015",A8LUWTIPU9CZB,560467893,Linda Fahner,"Great product, love it!!",Five Stars,1446681600,,,"[Home & Kitchen, Home Dcor, Home Dcor Accents,...",,"WELLAND Chicago Wall Floating Corner Shelf, 20...",WELLAND,"[>#1,037,069 in Home & Kitchen (See Top 100 in...",Amazon Home,,
1,3.0,True,"05 7, 2015",A3B6GKQQ1JJ167,560467893,Harry Slaughter,"Pretty flimsy, but does the job. If your corne...",Meh,1430956800,2.0,,"[Home & Kitchen, Home Dcor, Home Dcor Accents,...",,"WELLAND Chicago Wall Floating Corner Shelf, 20...",WELLAND,"[>#1,037,069 in Home & Kitchen (See Top 100 in...",Amazon Home,,
2,5.0,True,"01 22, 2014",A3MCTN65BU7XRA,681795107,luckyg,So much better than plastic mug types--keeps c...,Recommend,1390348800,,{'Color:': ' Brushed Stainless'},"[Home & Kitchen, Kitchen & Dining, Travel & To...",,Stainless Coffee Mug,Timolino,"[>#220,715 in Kitchen & Dining (See Top 100 in...",Amazon Home,"August 1, 2006",$14.27
3,1.0,True,"10 30, 2013",A7JVZFSXVY9RL,681795107,Nickleen,I like my coffee hot; borderline scorching but...,Not keeping coffee hot for long enough,1383091200,,{'Color:': ' Brushed Stainless'},"[Home & Kitchen, Kitchen & Dining, Travel & To...",,Stainless Coffee Mug,Timolino,"[>#220,715 in Kitchen & Dining (See Top 100 in...",Amazon Home,"August 1, 2006",$14.27
4,1.0,True,"09 20, 2013",A2RQ7VLAK1SHPU,681795107,Lacemaker427,This mug does only a fair job of keeping coffe...,Leaks like a waterfall when at an angle!,1379635200,,{'Color:': ' Red'},"[Home & Kitchen, Kitchen & Dining, Travel & To...",,Stainless Coffee Mug,Timolino,"[>#220,715 in Kitchen & Dining (See Top 100 in...",Amazon Home,"August 1, 2006",$14.27


In [23]:
#Clear the orginal merged dataframe form memory.
df = []

In [25]:
#Save the data as a csv
df1.to_csv('amazonReviewData.csv', index = False)

Now that the data is saved, we can take a close look at the data

In [26]:
df1.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 7244644 entries, 0 to 7244643
Data columns (total 19 columns):
 #   Column          Dtype  
---  ------          -----  
 0   overall         float64
 1   verified        bool   
 2   reviewTime      object 
 3   reviewerID      object 
 4   asin            object 
 5   reviewerName    object 
 6   reviewText      object 
 7   summary         object 
 8   unixReviewTime  int64  
 9   vote            object 
 10  style           object 
 11  category        object 
 12  fit             object 
 13  title           object 
 14  brand           object 
 15  rank            object 
 16  main_cat        object 
 17  date            object 
 18  price           object 
dtypes: bool(1), float64(1), int64(1), object(16)
memory usage: 1.0+ GB


It appears the final data set is 7.2 million entries, which is way too large to work with realistically. I'll just use a subset.

In [29]:
df1 = df1.sample(10000)

In [32]:
df1 = df1.reindex()

In [35]:
#Save the sampled data set
df1.to_csv('amazonReviewData_sample.csv', index = False)

Now lets clear all the memory and reload the data.

In [2]:
df  = pd.read_csv('amazonReviewData_sample.csv')

In [3]:
df.head()

Unnamed: 0,overall,verified,reviewTime,reviewerID,asin,reviewerName,reviewText,summary,unixReviewTime,vote,style,category,fit,title,brand,rank,main_cat,date,price
0,4.0,True,"09 13, 2015",A1FLUT4TT4SI7B,B00MNYHJRI,Red Butterfly,Four for decorative detail not five because th...,Nice but...,1442102400,,,"['Home & Kitchen', 'Storage & Organization', '...",,"&quot;Family, Home, Love&quot; Wood &amp; Meta...",MyGift,"['>#78,215 in Home & Kitchen (See Top 100 in H...",Amazon Home,,$16.99
1,5.0,True,"01 1, 2017",A329DI18H4J51Y,B019AW3N8E,Jkay,"I love these. Super soft, fluffy and comfortab...",Love these covers,1483228800,,"{'Size:': ' 18 X 18 Inches', 'Color:': ' Ivory'}","['Home & Kitchen', 'Bedding', 'Decorative Pill...",,CaliTime Pack of 2 Super Soft Throw Pillow Cov...,CaliTime,"['>#163,604 in Home & Kitchen (See Top 100 in ...",Amazon Home,,$18.95
2,5.0,True,"12 1, 2008",A4I1WJ2MUZV6P,B00005UP2N,KK,I can't add much to what's already been writte...,Excellent product!,1228089600,,,"['Home & Kitchen', 'Kitchen & Dining', 'Small ...",,KitchenAid KSM150PSGR Artisan Series 5-Qt. Sta...,KitchenAid,['>#85 in Kitchen & Dining (See Top 100 in Kit...,Amazon Home,"February 11, 2002",$43.01
3,5.0,True,"08 14, 2017",A2MDJLQS61XZUT,B00AX29JPM,Patricia,Perfect,Five Stars,1502668800,,"{'Size:': ' Twin XL', 'Color:': ' Navy Blue'}","['Home & Kitchen', 'Bedding', 'Bed Skirts']",,Superior 1500 Series 100% Microfiber Pleated T...,Superior,"['>#373,828 in Home & Kitchen (See Top 100 in ...",Amazon Home,,$27.01
4,5.0,True,"03 15, 2018",A3BOBH1FYAVRFN,B00GJADRNM,ryan hall,"Great pan, fast shipping",Five Stars,1521072000,,{'Size:': ' 10'},"['Home & Kitchen', 'Kitchen & Dining', 'Cookwa...",,"Starfrit SRFT060312 The Rock Fry Pan, 10-Inch",Starfrit,"['>#140,045 in Kitchen & Dining (See Top 100 i...",Amazon Home,"November 8, 2013",


Several of these columns may be unnecessary. I'm going to explore how many of these columns have a majority of NaNs. 

Unnamed: 0,overall,unixReviewTime
count,10000.0,10000.0
mean,4.3641,1447356000.0
std,1.12181,58490650.0
min,1.0,972345600.0
25%,4.0,1419725000.0
50%,5.0,1456445000.0
75%,5.0,1486274000.0
max,5.0,1537747000.0


In [7]:
df.isnull().sum()

overall              0
verified             0
reviewTime           0
reviewerID           0
asin                 0
reviewerName         2
reviewText           3
summary              1
unixReviewTime       0
vote              8640
style             3735
category            13
fit               9998
title               13
brand              107
rank                13
main_cat            19
date              4572
price             2319
parent              13
dtype: int64

Since style, fit, and date may not be relevant for analyzing the affect of reviews or overall ratings on purchases, these columns will be dropped. Additionlly, 'vote' appears to be almost entirely missing and the 'overall' column is the rating anyway so 'vote' will also be dropped. With the exception of price, the remaining missing values are small enough that I will happliy drop the rows. It's unfortuntate how many missing values there are for price, because it would make sense for a customer to pay more attention to reviews as price increases. I may need to circle back to try to find a supplemental sample of review to substitute the rows that I will need to drop here, because it is likely not reasonable to impute these with the mean given the range of values. 

In [12]:
df = df.drop(['vote', 'style', 'fit', 'date'], axis = 1)
df = df.dropna(axis=0, how = 'any')

In [14]:
#Lets take a look at the data now

df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 7650 entries, 0 to 9999
Data columns (total 16 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   overall         7650 non-null   float64
 1   verified        7650 non-null   bool   
 2   reviewTime      7650 non-null   object 
 3   reviewerID      7650 non-null   object 
 4   asin            7650 non-null   object 
 5   reviewerName    7650 non-null   object 
 6   reviewText      7650 non-null   object 
 7   summary         7650 non-null   object 
 8   unixReviewTime  7650 non-null   int64  
 9   category        7650 non-null   object 
 10  title           7650 non-null   object 
 11  brand           7650 non-null   object 
 12  rank            7650 non-null   object 
 13  main_cat        7650 non-null   object 
 14  price           7650 non-null   object 
 15  parent          7650 non-null   object 
dtypes: bool(1), float64(1), int64(1), object(13)
memory usage: 963.7+ KB


Looks better, but we need price to be a float, and rank to be an int. Updating rank will require some string manipulation to extract the rank within the specific subcategory of Home & Kitchen (i.e. Laundry Bags)

In [24]:
#First parse the rank on commas and store the result in a new column: rankCat
df['rankCat'] = df['rank'].str.split('>#', n = -1, expand = False)


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


In [None]:
#Check for duplicate products
checkedData, possibleMatches = productComparison(df)