# Overview

This notebook creates `.parquet` files from the raw JSON data and metadata. It is provided for reference only, as the output files are included in the referenced `.zip` file.

# Load `.json` Files into Lists

The data selected for this is the "Tools and Home Improvement" category, both [reviews](https://datarepo.eng.ucsd.edu/mcauley_group/data/amazon_v2/categoryFiles/Tools_and_Home_Improvement.json.gz) and [metadata](https://datarepo.eng.ucsd.edu/mcauley_group/data/amazon_v2/metaFiles2/meta_Tools_and_Home_Improvement.json.gz). 

This notebook assumes these are in files named `reviews.json.gz` and `metadata.json.gz`, and in the same directory as this notebook.

In [2]:
import gzip
import json

data = []
with gzip.open('reviews.json.gz') as f:
    for l in f:
        data.append(json.loads(l.strip()))
        
metadata = []
with gzip.open('metadata.json.gz') as f:
    for l in f:
        metadata.append(json.loads(l.strip()))

# Load Data into Dataframe
We'll load this into a dataframe, and remove all records that do not have values for `reviewText`.

In [3]:
# Load the data to dataframes
import pandas as pd

df = pd.DataFrame.from_dict(data)
df = df[df['reviewText'].notna()]

df_meta=pd.DataFrame.from_dict(metadata)

# the details column is generally all '{}' and causes problems when saving to .parquet file. We don't need it, so drop it.
df_meta = df_meta.drop(columns=['details'])

# Sample the Data

Rather than maintain all records, we'll keep only the ten products with the most reviews. 

In [6]:
# Identify the top 10 asin values by review count
top_10_asins = df.groupby('asin')['overall'].count().sort_values(ascending=False).head(10).index

# Filter the original dataframe to only include rows with those asin values
df = df[df['asin'].isin(top_10_asins)]
df_meta = df_meta[df_meta['asin'].isin(top_10_asins)]

df

Unnamed: 0,overall,verified,reviewTime,reviewerID,asin,style,reviewerName,reviewText,summary,unixReviewTime,vote,image
1383593,5.0,True,"11 6, 2008",A2DY2ODHK5WU9M,B0011UIPIW,"{'Color:': ' Black', 'Style:': ' Single'}",J. Yates,Got this as a promo with the Streamlight head ...,Best ever!,1225929600,,
1383594,3.0,True,"11 2, 2008",A2ACQZSJ1TC0AU,B0011UIPIW,"{'Color:': ' Black', 'Style:': ' Single'}",Dorbel Tweeter,That pretty much says it all for me. It's an ...,Very bright & tiny - but requires two hands to...,1225584000,122,
1383595,5.0,False,"10 18, 2008",A1YDAP0AZ9NMYP,B0011UIPIW,"{'Color:': ' Black', 'Style:': ' Single'}",Eric C.,"This little light has a sturdy metal design, n...",Good Little Light,1224288000,6,
1383596,5.0,True,"10 3, 2008",A2H5KASDT2SZIV,B0011UIPIW,"{'Color:': ' Black', 'Style:': ' Single'}",Jeff H.,This tiny light is a beacon of hope in a dark ...,Like a shining star...,1222992000,4,
1383597,5.0,False,"09 14, 2008",A2SH6A32BE6NEV,B0011UIPIW,"{'Color:': ' Black', 'Style:': ' Single'}",Comp Expert,This nanolight is extremely tiny and so makes ...,"Wonderfully bright, compact, weighs next to no...",1221350400,2,
...,...,...,...,...,...,...,...,...,...,...,...,...
8806614,4.0,True,"09 9, 2018",A25TJVHF90A5D3,B018Z0TE3K,,clayton,I like this for the toilet but it could be som...,Not as bright as the picture shows it to be,1536451200,,
8806615,4.0,False,"09 9, 2018",A2NYFO70F01ODU,B018Z0TE3K,,Mrs. K,We purchased the glowbowl as we were starting ...,"Product has issues, but great customer service",1536451200,,
8806616,1.0,True,"09 9, 2018",A1AAH4PH22JAPU,B018Z0TE3K,,Brian,This item is terrible. It only worked for like...,Not worth the money.,1536451200,,
8806617,5.0,True,"09 8, 2018",A1B1F7SN935GW5,B018Z0TE3K,,Iscos,It a great product works well my kids love it ...,That works very well no issues .,1536364800,,


# Add a "Truncated" Column

We'll not create embeddings for the full review, just the first 400 characters.

In [7]:
# Truncate the reviewText to 400 characters, and add a new column 'truncated'
max_text_length=400
def truncate_review(text):
    return text[:max_text_length]

df['truncated']=df.apply(lambda row: truncate_review(row['reviewText']),axis=1)

# Save Parquet Files

With the main selection work done, we'll save the data as parquet files for use in the next notebook.

In [16]:
# save to parquet files
df.to_parquet('reviews.parquet.gz', compression='gzip')
df_meta.to_parquet('metadata.parquet.gz', compression='gzip')