# Amazon Reviews Sentiment Analysis

## 1. Data Loading

For this project, we will be using the [Amazon Review Data (2018)](https://cseweb.ucsd.edu/~jmcauley/datasets/amazon_v2/) dataset provided by the University of California San Diego. It consists of a total number of 233.1 million Amazon reviews written between May 1996 and Oct 2018 across 29 product categories. The dataset is available in the form of two JSON files for each product category, with each line in a file representing a JSON object. One file contains the reviews and the other contains the metadata for the products.

In [None]:
%%bash
categories=("AMAZON_FASHION" "All_Beauty" "Appliances" "Arts_Crafts_and_Sewing" "Automotive" "Books" "CDs_and_Vinyl" "Cell_Phones_and_Accessories" "Clothing_Shoes_and_Jewelry" "Digital_Music" "Electronics" "Gift_Cards" "Grocery_and_Gourmet_Food" "Home_and_Kitchen" "Industrial_and_Scientific" "Kindle_Store" "Luxury_Beauty" "Magazine_Subscriptions" "Movies_and_TV" "Musical_Instruments" "Office_Products" "Patio_Lawn_and_Garden" "Pet_Supplies" "Prime_Pantry" "Software" "Sports_and_Outdoors" "Tools_and_Home_Improvement" "Toys_and_Games" "Video_Games")
data_links=()
meta_links=()

for category in "${categories[@]}"
do
    data_links+=("https://datarepo.eng.ucsd.edu/mcauley_group/data/amazon_v2/categoryFiles/$category.json.gz")
    meta_links+=("https://datarepo.eng.ucsd.edu/mcauley_group/data/amazon_v2/metaFiles2/meta_$category.json.gz")
done

mkdir -p /data/json/
wget -P /data/json/ "${data_links[@]}" "${meta_links[@]}"

gunzip /data/json/*.gz

With the above script, we download the data and the metadata files for each category, store them in the `/data/json/` directory and unzip them. The resulting JSON files amount to a total of almost 300GB of data.

## 2. Data preparation

Now that we have downloaded these huge files, we cannot simply load them into memory and start working with them. One way of dealing with this problem is to use [Spark](https://spark.apache.org/) on top of the [Hadoop Distributed File System (HDFS)](https://hadoop.apache.org/), which allows us to distribute the data and the computations with the data. Therefore, the following steps require a Spark and Hadoop installation on the machine.

### Rename fields

In [None]:
%%bash
sed -i -e 's/:":/":/g' /data/json/*.json

sed -i -e 's/"size":/"_size":/g' /data/json/meta_AMAZON_FASHION.json
sed -i -e 's/"size":/"_size":/g' /data/json/meta_Clothing_Shoes_and_Jewelry.json

Before we begin to work with the data using Spark, we have to rename some fields in the JSON files, because their names are not compatible with Spark DataFrames.

Some of the fields contain a `:` at the end of their name, which was likely introduced by some mistake in the data collection process and is not allowed in Spark. We remove the `:` from the field names with the first command.

There are also fields called `size` in the metadata files of the categories `AMAZON_FASHION` and `Clothing_Shoes_and_Jewelry`, which is a reserved keyword in Spark. Therfore, we rename these fields to `_size`.

### Join data and metadata

In [11]:
import pyspark.pandas as ps
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType, DoubleType, BooleanType, ArrayType, LongType

spark = SparkSession.builder.master('local[*]').config("spark.driver.memory", "8g").config("spark.executor.memory", "8g").config("spark.memory.offHeap.enabled","true").config("spark.memory.offHeap.size","28g").getOrCreate()

Now we are ready to start working with the data in spark. We create a new session with memory configurations according to our machine.

In [12]:
categories = ['AMAZON_FASHION', 'All_Beauty', 'Appliances', 'Arts_Crafts_and_Sewing', 'Automotive', 'Books', 'CDs_and_Vinyl', 'Cell_Phones_and_Accessories', 'Clothing_Shoes_and_Jewelry', 'Digital_Music', 'Electronics', 'Gift_Cards', 'Grocery_and_Gourmet_Food', 'Home_and_Kitchen', 'Industrial_and_Scientific', 'Kindle_Store', 'Luxury_Beauty', 'Magazine_Subscriptions', 'Movies_and_TV', 'Musical_Instruments', 'Office_Products', 'Patio_Lawn_and_Garden', 'Pet_Supplies', 'Prime_Pantry', 'Software', 'Sports_and_Outdoors', 'Tools_and_Home_Improvement', 'Toys_and_Games', 'Video_Games']

Since the original data is partitioned into two files for each category, we need an array of the names of all categories called `categories` in order to work with the data.

In [3]:
schema = StructType([
    StructField('asin', StringType()),
    StructField('image', ArrayType(StringType())),
    StructField('overall', DoubleType()),
    StructField('reviewText', StringType()),
    StructField('reviewTime', StringType()),
    StructField('reviewerID', StringType()),
    StructField('reviewerName', StringType()),
    StructField('style', StructType([
        StructField('Color', StringType()),
        StructField('Color Name', StringType()),
        StructField('Design', StringType()),
        StructField('Flavor', StringType()),
        StructField('Format', StringType()),
        StructField('Item Package Quantity', StringType()),
        StructField('Package Quantity', StringType()),
        StructField('Package Type', StringType()),
        StructField('Pattern', StringType()),
        StructField('Scent Name', StringType()),
        StructField('Size', StringType()),
        StructField('Size Name', StringType()),
        StructField('Style', StringType()),
        StructField('Style Name', StringType()),
    ])),
    StructField('summary', StringType()),
    StructField('unixReviewTime', LongType()),
    StructField('verified', BooleanType()),
    StructField('vote', StringType()),
])

Our goal is to combine all the available data into a single (yet distributed) file. However, the structure of the data files is slightly different for each category, which leads to problems when we try to combine them. Therefore, we first have to unify the structure of the data files, which we define as the Spark struct `schema`.

In [None]:
for category in categories:
    print(f'Processing {category}')

    data = ps.read_json(f'/data/json/{category}.json', schema=schema, index_col=['reviewerID', 'asin'])

    meta = ps.read_json(f'/data/json/meta_{category}.json', index_col='asin')
    meta = meta.drop(['similar_item', 'details', 'tech1', 'tech2'], axis=1)

    df = data.reset_index().join(meta, on='asin')
    df.reset_index(inplace=True)
    df.set_index(['reviewerID', 'asin'], inplace=True)
    df['category'] = category
    df.to_parquet(f'/data/{category}.parquet', index_col=['reviewerID', 'asin'])

    print(f'Finished {category}')

We can then simply read the data and metadata files for each category into a Pandas on Spark DataFrame, join them and write the resulting DataFrame to a parquet file.

Pandas on Spark is an abstraction of Spark DataFrames, which can be used the same way the popular Python library `pandas` is used, with the added benefit of being able to distribute the data and the computations with the data.

The parquet file is a columnar storage format, which is optimized for distributed data processing. It is also the default file format for Spark DataFrames.

### Concatenate categories

Now that we have all the data in a clean structure and stored in an efficient file format, we can work with it well. However, since we still have a separate file for each category, we don't want to have to work with all these files separately. Therefore, we concatenate all the parquet files into a single file.

In [14]:
df_list = [ps.read_parquet(f'/data/{category}.parquet', index_col=['reviewerID', 'asin']) for category in categories]
df = ps.concat(df_list)
df.to_parquet('/data/data.parquet', index_col=['reviewerID', 'asin'])

To join the files, we can simply read the separate parquet files into DataFrames, concatenate them and write the resulting DataFrame to a new parquet file.