# Pre-(before-the-workshop)-processing of UCSD Data

This solution and workshop use the [Amazon Review Data (2018)](https://nijianmo.github.io/amazon/index.html) dataset published by UCSD, as used in the paper:

**Justifying recommendations using distantly-labeled reviews and fined-grained aspects**<br/>
Jianmo Ni, Jiacheng Li, Julian McAuley<br/>
_Empirical Methods in Natural Language Processing (EMNLP), 2019 [(PDF)](http://cseweb.ucsd.edu/~jmcauley/pdfs/emnlp19a.pdf)_


## The Challenge

The base dataset is:

1. **Large** - We're only trying to create a representative demo here, not needlessly consume lots of resources
2. **Sparse** - Since views > purchases > reviews, only a fraction of products (and users) have multiple reviews

UCSD's published data already helps with this, by offering **5-core reviews** subsets pre-filtered to include only the reviews:

* by users who've written 5 or more reviews
* for products that received 5 or more reviews

If we focus on one particular product category (say, _"Sports and Outdoors"_ or _"Grocery and Gourmet Food"_) then this gets us to a nice manageable volume of reviews for training demo recommender engines - great!

...The only remaining problem is that there **aren't 5-core filtered versions of the product metadata files**... So our demo's start-up time (populating the products database) would still be unacceptable! 😭


## Notebook Overview

Here we simply **filter out (most of) the un-reviewed products from a UCSD category product metadata file**: Creating a slimmed-down product metadata file which is what will be used by the demo solution.

* Assuming the parsed 5-core reviews dataset fits in to memory (so lookups are performant), but the full category product list doesn't (so streaming is necessary)
* Preserving the data format (so users can relate it to the source dataset, and the workshop can still do some of the more "real" data pre-processing)

**Why "most of", not all?** We'd still like *some* unreviewed products, to demonstrate cold-start functionality... We just don't want our website swamped with 'em!


## Setup

Imports, AWS connection, and configuration:


In [None]:
%load_ext autoreload
%autoreload 2

# Python Built-Ins:
import io

# External Dependencies:
import boto3

# Local Dependencies:
from preproc import remove_unused_items
from dataformat import json_gz_reader


In [None]:
session = boto3.session.Session()
region = session.region_name
s3 = session.resource("s3")


In [None]:
# Note we only mirror a subset of the files available through
# https://nijianmo.github.io/amazon/index.html
category = "Grocery_and_Gourmet_Food"
max_cold_start = 0.01

reviews_uri = f"s3://public-personalize-demo-assets-{region}/data/{category}_5.json.gz"
products_raw_uri = f"s3://public-personalize-demo-assets-{region}/data/meta_{category}.json.gz"
products_out_uri = \
    f"s3://public-personalize-demo-assets-{region}/data/meta_{category}_{5 + max_cold_start}.json.gz"


## Identify Reviewed Product IDs

Loading the interactions file from S3 directly into RAM might take a while depending on the size of the category.

We only need the set of product IDs mentioned in any review, so that's all we store.

*Grocery_and_Gourmet_Food_5.json.gz* (1.1M reviews) took ~14s on our t3 instance


In [None]:
%%time

reviewed_item_ids = set()
n_reviews = 0

reviews_bucket, _, reviews_key = reviews_uri[len("s3://"):].partition("/")

for review in json_gz_reader(s3.Object(reviews_bucket, reviews_key).get()["Body"]):
    n_reviews += 1
    reviewed_item_ids.add(review["asin"])

print(f"{len(reviewed_item_ids)} products reviewed over {n_reviews} reviews")


## Filter the Product Metadata

Now just need to filter out the majority of the un-reviewed products. You should see from the console output below just how extreme the data reduction is!

We buffer the filtered binary data into memory (why create files everywhere?) and then directly upload it to S3.

(On our t3 instance with *meta_Grocery_and_Gourmet_Food.json.gz*, this took around 50s end-to-end)


In [None]:
%%time

products_raw_bucket, _, products_raw_key = products_raw_uri[len("s3://"):].partition("/")
products_out_bucket, _, products_out_key = products_out_uri[len("s3://"):].partition("/")

# See the local file this function was imported from above for implementation details.
# The output file should fit in memory, so let's not pollute the filesystem.
with io.BytesIO() as fout:
    print(f"Filtering data from {products_raw_uri}...")
    remove_unused_items(
        s3.Object(products_raw_bucket, products_raw_key).get()["Body"],
        fout,
        reviewed_item_ids,
        max_cold_start=max_cold_start,
    )

    fout.seek(0)
    print(f"\nUploading to {products_out_uri}...")
    s3.Object(products_out_bucket, products_out_key).put(Body=fout)
    print("Uploaded!")


That's all! It would be possible to validate this work by e.g:

- Re-running `remove_unused_items()` on the filtered product dataset to check the result is the same
- Iterating through the filtered product dataset to check the item format is the same

...We could even `pandas.read_json(..., lines=True)` on the output S3 URI, but for most product categories we'd still need more RAM than a `ml.t*.medium` instance.