# CSE272 HW2 Recommendation System 

Author: Anthony Liu ([e-mail](mailto:hliu35@ucsc.edu))

Date: 05/20/2021

Please read README.md before running.

## 1. Data Selection and Preprocessing

The default dataset used is the video game [review dataset](http://jmcauley.ucsd.edu/data/amazon/) from Amazon. Please note that we are using the FULL .json dataset instead of the 5-category subset.

The code below divides the dataset by a training-to-testing ratio of 80%:20%.

In [1]:
import os
import time
import json
import pickle
import numpy as np
import pandas as pd

import utils as UT

# working directory
print(os.getcwd())

/home/ry/Documents/recommender


In [2]:
# shell script $ ls data
!ls data

Video_Games_5.json


In [3]:
f = open("data/Video_Games_5.json")
data_list = []
for l in f.readlines():
    data_list.append(json.loads(l))

print(len(data_list))
f.close()

# sort the data based on reviewerID (uid)
data_list.sort(key=lambda r: (r["reviewerID"], r["asin"]))


231780


In [4]:
uids, pids, ratings, reviews, titles, unixtimestamps, helpfulness = [], [], [], [], [], [], []
col_titles = [x for x in data_list[0].keys()]
print(col_titles)

['reviewerID', 'asin', 'reviewerName', 'helpful', 'reviewText', 'overall', 'summary', 'unixReviewTime', 'reviewTime']


In [5]:
# convert to Pandas-friendly format (column-wise)
product_names = dict() # storing for recommendation
uid_set = set()
pid_set = set()

for row in data_list:
    uids.append(row["reviewerID"])
    pids.append(row["asin"])
    ratings.append(row["overall"])
    reviews.append(row["reviewText"])
    titles.append(row["summary"])
    unixtimestamps.append("unixReviewTime")
    h = UT.sigmoid(row["helpful"][1]-row["helpful"][0]) # using sigmoid instead of difference
    helpfulness.append(h)

    product_names[row["asin"]] = row["summary"]
    uid_set.add(row["reviewerID"]) # IMPORTANT: ALL USER ID IN DATASET
    pid_set.add(row["asin"]) # IMPORTANT: ALL PRODUCT ID IN DATASET

# generate a reverse map from uid/pid to ndarray index
print("User count:",len(uid_set))
print("Product count:",len(pid_set))

uid_idx = dict()
pid_idx = dict()

for i, uid in enumerate(list(uid_set)):
    uid_idx[uid] = i

for j, pid in enumerate(list(pid_set)):
    pid_idx[pid] = j

User count: 24303
Product count: 10672


In [6]:
with open("indices.pickle", "wb") as f:
    pickle.dump([uid_idx, pid_idx, uid_set, pid_set], f)

In [7]:
data = {"user-id":uids,
        "product-id":pids, 
        #"helpfulness":helpfulness, 
        "rating":ratings, 
        #"review":reviews, 
        #"title":titles, 
        #"timestamp":unixtimestamps
}
DF = pd.DataFrame(data)
DF.to_pickle("DF.pickle")

#### *continues in* `part2a.ipynb`