### Reference
- https://towardsdatascience.com/building-a-collaborative-filtering-recommender-system-with-clickstream-data-dffc86c8c65
- https://pypi.org/project/python-amazon-simple-product-api/
- https://github.com/benfred/implicit
- https://medium.com/@patelneha1495/recommendation-system-in-python-using-als-algorithm-and-apache-spark-27aca08eaab3
- https://towardsdatascience.com/prototyping-a-recommender-system-step-by-step-part-2-alternating-least-square-als-matrix-4a76c58714a1

## Data format
- Format is one-review-per-line in json. See examples below for further help reading the data.

    - reviewerID - ID of the reviewer, e.g. A2SUAM1J3GNN3B
    - asin - ID of the product, e.g. 0000013714
    - reviewerName - name of the reviewer
    - vote - helpful votes of the review
    - style - a disctionary of the product metadata, e.g., "Format" is "Hardcover"
    - reviewText - text of the review
    - overall - rating of the product
    - summary - summary of the review
    - unixReviewTime - time of the review (unix time)
    - reviewTime - time of the review (raw)
    - image - images that users post after they have received the produc

In [None]:
!pwd

In [None]:
!pip install ipython-autotime

In [1]:
#### To measure all running time
# https://github.com/cpcloud/ipython-autotime

%load_ext autotime

In [2]:
import gc

collected = gc.collect()
print ("Garbage collector: collected %d objects." % collected)

Garbage collector: collected 0 objects.
time: 15.8 ms


In [None]:
!pip install implicit

In [3]:
import pandas as pd
import scipy.sparse as sparse
import numpy as np
import random
import implicit
from sklearn.preprocessing import MinMaxScaler
from sklearn import metrics

time: 665 ms


In [4]:
import os
import time
import tqdm
import codecs

# spark imports
from pyspark.sql import SparkSession
from pyspark.sql.functions import UserDefinedFunction, explode, desc
from pyspark.sql.types import StringType, ArrayType
from pyspark.ml.evaluation import RegressionEvaluator

# data science imports
import math

# visualization imports
import seaborn as sns
import matplotlib.pyplot as plt

import json

%matplotlib inline

time: 205 ms


In [5]:
import os
import sys

os.environ['PYSPARK_PYTHON'] = sys.executable
os.environ['PYSPARK_DRIVER_PYTHON'] = sys.executable

time: 845 µs


In [6]:
number_cores = 16
memory_gb = 32

spark = SparkSession \
    .builder \
    .appName("amazon recommendation") \
    .config("spark.driver.memory", '{}g'.format(memory_gb)) \
    .config("spark.master", 'local[{}]'.format(number_cores)) \
    .getOrCreate()

# get spark context
sc = spark.sparkContext

time: 2.21 s


- Download dataset from: http://deepyeti.ucsd.edu/jianmo/amazon/categoryFiles/Clothing_Shoes_and_Jewelry.json.gz

In [7]:
!ls -alh 

total 3.4G
drwxrwxr-x  5 ec2-user ec2-user 4.0K May 14 17:55 .
drwxrwxr-x 10 ec2-user ec2-user 4.0K May 14 17:30 ..
-rw-rw-r--  1 ec2-user ec2-user  46K May 14 17:55 0504-JH-DSE-260-Capstone.ipynb
-rw-rw-r--  1 ec2-user ec2-user 3.4G May 14 07:16 Clothing_Shoes_and_Jewelry.json.gz
drwxrwxr-x  3 ec2-user ec2-user 4.0K May 14 17:36 gluon_recommender_system
drwxrwxr-x  2 ec2-user ec2-user 4.0K May 14 07:14 .ipynb_checkpoints
-rw-rw-r--  1 ec2-user ec2-user  83K May 14 02:03 JH-DSE-260-Capstone-ALS-2.ipynb
drwxrwxr-x  2 ec2-user ec2-user 4.0K May 14 07:18 spark-warehouse
time: 117 ms


In [8]:
DATA_PATH = './'
REVIEW_DATA = 'Clothing_Shoes_and_Jewelry.json.gz'

time: 880 µs


1. Please unzip Clothing_Shoes_and_Jewelry.json.gz to Clothing_Shoes_and_Jewelry.json
2. Load Clothing_Shoes_and_Jewelry.json (14.1 GB (14,144,939,923 bytes))

In [9]:
ratings = spark.read.load(DATA_PATH+REVIEW_DATA, format='json', header=True, inferSchema=True)

time: 3min 59s


In [10]:
ratings.show(3)

+----------+-----+-------+--------------------+-----------+--------------+-------------+--------------------+--------------------+--------------+--------+----+
|      asin|image|overall|          reviewText| reviewTime|    reviewerID| reviewerName|               style|             summary|unixReviewTime|verified|vote|
+----------+-----+-------+--------------------+-----------+--------------+-------------+--------------------+--------------------+--------------+--------+----+
|0871167042| null|    5.0|This book has bea...| 05 4, 2014|A2IC3NZN488KWK|   Ruby Tulip|[,,,,,,,,  Paperb...|      Unique designs|    1399161600|    true|   2|
|0871167042| null|    4.0|I love the ideas ...|04 26, 2014|A3OT9BYASFGU2X|    Laurie K.|[,,,,,,,,  Paperb...|makes you want to...|    1398470400|    true|null|
|0871167042| null|    5.0|As someone who ha...|04 17, 2014|A28GK1G2KDXHRP|Marie Rhoades|[,,,,,,,,  Paperb...|Highly Recommend ...|    1397692800|   false|   6|
+----------+-----+-------+--------------

In [11]:
type(ratings)

pyspark.sql.dataframe.DataFrame

time: 2.92 ms


In [12]:
# print("Shape of Data", (ratings.count(), len(ratings.columns)))

time: 871 µs


## Drop and Clean data
    - Drop null in Vote
    - Voted review comment is more reliable.

In [13]:
clean_ratings = ratings.na.drop(how='any', subset='vote')

time: 43.3 ms


In [14]:
print("Shape of Data", (clean_ratings.count(), len(clean_ratings.columns)))

Shape of Data (2886813, 12)
time: 1min 47s


In [15]:
clean_ratings.columns

['asin',
 'image',
 'overall',
 'reviewText',
 'reviewTime',
 'reviewerID',
 'reviewerName',
 'style',
 'summary',
 'unixReviewTime',
 'verified',
 'vote']

time: 2.01 ms


#### Extract ['asin', 'overall', 'reviewerID'] from dataset

In [16]:
ratings.columns

['asin',
 'image',
 'overall',
 'reviewText',
 'reviewTime',
 'reviewerID',
 'reviewerName',
 'style',
 'summary',
 'unixReviewTime',
 'verified',
 'vote']

time: 3.27 ms


In [28]:
product_ratings = ratings.drop(
 'image',
 'reviewText',
 'reviewTime',
 'reviewerName',
 'style',
 'summary',
 'unixReviewTime',
 'verified',
 'vote')

time: 5.38 ms


In [30]:
product_ratings.show()

+----------+-------+--------------+
|      asin|overall|    reviewerID|
+----------+-------+--------------+
|0871167042|    5.0|A2IC3NZN488KWK|
|0871167042|    4.0|A3OT9BYASFGU2X|
|0871167042|    5.0|A28GK1G2KDXHRP|
|0871167042|    5.0|A3NFXFEKW8OK0E|
|0871167042|    5.0|A3I6G5TKBVJEK9|
|0871167042|    5.0|A1A7Y1M8AJWNZ8|
|0871167042|    5.0|A30FG02C424EJ5|
|0871167042|    5.0| ADQQYU1UCDEWB|
|0871167042|    5.0|A39YL2NXZORK56|
|0871167042|    5.0|A2PRY50ZESF1MH|
|0871167042|    5.0|A2G9GWQEWWNQUB|
|0871167042|    4.0|A3RGH15H17SM1Z|
|0871167042|    3.0|A20QJNRKLJVP1E|
|0871167042|    5.0|A1G26EYQGW3YF1|
|0871167042|    4.0|A2JGAZF2Y2BDU6|
|0871167042|    5.0|A3NI5OGW35SLY2|
|0871167042|    5.0|A1OPRA4NE56EV6|
|0871167042|    4.0|A3M6UXIK7XTA7A|
|0871167042|    5.0|A3I3B5OSB80ZXC|
|0871167042|    5.0| A62O7C5RQB353|
+----------+-------+--------------+
only showing top 20 rows

time: 36.3 ms


In [31]:
type(product_ratings)

pyspark.sql.dataframe.DataFrame

time: 1.85 ms


#### Convert pyspark.sql.dataframe.DataFrame to Pandas dataframe

In [32]:
# rating_df = product_ratings.toPandas()

time: 512 µs


- make csv file

In [33]:
product_ratings.write.csv("./asin_overall_reviewerID_with_voted_review.csv")

time: 2min 27s


In [34]:
!ls -al ./

total 3471284
drwxrwxr-x  6 ec2-user ec2-user       4096 May 14 18:15 .
drwxrwxr-x 10 ec2-user ec2-user       4096 May 14 17:30 ..
-rw-rw-r--  1 ec2-user ec2-user      31494 May 14 18:15 0504-JH-DSE-260-Capstone.ipynb
drwxrwxr-x  2 ec2-user ec2-user       4096 May 14 18:14 asin_overall_reviewerID_with_voted_review.csv
-rw-rw-r--  1 ec2-user ec2-user 3554445765 May 14 07:16 Clothing_Shoes_and_Jewelry.json.gz
drwxrwxr-x  2 ec2-user ec2-user       4096 May 14 07:14 .ipynb_checkpoints
-rw-rw-r--  1 ec2-user ec2-user      84739 May 14 02:03 JH-DSE-260-Capstone-ALS-2.ipynb
drwxrwxr-x  3 ec2-user ec2-user       4096 May 14 17:36 JH-gluon_recommender_system
drwxrwxr-x  2 ec2-user ec2-user       4096 May 14 07:18 spark-warehouse
time: 115 ms


#### Load dataset 

In [None]:
rating_df = pd.read_csv('./data/asin_overall_reviewerID.csv/part-00000-6ef94642-3c25-4f7d-ade9-981f91953b81-c000.csv',
                        names=['asin', 'overall', 'reviewerID'])

In [None]:
rating_df.head(n=10)

In [None]:
rating_df['overall'].value_counts()

In [None]:
# rating_df.groupby(['reviewerID', 'asin']).sum()

- Drop duplicated records.
- Group overall together with reviwerID and asin.

In [None]:
rating_df = rating_df.drop_duplicates()
grouped_df = rating_df.groupby(['reviewerID', 'asin']).sum().reset_index()
grouped_df.head(10)

In [None]:
grouped_df.dtypes

In [None]:
grouped_df['reviewerID_encode'] = grouped_df['reviewerID'].astype("category")
grouped_df['asin_encode'] = grouped_df['asin'].astype("category")
grouped_df['reviewerID_encode'] = grouped_df['reviewerID_encode'].cat.codes
grouped_df['asin_encode'] = grouped_df['asin_encode'].cat.codes
grouped_df = grouped_df[['reviewerID','reviewerID_encode', 'asin', 'asin_encode', 'overall']]

sparse_content_person = sparse.csr_matrix(
    (grouped_df['overall'].astype(float), 
    (grouped_df['asin_encode'], grouped_df['reviewerID_encode']))
)

sparse_person_content = sparse.csr_matrix(
    (grouped_df['overall'].astype(float), 
    (grouped_df['reviewerID_encode'], grouped_df['asin_encode']))
)

model = implicit.als.AlternatingLeastSquares(
    factors=20, 
    regularization=0.1, 
    iterations=50, 
    use_gpu=True)

alpha = 15
data = (sparse_content_person * alpha).astype('double')

model.fit(data)

In [None]:
grouped_df

- Labling encoding asin

### Recommend ASIN(Products) based on product

In [None]:
asin='B00NX2IHS4'

asin_encode = grouped_df.loc[grouped_df['asin'] == asin].iloc[0].asin_encode
print("Covnert asin: %s to encoded asin: %d" %(asin, asin_encode))

In [None]:
n_similar = 20

person_vecs = model.user_factors
content_vecs = model.item_factors

content_norms = np.sqrt((content_vecs * content_vecs).sum(axis=1))

scores = content_vecs.dot(content_vecs[asin_encode]) / content_norms
top_idx = np.argpartition(scores, -n_similar)[-n_similar:]
similar = sorted(zip(top_idx, scores[top_idx] / content_norms[asin_encode]), key=lambda x: -x[1])

for content in similar:
    idx, score = content
    print("Encoded ASIN: %d" %(idx), 
          "| Simility Score: %.5f" %(round(score, 5)), 
          "| https://www.amazon.com/dp/"+grouped_df.asin.loc[grouped_df.asin_encode == idx].iloc[0])
#     print("\n")

In [None]:
# grouped_df.loc[grouped_df['person_id'] == 50].sort_values(by=['eventStrength'], ascending=False)[['title', 'person_id', 'eventStrength']].head(10)

In [None]:
grouped_df.asin.loc[grouped_df.asin_encode == 1564263].iloc[0]

In [None]:
n_similar = 20
output_filename='product_based_recommend.tsv'

person_vecs = model.user_factors
content_vecs = model.item_factors

asin_encode_list = grouped_df['asin_encode'].tolist()

with tqdm.tqdm(total=len(asin_encode_list)) as progress:
    with codecs.open(output_filename, "w", "utf8") as o:
        for asin_encode in asin_encode_list:
        #     print(asin_encode)
            content_norms = np.sqrt((content_vecs * content_vecs).sum(axis=1))

            scores = content_vecs.dot(content_vecs[asin_encode]) / content_norms
            top_idx = np.argpartition(scores, -n_similar)[-n_similar:]
            similar = sorted(zip(top_idx, scores[top_idx] / content_norms[asin_encode]), key=lambda x: -x[1])

            input_asin =""
            for content in similar:
                idx, score = content
                asin = grouped_df.asin.loc[grouped_df.asin_encode == idx].iloc[0]
                
                if round(score, 5)==1.00000:
#                     print(round(score))
                    input_asin = grouped_df.asin.loc[grouped_df.asin_encode == idx].iloc[0]
                o.write("%s\t%s\t%.5f\t%s\n" % (input_asin, asin, round(score, 5), "https://www.amazon.com/dp/"+asin))
#                 print(input_asin)

#                 print("Encoded ASIN: %d" %(idx), 
#                       "| Simility Score: %.5f" %(round(score, 5)), 
#                       "| https://www.amazon.com/dp/"+grouped_df.asin.loc[grouped_df.asin_encode == idx].iloc[0])
#             print("\n")
            progress.update(1)

### Recommend ASIN(Products) to Persons
- The following function will return the top 10 recommendations chosen based on the person / content vectors for contents never interacted with for any given person.

In [None]:
def recommend(person_id, sparse_person_content, person_vecs, content_vecs, num_contents=10):
    # Get the interactions scores from the sparse person content matrix
    person_interactions = sparse_person_content[asin_encode,:].toarray()
    # Add 1 to everything, so that articles with no interaction yet become equal to 1
    person_interactions = person_interactions.reshape(-1) + 1
    # Make articles already interacted zero
    person_interactions[person_interactions > 1] = 0
    # Get dot product of person vector and all content vectors
    rec_vector = person_vecs[asin_encode,:].dot(content_vecs.T).toarray()
    
    # Scale this recommendation vector between 0 and 1
    min_max = MinMaxScaler()
    rec_vector_scaled = min_max.fit_transform(rec_vector.reshape(-1,1))[:,0]
    # Content already interacted have their recommendation multiplied by zero
    recommend_vector = person_interactions * rec_vector_scaled
    # Sort the indices of the content into order of best recommendations
    content_idx = np.argsort(recommend_vector)[::-1][:num_contents]
    
    # Start empty list to store titles and scores
    asin_list = []
    scores = []

    for idx in content_idx:
        # Append titles and scores to the list
        asin_list.append("https://www.amazon.com/dp/"+grouped_df.asin.loc[grouped_df.asin_encode == idx].iloc[0])
        scores.append(recommend_vector[idx])

    recommendations = pd.DataFrame({'ASIN': asin_list, 'SCORE': scores})

    return recommendations
    


In [None]:
# Create recommendations for person
reviewerID="A0000040I1OM9N4SGBD8"
reviewerID_encode = grouped_df.loc[grouped_df['reviewerID'] == reviewerID].iloc[0].asin_encode
print("Covnert reviewerID: %s to encoded reviewerID: %d" %(reviewerID, reviewerID_encode))

In [None]:
# Get the trained person and content vectors. We convert them to csr matrices
person_vecs = sparse.csr_matrix(model.user_factors)
content_vecs = sparse.csr_matrix(model.item_factors)

person_id = reviewerID_encode

recommendations = recommend(person_id, sparse_person_content, person_vecs, content_vecs)

print("\n** Recommended list for reviewer:", reviewerID)
print()
print(recommendations)

#### Here we have top recommendations for reviewerID="A0000040I1OM9N4SGBD8". 


In [None]:
grouped_df.loc[grouped_df['reviewerID'] == 'A0000040I1OM9N4SGBD8'].sort_values(by=['overall'], ascending=False)[['asin', 'reviewerID', 'overall']]

## Evaluation the Recommender System 
- https://nbviewer.jupyter.org/github/jmsteinw/Notebooks/blob/master/RecEngine_NB.ipynb

In [None]:
import random

def make_train(ratings, pct_test = 0.2):
    test_set = ratings.copy() # Make a copy of the original set to be the test set. 
    test_set[test_set != 0] = 1 # Store the test set as a binary preference matrix
    
    training_set = ratings.copy() # Make a copy of the original data we can alter as our training set. 
    
    nonzero_inds = training_set.nonzero() # Find the indices in the ratings data where an interaction exists
    nonzero_pairs = list(zip(nonzero_inds[0], nonzero_inds[1])) # Zip these pairs together of item,user index into list

    
    random.seed(0) # Set the random seed to zero for reproducibility
    
    num_samples = int(np.ceil(pct_test*len(nonzero_pairs))) # Round the number of samples needed to the nearest integer
    samples = random.sample(nonzero_pairs, num_samples) # Sample a random number of item-user pairs without replacement

    content_inds = [index[0] for index in samples] # Get the item row indices

    person_inds = [index[1] for index in samples] # Get the user column indices

    
    training_set[content_inds, person_inds] = 0 # Assign all of the randomly chosen user-item pairs to zero
    training_set.eliminate_zeros() # Get rid of zeros in sparse array storage after update to save space
    
    return training_set, test_set, list(set(person_inds))

In [None]:
content_train, content_test, content_persons_altered = make_train(sparse_content_person, pct_test = 0.2)

In [None]:
def auc_score(predictions, test):
    fpr, tpr, thresholds = metrics.roc_curve(test, predictions)
    return metrics.auc(fpr, tpr)

In [None]:
def calc_mean_auc(training_set, altered_persons, predictions, test_set):
    store_auc = [] # An empty list to store the AUC for each user that had an item removed from the training set
    popularity_auc = [] # To store popular AUC scores
    pop_contents = np.array(test_set.sum(axis = 1)).reshape(-1) # Get sum of item iteractions to find most popular
    content_vecs = predictions[1]
    for person in altered_persons: # Iterate through each user that had an item altered
        training_column = training_set[:,person].toarray().reshape(-1) # Get the training set column
        zero_inds = np.where(training_column == 0) # Find where the interaction had not yet occurred
        
        # Get the predicted values based on our user/item vectors
        person_vec = predictions[0][person,:]
        pred = person_vec.dot(content_vecs).toarray()[0,zero_inds].reshape(-1)
        
        # Get only the items that were originally zero
        # Select all ratings from the MF prediction for this user that originally had no iteraction
        actual = test_set[:,person].toarray()[zero_inds,0].reshape(-1)
        
        # Select the binarized yes/no interaction pairs from the original full data
        # that align with the same pairs in training 
        pop = pop_contents[zero_inds] # Get the item popularity for our chosen items
        
        store_auc.append(auc_score(pred, actual)) # Calculate AUC for the given user and store
        
        popularity_auc.append(auc_score(pop, actual)) # Calculate AUC using most popular and score
    # End users iteration
    
    return float('%.3f'%np.mean(store_auc)), float('%.3f'%np.mean(popularity_auc))

In [None]:
calc_mean_auc(content_train, content_persons_altered,
              [person_vecs, content_vecs.T], content_test)