Omar Pineda Jr.
DATA612: Recommender Systems, Summer 2020
CUNY SPS MS Data Science
Project 1: Simple Recommender System for Amazon Healthcare Products

In [72]:
import pandas as pd
from numpy import nan
import random
from sklearn.model_selection import train_test_split as tts
from sklearn.metrics import mean_squared_error
from math import sqrt

This simple recommender system will predict how a user would rate a healthcare product based on how they rated other healthcare products. These predicted ratings can be used to suggest new items that a user would probably enjoy and from a seller's perspective buy. We sourced this data from Amazon's product reviews between May 1996 - July 2014: http://jmcauley.ucsd.edu/data/amazon/links.html

First, we load our dataset of 2.9 million reviews and add column names. The users are identified by unique reviewer IDs and the products are coded with Amazon Standard Identification Numbers (ASINs).

In [20]:
hc = pd.read_csv('ratings_Health_and_Personal_Care.csv', header = None)
hc.head()

In [26]:
hc.shape

(2982326, 3)

In [22]:
hc.columns = ['user', 'product', 'rating', 'time']
hc = hc.drop(columns='time')
hc.head()

Unnamed: 0,user,product,rating
0,ARMDSTEI0Z7YW,77614992,5.0
1,A3FYN0SZYWN74,615208479,5.0
2,A2J0WRZSAAHUAP,615269990,5.0
3,A38RKP6G5P8J63,615269990,5.0
4,ARENM677YXZKX,615269990,2.0


We then reformatted our dataframe to make it wide rather than long, with a row for each reviewer and a column for each product, producing a user-item matrix. In order to keep things simple for this project, we only looked at the first 25 entries in the dataset, but that unfortunately limited our scope so that each reviewer only had a review for a single product.

In [50]:
hc2 = hc.head(25)
hc3 = hc2.pivot(index = 'user', columns = 'product', values = 'rating')
hc3

product,0077614992,0615208479,0615269990,0615315860,0615406394,0615836828,0641710577,0641864507,0681504498,0705394638
user,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
A102TGNH1D915Z,,,,,,,,,3.0,
A11O3IHGGJBH67,,,,,,,,5.0,,
A12V35OD8T4ZVP,,,,,5.0,,,,,
A1LBXMFXPT2F7Q,,,,,,,,,1.0,
A1P27BGF8NAI29,,,,,,,,,,5.0
A2BWNU3Z38JEZ7,,,,,,,5.0,,,
A2COS3K6OVGHO8,,,,,5.0,,,,,
A2J0WRZSAAHUAP,,,5.0,,,,,,,
A2KX3GMQY9LS9N,,,,5.0,,,,,,
A2RGVNP1D6LRTA,,,,,,,,,5.0,


Instead, we recreated our table with dummy ratings from 1-5 so that a buyer would have reviews for multiple products. We also included some NAs in our matrix. We will hopefully use the original complete dataset in a future project.

In [83]:
random_ratings = [1,2,3,4,5,nan]
hc4 = hc3.copy()
for i in hc4.columns.values:
    for j in hc4.index.values:
        hc4[i][j] = random.choice(random_ratings)
hc4

product,0077614992,0615208479,0615269990,0615315860,0615406394,0615836828,0641710577,0641864507,0681504498,0705394638
user,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
A102TGNH1D915Z,2.0,3.0,4.0,5.0,,5.0,4.0,3.0,,3.0
A11O3IHGGJBH67,5.0,4.0,1.0,2.0,5.0,3.0,4.0,4.0,2.0,1.0
A12V35OD8T4ZVP,2.0,2.0,,3.0,1.0,4.0,,4.0,1.0,1.0
A1LBXMFXPT2F7Q,,,3.0,,,5.0,1.0,4.0,3.0,3.0
A1P27BGF8NAI29,,3.0,5.0,4.0,1.0,2.0,5.0,5.0,5.0,4.0
A2BWNU3Z38JEZ7,2.0,5.0,5.0,2.0,4.0,3.0,3.0,3.0,5.0,
A2COS3K6OVGHO8,,2.0,,5.0,2.0,2.0,4.0,1.0,1.0,4.0
A2J0WRZSAAHUAP,2.0,,,,3.0,3.0,2.0,3.0,4.0,5.0
A2KX3GMQY9LS9N,1.0,5.0,2.0,2.0,3.0,2.0,5.0,1.0,2.0,
A2RGVNP1D6LRTA,3.0,2.0,4.0,5.0,2.0,5.0,2.0,5.0,5.0,


Next, we split our data into a training and test set. 18 reviewers went into our training set and 7 of them went into our test set.

In [84]:
hc_train, hc_test = tts(hc4)

In [85]:
hc_train.shape

(18, 10)

In [86]:
hc_test.shape

(7, 10)

We found that the average rating in our training dataset was 3.17, which means that these healthcare products tended to be perceived as neither exceptional nor poor, or that the high ratings balanced out the poor ratings.

In [89]:
avg_rating = hc_train.sum(numeric_only=True).sum() / hc_train.count().sum()
avg_rating

3.108843537414966

Next, we calculated the RMSE for our training dataset using the overall raw average as the predicted values, and it was 1.37. Below is the user-item matrix with the calculated squared errors.

In [109]:
hc_train2 = hc_train.copy()
for i in hc_train2.columns.values:
    for j in hc_train2.index.values:
        hc_train2[i][j] = (hc_train[i][j] - avg_rating)**2
hc_train2

product,0077614992,0615208479,0615269990,0615315860,0615406394,0615836828,0641710577,0641864507,0681504498,0705394638
user,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
A34IX57PYIKFFF,,0.79416,4.447221,0.011847,0.011847,3.576473,4.447221,0.79416,4.447221,1.229534
A2BWNU3Z38JEZ7,1.229534,3.576473,3.576473,1.229534,0.79416,0.011847,0.011847,0.011847,3.576473,
A1LBXMFXPT2F7Q,,,0.011847,,,3.576473,4.447221,0.79416,0.011847,0.011847
A1P27BGF8NAI29,,0.011847,3.576473,0.79416,4.447221,1.229534,3.576473,3.576473,3.576473,0.79416
ARENM677YXZKX,,4.447221,,,0.011847,3.576473,,4.447221,1.229534,1.229534
A35G5VLYZIDBAU,1.229534,0.011847,0.011847,,1.229534,1.229534,0.79416,4.447221,4.447221,0.79416
AJCPRB73A2EPV,,4.447221,4.447221,0.011847,,,0.011847,4.447221,,0.011847
A12V35OD8T4ZVP,1.229534,1.229534,,0.011847,4.447221,0.79416,,0.79416,4.447221,4.447221
A30CP7L9JPBWX,3.576473,3.576473,3.576473,0.79416,,0.011847,4.447221,0.011847,0.011847,1.229534
A361YMXSRYL4K4,0.011847,0.79416,4.447221,0.79416,,,0.011847,,4.447221,0.79416


In [111]:
train_rmse = sqrt(hc_train2.sum(numeric_only=True).sum() / hc_train2.count().sum())
train_rmse

1.3708791146370387

We also calculated the RMSE for our test dataset and found it to be 1.55. Below is the user-item matrix with the squared errors.

In [92]:
hc_test2 = hc_test.copy()
for i in hc_test2.columns.values:
    for j in hc_test2.index.values:
        hc_test2[i][j] = (hc_test[i][j] - avg_rating)**2
hc_test2

product,0077614992,0615208479,0615269990,0615315860,0615406394,0615836828,0641710577,0641864507,0681504498,0705394638
user,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
A42VPA76SE1U3,,4.447221,,4.447221,3.576473,0.79416,0.79416,1.229534,0.79416,3.576473
ARMDSTEI0Z7YW,,0.011847,4.447221,4.447221,0.79416,3.576473,4.447221,4.447221,3.576473,4.447221
A11O3IHGGJBH67,3.576473,0.79416,4.447221,1.229534,3.576473,0.011847,0.79416,0.79416,1.229534,4.447221
A36IT4R58E2F10,0.011847,1.229534,1.229534,4.447221,,0.79416,3.576473,3.576473,,
A2KX3GMQY9LS9N,4.447221,3.576473,1.229534,1.229534,0.011847,1.229534,3.576473,4.447221,1.229534,
A3NNL2LPM66ZH5,4.447221,3.576473,,,3.576473,0.79416,1.229534,3.576473,,0.011847
AMBJQQSRCAOHS,,3.576473,3.576473,0.79416,0.79416,0.011847,0.79416,0.79416,4.447221,3.576473


In [95]:
test_rmse = sqrt(hc_test2.sum(numeric_only=True).sum() / hc_test2.count().sum())
test_rmse

1.552336448432413

Next, we take a different approach to predict ratings by accounting for the biases in users and items. The user bias is the average rating for a user across all items they rated subtracted by the overall average rating for our user-product matrix. User AJCPRB73A2EPV is the harshest and user ANLSE84SL6HWI is the most lenient in their ratings.

In [104]:
user_train_bias = hc_train.mean(axis=1) - avg_rating
user_train_bias

user
A34IX57PYIKFFF   -0.442177
A2BWNU3Z38JEZ7    0.446712
A1LBXMFXPT2F7Q    0.057823
A1P27BGF8NAI29    0.668934
ARENM677YXZKX    -0.775510
A35G5VLYZIDBAU   -0.664399
AJCPRB73A2EPV    -1.108844
A12V35OD8T4ZVP   -0.858844
A30CP7L9JPBWX     0.335601
A361YMXSRYL4K4   -0.251701
ANLSE84SL6HWI     0.991156
A2J0WRZSAAHUAP    0.034014
AVFXCA5AW8I7F     0.002268
A2COS3K6OVGHO8   -0.483844
A38RKP6G5P8J63    0.191156
A102TGNH1D915Z    0.516156
A3FYN0SZYWN74    -0.108844
A2RGVNP1D6LRTA    0.557823
dtype: float64

We also similarly looked at bias in our healthcare products. Product 0615406394 has the lowest ratings and product 0615315860 has the highest ratings.

In [105]:
product_train_bias = hc_train.mean(axis=0) - avg_rating
product_train_bias

product
0077614992   -0.025510
0615208479   -0.233844
0615269990    0.462585
0615315860    0.605442
0615406394   -0.654298
0615836828    0.491156
0641710577    0.016156
0641864507   -0.285314
0681504498   -0.358844
0705394638   -0.046344
dtype: float64

We then calculated the baseline predictions for our user-item matrix using the raw average and corresponding user and product biases where baseline predictor = raw average + user bias + product bias. Below is the user-item matrix with the baseline predictions.

In [106]:
hc_train_baseline_pred = hc_train.copy()
for i in hc_train_baseline_pred.columns.values:
    for j in hc_train_baseline_pred.index.values:
        hc_train_baseline_pred[i][j] = avg_rating + user_train_bias[j] + product_train_bias[i] 
hc_train_baseline_pred

product,0077614992,0615208479,0615269990,0615315860,0615406394,0615836828,0641710577,0641864507,0681504498,0705394638
user,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
A34IX57PYIKFFF,2.641156,2.432823,3.129252,3.272109,2.012369,3.157823,2.682823,2.381353,2.307823,2.620323
A2BWNU3Z38JEZ7,3.530045,3.321712,4.018141,4.160998,2.901257,4.046712,3.571712,3.270241,3.196712,3.509212
A1LBXMFXPT2F7Q,3.141156,2.932823,3.629252,3.772109,2.512369,3.657823,3.182823,2.881353,2.807823,3.120323
A1P27BGF8NAI29,3.752268,3.543934,4.240363,4.38322,3.12348,4.268934,3.793934,3.492464,3.418934,3.731434
ARENM677YXZKX,2.307823,2.09949,2.795918,2.938776,1.679035,2.82449,2.34949,2.048019,1.97449,2.28699
A35G5VLYZIDBAU,2.418934,2.210601,2.907029,3.049887,1.790146,2.935601,2.460601,2.15913,2.085601,2.398101
AJCPRB73A2EPV,1.97449,1.766156,2.462585,2.605442,1.345702,2.491156,2.016156,1.714686,1.641156,1.953656
A12V35OD8T4ZVP,2.22449,2.016156,2.712585,2.855442,1.595702,2.741156,2.266156,1.964686,1.891156,2.203656
A30CP7L9JPBWX,3.418934,3.210601,3.907029,4.049887,2.790146,3.935601,3.460601,3.15913,3.085601,3.398101
A361YMXSRYL4K4,2.831633,2.623299,3.319728,3.462585,2.202845,3.348299,2.873299,2.571829,2.498299,2.810799


Finally, we calculated the RMSE using the baseline predictions that account for bias and we found that it went down to 1.20 compared to 1.37 when we used the raw average as the predictions.

In [108]:
hc_train3 = hc_train.copy()
for i in hc_train2.columns.values:
    for j in hc_train2.index.values:
        hc_train3[i][j] = (hc_train[i][j] - hc_train_baseline_pred[i][j])**2
hc_train3

product,0077614992,0615208479,0615269990,0615315860,0615406394,0615836828,0641710577,0641864507,0681504498,0705394638
user,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
A34IX57PYIKFFF,,2.456043,4.533713,0.074043,0.975416,3.393616,2.831894,2.62002,1.710401,0.384801
A2BWNU3Z38JEZ7,2.341039,2.816651,0.964048,4.669911,1.207235,1.095606,0.326855,0.07303,3.251848,
A1LBXMFXPT2F7Q,,,0.395958,,,1.801439,4.764717,1.251372,0.036932,0.014478
A1P27BGF8NAI29,,0.295864,0.577049,0.146858,4.509166,5.148063,1.454595,2.272666,2.499769,0.072128
ARENM677YXZKX,,1.208878,,,1.744948,4.732845,,1.098344,0.000651,0.082363
A35G5VLYZIDBAU,0.175506,0.623151,0.008644,,0.044039,0.875349,2.36975,1.343583,1.178529,2.566081
AJCPRB73A2EPV,,0.586996,2.139155,0.155676,,,0.967948,0.510776,,1.094835
A12V35OD8T4ZVP,0.050396,0.000261,,0.020897,0.354861,1.584687,,4.142504,0.79416,1.448789
A30CP7L9JPBWX,2.499769,3.201949,1.194585,0.002489,,0.875349,6.054557,0.025322,0.007328,1.954686
A361YMXSRYL4K4,0.028348,1.895305,5.381137,0.288815,,,0.016053,,2.244901,1.414198


In [110]:
train_baseline_pred_rmse = sqrt(hc_train3.sum(numeric_only=True).sum() / hc_train3.count().sum())
train_baseline_pred_rmse

1.2016448287544108

To summarize, the predictions that account for the user and product biases had a RMSE of 1.20 while the simpler raw average predictions had a RMSE of 1.37. So, a recommender system that considers bias performs better in predicting a user's ratings for other products given their rating history. These predicted ratings can be used to recommend additional items to a user so that they enjoy future purchases and continue using Amazon's retail services.