# **Bayesian Personalized Ranking (BPR) Recommender on RetailRocket Data**

Authored by: Preet Khowaja

Edits by: Anna Dai

In this notebook we train a BPR model to recomend a ranked list of items to users in the dataset. We are using RetailRocket's e-commerce data to train this model. 

The BPR implementation used is from the cornac package in Python.:

In [17]:
# # Install dependencies
# # AD NOTE: recommenders only work on Python 3.9 or lower and cornac if stuck can be installed through conda

# !pip install cornac
# !pip install papermill
# !pip install scrapbook
# !pip install recommenders 


In [2]:
## Import required packages
import sys
import os
import cornac
import papermill as pm
import scrapbook as sb
import pandas as pd
import numpy as np

from recommenders.datasets.python_splitters import python_random_split
from recommenders.evaluation.python_evaluation import map_at_k, ndcg_at_k, precision_at_k, recall_at_k
from recommenders.models.cornac.cornac_utils import predict_ranking
from recommenders.utils.timer import Timer
from recommenders.utils.constants import SEED

## check what version of cornac is available
print("System version: {}".format(sys.version))
print("Cornac version: {}".format(cornac.__version__))

System version: 3.9.15 | packaged by conda-forge | (main, Nov 22 2022, 08:45:29) 
[GCC 10.4.0]
Cornac version: 1.14.2


## Loading RetailRocket Data

1. Mount this notebook to Google Drive

2. Save the **zipped** events.csv file to somewhere in your drive from the RetailRocket data available [here](https://www.kaggle.com/datasets/retailrocket/ecommerce-dataset)

3. Change the command below to reflect the path where your zipped file is saved. It will unzip the file so it is only available during run-time

In [19]:
# empty cache
!rm -rf /tmp/cornac

In [20]:
## Unzip events file of Retail Rocket and store in local directory
# !unzip drive/My\ Drive/aipi/final_project/RR_events.csv.zip

In [21]:
MOVE_DIR = True
if MOVE_DIR:
    DATA_PATH = "../data/RetailRocket/"
else:
    DATA_PATH = ""

In [22]:
## Read the dataset
df = pd.read_csv(DATA_PATH + 'events.csv')
df.head()

Unnamed: 0,timestamp,visitorid,event,itemid,transactionid
0,1433221332117,257597,view,355908,
1,1433224214164,992329,view,248676,
2,1433221999827,111016,view,318965,
3,1433221955914,483717,view,253185,
4,1433221337106,951259,view,367447,


In [23]:
## The event column shows us the feedback between user and item pairings in the dataset
df.event.value_counts()

view           2664312
addtocart        69332
transaction      22457
Name: event, dtype: int64

In [24]:
# ## Take a random sample from the full data because it's massive
# df = df.sample(n=5000, random_state=45)
df = df.sample(frac=0.005, random_state=45)

## Adding Ranking information to run BPR

BPR relies on some type of ranking for each user-item pair. Here we assume that if a user interacted with an item, the item is ranked as 1. Otherwise, it is ranked as 0. In our dataset we only have positive feedback available, so we generate the negative feedback.

In [25]:
## Assign a score of 1 to user-item interactions that are available
rr_data = df[['visitorid', 'itemid']].copy()
rr_data['Feedback'] = 1
rr_data = rr_data.drop_duplicates()

In [26]:
rr_data.head()

Unnamed: 0,visitorid,itemid,Feedback
2519133,961663,27127,1
2052038,1206986,435421,1
522082,437200,170460,1
2496162,731484,26210,1
2241508,1011436,401802,1


In [27]:
# Rename the columns for consistency 
rr_data.rename(columns = {'visitorid': 'userID', 'itemid': 'itemID', 'Feedback': 'rating'}, inplace = True)

In [28]:
# check size of df in memory
rr_data.info()


<class 'pandas.core.frame.DataFrame'>
Int64Index: 13723 entries, 2519133 to 2669183
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype
---  ------  --------------  -----
 0   userID  13723 non-null  int64
 1   itemID  13723 non-null  int64
 2   rating  13723 non-null  int64
dtypes: int64(3)
memory usage: 428.8 KB


In [29]:
rr_data.nunique()

userID    13066
itemID    11323
rating        1
dtype: int64

In [30]:
# reset user and item ids to be from 1 to n_users and 1 to n_items to save more space
rr_data['userID'] = rr_data['userID'].astype("category")
rr_data['itemID'] = rr_data['itemID'].astype("category")
rr_data['userID'] = rr_data['userID'].cat.codes
rr_data['itemID'] = rr_data['itemID'].cat.codes

In [31]:
# quantize to int8
rr_data = rr_data.astype({'userID': 'int16', 'itemID': 'int16', 'rating': 'int8'})

In [32]:
rr_data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 13723 entries, 2519133 to 2669183
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype
---  ------  --------------  -----
 0   userID  13723 non-null  int16
 1   itemID  13723 non-null  int16
 2   rating  13723 non-null  int8 
dtypes: int16(2), int8(1)
memory usage: 174.2 KB


In [33]:
# assert userID and itemID are not negative
assert rr_data.userID.min() >= 0
assert rr_data.itemID.min() >= 0

In [34]:
## Create a list of unique users and unique items from our sample
users = np.array(rr_data['userID'].unique())
items = np.array(rr_data['itemID'].unique())
print("unique users: ", len(users))
print("unique items: ", len(items))


unique users:  13066
unique items:  11323


### AD edit 
The below operation was too slow so I used a meshgrid to generate the negative feedback. 

In [35]:
# create df for all user-item pairs using meshgrid
rr_data_all = pd.DataFrame(np.array(np.meshgrid(users, items)).T.reshape(-1,2), columns=['userID', 'itemID'])

In [36]:
rr_data_all.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 147946318 entries, 0 to 147946317
Data columns (total 2 columns):
 #   Column  Dtype
---  ------  -----
 0   userID  int16
 1   itemID  int16
dtypes: int16(2)
memory usage: 564.4 MB


In [37]:
# ## Adding negative feedback for interactions not present
# interaction_lst = set()
# for user in users:
#     for item in items:
#         interaction_lst.add((user, item))

# ## Dataframe created for all negative feedback, i.e. where user has not interacted with the item
# rr_data_all = pd.DataFrame(data=interaction_lst, columns=["userID", "itemID", "rating"])

In [38]:
# Merge the datasets with positive and negative feedback
# rr_feedback = pd.merge(rr_data_all, rr_data, on=['userID', 'itemID'], how='outer').fillna(0)#.drop('rating_x', axis = 1)

In [39]:
# rr_feedback.head()

In [40]:
# # Cleaning up the column names
# rr_feedback.rename(columns = {'rating_y': 'rating'}, inplace = True)

### AD note
Pandas crashes kernel when trying to merge the negative feedback even with 1% data and 4-16gb of ram and even when the meshgrid did not crash.
I switched to pyspark to run the merge, which does not crash but is still taking over 60min to run.

In [41]:
# # Pandas take
# spark = pyspark.sql.SparkSession.builder.getOrCreate()
# rr_data_spark = spark.createDataFrame(rr_data)
# rr_data_all_spark = spark.createDataFrame(rr_data_all)
# rr_feedback_spark = rr_data_all_spark.join(rr_data_spark, on=['userID', 'itemID'], how='outer').fillna(0)

In [42]:
import time
tac = time.time()

# merge in to get 0s for negative feedback and 1 for positive
rr_feedback = rr_data_all.merge(rr_data, how='left', on=['userID', 'itemID']).fillna(0)

tic = time.time()
print("Time to create df: ", tic - tac)

Time to create df:  60.1868257522583


In [44]:
# Check how many positive and negative feedback signals we have
# We should have more 0's because the users interact with fewer items 
rr_feedback['rating'].value_counts()

0.0    147932595
1.0        13723
Name: rating, dtype: int64

In [45]:
# quantize to save more space
rr_feedback["rating"] = rr_feedback["rating"].astype("int8")
rr_feedback.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 147946318 entries, 0 to 147946317
Data columns (total 3 columns):
 #   Column  Dtype
---  ------  -----
 0   userID  int16
 1   itemID  int16
 2   rating  int8 
dtypes: int16(2), int8(1)
memory usage: 1.8 GB


In [46]:
# save to csv
rr_feedback.to_csv(DATA_PATH + 'rr_feedback.csv', index=False)

In [47]:
tac = time.time()

# Split data into training and testing
train_rr, test_rr = python_random_split(rr_feedback, 0.7)

tic = time.time()
print("Time to split data: ", tic - tac)

Time to split data:  17.956244230270386


Initiate BPR Model and Train on Dataset

In [48]:
## Initiating a BPR model 
bpr = cornac.models.BPR(
    k=20,
    max_iter=30,
    learning_rate=0.01,
    lambda_reg=0.001,
    verbose=True,
    seed=43
)

The next cell takes about 50 seconds to run and a little bit of RAM for sample of n=5000

In [49]:
tac = time.time()

train_set_rr = cornac.data.Dataset.from_uir(train_rr.itertuples(index=False), seed=4747)
print('Number of users: {}'.format(train_set_rr.num_users))
print('Number of items: {}'.format(train_set_rr.num_items))

tic = time.time()
print("Time to build dataset: ", tic - tac)

Number of users: 13066
Number of items: 11323
Time to build dataset:  294.9264464378357


NOTE: the below takes around 30min to run on 0.5% of the data

In [50]:
## Training the BPR model on our data
with Timer() as t:
    bpr.fit(train_set_rr)
print("Took {} seconds for training.".format(t))

100%|██████████| 30/30 [30:09<00:00, 60.32s/it, correct=53.72%, skipped=70.00%]

Optimization finished!
Took 1822.7974 seconds for training.





In [51]:
with Timer() as t:
    all_predictions = predict_ranking(bpr, train_rr, usercol='userID', itemcol='itemID', remove_seen = False)
print("Took {} seconds for prediction.".format(t))

Took 92.8688 seconds for prediction.


In [52]:
## Each user-item pairing is given a prediction 
## This is basically an item's rated value by the user and 
## a ranked item's list for the user
all_predictions.head()

Unnamed: 0,userID,itemID,prediction
0,12101,6511,0.23316
1,12101,3187,-0.071604
2,12101,5292,0.001401
3,12101,10047,0.087307
4,12101,1974,-0.18014


In [53]:
with Timer() as t:
      ## For top 5 recommendations here are the computed evaluation metrics:
      k = 10
      ## Mean Average Precision
      eval_map = map_at_k(test_rr, all_predictions, col_prediction='prediction', k=k)
      ## NDCG
      eval_ndcg = ndcg_at_k(test_rr, all_predictions, col_prediction='prediction', k=k)
      ## Precision
      eval_precision = precision_at_k(test_rr, all_predictions, col_prediction='prediction', k=k)
      ## Recall
      eval_recall = recall_at_k(test_rr, all_predictions, col_prediction='prediction', k=k)

      print("MAP:\t%f" % eval_map
      ,
            "NDCG:\t%f" % eval_ndcg,
            "Precision@K:\t%f" % eval_precision,
            "Recall@K:\t%f" % eval_recall, sep='\n'
      )

print("Took {} seconds for evaluation.".format(t))

MAP:	0.000268
NDCG:	0.207392
Precision@K:	0.209559
Recall@K:	0.000617
Took 1000.0047 seconds for evaluation.


# **References**

1. Microsoft Recommenders, BPR Deep Dive
https://github.com/microsoft/recommenders/blob/main/examples/02_model_collaborative_filtering/cornac_bpr_deep_dive.ipynb

2. Microsoft Recommenders Preparing Data
https://github.com/microsoft/recommenders/blob/main/examples/01_prepare_data/data_transform.ipynb

3. Aghiles Salah, Quoc-Tuan Truong, Hady W. Lauw; *\"Cornac: A Comparative Framework for Multimodal
Recommender Systems* ; Journal of Machine Learning Research 2021 (2020) 1-5. 
https://dl.acm.org/doi/pdf/10.5555/3455716.3455811

4. Rendle, S., Freudenthaler, C., Gantner, Z. & Schmidt-Thieme, L. (2012). BPR: Bayesian Personalized Ranking from Implicit Feedback (cite arxiv:1205.2618Comment: Appears in Proceedings of the Twenty-Fifth Conference on Uncertainty in Artificial Intelligence (UAI2009)) 
https://arxiv.org/ftp/arxiv/papers/1205/1205.2618.pdf 