# **Bayesian Personalized Ranking (BPR) Recommender on Diginetica Data**

Authored by: Preet Khowaja

In this notebook we train a BPR model to recomend a ranked list of items to users in the dataset. We are using Diginetica's e-commerce data to train this model. 

The BPR implementation used is from the cornac package in Python.:

In [None]:
## Install dependencies
!pip install cornac
!pip install papermill
!pip install scrapbook
!pip install recommenders


In [3]:
## Import required packages
import sys
import os
import cornac
import papermill as pm
import scrapbook as sb
import pandas as pd
import numpy as np

from recommenders.datasets.python_splitters import python_random_split
from recommenders.evaluation.python_evaluation import map_at_k, ndcg_at_k, precision_at_k, recall_at_k
from recommenders.models.cornac.cornac_utils import predict_ranking
from recommenders.utils.timer import Timer
from recommenders.utils.constants import SEED

## check what version of cornac is available
print("System version: {}".format(sys.version))
print("Cornac version: {}".format(cornac.__version__))

System version: 3.8.16 (default, Dec  7 2022, 01:12:13) 
[GCC 7.5.0]
Cornac version: 1.14.2


## Loading Diginetica Data

1. Mount this notebook to Google Drive

2. Save the **zipped** events.csv file to somewhere in your drive from the Diginetica data available [here](https://drive.google.com/drive/folders/0B7XZSACQf0KdXzZFS21DblRxQ3c?resourcekey=0-3k4O5YlwnZf0cNeTZ5Y_Uw)

3. Change the command below to reflect the path where your zipped file is saved. It will unzip the file so it is only available during run-time

In [4]:
## Unzip events file of Retail Rocket and store in local directory
!unzip /content/drive/MyDrive/aipi/final_project/dataset-train-diginetica.zip

Archive:  /content/drive/MyDrive/aipi/final_project/dataset-train-diginetica.zip
  inflating: train-clicks.csv        
  inflating: train-purchases.csv     
  inflating: train-item-views.csv    
  inflating: train-queries.csv       
  inflating: products.csv            
  inflating: product-categories.csv  


In [5]:
## Read the dataset
view = pd.read_csv('train-item-views.csv', sep = ';')
purchase = pd.read_csv('train-purchases.csv', sep = ';')

In [12]:
## Drop rows with NA rvalues for userID
view = view[view['userId'].notnull()]
purchase = purchase[purchase['userId'].notnull()]

In [13]:
purchase.head()

Unnamed: 0,sessionId,userId,timeframe,eventdate,ordernumber,itemId
0,150,18278.0,17100868,2016-05-06,16421,25911
2,156,7.0,1721689387,2016-05-27,21173,35324
4,246,34.0,2311046,2016-05-09,16936,34677
7,322,167422.0,1085316,2016-05-25,20849,69167
8,365,87.0,377426862,2016-02-27,3072,10290


In [14]:
view.head()

Unnamed: 0,sessionId,userId,itemId,timeframe,eventdate
164,48,2.0,24764,8863,2016-04-09
165,48,2.0,24764,496381,2016-04-09
166,48,2.0,24764,265216,2016-04-09
167,48,2.0,24764,519975,2016-04-09
168,48,2.0,24764,456437,2016-04-09


In [15]:
## Drop the columns we don't need 
view = view[['userId', 'itemId']]
purchase = purchase[['userId', 'itemId']]

In [16]:
## Join views and purchases into one dataset
events = pd.concat([purchase, view], ignore_index=True, axis=0)
# check the data is correctly merged 
assert(purchase.shape[0] + view.shape[0] == events.shape[0])
     

In [17]:
events.shape

(379694, 2)

In [19]:
## Take a random sample from the full data because it's massive
df = events.sample(n=10000, random_state=23)

## Adding Ranking information to run BPR

BPR relies on some type of ranking for each user-item pair. Here we assume that if a user interacted with an item, the item is ranked as 1. Otherwise, it is ranked as 0. In our dataset we only have positive feedback available, so we generate the negative feedback.

In [21]:
## Assign a score of 1 to user-item interactions that are available
dig_data = df[['userId', 'itemId']].copy()
dig_data['Feedback'] = 1
dig_data = dig_data.drop_duplicates()

In [22]:
dig_data.head()

Unnamed: 0,userId,itemId,Feedback
259496,119134.0,77531,1
356446,211587.0,25493,1
120579,45360.0,159881,1
146451,56425.0,387,1
302801,153085.0,574,1


In [23]:
# Rename the columns for consistency 
dig_data.rename(columns = {'userId': 'userID', 'itemId': 'itemID', 'Feedback': 'rating'}, inplace = True)

In [24]:
## Create a list of unique users and unique items from our sample
users = dig_data['userID'].unique()
items = dig_data['itemID'].unique()

In [25]:
## Adding negative feedback for interactions not present
interaction_lst = []
for user in users:
    for item in items:
        interaction_lst.append([user, item, 0])

## Dataframe created for all negative feedback, i.e. where user has not interacted with the item
dig_data_all = pd.DataFrame(data=interaction_lst, columns=["userID", "itemID", "rating"])

In [26]:
## Check if the rating column has 0
dig_data_all.head()

Unnamed: 0,userID,itemID,rating
0,119134.0,77531,0
1,119134.0,25493,0
2,119134.0,159881,0
3,119134.0,387,0
4,119134.0,574,0


In [27]:
## Merge the datasets with positive and negative feedback
dig_feedback = pd.merge(dig_data_all, dig_data, on=['userID', 'itemID'], how='outer').fillna(0).drop('rating_x', axis = 1)

In [28]:
# Cleaning up the column names
dig_feedback.rename(columns = {'rating_y': 'rating'}, inplace = True)

In [29]:
## Check how many positive and negative feedback signals we have
## We should have more 0's because the users interact with fewer items 
dig_feedback['rating'].value_counts()

0.0    68530316
1.0        9940
Name: rating, dtype: int64

In [30]:
## Check how many observations
dig_feedback.shape

(68540256, 3)

In [34]:
## Split data into training and testing
train_dig, test_dig = python_random_split(dig_feedback, 0.8)

Initiate BPR Model and Train on Dataset

In [32]:
## Initiating a BPR model 
bpr = cornac.models.BPR(
    k=20,
    max_iter=30,
    learning_rate=0.01,
    lambda_reg=0.001,
    verbose=True,
    seed=43
)

The next cell takes about 50 seconds to run and a little bit of RAM

In [37]:
train_set_dig = cornac.data.Dataset.from_uir(train_dig.itertuples(index=False), seed=4747)
print('Number of users: {}'.format(train_set_dig.num_users))
print('Number of items: {}'.format(train_set_dig.num_items))


Number of users: 8928
Number of items: 7677


In [38]:
## Training the BPR model on our data
with Timer() as t:
    bpr.fit(train_set_dig)
print("Took {} seconds for training.".format(t))

  0%|          | 0/30 [00:00<?, ?it/s]

Optimization finished!
Took 996.6799 seconds for training.


In [39]:
with Timer() as t:
    all_predictions = predict_ranking(bpr, train_dig, usercol='userID', itemcol='itemID', remove_seen = False)
print("Took {} seconds for prediction.".format(t))

Took 37.6833 seconds for prediction.


In [40]:
## Each user-item pairing is given a prediction 
## This is basically an item's rated value by the user and 
## a ranked item's list for the user
all_predictions.head()

Unnamed: 0,userID,itemID,prediction
0,83227.0,267053,0.131266
1,83227.0,14108,-0.08225
2,83227.0,139015,0.217529
3,83227.0,25185,0.051923
4,83227.0,252669,-0.077208


In [41]:
## For top 5 recommendations here are the computed evaluation metrics:
k = 10
## Mean Average Precision
eval_map = map_at_k(test_dig, all_predictions, col_prediction='prediction', k=k)
## NDCG
eval_ndcg = ndcg_at_k(test_dig, all_predictions, col_prediction='prediction', k=k)
## Precision
eval_precision = precision_at_k(test_dig, all_predictions, col_prediction='prediction', k=k)
## Recall
eval_recall = recall_at_k(test_dig, all_predictions, col_prediction='prediction', k=k)

print("MAP:\t%f" % eval_map
     ,
      "NDCG:\t%f" % eval_ndcg,
      "Precision@K:\t%f" % eval_precision,
      "Recall@K:\t%f" % eval_recall, sep='\n'
)

MAP:	0.000246
NDCG:	0.104363
Precision@K:	0.106474
Recall@K:	0.000693


# **References**

1. Microsoft Recommenders, BPR Deep Dive
https://github.com/microsoft/recommenders/blob/main/examples/02_model_collaborative_filtering/cornac_bpr_deep_dive.ipynb

2. Microsoft Recommenders Preparing Data
https://github.com/microsoft/recommenders/blob/main/examples/01_prepare_data/data_transform.ipynb

3. Aghiles Salah, Quoc-Tuan Truong, Hady W. Lauw; *\"Cornac: A Comparative Framework for Multimodal
Recommender Systems* ; Journal of Machine Learning Research 2021 (2020) 1-5. 
https://dl.acm.org/doi/pdf/10.5555/3455716.3455811