# **Bayesian Personalized Ranking (BPR) Recommender on H&M Data**

Authored by:

In this notebook we train a BPR model to recomend a ranked list of items to users in the dataset. We are using RetailRocket's e-commerce data to train this model. 

The BPR implementation used is from the cornac package in Python.:

In [1]:
!pip install cornac
!pip install recommenders

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting cornac
  Downloading cornac-1.14.2-cp38-cp38-manylinux1_x86_64.whl (14.4 MB)
[K     |████████████████████████████████| 14.4 MB 4.7 MB/s 
Collecting powerlaw
  Downloading powerlaw-1.5-py3-none-any.whl (24 kB)
Installing collected packages: powerlaw, cornac
Successfully installed cornac-1.14.2 powerlaw-1.5
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting recommenders
  Downloading recommenders-1.1.1-py3-none-any.whl (339 kB)
[K     |████████████████████████████████| 339 kB 4.8 MB/s 
Collecting retrying>=1.3.3
  Downloading retrying-1.3.4-py3-none-any.whl (11 kB)
Collecting scikit-surprise>=1.0.6
  Downloading scikit-surprise-1.1.3.tar.gz (771 kB)
[K     |████████████████████████████████| 771 kB 61.2 MB/s 
[?25hCollecting lightfm<2,>=1.15
  Downloading lightfm-1.16.tar.gz (310 kB)
[K     |██████████████████████████

In [2]:
import pandas as pd
import numpy as np

In [6]:
from google.colab import drive
drive.mount('/content/gdrive/')

import sys
sys.path.append('/content/gdrive/MyDrive/recommenders_aipi590')


from Non_DRL_Recommenders.bpr_model import run_bpr_model

Mounted at /content/gdrive/


In [7]:
## Read the dataset
df = pd.read_csv('/content/gdrive/MyDrive/H&M_Dataset/transactions_train.csv')

df.head()

Unnamed: 0,t_dat,customer_id,article_id,price,sales_channel_id
0,2018-09-20,000058a12d5b43e67d225668fa1f8d618c13dc232df0ca...,663713001,0.050831,2
1,2018-09-20,000058a12d5b43e67d225668fa1f8d618c13dc232df0ca...,541518023,0.030492,2
2,2018-09-20,00007d2de826758b65a93dd24ce629ed66842531df6699...,505221004,0.015237,2
3,2018-09-20,00007d2de826758b65a93dd24ce629ed66842531df6699...,685687003,0.016932,2
4,2018-09-20,00007d2de826758b65a93dd24ce629ed66842531df6699...,685687004,0.016932,2


In [8]:
## Take a random sample from the full data because it's massive
df = df.sample(n=3000, random_state=40)

## Adding Ranking information to run BPR

BPR relies on some type of ranking for each user-item pair. Here we assume that if a user interacted with an item, the item is ranked as 1. Otherwise, it is ranked as 0. In our dataset we only have positive feedback available, so we generate the negative feedback.

In [9]:
## Assign a score of 1 to user-item interactions that are available
df = df[['customer_id', 'article_id']].copy()
df['FEEDBACK'] = 1
df = df.drop_duplicates()

df.head()

Unnamed: 0,customer_id,article_id,FEEDBACK
24735390,d7951ef0b31959505d778831269ca5a549eaeb09488bc6...,901793006,1
907800,ae89ba4c28f12a6b274290ee20e864635d1461d9f9d0c7...,610776001,1
19966814,0b84a0bc6b37be1e3dc7ca5e68a000248a6873991372e4...,715828013,1
1902535,d226a541b24b86419b219dc6772fd39bf06abe8f408635...,637141001,1
1688640,bf4f13ea9f390d7290fca875e591c4453960900153ab4e...,501820001,1


In [10]:
# Rename the columns for consistency 
df.rename(columns = {'customer_id': 'userID', 'article_id': 'itemID', 'FEEDBACK': 'rating'}, inplace = True)

In [11]:
## Create a list of unique users and unique items from our sample
user_ids = df['userID'].unique()
item_ids = df['itemID'].unique()

In [12]:
## Adding negative feedback for interactions not present
absent_interactions_feedback = [[user, item, 0] for item in item_ids for user in user_ids] 

In [13]:
## Dataframe created for all negative feedback, i.e. where user has not interacted with the item
negative_feedback_df = pd.DataFrame(data=absent_interactions_feedback, columns=["userID", "itemID", "rating"])

negative_feedback_df.head()

Unnamed: 0,userID,itemID,rating
0,d7951ef0b31959505d778831269ca5a549eaeb09488bc6...,901793006,0
1,ae89ba4c28f12a6b274290ee20e864635d1461d9f9d0c7...,901793006,0
2,0b84a0bc6b37be1e3dc7ca5e68a000248a6873991372e4...,901793006,0
3,d226a541b24b86419b219dc6772fd39bf06abe8f408635...,901793006,0
4,bf4f13ea9f390d7290fca875e591c4453960900153ab4e...,901793006,0


In [14]:
## Merge the datasets with positive and negative feedback
prepared_dataset = pd.merge(negative_feedback_df, df, on=['userID', 'itemID'], how='outer').fillna(0).drop('rating_x', axis = 1)
# Cleaning up the column names
prepared_dataset.rename(columns = {'rating_y': 'rating'}, inplace = True)

prepared_dataset.head()

Unnamed: 0,userID,itemID,rating
0,d7951ef0b31959505d778831269ca5a549eaeb09488bc6...,901793006,1.0
1,ae89ba4c28f12a6b274290ee20e864635d1461d9f9d0c7...,901793006,0.0
2,0b84a0bc6b37be1e3dc7ca5e68a000248a6873991372e4...,901793006,0.0
3,d226a541b24b86419b219dc6772fd39bf06abe8f408635...,901793006,0.0
4,bf4f13ea9f390d7290fca875e591c4453960900153ab4e...,901793006,0.0


In [15]:
## Check how many positive and negative feedback signals we have
## We should have more 0's because the users interact with fewer items 
prepared_dataset['rating'].value_counts()

0.0    8112135
1.0       3000
Name: rating, dtype: int64

In [18]:
result = run_bpr_model(data=prepared_dataset, k=10, epochs=20, learning_rate=0.01, train_size=0.8)

rating_threshold = 1.0
exclude_unknowns = False
---
Training data:
Number of users = 2989
Number of items = 2715
Number of ratings = 6492108
Max rating = 1.0
Min rating = 0.0
Global mean = 0.0
---
Test data:
Number of users = 2989
Number of items = 2715
Number of ratings = 1623027
Number of unknown users = 0
Number of unknown items = 0
---
Total users = 2989
Total items = 2715

[BPR] Training started!


  0%|          | 0/20 [00:00<?, ?it/s]

Optimization finished!

[BPR] Evaluation started!


Ranking:   0%|          | 0/2989 [00:00<?, ?it/s]

In [19]:
print(result)

    |    MAP |    MRR | NDCG@10 | Train (s) | Test (s)
--- + ------ + ------ + ------- + --------- + --------
BPR | 0.0022 | 0.0022 |  0.0012 |   28.6848 |   2.3500



# **References**

1. Microsoft Recommenders, BPR Deep Dive
https://github.com/microsoft/recommenders/blob/main/examples/02_model_collaborative_filtering/cornac_bpr_deep_dive.ipynb

2. Microsoft Recommenders Preparing Data
https://github.com/microsoft/recommenders/blob/main/examples/01_prepare_data/data_transform.ipynb

3. Aghiles Salah, Quoc-Tuan Truong, Hady W. Lauw; *\"Cornac: A Comparative Framework for Multimodal
Recommender Systems* ; Journal of Machine Learning Research 2021 (2020) 1-5. 
https://dl.acm.org/doi/pdf/10.5555/3455716.3455811