# Product Recommender on H&M Dataset

- We have used the cornac package and mircosoft's recommenders module to train a Bayesian Personalised Ranking model on H&M e commere dataset. 

- The model learns and recommends top K items after ranking them based on user and product interactions

- Dataset in itself is huge, hence we have taken a subsample to train the model on google colab

In [1]:
AUTHORNAME = "Archit Kaila"
COLLABORATORS = "Shrey Gupta, Shen Juin Lee"

In [2]:
## Clone the repository and code base to run Non DRL Recommenders
!git clone https://github.com/architkaila/recommenders_aipi590.git

Cloning into 'recommenders_aipi590'...
remote: Enumerating objects: 238, done.[K
remote: Counting objects: 100% (238/238), done.[K
remote: Compressing objects: 100% (152/152), done.[K
remote: Total 238 (delta 119), reused 188 (delta 75), pack-reused 0[K
Receiving objects: 100% (238/238), 99.66 KiB | 850.00 KiB/s, done.
Resolving deltas: 100% (119/119), done.


In [3]:
## Install required libraries (only for google colab)
!pip install git+https://github.com/textomatic/cornac.git
!pip install recommenders

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting git+https://github.com/textomatic/cornac.git
  Cloning https://github.com/textomatic/cornac.git to /tmp/pip-req-build-lyvfc414
  Running command git clone -q https://github.com/textomatic/cornac.git /tmp/pip-req-build-lyvfc414
Collecting powerlaw
  Downloading powerlaw-1.5-py3-none-any.whl (24 kB)
Building wheels for collected packages: cornac
  Building wheel for cornac (setup.py) ... [?25l[?25hdone
  Created wheel for cornac: filename=cornac-1.14.2-cp38-cp38-linux_x86_64.whl size=14288992 sha256=b55a7efdc09d25f57cc386819fec270e2e4353b5fb5f7782936bbabe9cddda4c
  Stored in directory: /tmp/pip-ephem-wheel-cache-s3997y9g/wheels/ce/60/14/a887f00b396951c22e4e119ac64935f8619aa113312bb949b5
Successfully built cornac
Installing collected packages: powerlaw, cornac
Successfully installed cornac-1.14.2 powerlaw-1.5
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev

In [4]:
## Fetch the dataset from S3 Bucket
!wget https://aipi590.s3.amazonaws.com/transactions_train.csv -P "/content/recommenders_aipi590/Non_DRL_Recommenders/Dataset_2_HM/"

--2022-12-15 16:51:52--  https://aipi590.s3.amazonaws.com/transactions_train.csv
Resolving aipi590.s3.amazonaws.com (aipi590.s3.amazonaws.com)... 54.231.196.153, 52.217.132.201, 54.231.128.73, ...
Connecting to aipi590.s3.amazonaws.com (aipi590.s3.amazonaws.com)|54.231.196.153|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 3488002253 (3.2G) [text/csv]
Saving to: ‘/content/recommenders_aipi590/Non_DRL_Recommenders/Dataset_2_HM/transactions_train.csv’


2022-12-15 16:53:12 (41.8 MB/s) - ‘/content/recommenders_aipi590/Non_DRL_Recommenders/Dataset_2_HM/transactions_train.csv’ saved [3488002253/3488002253]



In [5]:
## Import standard libraries
import pandas as pd
import numpy as np

In [6]:
## Import python script to run and evaluate BPR model
from recommenders_aipi590.Non_DRL_Recommenders.bpr_model import run_bpr_model

### Read dataset

In [7]:
## Reading the e-commerce dataset
df = pd.read_csv('/content/recommenders_aipi590/Non_DRL_Recommenders/Dataset_2_HM/transactions_train.csv')
df.head()

Unnamed: 0,t_dat,customer_id,article_id,price,sales_channel_id
0,2018-09-20,000058a12d5b43e67d225668fa1f8d618c13dc232df0ca...,663713001,0.050831,2
1,2018-09-20,000058a12d5b43e67d225668fa1f8d618c13dc232df0ca...,541518023,0.030492,2
2,2018-09-20,00007d2de826758b65a93dd24ce629ed66842531df6699...,505221004,0.015237,2
3,2018-09-20,00007d2de826758b65a93dd24ce629ed66842531df6699...,685687003,0.016932,2
4,2018-09-20,00007d2de826758b65a93dd24ce629ed66842531df6699...,685687004,0.016932,2


In [8]:
## We take a subsample of our original dataset to train out model
df = df.sample(n=3000, random_state=0)

### Prepare datset

- The BPR implimentation in Cornac module works on the rankings (implicit feedbacks) for each user item pair. 

- We use the Negative Sampling method to prepare our data. This works on the assumption that if there is a interaction between user and item, then ranking is set to one else it is set to 0

- The postive interactions are present in our dataset and the negative interactions we prepare manually

In [9]:
## Set ranking (implicit feedbak) to 1 for interactions between user and item
df = df[['customer_id', 'article_id']].copy()
df['FEEDBACK'] = 1

# Remove duplicates from our samples
df = df.drop_duplicates()

# Rename the columns for explanability
df.rename(columns = {'customer_id': 'userID', 'article_id': 'itemID', 'FEEDBACK': 'rating'}, inplace = True)

df.head()

Unnamed: 0,userID,itemID,rating
24618421,373476abfd4e2b7aa917c31bf8e19c3f978300dd0367b0...,706016003,1
94306,f711fb816205c7f40dca9a379bff18ca523f4da7757ecd...,493810014,1
726248,64f03fa3ab5ad9e5850e2458d0f1e7c97fc4b57ba4131c...,594541012,1
3607236,86fd3f1033c86d842b284dfa1a6d434162edc7d877fc22...,719712001,1
8502693,6884d5c356ebb2dffabeeb88994d55b161a8b59e674e29...,700737007,1


In [10]:
## Obtain list of unique items and users present in our dataset to genrate negative interations
item_ids = df['itemID'].unique()
user_ids = df['userID'].unique()

In [11]:
## Adding negative feedback (0 ranking) for instances of no interaction between items and users
absent_interactions_feedback = [[user, item, 0] for item in item_ids for user in user_ids] 

# Convert prepared data into a dataframe
negative_feedback_df = pd.DataFrame(data=absent_interactions_feedback, columns=["userID", "itemID", "rating"])

negative_feedback_df.head()

Unnamed: 0,userID,itemID,rating
0,373476abfd4e2b7aa917c31bf8e19c3f978300dd0367b0...,706016003,0
1,f711fb816205c7f40dca9a379bff18ca523f4da7757ecd...,706016003,0
2,64f03fa3ab5ad9e5850e2458d0f1e7c97fc4b57ba4131c...,706016003,0
3,86fd3f1033c86d842b284dfa1a6d434162edc7d877fc22...,706016003,0
4,6884d5c356ebb2dffabeeb88994d55b161a8b59e674e29...,706016003,0


In [12]:
## Merge the positive and negative feedback into one single master dataframe
prepared_dataset = pd.merge(negative_feedback_df, df, on=['userID', 'itemID'], how='outer').fillna(0).drop('rating_x', axis = 1)

# Cleaning up the column names
prepared_dataset.rename(columns = {'rating_y': 'rating'}, inplace = True)

prepared_dataset.head()

Unnamed: 0,userID,itemID,rating
0,373476abfd4e2b7aa917c31bf8e19c3f978300dd0367b0...,706016003,1.0
1,f711fb816205c7f40dca9a379bff18ca523f4da7757ecd...,706016003,0.0
2,64f03fa3ab5ad9e5850e2458d0f1e7c97fc4b57ba4131c...,706016003,0.0
3,86fd3f1033c86d842b284dfa1a6d434162edc7d877fc22...,706016003,0.0
4,6884d5c356ebb2dffabeeb88994d55b161a8b59e674e29...,706016003,0.0


In [13]:
## Check number of positive and negative feedback samples
prepared_dataset['rating'].value_counts()

0.0    8115666
1.0       3000
Name: rating, dtype: int64

### Run and Evaluate Product Ranking Model

- We use the Cornac module the train and evaluate a Bayesian Personalised Ranking model
- We set the value for top K as 5 and train our model for 50 epochs
- We set the LR to 0.01
- We utilize 80% of our dataset for training and 20% for testing

In [14]:
## Call our BPR model train and evaluation script on our prepared dataset
result = run_bpr_model(data=prepared_dataset, k=10, epochs=20, learning_rate=0.01, train_size=0.8)

rating_threshold = 1.0
exclude_unknowns = False
---
Training data:
Number of users = 2987
Number of items = 2718
Number of ratings = 6494932
Max rating = 1.0
Min rating = 0.0
Global mean = 0.0
---
Test data:
Number of users = 2987
Number of items = 2718
Number of ratings = 1623734
Number of unknown users = 0
Number of unknown items = 0
---
Total users = 2987
Total items = 2718

[BPR] Training started!


  0%|          | 0/20 [00:00<?, ?it/s]

Optimization finished!

[BPR] Evaluation started!


Ranking:   0%|          | 0/2987 [00:00<?, ?it/s]

In [15]:
## Capture the model metric results on test data
print(result)

    |  HR@10 |    MAP |    MRR | NDCG@10 | Train (s) | Test (s)
--- + ------ + ------ + ------ + ------- + --------- + --------
BPR | 0.0034 | 0.0026 | 0.0026 |  0.0014 |   73.5463 |   2.3558



# **References**

1. Data Preparation for Colborative Filtering | Microsoft
https://github.com/microsoft/recommenders/blob/main/examples/01_prepare_data/data_transform.ipynb

2. Cornac Movie Recommendation using BPR | Microsoft
https://github.com/microsoft/recommenders/blob/main/examples/02_model_collaborative_filtering/cornac_bpr_deep_dive.ipynb

3. Bayesian Personalised Ranking (BPR) Evaluation Example | PreferredAI, Cornac
https://github.com/PreferredAI/cornac/blob/master/examples/bpr_netflix.py
https://cornac.preferred.ai/

4. BPR: Bayesian personalized ranking from implicit feedback | Rendle, S., Freudenthaler, C., Gantner, Z., & Schmidt-Thieme, L. (2009, June).
https://arxiv.org/ftp/arxiv/papers/1205/1205.2618.pdf