<a href="https://colab.research.google.com/github/kluo9/HM-personalized-fashion-recommendation/blob/main/HM_recall.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

The purpose of recall stage is to reduce the number of items from about 100K to a few hundreds for next stage ranking. 
The goal is to include as few items as possible that are likely to be bought by the user in next week but not excluding any potential items.

The evaluation of the recall stage is 
1. Precision: the number of items that were purchased / total number of items recalled
2. Item Recall Rate: the number of items that were purchased / total number of items user purchased
3. User Recall Rate: the number of users who purchased the item recalled / total number of users

The recall strategy:
1. popularity (time-weighted)
2. purchase history (up to 4 weeks)
3. relative items to what the user recently purchased (items bought together)
4. popular items under user's attributes (age, gender)
5. same prod-name items


In [1]:
import numpy as np
import pandas as pd
import os
import glob
from tqdm import tqdm
import datetime

# Read data

In [2]:
! pip install -q kaggle
from google.colab import files

In [3]:
uploaded = files.upload() # upload kaggle token downloaded from kaggle personal account page 'kaggle.json'

Saving kaggle.json to kaggle.json


In [4]:
 ! mkdir ~/.kaggle
 ! cp kaggle.json ~/.kaggle/
 ! chmod 600 ~/.kaggle/kaggle.json

In [5]:
! kaggle competitions download -c h-and-m-personalized-fashion-recommendations -f transactions_train.csv

Downloading transactions_train.csv.zip to /content
 99% 577M/584M [00:07<00:00, 70.6MB/s]
100% 584M/584M [00:07<00:00, 78.1MB/s]


In [6]:
transaction_df = pd.read_csv('/content/transactions_train.csv.zip')
transaction_df.head()

Unnamed: 0,t_dat,customer_id,article_id,price,sales_channel_id
0,2018-09-20,000058a12d5b43e67d225668fa1f8d618c13dc232df0ca...,663713001,0.050831,2
1,2018-09-20,000058a12d5b43e67d225668fa1f8d618c13dc232df0ca...,541518023,0.030492,2
2,2018-09-20,00007d2de826758b65a93dd24ce629ed66842531df6699...,505221004,0.015237,2
3,2018-09-20,00007d2de826758b65a93dd24ce629ed66842531df6699...,685687003,0.016932,2
4,2018-09-20,00007d2de826758b65a93dd24ce629ed66842531df6699...,685687004,0.016932,2


Keep 4 weeks as train and the last week as validation.

In [7]:
print("All Transactions Date Range: {} to {}".format(transaction_df['t_dat'].min(), transaction_df['t_dat'].max()))

transaction_df["t_dat"] = pd.to_datetime(transaction_df["t_dat"])
train1 = transaction_df.loc[(transaction_df["t_dat"] >= datetime.datetime(2020,9,8)) & (transaction_df['t_dat'] < datetime.datetime(2020,9,16))]
train2 = transaction_df.loc[(transaction_df["t_dat"] >= datetime.datetime(2020,9,1)) & (transaction_df['t_dat'] < datetime.datetime(2020,9,8))]
train3 = transaction_df.loc[(transaction_df["t_dat"] >= datetime.datetime(2020,8,23)) & (transaction_df['t_dat'] < datetime.datetime(2020,9,1))]
train4 = transaction_df.loc[(transaction_df["t_dat"] >= datetime.datetime(2020,8,15)) & (transaction_df['t_dat'] < datetime.datetime(2020,8,23))]

val = transaction_df.loc[transaction_df["t_dat"] >= datetime.datetime(2020,9,16)]

All Transactions Date Range: 2018-09-20 to 2020-09-22


In [8]:
# List of all purchases per user (has repetitions)
positive_items_per_user1 = train1.groupby(['customer_id'])['article_id'].apply(list)
positive_items_per_user2 = train2.groupby(['customer_id'])['article_id'].apply(list)
positive_items_per_user3 = train3.groupby(['customer_id'])['article_id'].apply(list)
positive_items_per_user4 = train4.groupby(['customer_id'])['article_id'].apply(list)

# popularity (time-weighted)

Next we do time decay based popularity for items. This leads to items bought more recently having more weight in the popularity list. In simple words, item A bought 5 times on the first day of the train period is inferior than item B bought 4 times on the last day of the train period.

In [9]:
train = pd.concat([train1, train2, train3, train4], axis=0)
train['pop_factor'] = train['t_dat'].apply(lambda x: 1/(datetime.datetime(2020,9,16) - x).days)
train['pop_factor'].describe()

count    1.179208e+06
mean     1.182349e-01
std      1.629160e-01
min      3.125000e-02
25%      4.166667e-02
50%      6.250000e-02
75%      1.111111e-01
max      1.000000e+00
Name: pop_factor, dtype: float64

In [10]:
popular_items_group = train.groupby(['article_id'])['pop_factor'].sum()

_, popular_items = zip(*sorted(zip(popular_items_group, popular_items_group.keys()))[::-1])

# purchase history (up to 4 weeks)

Find items that bought by each user in the past month

In [35]:
train = pd.concat([train1, train2, train3, train4], axis=0)
# List of all purchases per user 
items_per_user = train.groupby(['customer_id'])['article_id'].apply(list) 

In [None]:
def purchase_history(user, purchase_data_group):
  most_common_items_of_user = list({k:v for k, v in Counter(items_per_user[user]).most_common()}.keys())
  return most_common_items_of_user

# relative items to what the user recently purchased (items bought together)

# Validation

Define evaluation metric: Mean Average Precision @ 12

In [11]:
def apk(actual, predicted, k=12):
    if len(predicted)>k:
        predicted = predicted[:k]

    score = 0.0
    num_hits = 0.0

    for i,p in enumerate(predicted):
        if p in actual and p not in predicted[:i]:
            num_hits += 1.0
            score += num_hits / (i+1.0)

    if not actual:
        return 0.0

    return score / min(len(actual), k)

def mapk(actual, predicted, k=12):
    return np.mean([apk(a,p,k) for a,p in zip(actual, predicted)])

construct data set with items bought by users in the validation period.

In [12]:
positive_items_val = val.groupby(['customer_id'])['article_id'].apply(list)

In [13]:
# creating validation set for metrics use case
val_users = positive_items_val.keys()
val_items = []

for i,user in enumerate(val_users):
    val_items.append(positive_items_val[user])
    
print("Total users in validation:", len(val_users))

Total users in validation: 68984


Test the strategy on validation set

In [14]:
from collections import Counter
outputs = []
cnt = 0

popular_items = list(popular_items)

for user in tqdm(val_users):
    user_output = []
    if user in positive_items_per_user1.keys():
        most_common_items_of_user = {k:v for k, v in Counter(positive_items_per_user1[user]).most_common()}
        user_output += list(most_common_items_of_user.keys())[:12]
    if user in positive_items_per_user2.keys():
        most_common_items_of_user = {k:v for k, v in Counter(positive_items_per_user2[user]).most_common()}
        user_output += list(most_common_items_of_user.keys())[:12]
    if user in positive_items_per_user3.keys():
        most_common_items_of_user = {k:v for k, v in Counter(positive_items_per_user3[user]).most_common()}
        user_output += list(most_common_items_of_user.keys())[:12]
    if user in positive_items_per_user4.keys():
        most_common_items_of_user = {k:v for k, v in Counter(positive_items_per_user4[user]).most_common()}
        user_output += list(most_common_items_of_user.keys())[:12]
    
    user_output += list(popular_items[:12 - len(user_output)])
    outputs.append(user_output)
    
print("mAP Score on Validation set:", mapk(val_items, outputs))

100%|██████████| 68984/68984 [00:04<00:00, 15542.77it/s]


mAP Score on Validation set: 0.023448012511813318
