# NOTINO Sampling Campaigns Optimisation - Recommender system for perfumes based on implicit feedback

The following sections present the development process of a recommender system based on implicit feedback in the form of purchase history of perfume category of products, which would allow Notino to better target customers during their sampling campaigns, with the aim of increasing performance of these campaigns in terms of conversion rate. 

In [151]:
import os

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import pymssql
import openpyxl
from utils_recommender import *
from IPython.display import HTML, display, IFrame, Markdown as md, display_html, Image
from IPython import get_ipython

%reload_ext autoreload
get_ipython().magic("autoreload 2");
plt.rcParams['axes.labelsize'] = 10
plt.rcParams['axes.xmargin'] = 0.01
plt.rcParams['axes.titlesize'] = 10
pd.set_option('display.float_format', lambda x: '%.0f' % x if (x == x and x*10 % 10 == 0) else ('%.1f' % x if (x == x and x*100 % 10 == 0) else '%.2f' % x))
pd.set_option('display.max_columns', 80)
pd.set_option('display.max_rows', 70)
plt.style.use('ggplot')
%matplotlib inline

## 1. Data initialisation

**The purchase history data used in this model have been collected at so-called purchase (or item) level and not at the usual transaction level**, i.e. each observation in the input data represents a particular item bought by a customer. For the modelling purposes, this purchases data will be aggregated to the user-item level, summing the number of times each perfume item was purchased by a user into the `purchases` variable. In addition, for notational convenience, the original labels `customer_id` and `product_name_id` were renamed as `user_id` and `item_id`.

In [4]:
engine = sql.create_engine(os.environ['SIMPLITY_DB'])

# Training set (2018-19)
query_train = f"""select * from recommender_perfume_purchase_history_train;"""
purchases_train = pd.read_sql(query_train, engine, parse_dates=['order_date'])
purchases_train.customer_id = purchases_train.customer_id.astype(str)
purchases_train = purchases_train.rename(columns={'customer_id': 'user_id', 'product_name_id': 'item_id'})

# Validation & Test set (Q1/2020)
query_valid_test = f"""select * from recommender_perfume_purchase_history_valid_test;"""
purchases_valid_test = pd.read_sql(query_valid_test, engine, parse_dates=['order_date'])
purchases_valid_test.customer_id = purchases_valid_test.customer_id.astype(str)
purchases_valid_test = purchases_valid_test.rename(columns={'customer_id': 'user_id', 'product_name_id': 'item_id'})

### 1.1 Training set construction

First, let's check the **number of purchased items and users** in the training purchases dataset constructed using users which purchased **at least 3 and at most 75 perfume items (i.e. 3 perfumes per month) from 1.1.2018 until 31.12.2019.** In terms of the total number of customers who purchased a perfume item within the training period od 2018-19, the training purchases dataset contains roughly 25% of all customers (the remaining 75% customers purchased either less than 3 or more than 75 items).

In [4]:
report(purchases_train)

Unnamed: 0,user_id,order_date,item_id,master_product_id,subcategory,type
0,42580914-f4e1-425f-9c7d-8b2bafbde5a6,2018-11-24,Cerruti 1881 pour Femme EDT W,CRTPFNW_AEDT10,Women’s perfumes,Eaux de Toilette
1,7ca100db-7db9-4b75-97c9-021a913b7e86,2018-02-10,Cerruti 1881 pour Femme EDT W,CRTPFNW_AEDT10,Women’s perfumes,Eaux de Toilette
2,a5602c80-5554-4e3d-aa41-3d1a94ded2a7,2019-05-21,Cerruti 1881 pour Femme EDT W,CRTPFNW_AEDT10,Women’s perfumes,Eaux de Toilette
3,245beb0b-bba6-4517-803f-3e0419c9d046,2019-06-25,Cerruti 1881 pour Femme EDT W,CRTPFNW_AEDT10,Women’s perfumes,Eaux de Toilette
4,9503f640-e2fc-4dbf-a227-d8a030ca7f4a,2019-02-20,Cerruti 1881 pour Femme EDT W,CRTPFNW_AEDT10,Women’s perfumes,Eaux de Toilette


The number of purchased perfume items is **6,614,682.**


The number of distinct users is **1,161,116.**


The number of distinct items is **8,104.**


Let's perform two consecutive technical aggregations due to incorrect distinct values for the same product in columns `master_product_id` and original `product_name`, which was replaced by `product_name_id`.
**As a result of these aggregations, the output dataset will store distinct aggregated user-item interactions.**

In [5]:
purchases_train_agg_1st = purchases_train.groupby(
    by=['user_id', 'master_product_id', 'subcategory', 'type']).agg(
        purchases=('user_id', 'count'),
        item_id=('item_id', lambda x: x.str.upper().max()),
        first_order=('order_date', min),
        last_order=('order_date', max)).reset_index()

purchases_train_agg_2nd = purchases_train_agg_1st.groupby(
    by=['user_id', 'item_id', 'subcategory', 'type']).agg(
        purchases=('purchases', sum),
        first_order=('first_order', min),
        last_order=('last_order', max)).reset_index()

The last dataset used before constructing the training set is the `purchases_train_agg_2nd` dataset. Let's check its properties:

In [6]:
report(purchases_train_agg_2nd)

Unnamed: 0,user_id,item_id,subcategory,type,purchases,first_order,last_order
0,00000542-7d7d-4344-864a-8490d7cc708e,BRUNO BANANI PURE WOMAN EDT W,Women’s perfumes,Eaux de Toilette,2,2019-01-30,2019-09-16
1,00000542-7d7d-4344-864a-8490d7cc708e,COACH COACH NEW YORK EDP W,Women’s perfumes,Eaux de Parfum,1,2019-11-24,2019-11-24
2,00000542-7d7d-4344-864a-8490d7cc708e,GUESS DARE EDT W,Women’s perfumes,Eaux de Toilette,2,2018-08-14,2019-11-24
3,00000542-7d7d-4344-864a-8490d7cc708e,HUGO BOSS BOSS BOTTLED EDT M,Men’s perfumes,Eaux de Toilette,2,2019-09-16,2019-10-21
4,00000542-7d7d-4344-864a-8490d7cc708e,LACOSTE TOUCH OF PINK EDT W,Women’s perfumes,Eaux de Toilette,1,2018-08-14,2018-08-14


The number of aggregated users interactions, i.e. distinct user-item                 pairs, is **5,085,787.**


The number of users interactions is **6,614,682.**


The number of distinct users is **1,161,116.**


The number of distinct items is **8,074.**


Let's check the **sparsity of the future interactions matrix**, calculated as the number of distinct user-item interactions divided by the product of users and items.

In [7]:
n_users = len(purchases_train_agg_2nd.user_id.unique())
n_items = len(purchases_train_agg_2nd.item_id.unique())

print(f'Number of unique users: {n_users:,.0f}')
print(f'Number of unique items: {n_items:,.0f}')
print(f'Sparsity: {len(purchases_train_agg_2nd)/(n_users*n_items):,.4%}')

Number of unique users: 1,161,116
Number of unique items: 8,074
Sparsity: 0.0542%


In order to increase the sparsity and to **ensure that each item will be purchased at least three times (equivalent condition of each user having at least three item purchases)**, the thresholds will be iteratively applied on the train set until the numbers stabilize.

In [8]:
train = threshold_interactions(purchases_train_agg_2nd, users_min=3, items_min=3, interactions='purchases')

Starting data info
Number of unique users: 1,161,116
Number of unique items: 8,074
Sparsity: 0.0542%
Ending data info
Number of unique users: 1,160,996
Number of unique items: 7,561
Sparsity: 0.0579%


### 1.2 Training set statistics

Let's check whether **each user has a minimum of three purchases and each item has been purchased at least three times.**

In [9]:
nitems_head = train.groupby(by='user_id').purchases.sum().value_counts().head().to_frame(name='users')
nitems_head.index.name = 'purchases'
nitems_head.reset_index(inplace=True)
nitems_head = nitems_head.style.format('{:,.0f}').set_table_attributes("style='display:inline'").hide_index()

nitems_tail = train.groupby(by='user_id').purchases.sum().value_counts().tail().to_frame(name='users')
nitems_tail.index.name = 'purchases'
nitems_tail.reset_index(inplace=True)
nitems_tail = nitems_tail.style.format('{:,.0f}').set_table_attributes("style='display:inline'").hide_index()

nusers_head = train.groupby(by='item_id').purchases.sum().sort_values().head().to_frame(name='purchases')
nusers_head.index.name = 'item'
nusers_head.reset_index(inplace=True)
nusers_head = nusers_head.style.set_table_attributes("style='display:inline'").hide_index()

nusers_tail = train.groupby(by='item_id').purchases.sum().sort_values().tail().to_frame(name='purchases')
nusers_tail.index.name = 'item'
nusers_tail.reset_index(inplace=True)
nusers_tail = nusers_tail.style.set_table_attributes("style='display:inline'").hide_index()

display_html(nitems_head._repr_html_() + "\xa0" * 10 + nitems_tail._repr_html_() + "\xa0" * 40 + \
             nusers_head._repr_html_() + "\xa0" * 10 + nusers_tail._repr_html_(), raw=True)

purchases,users
3,403611
4,236266
5,145347
6,97631
7,66339

purchases,users
72,28
74,26
73,25
70,24
75,20

item,purchases
ANGEL SCHLESSER AGUA DE JAZMIN EDT W,3
DELAROM HOMME EAU SPORT EDP M,3
ALEXANDRE.J ULTIMATE COLLECTION: ST. HONORE EDP W,3
ALEXANDRE.J ULTIMATE COLLECTION: PURE ART EDP U,3
M. MICALLEF MON PARFUM GOLD EDP 30 ML,3

item,purchases
LANCOME LA VIE EST BELLE EDP W,73071
CHLOÉ CHLOÉ EDP W,73975
CALVIN KLEIN ETERNITY EDP W,77333
HUGO BOSS BOSS BOTTLED EDT M,103227
CALVIN KLEIN EUPHORIA EDP W,128083


**Final check of the number of users, items and interactions in the training set.**

In [10]:
report(train)
md(f"The training set date range is from **{train.first_order.min():%d.%m.%Y}** to **{train.last_order.max():%d.%m.%Y}**.")

Unnamed: 0,user_id,item_id,subcategory,type,purchases,first_order,last_order
0,00000542-7d7d-4344-864a-8490d7cc708e,BRUNO BANANI PURE WOMAN EDT W,Women’s perfumes,Eaux de Toilette,2,2019-01-30,2019-09-16
1,00000542-7d7d-4344-864a-8490d7cc708e,COACH COACH NEW YORK EDP W,Women’s perfumes,Eaux de Parfum,1,2019-11-24,2019-11-24
2,00000542-7d7d-4344-864a-8490d7cc708e,GUESS DARE EDT W,Women’s perfumes,Eaux de Toilette,2,2018-08-14,2019-11-24
3,00000542-7d7d-4344-864a-8490d7cc708e,HUGO BOSS BOSS BOTTLED EDT M,Men’s perfumes,Eaux de Toilette,2,2019-09-16,2019-10-21
4,00000542-7d7d-4344-864a-8490d7cc708e,LACOSTE TOUCH OF PINK EDT W,Women’s perfumes,Eaux de Toilette,1,2018-08-14,2018-08-14


The number of aggregated users interactions, i.e. distinct user-item                 pairs, is **5,084,851.**


The number of users interactions is **6,613,716.**


The number of distinct users is **1,160,996.**


The number of distinct items is **7,561.**


The training set date range is from **01.01.2018** to **31.12.2019**.

### 1.3 Validation and test sets construction

First, let's check the **number of purchases, distinct users and items in the joint valid_test purchases dataset**, constructed in the SQL query from users from the training set with time **restriction to 3 months after the end of the training set (Q1/2020).** 

In [35]:
report(purchases_valid_test)

Unnamed: 0,user_id,order_date,item_id,master_product_id,subcategory,type
0,ce1b7b76-de37-4ac9-ba83-07c110a5f5e5,2020-02-05,Davidoff Hot Water EDT M,DDFHOTM_AEDT10,Men’s perfumes,Eaux de Toilette
1,fb686a75-0747-4ae7-988d-d6ca5f9546b7,2020-02-09,Davidoff Hot Water EDT M,DDFHOTM_AEDT10,Men’s perfumes,Eaux de Toilette
2,dfa18f30-eb3b-4704-8c9c-38c276693382,2020-02-25,Davidoff Hot Water EDT M,DDFHOTM_AEDT10,Men’s perfumes,Eaux de Toilette
3,c937fa89-2abb-4c4d-b2dd-3ac7a89cfe07,2020-03-02,Davidoff Hot Water EDT M,DDFHOTM_AEDT10,Men’s perfumes,Eaux de Toilette
4,2c294b1e-232f-43b9-808e-85e7f2aeb4f6,2020-03-31,Davidoff Cool Water EDT M,DDFCWMM_AEDT10,Men’s perfumes,Eaux de Toilette


The number of purchased perfume items is **425,287.**


The number of distinct users is **229,772.**


The number of distinct items is **5,086.**


Let's further **randomly split users from the joint valid_test purchases dataset to two halves and take only the first half. Then construct the validation set by taking the first 10100 users and the test set using all the remaining users**. As a result, `purchases_valid` and `purchases_test` datasets will be created. The reason why only ~10000 users were used for validation sample is that each iteration of grid search is considerably time-consuming (around 7 minuts per 10K users), so it was chosen primarily for computational reasons.

In [70]:
all_users_ids = pd.Series(purchases_valid_test.user_id.drop_duplicates().index)
valid_test_users_ids = all_users_ids.sample(frac=0.5, random_state=1)

valid_users = purchases_valid_test.user_id[valid_test_users_ids[:10100].values]
test_users = purchases_valid_test.user_id[valid_test_users_ids[10100:].values]

purchases_valid = purchases_valid_test[purchases_valid_test.user_id.isin(valid_users)]
purchases_test = purchases_valid_test[purchases_valid_test.user_id.isin(test_users)]

Let's perform **on validation and test purchases datasets** two consecutive technical aggregations due to incorrect distinct values for the same product in columns `master_product_id` and original `product_name`, which was replaced by `product_name_id`. **As a result of these aggregations, the output dataset will store distinct aggregated user-item interactions.**

In [71]:
purchases_valid_agg_1st = purchases_valid.groupby(
    by=['user_id', 'master_product_id', 'subcategory', 'type']).agg(
        purchases=('user_id', 'count'),
        item_id=('item_id', lambda x: x.str.upper().max()),
        first_order=('order_date', min),
        last_order=('order_date', max)).reset_index()

purchases_valid_agg_2nd = purchases_valid_agg_1st.groupby(
    by=['user_id', 'item_id', 'subcategory', 'type']).agg(
        purchases=('purchases', sum),
        first_order=('first_order', min),
        last_order=('last_order', max)).reset_index()

purchases_test_agg_1st = purchases_test.groupby(
    by=['user_id', 'master_product_id', 'subcategory', 'type']).agg(
        purchases=('user_id', 'count'),
        item_id=('item_id', lambda x: x.str.upper().max()),
        first_order=('order_date', min),
        last_order=('order_date', max)).reset_index()

purchases_test_agg_2nd = purchases_test_agg_1st.groupby(
    by=['user_id', 'item_id', 'subcategory', 'type']).agg(
        purchases=('purchases', sum),
        first_order=('first_order', min),
        last_order=('last_order', max)).reset_index()

**Let's check the properties of the aggregated validation dataset:**

In [72]:
valid = purchases_valid_agg_2nd.copy()
report(valid)

Unnamed: 0,user_id,item_id,subcategory,type,purchases,first_order,last_order
0,00071d8d-1499-4b40-abd4-3cc6555e2c21,MONTBLANC LEGEND EDT M,Men’s perfumes,Eaux de Toilette,1,2020-01-28,2020-01-28
1,000aed4c-6b1a-490f-92cf-7300fc493522,BOUCHERON QUATRE EDP W,Women’s perfumes,Eaux de Parfum,1,2020-02-17,2020-02-17
2,000aed4c-6b1a-490f-92cf-7300fc493522,BVLGARI AQVA POUR HOMME EDT M,Men’s perfumes,Eaux de Toilette,1,2020-02-11,2020-02-11
3,00130e8e-0047-4f92-bd97-8eb644b0b43d,CHOPARD CAŠMIR EDP W,Women’s perfumes,Eaux de Parfum,1,2020-02-05,2020-02-05
4,00130e8e-0047-4f92-bd97-8eb644b0b43d,GUCCI GUILTY ABSOLUTE EDP M,Men’s perfumes,Eaux de Parfum,1,2020-01-30,2020-01-30


The number of aggregated users interactions, i.e. distinct user-item                 pairs, is **16,958.**


The number of users interactions is **18,520.**


The number of distinct users is **10,100.**


The number of distinct items is **2,666.**


**Let's check the properties of the aggregated test dataset:**

In [73]:
test = purchases_test_agg_2nd.copy()
report(test)

Unnamed: 0,user_id,item_id,subcategory,type,purchases,first_order,last_order
0,0000b60a-57be-4bce-a1f3-45dddf3e55e5,PRADA PRADA L'HOMME EDT M,Men’s perfumes,Eaux de Toilette,1,2020-02-06,2020-02-06
1,0000f5d4-c2d2-433d-8a98-c659439a337c,GIORGIO ARMANI CODE ABSOLU EDP M,Men’s perfumes,Eaux de Parfum,1,2020-01-26,2020-01-26
2,0000f5d4-c2d2-433d-8a98-c659439a337c,GIVENCHY PÍ EDT M,Men’s perfumes,Eaux de Toilette,1,2020-01-26,2020-01-26
3,00015adc-127b-4290-8612-9b0974028a3b,CALVIN KLEIN EUPHORIA EDP W,Women’s perfumes,Eaux de Parfum,1,2020-03-20,2020-03-20
4,00017c74-0cf5-4a88-9b54-b21e54ab5fc4,GIORGIO ARMANI ACQUA DI GIO POUR HOMME EDT M,Men’s perfumes,Eaux de Toilette,2,2020-01-20,2020-01-20


The number of aggregated users interactions, i.e. distinct user-item                 pairs, is **176,475.**


The number of users interactions is **194,164.**


The number of distinct users is **104,786.**


The number of distinct items is **4,663.**


### 1.4 Mappings of user/item string identifiers to integer indices

First, let's define the mapping functions using integer-based indices for further use as rows and columns identifiers in the matrix factorisation algorithm, by mapping each distinct `user_id` to `user_idx` (and vice versa) and similarly each distinct `item_id` to `item_idx` (and vice versa). **The mapping functions use lists of items and users contained in the training set and apply it on training, validation and test sets.**

In [None]:
train_items = sorted(train.item_id.unique())
train_users = sorted(train.user_id.unique())

item_to_idx = {}
for (idx, item) in enumerate(train_items):
    item_to_idx[item] = idx
    
user_to_idx = {}
for (idx, user) in enumerate(train_users):
    user_to_idx[user] = idx
    
def map_ids(row, mapper: dict):
    return mapper.get(row)

Calculate both `user_idx` and `item_idx` as new columns.

In [75]:
train['user_idx'] = train.user_id.apply(map_ids, mapper=user_to_idx)
train['item_idx'] = train.item_id.apply(map_ids, mapper=item_to_idx)

valid['user_idx'] = valid.user_id.apply(map_ids, mapper=user_to_idx)
valid['item_idx'] = valid.item_id.apply(map_ids, mapper=item_to_idx)

test['user_idx'] = test.user_id.apply(map_ids, mapper=user_to_idx)
test['item_idx'] = test.item_id.apply(map_ids, mapper=item_to_idx)

Since the **validation and test set contain unmapped users that are no longer in the training set** (they were excluded during the cleaning stage when enforcing minimum number of purchases per user and item), **these users were also excluded from the validation and test set** in order for the recommender results evaluated on the validation set to be compatible with training set. Moreover, since the **validation and test set contain also unmapped items that were not purchased during the training set period** (primarily newly-launched products or products purchased by excluded users), **these items were excluded as well**.

In [76]:
valid = valid[~valid.user_idx.isnull()]
valid = valid[~valid.item_idx.isnull()]
valid = valid.astype({'user_idx': int, 'item_idx': int})

test = test[~test.user_idx.isnull()]
test = test[~test.item_idx.isnull()]
test = test.astype({'user_idx': int, 'item_idx': int})

### 1.5 Validation set statistics

The final validation set statistics are as follows:

In [77]:
report(valid)
md(f"The validation set date range is from **{valid.first_order.min():%d.%m.%Y}** to **{valid.last_order.max():%d.%m.%Y}**.")

Unnamed: 0,user_id,item_id,subcategory,type,purchases,first_order,last_order,user_idx,item_idx
0,00071d8d-1499-4b40-abd4-3cc6555e2c21,MONTBLANC LEGEND EDT M,Men’s perfumes,Eaux de Toilette,1,2020-01-28,2020-01-28,139,5417
1,000aed4c-6b1a-490f-92cf-7300fc493522,BOUCHERON QUATRE EDP W,Women’s perfumes,Eaux de Parfum,1,2020-02-17,2020-02-17,206,1263
2,000aed4c-6b1a-490f-92cf-7300fc493522,BVLGARI AQVA POUR HOMME EDT M,Men’s perfumes,Eaux de Toilette,1,2020-02-11,2020-02-11,206,1370
3,00130e8e-0047-4f92-bd97-8eb644b0b43d,CHOPARD CAŠMIR EDP W,Women’s perfumes,Eaux de Parfum,1,2020-02-05,2020-02-05,361,1801
4,00130e8e-0047-4f92-bd97-8eb644b0b43d,GUCCI GUILTY ABSOLUTE EDP M,Men’s perfumes,Eaux de Parfum,1,2020-01-30,2020-01-30,361,3306


The number of aggregated users interactions, i.e. distinct user-item                 pairs, is **16,650.**


The number of users interactions is **18,189.**


The number of distinct users is **10,005.**


The number of distinct items is **2,549.**


The validation set date range is from **01.01.2020** to **31.03.2020**.

### 1.5 Test set statistics

The final test set statistics are as follows:

In [79]:
report(test)
md(f"The test set date range is from **{test.first_order.min():%d.%m.%Y}** to **{test.last_order.max():%d.%m.%Y}**.")

Unnamed: 0,user_id,item_id,subcategory,type,purchases,first_order,last_order,user_idx,item_idx
0,0000b60a-57be-4bce-a1f3-45dddf3e55e5,PRADA PRADA L'HOMME EDT M,Men’s perfumes,Eaux de Toilette,1,2020-02-06,2020-02-06,9,6196
1,0000f5d4-c2d2-433d-8a98-c659439a337c,GIORGIO ARMANI CODE ABSOLU EDP M,Men’s perfumes,Eaux de Parfum,1,2020-01-26,2020-01-26,15,3113
2,0000f5d4-c2d2-433d-8a98-c659439a337c,GIVENCHY PÍ EDT M,Men’s perfumes,Eaux de Toilette,1,2020-01-26,2020-01-26,15,3248
3,00015adc-127b-4290-8612-9b0974028a3b,CALVIN KLEIN EUPHORIA EDP W,Women’s perfumes,Eaux de Parfum,1,2020-03-20,2020-03-20,24,1562
4,00017c74-0cf5-4a88-9b54-b21e54ab5fc4,GIORGIO ARMANI ACQUA DI GIO POUR HOMME EDT M,Men’s perfumes,Eaux de Toilette,2,2020-01-20,2020-01-20,28,3103


The number of aggregated users interactions, i.e. distinct user-item                 pairs, is **173,280.**


The number of users interactions is **190,659.**


The number of distinct users is **103,768.**


The number of distinct items is **4,371.**


The test set date range is from **01.01.2020** to **31.03.2020**.

### 1.6 Dump training, validation and test sets into HDF5 files

In [107]:
dump_df_to_hdf5(train, valid, test, dirpath='./data/hdf5/rec', names=['train', 'valid', 'test'])

### 1.7 Load training, validation and test sets from HDF5 files

In [224]:
train, valid, test = load_df_from_hdf5(dirpath='./data/hdf5/rec', names=['train', 'valid', 'test'])
    
print(f'TRAIN SET: {len(train):,.0f}')
print(f'VALID SET: {len(valid):,.0f}')
print(f'TEST SET:  {len(test):,.0f}')

TRAIN SET: 5,084,851
VALID SET: 16,650
TEST SET:  173,280


## 2. Latent factor model (ALS algorithm)

### 2.1 Popularity-based baseline model

First, let's **calculate items popularity ranking** based on the number of purchases in the training set. This non-personalised popularity-based recommendation model, using as the top recommended items the most popular ones, will be used as a baseline for comparison with our latent factor model.

In [226]:
items_pop = get_item_popularity(train)
items_pop.head().style

Unnamed: 0,item_idx,purchases,rank_pct_pop,rank_pop
1562,1562,128083,0.000132,1
3586,3586,103227,0.000265,2
1548,1548,77333,0.000397,3
1782,1782,73975,0.000529,4
4628,4628,73071,0.000661,5


### 2.2 Item-oriented neighborhood-based baseline model 

The second baseline model used for comparison purposes will be the item-oriented neighborhood-based model (analogue of the k-nearest neighbors). This model will use the **cosine similarity measure**, and will predict the preference $\hat p_{ui}$ of user $u$ for item $i$, by **calculating a dot product of the vector of similarities between item $i$ and all items $j$ that user $u$ has purchased and the vector of respective purchases** $r_{uj}$ of these items: $$\hat p_{ui} = \sum_{j} s_{ij}r_{uj},$$ where $$s_{ij} = \frac{r_{i}^Tr_j}{\Vert r_i \Vert \Vert r_j \Vert}$$ is the similarity between items $i$ and $j$, and $r_i \in \mathcal{R}^m$ contains $r_{ui}$ values for each all $m$ users. The item-item cosine similarities are calculated using the `surprise` library and the baseline model is inspired by the research paper mentioned below.

In [18]:
train_Sij = get_item_cosine_similarity(train)

### 2.3 Fitting the latent factor model

The predicted users preferences of items will be calculated with frequently used **Alternating Least Squares (ALS) approximate matrix factorisation** algorithm implemented within an open-source Github repository called [implicit](https://github.com/benfred/implicit) dedicated to building implicit feedback recommender systems. The documentation of the library is available [here](https://implicit.readthedocs.io/en/latest/). **The first step prior to fitting the model is to create training-set confidence matrix $C_{iu}$ based on the raw interactions matrix $R_{iu}$ representing the number of user/item interactions (i.e. purchases).** One of the possible transformations designed for calculating $C_{iu}$ from $R_{iu}$ is $C_{iu} = 1 + R_{iu} \times \alpha$, where $\alpha$ is a positive integer around 40. Both of these matrices are sparse matrices of type `scipy.sparse.csr_matrix`. The confidence matrix $C_{iu}$ will then be used as an input to the ALS algorithm. The detailed description of the algorithm, e.g. construction of the confidence matrix $C_{iu}$ and other specifics, is provided in the following [research paper](http://yifanhu.net/PUB/cf.pdf). 

The initialization and actual fitting of the latent factor model using ALS algorithm is performed using the appropriate values for hyperparameters `factors`, `regularization`, `iterations`, and `alpha`. **After calling the fit method, the $m \times f$ matrix $X$ of `user_factors` and $n \times f$ matrix $Y$ of `item_factors` will be calculated as an output of the latent factor model.**

### 2.4 Grid search for optimal hyperparameters on validation set using evaluation metrics calculated on predicted preferences

Since there are four hyperparameters to optimize, we decided to search for the optimal combination of values using the grid search methodology. It means that the model fitting procedure will be performed on training set on multiple combinations of hyperparameters and each combination will be evaluated on validation set using two metrics based on the predicted preferences $\hat p_{ui}$. **Predicted preferences $\hat p_{ui} = x_u^Ty_i$ are components of vectors $\hat p_u$ representing for user $u$ its estimated preferences for each item $i$ (which was not purchased in the training set by default). These preferences are calculated using fitted user and item latent factors.** 

**The first evaluation metric** calculated to assess the model performance on validation set is a generalistic metric proposed by the above mentioned research paper. It's called `mean percentile ranking` (`MPR`) of predicted items induced by the model. It is calculated using the formula below, where $r_{ui}^t$ is the validation set interaction of user $u$ with item $i$, i.e. number of purchases of **item i, which was not purchased in the training set (new purchases)** and $rank_{ui}$ is its percentile ranking within the list of predicted items (0% means the top of the list, 100% means the bottom of the list).
$$MPR = \overline{rank} = \frac{\sum_{u,i}r_{ui}^t\times rank_{ui}}{\sum_{u,i}r_{ui}^t}$$

**The second evaluation metric** is specifically tailored to the Notino use case, based on the well-known `recall` measure. The `mean recall at N` (`MRN`), is calculated as a ratio of the sum of count of distinct user-items found in the TOP-N recommendations that were actually purchased during the validation period, divided by the sum of count of all distinct items purchased within the validation period that could be recommended (i.e. excluding items purchased in the training set).

$$MRN = \frac{\sum_{ui} r_{ui, count}^{t, topN}}{\sum_{ui}r_{ui,  count}^t}$$

Both metric will be calculated for popularity-based and neighborhood-based baseline models as well as for the latent factor model. 

In [185]:
pPui = grid_search_als_latent_factor(train=train, 
                                     valid=valid, 
                                     train_Sij=train_Sij, 
                                     factors=[250], 
                                     regularizations=[2400], 
                                     alphas=[120], 
                                     iterations=20, 
                                     N=3)

+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+
|   users    |  factors   |   lambda   |   alpha    | MPR [Fac]  | MPR [Pop]  | MPR [Knn]  | Rui Count  |  Rui Sum   |     N      | MRN [Fac]  | MRN [Pop]  | MRN [Knn]  |
|   10005    |    250     |    2400    |    120     |  8.5683%   |  10.4856%  |  11.5774%  |   10810    |   11548    |     3      |  5.2359%   |  3.0065%   |  0.6846%   |
+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+
Time elapsed: 32.74 minutes.


In [None]:
# Try also user-specific brand-based popularity recommendation as alternative to general popularity
# Try slicing validation set into user bins (frequent buyers) or item bins (frequently purchased items) or gender

### 2.5 Testing the final model on test set 

Let's **initialize and fit on the training set the latent factor model using ALS algorithm with the optimal values for `factors`, `regularization`, `alpha` and `iterations` hyperparameters**. Then, we will predict top $N=3$ recommendations for each user in the test for products which were not purchased by the user in the training set. Finally we will add the recommendations to the test set dataframe as new columns, along with binary indicator columns `item_not_in_train` and `item_in_topN`. The evaluation metrics will also be calculated and presented.

#### 2.5.1 Fitting the final model

In [116]:
mod_als = fit_als_latent_factor(train=train, 
                                alpha=120, 
                                factors=250, 
                                regularization=2400,
                                iterations=20)



#### 2.5.2 Predicting users preferences and scoring the test set

In [217]:
test_scored = fit_predict_score_als_latent_factor(train=train, 
                                                  test=test, 
                                                  factors=250,
                                                  regularization=2400, 
                                                  iterations=20, 
                                                  alpha=120, 
                                                  N=3,
                                                  K=3,
                                                  mod_als=mod_als,
                                                  explain=False)

+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+
|   users    |  factors   |   lambda   |   alpha    | MPR [Fac]  | MPR [Pop]  | MPR [Knn]  | Rui Count  |  Rui Sum   |     N      | MRN [Fac]  | MRN [Pop]  | MRN [Knn]  |
|   103768   |    250     |    2400    |    120     |  8.6714%   |  10.5154%  |  11.4808%  |   111597   |   120412   |     3      |  4.9508%   |  2.5951%   |  0.6676%   |
+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+
Time elapsed: 257.67 minutes.


#### 2.5.3 Dumping the scored test set into HDF5 file and MS Excel 

In [221]:
test_scored.to_excel('../output/recommender_perfume_test_scored.xlsx', index=False)
dump_df_to_hdf5(test_scored, dirpath='./data/hdf5/rec', names=['test_scored'])

### 2.6 Analysis on scored test set

In [111]:
test_scored = load_df_from_hdf5(dirpath='./data/hdf5/rec', names=['test_scored'])
print(f'TEST SCORED SET: {len(test_scored):,.0f}')

TEST SCORED SET: 173,280


## 3. Export to HTML

In [None]:
test_scored[:10]

In [90]:
os.system('jupyter nbconvert campaigns_recommender_perfume.ipynb \
          --output ../output/campaigns_recommender_perfume \
          --to html \
          --no-input \
          --no-prompt \
          --SlidesExporter.reveal_transition=fade');

In [None]:
# from scipy.sparse import csc_matrix, coo_matrix
# from scipy.sparse.linalg import svds, eigs

# A = np.array([[1, 0, 0], [5, 0, 2], [0, -1, 0], [0, 0, 3]], dtype=float)
# print('A:\n', A)
# u, s, vt = svds(A, k=2, which='LM')
# u, s, vt

# u*np.sqrt(s) @ (vt * np.sqrt(s)[:, np.newaxis])
# u @ np.diag(s) @ vt

# A = np.array([[1, 0, 0], [5, 0, 2], [0, -1, 0], [0, 0, 3]], dtype=float)
# U, S, VT = np.linalg.svd(A)
# U, S, VT

# norms = np.linalg.norm(u, axis=-1)
# u, norms[:, np.newaxis]

# X = np.array([[4, 1, 3], [8, 3, -2]], dtype=float)
# artist_factors, _, user_factors = np.linalg.svd(X)
# artist_factors[:, :1]
# user_factors

# user = valid_users.values[6]
# user_Pui = pd.DataFrame(valid_Pui[user], columns=['item_idx', 'Pui'])
# user_Pui['rank'] = user_Pui['Pui'].rank(method='average', ascending=False).astype(int)
# user_Pui['rank_pct'] = user_Pui['Pui'].rank(method='average', ascending=False, pct=True)
# user_Pui['user_idx'] = user
# user_Pui['valid_Rui'] = user_Pui.merge(valid[valid.user_idx == user], how='left', on='item_idx')['purchases']
# user_Pui['rank_pct_valid_Rui'] = user_Pui['rank_pct'] * user_Pui['valid_Rui']
# user_Pui[~user_Pui.valid_Rui.isnull()].style.format({'rank_pct_valid_Rui': '{:,.9f}'})

# a = time.time()
# n_items = 1000 # len(train.item_idx.drop_duplicates())
# train_Sij = np.full((n_items, n_items), None)
# data = train[['user_idx', 'item_idx', 'purchases']]

# for x, (i, j) in enumerate(combinations(range(n_items), 2)):
#     users_i = data[data.item_idx == i]
#     users_j = data[data.item_idx == j]
#     users_ij = users_i.merge(users_j, how='inner', on='user_idx')
#     users_ij['ri_rj'] = users_ij.purchases_x * users_ij.purchases_y
#     train_Sij[i, j] = users_ij.ri_rj.sum() / (np.sqrt(sum(users_i['purchases'] ** 2)) * np.sqrt(sum(users_j['purchases'] ** 2)))
#     if x % 1000 == 0:
#         print(x)

# print(f'Time elapsed: {time.time() - a:,.2f} seconds / {(time.time() - a)/60:,.2f} minutes.')

#train_Sij[np.ix_(user_pPui.item_idx, user_Rui.item_idx)] 

#   pPui[user] = mod_als.recommend(user, train_Rui, N=7561, filter_already_liked_items=True)
#     # Percentile-ranking of predicted items within the ordered list of predicted preferences Pui
#     user_pPui = pd.DataFrame(pPui[user], columns=['item_idx', 'pPui'])
#     user_pPui['rank_pct_fac'] = user_pPui['pPui'].rank(method='average', ascending=False, pct=True)
    
#     # Popularity-ranking of items (excluding already purchased items in training set)
#     user_pPui['rank_pct_pop'] = user_pPui.merge(items_pop, how='left', on='item_idx')['rank_pct_pop']    
#     user_pPui['Rui'] = user_pPui.merge(valid[valid.user_idx == user], how='left', on='item_idx')['purchases']
    
#     user_pPui['Rui_rank_pct_fac'] = user_pPui['Rui'] * user_pPui['rank_pct_fac']
#     user_pPui['Rui_rank_pct_pop'] = user_pPui['Rui'] * user_pPui['rank_pct_pop']
#     Rui_rank_pct_fac += user_pPui.Rui_rank_pct_fac.sum()
#     Rui_rank_pct_pop += user_pPui.Rui_rank_pct_pop.sum()
#     Rui += user_pPui.Rui.sum()


# pPui[user] = model.recommend(userid=0, user_items=train_Rui, N=7561, filter_already_liked_items=True)
# # Percentile-ranking of predicted items within the ordered list of predicted preferences Pui
# user_pPui = pd.DataFrame(pPui[user], columns=['item_idx', 'pPui_fac'])
# user_pPui['rank_pct_fac'] = user_pPui['pPui_fac'].rank(method='average', ascending=False, pct=True)
# # Popularity-ranking of items 
# user_pPui['rank_pct_pop'] = user_pPui.merge(items_pop, how='left', on='item_idx')['rank_pct_pop']    
# # Item-item neighboorhood-based ranking model 
# user_pPui['pPui_knn'] = train_Sij[np.ix_(user_pPui.item_idx, 
#                                           train[train.user_idx == user].item_idx
#                                          )].dot(train[train.user_idx == user].purchases)
# user_pPui['rank_pct_knn'] = user_pPui['pPui_knn'].rank(method='average', ascending=False, pct=True)

# # Mean percentile ranking of all three models
# user_pPui['Rui'] = user_pPui.merge(valid[valid.user_idx == user], how='left', on='item_idx')['purchases']
# user_pPui['Rui_rank_pct_fac'] = user_pPui['Rui'] * user_pPui['rank_pct_fac']
# user_pPui['Rui_rank_pct_pop'] = user_pPui['Rui'] * user_pPui['rank_pct_pop']
# user_pPui['Rui_rank_pct_knn'] = user_pPui['Rui'] * user_pPui['rank_pct_knn']
# Rui_rank_pct_fac += user_pPui.Rui_rank_pct_fac.sum()
# Rui_rank_pct_pop += user_pPui.Rui_rank_pct_pop.sum()
# Rui_rank_pct_knn += user_pPui.Rui_rank_pct_knn.sum()
# Rui += user_pPui.Rui.sum()

#             print(f'\n')
#             print(f'Mean Percentile Ranking [Factor]: {MPR_fac:,.4f}')
#             print(f'Mean Percentile Ranking [Popularity]: {MPR_pop:,.4f}')
#             print(f'Mean Percentile Ranking [Neighborhood]: {MPR_knn:,.4f}')
#             print(f'Time elapsed: {time.time() - a:,.2f} seconds / {(time.time() - a)/60:,.2f} minutes.')

# def grid_search_als_latent_factor(factors=[150], regularizations=[6000, 7000, 8000], 
#                                  iterations=20, chunk=10000, n_chunks=5):
#     n_users = min(chunk * n_chunks, len(valid_users))
#     row = ['users', 'factors', 'regularization', 'alpha', 'MPR [Factor]', 'MPR [Popularity]', 'MPR [Neighborhood]', 'Rui']
#     print_log(row, header=True, spacing=20)
        
#     for factor in factors:
#         for regularization in regularizations:
#            # print(f'ALS algorithm: factors={factor}, regularization={regularization}, iterations={iterations}')
#             mod_als = implicit.als.AlternatingLeastSquares(factors=factor, regularization=regularization, 
#                                                            iterations=iterations, random_state=1,
#                                                            use_cg=True, use_native=True, num_threads=0)
#             mod_als.fit(train_Ciu, show_progress=False)

#             pPui = {}
#             Rui = 0
#             Rui_rank_pct_fac = 0
#             Rui_rank_pct_pop = 0
#             Rui_rank_pct_knn = 0

#             # Percentile-ranking of items within the ordered list of predicted preferences Pui for three models for each user
#             # a = time.time()
#             for i in range(n_chunks):
#                 print(i)
#                 for user in valid_users[i*chunk:(i+1)*chunk]:    
#                     pPui[user] = mod_als.recommend(userid=user, user_items=train_Rui, N=7561, filter_already_liked_items=True)

#                     # Latent factor model-based ranking of predicted items 
#                     user_pPui = pd.DataFrame(pPui[user], columns=['item_idx', 'pPui_fac'])
#                     user_pPui['rank_pct_fac'] = user_pPui['pPui_fac'].rank(method='average', ascending=False, pct=True)

#                     # Popularity-based ranking
#                     user_pPui['rank_pct_pop'] = user_pPui.merge(items_pop, how='left', on='item_idx')['rank_pct_pop']    

#                     # Item-item neighboorhood-based ranking
#                     user_pPui['pPui_knn'] = train_Sij[np.ix_(user_pPui.item_idx, 
#                                                               train[train.user_idx == user].item_idx
#                                                              )].dot(train[train.user_idx == user].purchases)
#                     user_pPui['rank_pct_knn'] = user_pPui['pPui_knn'].rank(method='average', ascending=False, pct=True)

#                     # Mean percentile ranking of all three models
#                     user_pPui['Rui'] = user_pPui.merge(valid[valid.user_idx == user], how='left', on='item_idx')['purchases']
#                     user_pPui['Rui_rank_pct_fac'] = user_pPui['Rui'] * user_pPui['rank_pct_fac']
#                     user_pPui['Rui_rank_pct_pop'] = user_pPui['Rui'] * user_pPui['rank_pct_pop']
#                     user_pPui['Rui_rank_pct_knn'] = user_pPui['Rui'] * user_pPui['rank_pct_knn']
#                     Rui_rank_pct_fac += user_pPui.Rui_rank_pct_fac.sum()
#                     Rui_rank_pct_pop += user_pPui.Rui_rank_pct_pop.sum()
#                     Rui_rank_pct_knn += user_pPui.Rui_rank_pct_knn.sum()
#                     Rui += user_pPui.Rui.sum()    

#                     pPui[user] = [i[0] for i in pPui[user][:10]]
#                     del user_pPui

#             MPR_fac = Rui_rank_pct_fac/Rui
#             MPR_pop = Rui_rank_pct_pop/Rui
#             MPR_knn = Rui_rank_pct_knn/Rui
#             row = [n_users, factor, regularization, alpha, MPR_fac, MPR_pop, MPR_knn, int(Rui)]
#             print_log(row, header=False, spacing=20)
#     return pPui       

# mod_als = implicit.als.AlternatingLeastSquares(factors=150, regularization=2100, 
#                                                iterations=20, random_state=1,
#                                                use_cg=True, use_native=True, num_threads=0,
#                                                use_gpu=False)
# mod_als.fit(train_Ciu, show_progress=False)

# user = 10  # [10, 11, 24, 28, 36, 38, 51, 58, 66, 70]
# N = 5
# pPui = {}
# Rui = 0
# Rui_count = 0
# Rui_count_in_topN_pPui_fac = 0
# Rui_count_in_topN_pPui_pop = 0
# Rui_count_in_topN_pPui_knn = 0
# Rui_rank_pct_fac = 0
# Rui_rank_pct_pop = 0
# Rui_rank_pct_knn = 0

# pPui[user] = mod_als.recommend(userid=user, user_items=train_Rui, N=7561, filter_already_liked_items=True)

# # Latent factor model-based ranking of predicted items 
# user_pPui = pd.DataFrame(pPui[user], columns=['item_idx', 'pPui_fac'])
# user_pPui['rank_pct_fac'] = user_pPui['pPui_fac'].rank(method='average', ascending=False, pct=True)
# user_pPui['rank_fac'] = user_pPui['pPui_fac'].rank(method='average', ascending=False).astype(int)

# # Popularity-based ranking
# user_pPui['rank_pct_pop'] = user_pPui.merge(items_pop, how='left', on='item_idx')['rank_pct_pop']   
# user_pPui['rank_pop'] = user_pPui.merge(items_pop, how='left', on='item_idx')['rank_pop']  

# # Item-item neighboorhood-based ranking
# user_pPui['pPui_knn'] = train_Sij[np.ix_(user_pPui.item_idx, 
#                                           train[train.user_idx == user].item_idx
#                                          )].dot(train[train.user_idx == user].purchases)
# user_pPui['rank_pct_knn'] = user_pPui['pPui_knn'].rank(method='average', ascending=False, pct=True)
# user_pPui['rank_knn'] = user_pPui['pPui_knn'].rank(method='average', ascending=False).astype(int)

# # Mean percentile ranking of all three models
# user_pPui['Rui'] = user_pPui.merge(valid[valid.user_idx == user], how='left', on='item_idx')['purchases']
# user_pPui['Rui_rank_pct_fac'] = user_pPui['Rui'] * user_pPui['rank_pct_fac']
# user_pPui['Rui_rank_pct_pop'] = user_pPui['Rui'] * user_pPui['rank_pct_pop']
# user_pPui['Rui_rank_pct_knn'] = user_pPui['Rui'] * user_pPui['rank_pct_knn']
# Rui_rank_pct_fac += user_pPui.Rui_rank_pct_fac.sum()
# Rui_rank_pct_pop += user_pPui.Rui_rank_pct_pop.sum()
# Rui_rank_pct_knn += user_pPui.Rui_rank_pct_knn.sum()
# Rui += user_pPui.Rui.sum()    
# Rui_count += user_pPui.Rui.count()
# Rui_count_in_topN_pPui_fac += (user_pPui[~user_pPui.Rui.isnull()].rank_fac < N).sum()
# Rui_count_in_topN_pPui_knn += (user_pPui[~user_pPui.Rui.isnull()].rank_knn < N).sum()
# Rui_count_in_topN_pPui_pop += (user_pPui[~user_pPui.Rui.isnull()].rank_pop < N).sum()

# user_pPui[~user_pPui.Rui.isnull()].style
# Rui_count_in_topN_pPui_fac, Rui_count_in_topN_pPui_pop, Rui_count_in_topN_pPui_knn, Rui_count

# pPui = {}
# Rui = 0
# Rui_count = 0
# Rui_rank_pct_fac = 0
# Rui_rank_pct_pop = 0
# Rui_rank_pct_knn = 0
# Rui_count_in_topN_pPui_fac = 0
# Rui_count_in_topN_pPui_pop = 0
# Rui_count_in_topN_pPui_knn = 0

In [None]:
# from scipy.sparse import csc_matrix, coo_matrix
# from scipy.sparse.linalg import svds, eigs

# A = np.array([[1, 0, 0], [5, 0, 2], [0, -1, 0], [0, 0, 3]], dtype=float)
# print('A:\n', A)
# u, s, vt = svds(A, k=2, which='LM')
# u, s, vt

# u*np.sqrt(s) @ (vt * np.sqrt(s)[:, np.newaxis])
# u @ np.diag(s) @ vt

# A = np.array([[1, 0, 0], [5, 0, 2], [0, -1, 0], [0, 0, 3]], dtype=float)
# U, S, VT = np.linalg.svd(A)
# U, S, VT

# norms = np.linalg.norm(u, axis=-1)
# u, norms[:, np.newaxis]

# X = np.array([[4, 1, 3], [8, 3, -2]], dtype=float)
# artist_factors, _, user_factors = np.linalg.svd(X)
# artist_factors[:, :1]
# user_factors

# user = valid_users.values[6]
# user_Pui = pd.DataFrame(valid_Pui[user], columns=['item_idx', 'Pui'])
# user_Pui['rank'] = user_Pui['Pui'].rank(method='average', ascending=False).astype(int)
# user_Pui['rank_pct'] = user_Pui['Pui'].rank(method='average', ascending=False, pct=True)
# user_Pui['user_idx'] = user
# user_Pui['valid_Rui'] = user_Pui.merge(valid[valid.user_idx == user], how='left', on='item_idx')['purchases']
# user_Pui['rank_pct_valid_Rui'] = user_Pui['rank_pct'] * user_Pui['valid_Rui']
# user_Pui[~user_Pui.valid_Rui.isnull()].style.format({'rank_pct_valid_Rui': '{:,.9f}'})

# a = time.time()
# n_items = 1000 # len(train.item_idx.drop_duplicates())
# train_Sij = np.full((n_items, n_items), None)
# data = train[['user_idx', 'item_idx', 'purchases']]

# for x, (i, j) in enumerate(combinations(range(n_items), 2)):
#     users_i = data[data.item_idx == i]
#     users_j = data[data.item_idx == j]
#     users_ij = users_i.merge(users_j, how='inner', on='user_idx')
#     users_ij['ri_rj'] = users_ij.purchases_x * users_ij.purchases_y
#     train_Sij[i, j] = users_ij.ri_rj.sum() / (np.sqrt(sum(users_i['purchases'] ** 2)) * np.sqrt(sum(users_j['purchases'] ** 2)))
#     if x % 1000 == 0:
#         print(x)

# print(f'Time elapsed: {time.time() - a:,.2f} seconds / {(time.time() - a)/60:,.2f} minutes.')

#train_Sij[np.ix_(user_pPui.item_idx, user_Rui.item_idx)] 

#   pPui[user] = mod_als.recommend(user, train_Rui, N=7561, filter_already_liked_items=True)
#     # Percentile-ranking of predicted items within the ordered list of predicted preferences Pui
#     user_pPui = pd.DataFrame(pPui[user], columns=['item_idx', 'pPui'])
#     user_pPui['rank_pct_fac'] = user_pPui['pPui'].rank(method='average', ascending=False, pct=True)
    
#     # Popularity-ranking of items (excluding already purchased items in training set)
#     user_pPui['rank_pct_pop'] = user_pPui.merge(items_pop, how='left', on='item_idx')['rank_pct_pop']    
#     user_pPui['Rui'] = user_pPui.merge(valid[valid.user_idx == user], how='left', on='item_idx')['purchases']
    
#     user_pPui['Rui_rank_pct_fac'] = user_pPui['Rui'] * user_pPui['rank_pct_fac']
#     user_pPui['Rui_rank_pct_pop'] = user_pPui['Rui'] * user_pPui['rank_pct_pop']
#     Rui_rank_pct_fac += user_pPui.Rui_rank_pct_fac.sum()
#     Rui_rank_pct_pop += user_pPui.Rui_rank_pct_pop.sum()
#     Rui += user_pPui.Rui.sum()


# pPui[user] = model.recommend(userid=0, user_items=train_Rui, N=7561, filter_already_liked_items=True)
# # Percentile-ranking of predicted items within the ordered list of predicted preferences Pui
# user_pPui = pd.DataFrame(pPui[user], columns=['item_idx', 'pPui_fac'])
# user_pPui['rank_pct_fac'] = user_pPui['pPui_fac'].rank(method='average', ascending=False, pct=True)
# # Popularity-ranking of items 
# user_pPui['rank_pct_pop'] = user_pPui.merge(items_pop, how='left', on='item_idx')['rank_pct_pop']    
# # Item-item neighboorhood-based ranking model 
# user_pPui['pPui_knn'] = train_Sij[np.ix_(user_pPui.item_idx, 
#                                           train[train.user_idx == user].item_idx
#                                          )].dot(train[train.user_idx == user].purchases)
# user_pPui['rank_pct_knn'] = user_pPui['pPui_knn'].rank(method='average', ascending=False, pct=True)

# # Mean percentile ranking of all three models
# user_pPui['Rui'] = user_pPui.merge(valid[valid.user_idx == user], how='left', on='item_idx')['purchases']
# user_pPui['Rui_rank_pct_fac'] = user_pPui['Rui'] * user_pPui['rank_pct_fac']
# user_pPui['Rui_rank_pct_pop'] = user_pPui['Rui'] * user_pPui['rank_pct_pop']
# user_pPui['Rui_rank_pct_knn'] = user_pPui['Rui'] * user_pPui['rank_pct_knn']
# Rui_rank_pct_fac += user_pPui.Rui_rank_pct_fac.sum()
# Rui_rank_pct_pop += user_pPui.Rui_rank_pct_pop.sum()
# Rui_rank_pct_knn += user_pPui.Rui_rank_pct_knn.sum()
# Rui += user_pPui.Rui.sum()

#             print(f'\n')
#             print(f'Mean Percentile Ranking [Factor]: {MPR_fac:,.4f}')
#             print(f'Mean Percentile Ranking [Popularity]: {MPR_pop:,.4f}')
#             print(f'Mean Percentile Ranking [Neighborhood]: {MPR_knn:,.4f}')
#             print(f'Time elapsed: {time.time() - a:,.2f} seconds / {(time.time() - a)/60:,.2f} minutes.')

# def grid_search_als_latent_factor(factors=[150], regularizations=[6000, 7000, 8000], 
#                                  iterations=20, chunk=10000, n_chunks=5):
#     n_users = min(chunk * n_chunks, len(valid_users))
#     row = ['users', 'factors', 'regularization', 'alpha', 'MPR [Factor]', 'MPR [Popularity]', 'MPR [Neighborhood]', 'Rui']
#     print_log(row, header=True, spacing=20)
        
#     for factor in factors:
#         for regularization in regularizations:
#            # print(f'ALS algorithm: factors={factor}, regularization={regularization}, iterations={iterations}')
#             mod_als = implicit.als.AlternatingLeastSquares(factors=factor, regularization=regularization, 
#                                                            iterations=iterations, random_state=1,
#                                                            use_cg=True, use_native=True, num_threads=0)
#             mod_als.fit(train_Ciu, show_progress=False)

#             pPui = {}
#             Rui = 0
#             Rui_rank_pct_fac = 0
#             Rui_rank_pct_pop = 0
#             Rui_rank_pct_knn = 0

#             # Percentile-ranking of items within the ordered list of predicted preferences Pui for three models for each user
#             # a = time.time()
#             for i in range(n_chunks):
#                 print(i)
#                 for user in valid_users[i*chunk:(i+1)*chunk]:    
#                     pPui[user] = mod_als.recommend(userid=user, user_items=train_Rui, N=7561, filter_already_liked_items=True)

#                     # Latent factor model-based ranking of predicted items 
#                     user_pPui = pd.DataFrame(pPui[user], columns=['item_idx', 'pPui_fac'])
#                     user_pPui['rank_pct_fac'] = user_pPui['pPui_fac'].rank(method='average', ascending=False, pct=True)

#                     # Popularity-based ranking
#                     user_pPui['rank_pct_pop'] = user_pPui.merge(items_pop, how='left', on='item_idx')['rank_pct_pop']    

#                     # Item-item neighboorhood-based ranking
#                     user_pPui['pPui_knn'] = train_Sij[np.ix_(user_pPui.item_idx, 
#                                                               train[train.user_idx == user].item_idx
#                                                              )].dot(train[train.user_idx == user].purchases)
#                     user_pPui['rank_pct_knn'] = user_pPui['pPui_knn'].rank(method='average', ascending=False, pct=True)

#                     # Mean percentile ranking of all three models
#                     user_pPui['Rui'] = user_pPui.merge(valid[valid.user_idx == user], how='left', on='item_idx')['purchases']
#                     user_pPui['Rui_rank_pct_fac'] = user_pPui['Rui'] * user_pPui['rank_pct_fac']
#                     user_pPui['Rui_rank_pct_pop'] = user_pPui['Rui'] * user_pPui['rank_pct_pop']
#                     user_pPui['Rui_rank_pct_knn'] = user_pPui['Rui'] * user_pPui['rank_pct_knn']
#                     Rui_rank_pct_fac += user_pPui.Rui_rank_pct_fac.sum()
#                     Rui_rank_pct_pop += user_pPui.Rui_rank_pct_pop.sum()
#                     Rui_rank_pct_knn += user_pPui.Rui_rank_pct_knn.sum()
#                     Rui += user_pPui.Rui.sum()    

#                     pPui[user] = [i[0] for i in pPui[user][:10]]
#                     del user_pPui

#             MPR_fac = Rui_rank_pct_fac/Rui
#             MPR_pop = Rui_rank_pct_pop/Rui
#             MPR_knn = Rui_rank_pct_knn/Rui
#             row = [n_users, factor, regularization, alpha, MPR_fac, MPR_pop, MPR_knn, int(Rui)]
#             print_log(row, header=False, spacing=20)
#     return pPui       

# mod_als = implicit.als.AlternatingLeastSquares(factors=150, regularization=2100, 
#                                                iterations=20, random_state=1,
#                                                use_cg=True, use_native=True, num_threads=0,
#                                                use_gpu=False)
# mod_als.fit(train_Ciu, show_progress=False)

# user = 10  # [10, 11, 24, 28, 36, 38, 51, 58, 66, 70]
# N = 5
# pPui = {}
# Rui = 0
# Rui_count = 0
# Rui_count_in_topN_pPui_fac = 0
# Rui_count_in_topN_pPui_pop = 0
# Rui_count_in_topN_pPui_knn = 0
# Rui_rank_pct_fac = 0
# Rui_rank_pct_pop = 0
# Rui_rank_pct_knn = 0

# pPui[user] = mod_als.recommend(userid=user, user_items=train_Rui, N=7561, filter_already_liked_items=True)

# # Latent factor model-based ranking of predicted items 
# user_pPui = pd.DataFrame(pPui[user], columns=['item_idx', 'pPui_fac'])
# user_pPui['rank_pct_fac'] = user_pPui['pPui_fac'].rank(method='average', ascending=False, pct=True)
# user_pPui['rank_fac'] = user_pPui['pPui_fac'].rank(method='average', ascending=False).astype(int)

# # Popularity-based ranking
# user_pPui['rank_pct_pop'] = user_pPui.merge(items_pop, how='left', on='item_idx')['rank_pct_pop']   
# user_pPui['rank_pop'] = user_pPui.merge(items_pop, how='left', on='item_idx')['rank_pop']  

# # Item-item neighboorhood-based ranking
# user_pPui['pPui_knn'] = train_Sij[np.ix_(user_pPui.item_idx, 
#                                           train[train.user_idx == user].item_idx
#                                          )].dot(train[train.user_idx == user].purchases)
# user_pPui['rank_pct_knn'] = user_pPui['pPui_knn'].rank(method='average', ascending=False, pct=True)
# user_pPui['rank_knn'] = user_pPui['pPui_knn'].rank(method='average', ascending=False).astype(int)

# # Mean percentile ranking of all three models
# user_pPui['Rui'] = user_pPui.merge(valid[valid.user_idx == user], how='left', on='item_idx')['purchases']
# user_pPui['Rui_rank_pct_fac'] = user_pPui['Rui'] * user_pPui['rank_pct_fac']
# user_pPui['Rui_rank_pct_pop'] = user_pPui['Rui'] * user_pPui['rank_pct_pop']
# user_pPui['Rui_rank_pct_knn'] = user_pPui['Rui'] * user_pPui['rank_pct_knn']
# Rui_rank_pct_fac += user_pPui.Rui_rank_pct_fac.sum()
# Rui_rank_pct_pop += user_pPui.Rui_rank_pct_pop.sum()
# Rui_rank_pct_knn += user_pPui.Rui_rank_pct_knn.sum()
# Rui += user_pPui.Rui.sum()    
# Rui_count += user_pPui.Rui.count()
# Rui_count_in_topN_pPui_fac += (user_pPui[~user_pPui.Rui.isnull()].rank_fac < N).sum()
# Rui_count_in_topN_pPui_knn += (user_pPui[~user_pPui.Rui.isnull()].rank_knn < N).sum()
# Rui_count_in_topN_pPui_pop += (user_pPui[~user_pPui.Rui.isnull()].rank_pop < N).sum()

# user_pPui[~user_pPui.Rui.isnull()].style
# Rui_count_in_topN_pPui_fac, Rui_count_in_topN_pPui_pop, Rui_count_in_topN_pPui_knn, Rui_count

# pPui = {}
# Rui = 0
# Rui_count = 0
# Rui_rank_pct_fac = 0
# Rui_rank_pct_pop = 0
# Rui_rank_pct_knn = 0
# Rui_count_in_topN_pPui_fac = 0
# Rui_count_in_topN_pPui_pop = 0
# Rui_count_in_topN_pPui_knn = 0

# train_Rui = csr_matrix((train.purchases, (train.user_idx, train.item_idx)))
# score, top_contributions, _ = mod_als.explain(userid=15, user_items=train_Rui, itemid=2488, N=3)
# score, top_contributions

# pPui, metrics = predict_als_latent_factor(mod_als=mod_als, train=train, test=test[test.user_idx==15], N=3)
# test_scored_15 = calc_test_scored(train=train, test=test[test.user_idx==15], pPui=pPui, idx_to_item=idx_to_item, N=3, K=3, mod_als=mod_als)
# test_scored_15