## Chapter 7

### 7.1 User-Item similarity recommendations

In this chapter we will explore another approach in which we project user and item features into the same parameter space and we recommmend based on a distance metric. More precisely, we will represent each user by the average of coupons they purchased or visit. 

There is a lot of freedom in terms of how one could combine purchases with visit, and even how one could weight individual coupon contributions. For example, one could weight the coupons based on the amount of items users bought with them.

Because this is mostly an illustration of this technique, I will consider unique purchased and viewed coupons. I will then combine them using a weight parameter that I will optimize against the Mean Average Precision (MAP).

Let's have a look to the code

In [3]:
import pandas as pd
import numpy as np
import pickle
import os

from sklearn.metrics.pairwise import cosine_distances, euclidean_distances
from sklearn.metrics import jaccard_similarity_score
from recutils.average_precision import mapk

inp_dir = "../datasets/Ponpare/data_processed/"
train_dir = "train"
valid_dir = "valid"

In [4]:
# training interactions
df_purchases_train = pd.read_pickle(os.path.join(inp_dir, 'train', 'df_purchases_train.p'))
df_visits_train = pd.read_pickle(os.path.join(inp_dir, 'train', 'df_visits_train.p'))
df_visits_train.rename(index=str, columns={'view_coupon_id_hash': 'coupon_id_hash'}, inplace=True)

# train users and coupons features
df_coupons_train_feat = pd.read_pickle(os.path.join(inp_dir, 'train', 'df_coupons_train_feat.p'))
df_user_train_feat = pd.read_pickle(os.path.join(inp_dir, 'train', 'df_users_train_feat.p'))
train_users = df_user_train_feat.user_id_hash.unique()
train_coupons = df_coupons_train_feat.coupon_id_hash.unique()

# subset activities according to the users seen in training
df_vtr = df_visits_train[df_visits_train.user_id_hash.isin(train_users) &
    df_visits_train.coupon_id_hash.isin(train_coupons)]
df_ptr = df_purchases_train[df_purchases_train.user_id_hash.isin(train_users) &
    df_purchases_train.coupon_id_hash.isin(train_coupons)]

When computing the user mean vectors, only the most recent purchase will be considered. I insist, there is a lot of freedom when building a representation of customers and items. Please, feel free to consider any other approach you might think (or know!) it will work better.

In [5]:
# Most recent purchase
df_ptr_most_recent = (df_ptr
    .groupby(['user_id_hash','coupon_id_hash'])['days_to_present']
    .min()
    .reset_index())
df_ptr_most_recent.drop('days_to_present', axis=1, inplace=True)

df_ptr_most_recent.head()

Unnamed: 0,user_id_hash,coupon_id_hash
0,0000b53e182165208887ba65c079fc21,38beeadfe3f97e640367eddae4a8c1b5
1,00035b86e6884589ec8d28fbf2fe7757,25a27d420caa1c46a8d3c0572d27868a
2,0005b1068d5f2b8f2a7c978fcfe1ca06,4a79cd05ecb2bf8672e1d955f5faa7fa
3,0005b1068d5f2b8f2a7c978fcfe1ca06,f0f66195d527a5a9509e139ed367b879
4,000cc06982785a19e2a2fdb40b1c9d59,229ff5cc21c8d26615493be7f3b42841


In order to build a representation of users (i.e. projecting users onto the item/coupon parameter space), we will treat separarely coupons that were purchased and viewed. This means that if a coupon was viewed and then purchased, only the purchase interaction will be considered when computing the mean. Also, if someone interacted with that coupon many times, we will only consider it once here, the most recent one.

In [6]:
df_vtr_visits = df_vtr.copy()
df_vtr_visits['activity_hash'] = df_vtr_visits['user_id_hash'] + "_" + df_vtr_visits['coupon_id_hash']
purchases = df_vtr_visits[~df_vtr_visits.purchaseid_hash.isna()]['activity_hash'].unique()
df_vtr_visits = (df_vtr_visits[~df_vtr_visits.activity_hash
    .isin(purchases)][['user_id_hash','coupon_id_hash','days_to_present']])

# Most recent visit is the view that will be considered
df_vtr_most_recent = (df_vtr_visits
    .groupby(['user_id_hash','coupon_id_hash'])['days_to_present']
    .min()
    .reset_index())
df_vtr_most_recent.drop('days_to_present', axis=1, inplace=True)

df_vtr_most_recent.head()

Unnamed: 0,user_id_hash,coupon_id_hash
0,0000b53e182165208887ba65c079fc21,0645faa156f34104e6d8910160868f9f
1,0000b53e182165208887ba65c079fc21,18097cd25ab6b7e8eb0481b0e3a3cfd8
2,0000b53e182165208887ba65c079fc21,1b581f2ed53f2f2eafbc1560db640194
3,0000b53e182165208887ba65c079fc21,1d04e76c44c231d5d05dc1634d20fe8c
4,0000b53e182165208887ba65c079fc21,2ab16b8f5aeead6d31dbdb9bd59c41db


Merge with coupon features

In [7]:
# Merge with coupon features
df_ptr_most_recent = (df_ptr_most_recent
    .merge(df_coupons_train_feat, on='coupon_id_hash',how='left'))
df_vtr_most_recent = (df_vtr_most_recent
    .merge(df_coupons_train_feat, on='coupon_id_hash',how='left'))
df_ptr_most_recent.head()

Unnamed: 0,user_id_hash,coupon_id_hash,price_rate,catalog_price,discount_price,dispperiod,validperiod,usable_date_mon_cat,usable_date_tue_cat,usable_date_wed_cat,...,dispend_cat,dispperiod_cat,price_rate_cat,catalog_price_cat,discount_price_cat,capsule_text_cat,genre_name_cat,large_area_name_cat,ken_name_cat,small_area_name_cat
0,0000b53e182165208887ba65c079fc21,38beeadfe3f97e640367eddae4a8c1b5,62,7900,2980,3,179,1,1,1,...,6,1,2,1,1,0,0,1,9,11
1,00035b86e6884589ec8d28fbf2fe7757,25a27d420caa1c46a8d3c0572d27868a,52,3110,1490,4,165,1,1,1,...,5,2,1,0,0,0,0,3,6,8
2,0005b1068d5f2b8f2a7c978fcfe1ca06,4a79cd05ecb2bf8672e1d955f5faa7fa,90,1060,100,14,64,3,3,3,...,1,3,2,0,0,8,8,0,2,2
3,0005b1068d5f2b8f2a7c978fcfe1ca06,f0f66195d527a5a9509e139ed367b879,72,72500,19800,5,73,3,3,3,...,3,3,2,2,2,7,7,0,2,2
4,000cc06982785a19e2a2fdb40b1c9d59,229ff5cc21c8d26615493be7f3b42841,79,1950,400,7,131,3,3,3,...,2,3,2,0,0,9,9,0,2,5


Let's do some data manipulation before we compute the User Mean Vectors (hereafter UMV)

In [8]:
# Read the validation coupons features
df_coupons_valid_feat = pd.read_pickle(os.path.join(inp_dir, 'valid', 'df_coupons_valid_feat.p'))

# for convenience, we will flag each dataset (I realised that this could be coded better...)
df_ptr_most_recent['flag_cat'] = 0
df_vtr_most_recent['flag_cat'] = 1
df_coupons_valid_feat['flag_cat'] = 2
flag_cols = ['flag_cat_0','flag_cat_1','flag_cat_2']

# categorical and non categorical columns
cat_cols = [c for c in df_ptr_most_recent.columns if '_cat' in c]
non_cat_cols = [c for c in df_ptr_most_recent.columns if c not in cat_cols]
non_cat_cols_valid = [c for c in non_cat_cols if c != 'user_id_hash']

# again, the one hot encoding needs to be done with all the features 
# (and hence datasets) at once to ensure all datasets will end up with 
# the same number of features
tmp_df = pd.concat([df_ptr_most_recent[cat_cols],
    df_vtr_most_recent[cat_cols],
    df_coupons_valid_feat[cat_cols]],
    ignore_index=True)
df_dummy_feats = pd.get_dummies(tmp_df.astype('category'))

In [9]:
# One hot encoded feat for purchases and visits during training
# (there is a better way to do this that I will show in another notebook)
tmp_df_ptr_oh = (df_dummy_feats[df_dummy_feats.flag_cat_0 != 0]
    .drop(flag_cols, axis=1)
    .reset_index(drop=True))
tmp_df_vtr_oh = (df_dummy_feats[df_dummy_feats.flag_cat_1 != 0]
    .drop(flag_cols, axis=1)
    .reset_index(drop=True))
# One hot encoded feat for validation
tmp_df_valid_oh = (df_dummy_feats[df_dummy_feats.flag_cat_2 != 0]
    .drop(flag_cols, axis=1)
    .reset_index(drop=True))

df_ptr_most_recent_oh = pd.concat([df_ptr_most_recent[non_cat_cols],
    tmp_df_ptr_oh], axis=1)
df_vtr_most_recent_oh = pd.concat([df_vtr_most_recent[non_cat_cols],
    tmp_df_vtr_oh], axis=1)
df_coupons_valid_feat_oh = pd.concat([df_coupons_valid_feat[non_cat_cols_valid],
    tmp_df_valid_oh], axis=1)

# let's have a look to one of these dataframes
df_ptr_most_recent_oh.head()

Unnamed: 0,user_id_hash,coupon_id_hash,price_rate,catalog_price,discount_price,dispperiod,validperiod,usable_date_mon_cat_0,usable_date_mon_cat_1,usable_date_mon_cat_2,...,small_area_name_cat_45,small_area_name_cat_46,small_area_name_cat_47,small_area_name_cat_48,small_area_name_cat_49,small_area_name_cat_50,small_area_name_cat_51,small_area_name_cat_52,small_area_name_cat_53,small_area_name_cat_54
0,0000b53e182165208887ba65c079fc21,38beeadfe3f97e640367eddae4a8c1b5,62,7900,2980,3,179,0,1,0,...,0,0,0,0,0,0,0,0,0,0
1,00035b86e6884589ec8d28fbf2fe7757,25a27d420caa1c46a8d3c0572d27868a,52,3110,1490,4,165,0,1,0,...,0,0,0,0,0,0,0,0,0,0
2,0005b1068d5f2b8f2a7c978fcfe1ca06,4a79cd05ecb2bf8672e1d955f5faa7fa,90,1060,100,14,64,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0005b1068d5f2b8f2a7c978fcfe1ca06,f0f66195d527a5a9509e139ed367b879,72,72500,19800,5,73,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,000cc06982785a19e2a2fdb40b1c9d59,229ff5cc21c8d26615493be7f3b42841,79,1950,400,7,131,0,0,0,...,0,0,0,0,0,0,0,0,0,0


Let's check that the dimensions are consistent and then compute the UMV

In [10]:
df_coupons_valid_feat_oh.head()

Unnamed: 0,coupon_id_hash,price_rate,catalog_price,discount_price,dispperiod,validperiod,usable_date_mon_cat_0,usable_date_mon_cat_1,usable_date_mon_cat_2,usable_date_mon_cat_3,...,small_area_name_cat_45,small_area_name_cat_46,small_area_name_cat_47,small_area_name_cat_48,small_area_name_cat_49,small_area_name_cat_50,small_area_name_cat_51,small_area_name_cat_52,small_area_name_cat_53,small_area_name_cat_54
0,282b5bda1758e147589ca517e02195c3,50,2980,1490,5,154,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0f43ef71c25d409c250f5a5042806342,50,7810,3905,7,99,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
2,28ff0fb4b561a2fd6a360fe28f465e07,51,7200,3480,7,178,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
3,864f351e66cd3aeece5d06987fc2ed4b,50,5649,2790,7,101,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
4,279ba64539609d30114b68874cd0fb42,55,4400,1980,7,134,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0


All fine

In [11]:
# User mean vectors
user_mean_purchase_vector = (df_ptr_most_recent_oh.groupby('user_id_hash')
    .mean()
    .reset_index())
user_mean_visit_vector = (df_vtr_most_recent_oh.groupby('user_id_hash')
    .mean()
    .reset_index())
user_mean_visit_vector.head()

Unnamed: 0,user_id_hash,price_rate,catalog_price,discount_price,dispperiod,validperiod,usable_date_mon_cat_0,usable_date_mon_cat_1,usable_date_mon_cat_2,usable_date_mon_cat_3,...,small_area_name_cat_45,small_area_name_cat_46,small_area_name_cat_47,small_area_name_cat_48,small_area_name_cat_49,small_area_name_cat_50,small_area_name_cat_51,small_area_name_cat_52,small_area_name_cat_53,small_area_name_cat_54
0,0000b53e182165208887ba65c079fc21,57.918367,11393.918367,4151.591837,3.836735,130.040816,0.081633,0.612245,0.020408,0.285714,...,0.0,0.0,0.0,0.0,0.0,0.020408,0.0,0.0,0.0,0.0
1,0005b1068d5f2b8f2a7c978fcfe1ca06,80.0,500.0,100.0,14.0,36.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,000cc06982785a19e2a2fdb40b1c9d59,56.184,6095.608,2647.56,3.608,120.56,0.072,0.448,0.008,0.472,...,0.008,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0013518e41c416cd6a181d277dd8ca0b,58.0,10351.111111,4682.222222,3.666667,99.888889,0.0,0.888889,0.0,0.111111,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,001acdee812a18acfd7509172bed5700,62.098592,10024.859155,3256.0,3.957746,105.943662,0.014085,0.309859,0.0,0.676056,...,0.014085,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Let's now load the dictionary with all the `user:[items]` interactions during validation. Remember, out of the 6924 users seen during validation, 6071 of them interacted with at least one validation coupon. These are the users that we are considering for this excercise.

In [12]:
interactions_valid_dict = pickle.load(open("../datasets/Ponpare/data_processed/valid/interactions_valid_dict.p", "rb"))

There is a final consideration before we moved into the more interesting part of this notebook. Remember that UMV have been created based on the features of the coupons they interacted with during training. This means that if there is a user among these 6071 users that did never interacted with any of the coupons **in the coupon training dataset**, that user will not have a corresponding vector. This will through an error later in the code when we try to compute the similarity between users and items. Therefore we would need to remove these users, if there are such cases. Let's have a look

In [13]:
user_mean_purchase_vector_valid = (user_mean_purchase_vector[user_mean_purchase_vector
    .user_id_hash
    .isin(interactions_valid_dict.keys())]
    .reset_index(drop=True))
user_mean_visit_vector_valid = (user_mean_visit_vector[user_mean_visit_vector
    .user_id_hash
    .isin(interactions_valid_dict.keys())]
    .reset_index(drop=True))
users_valid = pd.concat([user_mean_purchase_vector_valid,
    user_mean_visit_vector_valid])['user_id_hash'].unique()

lost_users = [usr for usr in interactions_valid_dict.keys() if usr not in users_valid]

print(lost_users)

['25e2b645bfcd0980b2a5d0a4833f237a']


we lose one user that visited just one coupon and that coupon not in train_coupons dataset. Not a big deal. 

Overall: 6070 users and 358 coupons. 

In [14]:
del interactions_valid_dict[lost_users[0]]

Now is when things are going to get a bit interesting (we'll see if there is a price in the end). We have the UMV for both purchases and visits. A priori, one would think that purchases are more important that visits, so we would like to weight them more in the mean, but, how much more?

With that in mind we are going to use two python libraries designed for parameter tuning and optimisation called [hyperopt](https://github.com/hyperopt/hyperopt) and [scikit-optimize](https://scikit-optimize.github.io/). Each has its prons and cons. I normally use `hyperopt`, but I don't think is maintained anymore. For example, `hyperopt` uses `networkx 1.11` while the latest version is `2.1`. Therefore, if you are using some package that requires a later version of `networkx`, you will need to work in different environments (as I do). `Skopt` does not have that problem, but usually I find it slower and normally yielding slightly worse results. 

The way they work is fairly similar. You define an optimisation function and you pass it a given method of the library that will try to minimise it using some clever algorithm. In case of `hyperopt` is called [Tree-structured Parzen Estimator Approach (TPE)](https://papers.nips.cc/paper/4443-algorithms-for-hyper-parameter-optimization.pdf), and so far has worked very well for me. For `skopt` there are a number of algorithms. Just have a look to the link above. Here we will use `gbrt_minimize`, a sequential optimization using gradient boosted trees.

Enough with the talking, let's have a look to the code. I will first code a function that, given a weight for the features derived using visits, it will return the MAP@10. 

In [15]:
from hyperopt import hp, tpe
from hyperopt.fmin import fmin
from skopt import gbrt_minimize

def mapk_similarity_rec(alpha, at_random=False):
    """
    function to compute the MAP using a user-item similarity method
    
    Params:
    -------
    alpha: visits weight for the weighted mean: weighted mean = purchases + (alpha*visits)
    
    Returns:
    -------
    map@10
    """

    mpv = user_mean_purchase_vector_valid.copy()
    feat_cols = [c for c in mpv.columns if 'id_hash' not in c]
    mvv = user_mean_visit_vector_valid.copy()
    
    # Weight features coming from visits using alpha
    mvv[feat_cols] = alpha*mvv[feat_cols]

    # Combine the two
    user_vector= (pd.concat([mpv, mvv])
        .groupby('user_id_hash')
        .sum()
        .reset_index())

    user_ids = user_vector.user_id_hash.values
    item_ids = df_coupons_valid_feat_oh.coupon_id_hash.values

    # ensure the same column order
    user_cols = ['user_id_hash'] + [c for c in user_vector.columns if 'id_hash' not in c]
    item_cols = ['coupon_id_hash'] + [c for c in user_vector.columns if 'id_hash' not in c]
    user_feat = user_vector[user_cols[1:]].values
    item_feat = df_coupons_valid_feat_oh[item_cols[1:]].values

    # Compute distances between users and items and rank them based on proximity
    user_item_sim = euclidean_distances(user_feat, item_feat)
    top_n_idx = np.apply_along_axis(np.argsort, 1, user_item_sim)

    recomendations_dict = {}
    for user,idx in zip(user_ids,top_n_idx):
        recomendations_dict[user] = [item_ids[i] for i in idx]

    actual = []
    pred = []
    for k,_ in recomendations_dict.items():
        actual.append(list(interactions_valid_dict[k]))
        pred.append(list(recomendations_dict[k]))

    return mapk(actual, pred)

Let's now code the true optimization function

In [16]:
def sim_objective(params, method="hyperopt"):
    
    # params input needs to be a dictionary for hyperopt and a tuple for skopt
    if method is "hyperopt":
        sim_objective.i+=1
        alpha = params['alpha']
    elif method is "skopt":
        alpha = params[0]

    score = mapk_similarity_rec(alpha)
    
    # Remember this function will be minimize, so should output -score or 1-score to maximize the score    
    return 1-score

Objective functions **MUST depend ONLY** on `params` (or whatever you want to call the parameter space input, whether a dictionary for `hyperopt` of a list of tuples for `skopt`), therefore, if we want to use an objective function with more parameters we need to turn it into a partial function of just `params`. This can easily be done with `lambda` functions.

In [17]:
partial_objective = lambda params: sim_objective(params, method=method)

Let's go! Let's 1st try hyperopt

In [18]:
hp_params = {'alpha': hp.uniform('alpha', 0.01, 1.)}
method = "hyperopt"
sim_objective.i=0
hp_best = fmin(fn=partial_objective,
    space=hp_params,
    algo=tpe.suggest,
    max_evals=100)

100%|██████████| 100/100 [00:59<00:00,  1.68it/s, best loss: 0.9882249899914203]


let's see what is the best value of alpha and what is the MAP@10 for that value (one could save all iterations, we will see how in a later notebook. You could also have a look [here](https://github.com/nadbordrozd/text-top-model/blob/master/ttm/tune_params.py))

In [20]:
hp_best

{'alpha': 0.044104857119314146}

In [21]:
mapk_similarity_rec(hp_best['alpha'])

0.011706886992395348

In [22]:
sk_params = [(0.01, 1, 'uniform')]
method = "skopt"
sk_best = gbrt_minimize(partial_objective,
    sk_params,
    n_calls=100,
    random_state=0,
    verbose=False,
    n_jobs=-1)

In [24]:
# best alpha
sk_best.x

[0.04264725225194222]

In [25]:
# best MAP@10
1-sk_best.fun

0.011707592003181166

All this work just to find out that the MAP@10 is significantly worse than that obtained with "most popular" recommendations ($\sim$0.018). As in previous exercises, let's just think for a second on why this could be. The amount of possibilities to combine the different features when calculating the mean vectors is quite significant. For example, here I used all the features and there is a lot of redundancy. One might want to use just a subset of them. Also, we might want to consider all interactions (rather than just the most recent) and weight according to recency. In summary, having a lot of "freedom" can be a burden sometimes.

For this technique to work (in my experience this has never been a winning technique) one needs **a lot** of exploration to find the right set up when combining features to build the user and item representations. In addition, there is no guarantee that such set up will be stable if, or as the business changes. For example, if there is a large, sudden increase of user base or stock. 

Ok, at this stage we have explored Most-Common and Distance-Based recommendations, let's start using more advanced algorithms and see if we can obtain better metrics.