## Chapter 6

### 6.1 Recommending the most popular items

In the previous chapter we presented our evaluation metric, MAP, and we computed its value for the case of random recommendation. Let's not move onto the next natural set of recommendations before exploring different algorithms, *"most-popular"* recommendations

As always, let's start loading the required packages and defining some useful names

In [3]:
import pandas as pd
import numpy as np
import pickle
import os

from sklearn.metrics.pairwise import pairwise_distances
from sklearn.preprocessing import MinMaxScaler
from recutils.average_precision import mapk

inp_dir = "/home/ubuntu/projects/RecoTour/datasets/Ponpare/data_processed/"
train_dir = "train"
valid_dir = "valid"

and as in previous chapter let's load the data

In [5]:
# training interactions
df_purchases_train = pd.read_pickle(os.path.join(inp_dir, 'train', 'df_purchases_train.p'))
df_visits_train = pd.read_pickle(os.path.join(inp_dir, 'train', 'df_visits_train.p'))
df_visits_train.rename(index=str, columns={'view_coupon_id_hash': 'coupon_id_hash'}, inplace=True)

# train users
df_user_train_feat = pd.read_pickle(os.path.join(inp_dir, 'train', 'df_users_train_feat.p'))
train_users = df_user_train_feat.user_id_hash.unique()

Now we need to define *"popularity"*. A priori one could simply use the number of times an item was purchased. However here we also have information about visits, so we could compute a meassure of popularity that combines the two. In our case, we will simply compute popularity as:

$$\text{Coupon Popularity} = N_{purchases} + f \times N_{visits}$$

For this particular excercise, $f = 0.1$

In [6]:
df_n_purchases = (df_purchases_train
    .coupon_id_hash
    .value_counts()
    .reset_index())
df_n_purchases.columns = ['coupon_id_hash','purchase_counts']
df_n_visits = (df_visits_train
    .coupon_id_hash
    .value_counts()
    .reset_index())
df_n_visits.columns = ['coupon_id_hash','visit_counts']

df_popularity = df_n_purchases.merge(df_n_visits, on='coupon_id_hash', how='left')
df_popularity.fillna(0, inplace=True)
df_popularity['popularity'] = df_popularity['purchase_counts'] + 0.1*df_popularity['visit_counts']
df_popularity.sort_values('popularity', ascending=False , inplace=True)
df_popularity.reset_index(inplace=True, drop=True)

Let's have a look to the dataframe

In [7]:
df_popularity.head(10)

Unnamed: 0,coupon_id_hash,purchase_counts,visit_counts,popularity
0,a262c7ff56a5cd3de3c5c40443f3018c,5760,14778.0,7237.8
1,3d9029d3ec66802b11ee2645dc16e8cb,1511,3063.0,1817.3
2,09411858ae07c0be91aeeddacf4556b4,1016,2562.0,1272.2
3,7fc6567f470af5356ae97097dbe18486,863,444.0,907.4
4,bf69bd9e0e26fa1f62243d1fcada38f1,663,1810.0,844.0
5,047fb1f23d8cedea8cb86956cfd4b7cf,628,1785.0,806.5
6,229ff5cc21c8d26615493be7f3b42841,494,2626.0,756.6
7,4a79cd05ecb2bf8672e1d955f5faa7fa,466,2623.0,728.3
8,d0e1b63cb7cc32edc3a6c619e4215368,355,3345.0,689.5
9,3e6d617c55328b761d62510167c43c08,504,1686.0,672.6


At this stage we need to consider the fact that the validation coupons were, of course, never seen during training. Therefore, we need to find a way to compute their "popularity". What I use here is one method of many, so please, feel free to try anything you might consider better.

First I will compute a distance between validation and the top 10 most popular training coupons. Given that we have 358 validation coupons, this first step will result into a matrix of shape (358,10). I will then compute the mean per row which will give me an idea of how similar a validation coupon is to the top 10 most similar coupons during training. 

The next question is, of course, how we compute the distance between coupons? In this particular dataset we have a rich set of coupon features, which allows for substantiall experimentation. For example, you might want to consider an implementation where using the [jaccard distance](https://en.wikipedia.org/wiki/Jaccard_index) for one-hot encoded categorical features combines with the euclidean distance for numerical features. There is an implementation of this approach in `recutils`

Here, I will simply stack the two feature-sets and use the cosine distance.

In [8]:
# select top 10 most popular coupons from the training dataset
top10 = df_popularity.coupon_id_hash.tolist()[:10]

# and their feautures
df_coupons_train_feat = pd.read_pickle(os.path.join(inp_dir, 'train', 'df_coupons_train_feat.p'))
df_top_10_feat = (df_coupons_train_feat[df_coupons_train_feat.coupon_id_hash.isin(top10)]
    .reset_index())

In [9]:
# let's read the validation coupons and select categorical and numerical columns
df_coupons_valid_feat = pd.read_pickle(os.path.join(inp_dir, 'valid', 'df_coupons_valid_feat.p'))
coupons_valid_ids = df_coupons_valid_feat.coupon_id_hash.values
cat_cols = [c for c in df_coupons_train_feat.columns if c.endswith('_cat')]
id_cols = ['coupon_id_hash']
num_cols = [c for c in df_coupons_train_feat.columns if
    (c not in cat_cols) and (c not in id_cols)]
print(cat_cols)
print(num_cols)

['usable_date_mon_cat', 'usable_date_tue_cat', 'usable_date_wed_cat', 'usable_date_thu_cat', 'usable_date_fri_cat', 'usable_date_sat_cat', 'usable_date_sun_cat', 'usable_date_holiday_cat', 'usable_date_before_holiday_cat', 'validperiod_method1_cat', 'validperiod_method2_cat', 'validfrom_method1_cat', 'validfrom_method2_cat', 'validend_method1_cat', 'validend_method2_cat', 'dispfrom_cat', 'dispend_cat', 'dispperiod_cat', 'price_rate_cat', 'catalog_price_cat', 'discount_price_cat', 'capsule_text_cat', 'genre_name_cat', 'large_area_name_cat', 'ken_name_cat', 'small_area_name_cat']
['price_rate', 'catalog_price', 'discount_price', 'dispperiod', 'validperiod']


The one-hot encoding process needs to happen with all coupons in consideration, so we account for all possible feature values

In [10]:
df_top_10_feat['flag'] = 0
df_coupons_valid_feat['flag'] = 1
tmp_df = pd.concat([
    df_top_10_feat[cat_cols+['flag']],
    df_coupons_valid_feat[cat_cols+['flag']]
    ],
    ignore_index=True)
df_dummy_feats = pd.get_dummies(tmp_df, columns=cat_cols)
df_dummy_feats.head()

Unnamed: 0,flag,usable_date_mon_cat_0,usable_date_mon_cat_1,usable_date_mon_cat_2,usable_date_mon_cat_3,usable_date_tue_cat_0,usable_date_tue_cat_1,usable_date_tue_cat_2,usable_date_tue_cat_3,usable_date_wed_cat_0,...,small_area_name_cat_39,small_area_name_cat_40,small_area_name_cat_43,small_area_name_cat_44,small_area_name_cat_45,small_area_name_cat_46,small_area_name_cat_47,small_area_name_cat_48,small_area_name_cat_49,small_area_name_cat_51
0,0,0,0,0,1,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,1,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,1,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,1,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,1,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0


In [11]:
# let's split back to train and validation
df_top_10_feat_oh = (df_dummy_feats[df_dummy_feats.flag == 0]
    .drop('flag', axis=1)
    .values)
coupons_valid_feat_oh = (df_dummy_feats[df_dummy_feats.flag == 1]
    .drop('flag', axis=1)
    .values)
del(tmp_df, df_dummy_feats)

In [12]:
# scaling the numerical features in training and validation
df_top_10_feat_num = df_top_10_feat[num_cols].values
coupons_valid_feat_num = df_coupons_valid_feat[num_cols].values

scaler = MinMaxScaler()
df_top_10_feat_num_norm = scaler.fit_transform(df_top_10_feat_num)
coupons_valid_feat_num_norm = scaler.transform(coupons_valid_feat_num)



And now time to compute the distance metric

In [13]:
coupons_train_feat = np.hstack([df_top_10_feat_num_norm, df_top_10_feat_oh])
coupons_valid_feat = np.hstack([coupons_valid_feat_num_norm, coupons_valid_feat_oh])

dist_mtx = pairwise_distances(coupons_valid_feat, coupons_train_feat, metric='cosine')

# let's check "all makes sense"
dist_mtx.shape

(358, 10)

And finally the validation coupons "popularity", expressed as how similar are the validation coupons to the most popular coupons during training.

In [14]:
mean_distances = np.apply_along_axis(np.mean, 1, dist_mtx)
df_valid_popularity = pd.DataFrame({'coupon_id_hash': coupons_valid_ids,
    'popularity': 1-mean_distances})

df_valid_popularity.head()

Unnamed: 0,coupon_id_hash,popularity
0,282b5bda1758e147589ca517e02195c3,0.193095
1,0f43ef71c25d409c250f5a5042806342,0.175737
2,28ff0fb4b561a2fd6a360fe28f465e07,0.149547
3,864f351e66cd3aeece5d06987fc2ed4b,0.136935
4,279ba64539609d30114b68874cd0fb42,0.30255


Perfect, so at this stage we have a measure of popularity for the coupons in the validation set. The code below is identical to the one in Chapter 5. Therefore, we will save the result of the cell so we do not have to write the whole snippet again

In [15]:
# validation activities
df_purchases_valid = pd.read_pickle(os.path.join(inp_dir, 'valid', 'df_purchases_valid.p'))
df_visits_valid = pd.read_pickle(os.path.join(inp_dir, 'valid', 'df_visits_valid.p'))
df_visits_valid.rename(index=str, columns={'view_coupon_id_hash': 'coupon_id_hash'}, inplace=True)

# subset users that were seeing in training. Code below is identical to that 
# in the previous chapter. I will save the corresponding dictionary to avoid
# to much code repetition
df_vva = df_visits_valid[df_visits_valid.user_id_hash.isin(train_users)]
df_pva = df_purchases_valid[df_purchases_valid.user_id_hash.isin(train_users)]

id_cols = ['user_id_hash', 'coupon_id_hash']
df_interactions_valid = pd.concat([df_pva[id_cols], df_vva[id_cols]], ignore_index=True)
df_interactions_valid = (df_interactions_valid.groupby('user_id_hash')
    .agg({'coupon_id_hash': 'unique'})
    .reset_index())
tmp_valid_dict = pd.Series(df_interactions_valid.coupon_id_hash.values,
    index=df_interactions_valid.user_id_hash).to_dict()

valid_coupon_ids = df_coupons_valid_feat.coupon_id_hash.values

keep_users = []
for user, coupons in tmp_valid_dict.items():
    if np.intersect1d(valid_coupon_ids, coupons).size !=0:
        keep_users.append(user)

interactions_valid_dict = {k:v for k,v in tmp_valid_dict.items() if k in keep_users}
pickle.dump(interactions_valid_dict, open("../datasets/Ponpare/data_processed/valid/interactions_valid_dict.p", "wb"))

And finally let's recommmend the most popular items to every user that interacted at least with one validation coupon during validation. 

In [16]:
# Cartesian Product between users and validation items
left = pd.DataFrame({'user_id_hash':list(interactions_valid_dict.keys())})
left['key'] = 0
right = df_coupons_valid_feat[['coupon_id_hash']]
right['key'] = 0
df_valid = (pd.merge(left, right, on='key', how='outer')
    .drop('key', axis=1))
df_valid = pd.merge(df_valid, df_valid_popularity, on='coupon_id_hash')

# rank based on popularity
df_ranked = df_valid.sort_values(['user_id_hash', 'popularity'], ascending=[False, False])
df_ranked = (df_ranked
    .groupby('user_id_hash')['coupon_id_hash']
    .apply(list)
    .reset_index())
recomendations_dict = pd.Series(df_ranked.coupon_id_hash.values,
    index=df_ranked.user_id_hash).to_dict()

# Compute mapk
actual = []
pred = []
for k,_ in recomendations_dict.items():
    actual.append(list(interactions_valid_dict[k]))
    pred.append(list(recomendations_dict[k]))

print(mapk(actual,pred))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """


0.01856165521321787


This is a lot better than random, as expected. In fact, let's pause for one second and reflect on that value and the solution in this notebook. 

In this particular dataset we have 18622 coupons and 22624 users in the training dataset. These are relatively small numbers. In addition, the most popular coupons are indeed very popular. The most popular coupon was purchased 5760 times by 5760 different customers. Therefore, it is straightforward to understand that in this scenario, a "most-popular" recommendation approach should work fairly well. In fact, I can anticipate that it will not be easy to improve.

As the data size increases, of if there are not big popularity differences between the items in your dataset, this approach does not normally work that well. Nonetheless if you are in a company where your boss or "the business" asks you for an "ML-based" recommendation system, you are in a rush, and your scenario is similar to the one described here (relatively small number of items and some of them really popular), you might want to quickly implement and productionise a most-popular recommendation system and then move to a true algorithmic/ML solution. 