## Chapter 9

### 9.1 KNN Collaborative Filtering

The amount of blog posts and literature one can find regarding to this technique is *gigantic*. Therefore, I will not spend much time explaining how it works. For example, have a look [here](https://beckernick.github.io/music_recommender/). Let me also use this opportunity to recommend one good book about recommender systems: [Recommender Systems](https://www.amazon.co.uk/Recommender-Systems-Textbook-Charu-Aggarwal/dp/3319296574/ref=sr_1_1?ie=UTF8&qid=1531491611&sr=8-1&keywords=recommender+systems). 

Nonetheless, here is a quick explanation. In Chapter 8 we built an *"Interaction Matrix"* (hereafter $R$ as a generic reference to *rating matrix*) of dimensions $U\times I$ where $U$ is the number of users and $I$ is the number of items. Each element of that matrix $R_{ij}$ is the interest of user $i$ in coupon $j$. If we transpose this matrix ($R^{T}$) we can use it to compute the similarity between coupons based on how user interacted with them. Then, if a user has shown interest in a given coupon, we can recommend similar coupons based on that similarity metric. In other words, we can recommend similar items using **item-based collaborative filtering**. 

However, as straightforward this approach might sound, there is an caveat here, and in any approach that is based purely on past interaction between users and items. This is, the coupons that need to be recommended in a given week, have never been seen before. Therefore, they are not in $R$. 

Here, we are going to "overcomplicate" things a bit simply because I thought any "tour" through recommendation algorithms without at least illustrating the use of CF is not complete. What we will do is the following:

1. Use kNN CF as usual, recommending training coupons based on interactions.
2. As in previous chapters, we will compute the distance between training and validation coupons using only coupon features. 
3. We will build a dictionary mapping training into validation coupons. 
4. We will map training coupon recommendations into validation coupon recommendation. 

Yes, this is overcomplicating if not even misusing the technique...Anyway, one advantage might be that we add some sense of interaction-based recommendation as we recommend the new coupons. Still, you might wonder: *"why not simply recommend to a given user those new coupons in validation that resemble more to those he/she interacted with during training?!"* And you will be right to ask that. I will leave this as an exercise if you want to do it. Simply: 1) take the $N$ coupons a user interacted with during training. 2) Find the corresponding most similar validation coupons and 3) rank them based on similarity and maybe consider adding a weight based on interest.

With all that in mind, let's go.

In [1]:
import numpy as np
import pandas as pd
import os
import pickle
import multiprocessing

from time import time
from sklearn.metrics.pairwise import pairwise_distances
from scipy.sparse import csr_matrix, load_npz
from sklearn.preprocessing import MinMaxScaler
from sklearn.neighbors import NearestNeighbors
from recutils.average_precision import mapk

# Make sure you import this package the last (read below)
from joblib import Parallel, delayed

inp_dir = "../datasets/Ponpare/data_processed/"
train_dir = "train"
valid_dir = "valid"

In [2]:
df_coupons_train_feat = pd.read_pickle(os.path.join(inp_dir, train_dir, 'df_coupons_train_feat.p'))
df_coupons_valid_feat = pd.read_pickle(os.path.join(inp_dir, valid_dir, 'df_coupons_valid_feat.p'))
coupons_train_ids = df_coupons_train_feat.coupon_id_hash.values
coupons_valid_ids = df_coupons_valid_feat.coupon_id_hash.values

As In Chapter 6, we compute the distance between coupons simply stacking one-hot encoded and numerical features and using the cosine distance.

Let's first get the one-hot encoded features...

In [3]:
# let's add a flag for convenience
df_coupons_train_feat['flag_cat'] = 0
df_coupons_valid_feat['flag_cat'] = 1

# There are better ways of doing this. See Chapter 14, for example. 
# In any case, computation-wise takes the same time
flag_cols = ['flag_cat_0','flag_cat_1']

cat_cols = [c for c in df_coupons_train_feat.columns if '_cat' in c]
id_cols = ['coupon_id_hash']
num_cols = [c for c in df_coupons_train_feat.columns if
    (c not in cat_cols) and (c not in id_cols)]

tmp_df = pd.concat([df_coupons_train_feat[cat_cols],
    df_coupons_valid_feat[cat_cols]],
    ignore_index=True)

df_dummy_feats = pd.get_dummies(tmp_df.astype('category'))

coupons_train_feat_oh = (df_dummy_feats[df_dummy_feats.flag_cat_0 != 0]
    .drop(flag_cols, axis=1)
    .values)
coupons_valid_feat_oh = (df_dummy_feats[df_dummy_feats.flag_cat_1 != 0]
    .drop(flag_cols, axis=1)
    .values)

And the numeric ones

In [4]:
coupons_train_feat_num = df_coupons_train_feat[num_cols].values
coupons_valid_feat_num = df_coupons_valid_feat[num_cols].values

scaler = MinMaxScaler()
coupons_train_feat_num_norm = scaler.fit_transform(coupons_train_feat_num)
coupons_valid_feat_num_norm = scaler.transform(coupons_valid_feat_num)



Stack -> distance -> to dictionary.

In [5]:
coupons_train_feat = np.hstack([coupons_train_feat_num_norm, coupons_train_feat_oh])
coupons_valid_feat = np.hstack([coupons_valid_feat_num_norm, coupons_valid_feat_oh])

dist_mtx = pairwise_distances(coupons_valid_feat, coupons_train_feat, metric='cosine')

# now we have a matrix of distances, let's build the dictionaries
valid_to_train_top_n_idx = np.apply_along_axis(np.argsort, 1, dist_mtx)
train_to_valid_top_n_idx = np.apply_along_axis(np.argsort, 1, dist_mtx.T)
train_to_valid_most_similar = dict(zip(coupons_train_ids,
    coupons_valid_ids[train_to_valid_top_n_idx[:,0]]))
# there is one coupon in validation: '0a8e967835e2c20ac4ed8e69ee3d7349' that
# is never among the most similar to those previously seen.
valid_to_train_most_similar = dict(zip(coupons_valid_ids,
    coupons_train_ids[valid_to_train_top_n_idx[:,0]]))

Let's quickly build a dictionary of interactions during training

In [6]:
# build a dictionary or interactions during training
df_interest = pd.read_pickle(os.path.join(inp_dir, train_dir, "df_interest.p"))
df_interactions_train = (df_interest.groupby('user_id_hash')
    .agg({'coupon_id_hash': 'unique'})
    .reset_index())
interactions_train_dict = pd.Series(df_interactions_train.coupon_id_hash.values,
    index=df_interactions_train.user_id_hash).to_dict()

Load the interactions matrix and user/item indexes

In [7]:
# let's load the activity matrix and dict of indexes
interactions_mtx = load_npz(os.path.join(inp_dir, train_dir, "interactions_mtx.npz"))

# We built the matrix as user x items, but for knn item based CF we need items x users
interactions_mtx_knn = interactions_mtx.T

# users and items indexes
items_idx_dict = pickle.load(open(os.path.join(inp_dir, train_dir, "items_idx_dict.p"),'rb'))
users_idx_dict = pickle.load(open(os.path.join(inp_dir, train_dir, "users_idx_dict.p"),'rb'))
idx_item_dict = {v:k for k,v in items_idx_dict.items()}

And run knn in two lines

In [8]:
# Let's build the KNN model...two lines :)
model = NearestNeighbors(metric = 'cosine', algorithm = 'brute')
model.fit(interactions_mtx_knn)

NearestNeighbors(algorithm='brute', leaf_size=30, metric='cosine',
         metric_params=None, n_jobs=1, n_neighbors=5, p=2, radius=1.0)

Load the dictionary of interactions during validation

In [9]:
# Load the validation interactions dictionary
interactions_valid_dict = pickle.load(open(os.path.join(inp_dir,valid_dir,"interactions_valid_dict.p"), "rb"))
# remember that one user that visited one coupon and that coupon is not in the training set of coupons.
# and in consequence not in the interactions matrix
interactions_valid_dict.pop("25e2b645bfcd0980b2a5d0a4833f237a")

array(['fe28f9f9055fde46855b1520a40e3c08'], dtype=object)

In [10]:
# just for convenience
user_items_tuple = [(k,v) for k,v in interactions_valid_dict.items()]

def build_recommendations(user):
    #given a user seen during training and validation, get her training interactions
    coupons = interactions_train_dict[user]
    
    # get the training coupon indexes
    idxs = [items_idx_dict[c] for c in coupons]
    
    # compute the k=11 NN
    dist, nnidx = model.kneighbors(interactions_mtx_knn[idxs], n_neighbors = 11)

    # Drop the 1st result as the closest to a coupon is always itself
    dist, nnidx = dist[:, 1:], nnidx[:,1:]
    dist, nnidx = dist.ravel(), nnidx.ravel()

    # rank based on distances and keep top 50 (with 10 is enough really)
    ranked_dist = np.argsort(dist)
    ranked_cp_idxs = nnidx[ranked_dist][:50]

    # recover the train coupon ids from their indexes and map then to validation coupons
    ranked_cp_ids  = [idx_item_dict[i] for i in ranked_cp_idxs]
    ranked_cp_idxs_valid = [train_to_valid_most_similar[c] for c in ranked_cp_ids]

    return (user,ranked_cp_idxs_valid)

Initially I tried to use the `joblib` package and run the function in the cell above in parallel. Note that I imported `joblib` the last one. This is **IMPORTANT** when using linux. Packages like `numpy, scipy or pandas` link against multithreaded OpenBLAS libraries. In other words, if you import them afterwards, `joblib` will not run, or will do it very slow and not using all cores. There are a few [ways around](https://stackoverflow.com/questions/15639779/why-does-multiprocessing-use-only-a-single-core-after-i-import-numpy), but the easiest one is simply to import joblib after having imported all other required packages (sometimes works, sometimes does not). However, **I have never managed to run it in a jupyter notebook in linux** (I am on an EC2 instance on AWS). Maybe is the environment, I don't know. 

Eventually I decided to use `multiprocessing.Pool`

In [11]:
start = time()
cores = multiprocessing.cpu_count()
pool = multiprocessing.Pool(cores)
all_users = list(interactions_valid_dict.keys())
recommend_coupons = pool.map(build_recommendations, all_users)
print(time()-start)

302.82322239875793


Let's see how this technique performs

In [12]:
recommendations_dict = {k:v for k,v in recommend_coupons}
actual = []
pred = []
for k,_ in recommendations_dict.items():
    actual.append(list(interactions_valid_dict[k]))
    pred.append(list(recommendations_dict[k]))

result = mapk(actual, pred)
print(result)

0.019374116349816513


Okay, so this is just slightly better than "most popular" recommendations. As I mentioned in the beginning, kNN CF is not the best technique for this problem. 

However, I can tell you that for scenarios where one has to recommend products in stock to existing customers, Collaborative Filtering is almost *always* my go-to recommendation algorithm. It normally performs really well and there are kNN implementations in [python](https://github.com/spotify/annoy) that are really fast and ready for production (apart from, of course, the `sklearn` one). Therefore, if I faced a problem where I need to recommend existing and new products and I can afford it, I will probably build two algorithms: one based on CF for the existing products with recorded interactions/ratings, and another one based purely on features for new products (or a hybrid version).