## Chapter 9

### 1. KNN Collaborative Filtering

The amount of blog posts and literature one can find regarding to this technique is *gigantic*. Therefore, I will not spend much time explaining how it works. For example, have a look [here](https://beckernick.github.io/music_recommender/). Let me also use this opportunity to recommend one of the few good books about recommender systems: [Recommender Systems](https://www.amazon.co.uk/Recommender-Systems-Textbook-Charu-Aggarwal/dp/3319296574/ref=sr_1_1?ie=UTF8&qid=1531491611&sr=8-1&keywords=recommender+systems). 

Nonetheless, here is a quick explanation. In Chapter 8 we built an "Interaction Matrix" (hereafter $R$) of dimensions $U\times I$ where $U$ is the number of users and $I$ is the number of items. Each element of that matrix $R_{ij}$ is the interest of user $i$ in coupon $j$. If we transpose this matrix ($R^{T}$) we can use it to compute the similarity between coupons based on user interest. Then, if a user has shown interest in a given coupon, we can recommend similar coupons based on that similarity metric. In other words, we can recommend similar items using item-based collaborative filtering. 

However, as straightforward this approach might sound, there is an issue to we need to address here for this particular problem, and in any approach that is based purely on past interaction between users and items. This is, the coupons that need to be recommended in a given week, have never been seen before. Therefore, they are not in $R$. 

In real life, one could do the following, if one decides to go with this approach: 

1. As in previous chapters, we could compute the distance between validation and training coupons using only coupon features. 
2. We could then build a dictionary mapping validation into training coupons.
3. Use kNN CF "as usual". 
4. Once we have the recommended training coupons, we can map them back to validation. 

As you might have guessed, this approach will be very slow, but will it perform well enough so we can trade some speed? let's see.

In [25]:
import numpy as np
import pandas as pd
import os
import pickle
import multiprocessing

from sklearn.metrics.pairwise import pairwise_distances
from scipy.sparse import csr_matrix, load_npz
from sklearn.preprocessing import MinMaxScaler
from sklearn.neighbors import NearestNeighbors
from recutils.average_precision import mapk

# Make sure you import this package the last (read below)
from joblib import Parallel, delayed

inp_dir = "../datasets/Ponpare/data_processed/"
train_dir = "train"
valid_dir = "valid"

In [26]:
df_coupons_train_feat = pd.read_pickle(os.path.join(inp_dir, train_dir, 'df_coupons_train_feat.p'))
df_coupons_valid_feat = pd.read_pickle(os.path.join(inp_dir, valid_dir, 'df_coupons_valid_feat.p'))
coupons_train_ids = df_coupons_train_feat.coupon_id_hash.values
coupons_valid_ids = df_coupons_valid_feat.coupon_id_hash.values

In Chapter 6, we computed the distance between coupons by combining euclidean and jaccard distances for the numerical and one-hot encoded features respectively, . Here we will illustrate another approach consisting in simply stacking the two feature-sets and use the cosine distance.

Let's first get the one-hot encoded features...

In [27]:
df_coupons_train_feat['flag'] = 0
df_coupons_valid_feat['flag'] = 1

cat_cols = [c for c in df_coupons_train_feat.columns if '_cat' in c]
id_cols = ['coupon_id_hash']
num_cols = [c for c in df_coupons_train_feat.columns if
    (c not in cat_cols) and (c not in id_cols)]

tmp_df = pd.concat([
    df_coupons_train_feat[cat_cols+['flag']],
    df_coupons_valid_feat[cat_cols+['flag']]
    ],
    ignore_index=True)

df_dummy_feats = pd.get_dummies(tmp_df, columns=cat_cols)

coupons_train_feat_oh = (df_dummy_feats[df_dummy_feats.flag == 0]
    .drop('flag', axis=1)
    .values)
coupons_valid_feat_oh = (df_dummy_feats[df_dummy_feats.flag == 1]
    .drop('flag', axis=1)
    .values)
del(tmp_df, df_dummy_feats)

And the numeric ones

In [28]:
coupons_train_feat_num = df_coupons_train_feat[num_cols].values
coupons_valid_feat_num = df_coupons_valid_feat[num_cols].values

scaler = MinMaxScaler()
coupons_train_feat_num_norm = scaler.fit_transform(coupons_train_feat_num)
coupons_valid_feat_num_norm = scaler.transform(coupons_valid_feat_num)



Stack -> distance -> to dictionary.

In [29]:
coupons_train_feat = np.hstack([coupons_train_feat_num_norm, coupons_train_feat_oh])
coupons_valid_feat = np.hstack([coupons_valid_feat_num_norm, coupons_valid_feat_oh])

dist_mtx = pairwise_distances(coupons_valid_feat, coupons_train_feat, metric='cosine')

# Most similar coupons
valid_to_train_top_n_idx = np.apply_along_axis(np.argsort, 1, dist_mtx)
train_to_valid_top_n_idx = np.apply_along_axis(np.argsort, 1, dist_mtx.T)

# there is one coupon in validation: '0a8e967835e2c20ac4ed8e69ee3d7349' that
# is never among the most similar to those previously seen.
train_to_valid_most_similar = dict(zip(coupons_train_ids,
    coupons_valid_ids[train_to_valid_top_n_idx[:,0]]))
valid_to_train_most_similar = dict(zip(coupons_valid_ids,
    coupons_train_ids[valid_to_train_top_n_idx[:,0]]))

Before we move on there is a caveat that we need to mention

In [30]:
n_valid_coupons_considered = len(set(train_to_valid_most_similar.values()))
n_train_coupons_considered = len(set(valid_to_train_most_similar.values()))

print("number of distinct validation coupons included in the train to validation map: {}".format(n_valid_coupons_considered)) 
print("number of distinct train coupons included in the validation to train map: {}".format(n_train_coupons_considered)) 

number of distinct validation coupons included in the train to validation map: 358
number of distinct train coupons included in the validation to train map: 346


This means that some validation coupons are mapped to the same training coupon. More precisely, the following train coupons appear more than once in the mapping

In [31]:
from collections import Counter
train_coupons_considered = valid_to_train_most_similar.values()
repeated_train_coupons = Counter(train_coupons_considered).most_common()[:12]
repeated_train_coupons

[('694376875e20968fa307ebbf9cbd20e6', 2),
 ('3beea916a869cefb10cf77a939624521', 2),
 ('16bae16c33056fc3df73d51cd8991ac7', 2),
 ('b0c4c37bb0c5ca1de047707b7be0286c', 2),
 ('578878859b5d03bd0951c1ca8369998b', 2),
 ('68b8f4ff1151b51f864764cab41a30b5', 2),
 ('0f2ef03220f9b2a2f7bccd643c197a5a', 2),
 ('18b118bb3542acce6cbf595db4dc7805', 2),
 ('5028ccee5ea9ae59327245f1873f39c2', 2),
 ('22dce9981cc481ea9c06cee50c852a1a', 2),
 ('f6a8f0f454c308c85fe670d083d8d866', 2),
 ('83aee104d72bd5b59218e8842968007a', 2)]

In [32]:
[k for k,v in valid_to_train_most_similar.items() if v == repeated_train_coupons[0][0]]

['8ea9a399ef134138829f834197acf34b', 'd574d880f08b3eeb7ad731de7bd955f8']

These two validation coupons are mapped to the same train coupon (`694376875e20968fa307ebbf9cbd20e6`). So, what does this mean in terms of recommending? During the recommendation process (see below), we will first map validation to train coupons and then we will "map back" train to validation coupons. During the first mapping, the 358 validation coupons will be mapped into 346 train coupons. Then during the second mapping, these 346 training coupons will be mapped back to 346 validation coupons. This mean that there are 12 validation coupons that will never be recommended. In real life this is, of course, unacceptable. Therefore once should find a way of including these coupons. There are a number of ways of doing this and they all come with some caveats. I will let the reader explore different possibilites and for the time being, we will move on knowing that we ignore 12 of the 358 validation coupons. 

Let's now describe the recommendation process:
1. For each user seen in validation that has interacted with validation coupons, collect the user:items interactions
2. Recommend the N kNN based on the interactions_mtx, ranked by proximity.
3. For each validation coupon, get the most similar training coupon
4. Map the training coupons to validation coupons 

Let's do it

In [33]:
# Load the validation interactions dictionary
interactions_valid_dict = pickle.load(open(os.path.join(inp_dir,valid_dir,"interactions_valid_dict.p"), "rb"))

# let's load the activity matrix and dict of indexes
interactions_mtx = load_npz(os.path.join(inp_dir, train_dir, "interactions_mtx.npz"))

# Remember we built the matrix R as Users x Items, but for knn item based CF we need Items x Users
interactions_mtx_knn = interactions_mtx.T

# item index
items_idx_dict = pickle.load(open(os.path.join(inp_dir, train_dir, "items_idx_dict.p"),'rb'))
idx_item_dict = {v:k for k,v in items_idx_dict.items()}

kNN in two lines...

In [34]:
model = NearestNeighbors(metric = 'cosine', algorithm = 'brute')
model.fit(interactions_mtx_knn)

NearestNeighbors(algorithm='brute', leaf_size=30, metric='cosine',
         metric_params=None, n_jobs=1, n_neighbors=5, p=2, radius=1.0)

Map validation to train coupons

In [35]:
interactions_mapped = {}
for user, coupons in interactions_valid_dict.items():
    # per user...
    mapped_coupons = []
    for coupon in coupons:
        # if the coupon is not among the training coupons...
        if coupon not in coupons_train_ids:
            try:
                # then map it to a training coupon (exception in case there
                # were no features for that coupon)
                coupon = valid_to_train_most_similar[coupon]
            except KeyError:
                continue
        mapped_coupons.append(coupon)
    interactions_mapped[user] = mapped_coupons

In [36]:
print(interactions_valid_dict['002ae30377cd30f65652e52618e8b2d6'])
print(interactions_mapped['002ae30377cd30f65652e52618e8b2d6'])

['1ae11153f2bfacec6ab5450d01453c4d' '404d7f06930ed5435f8b87accfeb5329']
['ff47ed446ec0b6bba223c6cfbf82fd1e', '404d7f06930ed5435f8b87accfeb5329']


User `002ae30377cd30f65652e52618e8b2d6` interacted (visited or purchased) two coupons during the validation period. One of them was already displayed during training, and hence it is in the coupon training data. The other belongs to the validation data and it is mapped to its most similar training coupon: 

`1ae11153f2bfacec6ab5450d01453c4d` $\rightarrow$ `ff47ed446ec0b6bba223c6cfbf82fd1e`

We are now going to build the recommendation themselves. For this, we will need to build a function that will have to:
1. Find coupon index in the $R$ matrix
2. Find k (11 in this case) NN
3. Rank them 
4. Map train coupons back to validation

All these tasks can run in parallel, so let's use the `joblib` package to that aim. Note that when I imported the packages, I imported `joblib` the last one. This is **IMPORTANT** when using linux. Packages like `numpy, scipy, pandas`...link against multithreaded OpenBLAS libraries. In other words, if you import them afterwards, `joblib` will not run, or will do it very slow and not using all cores. There are a few [ways around](https://stackoverflow.com/questions/15639779/why-does-multiprocessing-use-only-a-single-core-after-i-import-numpy), but the easiest one is simply to import joblib after having imported all other required packages (sometimes works, sometimes does not). However, **I have never managed to run it in a jupyter notebook in linux** (I am on an EC2 instance on AWS). Maybe is the environment, I don't know. Therefore, The results shown below are obtained using my mac. I will also include the numbers one would obtain my simply running 

`python knn_cf.py`

on the C5 instance (containes **identical code**)

In [37]:
# let's put the user:items in tuples, and build a function to run it in Parallel
user_item_tuple = [(k,v) for k,v in interactions_mapped.items()]

def build_recommendations(user,coupons):

    # when ranking the coupons, the most similar ones to the existing ones will be, 
    #of course, themselves. We ignore them
    ignore = len(coupons)

    # indexes in the matrix of interactions
    idxs = [items_idx_dict[c] for c in coupons]
    dist, nnidx = model.kneighbors(interactions_mtx_knn[idxs], n_neighbors = 11)
    dist, nnidx = dist.ravel(), nnidx.ravel()

    # rank indexes based on distance
    ranked_dist = np.argsort(dist)[ignore:]
    ranked_cp_idxs = nnidx[ranked_dist]
    ranked_train_cp = [idx_item_dict[i] for i in ranked_cp_idxs]

    # map training into validation coupons
    ranked_valid_cp = [train_to_valid_most_similar[c] for c in ranked_train_cp]

    return (user, ranked_valid_cp)

In [38]:
from time import time
start = time()
cores = multiprocessing.cpu_count()
recommend_coupons = Parallel(n_jobs=cores)(delayed(build_recommendations)(user,coupons) for user,coupons in user_item_tuple)
print(time()-start)

89.3989622592926


When running this in an EC2 C5.4xlarge instance via: 

`python knn_cf.py`

it takes around 39 sec.

Let's say how this technique performs

In [39]:
recommendations_dict = {k:v for k,v in recommend_coupons}
actual = []
pred = []
for k,_ in recommendations_dict.items():
    actual.append(list(interactions_valid_dict[k]))
    pred.append(list(recommendations_dict[k]))

result = mapk(actual, pred)
print(result)

0.04845404501319918


Boom! this is A LOT better than *"most popular"* recommendations. In fact, I can tell you that this will be THE BEST MAP value. **However**, on an EC2 instance with 16 cores it takes ~40 sec to recommend 358 coupons to slightly over 6000 customers. In many set ups, this in just not acceptable. The mapping back and forward consumes too much computing time and it is prone to bugs. Again, all depends on your restrictions when you go to production. 

In general, Collaborative Filtering is almost *always* my go-to recommendation algorithm. It normally performs really well and there are kNN implementations in [python](https://github.com/spotify/annoy) that are really fast and ready for production (apart from, of course, the `sklearn` one). Therefore, if I faced a problem where I need to recommend existing and new products and I can afford it, I will probably build two algorithms: one based on CF for the existing products with recorded interactions/ratings, and another one based purely on features for new products (or a hybrid version ;) ).

However it is also possible that we can't afford that, in which case we might prefer other solutions like the one described in the next chapter.