## Chapter 9

### 1. KNN Collaborative Filtering

The amount of blog posts and literature one can find regarding to this technique is *gigantic*. Therefore, I will not spend much time explaining how it works. For example, have a look [here](https://beckernick.github.io/music_recommender/). Let me also use this opportunity to recommend one of the few good books about recommender systems: [Recommender Systems](https://www.amazon.co.uk/Recommender-Systems-Textbook-Charu-Aggarwal/dp/3319296574/ref=sr_1_1?ie=UTF8&qid=1531491611&sr=8-1&keywords=recommender+systems). 

Nonetheless, here is a quick explanation. In Chapter 8 we built an "Interaction Matrix" (hereafter $R$) of dimensions $U\times I$ where $U$ is the number of users and $I$ is the number of items. Each element of that matrix $R_{ij}$ is the interest of user $i$ in coupon $j$. If we transpose this matrix ($R^{T}$) we can use it to compute the similarity between coupons based on user interest. Then, if a user has shown interest in a given coupon, we can recommend similar coupons based on that similarity metric. In other words, we can recommend similar items using item-based collaborative filtering. 

However, as straightforward this approach might sound, there is an issue to we need to address here for this particular problem, and in any approach that is based purely on past interaction between users and items. This is, the coupons that need to be recommended in a given week, have never been seen before. Therefore, they are not in $R$. 

In real life, one could do the following, if one decides to go with this approach: 

1. As in previous chapters, we could compute the distance between validation and training coupons using only coupon features. 
2. We could then build a dictionary mapping validation into training coupons.
3. Use kNN CF "as usual". 
4. Once we have the recommended training coupons, we can map them back to validation. 

As you might have guessed, this approach will be very slow, but will it perform well enough so we can trade some speed? let's see.

In [46]:
import numpy as np
import pandas as pd
import os
import pickle
import multiprocessing

from joblib import Parallel, delayed
from sklearn.metrics.pairwise import pairwise_distances
from scipy.sparse import csr_matrix, load_npz
from sklearn.preprocessing import MinMaxScaler
from sklearn.neighbors import NearestNeighbors
from recutils.average_precision import mapk

inp_dir = "../datasets/Ponpare/data_processed/"
train_dir = "train"
valid_dir = "valid"

In [57]:
df_coupons_train_feat = pd.read_pickle(os.path.join(inp_dir, train_dir, 'df_coupons_train_feat.p'))
df_coupons_valid_feat = pd.read_pickle(os.path.join(inp_dir, valid_dir, 'df_coupons_valid_feat.p'))
coupons_train_ids = df_coupons_train_feat.coupon_id_hash.values
coupons_valid_ids = df_coupons_valid_feat.coupon_id_hash.values

We will use a different approach to compute the distance between train and validation coupons that the one described in Chapter 6. There we combined euclidean and jaccard distances for the numerical and one-hot encoded features separately. Here we will stack the two feature-sets and use the cosine distance.

Let's first get the one-hot encoded features

In [59]:
df_coupons_train_feat['flag'] = 0
df_coupons_valid_feat['flag'] = 1

cat_cols = [c for c in df_coupons_train_feat.columns if '_cat' in c]
id_cols = ['coupon_id_hash']
num_cols = [c for c in df_coupons_train_feat.columns if
    (c not in cat_cols) and (c not in id_cols)]

tmp_df = pd.concat([
    df_coupons_train_feat[cat_cols+['flag']],
    df_coupons_valid_feat[cat_cols+['flag']]
    ],
    ignore_index=True)

df_dummy_feats = pd.get_dummies(tmp_df, columns=cat_cols)

coupons_train_feat_oh = (df_dummy_feats[df_dummy_feats.flag == 0]
    .drop('flag', axis=1)
    .values)
coupons_valid_feat_oh = (df_dummy_feats[df_dummy_feats.flag == 1]
    .drop('flag', axis=1)
    .values)
del(tmp_df, df_dummy_feats)

And the numeric ones

In [60]:
coupons_train_feat_num = df_coupons_train_feat[num_cols].values
coupons_valid_feat_num = df_coupons_valid_feat[num_cols].values

scaler = MinMaxScaler()
coupons_train_feat_num_norm = scaler.fit_transform(coupons_train_feat_num)
coupons_valid_feat_num_norm = scaler.transform(coupons_valid_feat_num)



Stack -> distance -> to dictionary.

In [65]:
coupons_train_feat = np.hstack([coupons_train_feat_num_norm, coupons_train_feat_oh])
coupons_valid_feat = np.hstack([coupons_valid_feat_num_norm, coupons_valid_feat_oh])

dist_mtx = pairwise_distances(coupons_valid_feat, coupons_train_feat, metric='cosine')

valid_to_train_top_n_idx = np.apply_along_axis(np.argsort, 1, dist_mtx)
train_to_valid_top_n_idx = np.apply_along_axis(np.argsort, 1, dist_mtx.T)

# there is one coupon in validation: '0a8e967835e2c20ac4ed8e69ee3d7349' that
# is never among the most similar to those previously seen.
train_to_valid_most_similar = dict(zip(coupons_train_ids,
    coupons_valid_ids[train_to_valid_top_n_idx[:,0]]))
valid_to_train_most_similar = dict(zip(coupons_valid_ids,
    coupons_train_ids[valid_to_train_top_n_idx[:,0]]))