# Chapter 16

## 16.1 Wrapping Up

So, we made it up to here! we have gone through:

1. Most popular recommendations
2. User-Item similarity based recommendations
3. kNN collaborative filtering (well, sort of...)
4. GBM-based methods
5. Non-Negative Matrix Factorization
6. Factorization Machines
7. Field Aware Factorization Machines 
8. Deep Learning based recommendations (1st contact)

The best and most reliable technique has been **GBMs**, turning our recommendation algorithm into a regression problem and using `lightGBM`. Of course, this might be due to the fact that I might have not used the best packages for some of the other techniques (e.g. factorization machines) or the best set ups (e.g. wide and deep). Nonetheless, we know `lightGBM` produces a decent **MAP@10 (0.032)** and is a fully tested, production-ready package, so let's use it. 

Before going to the code in this notebook let me remind myself for a second the problem we are trying to solve here. We are given a training set (a year's worth of data) and I need to recommend new items (in this case coupons) to my customers. Let's say I receive new coupons during the week and I need to display them to my customers (or uses) during Sunday night, so next morning they all see fresh recommendations in the site or their mobile. From there in advance there are a series of possibilities one might consider that I do not have the information (or the time) to implement them here.  

For example, the easiest scenario is one in which the recommended coupons stay there for the whole "next week". However, it is possible that we have to update the recommendations every time a user interacts with one of the recommended coupons. Let's assume that based on a thorough analysis of the existing data we know that we need at least 10 interactions for the user interaction-based features to have any predicting power. What do we do before those 10 interactions? Well, we could recommend based on how similar the new coupons are to the most popular existing coupons or we could just recommend based on coupon content (content-based recommendations). 

On the other hand it might be that our recommendations should focus more on new coupons, or maybe we should prioritize old/existing coupons that are about to expire. In this later case we could build an identical recommendation algorithm to the one I will show here simply adding "about-to-expire" coupons to the batch of new coupons. Then we would have to add a rule so that we do know show existing coupons to those customers that (somehow) expressed dislike or already purchased them. A further possibility is one where some of our sponsors pay us more to promote their coupons and we need to give their coupons an "extra push" in the ranking. There are many possibilities and they determine the final shape of your recommendation algorithm. With the information provided for this Kaggle competition and the dataset **I have, I will simply assume I am at that point on Sunday when I need to recommend new coupons to my customers**. 

Finally, let me mention that I have tested the algorithms with existing users, leaving outside the test those users that were not seen during training. However, my final recommendation algorithm should also contemplate the possibility that a new user might register on Monday evening and he/she needs to see some recommendations as well (I will assume that for this new user I have no information on his/her likes). 

Therefore, considering all of the above, my final solution will be a combination of: 

1. Most popular recommendations for unseen customers (cold start problem for users)
2. `lightGBM`-based recommendations for existing customers

To the code...

## 16.2 Utility functions

With the aim of making the code more readable I have wrap up most of the code from previous Chapters in functions. I will still include them here in this notebook, but they could be place somewhere (say a module called `recommender_utils` and simply do: 

    from recommeder_utils import *

Also, we have "validated" already, so here I will be using all the data but the last week for training and the last week for testing. You could just append the train and validation data or you could run the code in Chapters 2, 3 and 4 simply changing the lines that look like this: 

    df_visits['days_to_present_flag'] = df_visits.days_to_present.apply(
        lambda x: 0 if x<=tp-1 else 1 if ((x>tp-1) and (x<=(tp*2)-1)) else 2)

into this:

    df_visits['days_to_present_flag'] = df_visits.days_to_present.apply(
    lambda x: 0 if x<=tp-1 else 1)

This is a trivial change and I will leave you to do it. I did it and place the results in a directory called `ftrain`, such that:

    ~/projects/RecoTour/datasets/Ponpare/data_processed$ tree
    .
    ├── ftrain
    │   ├── df_coupons_train_feat.p
    │   ├── df_coupons_train.p
    │   ├── df_interest.p
    │   ├── df_purchases_train.p
    │   ├── df_users_train_feat.p
    │   ├── df_users_train.p
    │   └── df_visits_train.p
    ├── test
    │   ├── df_coupons_test_feat.p
    │   ├── df_coupons_test.p
    │   ├── df_purchases_test.p
    │   ├── df_users_test.p
    │   └── df_visits_test.p
    
Ok, let's go, one function per cell with comments.

In [25]:
import numpy as np
import pandas as pd
import pickle
import os
import argparse
import lightgbm as lgb
import warnings
import multiprocessing

from recutils.utils import coupon_similarity_function
from recutils.average_precision import mapk
from sklearn.metrics.pairwise import pairwise_distances
from hyperopt import hp, tpe, fmin, Trials

warnings.filterwarnings("ignore")
cores = multiprocessing.cpu_count()

### Top 10 Most Popular Coupons during training

In [26]:
def top10_train_coupon_populatiry(train_purchases_path, train_visits_path):

    # train coupon popularity based on purchases and visits
    df_purchases_train = pd.read_pickle(train_purchases_path)
    df_visits_train = pd.read_pickle(train_visits_path)
    df_visits_train.rename(index=str, columns={'view_coupon_id_hash': 'coupon_id_hash'}, inplace=True)

    # popularity = n_purchases + 0.1*n_visits
    df_n_purchases = (df_purchases_train
        .coupon_id_hash
        .value_counts()
        .reset_index())
    df_n_purchases.columns = ['coupon_id_hash','counts']
    df_n_visits = (df_visits_train
        .coupon_id_hash
        .value_counts()
        .reset_index())
    df_n_visits.columns = ['coupon_id_hash','counts']

    df_popularity = df_n_purchases.merge(df_n_visits, on='coupon_id_hash', how='left')
    df_popularity.fillna(0, inplace=True)
    df_popularity['popularity'] = df_popularity['counts_x'] + 0.1*df_popularity['counts_y']
    df_popularity.sort_values('popularity', ascending=False , inplace=True)

    # select top 10 most popular coupons from the training dataset
    top10 = df_popularity.coupon_id_hash.tolist()[:10]

    return top10

### Test Coupon Popularity 

Based on how similar they are to the top 10 most popular training coupons

In [27]:
def test_coupon_populatiry(train_coupons_path, test_coupons_path):

    # Find top 10 most popular coupons during training
    top10 = top10_train_coupon_populatiry(train_purchases_path, train_visits_path)

    # Load train and test coupon features
    df_coupons_train_feat = pd.read_pickle(train_coupons_path)
    df_coupons_test_feat = pd.read_pickle(test_coupons_path)
    coupons_test_ids = df_coupons_test_feat.coupon_id_hash.values

    id_cols = ['coupon_id_hash']
    cat_cols = [c for c in df_coupons_train_feat.columns if c.endswith('_cat')]
    num_cols = [c for c in df_coupons_train_feat.columns if
        (c not in cat_cols) and (c not in id_cols)]

    # Compute test coupon popularity
    # 1. Normalize numerical columns (remember this process needs to happen all at once)
    df_coupons_train_feat['flag'] = 0
    df_coupons_test_feat['flag'] = 1

    tmp_df = pd.concat(
        [df_coupons_train_feat,df_coupons_test_feat],
        ignore_index=True)

    tmp_df_num = tmp_df[num_cols]
    tmp_df_norm = (tmp_df_num-tmp_df_num.min())/(tmp_df_num.max()-tmp_df_num.min())
    tmp_df[num_cols] = tmp_df_norm

    # 2. one-hot encoding for categorical features
    tmp_df[cat_cols] = tmp_df[cat_cols].astype('category')
    tmp_df_dummy = pd.get_dummies(tmp_df, columns=cat_cols)
    coupons_train_feat = tmp_df_dummy[tmp_df_dummy.flag==0]
    coupons_test_feat = tmp_df_dummy[tmp_df_dummy.flag==1]

    # get the values for the pairwise_distances method
    df_top_10_feat = (coupons_train_feat[coupons_train_feat.coupon_id_hash.isin(top10)]
        .reset_index()
        .drop(['flag','coupon_id_hash','index'], axis=1)
        )
    coupons_test_feat = (coupons_test_feat
        .drop(['flag','coupon_id_hash'], axis=1)
        .values)

    # 3. cosine distance
    dist_mtx = pairwise_distances(coupons_test_feat, df_top_10_feat, metric='cosine')

    # 4. Test coupons average distance to top 10 most popular coupons during training
    mean_distances = np.apply_along_axis(np.mean, 1, dist_mtx)
    df_test_popularity = pd.DataFrame({'coupon_id_hash': coupons_test_ids,
        'popularity': 1-mean_distances})

    return df_test_popularity

### Build a dictionary of interactions

In [28]:
def build_interactions_dictionary(interest_path, test_purchases_path, test_visits_path, is_hot=True):
    """
    Function to build a dictionary of real interactions: {user_id: [coupon_id1, ..., coupon_idN]}
    parameters are trivial with perhaps the exception of is_hot
    
    Params:
    -------
    is_hot: boolean. Whether we are dealing with existing or unseen customers
    """
        
    # interest dataframe
    df_interest = pd.read_pickle(interest_path)
    train_users = df_interest.user_id_hash.unique()
    del(df_interest)

    # test activities
    df_purchases_test = pd.read_pickle(test_purchases_path)
    df_visits_test = pd.read_pickle(test_visits_path)
    df_visits_test.rename(index=str, columns={'view_coupon_id_hash': 'coupon_id_hash'}, inplace=True)

    # whether they are existing users (hot) or not (cold)
    if is_hot:
        df_vte = df_visits_test[df_visits_test.user_id_hash.isin(train_users)]
        df_pte = df_purchases_test[df_purchases_test.user_id_hash.isin(train_users)]
    else:
        df_vte = df_visits_test[~df_visits_test.user_id_hash.isin(train_users)]
        df_pte = df_purchases_test[~df_purchases_test.user_id_hash.isin(train_users)]

    # dictionary of real interactions
    id_cols = ['user_id_hash', 'coupon_id_hash']

    df_interactions_test = pd.concat([df_pte[id_cols], df_vte[id_cols]], ignore_index=True)
    df_interactions_test = (df_interactions_test.groupby('user_id_hash')
        .agg({'coupon_id_hash': 'unique'})
        .reset_index())
    interactions_test_dict = pd.Series(df_interactions_test.coupon_id_hash.values,
        index=df_interactions_test.user_id_hash).to_dict()

    return interactions_test_dict

### Build a dictionary of recommendations

In [29]:
def build_recomendations_dictionary(ranking_df, ranking_metric='interest'):
    """
    Params:
    -------
    ranking_df: pandas dataframe with 3 cols (user_id, coupon_id, ranking_metric)    

    Returns:
    recomendations_dict: dictionary with recommendations: {user_id: recommended_coupon_id1, ...}
    """
    
    df_ranked = ranking_df.sort_values(['user_id_hash', ranking_metric], ascending=[False, False])
    df_ranked = (df_ranked
        .groupby('user_id_hash')['coupon_id_hash']
        .apply(list)
        .reset_index())
    recomendations_dict = pd.Series(df_ranked.coupon_id_hash.values,
        index=df_ranked.user_id_hash).to_dict()

    return recomendations_dict

### Most Popular recommendations function

In [30]:
def most_popular_recommendations(train_coupons_path, test_coupons_path, interest_path):

    # test coupons popularity
    df_test_popularity = test_coupon_populatiry(train_coupons_path, test_coupons_path)

    # list of purchases and visits for new users
    interactions_test_dict = build_interactions_dictionary(interest_path,
        test_purchases_path, test_visits_path, is_hot=False)

    # ranking dataframe
    left = pd.DataFrame({'user_id_hash':list(interactions_test_dict.keys())})
    left['key'] = 0
    right = pd.read_pickle(test_coupons_path)[['coupon_id_hash']]
    right['key'] = 0
    df_test = (pd.merge(left, right, on='key', how='outer')
        .drop('key', axis=1))
    df_test = pd.merge(df_test, df_test_popularity, on='coupon_id_hash')

    recomendations_dict = build_recomendations_dictionary(df_test, ranking_metric='popularity')

    return recomendations_dict

Moving onto lightGBM...

### Building lightGBM training set

In [31]:
def build_lightgbm_train_set(train_coupons_path, train_users_path, interest_path):

    # train coupon features
    df_coupons_train_feat = pd.read_pickle(train_coupons_path)
    drop_cols = [c for c in df_coupons_train_feat.columns
        if ((not c.endswith('_cat')) or ('method2' in c)) and (c!='coupon_id_hash')]
    df_coupons_train_cat_feat = df_coupons_train_feat.drop(drop_cols, axis=1)

    # train user features
    df_users_train_feat = pd.read_pickle(train_users_path)

    # interest dataframe
    df_interest = pd.read_pickle(interest_path)
    train_users = df_interest.user_id_hash.unique()

    df_train = pd.merge(df_interest, df_users_train_feat, on='user_id_hash')
    df_train = pd.merge(df_train, df_coupons_train_cat_feat, on = 'coupon_id_hash')

    # for the time being we ignore recency
    df_train.drop(['user_id_hash','coupon_id_hash','recency_factor'], axis=1, inplace=True)
    train = df_train.drop('interest', axis=1)
    y_train = df_train.interest
    all_cols = train.columns.tolist()
    cat_cols = [c for c in train.columns if c.endswith("_cat")]

    return train.values, y_train, all_cols, cat_cols, drop_cols

### Building lightGBM training set

In [32]:
def build_lightgbm_test_set(train_users_path, test_coupons_path, test_purchases_path, 
                            test_visits_path, interest_path, drop_cols):

    interactions_test_dict = build_interactions_dictionary(
        interest_path, test_purchases_path, test_visits_path, is_hot=True)

    df_users_train_feat = pd.read_pickle(train_users_path)
    df_coupons_test_feat = (pd.read_pickle(test_coupons_path)
        .drop(drop_cols, axis=1))

    left = pd.DataFrame({'user_id_hash':list(interactions_test_dict.keys())})
    left['key'] = 0
    right = pd.read_pickle(test_coupons_path)[['coupon_id_hash']]
    right['key'] = 0
    df_test = (pd.merge(left, right, on='key', how='outer')
        .drop('key', axis=1))
    df_test = pd.merge(df_test, df_users_train_feat, on='user_id_hash')
    df_test = pd.merge(df_test, df_coupons_test_feat, on = 'coupon_id_hash')
    X_test = (df_test
        .drop(['user_id_hash','coupon_id_hash'], axis=1)
        .values)
    df_rank = df_test[['user_id_hash','coupon_id_hash']]

    return X_test, df_rank

And a wrap up around MAP@10 

In [33]:
def compute_mapk(interactions_dict, recomendations_dict):
    actual = []
    pred = []
    for k,_ in recomendations_dict_hot.items():
        actual.append(list(interactions_dict[k]))
        pred.append(list(recomendations_dict[k]))
    return mapk(actual,pred)

## 16.3 FINAL SOLUTION: Most Popular + lightGBM

In [34]:
inp_dir = "../datasets/Ponpare/data_processed/"
train_dir = "ftrain"
test_dir = "test"
model_dir = "models"

# Training datasets
train_visits_path = os.path.join(inp_dir,train_dir, 'df_visits_train.p')
train_purchases_path = os.path.join(inp_dir,train_dir, 'df_purchases_train.p')
train_coupons_path = os.path.join(inp_dir,train_dir, 'df_coupons_train_feat.p')
train_users_path = os.path.join(inp_dir,train_dir, 'df_users_train_feat.p')
interest_path = os.path.join(inp_dir,train_dir, 'df_interest.p')

# Testing datasets
test_visits_path = os.path.join(inp_dir,test_dir, 'df_visits_test.p')
test_purchases_path = os.path.join(inp_dir,test_dir, 'df_purchases_test.p')
test_coupons_path = os.path.join(inp_dir,test_dir, 'df_coupons_test_feat.p')
test_users_path = os.path.join(inp_dir,test_dir, 'df_users_test_feat.p')

# Remember we have already optimize lightGBM, so we can upload the best 
# parameters or the model directly
best_params_path = os.path.join(inp_dir, model_dir, 'gbm_optimal_parameters.p')

### 16.3.1 Most Popular Recommendations for new customers

In [35]:
recomendations_dict_cold = most_popular_recommendations(train_coupons_path,
    test_coupons_path, interest_path)

### 16.3.2 LightGBM Recommendations for existing customers

In [36]:
train,y_train,all_cols,cat_cols,drop_cols = build_lightgbm_train_set(
    train_coupons_path,
    train_users_path,
    interest_path)
X_test, df_rank = build_lightgbm_test_set(
    train_users_path,
    test_coupons_path,
    test_purchases_path,
    test_visits_path,
    interest_path,
    drop_cols)

best = pickle.load(open(best_params_path, "rb"))
model = lgb.LGBMRegressor(**best)
model.fit(train,y_train,feature_name=all_cols,categorical_feature=cat_cols)
preds = model.predict(X_test)
df_rank['interest'] = preds
recomendations_dict_hot = build_recomendations_dictionary(df_rank)

### 16.3.3 Putting all together

In [37]:
recomendations_dict = recomendations_dict_cold.copy()
recomendations_dict.update(recomendations_dict_hot)

interactions_dict_cold = build_interactions_dictionary(
    interest_path,
    test_purchases_path,
    test_visits_path,
    is_hot=False)
interactions_dict_hot = build_interactions_dictionary(
    interest_path,
    test_purchases_path,
    test_visits_path,
    is_hot=True)
interactions_dict = interactions_dict_cold.copy()
interactions_dict.update(interactions_dict_hot)

final_mapk = compute_mapk(interactions_dict, recomendations_dict)
print(final_mapk)

0.04239416493503196


MAP@10=0.042 pretty good! And with this we are done...building the algorithm and running al offline test. 

Now is when the real fun begins, where (data) scientists and engineers sit down and design a good online test and a decent production pipeline. But this is an entire different history. 

THE END