## Chapter 10 

### 10.1 Non negative matrix factorization (NMF)

If you have worked or are working with recommendation algorithms, I'd say you are familiar with matrix factorization. Just in case let's go quickly through some formulation before jumping into the code. 

Given a ratings (or scores) matrix $R$ with dimensions $M \times N$ we aim to find two matrix $C$ and $U$ with dimensions $M \times K$ and $N \times K$ respectively such that

$$R \approx C \times U^T = \hat{R}$$

$K$ are the latent factors (or latent dimensions) which we will choose at our convenience. Then the rating of item $i$ by user $j$ can be computed as the dot product 

$$ \hat{r}_{ij} = c_i u_j^T = \sum_{k=1}^k{c_{ik}u_{kj}}$$


In our case, $R$, $C$ and $U$ are our interest, coupons and user matrices respectively. Since we have no measure of negative interest, all matrices will be non-negative and hence non-negative matrix factorization. You can find a nice tutorial in python [here](http://www.quuxlabs.com/blog/2010/09/matrix-factorization-a-simple-tutorial-and-implementation-in-python/).

Once we have computed $\hat{R}$ we will be in a position where we can recommend existing coupons to customers based on past interactions. **However** let's emphasise once more that this is **NOT** the problem we are solving here. Here we have a batch of new, unseen coupons and we need to recommend them to existing customers. This is what I will do:

1. Compute $\hat{R}$, $C$ and $U$
2. Compute similarity between new and old coupons based on features (price, category, etc), and assign the latent factors of the old coupons to the most similar new coupons. 
3. Build a dataset horizontally stacking user and item latent factors.
4. Use a regressor to predict interest and rank

In [32]:
import numpy as np
import pandas as pd
import os
import pickle
import warnings
import multiprocessing
import lightgbm as lgb

from time import time
from sklearn.metrics.pairwise import pairwise_distances
from sklearn.decomposition import NMF
from sklearn.model_selection import train_test_split
from scipy.sparse import csr_matrix, load_npz
from recutils.average_precision import mapk
from hyperopt import hp, tpe, fmin, Trials

warnings.filterwarnings("ignore")
cores = multiprocessing.cpu_count()

inp_dir = "../datasets/Ponpare/data_processed/"
train_dir = "train"
valid_dir = "valid"

In [33]:
# train and validation coupons
df_coupons_train_feat = pd.read_pickle(os.path.join(inp_dir, train_dir, 'df_coupons_train_feat.p'))
df_coupons_valid_feat = pd.read_pickle(os.path.join(inp_dir, valid_dir, 'df_coupons_valid_feat.p'))

# train and validation coupon ids
coupons_train_ids = df_coupons_train_feat.coupon_id_hash.values
coupons_valid_ids = df_coupons_valid_feat.coupon_id_hash.values

In [34]:
id_cols = ['coupon_id_hash']
cat_cols = [c for c in df_coupons_train_feat.columns if c.endswith('_cat')]
num_cols = [c for c in df_coupons_train_feat.columns if
    (c not in cat_cols) and (c not in id_cols)]

As in previous Chapters, let's calculate the similarity between new and old coupons. 

**Note**. In the `recutils` module there is a submodule simply called `utils` that contains the `coupon_similarity_function` method. All the code below is wrapped up in that function.

In [35]:
# Add a train/test flag
df_coupons_train_feat['flag'] = 0
df_coupons_valid_feat['flag'] = 1

tmp_df = pd.concat(
    [df_coupons_train_feat,df_coupons_valid_feat],
    ignore_index=True)

Normalize numerical columns

In [36]:
# Normalize numerical columns
tmp_df_num = tmp_df[num_cols]
tmp_df_norm = (tmp_df_num-tmp_df_num.min())/(tmp_df_num.max()-tmp_df_num.min())
tmp_df[num_cols] = tmp_df_norm

One-hot encode categorical columns

In [37]:
# one hot categorical
tmp_df[cat_cols] = tmp_df[cat_cols].astype('category')
tmp_df_dummy = pd.get_dummies(tmp_df, columns=cat_cols)

coupons_train_feat = tmp_df_dummy[tmp_df_dummy.flag==0]
coupons_valid_feat = tmp_df_dummy[tmp_df_dummy.flag==1]
coupons_train_feat = (coupons_train_feat
    .drop(['flag','coupon_id_hash'], axis=1)
    .values)
coupons_valid_feat = (coupons_valid_feat
    .drop(['flag','coupon_id_hash'], axis=1)
    .values)

Distance Matrix

In [38]:
dist_mtx = pairwise_distances(coupons_valid_feat, coupons_train_feat, metric='cosine')
valid_to_train_top_n_idx = np.apply_along_axis(np.argsort, 1, dist_mtx)
valid_to_train_most_similar = dict(zip(coupons_valid_ids,
    coupons_train_ids[valid_to_train_top_n_idx[:,0]]))

In [39]:
valid_to_train_most_similar['f1540e7a08cce1a8d5a5ebd8233e1db0']

'31e98da3c0c1df31559848688d25eb01'

In [40]:
df_coupons_train_feat[df_coupons_train_feat.coupon_id_hash == '31e98da3c0c1df31559848688d25eb01']

Unnamed: 0,price_rate,catalog_price,discount_price,dispperiod,validperiod,usable_date_mon_cat,usable_date_tue_cat,usable_date_wed_cat,usable_date_thu_cat,usable_date_fri_cat,usable_date_sat_cat,usable_date_sun_cat,usable_date_holiday_cat,usable_date_before_holiday_cat,coupon_id_hash,validperiod_method1_cat,validperiod_method2_cat,validfrom_method1_cat,validfrom_method2_cat,validend_method1_cat,validend_method2_cat,dispfrom_cat,dispend_cat,dispperiod_cat,price_rate_cat,catalog_price_cat,discount_price_cat,capsule_text_cat,genre_name_cat,large_area_name_cat,ken_name_cat,small_area_name_cat,flag
17075,54,2150,980,3,90,3,3,3,3,3,3,3,3,3,31e98da3c0c1df31559848688d25eb01,4,1,7,2,7,0,1,4,1,1,0,0,6,6,0,2,5,0


In [41]:
df_coupons_valid_feat[df_coupons_valid_feat.coupon_id_hash == "f1540e7a08cce1a8d5a5ebd8233e1db0"]

Unnamed: 0,price_rate,catalog_price,discount_price,dispperiod,validperiod,usable_date_mon_cat,usable_date_tue_cat,usable_date_wed_cat,usable_date_thu_cat,usable_date_fri_cat,usable_date_sat_cat,usable_date_sun_cat,usable_date_holiday_cat,usable_date_before_holiday_cat,coupon_id_hash,validperiod_method1_cat,validperiod_method2_cat,validfrom_method1_cat,validfrom_method2_cat,validend_method1_cat,validend_method2_cat,dispfrom_cat,dispend_cat,dispperiod_cat,price_rate_cat,catalog_price_cat,discount_price_cat,capsule_text_cat,genre_name_cat,large_area_name_cat,ken_name_cat,small_area_name_cat,flag
210,55,2200,980,3,99,3,3,3,3,3,3,3,3,3,f1540e7a08cce1a8d5a5ebd8233e1db0,4,1,7,2,7,0,1,4,1,1,0,0,6,6,0,2,5,1


Overall, very similar. 

Let's now load the interaction matrix

In [42]:
# let's load the activity matrix and dict of indexes
interactions_mtx = load_npz(os.path.join(inp_dir, train_dir, "interactions_mtx.npz"))
items_idx_dict = pickle.load(open(os.path.join(inp_dir, train_dir, "items_idx_dict.p"),'rb'))
users_idx_dict = pickle.load(open(os.path.join(inp_dir, train_dir, "users_idx_dict.p"),'rb'))
interactions_mtx

<22623x18622 sparse matrix of type '<class 'numpy.float64'>'
	with 1560464 stored elements in Compressed Sparse Row format>

None negative matrix factorization with default values and n_comp (50 to start with) components/factors.

In [19]:
ncomp = 50
nmf_model = NMF(n_components=ncomp, init='random', random_state=1981)
user_factors = nmf_model.fit_transform(interactions_mtx)
item_factors = nmf_model.components_.T
# pickle.dump(nmf_model, open("../datasets/Ponpare/data_processed/models/nmf_model.p", "wb"))

And just like that we have our item and user projections onto our latent space

In [43]:
print(user_factors.shape)
print(item_factors.shape)

(22623, 50)
(18622, 50)


Let's make sure every user/item points to the right latent vector

In [44]:
# make sure every user/item points to the right factors
user_factors_dict = {}
for k,v in users_idx_dict.items():
    user_factors_dict[k] = user_factors[users_idx_dict[k]]

item_factors_dict = {}
for k,v in items_idx_dict.items():
    item_factors_dict[k] = item_factors[items_idx_dict[k]]

And now only thing left to do is to train a regressor, more precisely, our favourite lightGBM. Let's build the training/testing datasets and build the model. By the way, now there are no categorical features, and our life is just a bit esier.

In [45]:
df_interest = pd.read_pickle(os.path.join(inp_dir, train_dir, 'df_interest.p'))
df_user_factors = (pd.DataFrame.from_dict(user_factors_dict, orient="index")
    .reset_index())
df_user_factors.columns = ['user_id_hash'] + ['user_factor_'+str(i) for i in range(ncomp)]
df_item_factors = (pd.DataFrame.from_dict(item_factors_dict, orient="index")
    .reset_index())
df_item_factors.columns = ['coupon_id_hash'] + ['item_factor_'+str(i) for i in range(ncomp)]

#### TRAIN

In [46]:
# TRAIN
df_train = pd.merge(df_interest[['user_id_hash','coupon_id_hash','interest']],
    df_item_factors, on='coupon_id_hash')
df_train = pd.merge(df_train, df_user_factors, on='user_id_hash')
X = df_train.iloc[:,3:].values
y = df_train.interest.values
print(df_train.shape)
df_train.head()

(1560464, 103)


Unnamed: 0,user_id_hash,coupon_id_hash,interest,item_factor_0,item_factor_1,item_factor_2,item_factor_3,item_factor_4,item_factor_5,item_factor_6,item_factor_7,item_factor_8,item_factor_9,item_factor_10,item_factor_11,item_factor_12,item_factor_13,item_factor_14,item_factor_15,item_factor_16,item_factor_17,item_factor_18,item_factor_19,item_factor_20,item_factor_21,item_factor_22,item_factor_23,item_factor_24,item_factor_25,item_factor_26,item_factor_27,item_factor_28,item_factor_29,item_factor_30,item_factor_31,item_factor_32,item_factor_33,item_factor_34,item_factor_35,item_factor_36,item_factor_37,item_factor_38,item_factor_39,item_factor_40,item_factor_41,item_factor_42,item_factor_43,item_factor_44,item_factor_45,item_factor_46,...,user_factor_0,user_factor_1,user_factor_2,user_factor_3,user_factor_4,user_factor_5,user_factor_6,user_factor_7,user_factor_8,user_factor_9,user_factor_10,user_factor_11,user_factor_12,user_factor_13,user_factor_14,user_factor_15,user_factor_16,user_factor_17,user_factor_18,user_factor_19,user_factor_20,user_factor_21,user_factor_22,user_factor_23,user_factor_24,user_factor_25,user_factor_26,user_factor_27,user_factor_28,user_factor_29,user_factor_30,user_factor_31,user_factor_32,user_factor_33,user_factor_34,user_factor_35,user_factor_36,user_factor_37,user_factor_38,user_factor_39,user_factor_40,user_factor_41,user_factor_42,user_factor_43,user_factor_44,user_factor_45,user_factor_46,user_factor_47,user_factor_48,user_factor_49
0,7a971028976de1a048c6b711b7889d17,48948527d6a8e075090393f3d95e31bf,0.45,0.002203,0.0,0.0,0.002933,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.020744,0.0,0.036788,0.015414,0.008512,0.0,0.000171,0.015847,0.0,0.009768,0.0,0.003239,0.002925,0.0,0.0,0.022358,0.0,0.0,0.0,0.001181,0.0,0.0,0.0,0.062034,0.000428,0.029339,...,0.021197,0.0,0.035426,0.0,0.0,0.140428,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.047493,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.137815,0.0,0.0,0.0,0.010867,0.003565,0.0,3e-06,0.092271,0.0,0.0,0.0,0.002384,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000115,0.0
1,7a971028976de1a048c6b711b7889d17,a262c7ff56a5cd3de3c5c40443f3018c,1.0,0.0,0.0,0.0,0.0,0.0,0.009887,0.0,0.004451,0.0,0.014536,0.0,0.0,0.0,0.004243,0.0,0.0,0.0,0.0,0.011751,21.018275,0.000612,0.0,0.002987,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.9e-05,0.0,0.0,0.01778,0.002519,0.0,0.0,0.0,0.001527,0.0,0.0,0.0,0.0,0.0,0.0,...,0.021197,0.0,0.035426,0.0,0.0,0.140428,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.047493,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.137815,0.0,0.0,0.0,0.010867,0.003565,0.0,3e-06,0.092271,0.0,0.0,0.0,0.002384,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000115,0.0
2,7a971028976de1a048c6b711b7889d17,7fc6567f470af5356ae97097dbe18486,1.0,0.0,0.0,0.0,0.0,0.0,7.082221,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.021197,0.0,0.035426,0.0,0.0,0.140428,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.047493,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.137815,0.0,0.0,0.0,0.010867,0.003565,0.0,3e-06,0.092271,0.0,0.0,0.0,0.002384,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000115,0.0
3,7a971028976de1a048c6b711b7889d17,db295c37a59baca890641e5faf0f2f7b,0.107283,0.012354,0.010444,0.0096,0.0,0.013905,0.0,0.0,0.004099,0.0,0.0,0.039814,0.059949,0.0,0.0,0.0,0.043538,0.021911,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.023068,0.124153,0.138026,0.044227,0.0,0.0,0.007606,0.0,0.067632,0.070299,0.011877,0.00283,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.028807,0.0,0.0,...,0.021197,0.0,0.035426,0.0,0.0,0.140428,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.047493,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.137815,0.0,0.0,0.0,0.010867,0.003565,0.0,3e-06,0.092271,0.0,0.0,0.0,0.002384,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000115,0.0
4,7a971028976de1a048c6b711b7889d17,1dfb7cc88be25a8f7b435cd988859ebf,0.107283,0.0,0.0,0.012573,0.0,0.007583,0.0,0.126861,0.0,0.0,0.0,0.072697,0.0,0.025464,0.0,0.0,0.0,0.004041,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000654,0.032683,0.0,0.0,0.0,0.010861,0.0,0.0,0.0,0.15798,0.0,0.0,0.0,0.0,0.0,0.016541,0.0,0.0,0.0,0.0,0.007658,0.001791,0.0,...,0.021197,0.0,0.035426,0.0,0.0,0.140428,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.047493,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.137815,0.0,0.0,0.0,0.010867,0.003565,0.0,3e-06,0.092271,0.0,0.0,0.0,0.002384,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000115,0.0


#### VALIDATION

In [47]:
# VALIDATION
interactions_valid_dict = pickle.load(
    open("../datasets/Ponpare/data_processed/valid/interactions_valid_dict.p","rb"))

# remember that one user that visited one coupon and that coupon is not in the training set of coupons.
# and in consequence not in the interactions matrix
interactions_valid_dict.pop("25e2b645bfcd0980b2a5d0a4833f237a")

array(['fe28f9f9055fde46855b1520a40e3c08'], dtype=object)

In [50]:
left = pd.DataFrame({'user_id_hash':list(interactions_valid_dict.keys())})
left['key'] = 0
right = df_coupons_valid_feat[['coupon_id_hash']]
right['key'] = 0
df_valid = (pd.merge(left, right, on='key', how='outer')
    .drop('key', axis=1))
df_valid['mapped_coupons'] = (df_valid.coupon_id_hash
    .apply(lambda x: valid_to_train_most_similar[x]))
df_valid = pd.merge(df_valid, df_item_factors,
    left_on='mapped_coupons', right_on='coupon_id_hash')
df_valid = pd.merge(df_valid, df_user_factors,
    on='user_id_hash')
df_valid.drop('coupon_id_hash_y', axis=1, inplace=True)
df_valid.rename(index=str, columns={'coupon_id_hash_x': 'coupon_id_hash'}, inplace=True)
df_preds = df_valid[['user_id_hash', 'coupon_id_hash']]
df_valid.head()

Unnamed: 0,user_id_hash,coupon_id_hash,mapped_coupons,item_factor_0,item_factor_1,item_factor_2,item_factor_3,item_factor_4,item_factor_5,item_factor_6,item_factor_7,item_factor_8,item_factor_9,item_factor_10,item_factor_11,item_factor_12,item_factor_13,item_factor_14,item_factor_15,item_factor_16,item_factor_17,item_factor_18,item_factor_19,item_factor_20,item_factor_21,item_factor_22,item_factor_23,item_factor_24,item_factor_25,item_factor_26,item_factor_27,item_factor_28,item_factor_29,item_factor_30,item_factor_31,item_factor_32,item_factor_33,item_factor_34,item_factor_35,item_factor_36,item_factor_37,item_factor_38,item_factor_39,item_factor_40,item_factor_41,item_factor_42,item_factor_43,item_factor_44,item_factor_45,item_factor_46,...,user_factor_0,user_factor_1,user_factor_2,user_factor_3,user_factor_4,user_factor_5,user_factor_6,user_factor_7,user_factor_8,user_factor_9,user_factor_10,user_factor_11,user_factor_12,user_factor_13,user_factor_14,user_factor_15,user_factor_16,user_factor_17,user_factor_18,user_factor_19,user_factor_20,user_factor_21,user_factor_22,user_factor_23,user_factor_24,user_factor_25,user_factor_26,user_factor_27,user_factor_28,user_factor_29,user_factor_30,user_factor_31,user_factor_32,user_factor_33,user_factor_34,user_factor_35,user_factor_36,user_factor_37,user_factor_38,user_factor_39,user_factor_40,user_factor_41,user_factor_42,user_factor_43,user_factor_44,user_factor_45,user_factor_46,user_factor_47,user_factor_48,user_factor_49
0,002ae30377cd30f65652e52618e8b2d6,282b5bda1758e147589ca517e02195c3,ec178b741b164c55ea87b4589318ef87,0.0,0.0,0.0,0.000471,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.00424,0.0,0.001528,0.049214,0.0,0.0,0.0,0.0,0.0,0.0,0.002479,0.0,0.0,0.0,0.019263,0.0,0.000639,0.0,0.0,0.0,0.00375,0.0,0.0,0.003346,0.00373,0.0,0.0,0.0,0.0,0.0,0.0,0.000403,0.0507,0.0,0.0,0.0,...,0.216339,0.0,0.099827,0.0,0.107872,0.0,0.02954,0.0,0.0,0.0,0.007376,0.025764,0.0,0.0,0.0,0.000486,0.000334,0.0,0.0,0.0,0.0,0.006297,0.0,0.0,0.0,0.001352,0.001981,0.0,0.0,0.004619,0.0,0.010192,0.001414,0.000171,0.0,0.0,0.0,0.000364,0.0,0.0,0.0,0.000247,0.0,0.0,0.0,0.0,0.012614,0.0,0.0,0.021166
1,002ae30377cd30f65652e52618e8b2d6,0f43ef71c25d409c250f5a5042806342,3a80034f0ec74c42ed9fa7933a4e2945,0.0,0.0,0.0,0.0,0.0,0.0,0.099698,0.0,0.0,0.0,0.160438,0.0,0.0,0.0,0.00492,0.0,0.0,0.030002,0.0,0.0,0.0,0.029892,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.082048,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.009478,0.0,0.016193,0.0,0.0,0.0,0.0,0.0,0.0,...,0.216339,0.0,0.099827,0.0,0.107872,0.0,0.02954,0.0,0.0,0.0,0.007376,0.025764,0.0,0.0,0.0,0.000486,0.000334,0.0,0.0,0.0,0.0,0.006297,0.0,0.0,0.0,0.001352,0.001981,0.0,0.0,0.004619,0.0,0.010192,0.001414,0.000171,0.0,0.0,0.0,0.000364,0.0,0.0,0.0,0.000247,0.0,0.0,0.0,0.0,0.012614,0.0,0.0,0.021166
2,002ae30377cd30f65652e52618e8b2d6,28ff0fb4b561a2fd6a360fe28f465e07,1d4bbd6a9bcb8b8349dce28ae25b1c34,0.0,0.0,0.0,0.0,0.01213,0.0,0.047793,0.0,0.0,0.0,0.0,0.004743,0.0,0.0,0.0,0.0,0.0,0.004402,0.0,0.0,0.0,0.034334,0.0,0.0,0.0,0.001826,0.0,0.0,0.0,0.042891,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.006935,0.003588,0.0,0.0,0.0,...,0.216339,0.0,0.099827,0.0,0.107872,0.0,0.02954,0.0,0.0,0.0,0.007376,0.025764,0.0,0.0,0.0,0.000486,0.000334,0.0,0.0,0.0,0.0,0.006297,0.0,0.0,0.0,0.001352,0.001981,0.0,0.0,0.004619,0.0,0.010192,0.001414,0.000171,0.0,0.0,0.0,0.000364,0.0,0.0,0.0,0.000247,0.0,0.0,0.0,0.0,0.012614,0.0,0.0,0.021166
3,002ae30377cd30f65652e52618e8b2d6,864f351e66cd3aeece5d06987fc2ed4b,5d4f76bd6de8e64bc5fd65670b4527cf,0.0,0.0,0.002574,0.0,0.09186,0.0,0.0,0.0,0.005344,0.0,0.010375,0.0,0.0,0.0,0.0,0.002895,0.0,0.045026,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.019893,0.0,0.010034,0.0,0.016285,0.0,0.0,0.0,0.014651,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.216339,0.0,0.099827,0.0,0.107872,0.0,0.02954,0.0,0.0,0.0,0.007376,0.025764,0.0,0.0,0.0,0.000486,0.000334,0.0,0.0,0.0,0.0,0.006297,0.0,0.0,0.0,0.001352,0.001981,0.0,0.0,0.004619,0.0,0.010192,0.001414,0.000171,0.0,0.0,0.0,0.000364,0.0,0.0,0.0,0.000247,0.0,0.0,0.0,0.0,0.012614,0.0,0.0,0.021166
4,002ae30377cd30f65652e52618e8b2d6,279ba64539609d30114b68874cd0fb42,0d65be97c9daa9363aa4f996facd3725,0.0,0.0,0.032641,0.0,0.0,0.0,0.04215,0.0,0.01483,0.0,0.066713,0.0,0.043134,0.0,0.004645,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.011284,0.0,0.0,0.0,0.006607,0.0,0.0,0.0,0.217567,0.0,0.0,0.0,0.0,0.006989,0.0,0.020393,0.0,0.00403,0.0,0.00549,0.0,0.0,...,0.216339,0.0,0.099827,0.0,0.107872,0.0,0.02954,0.0,0.0,0.0,0.007376,0.025764,0.0,0.0,0.0,0.000486,0.000334,0.0,0.0,0.0,0.0,0.006297,0.0,0.0,0.0,0.001352,0.001981,0.0,0.0,0.004619,0.0,0.010192,0.001414,0.000171,0.0,0.0,0.0,0.000364,0.0,0.0,0.0,0.000247,0.0,0.0,0.0,0.0,0.012614,0.0,0.0,0.021166


In [51]:
X_valid = df_valid.iloc[:, 3:].values

Now that we have the two matrices, `X` and `X_valid`, we have two options: keep it simple or run with full optimization. 

#### SIMPLE SOLUTION

I will run a single fit with a large number of estimators and "aggressive" early stopping (5)

In [53]:
X_train, X_eval, y_train, y_eval = train_test_split(X, y, test_size=0.25)
model1 = lgb.LGBMRegressor(n_estimators=1000)
model1.fit(X_train,y_train,
    eval_set = [(X_eval,y_eval)],
    early_stopping_rounds=5,
    eval_metric="rmse")

[1]	valid_0's rmse: 0.273456
Training until validation scores don't improve for 10 rounds.
[2]	valid_0's rmse: 0.271579
[3]	valid_0's rmse: 0.269902
[4]	valid_0's rmse: 0.268547
[5]	valid_0's rmse: 0.267429
[6]	valid_0's rmse: 0.266278
[7]	valid_0's rmse: 0.265348
[8]	valid_0's rmse: 0.264341
[9]	valid_0's rmse: 0.263683
[10]	valid_0's rmse: 0.263036
[11]	valid_0's rmse: 0.262435
[12]	valid_0's rmse: 0.261834
[13]	valid_0's rmse: 0.261268
[14]	valid_0's rmse: 0.260704
[15]	valid_0's rmse: 0.260081
[16]	valid_0's rmse: 0.259663
[17]	valid_0's rmse: 0.259175
[18]	valid_0's rmse: 0.258766
[19]	valid_0's rmse: 0.258466
[20]	valid_0's rmse: 0.258074
[21]	valid_0's rmse: 0.257747
[22]	valid_0's rmse: 0.257403
[23]	valid_0's rmse: 0.256945
[24]	valid_0's rmse: 0.25666
[25]	valid_0's rmse: 0.256395
[26]	valid_0's rmse: 0.256037
[27]	valid_0's rmse: 0.255718
[28]	valid_0's rmse: 0.255353
[29]	valid_0's rmse: 0.255057
[30]	valid_0's rmse: 0.254847
[31]	valid_0's rmse: 0.254597
[32]	valid_0's rms

[270]	valid_0's rmse: 0.234532
[271]	valid_0's rmse: 0.234513
[272]	valid_0's rmse: 0.234476
[273]	valid_0's rmse: 0.234443
[274]	valid_0's rmse: 0.234418
[275]	valid_0's rmse: 0.234404
[276]	valid_0's rmse: 0.23439
[277]	valid_0's rmse: 0.234348
[278]	valid_0's rmse: 0.234313
[279]	valid_0's rmse: 0.234295
[280]	valid_0's rmse: 0.234277
[281]	valid_0's rmse: 0.234236
[282]	valid_0's rmse: 0.234216
[283]	valid_0's rmse: 0.234186
[284]	valid_0's rmse: 0.234171
[285]	valid_0's rmse: 0.23412
[286]	valid_0's rmse: 0.234092
[287]	valid_0's rmse: 0.234078
[288]	valid_0's rmse: 0.234034
[289]	valid_0's rmse: 0.233992
[290]	valid_0's rmse: 0.233961
[291]	valid_0's rmse: 0.23394
[292]	valid_0's rmse: 0.233919
[293]	valid_0's rmse: 0.233862
[294]	valid_0's rmse: 0.233826
[295]	valid_0's rmse: 0.23378
[296]	valid_0's rmse: 0.233763
[297]	valid_0's rmse: 0.233756
[298]	valid_0's rmse: 0.233737
[299]	valid_0's rmse: 0.23371
[300]	valid_0's rmse: 0.233697
[301]	valid_0's rmse: 0.233684
[302]	valid_0

[539]	valid_0's rmse: 0.228987
[540]	valid_0's rmse: 0.228953
[541]	valid_0's rmse: 0.228936
[542]	valid_0's rmse: 0.228931
[543]	valid_0's rmse: 0.228908
[544]	valid_0's rmse: 0.228903
[545]	valid_0's rmse: 0.228892
[546]	valid_0's rmse: 0.228873
[547]	valid_0's rmse: 0.228861
[548]	valid_0's rmse: 0.228855
[549]	valid_0's rmse: 0.228847
[550]	valid_0's rmse: 0.228833
[551]	valid_0's rmse: 0.228824
[552]	valid_0's rmse: 0.228817
[553]	valid_0's rmse: 0.22879
[554]	valid_0's rmse: 0.228754
[555]	valid_0's rmse: 0.228743
[556]	valid_0's rmse: 0.228706
[557]	valid_0's rmse: 0.228691
[558]	valid_0's rmse: 0.228672
[559]	valid_0's rmse: 0.228644
[560]	valid_0's rmse: 0.228629
[561]	valid_0's rmse: 0.228624
[562]	valid_0's rmse: 0.228622
[563]	valid_0's rmse: 0.228616
[564]	valid_0's rmse: 0.228612
[565]	valid_0's rmse: 0.228578
[566]	valid_0's rmse: 0.228547
[567]	valid_0's rmse: 0.228537
[568]	valid_0's rmse: 0.228518
[569]	valid_0's rmse: 0.228505
[570]	valid_0's rmse: 0.228491
[571]	val

[809]	valid_0's rmse: 0.225882
[810]	valid_0's rmse: 0.22588
[811]	valid_0's rmse: 0.225879
[812]	valid_0's rmse: 0.225869
[813]	valid_0's rmse: 0.225867
[814]	valid_0's rmse: 0.225864
[815]	valid_0's rmse: 0.225856
[816]	valid_0's rmse: 0.22584
[817]	valid_0's rmse: 0.225839
[818]	valid_0's rmse: 0.22583
[819]	valid_0's rmse: 0.225819
[820]	valid_0's rmse: 0.225806
[821]	valid_0's rmse: 0.225792
[822]	valid_0's rmse: 0.225783
[823]	valid_0's rmse: 0.225777
[824]	valid_0's rmse: 0.225776
[825]	valid_0's rmse: 0.225776
[826]	valid_0's rmse: 0.225767
[827]	valid_0's rmse: 0.225756
[828]	valid_0's rmse: 0.225739
[829]	valid_0's rmse: 0.225727
[830]	valid_0's rmse: 0.225715
[831]	valid_0's rmse: 0.2257
[832]	valid_0's rmse: 0.225692
[833]	valid_0's rmse: 0.225688
[834]	valid_0's rmse: 0.225687
[835]	valid_0's rmse: 0.225682
[836]	valid_0's rmse: 0.22568
[837]	valid_0's rmse: 0.225676
[838]	valid_0's rmse: 0.225666
[839]	valid_0's rmse: 0.225667
[840]	valid_0's rmse: 0.225659
[841]	valid_0'

LGBMRegressor(boosting_type='gbdt', class_weight=None, colsample_bytree=1.0,
       learning_rate=0.1, max_depth=-1, min_child_samples=20,
       min_child_weight=0.001, min_split_gain=0.0, n_estimators=1000,
       n_jobs=-1, num_leaves=31, objective=None, random_state=None,
       reg_alpha=0.0, reg_lambda=0.0, silent=True, subsample=1.0,
       subsample_for_bin=200000, subsample_freq=0)

In [54]:
preds = model1.predict(X_valid)
df_preds['interest'] = preds

In [55]:
df_ranked = df_preds.sort_values(['user_id_hash', 'interest'], ascending=[False, False])
df_ranked = (df_ranked
    .groupby('user_id_hash')['coupon_id_hash']
    .apply(list)
    .reset_index())
recomendations_dict = pd.Series(df_ranked.coupon_id_hash.values,
    index=df_ranked.user_id_hash).to_dict()

actual = []
pred = []
for k,_ in recomendations_dict.items():
    actual.append(list(interactions_valid_dict[k]))
    pred.append(list(recomendations_dict[k]))

print(mapk(actual,pred))

0.021764365943766


Not amazing, let's see what happens with optimization...

#### WITH OPTIMIZATION

Here we will use hyperopt as in the previous Chapter to optimise `lightGBM`

It is worth mentioning that here features are numerical. This normally makes things slightly simpler. In this scenario you might want to try libraries like [tpot](https://epistasislab.github.io/tpot/) for automatic ML with genetic programming (if you have the time and the memory) or [ml-lens](http://ml-ensemble.com/info/start/ensembles.html) to build ensemble algorithms. 

Let's start with the usual objective function, optimising using the `MAP`

In [56]:
def lgb_objective_map(params):
    """
    objective function for lightgbm.
    """

    # hyperopt casts as float
    params['num_boost_round'] = int(params['num_boost_round'])
    params['num_leaves'] = int(params['num_leaves'])

    # need to be passed as parameter
    params['verbose'] = -1
    params['seed'] = 1

    cv_result = lgb.cv(
    params,
    lgtrain,
    nfold=3,
    metrics='rmse',
    num_boost_round=params['num_boost_round'],
    early_stopping_rounds=20,
    stratified=False,
    )
    early_stop_dict[lgb_objective_map.i] = len(cv_result['rmse-mean'])
    params['num_boost_round'] = len(cv_result['rmse-mean'])

    model = lgb.LGBMRegressor(**params)
    model.fit(X,y)
    preds = model.predict(X_valid)

    df_preds['interest'] = preds
    df_ranked = df_preds.sort_values(['user_id_hash', 'interest'], ascending=[False, False])
    df_ranked = (df_ranked
        .groupby('user_id_hash')['coupon_id_hash']
        .apply(list)
        .reset_index())
    recomendations_dict = pd.Series(df_ranked.coupon_id_hash.values,
        index=df_ranked.user_id_hash).to_dict()

    actual = []
    pred = []
    for k,_ in recomendations_dict.items():
        actual.append(list(interactions_valid_dict[k]))
        pred.append(list(recomendations_dict[k]))

    result = mapk(actual,pred)
    print("INFO: iteration {} MAP {:.3f}".format(lgb_objective_map.i, result))

    lgb_objective_map.i+=1

    return 1-result

to `lightGBM` data format and defining the parameter space

In [57]:
# lgb dataset object
lgtrain = lgb.Dataset(X,
    label=y,
    free_raw_data=False)

# defining the parameter space
lgb_parameter_space = {
    'learning_rate': hp.uniform('learning_rate', 0.01, 0.5),
    'num_boost_round': hp.quniform('num_boost_round', 100, 500, 50),
    'num_leaves': hp.quniform('num_leaves', 30,1024,5),
    'min_child_weight': hp.quniform('min_child_weight', 1, 50, 2),
    'colsample_bytree': hp.uniform('colsample_bytree', 0.5, 1.),
    'subsample': hp.uniform('subsample', 0.5, 1.),
    'reg_alpha': hp.uniform('reg_alpha', 0.01, 1.),
    'reg_lambda': hp.uniform('reg_lambda', 0.01, 1.),
}

I have not run the cell below here. It takes 98 min on a c5.4xlarge EC2 instance (30GB, 16 cores), so I used `screen` in the terminal and went for a run myself.

In [None]:
early_stop_dict = {}
trials = Trials()
start = time()
lgb_objective_map.i = 0
best = fmin(fn=lgb_objective_map,
            space=lgb_parameter_space,
            algo=tpe.suggest,
            max_evals=50,
            trials=trials)
best['num_boost_round'] = early_stop_dict[trials.best_trial['tid']]
best['num_leaves'] = int(best['num_leaves'])
best['verbose'] = -1
print(1-trials.best_trial['result']['loss'])
print(time()-start)
print(best)
pickle.dump(best,open("../datasets/Ponpare/data_processed/models/gbm_nmf_optimal_parameters.p", "wb"))

The best result is MAP: 0.021808

and the corresponding best parameters are

In [58]:
pickle.load(open("../datasets/Ponpare/data_processed/models/gbm_nmf_optimal_parameters.p", "rb"))

{'colsample_bytree': 0.70026944963067,
 'learning_rate': 0.13477552502641502,
 'min_child_weight': 40.0,
 'num_boost_round': 200,
 'num_leaves': 355,
 'reg_alpha': 0.4739150442922858,
 'reg_lambda': 0.7609758831113889,
 'subsample': 0.796699692621813,
 'verbose': -1}

Totally **not** worth it. Furthermore, using the so called "Simple Solution" (i.e. no optimization) and `n_comp=100` you obtain MAP=0.0226255 and "in no time". Nonetheless, this is still significantly smaller than using directly `lightGBM` on the features themselves (see Chapter 10). 

However, if you face a problem where this technique performs well, it is indeed a very useful technique. This is because the latent factors can be used for a number of things other than recommending. They have been learned based on users' behaviour. Therefore, you might want to use them for campaign targeting instead of demographic-based features (such as age, location, etc) for example. In this scenario, you will be targeting your users based on their behaviour instead of some "human-readable" features, which is possibly more adequate. 

Before we leave this notebook make sure you are familiar with the concept of latent factors, since similar principles with a different formulation will be applied when using our next technique: Factorization Machines.