### Chapter 13

### 13.1 Field Aware Factorization Machines

This Chapter is just the natural continuation of the previous one. Let's remind ourselves a bit where we are. We want to predict the CTR, or perhaps the interest of a user in an item. To that aim we can use FMs defined as: 

$$\phi_\text{FM}(\boldsymbol{w}, \boldsymbol{x}) = \boldsymbol{w}_{0} + \sum_{i=1}^{n} w_i x_i + \sum_{i=1}^{n}\sum_{j=i+1}^{n} (\boldsymbol{w}_i \cdot \boldsymbol{w}_j) x_i x_j$$

where the feature interactions are learned through inner products between the latent vectors ($\boldsymbol{w}$) associated to each feature. The computation complexity of that expression is $O(\overline{n}k)$, where $\overline{n}$ is the average number of non zero elements per instance and $k$ is the number of latent factors. 

Although this technique captures feature interaction, there is no notion of "field". Let's consider the same example shown in the previous chapter

|  |Publisher (P)| Advertiser (A)| Gender (G)| 
|----|-------|--------|--|
| YES| ESPN  | Nike   |M |


Using FMs, the outcome for this instance (`YES` in this case) would be predicted as:

$$\phi_{FM} = 
\boldsymbol{w}_0 + 
\boldsymbol{w}_\text{ESPN} x_\text{ESPN} + 
\boldsymbol{w}_\text{Nike} x_\text{Nike} + 
\boldsymbol{w}_\text{M} x_\text{M} + 
(\boldsymbol{w}_\text{ESPN}\cdot\boldsymbol{w}_\text{Nike}) x_\text{ESPN}x_\text{Nike} + 
(\boldsymbol{w}_\text{Nike}\cdot\boldsymbol{w}_\text{M}) x_\text{Nike}x_\text{M} + 
(\boldsymbol{w}_\text{ESPN}\cdot\boldsymbol{w}_\text{M}) x_\text{ESPN}x_\text{M}
$$

In this expression, there is no notion that ESPN and Nike are values of the features Publisher and Advertiser respectively. Field Aware Factorization Machines address this "limitation" by introducing field information. More precisely:

$$\phi_\text{FFM}(\boldsymbol{w}, \boldsymbol{x}) = 
\boldsymbol{w}_{0} + 
\sum_{i=1}^{n} w_i x_i + 
\sum_{i=1}^{n}\sum_{j=i+1}^{n} (\boldsymbol{w}_{_i, f_j} \cdot \boldsymbol{w}_{j, f_i}) x_i x_j$$

The complexity to compute that expression is $O(\overline{n}^2 k)$ (i.e. slower than FMs). 

When using FMs there is one latent vector per feature learned through the interaction with another features. In FFMs, each feature has several latent vectors depending on the field of other features it interacts with. Let's go back to our example:

$$\phi_{FFM} = 
\boldsymbol{w}_0 + 
\boldsymbol{w}_\text{ESPN} x_\text{ESPN} + 
\boldsymbol{w}_\text{Nike} x_\text{Nike} + 
\boldsymbol{w}_\text{M} x_\text{M} + 
(\boldsymbol{w}_\text{ESPN,A}\cdot\boldsymbol{w}_\text{Nike,P}) x_\text{ESPN}x_\text{Nike} + 
(\boldsymbol{w}_\text{Nike,G}\cdot\boldsymbol{w}_\text{M,A}) x_\text{Nike}x_\text{M} + 
(\boldsymbol{w}_\text{ESPN,G}\cdot\boldsymbol{w}_\text{M,P}) x_\text{ESPN}x_\text{M}
$$

We can see that the for the Publisher feature value ESPN, two latent vectors are learned $\boldsymbol{w}_\text{ESPN,A}$ and $\boldsymbol{w}_\text{ESPN,G}$ depending on whether the interaction occurs with the feature Advertiser or Gender. Because in FFMs each latent vector only needs to learn the effect with a specific field, usually $k_{FFM} << k_{FM}$. 

Let's see how all this looks in code.

In [1]:
import numpy as np
import pandas as pd
import random
import os
import xlearn as xl
import pickle

from recutils.average_precision import mapk
from time import time
from recutils.datasets import dump_libffm_file
from sklearn.model_selection import KFold
from sklearn.metrics import mean_squared_error
from hyperopt import hp, tpe
from hyperopt.fmin import fmin

inp_dir = "../datasets/Ponpare/data_processed/"
train_dir = "train"
valid_dir = "valid"

#### COUPONS

In [2]:
# COUPONS
df_coupons_train_feat = pd.read_pickle(os.path.join(inp_dir, train_dir, 'df_coupons_train_feat.p'))
drop_cols = [c for c in df_coupons_train_feat.columns
    if ((not c.endswith('_cat')) or ('method2' in c)) and (c!='coupon_id_hash')]
df_coupons_train_cat_feat = df_coupons_train_feat.drop(drop_cols, axis=1)
coupon_categorical_cols = [c for c in df_coupons_train_cat_feat.columns if c!="coupon_id_hash"]
df_coupons_train_feat.shape

(18622, 32)

#### USERS

In [3]:
# USERS
df_users_train_feat = pd.read_pickle(os.path.join(inp_dir, train_dir, 'df_users_train_feat.p'))
user_categorical_cols = [c for c in df_users_train_feat.columns if c.endswith('_cat')]
user_numerical_cols = [c for c in df_users_train_feat.columns
    if ((c not in user_categorical_cols) and (c!='user_id_hash'))]

# Normalizing numerical features
user_numerical_df = df_users_train_feat[user_numerical_cols]
user_numerical_df_norm = (user_numerical_df-user_numerical_df.min())/(user_numerical_df.max()-user_numerical_df.min())
df_users_train_feat.drop(user_numerical_cols, axis=1, inplace=True)
df_users_train_feat = pd.concat([user_numerical_df_norm, df_users_train_feat], axis=1)
df_users_train_feat.shape

(22624, 63)

#### VALIDATION DATA 

In [4]:
# interest dataframe
df_interest = pd.read_pickle(os.path.join(inp_dir, train_dir, 'df_interest.p'))
df_train = pd.merge(df_interest, df_users_train_feat, on='user_id_hash')
df_train = pd.merge(df_train, df_coupons_train_cat_feat, on = 'coupon_id_hash')

# for the time being we ignore recency
df_train.drop(['user_id_hash','coupon_id_hash','recency_factor'], axis=1, inplace=True)

In [5]:
# I want/need to ensure some order
all_cols = [c for c in df_train.columns.tolist() if c != 'interest']
cat_cols = [c for c in all_cols if c.endswith('_cat')]
num_cols = [c for c in all_cols if c not in cat_cols]
target = 'interest'
col_order=[target]+num_cols+cat_cols
df_train = df_train[col_order]

In [6]:
# load the validation interactions and coupon info
df_coupons_valid_feat = pd.read_pickle(os.path.join(inp_dir, 'valid', 'df_coupons_valid_feat.p'))
df_coupons_valid_cat_feat = df_coupons_valid_feat.drop(drop_cols, axis=1)

interactions_valid_dict = pickle.load(open(inp_dir + "valid/interactions_valid_dict.p", "rb"))

In [7]:
# Build validation data
left = pd.DataFrame({'user_id_hash':list(interactions_valid_dict.keys())})
left['key'] = 0
right = df_coupons_valid_feat[['coupon_id_hash']]
right['key'] = 0
df_valid = (pd.merge(left, right, on='key', how='outer')
    .drop('key', axis=1))
df_valid = pd.merge(df_valid, df_users_train_feat, on='user_id_hash')
df_valid = pd.merge(df_valid, df_coupons_valid_cat_feat, on = 'coupon_id_hash')
df_valid['interest'] = 0.1
df_preds = df_valid[['user_id_hash','coupon_id_hash']]
df_valid.drop(['user_id_hash','coupon_id_hash'], axis=1, inplace=True)
df_valid = df_valid[col_order]

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


### XLEARN

In [8]:
# All needs to go to libffm format
XLEARN_DIR = inp_dir+"xlearn_data"
train_data_file = os.path.join(XLEARN_DIR,"xltrain_ffm.txt")
valid_data_file = os.path.join(XLEARN_DIR,"xlvalid_ffm.txt")
xlmodel_fname = os.path.join(XLEARN_DIR,"xlffm_model.out")
xlpreds_fname = os.path.join(XLEARN_DIR,"xlffm_preds.txt")

#### LibFFM Format

If we remember from the previous Chapter when feeding the data to the xlearn's `create_fm` method, these have to be in `libsvm` format. In our example:

    Yes P-ESPN:1 A-Nike:1 G-Male:1

Luckily for us, `sklearn` comes with two convenient utilities in the `datasets` module: `dump_svmlight_file` and `load_svmlight_file`. When using the `crete_ffm` method, we need to encode the "field". There are a number of ways of doing it, please read Section 3.3 of their [paper](https://www.csie.ntu.edu.tw/~cjlin/papers/ffm.pdf). For example, for categorical features we do: 

    Yes P:P-ESPN:1 A:A-Nike:1 G:G-Male:1
    
For numerical features, let's consider the following example (extracted from their paper):

| Accepted | AR | Hidx | Cite| 
|----|-------|--------|--|
| YES| 45.47  | 2   |3 |

This instance would be represented as:

    Yes AR:45:1 Hidx:2:1 Cite:3:1

Fortunately, there are lots of clever people around and thanks to this Kaggle [kernel](https://www.kaggle.com/scirpus/libffm-generator-lb-280) by [Scirpus](https://www.kaggle.com/scirpus) we have a function that does the job. It is included in the `recutils.datasets` module in this repo.

This is how I use it

In [9]:
catdict = {}
for x in num_cols:
    catdict[x] = 0
for x in cat_cols:
    catdict[x] = 1

currentcode = len(num_cols)
catcodes = {}

In [10]:
start = time()
currentcode_tr, catcodes_tr =  dump_libffm_file(df_train,
    target, catdict, currentcode, catcodes, train_data_file, verbose=True)
print("{} min".format(round((time()-start)/60., 3)))

Row 100000
Row 200000
Row 300000
Row 400000
Row 500000
Row 600000
Row 700000
Row 800000
Row 900000
Row 1000000
Row 1100000
Row 1200000
Row 1300000
Row 1400000
Row 1500000
8.224 min


In [11]:
start = time()
currentcode_va, catcodes_va =  dump_libffm_file(df_valid,
    target, catdict, currentcode_tr, catcodes_tr, valid_data_file, verbose=True)
print("{} min".format(round((time()-start)/60., 3)))

Row 100000
Row 200000
Row 300000
Row 400000
Row 500000
Row 600000
Row 700000
Row 800000
Row 900000
Row 1000000
Row 1100000
Row 1200000
Row 1300000
Row 1400000
Row 1500000
Row 1600000
Row 1700000
Row 1800000
Row 1900000
Row 2000000
Row 2100000
11.476 min


## 13.2 Experiments

### 13.2.1 Experiment 1: defaults

In [16]:
# Before we start with the optimization let's try the defaults
params = {'epoch': 20, 'task': 'reg', 'metric': 'rmse'}
xl_model = xl.create_ffm()
xl_model.setTrain(train_data_file)
xl_model.setTest(valid_data_file)
xl_model.fit(params, xlmodel_fname)
xl_model.predict(xlmodel_fname, xlpreds_fname)

In [26]:
preds = np.loadtxt(xlpreds_fname)
df_preds['interest'] = preds

df_ranked = df_preds.sort_values(['user_id_hash', 'interest'],
    ascending=[False, False])
df_ranked = (df_ranked
    .groupby('user_id_hash')['coupon_id_hash']
    .apply(list)
    .reset_index())
recomendations_dict = pd.Series(df_ranked.coupon_id_hash.values,
    index=df_ranked.user_id_hash).to_dict()

actual = []
pred = []
for k,_ in recomendations_dict.items():
    actual.append(list(interactions_valid_dict[k]))
    pred.append(list(recomendations_dict[k]))

print(mapk(actual,pred))

0.026023183621136616


0.026 out of the box! That is encouraging. Let's see if we can push it to higher values with some optimization

### 13.2.2 Experiment 2: with optimization

My initial idea was to use the `df_train` dataset and split it into training and evaluation for optimization purposes. This way, I would be able to set the `early_stopping` parameter in `xlearn`. I would then train and evaluate with these datasets and predict with the original validation data (`df_valid`) that is just the cartesian product of the validation coupons and the users that were seen during both training and validation.

However, in the many manual runs, I never found that the algorithm *"early-stopped"*. Therefore, I decided to just use the existing training and validation data files (train_data_file and valid_data_file) and don't use early stopping. Nonetheless, below I also include the code that one would use if you want to use early stopping. 

Let's have a look to the code.

In [1]:
# UNCOMMENT THE TWO CELLS BELOW IF YOU EVENTUALLY WANT TO USE EARLY STOPPING

# # Train and validation to used during optimization (they will both come from df_train)
# train_data_file_opt = os.path.join(XLEARN_DIR,"xltrain_ffm_opt.txt")
# valid_data_file_opt = os.path.join(XLEARN_DIR,"xlvalid_ffm_opt.txt")

In [2]:
# df_train_opt, df_valid_opt = train_test_split(df_train, test_size=0.3, random_state=1981)

# catdict = {}
# for x in num_cols:
#     catdict[x] = 0
# for x in cat_cols:
#     catdict[x] = 1

# currentcode = len(num_cols)
# catcodes = {}

# currentcode_tr_opt, catcodes_tr_opt =  dump_libffm_file(df_train_opt,
#     target, catdict, currentcode, catcodes, train_data_file_opt, verbose=True)

# currentcode_va_opt, catcodes_va_opt =  dump_libffm_file(df_valid_opt,
#     target, catdict, currentcode_tr_opt, catcodes_tr_opt, valid_data_file_opt, verbose=True)

# currentcode_va, catcodes_va =  dump_libffm_file(df_valid,
#     target, catdict, currentcode_va_opt, catcodes_va_opt, valid_data_file, verbose=True)

temporal files for model and predictions output during optimization

In [7]:
# temporal outputs of the model during optimization
xlmodel_fname_tmp = os.path.join(XLEARN_DIR,"xlffm_model_tmp.out")
xlpreds_fname_tmp = os.path.join(XLEARN_DIR,"xlffm_preds_tmp.txt")

Let's now define our optimization function

In [8]:
def xl_objective(params):

    start = time()
    xl_objective.i+=1

    params['task'] = 'reg'
    params['metric'] = 'rmse'
    
    # uncomment this line if you are using early stopping and define your window
    # params['stop_window'] = 3

    # remember hyperopt casts as floats
    params['epoch'] = int(params['epoch'])
    params['k'] = int(params['k'])

    xl_model.fit(params, tmp_model_fname)
    xl_model.predict(tmp_model_fname, tmp_preds_fname)

    # We optimize using the recommendations success metric: MAP
    # Therefore, we add the predictions to the df_pred dataframe, 
    # we rank and calculate the MAP
    predictions = np.loadtxt(preds_fname)
    df_preds['interest'] = predictions

    df_ranked = df_preds.sort_values(['user_id_hash', 'interest'],
        ascending=[False, False])
    df_ranked = (df_ranked
        .groupby('user_id_hash')['coupon_id_hash']
        .apply(list)
        .reset_index())
    recomendations_dict = pd.Series(df_ranked.coupon_id_hash.values,
        index=df_ranked.user_id_hash).to_dict()

    actual = []
    pred = []
    for k,_ in recomendations_dict.items():
        actual.append(list(interactions_valid_dict[k]))
        pred.append(list(recomendations_dict[k]))

    score = mapk(actual,pred)
    end = round((time() - start)/60.,2)

    print("INFO: iteration {} was completed in {} min. Score {:.3f}".format(xl_objective.i, end, score))

    return 1-score

We define the parameter space. These values are mostly based on the values I see in their [documentation](http://xlearn-doc.readthedocs.io/en/latest/python_api.html#). Nonetheless, this process takes ages for the size of the dataset, so we will limit our exploration to "thin" ranges. Also note that I have not run some cells in the notebook. As I mentioned, it takes a long time, so I use `screen` in the terminal and left the process running for some time.

In [9]:
xl_parameter_space = {
    'lr': hp.uniform('lr', 0.1, 0.4),
    'lambda': hp.uniform('lambda', 0.00002, 0.0001),
    'init': hp.uniform('init', 0.4, 0.8),
    'epoch': hp.quniform('epoch', 5, 20, 2),
    'k': hp.quniform('k', 4, 8, 1)
    }

Instantiate the model and load the corresponding datasets

In [None]:
xl_model = xl.create_ffm()
xl_model.setTrain(train_data_file)
xl_model.setTest(valid_data_file)

# # IF YOU WERE USING EARLY STOP THESE CELL WOULD LOOK LIKE THIS:
# xl_model = xl.create_ffm()
# xl_model.setTrain(train_data_file_opt)
# xl_model.setValidate(valid_data_file_opt)
# xl_model.setTest(valid_data_file)

And optimize (go for a coffee, a run, a swim, a night out...)

In [None]:
trials = Trials()
xl_objective.i = 0
best_ffm = fmin(
    fn=xl_objective,
    space=xl_parameter_space,
    algo=tpe.suggest,
    max_evals=7,
    trials=trials
    )
pickle.dump(best_ffm, open(os.path.join(XLEARN_DIR,'best_ffm.p'), "wb"))
pickle.dump(trials.best_trial, open(os.path.join(XLEARN_DIR,'best_trial_ffm.p'), "wb"))

and I never managed to get anything better than **MAP@10=0.028**, which is not bad, for this parameters. Also note that **I only run 7 iterations**. As I mentioned in the previous Chapter (and below) either I am doing something wrong or there is a problem with the package, since each iteration accumulates memory at a 3GB+ per iteration. I guess that if this problem is fixed and I can run more iteration maybe we can get to values of 0.03 or higher, similar to those obtained with `lightGBM`.

Anyway, best aparameters obtained out of these 7 iterations where

In [4]:
best_ffm = pickle.load(open(os.path.join(XLEARN_DIR,'best_ffm.p'), "rb"))
best_ffm

{'epoch': 20.0,
 'init': 0.5727517342785235,
 'k': 5.0,
 'lambda': 3.899154394317734e-05,
 'lr': 0.1598910860257523}

### 13.3 VEREDICT

As I mentioned in the previous Chapter `xlearn` is not production ready. There are a number of aspects that need refinement. Here are a few of the problems I have faced, from least to most relevant:

1. Too much verbosity even when is `setQuiet`.
2. The sklearn API is bit "odd". I normally obtained all `nan` when the process worked well with native methods.
3. Still with `nan`, sometimes, after a run when all seemed to be ok, I obtained a `test loss: nan` message, yet the `MAP` was still decent (MAP@10$\geq$0.025). Maybe is because the `rmse` as meassure by the package was negligible(?).
4. As soon as there is a `"file does not exists"` error the program terminates and expels you out from interactive tools like the `ipython` console. Not a deal breaker, but annoying.
5. Small changes in parameters lead to large changes in the predicted `rmse` and `MAP`. I personally do not feel confident when these things happen. However, might also have to do with the data, although this behaviour was never seen when using `lightGBM`.
6. It is hard to access to the model results. For example, I can't find the details of the folds (even the score) when using `cv`
7. You will have noticed when using `hyperopt` I only run 7 iterations. This is because every iteration adds around $\sim$3.5GB of RAM. I am using a c5.4xlarge with 30GB, so 7 is the maximum number I can run before I see the nice `killed` message. I am not sure whether this is the expected behaviour or there is some kind of memory leak. I tried removing the model in each iteration, as well as the `xlffm_model_tmp.out` file in case the information accumulates in the file and blows up when loaded (`xl_model.predict(tmp_model_fname, tmp_preds_fname)`) but nothing helped and the memory is not released. Maybe a workaround for this issue is using `skopt` and the `x0` initialization parameter. We could run steps of 7 iterations and start the next one with the best parameters from the previous one. To be honest, only writting it sounds a bit painful, so I will leave it to you if you want to give it a go.

For all of the above, I can only conclude that while it has been "fun" to play with the package for a while, I would probably turn to other packages when it comes to factorization methods, if I decided to go with them to production. `ALS` in the `Spark MLlib` has been a good option for me in the past. I recently came accross the [spotlight](https://maciejkula.github.io/spotlight/) package, by the creator of `lightFM`, that also seems a potential option. However, is based on `Pytorch` which, by the time of writting, is not production ready either. In their site they say version 1.0 ready for research and production will be out soon.

Let's now move onto our final example. A Deep Learning based recommendation algorithm. The Ponpare dataset is not really well suited for these types of algorithms (I discuss further in the next chapter), but let's simply illustrate an example and I will use a more suited dataset in the future.