## Chapter 12

### 12.1 Factorization Machines

The text in this notebook is based on this [paper](https://www.csie.ntu.edu.tw/~cjlin/papers/ffm.pdf). [Chao Ma](https://github.com/aksnzhy) is the author of the [xlearn](http://xlearn-doc.readthedocs.io/en/latest/start.html) package looks really promising. 

The differentiating method in `xlearn` is Field Aware Factorization Machines (FFMs), which has been the winning method in a couple of click through rate (CTR) prediction competitions. However, the library also includes [Factorization Machines](https://cseweb.ucsd.edu/classes/fa17/cse291-b/reading/Rendle2010FM.pdf) (FMs) and linear methods for large datasets. Therefore, we will first explore FMs and then we will move onto FFMs in the next Chapter/notebook.

There are a number of packages for Factorization Machines in python: 


1. [pyFM](https://github.com/coreylynch/pyFM)
2. [pywFM](https://github.com/jfloff/pywFM)
3. [fastFM](https://github.com/jfloff/pywFM)
4. [ligtFM](https://github.com/lyst/lightfm)

While I am familiar with `pyFM` and `lightFM`, I have never used the `pywFM` and only "played a bit" with `fastFM`. To be honest, with the exception of the `lightFM`, I do not think any of them are production ready. Let me clarify that `lighFM` is not strictly speaking FMs, but a hybrid matrix factorisation model. However, given the resemble between methods (see Section 3 of Maciej Kula's [paper](https://arxiv.org/pdf/1507.08439.pdf)), the author decided to call it `lightFM`, and I think it must be included in the list above.

So, what are factorization machines? Let's see if I can answer this question with some math and plane English.

Let's assume we have $m$ items that are displayed in a site. For each item we have $(y_i, \boldsymbol{x}_i)$ where $i=1,...,n$, $\boldsymbol {x}_i$ is an n-dimensional feature vector and $y_i$ is our *"target"*, for example whether a user will click on a specific link to an item. In this scenario (logistic regression, click or not) the model can be obtained by solving the following optimization problem, i.e. minimizing the log-loss with regularization:

$$ \min\limits_{w} \frac{\lambda}{2} ||w||^{2}_{2} +  \sum_{i=1}^{m} log(1+exp(-y_i\phi_{LM}(\boldsymbol{w},\boldsymbol{x}_i)))$$

where $\lambda$ is the regularization parameter and $\phi_{LM}$ is linear:

$$\phi_{LM}(\boldsymbol{w},\boldsymbol{x}) = \boldsymbol{w} \cdot \boldsymbol{x}$$

Let's use the example in the [Yuchin Juan, et al 2017](https://www.csie.ntu.edu.tw/~cjlin/papers/ffm.pdf) paper. 

| | |Publisher (P)| Advertiser (A)| 
|-----|-----|--------|--------|
| +80 | −20 |  ESPN  | Nike   |
| +10 | −90 |  ESPN  | Gucci  |
| +0  | −1  |  ESPN  | Adidas |
| +15 | −85 |  Vogue | Nike   |
| +90 | −10 |  Vogue | Gucci  |
| +10 | −90 |  Vogue | Adidas |
| +85 | −15 |  NBC   | Nike   |
| +0  | −0  |  NBC   | Gucci  |
| +90 | −10 |  NBC   | Adidas |

where + (−) represents the number of clicked (unclicked) impressions. And a single instance:

|  |Publisher (P)| Advertiser (A)| Gender (G)| 
|----|-------|--------|--|
| YES| ESPN  | Nike   |M |

With a linear model, the click probability for this observations would be calculated as:

$$\phi_{LM} = \boldsymbol{w}_0 + \boldsymbol{w}_\text{ESPN} x_\text{ESPN} + \boldsymbol{w}_\text{Nike} x_\text{Nike} + \boldsymbol{w}_\text{M} x_\text{M} = \boldsymbol{w}_0 + \boldsymbol{w}_\text{ESPN} + \boldsymbol{w}_\text{Nike} + \boldsymbol{w}_\text{M}$$ 

Note that `Publisher`, `Advertiser` and `Gender` are categorical features. Once one-hot encoded, $x_\text{ESPN}$ will be 1 if the value for the feature Publisher is ESPN and 0 otherwise, and hence the equality in the expression above.

Of course, a limitation of that model is that it does not capture feature interactions. A way to address that limitation is using polynomial models, for example, of degree 2: 

$$\phi_\text{Poli2} = \boldsymbol{w}_0 + \boldsymbol{w}_\text{ESPN} x_\text{ESPN} + \boldsymbol{w}_\text{Nike} x_\text{Nike} + \boldsymbol{w}_\text{M} x_\text{M} + \boldsymbol{w}_\text{ESPN, Nike} x_\text{ESPN} x_\text{Nike} + \boldsymbol{w}_\text{Nike, M} x_\text{ESPN} x_\text{M} + \boldsymbol{w}_\text{ESPN, M} x_\text{ESPN} x_\text{M}$$ 

or in a more compact notation:

$$\phi_\text{Poli2}(\boldsymbol{w}, \boldsymbol{x}) = \boldsymbol{w}_{0} + \sum_{i=1}^{n} w_i x_i +  \sum_{i=1}^{n}\sum_{j=i+1}^{n} \boldsymbol{w}_{h(i,j)}x_{i} x_{j}$$

where $h(i, j)$ is a function encoding $i$ and $j$ into a natural number. The complexity of computing that expression is $O(\overline{n}^2)$, where $\overline{n}$ is the average non-negative values per instance. One drawback of the Poly2 solution is the "limited learning". For example, in our first table, there is only one example of the pair (ESPN, Adidas), and the user did not click. For Poly2, it is likely that a very negative weight $w_\text{ESPN,Adidas}$ is learned. Also, there are no examples for the pair (NBC, Gucci) and in consequence, no weight will be learned. Such limiation is overcome by Factorization Machines.

Factorization Machines (FMs), proposed by Steffen Rendle in his [paper](https://cseweb.ucsd.edu/classes/fa17/cse291-b/reading/Rendle2010FM.pdf), implicitly learn a latent vector for each feature. Each latent vector contains $k$ latent factors, where $k$ is a user-specified parameter. Then, the effect of feature conjunction is modelled by the inner product of two latent vectors: 

$$\phi_\text{FM}(\boldsymbol{w}, \boldsymbol{x}) = \boldsymbol{w}_{0} + \sum_{i=1}^{n} w_i x_i + \sum_{i=1}^{n}\sum_{j=i+1}^{n} (\boldsymbol{w}_i \cdot \boldsymbol{w}_j) x_i x_j$$

One can prove that the complexity of computing that expression is $O(\overline{n}k)$. Since one would expect the number of latent factors to be smaller that the average number of non zero elements per instance, FMs are normally faster than Poly2 approaches.

In addition, for FMs the prediction of (ESPN, Adidas) is determined by $w_\text{ESPN} \cdot w_\text{Adidas}$, and $w_\text{ESPN}$ and $w_\text{Adidas}$ are also learned from other pairs (e.g. (ESPN, Nike), (NBC, Adidas)). Therefore, it is likely that the corresponding prediction will be more accurate. Furthermore, even though there is no training data for the pair (NBC, Gucci), because $w_\text{ESPN}$ and $w_\text{Adidas}$ can be learned from other pairs, it is still possible to do meaningful predictions with those weights.

Although the formulation of the problem is different, the idea is similar to the Matrix Factorization (MF) technique described in the previous chapter. There, the ratings for an item where obtained as the inner product of two latent vectors (item and user latent vectors) with $k$ latent factors. 

One difference is that in the example here, the latent vectors are associated to features (Publisher, Advertiser, etc). However, user and item vectors can also be learned, along with feature latent vectors, when using FMs. You just need to encode them as part of the the sparse matrix of features (see Rendle 2010 his Figure 1). Therefore, if features are important when predicting ratings or CTR, FMs are likely to perform better than MF.

Note that I say: *"if features are important"*. This is because I sometimes find that using only user behaviour (e.g. interactions) yields better results that using user behaviour plus user and item features. Of course, this is something that needs to be carefully explored when building your algorithm. 

I hope at this stage we are clear on why FMs are powerful when building prediction algorithms. Let's use them for the example here with the Ponpare dataset.

In [1]:
import numpy as np
import pandas as pd
import random
import os
import gc
import xlearn as xl
import pickle

from time import time
from sklearn.datasets import dump_svmlight_file, load_svmlight_file
from scipy.sparse import csr_matrix, save_npz
from sklearn.model_selection import KFold, train_test_split
from sklearn.metrics import mean_squared_error
from hyperopt import hp, tpe
from hyperopt.fmin import fmin
from recutils.average_precision import mapk

inp_dir = "../datasets/Ponpare/data_processed/"
train_dir = "train"
valid_dir = "valid"

**COUPONS**

In [2]:
# train coupon features (with coupons we will focus on categorical features only)
df_coupons_train_feat = pd.read_pickle(os.path.join(inp_dir, train_dir, 'df_coupons_train_feat.p'))
drop_cols = [c for c in df_coupons_train_feat.columns
    if (('_cat' not in c) or ('method2' in c)) and (c!='coupon_id_hash')]

df_coupons_train_cat_feat = df_coupons_train_feat.drop(drop_cols, axis=1)
coupons_cols_to_oh = [c for c in df_coupons_train_cat_feat.columns if (c!='coupon_id_hash')]

drop_cols

['price_rate',
 'catalog_price',
 'discount_price',
 'dispperiod',
 'validperiod',
 'validperiod_method2_cat',
 'validfrom_method2_cat',
 'validend_method2_cat']

In [3]:
# We are going to use FMs (and linear models) with xlearn. Here there is no "automatic" 
# treatment of categorical features. Therefore, we need to one-hot encode them. 
# To one hot encode we need to do it all at once, validation and training coupons

# Read the validation coupon features
df_coupons_valid_feat = pd.read_pickle(os.path.join(inp_dir, 'valid', 'df_coupons_valid_feat.p'))
df_coupons_valid_cat_feat = df_coupons_valid_feat.drop(drop_cols, axis=1)

df_coupons_train_cat_feat['is_valid'] = 0
df_coupons_valid_cat_feat['is_valid'] = 1

df_all_coupons = (df_coupons_train_cat_feat
    .append(df_coupons_valid_cat_feat, ignore_index=True))

In [4]:
df_all_coupons_oh_feat = pd.get_dummies(df_all_coupons, columns=coupons_cols_to_oh)
df_coupons_train_oh_feat = (df_all_coupons_oh_feat[df_all_coupons_oh_feat.is_valid==0]
    .drop('is_valid', axis=1))
df_coupons_valid_oh_feat = (df_all_coupons_oh_feat[df_all_coupons_oh_feat.is_valid==1]
    .drop('is_valid', axis=1))
df_coupons_train_oh_feat.shape

(18622, 233)

**USERS**

In [5]:
# train user-features: there are a lot of features for users, both, numerical
# and categorical. We keep them all
df_users_train_feat = pd.read_pickle(os.path.join(inp_dir, train_dir, 'df_users_train_feat.p'))

In [6]:
# Normalizing the numerical columns
user_categorical_cols = [c for c in df_users_train_feat.columns if c.endswith('_cat')]
user_numerical_cols = [c for c in df_users_train_feat.columns
    if ((c not in user_categorical_cols) and (c!='user_id_hash'))]
user_numerical_df = df_users_train_feat[user_numerical_cols]

# I know I could use MinMaxScaler(), but it returns a np array. I would have to transform the
# object into a pandas df and add column names. Is really easier, but the line below is easier
user_numerical_df_norm = (user_numerical_df-user_numerical_df.min())/(user_numerical_df.max()-user_numerical_df.min())
df_users_train_feat.drop(user_numerical_cols, axis=1, inplace=True)
df_users_train_feat = pd.concat([user_numerical_df_norm, df_users_train_feat], axis=1)
df_users_train_oh_feat = pd.get_dummies(df_users_train_feat, columns=user_categorical_cols)
df_users_train_oh_feat.shape

(22624, 456)

**INTEREST DF**

In [7]:
# Load interest dataframe
df_interest = pd.read_pickle(os.path.join(inp_dir, train_dir, 'df_interest.p'))
df_train = pd.merge(df_interest, df_users_train_oh_feat, on='user_id_hash')
df_train = pd.merge(df_train, df_coupons_train_oh_feat, on = 'coupon_id_hash')

# drop unneccesary columns
df_train.drop(['user_id_hash','coupon_id_hash','recency_factor'], axis=1, inplace=True)
y_train = df_train.interest.values
df_train.drop('interest', axis=1, inplace=True)

df_train.shape

(1560464, 687)

### 12.2 The joys of xlearn 

(If you want to jump to the final solution simply go to the next section 12.3 below)

As I mentioned at the beginning of this notebook, I will be using `xlearn`. While this package is really promissing, is still "rough on the edges", and I will illustrate why in the following lines. Nonetheless is always fun to check new packages as they are created and I do hope they bring it up to production standards. 

I normally prefer to use native methods of the packages I use. However, this time, after reading the documentation I decided to start using the sklearn-like wrap-up. However, this is what happened. Let's start with a small sample so tests happen quickly.

In [8]:
# random sample of 10000/1000 train/test instances
rnd_indx = random.sample(range(df_train.shape[0]), 11000)
rnd_indx_tr = rnd_indx[:10000]
rnd_indx_te = rnd_indx[10000:]

# temporal matrices
tmp_X_train = df_train.iloc[rnd_indx_tr,:].values
tmp_y_train = y_train[rnd_indx_tr]
tmp_X_test = df_train.iloc[rnd_indx_te,:].values
tmp_y_test = y_train[rnd_indx_te]

Let's now try the linear method in xlearn

In [9]:
# Following the tutorial on their site:
lr_model = xl.LRModel(task='reg', epoch=10, lr=0.1)
lr_model.fit(tmp_X_train, tmp_y_train)

This outputs the following on the terminal (not here (?) )

```
----------------------------------------------------------------------------------------------
           _
          | |
     __  _| |     ___  __ _ _ __ _ __
     \ \/ / |    / _ \/ _` | '__| '_ \
      >  <| |___|  __/ (_| | |  | | | |
     /_/\_\_____/\___|\__,_|_|  |_| |_|

        xLearn   -- 0.31 Version --
----------------------------------------------------------------------------------------------

[ WARNING    ] Validation file not found, xLearn has already disable early-stopping.
[ WARNING    ] Validation file not found, xLearn has already disable (-x auc) option.
[ ACTION     ] Read Problem ...
[------------] First check if the text file has been already converted to binary format.
[------------] Binary file (/tmp/tmpcqz8a9rk.bin) NOT found. Convert text file to binary file.
[------------] Number of Feature: 687
[------------] Time cost for reading problem: 0.16 (sec)
[ ACTION     ] Initialize model ...
[------------] Model size: 2.69 KB
[------------] Time cost for model initial: 0.00 (sec)
[ ACTION     ] Start to train ...
[------------] Epoch      Train mse_loss     Time cost (sec)
[   10%      ]     1                -nan                0.00
[   20%      ]     2                -nan                0.00
[   30%      ]     3                -nan                0.00
[   40%      ]     4                -nan                0.00
[   50%      ]     5                -nan                0.00
[   60%      ]     6                -nan                0.01
[   70%      ]     7                -nan                0.01
[   80%      ]     8                -nan                0.01
[   90%      ]     9                -nan                0.01
[  100%      ]    10                -nan                0.01
[ ACTION     ] Start to save model ...
[------------] Model file: /tmp/tmpjkhnbd2f
[------------] Time cost for saving model: 0.00 (sec)
[ ACTION     ] Start to save txt model ...
[------------] TXT Model file: /tmp/tmpkcmsbtwl
[------------] Time cost for saving txt model: 0.00 (sec)
[ ACTION     ] Finish training
[ ACTION     ] Clear the xLearn environment ...
[------------] Total time cost: 0.22 (sec)
```

Ok, all `NaNs`. I have tried a number of set ups, parameters and data sizes (yes, desperate attemps) and nothing changed data. So I decided to move to native methods. Here the input has to be passed as files read from disk in . svmlight format. Let's have a look:

In [10]:
# dump to svmlight
%time dump_svmlight_file(tmp_X_train, tmp_y_train, "trainfm.txt")

lr_model = xl.create_linear()
lr_model.setTrain("trainfm.txt")
param = {'task':'reg', 'lr':0.1, 'epoch': 10}
lr_model.fit(param, "model.out")

CPU times: user 2.42 s, sys: 8 ms, total: 2.43 s
Wall time: 2.43 s


Ok, output now looks better

```
----------------------------------------------------------------------------------------------
           _
          | |
     __  _| |     ___  __ _ _ __ _ __
     \ \/ / |    / _ \/ _` | '__| '_ \
      >  <| |___|  __/ (_| | |  | | | |
     /_/\_\_____/\___|\__,_|_|  |_| |_|

        xLearn   -- 0.31 Version --
----------------------------------------------------------------------------------------------

[ WARNING    ] Validation file not found, xLearn has already disable early-stopping.
[ ACTION     ] Read Problem ...
[------------] First check if the text file has been already converted to binary format.
[------------] Binary file (trainfm.txt.bin) NOT found. Convert text file to binary file.
[------------] Number of Feature: 687
[------------] Time cost for reading problem: 0.15 (sec)
[ ACTION     ] Initialize model ...
[------------] Model size: 5.38 KB
[------------] Time cost for model initial: 0.00 (sec)
[ ACTION     ] Start to train ...
[------------] Epoch      Train mse_loss     Time cost (sec)
[   10%      ]     1            0.058563                0.01
[   20%      ]     2            0.039636                0.01
[   30%      ]     3            0.038048                0.01
[   40%      ]     4            0.036483                0.01
[   50%      ]     5            0.035845                0.01
[   60%      ]     6            0.035200                0.01
[   70%      ]     7            0.035077                0.01
[   80%      ]     8            0.034705                0.01
[   90%      ]     9            0.034303                0.01
[  100%      ]    10            0.033958                0.01
[ ACTION     ] Start to save model ...
[------------] Model file: model.out
[------------] Time cost for saving model: 0.00 (sec)
[ ACTION     ] Finish training
[ ACTION     ] Clear the xLearn environment ...
[------------] Total time cost: 0.23 (sec)
```

At this point I thought I will use these methods and optimize + cross validate, given that the library comes with a convenient `.cv` method

In [11]:
lr_model = xl.create_linear()
lr_model.setTrain("trainfm.txt")
param = {'task':'reg', 'lr':0.1, 'epoch': 10}
lr_model.cv(param)

```
[ ACTION     ] Cross-validation: 1/5:
[------------] Epoch      Train mse_loss       Test mse_loss     Time cost (sec)
[   10%      ]     1            0.059444            0.044607                0.01
[   20%      ]     2            0.040383            0.040577                0.00
[   30%      ]     3            0.038289            0.040104                0.00
[   40%      ]     4            0.036922            0.039543                0.00
[   50%      ]     5            0.036106            0.037690                0.00
[   60%      ]     6            0.035162            0.036985                0.00
[   70%      ]     7            0.035021            0.036967                0.00
[   80%      ]     8            0.034697            0.036919                0.00
[   90%      ]     9            0.034426            0.037782                0.00
[  100%      ]    10            0.034217            0.037063                0.01
[ ACTION     ] Cross-validation: 2/5:
[------------] Epoch      Train mse_loss       Test mse_loss     Time cost (sec)
[   10%      ]     1            0.090296            0.039171                0.00
[   20%      ]     2            0.039904            0.038145                0.00
[   30%      ]     3            0.037674            0.037457                0.00
[   40%      ]     4            0.036615            0.036940                0.00
[   50%      ]     5            0.036017            0.036663                0.00
[   60%      ]     6            0.035474            0.036272                0.00
[   70%      ]     7            0.034949            0.036434                0.01
[   80%      ]     8            0.034661            0.035400                0.00
[   90%      ]     9            0.034408            0.035600                0.00
[  100%      ]    10            0.034342            0.036243                0.00
[ ACTION     ] Cross-validation: 3/5:
[------------] Epoch      Train mse_loss       Test mse_loss     Time cost (sec)
[   10%      ]     1            0.085794            0.043569                0.00
[   20%      ]     2            0.039700            0.040191                0.00
[   30%      ]     3            0.037755            0.037411                0.00
[   40%      ]     4            0.036678            0.037501                0.00
[   50%      ]     5            0.036058            0.038654                0.00
[   60%      ]     6            0.035338            0.036062                0.00
[   70%      ]     7            0.034920            0.035899                0.00
[   80%      ]     8            0.034780            0.038572                0.00
[   90%      ]     9            0.034402            0.035061                0.01
[  100%      ]    10            0.034162            0.036144                0.00
[ ACTION     ] Cross-validation: 4/5:
[------------] Epoch      Train mse_loss       Test mse_loss     Time cost (sec)
[   10%      ]     1            0.079564            0.041987                0.00
[   20%      ]     2            0.040053            0.036263                0.00
[   30%      ]     3            0.038571            0.035752                0.00
[   40%      ]     4            0.037176            0.036989                0.00
[   50%      ]     5            0.036446            0.034759                0.00
[   60%      ]     6            0.036075            0.034664                0.00
[   70%      ]     7            0.035224            0.034854                0.00
[   80%      ]     8            0.035005            0.037606                0.00
[   90%      ]     9            0.034773            0.036742                0.00
[  100%      ]    10            0.034599            0.034972                0.00
[ ACTION     ] Cross-validation: 5/5:
[------------] Epoch      Train mse_loss       Test mse_loss     Time cost (sec)
[   10%      ]     1            0.072777            0.041085                0.00
[   20%      ]     2            0.039750            0.039631                0.01
[   30%      ]     3            0.038511            0.040434                0.00
[   40%      ]     4            0.037066            0.039133                0.00
[   50%      ]     5            0.036170            0.036739                0.00
[   60%      ]     6            0.035572            0.036710                0.00
[   70%      ]     7            0.035060            0.035377                0.00
[   80%      ]     8            0.034890            0.037823                0.00
[   90%      ]     9            0.034619            0.035106                0.00
[  100%      ]    10            0.034623            0.035278                0.00
[------------] Average mse_loss: 0.035940
[ ACTION     ] Finish Cross-Validation
[ ACTION     ] Clear the xLearn environment ...
[------------] Total time cost: 0.47 (sec)
```

ok...that worked...however, I have no way of accessing the score per fold (`xlearn` default is 5-folds). 

All methods an attributes of our linear model are:

In [12]:
model_methods = [method for method in dir(lr_model) if callable(getattr(lr_model, method))]
model_methods

['__class__',
 '__del__',
 '__delattr__',
 '__dir__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__le__',
 '__lt__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '_set_Param',
 'cv',
 'disableEarlyStop',
 'disableLockFree',
 'disableNorm',
 'fit',
 'predict',
 'setOnDisk',
 'setPreModel',
 'setQuiet',
 'setSigmoid',
 'setSign',
 'setTXTModel',
 'setTest',
 'setTrain',
 'setValidate',
 'setValidateDMatrix',
 'show']

There is no (or I can't see) a `score` or `result` attribute, or any method that, given some argument, will retrieve the information I want. At this point I decided to code my own cross validation and move on. This is how I went about it (spoiler alert, was not the final solution):

1. I will take X% of the training data and split it in 3 folds. This is because with all data it takes ages
2. Save them to disk in respective files
3. perform cv manually within an hyperopt function

before we move on let me clean some stuff

In [13]:
del(tmp_X_test, tmp_X_train, tmp_y_test, tmp_y_train, lr_model)
gc.collect()

24

In [14]:
XLEARN_DIR = inp_dir+"xlearn_data"

rnd_indx_cv = random.sample(range(df_train.shape[0]), round(df_train.shape[0]*0.25))
X_train_cv = csr_matrix(df_train.iloc[rnd_indx_cv,:].values)
y_train_cv =  y_train[rnd_indx_cv]
seed = 37
kf = KFold(n_splits=3, shuffle=True, random_state=seed)
train_fpaths, valid_fpaths, valid_target_fpaths = [],[],[]

# Here we go...
for i, (train_index, valid_index) in enumerate(kf.split(X_train_cv)):

    print("INFO: iteration {} of {}".format(i+1,kf.n_splits))

    x_tr, y_tr = X_train_cv[train_index], y_train_cv[train_index]
    x_va, y_va = X_train_cv[valid_index], y_train_cv[valid_index]

    train_fpath = os.path.join(XLEARN_DIR,'train_part_'+str(i)+".txt")
    valid_fpath = os.path.join(XLEARN_DIR,'valid_part_'+str(i)+".txt")
    valid_target_fpath = os.path.join(XLEARN_DIR,'target_part_'+str(i)+".txt")

    print("INFO: saving svmlight training file to {}".format(train_fpath))
    dump_svmlight_file(x_tr, y_tr, train_fpath)

    print("INFO: saving svmlight validatio file to {}".format(valid_fpath))
    dump_svmlight_file(x_va, y_va, valid_fpath)

    print("INFO: saving y_valid to {}".format(valid_target_fpath))
    np.savetxt(valid_target_fpath, y_va)

    train_fpaths.append(train_fpath)
    valid_fpaths.append(valid_fpath)
    valid_target_fpaths.append(valid_target_fpath)

INFO: iteration 1 of 3
INFO: saving svmlight training file to ../datasets/Ponpare/data_processed/xlearn_data/train_part_0.txt
INFO: saving svmlight validatio file to ../datasets/Ponpare/data_processed/xlearn_data/valid_part_0.txt
INFO: saving y_valid to ../datasets/Ponpare/data_processed/xlearn_data/target_part_0.txt
INFO: iteration 2 of 3
INFO: saving svmlight training file to ../datasets/Ponpare/data_processed/xlearn_data/train_part_1.txt
INFO: saving svmlight validatio file to ../datasets/Ponpare/data_processed/xlearn_data/valid_part_1.txt
INFO: saving y_valid to ../datasets/Ponpare/data_processed/xlearn_data/target_part_1.txt
INFO: iteration 3 of 3
INFO: saving svmlight training file to ../datasets/Ponpare/data_processed/xlearn_data/train_part_2.txt
INFO: saving svmlight validatio file to ../datasets/Ponpare/data_processed/xlearn_data/valid_part_2.txt
INFO: saving y_valid to ../datasets/Ponpare/data_processed/xlearn_data/target_part_2.txt


This took some time. Ideally one would like to wrap-up the content of that loop into a function, and use joblib's Parallel to paralelise the process. At least, you would be using 3 cores instead of one. I will leave that to you, the reader.

The required files are now created. Let's define our parameter space and hyperopt objective function.

In [15]:
xl_parameter_space = {
    'lr': hp.uniform('lr', 0.01, 0.5),           
    'lambda': hp.uniform('lambda', 0.001,0.01),  # regularization
    'init': hp.uniform('init', 0.2,0.8),         # model (w) initialization
    'epoch': hp.quniform('epoch', 10, 200, 10),
    'k': hp.quniform('k', 2, 10, 1),             # latent factors
}

In [28]:
def xl_objective(params, method="fm"):

    xl_objective.i+=1

    params['task'] = 'reg'
    params['metric'] = 'rmse'

    # remember hyperopt casts as floats
    params['epoch'] = int(params['epoch'])
    params['k'] = int(params['k'])

    if method is "linear":
        xl_model = xl.create_linear()
    elif method is "fm":
        xl_model = xl.create_fm()

    results = []
    for train, valid, target in zip(train_fpaths, valid_fpaths, valid_target_fpaths):

        preds_fname = os.path.join(XLEARN_DIR, 'tmp_output.txt')
        model_fname = os.path.join(XLEARN_DIR, "tmp_model.out")

        xl_model.setTrain(train)
        xl_model.setTest(valid)
        # whether quiet of not, it'll output a lot of stuff...
        xl_model.setQuiet()
        xl_model.fit(params, model_fname)
        xl_model.predict(model_fname, preds_fname)

        y_valid = np.loadtxt(target)
        predictions = np.loadtxt(preds_fname)
        loss = np.sqrt(mean_squared_error(y_valid, predictions))

        results.append(loss)

    error = np.mean(results)
    print("INFO: iteration {} error {:.3f}".format(xl_objective.i, error))

    return error

Let's turn the objective function into a partial function of params and run 3 iterations to see why I end up optimizing without `cv`.

In [17]:
partial_objective = lambda params: xl_objective(
    params,
    method="fm")

start = time()
xl_objective.i = 0
best_fm = fmin(
    fn=partial_objective,
    space=xl_parameter_space,
    algo=tpe.suggest,
    max_evals=3
    )
end = time()-start
print("{} min".format(round(end/60,3)))

pickle.dump(best_fm, open(os.path.join(XLEARN_DIR,'best_fm.p'), "wb"))

INFO: iteration 1 error 0.265                      
INFO: iteration 2 error 0.264                                                
INFO: iteration 3 error 0.261                                                 
100%|██████████| 3/3 [07:23<00:00, 147.84s/it, best loss: 0.26129352027329267]
7.392 min


3 iterations on a c5n.4xlarge instance (40GB or RAM and 16 cores) takes quite some time for this excercise moreover bearing in mind I am only using 25% of the data. Therefore I decided to **NOT use cross validation** and optimize with a single train/test split. Not ideal, but better than manually trying parameters

### 12.3 Final Solution

Let's define a series of paths that we will use later

In [8]:
XLEARN_DIR = inp_dir+"xlearn_data"

# train and validation paths
train_fpath = os.path.join(XLEARN_DIR,"train_xl.txt")
train_target_path = os.path.join(XLEARN_DIR,"train_target_xl.txt")
valid_fpath = os.path.join(XLEARN_DIR,"valid_xl.txt")

# temporal filenames for the optimization process
xlmodel_fname_tmp = os.path.join(XLEARN_DIR,"xlfm_model_tmp.out")
xlpreds_fname_tmp = os.path.join(XLEARN_DIR,"xlfm_preds_tmp.txt")

split the training dataset into train and evaluation (I run out of names)

In [9]:
# For memory issues, I will still not use the whole dataset here, but 50%.
rnd_indx = random.sample(range(df_train.shape[0]), round(df_train.shape[0]*0.50))
X_train_rn = csr_matrix(df_train.iloc[rnd_indx,:].values)
y_train_rn =  y_train[rnd_indx]
X_train_rn.shape

(780232, 687)

Let's save them to a svmlight format

In [10]:
print("INFO: saving svmlight training file to {}".format(train_fpath))
dump_svmlight_file(X_train_rn, y_train_rn, train_fpath)

print("INFO: saving target to {}".format(train_target_path))
np.savetxt(train_target_path, y_train_rn)

INFO: saving svmlight training file to ../datasets/Ponpare/data_processed/xlearn_data/train_xl.txt
INFO: saving target to ../datasets/Ponpare/data_processed/xlearn_data/train_target_xl.txt


In [11]:
del(X_train_rn, y_train_rn)

Preparing the validation data

In [12]:
# Read the interactions during validation
interactions_valid_dict = pickle.load(
    open("../datasets/Ponpare/data_processed/valid/interactions_valid_dict.p", "rb"))

left = pd.DataFrame({'user_id_hash':list(interactions_valid_dict.keys())})
left['key'] = 0
right = df_coupons_valid_feat[['coupon_id_hash']]
right['key'] = 0
df_valid = (pd.merge(left, right, on='key', how='outer')
    .drop('key', axis=1))
df_valid = pd.merge(df_valid, df_users_train_oh_feat, on='user_id_hash')
df_valid = pd.merge(df_valid, df_coupons_valid_oh_feat, on = 'coupon_id_hash')
df_preds = df_valid[['user_id_hash','coupon_id_hash']]
print(df_valid.shape)
print(df_preds.shape)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


(2173418, 689)
(2173418, 2)


In [13]:
X_valid = csr_matrix(df_valid
    .drop(['user_id_hash','coupon_id_hash'], axis=1)
    .values)
# svmlight needs a target column
y_valid = np.array([0.1]*X_valid.shape[0])
%time dump_svmlight_file(X_valid,y_valid,valid_fpath)
del(X_valid)

CPU times: user 9min 30s, sys: 2.63 s, total: 9min 32s
Wall time: 9min 33s


Define the objective function

In [16]:
def xl_objective(params, method="fm"):

    start = time()
    xl_objective.i+=1

    params['task'] = 'reg'
    params['metric'] = 'rmse'

    # remember hyperopt casts as floats
    params['epoch'] = int(params['epoch'])
    params['k'] = int(params['k'])

    # I added an option in case you want to use linear
    if method is "linear":
        xl_model = xl.create_linear()
    elif method is "fm":
        xl_model = xl.create_fm()

    # if you just want fm or linear the firs 3 lines here can go outside the function    
    xl_model.setTrain(train_fpath)
    xl_model.setTest(valid_fpath)
    xl_model.disableNorm()
    #xl_model.setQuiet()    
    xl_model.fit(params, xlmodel_fname_tmp)
    xl_model.predict(xlmodel_fname_tmp, xlpreds_fname_tmp)

    # add predicitions and rank
    preds = np.loadtxt(xlpreds_fname_tmp)
    df_preds['interest'] = preds

    df_ranked = df_preds.sort_values(['user_id_hash', 'interest'],
        ascending=[False, False])
    df_ranked = (df_ranked
        .groupby('user_id_hash')['coupon_id_hash']
        .apply(list)
        .reset_index())
    recomendations_dict = pd.Series(df_ranked.coupon_id_hash.values,
        index=df_ranked.user_id_hash).to_dict()

    actual = []
    pred = []
    for k,_ in recomendations_dict.items():
        actual.append(list(interactions_valid_dict[k]))
        pred.append(list(recomendations_dict[k]))

    score = mapk(actual,pred)
    end = round((time() - start)/60.,2)

    print("INFO: iteration {} was completed in {} min. Score {:.3f}".format(xl_objective.i, end, score))

    return 1-score

Define the parameter space

In [17]:
xl_parameter_space = {
    'lr': hp.uniform('lr', 0.01, 0.4),                # learning rate default ?          
    'lambda': hp.uniform('lambda', 0.,0.02),          # regularization default 0.00002
    'init': hp.uniform('init', 0.4,0.8),              # model (w) initialization default 0.66
    'epoch': hp.quniform('epoch', 10, 50, 5),         # epoch default 10
    'k': hp.quniform('k', 4, 20, 2),                  # latent factors default 4
}

And we run the same experiment

In [18]:
partial_objective = lambda params: xl_objective(
    params,
    method="fm")

start = time()
xl_objective.i = 0
best_fm = fmin(
    fn=partial_objective,
    space=xl_parameter_space,
    algo=tpe.suggest,
    max_evals=10
    )
end = time()-start
print("{} min".format(round(end/60,3)))

pickle.dump(best_fm, open(os.path.join(XLEARN_DIR,'best_fm.p'), "wb"))

  0%|          | 0/10 [00:00<?, ?it/s, best loss: ?]

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



INFO: iteration 1 was completed in 1.15 min. Score 0.014
INFO: iteration 2 was completed in 0.7 min. Score 0.015                      
INFO: iteration 3 was completed in 1.05 min. Score 0.012                     
INFO: iteration 4 was completed in 0.94 min. Score 0.011                     
INFO: iteration 5 was completed in 0.53 min. Score 0.012                     
INFO: iteration 6 was completed in 0.58 min. Score 0.013                     
INFO: iteration 7 was completed in 1.34 min. Score 0.013                     
INFO: iteration 8 was completed in 2.24 min. Score 0.014                     
INFO: iteration 9 was completed in 0.58 min. Score 0.010                     
INFO: iteration 10 was completed in 1.24 min. Score 0.012                    
100%|██████████| 10/10 [10:21<00:00, 62.14s/it, best loss: 0.9853323567678748]
10.356 min


I run this on a c5n.4xlarge instance with 40GB of memory and by the time it finished it had almost consumed it all. Previously, when using a c5.4xlarge (30GB) I got the always nice **`Memory error`** message "`Killed`" in the terminal

I am not sure whether there is a memory leak in the package or some other memory related issue because every iteration accumulates a lot of memory (around 3GB). 

Nonetheless, I would say that these are dissapointing results. Of course, there are a number of ways to improve the methods here, starting with the obvious: using the whole dataset instead of 50% (but optimizing will take longer and memory blows quite quickly). Also, one could increase the number of factors, but the impact is less notable.

For example, when using the whole training dataset and these parameters:

```
param_fm = {'epoch': 20,
 'init': 0.4,
 'k': 10,
 'lambda': 0.2,
 'lr': 0.01,
 'task': 'reg',
 'metric': 'rmse'}
```

you get MAP@10=0.02. It looks like is "all about the data". The more the better, obviously. Anyway, we will take all the lessons learned in this chapter, and move into the next, more powerful technique: Field-Aware Factorization Machines  