## Training the model

In Notebooks 01 and 02 I described how to prepare the data and build the model. Here I will show how to train, validate and test it. 

Once again, I will focus on the `Mxnet` implementation.

Let's start by loading the data

In [1]:
import os
import pickle
import sys
from datetime import datetime
from pathlib import Path

import mxnet as mx
import numpy as np
from mxnet import autograd, gluon, nd
from tqdm import trange

sys.path.append(os.path.abspath('../'))
from models.mxnet_models import MultiDAE, MultiVAE
from utils.data_loader import DataLoader
from utils.metrics import NDCG_binary_at_k_batch, Recall_at_k_batch
from utils.parser import parse_args

  from ._conv import register_converters as _register_converters


In [2]:
DATA_DIR = Path("../data")
data_path = DATA_DIR / "movielens_processed"

In [3]:
data_loader = DataLoader(data_path)
n_items = data_loader.n_items
train_data = data_loader.load_data("train")
valid_data_tr, valid_data_te = data_loader.load_data("validation")
test_data_tr, test_data_te = data_loader.load_data("test")

In [4]:
train_data

<116677x20108 sparse matrix of type '<class 'numpy.float32'>'
	with 8538846 stored elements in Compressed Sparse Row format>

As you can see, the training data (same applies to validation and test) is the binary sparse matrix of interactions. Have a look to the class `DataLoader` if you want a few more details on how it is built.

As described in Notebook 02, [Liang et al, 2018](https://arxiv.org/pdf/1802.05814.pdf) interpret the Kullback-Leiber divergence as a regularization term. With that in mind they add a regularization parameter $\beta$ and, in a procedure inspired by [Samuel R. Bowman et al, 2016](https://arxiv.org/abs/1511.06349), they linearly anneal the KL term slowly over a large number of training steps. Here is perhaps the part where it can get a bit (just a bit) confusing, moreover if you do not look at their code and focus on the paper alone. 

In their paper,  Liang et al write the following referring to their Figure 1 and their annealing approach: "*[...] we plot the validation ranking metric [...] with KL annealing all the way to $\beta$ = 1 [...] $\beta$ reaches 1 at around 80 epochs) [...] Having identified the best $\beta$ based on the peak validation metric, we can retrain the model with the same annealing schedule, but stop increasing $\beta$ after reaching that value*"

When I read these lines, together with their Figure 1, I initially interpreted the following: $\beta$ reaches 1 at around 80 epochs and the best validation metrics is occurs at epoch 20 approximately. By then, $\beta$ must have a value of $\sim$0.25. Therefore, my understanding is that they will then retrain the model with the same annealing schedule, but stop at epoch 20 when $\beta$ reaches $\sim$0.25. 

However, when I went to [their implementation](https://github.com/dawenl/vae_cf/blob/master/VAE_ML20M_WWW2018.ipynb), the authors do the following: using a batch size of 500 they set the total number of annealing steps to 200000. Given that the training dataset has a size of 116677, every epoch has 234 training steps. Their `anneal_cap` value, i.e. the maximum annealing reached during training, is set to 0.2, and during training they use the following approach: 

```python
            if total_anneal_steps > 0:
                anneal = min(anneal_cap, 1. * update_count / total_anneal_steps)
            else:
                anneal = anneal_cap
```

where `update_count` will increase by 1 every training step/batch. They use 200 epochs, therefore, if we do the math, the `anneal_cap` value will stop increasing when `update_count / total_anneal_steps` = 0.2, i.e. after 40000 training steps, or in other words, after around 170 epochs, i.e. $\sim$80% of the total number of epochs. Therefore, what they really meant is that once you select the best performing $\beta$, you applied the same schedule as the ones used when annealing all the way to $\beta$ = 1, reaching the annealing max value (e.g. 0.2) at $\sim$80% of the total number of epochs.

Whit that in mind my implementation looks like this:

In [5]:
batch_size = 500
anneal_epochs = None
anneal_cap = 0.2
constant_anneal = False
n_epochs = 200

In [6]:
training_steps = len(range(0, train_data.shape[0], batch_size))
try:
    total_anneal_steps = (
        training_steps * (n_epochs - int(n_epochs * 0.2))
    ) / anneal_cap
except ZeroDivisionError:
    assert (
        constant_anneal
    ), "if 'anneal_cap' is set to 0.0 'constant_anneal' must be set to 'True"


The following two functions will look very familiar if you are used to `Pytorch`

### Train step

In [7]:
def train_step(model, optimizer, data, epoch):

    running_loss = 0.0
    global update_count
    N = data.shape[0]
    idxlist = list(range(N))
    np.random.shuffle(idxlist)
    training_steps = len(range(0, N, batch_size))

    with trange(training_steps) as t:
        for batch_idx, start_idx in zip(t, range(0, N, batch_size)):
            t.set_description("epoch: {}".format(epoch + 1))

            end_idx = min(start_idx + batch_size, N)
            X_inp = data[idxlist[start_idx:end_idx]]
            X_inp = nd.array(X_inp.toarray()).as_in_context(ctx)

            if constant_anneal:
                anneal = anneal_cap
            else:
                anneal = min(anneal_cap, update_count / total_anneal_steps)
            update_count += 1

            with autograd.record():
                if model.__class__.__name__ == "MultiVAE":
                    X_out, mu, logvar = model(X_inp)
                    loss = vae_loss_fn(X_inp, X_out, mu, logvar, anneal)
                    train_step.anneal = anneal
                elif model.__class__.__name__ == "MultiDAE":
                    X_out = model(X_inp)
                    loss = -nd.mean(nd.sum(nd.log_softmax(X_out) * X_inp, -1))
            loss.backward()
            trainer.step(X_inp.shape[0])
            running_loss += loss.asscalar()
            avg_loss = running_loss / (batch_idx + 1)

            t.set_postfix(loss=avg_loss)


### Evaluate step

In [8]:
def eval_step(data_tr, data_te, data_type="valid"):

    running_loss = 0.0
    eval_idxlist = list(range(data_tr.shape[0]))
    eval_N = data_tr.shape[0]
    eval_steps = len(range(0, eval_N, batch_size))

    n100_list, r20_list, r50_list = [], [], []

    with trange(eval_steps) as t:
        for batch_idx, start_idx in zip(t, range(0, eval_N, batch_size)):
            t.set_description(data_type)

            end_idx = min(start_idx + batch_size, eval_N)
            X_tr = data_tr[eval_idxlist[start_idx:end_idx]]
            X_te = data_te[eval_idxlist[start_idx:end_idx]]
            X_tr_inp = nd.array(X_tr.toarray()).as_in_context(ctx)

            with autograd.predict_mode():
                if model.__class__.__name__ == "MultiVAE":
                    X_out, mu, logvar = model(X_tr_inp)
                    loss = vae_loss_fn(X_tr_inp, X_out, mu, logvar, train_step.anneal)
                elif model.__class__.__name__ == "MultiDAE":
                    X_out = model(X_tr_inp)
                    loss = -nd.mean(nd.sum(nd.log_softmax(X_out) * X_tr_inp, -1))

            running_loss += loss.asscalar()
            avg_loss = running_loss / (batch_idx + 1)

            # Exclude examples from training set
            X_out = X_out.asnumpy()
            X_out[X_tr.nonzero()] = -np.inf

            n100 = NDCG_binary_at_k_batch(X_out, X_te, k=100)
            r20 = Recall_at_k_batch(X_out, X_te, k=20)
            r50 = Recall_at_k_batch(X_out, X_te, k=50)
            n100_list.append(n100)
            r20_list.append(r20)
            r50_list.append(r50)

            t.set_postfix(loss=avg_loss)

        n100_list = np.concatenate(n100_list)
        r20_list = np.concatenate(r20_list)
        r50_list = np.concatenate(r50_list)

    return avg_loss, np.mean(n100_list), np.mean(r20_list), np.mean(r50_list)


I have widely discussed the evaluation metrics (NDCG@k and Recall@k) in a number of notebooks in this repo (and corresponding posts). Therefore, with that in mind and with the aim of not making another infinite notebook, I will not describe the corresponding implementation here. If you want details on those evaluation metrics, please go the `metrics.py` module in `utils`. The code there is a very small adaptation to the one in the [original implementation](https://github.com/dawenl/vae_cf/blob/master/VAE_ML20M_WWW2018.ipynb). 

### Running the process

Let's first instantiate the model

In [9]:
model = MultiVAE(
    p_dims=[200, 600, n_items],
    q_dims=[n_items, 600, 200],
    dropout_enc=[0.5, 0.0],
    dropout_dec=[0.0, 0.0],
)

In [10]:
model

MultiVAE(
  (encode): VAEEncoder(
    (q_layers): HybridSequential(
      (0): Dropout(p = 0.5, axes=())
      (1): Dense(20108 -> 600, linear)
      (2): Dropout(p = 0.0, axes=())
      (3): Dense(600 -> 400, linear)
    )
  )
  (decode): Decoder(
    (p_layers): HybridSequential(
      (0): Dropout(p = 0.0, axes=())
      (1): Dense(200 -> 600, linear)
      (2): Dropout(p = 0.0, axes=())
      (3): Dense(600 -> 20108, linear)
    )
  )
)

The the usual, use GPU if available, make it static/imperative if possible (see Notebook 02 and [here](https://gluon.mxnet.io/chapter07_distributed-learning/hybridize.html), etc...

In [11]:
ctx = mx.gpu() if mx.context.num_gpus() else mx.cpu()
model.initialize(mx.init.Xavier(), ctx=ctx)
model.hybridize()
optimizer = mx.optimizer.Adam(learning_rate=0.001, wd=0.)
trainer = gluon.Trainer(model.collect_params(), optimizer=optimizer)

Remember, we need our custom loss (Eq 10 in Notebook 2)

In [12]:
def vae_loss_fn(inp, out, mu, logvar, anneal):
    neg_ll = -nd.mean(nd.sum(nd.log_softmax(out) * inp, -1))
    KLD = -0.5 * nd.mean(nd.sum(1 + logvar - nd.power(mu, 2) - nd.exp(logvar), axis=1))
    return neg_ll + anneal * KLD

And we are ready, let's run one epoch and a small sample to make sure all works fine

In [13]:
stop_step = 0
update_count = 0
eval_every = 1
stop = False
for epoch in range(1):
    train_step(model, optimizer, train_data[:2000], epoch)
    if epoch % eval_every == (eval_every - 1):
        val_loss, n100, r20, r50 = eval_step(valid_data_tr[:1000], valid_data_te[:1000])
        print("=" * 80)
        print(
            "| valid loss {:4.3f} | n100 {:4.3f} | r20 {:4.3f} | "
            "r50 {:4.3f}".format(val_loss, n100, r20, r50)
        )
        print("=" * 80)

epoch: 1: 100%|██████████| 4/4 [00:05<00:00,  1.28s/it, loss=737]
valid: 100%|██████████| 2/2 [00:01<00:00,  1.00it/s, loss=562]

| valid loss 562.083 | n100 0.005 | r20 0.003 | r50 0.006





And with a few more rings and bells (e.g. optional learning rate scheduler, early stopping, etc...) this is exactly the code that you will find in `main_mxnet.py`. 

Before I move to Notebook 04, just a quick comment about something I normally find in these scientific publications. Normally, once they have found the best hyperparameters on the validation set, they test the model on the test set. In "real-life" scenarios, there would be one additional step, the one merging the train and validation sets, re-training the model with the best hyperparameters and then testing on the test set. In any case, since here my goal is not to build a real-life system, I will follow the same procedure to that found in the original [implementation](https://github.com/dawenl/vae_cf/blob/master/VAE_ML20M_WWW2018.ipynb).

Time now to have a look to the results obtained with both `Pytorch` and `Mxnet`.