**Recommender Systems**

Let's examine neural network architectures for recommender systems. We'll take a closer look at neural collaborative filtering (NCF). The code comes from NVIDIA's [Deep Learning Examples](https://github.com/NVIDIA/DeepLearningExamples/tree/master/PyTorch/Recommendation/NCF) repository and is written in PyTorch. Let's get into it!

The `README.md` gives a nice overview of the model architecture. There are "two branches" of the model. The "Multi Layer Perceptron (MLP) branch... transforms the input through fully connected layers with ReLU activations and dropout." The "Matrix Factorization (MF) branch... performs collaborative filtering factorization. Each use and each item has two embedding vectors associated with it: one for the MLP branch and the other for the MF branch. The outputs from these branches are concatenated and fed into the final fully connected layer with sigmoid activation. This can be interpreted as the probability of a user interacting with a given item."

If this doesn't make sense just yet, hopefully it'll become clearer once we dive into the code. `ncf.py` looks like the logical next step. What are the functions in this file?

`parse_args` contains the command-line arguments necessary to train the model, along with their data type: string (`str`), integer (`int`), or 32-bit floating point (`float`) and (optionally) default value:

`--data`: "path to test and training data files" (`str`). Presumably this is a path to a directory containing test files in one subdirectory and training files in another. From experience, our data in this directory needs to be in a particular structure or we'll get errors. There is no default value for `--data`. We need to pass a directory or our script will not train.

`-e`, or `--epochs`: number of forward and backward passes through the full dataset (`int`, default: 30).

`-b`, or `--batch_size`: number of examples in a training batch (`int`, default: $2^{20}$). That is an enormous batch size compared to what we typically see for e.g. convolutional neural networks.

`--valid_batch_size`: number of examples in a validation batch (`int`, default: $2^{20}$).

`-f`, or `--factors`: number of predictive factors (`int`, default: 64). This defines the size of the last layer of the NCF. The [paper](http://papers.www2017.com.au.s3-website-ap-southeast-2.amazonaws.com/proceedings/p173.pdf) introducing NCF notes that large factors may lead to overfitting and experiments with factors of size 8, 16, 32, and 64. We should be mindful of this when training the model.

`--layers`: number of nodes per hidden layer in MLP. By default, this is a list of `int` data: `[256, 256, 128, 64]`.

`-n`, or `--negative_samples`: To optimize the network, we use "log loss with negative sampling" (see the paper). This parameter tells the network how many negative examples to use for each iteration (forward and backward pass of a batch). Again, we pass our argument as an `int` (default: 4)

`-l`, or `--learning_rate`: We update our parameters by the negative of our gradient, scaling by our learning rate (`float`, default: 0.0045).

`-k`, or `--topk`: To evaluate our accuracy, we need to tell our recommender system how many examples to consider when applying it to our validation data. We set this to 10 by default (data type: `int`).

`--seed`, or `-s`: sets a random seed. Not sure what this will be used for (`int`, default: 1). 

`--threshold`, or `-t`: sets early stopping for our training. Not sure how this works just yet (`float`, default: 1.0).

`--beta1`, or `-b1`: When we use the Adam optimizer, we need to set two `beta` values. We use `beta1` to compute a decaying average of prior gradients. For details, there's a great [blog post](http://ruder.io/optimizing-gradient-descent/index.html#adam) on Adam and other optimizers (`float`, default: 0.25).

`--beta2`, or `-b2`: Similar to `beta1`, we use `beta2` to compute a decaying average of prior squared gradients (`float`, default: 0.5).

`--eps`: We also need a (very small) epsilon value for computing gradient updates with Adam (`float`, default: 1e-8).

`--dropout`: randomly sets input parameters to zero with a given probability (`float`, default: 0.5).

`--checkpoint_dir`: outputs the model to a specified directory (`str`, default: `/data/checkpoints/`)

`--load_checkpoint_path`: loads a model and (presumably) initializes this model with parameters from the checkpoint (`str`, no default path) 

`--mode`: runs the script in either `train` mode (by which the model trains for the the number of `epochs` specified) or `test` mode (by which the model "runs a single evaluation" - presumably a full forward pass of all the test data?). The default is `train` and the type is a `str` (we only have our two options: `train` and `test`)

`--grads_accumulated`: "number of gradients to accumulate before performing an optimization step" (`int`, default: 1)

`--opt_level`: "optimization level for automatic mixed precision" (`str`, default: `02`, choices: `['00', '02']`)

`--local_rank`: "necessary for multi-GPU training" (`int`, default: `0`)

Our next function is `init_distributed`, which "initializes distributed communication" for GPUs. Let's not worry about the specific logic here. We want to examine the neural network itself.

Looks like there are two more functions in `ncf.py`: `val_epoch` and `main`, and neither defines the network architecture. However, if we return to the `README.md`, the "Advanced" section notes that "the model architecture is defined in `neumf.py`. Let's examine it.

First we have our imports and a line defining `LOGGER.model = 'ncf'`. Let's comment out the line beginning with `sys.path.append` (if we run it in the block below, we get a `NameError: name '__file__' is not defined)`). `logger` resides in the same directory (`NCF`) as `neumf.py`, and thus we'll also comment out those lines - they won't run unless we clone the repository.

In [1]:
import numpy as np
import torch
import torch.nn as nn

import sys
from os.path import abspath, join, dirname

#sys.path.append(abspath(dirname(__file__) + '/'))

#from logger.logger import LOGGER
#from logger import tags

#LOGGER.model = 'ncf'

Everything else in `neumf.py` resides in the `NeuMF` class. Let's sketch the basic structure of the class and its functions:

In [2]:
class NeuMF(nn.Module):
    def __init__(self, nb_users, nb_items,
                 mf_dim, mlp_layer_sizes, dropout=0):
        
        # self methods go here
        
        def glorot_uniform(layer):
            
            # code to initialize with glorot uniform
            
            pass
            
        def lecunn_uniform(layer):
        
            # code to initialize with lecun uniform
            
            pass
            
    def forward(self, user, item, sigmoid=False):
    
        # defines forward propagation?
        
        pass

Let's build out this class. Our constructor takes `self` and five arguments: `nb_users`, `nb_items`, `mf_dim`, `mlp_layer_sizes`, and `dropout`. We also define a term `nb_mlp_layers` equal to the length of `mlp_layer_sizes`.

Now we get into the network architecture! In lines 56-60, we see four `nn.Embedding` layers followed by `dropout` (zero by default). What is `nn.Embedding` exactly? If we go to the [documentation](https://pytorch.org/docs/stable/nn.html), search for "Embedding," and click on `[SOURCE]`, we see that the `Embedding` class inherits from the general `Module` class and is defined in `torch.nn.modules.sparse` as "a simple lookup table that stores embeddings of a fixed dictionary and size."

Returning to the first layer of our `NeuMF` class, we see that `num_embeddings` - the size of our dictionary - is equal to the `nb_users` argument passed to the constructor. `embedding_dim` - the size of each embedding vector - is equal to `mf_dim`.

Thus our architecture in lines 56-60 looks like this:

1) A layer `mf_user_embed` that learns `nb_users` embeddings of size `mf_dim`

2) A layer `mf_item_embed` that learns `nb_items` embeddings of size `mf_dim`

3) A layer `mlp_user_embed` that learns `nb_users` embeddings of size `mlp_layer_sizes[0] // 2`

4) A layer `mlp_item_embed` that learns `nb_items` embeddings of size `mlp_layer_sizes[0] // 2`

5) An optional dropout layer (omitted by default)

Let's take a step back and examine other implementations of NCF. If we Google "neural collaborative filtering pytorch," something like [this](https://github.com/yihong-chen/neural-collaborative-filtering) should come up. This is based off of the original NCF paper (He at al. 2017). The NCF authors implemented their model in TensorFlow with Keras. This repo is in PyTorch instead. Again, we'll examine `neumf.py`. The second class `NeuMFEngine` inherits from `Engine`, and thus won't run without importing from that `engine.py` file. We'll omit that and focus on the `NeuMF` class, analogous to the same class in the prior example.

In [3]:
class NeuMF(torch.nn.Module):
    def __init__(self, config):
        super(NeuMF, self).__init__()
        
    def forward(self, user_indices, item_indices):
        pass
        
    def init_weight(self):
        pass
    
    def load_pretrain_weights(self):
        pass

Our constructor `__init__` uses the `config` argument to define a set of variables:

`num_users`: the number of embeddings we need to learn for `embedding_user_mlp` (multi-layer perceptron) and `embedding_user_mf` (matrix factorization). Presumably this is equal to the number of users, and we need to learn one embedding per user?

`num_items`: Just as for users, we learn `mlp` and `mf` embeddings for each item.

`latent_dim_mf`: We also need to specify the "latent dimension" of the `mf` embeddings vectors - in other words, the size of these vectors.

`latent_dim_mlp`: Just as for the `mf` embeddings, we need to define the latent dimension (size) of the `mlp` vectors.

`fc_layers` is set to `torch.nn.ModuleList()`. We then append `nn.Linear` layers of `in_size` input size and `out_size` output size according to the `layers` defined in `config`.

We then have an `affine_output` layer that concatenates the output of the final layer in `layers` with `latent_dim_mf`, and returns one output feature.

Finally, we define a `logistic` layer as `torch.nn.Sigmoid()`

`forward` defines how we pass our data through the network. We have data for users (`user_indices`) and items (`item_indices`), and perform the following operations:

1) Apply our `mlp` and `mf` embeddings to each set of indices. 

2) Concatenate our `mlp` embeddings for users and items (into `mlp_vector`) and do the same for our `mf` embeddings (concatenating these into `mf_vector`).

3) Add additional layers (with `ReLU` activation) to `mlp_vector` as defined in `fc_layers`.

4) Concatenate `mlp_vector` and `mf_vector`.

5) Apply a linear transformation to the output of 4) as defined in `affine_output`.

6) Pass the output of 5) through a sigmoid activation function and return this as our rating.

There's another function for loading pretrained weights, but we've covered the model architecture.