# Scalable Recommendation with Poisson Factorization
#### The following notebook is an implementation of the mean-field variational algorithm for approximate posterior inference for the Hierarchical Poisson Factorization, by Gopalan et al. (2013), followed by some visualizations.

The module **hpf_vi** contains the model implemented.

In [0]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import scipy.special
import scipy.stats
import sklearn.metrics
from hpf_vi_while import hpf_vi_while
from sklearn.metrics import mean_squared_error

In [0]:
ratings = pd.read_pickle("ratings_df.pkl")
ratings = np.array(ratings)

We are in a parametric framework, hence the number of latent components is fixed. This hypothesis suggests that the model choice will not be automatic as in a non-parametric setup. Nevertheless, I ignore these difficulties for the moment, and fix $K$ to be equal to 10.

## Initialization of the variational parameters
The model implemented initialize the user parameters $\gamma_u$, and the item parameters $\lambda_i$ to the prior, plus a small random offset generated with a $\text{Uniform}(0,1)$.

In this way, $\gamma^{shp}_{uk} = a$ and $\gamma^{rte}_{uk} = a/b'$, so that the expectation of the variational $\theta$ is $b'$.\
Similarly, $\lambda_{ik}^{shp} = c$ and $\lambda_{ik}^{rte} = c/d'$, so that the expectation of the variational $\theta$ is $d'$.

Going up in the hierarchy, for the same reason, we set the variational parameters of $\xi_u$ and $\eta_i$, which are $k^{rte}_u$ and $\tau_i^{rte}$, to the prior, plus the same small random offset.

This initialization is embedded in the **hpf_vi** model.

In [0]:
ratings_toy = [[0,0,0,0,0,1,1,1,0,0,0,0,1,0,0],[0,0,0,0,0,1,1,0,0,0,0,1,1,0,0],[0,0,0,0,0,1,1,1,0,0,0,0,0,0,0],[0,0,0,0,0,1,0,1,0,0,0,1,1,0,0],[0,0,1,1,0,0,0,0,0,0,0,0,0,0,0],[0,0,1,0,0,0,0,0,0,0,0,0,0,0,0],[0,0,1,1,0,0,0,0,0,0,0,0,0,0,0],[0,0,0,1,0,0,0,0,0,0,0,0,0,0,0],[1,0,0,0,1,0,0,0,0,0,0,0,0,1,1],[1,0,0,0,0,0,0,0,0,0,0,0,0,0,1]]
ratings_toy = np.array(ratings_toy)

In the following, a function to split training and testing data is defined.\
As suggested by the authors of the paper, the test set consists of randomly selected ratings, which are set to zero during the training.

In [0]:
 def train_val_split(data, valid_dim=0.2):
        '''
        Creating two additional objects, i.e. training and validation set, which can be used in the fitting process

        Parameters:
        data = np.array
        valid_dim = float
        '''
        if valid_dim >= 1:
          raise ValueError("valid_dim must be lower than 1")

        train = data.copy()
        valid = np.zeros(data.shape)

        for u in np.unique(data.nonzero()[0]):
            ind = data[u].nonzero()[0] 

            if len(ind) > 0: 
                valid_ind = np.random.choice(ind, round(len(ind)*valid_dim), replace=False)
                for i in valid_ind:
                    valid[u,i], train[u,i] = data[u,i], 0
        return train, valid

train, valid = train_val_split(ratings_toy)

We can now fit the model, using the predefined settings.

In [177]:
model = hpf_vi_while() # instantiating the model
model.fit(train, valid = valid, iterations = 100)

AttributeError: ignored

In [107]:
[j for i,j in enumerate(np.argsort(model.predicted[0])[::-1])]

[5, 12, 7, 6, 11, 3, 2, 14, 13, 10, 9, 8, 4, 1, 0]

In [85]:
ranking = np.zeros(shape = (ratings_toy.shape[0], ratings_toy.shape[1]))
names = ["Anna", "Baldo", "Carlo", "Dante", "Emma", "Fabio", "Giorgio", "Helena", "Ida", "Lando"]
movies = ["Lo squalo", "Quei bravi ragazzi", "Apocalypse Now", "Cantando sotto la pioggia", "Pulp Fiction", "Fight Club", "Toro scatenato", "L'appartamento", "C'era una volta il West"]

for u in train:
    for h,i in enumerate(movies):
        ranking[j] = ratings_pred[j]-ratings_toy[j]
        if ranking[j,h] > 0.1:
            print(f"Hey {u}! You may also like this movie: {i}")

NameError: ignored

In [111]:
ratings_toy[0].nonzero()[0]

array([ 5,  6,  7, 12])

In [114]:
ratings_toy.nonzero()

(array([0, 0, 0, 0, 1, 1, 1, 1, 2, 2, 2, 3, 3, 3, 3, 4, 4, 5, 6, 6, 7, 8,
        8, 8, 8, 9, 9]),
 array([ 5,  6,  7, 12,  5,  6, 11, 12,  5,  6,  7,  5,  7, 11, 12,  2,  3,
         2,  2,  3,  3,  0,  4, 13, 14,  0, 14]))

In [122]:
np.unique(ratings_toy.nonzero()[0])

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In [124]:
ratings_toy[0,0]

0

In [163]:
mod.fit(ratings_toy,100)

AttributeError: ignored

In [157]:
mean_squared_error(ratings_toy,ratings_toy)

0.0