## 01 Preparing the data

In this notebook I will describe how one designs the experiment and prepares the data to be passed to the algorithm(s)

Before going any further let me acknowledge the authors of the paper that I will be implementing here as well as the code that inspire most of the code in this repo. 

This repo is based on paper: [`Variational Autoencoders for Collaborative Filtering`](https://arxiv.org/pdf/1802.05814.pdf). The paper has a companion [repo](https://github.com/dawenl/vae_cf) where you can find the `tensorflow` implementation of the algorithm. As always, I strongly recommend having a look to both the repo and of course the paper. 

On the other hand the `Pytorch` and `Mxnet` implementations of the algorithm in this repo are greatly inspired by the code in this [repo](https://github.com/younggyoseo/vae-cf-pytorch). I have adapted the code to my coding preferences and added some options and flexibility to run multiple experiments. 

Once all that clever people are acknowledged for their contributions, let's have a look to the data. Throughout this exercise I will use two dataset. The [Amazon Movies and TV](http://jmcauley.ucsd.edu/data/amazon/) dataset (see also [here](https://arxiv.org/pdf/1602.01585.pdf) and [here](https://arxiv.org/pdf/1506.04757.pdf)) and the [Movilens](https://grouplens.org/datasets/movielens/20m/) dataset. The later is mainly use so I can make sure I am obtaining consistent results to those obtained in the paper. 

As we will see through the notebook, the Amazon dataset is significantly more challenging that the Movielens dataset.

In [2]:
import os
import sys
import pandas as pd
import numpy as np
import pickle

from fire import Fire
from typing import Tuple, Dict, Union
from pathlib import Path

sys.path.append(os.path.abspath('../'))

Let's define a few constants

In [7]:
DATA_DIR = Path("../data")
new_colnames = ["user", "item", "rating", "timestamp"]

Let me focus on the `Amazon` data here. 

In [8]:
inp_path = DATA_DIR / "amazon-movies"
filename = "reviews_Movies_and_TV_5.json.gz"
raw_data = pd.read_json(inp_path / filename, lines=True)
keep_cols = ["reviewerID", "asin", "overall", "unixReviewTime"]
raw_data = raw_data[keep_cols]
raw_data.columns = new_colnames

# # Replace those lines with the ones below for the movielens dataset
# inp_path = DATA_DIR / "ml-20m"
# filename = "ratings.csv"
# raw_data = pd.read_csv(inp_path / filename, header=0)
# raw_data.columns = new_colnames
# raw_data = raw_data[raw_data["rating"] > 3.5]

In [33]:
raw_data.shape

(1697533, 4)

In [9]:
raw_data.head()

Unnamed: 0,user,item,rating,timestamp
0,ADZPIG9QOCDG5,5019281,4,1203984000
1,A35947ZP82G7JH,5019281,3,1388361600
2,A3UORV8A9D5L2E,5019281,3,1388361600
3,A1VKW06X1O2X7V,5019281,5,1202860800
4,A3R27T4HADWFFJ,5019281,4,1387670400


The first thing that the we do is to "*filter triples*" (hereafter refereed as `tp`) based on the number of times a user interacted with items (`min_user_click`) or items that where "*interacted with*" by a user a given number of times (`min_item_click`). 

We do this with the following two functions

In [11]:
def get_count(tp: pd.DataFrame, id: str) -> pd.Index:
    """
    Returns `tp` groupby+count by `id`
    """
    playcount_groupbyid = tp[[id]].groupby(id, as_index=False)
    count = playcount_groupbyid.size()
    return count


def filter_triplets(
    tp: pd.DataFrame, min_user_click, min_item_click
) -> Tuple[pd.DataFrame, pd.Index, pd.Index]:
    """
    Returns triplets (`tp`) of user-item-rating for users/items with 
    more than min_user_click/min_item_click counts
    """
    if min_item_click > 0:
        itemcount = get_count(tp, "item")
        tp = tp[tp["item"].isin(itemcount.index[itemcount >= min_item_click])]

    if min_user_click > 0:
        usercount = get_count(tp, "user")
        tp = tp[tp["user"].isin(usercount.index[usercount >= min_user_click])]

    usercount, itemcount = get_count(tp, "user"), get_count(tp, "item")

    return tp, usercount, itemcount

In [12]:
filtered_raw_data, user_activity, item_popularity = filter_triplets(
    raw_data, min_user_click=5, min_item_click=0
)

In [34]:
filtered_raw_data.shape

(1697533, 4)

In [13]:
filtered_raw_data.head()

Unnamed: 0,user,item,rating,timestamp
0,ADZPIG9QOCDG5,5019281,4,1203984000
1,A35947ZP82G7JH,5019281,3,1388361600
2,A3UORV8A9D5L2E,5019281,3,1388361600
3,A1VKW06X1O2X7V,5019281,5,1202860800
4,A3R27T4HADWFFJ,5019281,4,1387670400


In [15]:
user_activity.head(5)

user
A00295401U6S2UG3RAQSZ     6
A00348066Q1WEW5BMESN      5
A0040548BPHKXMHH3NTI     10
A00438023NNXSDBGXK56L     5
A0048168OBFNFN7WW8XC      9
dtype: int64

Note that, since I am using the `"reviews_Movies_and_TV_5.json.gz"` (i.e. the 5-core dataset, where users and items have k reviews each) `filtered_raw_data` has no effect on the `Amazon` dataset. It does however filter some users/items in the case of the `Movilens` dataset.   

Let's now have a look to the sparsity of the dataset

In [17]:
sparsity = (
    1.0
    * filtered_raw_data.shape[0]
    / (user_activity.shape[0] * item_popularity.shape[0])
)

print(
    "After filtering, there are %d watching events from %d users and %d movies (sparsity: %.3f%%)"
    % (
        filtered_raw_data.shape[0],
        user_activity.shape[0],
        item_popularity.shape[0],
        sparsity * 100,
    )
)


After filtering, there are 1697533 watching events from 123960 users and 50052 movies (sparsity: 0.027%)


Comparing these numbers to those of the `Movilens` dataset (9990682 watching events from 136677 users and 20720 movies: sparsity: 0.353%. see the [notebook](https://github.com/dawenl/vae_cf/blob/master/VAE_ML20M_WWW2018.ipynb) corresponding to the original publication, or the [original publication](https://arxiv.org/pdf/1802.05814.pdf) itself) one can see that the `Amazon` dataset is $\sim$13 times more sparse than the `Movielens` dataset. In consequence, I expect that the algorithm finds it more challenging, resulting in lower ranking metrics.

Once the raw data is filtered, we follow the same procedure than that of the original authors to split the users into training, validation and test users. This happens within the next function:

In [21]:
def split_users(
    unique_uid: pd.Index, test_users_size: Union[float, int]
) -> Tuple[pd.Index, pd.Index, pd.Index]:

    n_users = unique_uid.size

    if isinstance(test_users_size, int):
        n_heldout_users = test_users_size
    else:
        n_heldout_users = int(test_users_size * n_users)

    tr_users = unique_uid[: (n_users - n_heldout_users * 2)]
    vd_users = unique_uid[(n_users - n_heldout_users * 2) : (n_users - n_heldout_users)]
    te_users = unique_uid[(n_users - n_heldout_users) :]

    return tr_users, vd_users, te_users

In [23]:
unique_uid = user_activity.index
np.random.seed(98765)
idx_perm = np.random.permutation(unique_uid.size)
unique_uid = unique_uid[idx_perm]
tr_users, vd_users, te_users = split_users(
    unique_uid, test_users_size=0.1
)

In [26]:
print(tr_users.shape, vd_users.shape, te_users.shape)

(99168,) (12396,) (12396,)


In [28]:
tr_users[:10]

Index(['A2AA6XJ6ZPU52I', 'A2D6VDR54FDOK1', 'A1GOCS6P9ZJLDJ', 'AD3IEMGOL22CZ',
       'A1O0YYQ587UV5X', 'AJ13O1EERMG98', 'A36ZCBPPL2MSRG', 'A1FA82C5JQAC2H',
       'A2Z2XRM3TNKUS', 'AWN86EEALSBNR'],
      dtype='object', name='user')

In [29]:
# Select the training observations raw data 
tr_obsrv = filtered_raw_data.loc[filtered_raw_data["user"].isin(tr_users)]
tr_items = pd.unique(tr_obsrv["item"])

# Save index dictionaries to "numerise" later one
item2id = dict((sid, i) for (i, sid) in enumerate(tr_items))
user2id = dict((pid, i) for (i, pid) in enumerate(unique_uid))

And this is how the  authors design the experiment. For validation and test they consider "*only*" items that have been seen during training

In [30]:
vd_obsrv = filtered_raw_data[
    filtered_raw_data["user"].isin(vd_users)
    & filtered_raw_data["item"].isin(tr_items)
]

te_obsrv = filtered_raw_data[
    filtered_raw_data["user"].isin(te_users)
    & filtered_raw_data["item"].isin(tr_items)
]

Note that in reality, this is not really an aggresive filtering since the total number of items was 50052 and the number train items is  

In [32]:
print(tr_items.shape)

(50049,)


Now that we have the validation and test users and their interactions, we will split such interactions into so-called "*validation and test train and test sets*". 

I know that this sounds convoluted, but stay with me, is not that complex. The "*validation_train and test_train sets*" will be used to build what we could think as an input binary *"image"* to be "*encoded -> decoded*" by the trained auto-encoder, while the "*validation_test and test_test sets*" set will be used to compute the ranking metrics at validation/test time. 

For example, let's assume that we have already trained a VAE and we are at validation time. Let's say that a validation user has interacted with 10 items, and that we use 8 of these 10 as `validation_train` set and the remaining 2 as `validation_test` set. Let's also assume that the total number of items in our entire database is 20. 

Let's say that, once indexed, those 10 items that the user has interacted with are: *{0, 1, 3, 4, 5, 7, 9, 12, 13, 19}*

validation train items: *{0, 1, 3, 5, 7, 9, 12, 19}*

validation test items:  *{4, 13}*

In [58]:
x_val_tr = np.zeros(20)
x_val_te = np.zeros(20)

# validation items that will be passed as inputs to the trained AE and will be used
# to compute a "reconstruction" loss
x_val_tr[[0, 1, 3, 5, 7, 9, 12, 19]] = 1

# reconstructed ratings by the AE
x_val_out = np.random.uniform(0, 1, 20)

# validation items that will be used to compute ranking metrics
x_val_te[[4, 13]] = 1

In [59]:
# this, when using batches, can be thought as a binary image to be encoded-decoded by the auto-encoder
x_val_tr

array([1., 1., 0., 1., 0., 1., 0., 1., 0., 1., 0., 0., 1., 0., 0., 0., 0.,
       0., 0., 1.])

In [60]:
x_val_te

array([0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0.,
       0., 0., 0.])

In [61]:
x_val_out

array([0.30278161, 0.14136984, 0.06735148, 0.49820737, 0.66620596,
       0.89161954, 0.20605658, 0.86947657, 0.33678987, 0.59325109,
       0.92272292, 0.17465857, 0.55970304, 0.18423661, 0.57690288,
       0.51976802, 0.35363432, 0.52543602, 0.01105844, 0.946705  ])

at validation time, we would compute the loss between `loss(x_val_out, x_val_tr)`. 

We would then set to a very low number those items in `x_val_tr` (we do not want to consider them when computing the ranking metrics), and compute the ranking metrics

In [64]:
x_val_out[x_val_tr.nonzero()] = -np.inf

In [65]:
x_val_out

array([      -inf,       -inf, 0.06735148,       -inf, 0.66620596,
             -inf, 0.20605658,       -inf, 0.33678987,       -inf,
       0.92272292, 0.17465857,       -inf, 0.18423661, 0.57690288,
       0.51976802, 0.35363432, 0.52543602, 0.01105844,       -inf])

and then `ranking_metric(x_val_out, x_val_te)`

The validation and test train/test split is accomplised by these functions

In [67]:
def split_train_test(
    data: pd.DataFrame, test_size: float
) -> Tuple[pd.DataFrame, pd.DataFrame]:

    data_grouped_by_user = data.groupby("user")
    tr_list, te_list = list(), list()

    np.random.seed(98765)

    for i, (nm, group) in enumerate(data_grouped_by_user):
        n_items_u = len(group)

        if n_items_u >= 5:
            idx = np.zeros(n_items_u, dtype="bool")
            idx[
                np.random.choice(
                    n_items_u, size=int(test_size * n_items_u), replace=False
                ).astype("int64")
            ] = True

            tr_list.append(group[np.logical_not(idx)])
            te_list.append(group[idx])
        else:
            tr_list.append(group)

        if i % 1000 == 0:
            print("%d users sampled" % i)
            sys.stdout.flush()

    data_tr = pd.concat(tr_list)
    data_te = pd.concat(te_list)

    return data_tr, data_te


def numerize(tp: pd.DataFrame, user2id: Dict, item2id: Dict) -> pd.DataFrame:
    user = [user2id[x] for x in tp["user"]]
    item = [item2id[x] for x in tp["item"]]
    return pd.DataFrame(data={"user": user, "item": item}, columns=["user", "item"])

In [69]:
vd_items_tr, vd_items_te = split_train_test(vd_obsrv, test_size=0.2)
te_items_tr, te_items_te = split_train_test(te_obsrv, test_size=0.2)

0 users sampled
1000 users sampled
2000 users sampled
3000 users sampled
4000 users sampled
5000 users sampled
6000 users sampled
7000 users sampled
8000 users sampled
9000 users sampled
10000 users sampled
11000 users sampled
12000 users sampled
0 users sampled
1000 users sampled
2000 users sampled
3000 users sampled
4000 users sampled
5000 users sampled
6000 users sampled
7000 users sampled
8000 users sampled
9000 users sampled
10000 users sampled
11000 users sampled
12000 users sampled


And that's it regarding the data preparation. 

Now let's have a look to the models