# Chapter 1

## 1.1 Introduction

The content in this `Amazon` directory is going to follow a similar (if not identical) approach to that in the `Ponpare` dir. This is, there will be a series of jupyter notebooks with a more "explanation-oriented" code and then the corresponding companion python scripts. 

Here I will concentrate mostly on matrix factorization algorithms and I will use the [Amazon Revies dataset](https://arxiv.org/pdf/1602.01585.pdf) [1] [2] in particular 5-core Movies and TV reviews. 


Using that dataset I will implement the [Xiangnan He, et al. 2016](https://www.comp.nus.edu.sg/~xiangnan/papers/ncf.pdf) paper [3], *Neural Collaborative Filtering*.


## 1.2 Data Preparation

An implementation of Xiangnan He's paper in Pytorch, Gluon and Keras (original) can be found [here](https://github.com/jrzaurin/neural_cf), along with an explanation of the algorithm. In this repo I will use the Pytorch implementation and I will explain again the main components. 

The problem, as framed in the paper, consists in predicting whether a user "interacted" with an item (1) or not (0) (i.e. ignoring the actual rating) using implicit negative feedback. The success metrics are the Hit Ratio (HR) and [Normalized Discounted Cumulative Gain (NDCG)](https://en.wikipedia.org/wiki/Discounted_cumulative_gain) at K, with K=10 in this excercise. For more details on the problem formulation I recommend reading the paper and having a look to the code here.

To start with simply download the dataset

    wget http://snap.stanford.edu/data/amazon/productGraph/categoryFiles/reviews_Movies_and_TV_5.json.gz
 
to your `workdir`. In my case I place all the datasets I use for this RecoTour repo at: 

    /home/ubuntu/projects/RecoTour/datasets/
    
There I have Ponpare, Amazon, etc...

Let's now move to the code

In [1]:
import numpy as np
import pandas as pd
import gzip
import pickle
import argparse
import scipy.sparse as sp

from time import time
from pathlib import Path
from scipy.sparse import save_npz
from joblib import Parallel, delayed

In [2]:
DATA_PATH = Path("/home/ubuntu/projects/RecoTour/datasets/Amazon")
reviews = "reviews_Movies_and_TV_5.json.gz"

In [3]:
df = pd.read_json(DATA_PATH/reviews, lines=True)
keep_cols = ['reviewerID', 'asin', 'unixReviewTime', 'overall']
new_colnames = ['user', 'item', 'timestamp', 'rating']
df = df[keep_cols]
df.columns = new_colnames
df.head()

Unnamed: 0,user,item,timestamp,rating
0,ADZPIG9QOCDG5,5019281,1203984000,4
1,A35947ZP82G7JH,5019281,1388361600,3
2,A3UORV8A9D5L2E,5019281,1388361600,3
3,A1VKW06X1O2X7V,5019281,1202860800,5
4,A3R27T4HADWFFJ,5019281,1387670400,4


In [4]:
df.rating.value_counts()

5    906608
4    382994
3    201302
1    104219
2    102410
Name: rating, dtype: int64

a lot of people seem to love the movies they watch. There are more 5s that 1,2,3 and 4s together. 

In [5]:
# (temporal) rank of items bought
df['rank'] = df.groupby("user")["timestamp"].rank(ascending=True, method='dense')
df.drop("timestamp", axis=1, inplace=True)
df.head()

Unnamed: 0,user,item,rating,rank
0,ADZPIG9QOCDG5,5019281,4,2.0
1,A35947ZP82G7JH,5019281,3,1.0
2,A3UORV8A9D5L2E,5019281,3,3.0
3,A1VKW06X1O2X7V,5019281,5,1.0
4,A3R27T4HADWFFJ,5019281,4,2.0


In [6]:
# mapping user and item ids to (continuos) integers
user_mappings = {k:v for v,k in enumerate(df.user.unique())}
item_mappings = {k:v for v,k in enumerate(df.item.unique())}
df['user'] = df['user'].map(user_mappings)
df['item'] = df['item'].map(item_mappings)
df = df[['user','item','rank','rating']].astype(np.int64)
n_users = df.user.nunique()
n_items = df.item.nunique()
df.head()

Unnamed: 0,user,item,rank,rating
0,0,0,2,4
1,1,0,1,3
2,2,0,3,3
3,3,0,1,5
4,4,0,2,4


And now is where the "proper preparation" and problem set up begins. We will use the last user rating for testing and all the previous ones for training

In [7]:
dfc = df.copy()
dfc.sort_values(['user','rank'], ascending=[True,True], inplace=True)
dfc.reset_index(inplace=True, drop=True)

# use last ratings for testing and all the previous for training
test = dfc.groupby('user').tail(1)
train = pd.merge(dfc, test, on=['user','item'],
    how='outer', suffixes=('', '_y'))
train = train[train.rating_y.isnull()]
test = test[['user','item','rating']]
train = train[['user','item','rating']]
print(train.shape, test.shape)

(1573573, 3) (123960, 3)


## 1.3 Testing Method

During testing, we will use 99 random movies per user that were never rated by that user. The total 100 (1 rated + 99 non rated) will be ranked and our success ranking metrics will be the already mentioned HR@10 and NDCG@10. Later in the notebooks we will reflect a bit about the pros and cons of this set up.

In [8]:
# select 99 random movies per user that were never rated by that user
all_items = dfc.item.unique()
rated_items = (dfc.groupby("user")['item']
    .apply(list)
    .reset_index()
    ).item.tolist()

def sample_not_rated(item_list, rseed=1, n=99):
    np.random.seed=rseed
    return np.random.choice(np.setdiff1d(all_items, item_list), n)

print("sampling not rated items...")
start = time()
non_rated_items = Parallel(n_jobs=4)(delayed(sample_not_rated)(ri) for ri in rated_items)
end = time() - start
print("sampling took {} min".format(round(end/60,2)))

negative = pd.DataFrame({'negative':non_rated_items})
negative[['item_n'+str(i) for i in range(99)]] =\
    pd.DataFrame(negative.negative.values.tolist(), index= negative.index)
negative.drop('negative', axis=1, inplace=True)
negative = negative.stack().reset_index()
negative = negative.iloc[:, [0,2]]
negative.columns = ['user','item']
negative['rating'] = 0
assert negative.shape[0] == len(non_rated_items)*99
test_negative = (pd.concat([test,negative])
    .sort_values('user', ascending=True)
    .reset_index(drop=True)
    )
# Ensuring that the 1st element every 100 is the rated item. This is
# fundamental for testing
test_negative.sort_values(['user', 'rating'], ascending=[True,False], inplace=True)
assert np.all(test_negative.values[0::100][:,2] != 0)

sampling not rated items...
sampling took 2.64 min


Let's make sure we did a good job. Let's pick up a random user and make sure the test set contains 99 items that user never rated.

In [9]:
user_id = np.random.randint(0, n_users-1)
items_rated = test_negative[(test_negative.user==user_id) & (test_negative.rating != 0)]['item'].tolist()
items_rated+= train[train.user==user_id]['item'].tolist()
items_never_rated = test_negative[(test_negative.user==user_id) & (test_negative.rating == 0)]['item'].tolist()
assert len(np.intersect1d(items_rated, items_never_rated)) == 0

Let's define a helper function to build a sparse matrix of interactions given a dataframe

In [10]:
def array2mtx(interactions):
    num_users = interactions[:,0].max()
    num_items = interactions[:,1].max()
    mat = sp.dok_matrix((num_users+1, num_items+1), dtype=np.float32)
    for user, item, rating in interactions:
            mat[user, item] = rating
    return mat.tocsr()

In [11]:
print("saving training set as sparse matrix...")
train_mtx = array2mtx(train.values)

saving training set as sparse matrix...


and that's it. All the required objects are saved to disk as below and we are ready to experiment

In [12]:
# # Save
# np.savez(data_path/"neuralcf_split.npz", train=train.values, test=test.values,
#     test_negative=test_negative.values, negatives=np.array(non_rated_items),
#     n_users=n_users, n_items=n_items)

# # Save training as sparse matrix
# print("saving training set as sparse matrix...")
# train_mtx = array2mtx(train.values)
# save_npz(data_path/"neuralcf_train_sparse.npz", train_mtx)

### REFERENCES

[1] R. He, J. McAuley. Modeling the visual evolution of fashion trends with one-class collaborative filtering. WWW, 2016

[2] J. McAuley, C. Targett, J. Shi, A. van den Hengel. Image-based recommendations on styles and substitutes. SIGIR, 2015

[3] Xiangnan He, Lizi Liao, Hanwang Zhang, Liqiang Nie, Xia Hu, Tat-Seng Chua. Neural Collaborative Filtering.  arXiv:1708.05031v2. 2016