## Amazon Movies Data Preparation

To start with simply download the dataset:

```
wget http://snap.stanford.edu/data/amazon/productGraph/categoryFiles/reviews_Movies_and_TV_5.json.gz`
```

Before we jump into the code, I will say the following. Most of the content of the notebooks in this directory is designed to run in any laptop. No need of a GPU or a machine with a lot of memory. With that in mind I have included in this directory a script called `generate_toy_data.py` that will generate a small, random dataset designed to replicate exactly the data format required to be passed to the Neural Graph Collaborative Filtering algorithm.

Therefore, if you just want to see how the algorithm functions and do not have any intention of running it using the Amazon Dataset, simply run that script (e.g):

```
python generate_toy_data --n_users 1000 n_items 2000 
```

And you are good to go, you can move directly to notebook Chapter02. With that being said, let's see how one would prepare the Amazon movie reviews to be passed to the NGCF algorithm

In [1]:
import numpy as np
import pandas as pd
import scipy.sparse as sp
import csv

from tqdm import tqdm
from pathlib import Path

As with any other dataset used in this repo, I place them in `~/projects/RecoTour/datasets`

In [2]:
DATA_PATH = Path("/home/ubuntu/projects/RecoTour/datasets/Amazon")
reviews = "reviews_Movies_and_TV_5.json.gz"

In [3]:
df = pd.read_json(DATA_PATH/reviews, lines=True)
keep_cols = ['reviewerID', 'asin', 'unixReviewTime', 'overall']
new_colnames = ['user', 'item', 'timestamp', 'rating']
df = df[keep_cols]
df.columns = new_colnames
df.head()

Unnamed: 0,user,item,timestamp,rating
0,ADZPIG9QOCDG5,5019281,1203984000,4
1,A35947ZP82G7JH,5019281,1388361600,3
2,A3UORV8A9D5L2E,5019281,1388361600,3
3,A1VKW06X1O2X7V,5019281,1202860800,5
4,A3R27T4HADWFFJ,5019281,1387670400,4


In [4]:
df.rating.value_counts()

5    906608
4    382994
3    201302
1    104219
2    102410
Name: rating, dtype: int64

a lot of people seem to love the movies they watch. There are more 5s that 1,2,3 and 4s together. 

For convenience later, let's now sort values based on `timestamp`. This will be useful later in the process.

In [5]:
df.sort_values(['user','timestamp'], ascending=[True,True], inplace=True)
df.reset_index(inplace=True, drop=True)
df.head()

Unnamed: 0,user,item,timestamp,rating
0,A00295401U6S2UG3RAQSZ,0767015533,1353196800,4
1,A00295401U6S2UG3RAQSZ,0792838084,1353196800,4
2,A00295401U6S2UG3RAQSZ,6304484054,1353196800,4
3,A00295401U6S2UG3RAQSZ,6305182205,1353196800,4
4,A00295401U6S2UG3RAQSZ,B00004W22I,1353196800,4


Let's map users and items to continuous integers

In [6]:
def map_user_items(df):

    dfc = df.copy()
    user_mappings = {k:v for v,k in enumerate(dfc.user.unique())}
    item_mappings = {k:v for v,k in enumerate(dfc.item.unique())}

    user_list = pd.DataFrame.from_dict(user_mappings, orient='index').reset_index()
    user_list.columns = ['orig_id', 'remap_id']
    item_list = pd.DataFrame.from_dict(item_mappings, orient='index').reset_index()
    item_list.columns = ['orig_id', 'remap_id']
    user_list.to_csv(DATA_PATH/'user_list.txt', sep=" ", index=False)
    item_list.to_csv(DATA_PATH/'item_list.txt', sep=" ", index=False)    

    dfc['user'] = dfc['user'].map(user_mappings).astype(np.int64)
    dfc['item'] = dfc['item'].map(item_mappings).astype(np.int64)
        
    return user_mappings, item_mappings, dfc

In [7]:
user_mappings, item_mappings, dfm = map_user_items(df)
dfm.head()

Unnamed: 0,user,item,timestamp,rating
0,0,0,1353196800,4
1,0,1,1353196800,4
2,0,2,1353196800,4
3,0,3,1353196800,4
4,0,4,1353196800,4


###  Train/Test split

This split is designed to reproduce [Xiang Wang et al. 2019](https://arxiv.org/pdf/1905.08108.pdf) paper's format. 

In [8]:
df1 = dfm[['user', 'item']]

In [9]:
def f(df):
    keys, values = df.sort_values('user').values.T
    ukeys, index = np.unique(keys, True)
    arrays = np.split(values, index[1:])
    df2 = pd.DataFrame({'user':ukeys, 'item':[list(a) for a in arrays]})
    return df2

In [10]:
interactions_df = f(df1)
interactions_df.head()

Unnamed: 0,user,item
0,0,"[0, 1, 2, 3, 4, 5]"
1,1,"[6, 7, 8, 9, 10]"
2,2,"[20, 19, 18, 17, 16, 11, 14, 13, 12, 15]"
3,3,"[25, 24, 21, 22, 23]"
4,4,"[34, 32, 31, 30, 33, 28, 27, 26, 29]"


The split strategy we will follow is: 80% training, 20% testing. 

Then, 10% of the training we'll be as validation to tune parameters. Once tuned, one would merge train+validation and re-train with the best performing params

In [11]:
def train_test_split(u, i_l, p=0.8):
    s = np.floor(len(i_l)*p).astype('int')
    train = list(np.random.choice(i_l, s, replace=False))
    test  = list(np.setdiff1d(i_l, train))
    return ([u]+train, [u]+test)

In [12]:
interactions_l = [train_test_split(r['user'], r['item']) for i,r in interactions_df.iterrows()]

In [13]:
train = [interactions_l[i][0] for i in range(len(interactions_l))]
test =  [interactions_l[i][1] for i in range(len(interactions_l))]

In [14]:
print(train[0], test[0])

[0, 0, 3, 1, 5] [0, 2, 4]


Now let's take 10% of the train (which was 80%) as validation

In [15]:
tr_interactions_l = [train_test_split(t[0], t[1:], p=0.9) for t in train]

In [16]:
train = [tr_interactions_l[i][0] for i in range(len(tr_interactions_l))]
valid = [tr_interactions_l[i][1] for i in range(len(tr_interactions_l))]

In [17]:
print(train[1], valid[1], test[1])

[1, 10, 6, 8] [1, 9] [1, 7]


In [18]:
print(min([len(t[1:]) for t in test]), min([len(v[1:]) for v in valid]))

1 1


In [20]:
train_fname = DATA_PATH/'train.txt'
valid_fname = DATA_PATH/'valid.txt'
test_fname = DATA_PATH/'test.txt'

with open(train_fname, 'w') as trf, open(valid_fname, 'w') as vaf, open(test_fname, 'w') as tef:
    trwrt = csv.writer(trf, delimiter=' ')
    vawrt = csv.writer(vaf, delimiter=' ')
    tewrt = csv.writer(tef, delimiter=' ')
    trwrt.writerows(train)
    vawrt.writerows(valid)
    tewrt.writerows(test)

### Train/Test split approach 2

I will also be running the NGCF algorithm using the same train/test approach (or method) used for [neural collaborative filtering](https://arxiv.org/pdf/1708.05031.pdf) (you can read in the other [sub-directory](https://github.com/jrzaurin/RecoTour/tree/master/Amazon/neural_cf) in the Amazon's directory). With that in mind, let me also show how to prepare the data in that scenario. Since I already described in detail the process there, I will "go quickly" here. 

Remember, `df` is

In [19]:
df.head()

Unnamed: 0,user,item,timestamp,rating
0,A00295401U6S2UG3RAQSZ,0767015533,1353196800,4
1,A00295401U6S2UG3RAQSZ,0792838084,1353196800,4
2,A00295401U6S2UG3RAQSZ,6304484054,1353196800,4
3,A00295401U6S2UG3RAQSZ,6305182205,1353196800,4
4,A00295401U6S2UG3RAQSZ,B00004W22I,1353196800,4


In [20]:
from copy import copy
from time import time
from joblib import Parallel, delayed
dfc = copy(df)

In [21]:
# rank (temporal order) of items bought
dfc['rank'] = dfc.groupby("user")["timestamp"].rank(ascending=True, method='dense')
dfc.drop("timestamp", axis=1, inplace=True)

# mapping user and item ids to integers
user_mappings = {k:v for v,k in enumerate(dfc.user.unique())}
item_mappings = {k:v for v,k in enumerate(dfc.item.unique())}
dfc['user'] = dfc['user'].map(user_mappings)
dfc['item'] = dfc['item'].map(item_mappings)
dfc = dfc[['user','item','rank','rating']].astype(np.int64)

In [22]:
dfc.head()

Unnamed: 0,user,item,rank,rating
0,0,0,1,4
1,0,1,1,4
2,0,2,1,4
3,0,3,1,4
4,0,4,1,4


In [23]:
# Cardinality
n_users = df.user.nunique()
n_items = df.item.nunique()

dfc.sort_values(['user','rank'], ascending=[True,True], inplace=True)
dfc.reset_index(inplace=True, drop=True)

# use last ratings for testing and all the previous for training
test = dfc.groupby('user').tail(1)
train = pd.merge(dfc, test, on=['user','item'],
    how='outer', suffixes=('', '_y'))
train = train[train.rating_y.isnull()]
test = test[['user','item','rating']]
train = train[['user','item','rating']]
print(train.shape, test.shape)

(1573573, 3) (123960, 3)


In [24]:
# select 99 random movies per user that were never rated by that user
all_items = dfc.item.unique()
rated_items = (dfc.groupby("user")['item']
    .apply(list)
    .reset_index()
    ).item.tolist()

def sample_not_rated(item_list, rseed=1, n=99):
    np.random.seed=rseed
    return np.random.choice(np.setdiff1d(all_items, item_list), n)

print("sampling not rated items...")
start = time()
non_rated_items = Parallel(n_jobs=4)(delayed(sample_not_rated)(ri) for ri in rated_items)
end = time() - start
print("sampling took {} min".format(round(end/60,2)))

negative = pd.DataFrame({'negative':non_rated_items})
negative[['item_n'+str(i) for i in range(99)]] =\
    pd.DataFrame(negative.negative.values.tolist(), index= negative.index)
negative.drop('negative', axis=1, inplace=True)
negative = negative.stack().reset_index()
negative = negative.iloc[:, [0,2]]
negative.columns = ['user','item']
negative['rating'] = 0
assert negative.shape[0] == len(non_rated_items)*99
test_negative = (pd.concat([test,negative])
    .sort_values('user', ascending=True)
    .reset_index(drop=True)
    )
# Ensuring that the 1st element every 100 is the rated item. This is
# fundamental for testing
test_negative.sort_values(['user', 'rating'], ascending=[True,False], inplace=True)
assert np.all(test_negative.values[0::100][:,2] != 0)

sampling not rated items...
sampling took 1.0 min


Let's make sure we did a good job. Let's pick up a random user and make sure the test set contains 99 items that user never rated.

In [25]:
user_id = np.random.randint(0, n_users-1)
items_rated = test_negative[(test_negative.user==user_id) & (test_negative.rating != 0)]['item'].tolist()
items_rated+= train[train.user==user_id]['item'].tolist()
items_never_rated = test_negative[(test_negative.user==user_id) & (test_negative.rating == 0)]['item'].tolist()
assert len(np.intersect1d(items_rated, items_never_rated)) == 0

Let's define a helper function to build a sparse matrix of interactions given a dataframe

In [26]:
def array2mtx(interactions):
    num_users = interactions[:,0].max()
    num_items = interactions[:,1].max()
    mat = sp.dok_matrix((num_users+1, num_items+1), dtype=np.float32)
    for user, item, rating in interactions:
            mat[user, item] = rating
    return mat.tocsr()

In [27]:
print("saving training set as sparse matrix...")
train_mtx = array2mtx(train.values)

saving training set as sparse matrix...


and that's it. All the required objects are saved to disk as below and we are ready to experiment

In [46]:
# # Save
# np.savez(data_path/"neuralcf_split.npz", train=train.values, test=test.values,
#     test_negative=test_negative.values, negatives=np.array(non_rated_items),
#     n_users=n_users, n_items=n_items)

# # Save training as sparse matrix
# print("saving training set as sparse matrix...")
# train_mtx = array2mtx(train.values)
# save_npz(data_path/"neuralcf_train_sparse.npz", train_mtx)