## Loading the data for the model

In this notebook my intention is going through the [original](https://github.com/xiangwang1223/neural_graph_collaborative_filtering) implementation keeping only what I consider necessary to eventually understand the model. There are some elements in the authors' original code that are related to the different experiments carried out in the [paper](https://arxiv.org/pdf/1905.08108.pdf). These will not include here. As always, please, go and read the original paper and **all credit to the authors**.

As I mentioned in Chapter01, my intention is that the notebooks can run in any machine. The real goal of the notebooks is to understand the process more than executing the real model. With that in mind, I have included a script called `generate_toy_data.py`, which generates a dataset with a small number of users and items that has the exact same format as that of the datasets used by the authors in the [original repo](https://github.com/xiangwang1223/neural_graph_collaborative_filtering).

To generate such small dataset with random interactions (just to play with) simply run, for example:
    
    python generate_toy_data.py --n_users 1000 --n_items 2000 --min_interaction 11 --max_interactions 51

In [1]:
import numpy as np
import random as rd
import scipy.sparse as sp
import pdb

from pathlib import Path
from time import time
from tqdm import tqdm

from joblib import Parallel, delayed

In [2]:
path = Path("/home/ubuntu/projects/RecoTour/datasets/toy_data/")
batch_size = 32

train_file = path/'train.txt'
test_file = path/'test.txt'

Users and items are numbered from 0 to (n_users-1) and (n_items-1), so let's count

In [3]:
# get number of users and items. 
n_users, n_items = 0, 0
n_train, n_test = 0, 0

exist_users = []
with open(train_file) as f:
    for l in f.readlines():
        if len(l) > 0:
            l = l.strip('\n').split(' ')
            # first element is the user_id, then items
            uid = int(l[0])
            items = [int(i) for i in l[1:]]
            exist_users.append(uid)
            n_items = max(n_items, max(items))
            n_users = max(n_users, uid)
            n_train += len(items)

# same as before but for testing
with open(test_file) as f:
    for l in f.readlines():
        if len(l) > 0:
            l = l.strip('\n')
            try:
                items = [int(i) for i in l.split(' ')[1:]]
            except Exception:
                continue
            n_items = max(n_items, max(items))
            n_test += len(items)
n_items += 1
n_users += 1

In [4]:
print(n_items, n_users)

2000 1000


All OK. Let's build the interactions/ratings matrix

In [9]:
# rating matrix for the training dataset
Rtr = sp.dok_matrix((n_users, n_items), dtype=np.float32)
# rating matrix for the testing dataset 
Rte = sp.dok_matrix((n_users, n_items), dtype=np.float32)

train_set, test_set = {}, {}
with open(train_file) as f_train, open(test_file) as f_test:
    for l in f_train.readlines():
        if len(l) == 0: break
        l = l.strip('\n')
        items = [int(i) for i in l.split(' ')]
        uid, train_items = items[0], items[1:]
        # simply 1 if user interacted with item, otherwise, 0.
        for i in train_items:
            Rtr[uid, i] = 1.
        train_set[uid] = train_items

    for l in f_test.readlines():
        if len(l) == 0: break
        l = l.strip('\n')
        try:
            items = [int(i) for i in l.split(' ')]
        except Exception:
            continue
        uid, test_items = items[0], items[1:]
        for i in test_items:
            Rte[uid, i] = 1.
        test_set[uid] = test_items

In [10]:
Rtr

<1000x2000 sparse matrix of type '<class 'numpy.float32'>'
	with 24228 stored elements in Dictionary Of Keys format>

In [11]:
Rte

<1000x2000 sparse matrix of type '<class 'numpy.float32'>'
	with 6552 stored elements in Dictionary Of Keys format>

In [12]:
print(train_set[0][:10])
print(test_set[0][:10])

[1365, 1073, 664, 292, 1248, 1897, 1370, 1625, 672, 729]
[194, 258, 386, 525, 674, 1265, 1347, 1683, 1763, 1930]


They use a number of difference adjacency matrices, see [here](https://github.com/xiangwang1223/neural_graph_collaborative_filtering). 

Below is their implementation of a function to normalise the adjacency matrix. Here, each decay factor between two connected nodes is set as `(1/out-degree of the node)` (a few cells below there are more details):

In [13]:
def normalized_adj_single(adj):
    # rowsum = out-degree of the node    
    rowsum = np.array(adj.sum(1))
    # inverted and set to 0 if no connections
    d_inv = np.power(rowsum, -1).flatten()
    d_inv[np.isinf(d_inv)] = 0.
    # sparse diagonal matrix with the normalizing factors in the diagonal
    d_mat_inv = sp.diags(d_inv)
    # dot product resulting in a row-normalised version of the input matrix
    norm_adj = d_mat_inv.dot(adj)
    return norm_adj.tocoo()

The function in the next cell is used to check the expression 8 in their paper, where the Laplacian Matrix is formulated as:

$$
\mathcal{L} = \text{D}^{\frac{-1}{2}}\text{A}\text{D}^{\frac{1}{2}}
$$

where D is the diagonal degree matrix and A:

$$
A = \begin{bmatrix} 
0 & \text{R} \\
\text{R}^{\text{T}} & 0 
\end{bmatrix}
$$

In [14]:
def check_adj_if_equal(adj):
    dense_A = np.array(adj.todense())
    degree = np.sum(dense_A, axis=1, keepdims=False)
    temp = np.dot(np.diag(np.power(degree, -1)), dense_A)
    return temp

Let's build A, the adjacency matrix

In [15]:
adj_mat = sp.dok_matrix((n_users + n_items, n_users + n_items), dtype=np.float32)
adj_mat = adj_mat.tolil()

# A:
s = time()
adj_mat[:n_users, n_users:] = Rtr.tolil()
adj_mat[n_users:, :n_users] = Rtr.tolil().T
print(time()-s)
adj_mat = adj_mat.todok()

0.17641139030456543


In [16]:
adj_mat

<3000x3000 sparse matrix of type '<class 'numpy.float32'>'
	with 48456 stored elements in Dictionary Of Keys format>

along with the "normal" adjancecy matrix, we generate two additional ones

`norm_adj_mat`: each decay factor bewteen two connected nodes is set as `1/(out degree of the node + self-conncetion)`

`mean_adj_mat`: each decay factor bewteen two connected nodes is set as `1/(out degree of the node)`

eventually a forth one will also be used (in fact is the default one) which will be

`ngcf_adj_mat`: each decay factor bewteen two connected nodes is set as `1/(out degree of the node)` and each node is also assigned with 1 for self-connections. This is: `norm_adj_mat + sp.eye(mean_adj.shape[0])`

In [17]:
norm_adj_mat = normalized_adj_single(adj_mat + sp.eye(adj_mat.shape[0]))
mean_adj_mat = normalized_adj_single(adj_mat)

Let's have a look to the 1st row and search for non-zero elements

In [18]:
uid0_nonzero = np.where(adj_mat[0].todense())

In [19]:
uid0_nonzero

(array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]),
 array([1056, 1108, 1169, 1177, 1270, 1282, 1292, 1367, 1421, 1486, 1579,
        1589, 1631, 1634, 1664, 1672, 1724, 1729, 1745, 2065, 2073, 2117,
        2248, 2365, 2370, 2407, 2441, 2562, 2622, 2625, 2672, 2713, 2751,
        2758, 2803, 2897, 2913, 2943]))

Let's check the training data for the 1st user (id=0)

In [20]:
print(sorted(train_set[0]))

[56, 108, 169, 177, 270, 282, 292, 367, 421, 486, 579, 589, 631, 634, 664, 672, 724, 729, 745, 1065, 1073, 1117, 1248, 1365, 1370, 1407, 1441, 1562, 1622, 1625, 1672, 1713, 1751, 1758, 1803, 1897, 1913, 1943]


In [21]:
len(uid0_nonzero[0]) == len(train_set[0])

True

We see that the 1st element different than 0 is `n_users + train_set[0][0]`. Note that if we included self-connections, the elements in the diagonal would also be diff than 0. 

Let's now create "negative pools", simply collections of N items that users never interacted with

In [22]:
neg_pools = {}
for u in train_set.keys():
    neg_items = list(set(range(n_items)) - set(train_set[u]))
    pools = np.random.choice(neg_items, 10)
    neg_pools[u] = pools

In [23]:
neg_pools[0]

array([1733,  537, 1806,  655, 1140, 1846, 1995,  701, 1082,  677])

The following functions sample positive and negative (never seen or interacted with) items either directly from the dataset, or from the previously generated "negative pools"

In [24]:
def sample_pos_items_for_u(u, num):
    pos_items = train_set[u]
    n_pos_items = len(pos_items)
    pos_batch = []
    while True:
        # Once we have sample num positive items, stop
        if len(pos_batch) == num: break
        pos_id = np.random.randint(low=0, high=n_pos_items, size=1)[0]
        pos_i_id = pos_items[pos_id]
        if pos_i_id not in pos_batch: pos_batch.append(pos_i_id)
    return pos_batch

In [25]:
def sample_neg_items_for_u(u, num):
    neg_items = []
    while True:
        # Once we have sample num negative items, stop
        if len(neg_items) == num: break
        neg_id = np.random.randint(low=0, high=n_items,size=1)[0]
        if neg_id not in train_set[u] and neg_id not in neg_items:
            neg_items.append(neg_id)
    return neg_items

In [26]:
def sample_neg_items_for_u_from_pools(u, num):
    # this line must be a bug because no train_items[u] will ever be in neg_pools[u], 
    # neg_items = list(set(range(n_items)) - set(train_set[u]))
    # pools = np.random.choice(neg_items, 100)
    # neg_pools[u] = pools
    neg_items = list(set(neg_pools[u]) - set(train_set[u]))
    return rd.sample(neg_items, num)

# To me this should be
def sample_neg_items_for_u_from_pools(u, num):
    return rd.sample(neg_pools[u], num)

Let's have a look

In [27]:
users, pos_items, neg_items = [], [], []
for u in np.random.choice(exist_users, 5):
    users.append(u)
    pos_items += sample_pos_items_for_u(u, 1)
    neg_items += sample_neg_items_for_u(u, 1)

In [28]:
print(users), print(pos_items), print(neg_items)

[104, 290, 941, 149, 403]
[1284, 1843, 1741, 880, 1965]
[1052, 1695, 574, 1246, 1983]


(None, None, None)

And let's see if item `pos_items[2]` and `neg_items[2]` are positive and negative respectively for user `users[2]` (for example)

In [29]:
print(pos_items[2] in train_set[users[2]])
print(neg_items[2] not in train_set[users[2]])

True
True


And that is about it for us, because the functions below will not be used in this repo. 

These functions correspond to their study of the effect of sparsity. Have a look to their section 4.3.2 Performance Comparison w.r.t. Interaction Sparsity Levels: "*.... In particular, based on interaction number per user, we divide the test set into four groups, each of which has the same total interactions...*"

Nonetheless, here is the code and an explanation

In [31]:
def create_sparsity_split():
    all_users_to_test = list(test_set.keys())
    user_n_iid = dict()

    # generate a dictionary to store (key=n_iids, value=a list of uid).
    for uid in all_users_to_test:
        # train and test items for user_id
        train_iids = train_set[uid]
        test_iids = test_set[uid]

        # number of "interactions"
        n_iids = len(train_iids) + len(test_iids)

        if n_iids not in user_n_iid.keys():
            # dictionary where the keys are the number of interactions 
            # and the values are the users that have that number of interactions
            user_n_iid[n_iids] = [uid]
        else:
            user_n_iid[n_iids].append(uid)
    split_uids = list()

    # split the whole user set into four subset.
    temp = []
    count = 1
    fold = 4
    # total number of interactions in the dataset
    n_count = (n_train + n_test) 
    n_rates = 0

    split_state = []
    for idx, n_iids in enumerate(sorted(user_n_iid)):
        temp += user_n_iid[n_iids]
        # n_rates -> number of ratings
        # n_iids  -> key corresponding to a certain number of interactions (e.g. 10 ratins)
        # len(user_n_iid[n_iids]) -> number of users that interacted with 10 items
        n_rates += n_iids * len(user_n_iid[n_iids])
        n_count -= n_iids * len(user_n_iid[n_iids])
        # when number of rates/interaction has reached 25% of the total number of interactions, 
        # append the corresponding users to split_uids (remember we loop over sorted(user_n_iid))
        if n_rates >= count * 0.25 * (n_train + n_test):
            split_uids.append(temp)

            state = '#inter per user<=[%d], #users=[%d], #all rates=[%d]' %(n_iids, len(temp), n_rates)
            split_state.append(state)
            print(state)

            temp = []
            n_rates = 0
            fold -= 1 # don't think we need this if we manually state 0.25
        
        if idx == len(user_n_iid.keys()) - 1 or n_count == 0:
            split_uids.append(temp)

            state = '#inter per user<=[%d], #users=[%d], #all rates=[%d]' % (n_iids, len(temp), n_rates)
            split_state.append(state)
            print(state)
    return split_uids, split_state

In [32]:
def get_sparsity_split():
    # here, once the previous function is understood, there is not much to explain
    try:
        split_uids, split_state = [], []
        lines = open(path + '/sparsity.split', 'r').readlines()

        for idx, line in enumerate(lines):
            if idx % 2 == 0:
                split_state.append(line.strip())
                print(line.strip())
            else:
                split_uids.append([int(uid) for uid in line.strip().split(' ')])
        print('get sparsity split.')

    except Exception:
        split_uids, split_state = create_sparsity_split()
        f = open(path + '/sparsity.split', 'w')
        for idx in range(len(split_state)):
            f.write(split_state[idx] + '\n')
            f.write(' '.join([str(uid) for uid in split_uids[idx]]) + '\n')
        print('create sparsity split.')

    return split_uids, split_state