## Loading the data for the model

In this notebook my intention is going through the [original](https://github.com/xiangwang1223/neural_graph_collaborative_filtering) implementation keeping only what I consider necessary to eventually understand the model. There are some elements in the authors' original code that is related to the different experiments carried out in the [paper](https://arxiv.org/pdf/1905.08108.pdf) that I will not include here. 

The code below is my adaptation. When this differs significantly from that of the original paper I will include the details. Also below I will be using the toy dataset that I have generated with `generate_toy_data.py`, including only a few users and items (and random interactions). To generate a toy dataset with random interactions (just to play with) simply run, for example:
    
    python generate_toy_data.py --n_users 100 --n_items 200 --min_interaction 11 --max_interactions 51

In [14]:
import numpy as np
import random as rd
import scipy.sparse as sp
import pdb

from pathlib import Path
from time import time
from tqdm import tqdm

from joblib import Parallel, delayed

In [15]:
path = Path("/Users/javier/ml_exercises_python/RecoTour/Amazon/neural_graph_cf/Data/gowalla/")
batch_size = 16

train_file = path/'train.txt'
test_file = path/'test.txt'

Users and items are numbered from 0 to (n_users-1) and (n_items-1), so let's count

In [16]:
# get number of users and items. 
n_users, n_items = 0, 0
n_train, n_test = 0, 0

exist_users = []
with open(train_file) as f:
    for l in f.readlines():
        if len(l) > 0:
            l = l.strip('\n').split(' ')
            # first element is the user_id, then items
            uid = int(l[0])
            items = [int(i) for i in l[1:]]
            exist_users.append(uid)
            n_items = max(n_items, max(items))
            n_users = max(n_users, uid)
            n_train += len(items)

# same as before but for testing
with open(test_file) as f:
    for l in f.readlines():
        if len(l) > 0:
            l = l.strip('\n')
            try:
                items = [int(i) for i in l.split(' ')[1:]]
            except Exception:
                continue
            n_items = max(n_items, max(items))
            n_test += len(items)
n_items += 1
n_users += 1

In [17]:
print(n_items, n_users)

40981 29858


All OK. Let's build the interactions/ratings matrix

In [18]:
R = sp.dok_matrix((n_users, n_items), dtype=np.float32)
train_set, test_set = {}, {}
with open(train_file) as f_train, open(test_file) as f_test:
    for l in f_train.readlines():
        if len(l) == 0: break
        l = l.strip('\n')
        items = [int(i) for i in l.split(' ')]
        uid, train_items = items[0], items[1:]
        # simply 1 if user interacted with item, otherwise, 0.
        for i in train_items:
            R[uid, i] = 1.
        train_set[uid] = train_items

    for l in f_test.readlines():
        if len(l) == 0: break
        l = l.strip('\n')
        try:
            items = [int(i) for i in l.split(' ')]
        except Exception:
            continue
        uid, test_items = items[0], items[1:]
        test_set[uid] = test_items

In [19]:
R

<29858x40981 sparse matrix of type '<class 'numpy.float32'>'
	with 810128 stored elements in Dictionary Of Keys format>

In [None]:
print(train_set[0][:10])
print(test_set[0][:10])

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
[7580, 3730, 5983, 5990, 7608, 1213, 6017, 7510, 7513, 8343]


They use a number of difference adjacency matrices, see [here](https://github.com/xiangwang1223/neural_graph_collaborative_filtering): 

In [None]:
def normalized_adj_single(adj):
    # rowsum = out-degree of the node    
    rowsum = np.array(adj.sum(1))
    # inverted and set to 0 if no connections
    d_inv = np.power(rowsum, -1).flatten()
    d_inv[np.isinf(d_inv)] = 0.
    # sparse diagonal matrix with the normalizing factors in the diagonal
    d_mat_inv = sp.diags(d_inv)
    # dot product resulting in a row-normalised version of the input matrix
    norm_adj = d_mat_inv.dot(adj)
    return norm_adj.tocoo()

The following function is used to check the expression 8 in their paper, where the Laplacian Matrix is formulated as:

$$
\mathcal{L} = \text{D}^{\frac{-1}{2}}\text{A}\text{D}^{\frac{1}{2}}
$$

where D is the diagonal degree matrix and A:

$$
A = \begin{bmatrix} 
0 & \text{R} \\
\text{R}^{\text{T}} & 0 
\end{bmatrix}
$$

In [None]:
def check_adj_if_equal(adj):
    dense_A = np.array(adj.todense())
    degree = np.sum(dense_A, axis=1, keepdims=False)
    temp = np.dot(np.diag(np.power(degree, -1)), dense_A)
    return temp

Let's build A, the adjacency matrix

In [None]:
adj_mat = sp.dok_matrix((n_users + n_items, n_users + n_items), dtype=np.float32)
adj_mat = adj_mat.tolil()

# A:
s = time()
adj_mat[:n_users, n_users:] = R.tolil()
adj_mat[n_users:, :n_users] = R.tolil().T
print(time()-s)
adj_mat = adj_mat.todok()

In [54]:
adj_mat

<30000x30000 sparse matrix of type '<class 'numpy.float32'>'
	with 485112 stored elements in Dictionary Of Keys format>

along with the "normal" adjancecy matrix, we generate two additional ones

`norm_adj_mat`: each decay factor bewteen two connected nodes is set as `1/(out degree of the node + self-conncetion)`

`mean_adj_mat`: each decay factor bewteen two connected nodes is set as `1/(out degree of the node)`

eventually a forth one will also be used (in fact is the default one) which will be

`ngcf_adj_mat`: each decay factor bewteen two connected nodes is set as `1/(out degree of the node)` and each node is also assigned with 1 for self-connections. This is: `norm_adj_mat + sp.eye(mean_adj.shape[0])`

In [22]:
norm_adj_mat = normalized_adj_single(adj_mat + sp.eye(adj_mat.shape[0]))
mean_adj_mat = normalized_adj_single(adj_mat)

Let's have a look to the 1st row and search for non-zero elements

In [23]:
uid0_nonzero = np.where(adj_mat[0].todense())

In [24]:
uid0_nonzero

(array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]),
 array([104, 111, 114, 118, 119, 129, 131, 134, 138, 140, 151, 156, 158,
        167, 178, 184, 189, 193, 194, 197, 198, 208, 210, 212, 222, 232,
        245, 259, 263, 265, 266, 269, 272, 275, 281, 288, 290, 299]))

Let's check the training data for the 1st user (id=0)

In [28]:
print(sorted(train_set[0]))

[4, 11, 14, 18, 19, 29, 31, 34, 38, 40, 51, 56, 58, 67, 78, 84, 89, 93, 94, 97, 98, 108, 110, 112, 122, 132, 145, 159, 163, 165, 166, 169, 172, 175, 181, 188, 190, 199]


In [26]:
len(uid0_nonzero[0]) == len(train_set[0])

True

We see that the 1st element different than 0 is 104, which is equal to the number of users plus the first item_id that user 0 interacted with (4). Note that if we included self-connections, the elements in the diagonal would also be diff than 0. 

Let's now create "negative pools", simply collections of N items that users never interacted with

In [33]:
neg_pools = {}
for u in train_set.keys():
    neg_items = list(set(range(n_items)) - set(train_set[u]))
    pools = np.random.choice(neg_items, 10)
    neg_pools[u] = pools

In [34]:
neg_pools[0]

array([107, 152,  59, 154, 149, 102,  33,  99, 195, 100])

The following functions sample positive and negative (never seen or interacted with) items either directly from the dataset, or from the previously generated "negative pools"

In [38]:
def sample_pos_items_for_u(u, num):
    pos_items = train_set[u]
    n_pos_items = len(pos_items)
    pos_batch = []
    while True:
        # Once we have sample num positive items, stop
        if len(pos_batch) == num: break
        pos_id = np.random.randint(low=0, high=n_pos_items, size=1)[0]
        pos_i_id = pos_items[pos_id]
        if pos_i_id not in pos_batch: pos_batch.append(pos_i_id)
    return pos_batch

In [39]:
def sample_neg_items_for_u(u, num):
    neg_items = []
    while True:
        # Once we have sample num negative items, stop
        if len(neg_items) == num: break
        neg_id = np.random.randint(low=0, high=n_items,size=1)[0]
        if neg_id not in train_set[u] and neg_id not in neg_items:
            neg_items.append(neg_id)
    return neg_items

In [40]:
def sample_neg_items_for_u_from_pools(u, num):
    # this line must be a bug because no train_items[u] will ever be in neg_pools[u], 
    # neg_items = list(set(range(n_items)) - set(train_set[u]))
    # pools = np.random.choice(neg_items, 100)
    # neg_pools[u] = pools
    neg_items = list(set(neg_pools[u]) - set(train_set[u]))
    return rd.sample(neg_items, num)

# To me this should be
def sample_neg_items_for_u_from_pools(u, num):
    return rd.sample(neg_pools[u], num)

Let's have a look

In [47]:
users, pos_items, neg_items = [], [], []
for u in np.random.choice(exist_users, 5):
    users.append(u)
    pos_items += sample_pos_items_for_u(u, 1)
    neg_items += sample_neg_items_for_u(u, 1)

In [49]:
print(users), print(pos_items), print(neg_items)

[39, 42, 62, 51, 94]
[23, 162, 50, 188, 125]
[170, 184, 95, 30, 137]


(None, None, None)

And let's see if item 50 and 95 are positive and negative respectively for user 62 (for example)

In [53]:
print(50 in train_set[62])
print(95 not in train_set[62])

True
True


And that is about it for us, because the functions below will not be used in this repo. 

These functions correspond to their study of the effect of sparsity. Have a look to their section 4.3.2 Performance Comparison w.r.t. Interaction Sparsity Levels: *".... In particular, based on interaction number per user, we divide the test set into four groups, each of which has the same total interactions..."*

Nonetheless, here is the code and an explanation

In [52]:
def create_sparsity_split():
    all_users_to_test = list(test_set.keys())
    user_n_iid = dict()

    # generate a dictionary to store (key=n_iids, value=a list of uid).
    for uid in all_users_to_test:
        # train and test items for user_id
        train_iids = train_set[uid]
        test_iids = test_set[uid]

        # number of "interactions"
        n_iids = len(train_iids) + len(test_iids)

        if n_iids not in user_n_iid.keys():
            # dictionary where the keys are the number of interactions 
            # and the values are the users that have that number of interactions
            user_n_iid[n_iids] = [uid]
        else:
            user_n_iid[n_iids].append(uid)
    split_uids = list()

    # split the whole user set into four subset.
    temp = []
    count = 1
    fold = 4
    # total number of interactions in the dataset
    n_count = (n_train + n_test) 
    n_rates = 0

    split_state = []
    for idx, n_iids in enumerate(sorted(user_n_iid)):
        temp += user_n_iid[n_iids]
        # n_rates -> number of ratings
        # n_iids  -> key corresponding to a certain number of interactions (e.g. 10 ratins)
        # len(user_n_iid[n_iids]) -> number of users that interacted with 10 items
        n_rates += n_iids * len(user_n_iid[n_iids])
        n_count -= n_iids * len(user_n_iid[n_iids])
        # when number of rates/interaction has reached 25% of the total number of interactions, 
        # append the corresponding users to split_uids (remember we loop over sorted(user_n_iid))
        if n_rates >= count * 0.25 * (n_train + n_test):
            split_uids.append(temp)

            state = '#inter per user<=[%d], #users=[%d], #all rates=[%d]' %(n_iids, len(temp), n_rates)
            split_state.append(state)
            print(state)

            temp = []
            n_rates = 0
            fold -= 1 # don't think we need this if we manually state 0.25
        
        if idx == len(user_n_iid.keys()) - 1 or n_count == 0:
            split_uids.append(temp)

            state = '#inter per user<=[%d], #users=[%d], #all rates=[%d]' % (n_iids, len(temp), n_rates)
            split_state.append(state)
            print(state)
    return split_uids, split_state

In [53]:
def get_sparsity_split():
    # here, once the previous function is understood, there is not much to explain
    try:
        split_uids, split_state = [], []
        lines = open(path + '/sparsity.split', 'r').readlines()

        for idx, line in enumerate(lines):
            if idx % 2 == 0:
                split_state.append(line.strip())
                print(line.strip())
            else:
                split_uids.append([int(uid) for uid in line.strip().split(' ')])
        print('get sparsity split.')

    except Exception:
        split_uids, split_state = create_sparsity_split()
        f = open(path + '/sparsity.split', 'w')
        for idx in range(len(split_state)):
            f.write(split_state[idx] + '\n')
            f.write(' '.join([str(uid) for uid in split_uids[idx]]) + '\n')
        print('create sparsity split.')

    return split_uids, split_state