# 2.0 - graphing

In this notebook I aim to use prepared data in order to create train and test graphs and dataloader functions for further use in the models training and testing.

## Bayesian Personalized Ranking (BPR)

There is such concept BRP which we will highly use in this work. Simply, for a set of given users we would aggregate "positive" items which really exist and "negative" items which are not.


In [1]:
import pandas as pd
import numpy as np
import torch
import torch.nn.functional as F

# load train dataframe for example
train = pd.read_csv("../data/interim/train.csv")

train.head()

Unnamed: 0,rating,timestamp,user_id_idx,item_id_idx
0,3,887080905,575,275
1,4,891464148,829,691
2,5,879456334,526,179
3,4,879376967,869,9
4,4,879442377,803,654


In [67]:
def dataloader(df: pd.DataFrame, batch_size: int = 32):
    """
    dataloader uses BRP idea to create batches of data
    """

    n_users = df.user_id_idx.nunique()

    # sample 'batch_size' users
    users = np.random.choice(n_users, size=batch_size, replace=False)
    # sort users
    users.sort()

    # helper function to sample from dataframe group
    def sample(group):
        return group.sample(1)

    # sample by one existing item for each user
    items = (
        df[df.user_id_idx.isin(users)].groupby("user_id_idx").apply(sample).item_id_idx
    )

    # create temporary table where there will be all user-item non-existing pairs
    df_outside_batch = df[~df.user_id_idx.isin(users)]

    non_items = (
        df_outside_batch.groupby("user_id_idx")
        .apply(lambda x: np.random.choice(x.item_id_idx))
        .sample(batch_size)
    )

    # for each item we have to add the number of users so that the index will be unique among users
    items = items + n_users
    non_items = non_items + n_users

    # return tensors
    return (
        torch.Tensor(list(users)).long(),
        torch.Tensor(list(items)).long(),
        torch.Tensor(list(non_items)).long(),
    )

In [68]:
batch = dataloader(train, batch_size=3)

batch

(tensor([311, 463, 901]),
 tensor([1908, 1690, 1069]),
 tensor([1488, 1032, 1886]))

In [80]:
# check whether our logic was correct

users = batch[0].numpy()
items = batch[1].numpy()
non_items = batch[2].numpy()

n_users = train.user_id_idx.nunique()

entry = train[
    (train.user_id_idx == users[0]) & (train.item_id_idx == items[0] - n_users)
]

assert entry.shape[0] == 1

Check that there is not edge with first user and non-existing corresponding item

In [82]:
entry = train[
    (train.user_id_idx == users[0]) & (train.item_id_idx == non_items[0] - n_users)
]

assert entry.shape[0] == 0

## Losses:

As we use BRP, definition of its loss implementation can be found [here](https://d2l.ai/chapter_recommender-systems/ranking.html). I will combine both approaches from the website and from [this work](https://medium.com/stanford-cs224w/recommender-systems-with-gnns-in-pyg-d8301178e377)

In [1]:
def bpr_loss(users, users_emb, pos_emb, neg_emb, user_emb0, pos_emb0, neg_emb0):
    # compute loss from initial embeddings, used for regulization
    reg_loss = (
        (1 / 2)
        * (user_emb0.norm().pow(2) + pos_emb0.norm().pow(2) + neg_emb0.norm().pow(2))
        / float(len(users))
    )

    # compute BPR loss from user, positive item, and negative item embeddings
    pos_scores = torch.mul(users_emb, pos_emb).sum(dim=1)
    neg_scores = torch.mul(users_emb, neg_emb).sum(dim=1)

    bpr_loss = torch.mean(F.softplus(neg_scores - pos_scores))

    brp_loss2 = torch.sum(torch.log(torch.sigmoid(pos_scores - neg_scores)))

    return brp_loss2, reg_loss

    return bpr_loss, reg_loss

## Metrics

For the metrics I will use precision and recall because they are the most common. Moreover, as the work I base on uses them, I will use them too.

- Precision - is the fraction of relevant instances among the retrieved instances
- Recall - is the fraction of relevant instances that have been retrieved over the total amount of relevant instances


In [None]:
def metrics(user_embeddings, item_embeddings, n_users, n_items, traindf, testdf, K):
    """
    Compute precision and recall
    """

    relevance = torch.mm(user_embeddings, item_embeddings.t())

    # dense tensor of all user-item interactions
    i = torch.LongTensor([traindf.user_id_idx, traindf.item_id_idx])
    v = torch.FloatTensor([1] * traindf.shape[0])
    train_interactions = torch.sparse.FloatTensor(i, v, torch.Size([n_users, n_items]))

    # mask user-item pairs from metric computation
    relevance_score = torch.mul(relevance, (1 - train_interactions.to_dense()))

    # get top K items for each user
    _, topk_idx = torch.topk(relevance_score, K)

    # measure overlap between recommended (top-scoring) and held-out user-item interactions
    