# Lab — Graph Neural Networks for Recommender Systems

Recommender system task can be formulated in terms of link prediction problem.


Today we will work with the Movielens dataset. It contains ratings assigned to movie by user, movie genres and user features. The goal of the dataset is to predict the final rating. We reduce task to the prediction of interaction to fit the link prediction task.

The assignment will consists of three tasks

1. We will train simple GraphSAGE model with user and item encoders to predict the new links
2. We will enhance the loss function with low-rank positive penalties following the Uber Eats idea
3. We will add importances to the sampling strategy following the PinSAGE model

In [None]:
! pip install dgl-cu111 -f https://data.dgl.ai/wheels/repo.html

## Data preparation

Before we start any model, we need to prepare datasets and define all required methods.

Let us download the data from official repository. We will take pretty small data with only one million ratings

In [None]:
! wget http://files.grouplens.org/datasets/movielens/ml-1m.zip
! unzip ml-1m.zip

Let us see how data looks like

In [None]:
import pandas as pd

ratings = pd.read_csv("ml-1m/ratings.dat", sep="::", header=None)
ratings.columns = ["user_id", "item_id", "rating", "timestamp"]
ratings.head()

Let us check misses in the `user_id` column.

In [None]:
set(range(ratings.user_id.max() + 1)) - set(ratings.user_id)

Looks ok, we need to substract `1` from `user_id` column for proper indexing of feature tensors

In [None]:
ratings.user_id -= 1

What about items?

In [None]:
items = pd.read_csv("ml-1m/movies.dat", sep="::", encoding='cp1251', header=None)
items.columns = ["item_id", "title", "tags"]
items.head()

In [None]:
len(set(range(items.item_id.max() + 1)) - set(items.item_id))

Ok, there are missed item indices in ordering, so let us rearrange ids

In [None]:
item_mapper = dict(zip(list(set(items.item_id)), range(items.item_id.nunique())))
items.item_id = items.item_id.map(item_mapper)
ratings.item_id = ratings.item_id.map(item_mapper)

Movies are described by the set of tags, so we will use one-hot encoding as item features. Our goal is to build the inductive model that are suitable to handle cold-start problem.

In [None]:
item_features = pd.DataFrame(items.tags.str.split("|").map(lambda x: {i: 1 for i in x}).tolist()).fillna(0).values
item_features

Now, we are ready to work with the user features (do not forget to substract `1` from id)

In [None]:
users = pd.read_csv("ml-1m/users.dat", sep="::", header=None)
users.columns = ["user_id", "gender", "age", "occupation", "zipcode"]
users.user_id -= 1
users.head()

Age and occupation are defined by the groups, so it is better to encode them as one-hot vectors. Zipcode is very variative categorical, so we will just drop it for simplicity.

In [None]:
# change gender to indicator
users.gender = (users.gender == "F").astype(int)
# extract OHE vector from age
users = users.join(pd.get_dummies(users.age))
# extract OHE vector from occupation
users = users.join(pd.get_dummies(users.occupation), rsuffix="_occ")
# drop non-required fields
user_features = users.drop(["user_id", "age", "occupation", "zipcode"], axis=1).values

Now, we are ready to define a graph in dgl format and assign features to nodes.

In [None]:
import dgl
import torch
import torch.nn.functional as F

# create bipartite graph for user-item interactions
graph = dgl.graph((ratings.user_id, ratings.item_id + ratings.user_id.max() + 1))
graph.edata["rating"] = torch.Tensor(ratings.rating.values)
# add reverse edges for message passing
graph = dgl.add_reverse_edges(graph, copy_edata=True)
# preserve type of item
graph.ndata["ntype"] = torch.cat((torch.zeros(ratings.user_id.max() + 1), torch.ones(ratings.item_id.max() + 1))).reshape(-1, 1)

To simplify message passing code, we allign the user and item features to store it in the same graph attribute

In [None]:
print(user_features.shape, item_features.shape)

In [None]:
import numpy as np

item_features = np.hstack([item_features, np.zeros((item_features.shape[0], 11))])
graph.ndata["feat"] = torch.Tensor(np.vstack([user_features, item_features]))

We also need to split our graph on train, validation and test parts. We will split it by unique nodes, because GraphSAGE model is suitable to infer in inductive fashion (for previously unseen nodes).

In [None]:
import numpy as np

def prepare_graph_for_loaders(graph, perm, l=None, r=None, cum=True):
  real_nodes = perm[int((l or 0) * perm.shape[0]): int((r or 1) * perm.shape[0])]
  g = graph.subgraph(perm[:int((r or 1) * perm.shape[0])] if cum else real_nodes)
  n = torch.where(torch.isin(g.ndata[dgl.NID], torch.Tensor(real_nodes)))[0]
  f, t = g.edges()
  edge_filter = torch.logical_or(torch.isin(f, n), torch.isin(t, n))
  return g, real_nodes, g.edges("eid")[edge_filter]

np.random.seed(0)
perm = np.random.permutation(graph.nodes().numpy())

train_graph, train_nodes, train_edges = prepare_graph_for_loaders(graph, perm, r=0.8)
val_graph, val_nodes, val_edges = prepare_graph_for_loaders(graph, perm, l=0.8, r=0.9)
test_graph, test_nodes, test_edges = prepare_graph_for_loaders(graph, perm, l=0.9, r=1)

## Model definition

User and item data has different types of features, so we need to project it in one space. We can use simple MLP encoders.

The model will be a GraphSAGE. It works as follows:

1. Sample several layers of neighbors uniformly
2. Pass features through linear transformation
3. Aggregate the features of adjacent nodes from neighbor layer
4. Apply non-linearity
5. Iterate from the farest neighbors to the current node

Here we write the custom GraphSAGE because our graph is bipartite.

In [None]:
from torch import nn
class GraphSAGE(nn.Module):
    def __init__(self, n_input, n_hidden, n_layers, activation, dropout):
        super(GraphSAGE, self).__init__()
        self.n_layers = n_layers
        self.n_hidden = n_hidden
        self.layers = nn.ModuleList()

        self.user_encoder = nn.Sequential(
            nn.Linear(n_input, n_hidden * 2),
            nn.ReLU(),
            nn.Linear(n_hidden * 2, n_hidden),
        )
        self.item_encoder = nn.Sequential(
            nn.Linear(n_input, n_hidden * 2),
            nn.ReLU(),
            nn.Linear(n_hidden * 2, n_hidden),
        )

        for i in range(n_layers):
            self.layers.append(dgl.nn.SAGEConv(n_hidden, n_hidden, "mean"))
        self.dropout = nn.Dropout(dropout)
        self.activation = activation

    def forward(self, pos_graph, neg_graph, blocks):
      u, i = self.user_encoder(blocks[0].ndata["feat"]["_N"]), self.item_encoder(blocks[0].ndata["feat"]["_N"])
      h = (1 - blocks[0].ndata["ntype"]["_N"]) * u + i * (blocks[0].ndata["ntype"]["_N"])
      h = F.normalize(h, dim=1)

      for l, (layer, block) in enumerate(zip(self.layers, blocks)):
          h = self.activation(h)
          h = self.dropout(h)
          h = layer(block, h, block.edata.get("weight"))
          h = F.normalize(h, dim=1)

      s, d = pos_graph.edges()
      pos_scores = (h[s] * h[d]).sum(dim=1)
      if neg_graph is not None:
          s, d = neg_graph.edges()
          neg_scores = (h[s] * h[d]).sum(dim=1)
      else:
          neg_scores = None
      return pos_scores, neg_scores

## Data loaders and train loops

We aims to solve the link prediction problem, so our dataloader should iterate over the edges.

Our model should aggregate the information over several layers of neighbors. This could be done usin the NeighborSampling in torch. We can define the desired number of layers and desired number of neighbors at each layer by passing fanouts list. The neighbors will be sampled uniformly.

Let us use the 3-hop neighborhood with 10, 5 and 3 sampled neighbors. The general idea of such setting is that local structure plays much more role for in our network



In [None]:
sampler = dgl.dataloading.MultiLayerNeighborSampler([10, 5, 3])

The negative sampling is also required for the link prediction problem.
We will sample one exemplar of negative edge for each positive. However, our graph is bipartite, so we need to modify standard sampler to sample only nodes of the opposite type.

In [None]:
class BipartiteUniform(dgl.dataloading.negative_sampler._BaseNegativeSampler):
    def __init__(self, k):
        self.k = k
    def _generate(self, g, eids, canonical_etype):
        shape = dgl.backend.shape(eids)
        dtype = dgl.backend.dtype(eids)
        ctx = dgl.backend.context(eids)
        shape = (shape[0] * self.k,)
        src, _ = g.find_edges(eids, etype=canonical_etype)
        src = dgl.backend.repeat(src, self.k, 0)
        num_users = g.number_of_nodes() - g.ndata["ntype"].sum().int().item()
        dst_from_item = dgl.backend.randint(shape, dtype, ctx, 0, num_users)
        dst_from_user = dgl.backend.randint(shape, dtype, ctx, 0, g.ndata["ntype"].sum().int().item()) + num_users
        dst = torch.where(g.ndata["ntype"][src].view(-1) == 1, dst_from_item, dst_from_user)
        return src, dst

In [None]:
negative_sampler = BipartiteUniform(1)

Finally, one can define the dataloaders.



In [None]:
batch_size = 512
device = "cpu"

train_loader = dgl.dataloading.EdgeDataLoader(
    train_graph,
    train_edges,
    sampler,
    negative_sampler=negative_sampler,
    device=device,
    batch_size=batch_size,
    shuffle=True,
    drop_last=False)

val_loader = dgl.dataloading.EdgeDataLoader(
    val_graph,
    val_edges,
    sampler,
    negative_sampler=negative_sampler,
    device=device,
    batch_size=batch_size,
    shuffle=True,
    drop_last=False)

test_loader = dgl.dataloading.EdgeDataLoader(
    test_graph,
    test_edges,
    sampler,
    negative_sampler=negative_sampler,
    device=device,
    batch_size=batch_size,
    shuffle=True,
    drop_last=False)

Let us define the train and evaluation functions

In [None]:
from sklearn.metrics import average_precision_score
from matplotlib import pyplot as plt
from IPython.display import clear_output
from tqdm import tqdm
import numpy as np

def train(model, loader, opt, delta=0.5):
    model.train()
    loss_log = []
    for (in_nodes, pos_graph, neg_graph, blocks) in tqdm(loader):
        preds = model(pos_graph, neg_graph, blocks)

        loss = (-preds[0] + preds[1] + delta).clamp(min=0).sum()
        opt.zero_grad()
        loss.backward()
        opt.step()
        loss_log.append(loss.item())
    return loss_log


def evaluate(model, loader, opt):
    model.eval()
    pred_log = []
    gt_log = [torch.cat([torch.ones((batch_size, 1)), torch.zeros((batch_size, 1))]).data.numpy()]
    for (in_nodes, pos_graph, neg_graph, blocks) in tqdm(loader):
        preds = torch.cat(model(pos_graph, neg_graph, blocks))
        pred_log.append(preds.data.cpu().numpy())
    gt_log *= len(pred_log) - 1
    gt_log.append(torch.cat([torch.ones((len(pred_log[-1]) // 2, 1)), torch.zeros((len(pred_log[-1]) // 2, 1))]).data.numpy())
    return average_precision_score(np.concatenate(gt_log), np.concatenate(pred_log))
    
def run(model, train_loader, val_loader, test_loader, opt, n_epoch=100, delta=0.1, trainer=train):
    ap_log = []
    for epoch in range(n_epoch):
        loss_log = trainer(model, train_loader, opt, delta=delta)
        ap_log.append(evaluate(model, val_loader, opt))
        clear_output()
        plt.plot(ap_log)
        plt.title(f"Best score: {max(ap_log)} at epoch: {np.argmax(ap_log)}, last score: {ap_log[-1]} at epoch {epoch}")
        plt.show()
    print(f'Test results: {evaluate(model, test_loader, opt)}')

In [None]:
model = GraphSAGE(29, 8, 3, nn.ReLU(), 0.1)
model.to(device)
opt = torch.optim.Adam(model.parameters(), 3e-4)
run(model, train_loader, val_loader, test_loader, opt, 10, delta=0.1)

We test the quality of base training. However, the rate of a movie shows how much an user like them, so it will be helpful to account such information in loss.

Now we train our model with simple margin-loss. To account for the heterogeneity of links we will add low-rank positive part to it.

In [None]:
def train(model, loader, opt, delta=0.1):
    model.train()
    loss_log = []
    for (in_nodes, pos_graph, neg_graph, blocks) in tqdm(loader):
        preds = model(pos_graph, neg_graph, blocks)
        loss = (-preds[0][pos_graph.edata["rating"] > 3] + preds[1][pos_graph.edata["rating"] > 3] + delta).clamp(min=0).sum()
        loss += (-preds[0][pos_graph.edata["rating"] <= 3] + preds[1][pos_graph.edata["rating"] <= 3] + delta / 2).clamp(min=0).sum()
        opt.zero_grad()
        loss.backward()
        opt.step()
        loss_log.append(loss.item())
    return loss_log

In [None]:
model = GraphSAGE(29, 8, 3, nn.ReLU(), 0.1)
model.to(device)
opt = torch.optim.Adam(model.parameters(), 3e-4)
run(model, train_loader, val_loader, test_loader, opt, 10, delta=0.1, trainer=train)

## PinSAGE sampling

In this task we will define the sampler from PinSAGE paper.
The general idea is to sample neighbors with random walks and weight the message passing with hitting count (approximation of PageRank)



In [None]:
class PinSAGESampler(dgl.dataloading.MultiLayerNeighborSampler):
    def __init__(self, walk_len, num_walks_per_node, top_k, *args, **kwargs):
        super(PinSAGESampler, self).__init__(*args, **kwargs)
        self.walk_len = walk_len
        self.num_walks_per_node = num_walks_per_node
        self.top_k = top_k

    def sample_frontier(self, block_id, g, seed_nodes, exclude_eids=None):
        seed_nodes = seed_nodes["_N"] if type(seed_nodes) is dict  else seed_nodes
        walks, types = dgl.sampling.random_walk(graph, torch.LongTensor(np.repeat(seed_nodes, self.num_walks_per_node)), length=self.walk_len)
        src = dgl.backend.reshape(walks[:, self.walk_len::self.walk_len], (-1,))
        dst = dgl.backend.repeat(walks[:, 0], 1, 0)
        src_mask = (src != -1)
        src = dgl.backend.boolean_mask(src, src_mask)
        dst = dgl.backend.boolean_mask(dst, src_mask)

        neighbor_graph = dgl.graph((src, dst))
        neighbor_graph = neighbor_graph.to_simple(return_counts="weight")
        counts = neighbor_graph.edata["weight"]
        neighbor_graph = dgl.sampling.neighbor.select_topk(neighbor_graph, self.top_k, "weight")
        selected_counts = dgl.backend.gather_row(counts, neighbor_graph.edata[dgl.EID])
        neighbor_graph.edata["weight"] = selected_counts.float() / self.walk_len / self.num_walks_per_node
        neighbor_graph.ndata["feat"] = g.ndata["feat"][neighbor_graph.nodes()]
        neighbor_graph.ndata["ntype"] = g.ndata["ntype"][neighbor_graph.nodes()]
        return neighbor_graph

Now we need to update our data loaders with new sampler

In [None]:
sampler = PinSAGESampler(10, 10, 5, [1, 1, 1])

In [None]:
batch_size = 32
device = "cpu"

train_loader = dgl.dataloading.EdgeDataLoader(
    train_graph,
    train_edges,
    sampler,
    negative_sampler=negative_sampler,
    device=device,
    batch_size=batch_size,
    shuffle=True,
    drop_last=False)

val_loader = dgl.dataloading.EdgeDataLoader(
    val_graph,
    val_edges,
    sampler,
    negative_sampler=negative_sampler,
    device=device,
    batch_size=batch_size,
    shuffle=True,
    drop_last=False)

test_loader = dgl.dataloading.EdgeDataLoader(
    test_graph,
    test_edges,
    sampler,
    negative_sampler=negative_sampler,
    device=device,
    batch_size=batch_size,
    shuffle=True,
    drop_last=False)

And let us test the final model

In [None]:
model = GraphSAGE(29, 8, 3, nn.ReLU(), 0.1)
model.to(device)
opt = torch.optim.Adam(model.parameters(), 3e-4)
run(model, train_loader, val_loader, test_loader, opt, 10, delta=0.1)