Copyright (c) 2023 Graphcore Ltd. All rights reserved.

# Training Neural Bellman-Ford networks (NBFnet) for inductive knowledge graph link prediction on IPUs

<a href="https://arxiv.org/abs/2106.06935" target="_blank">Neural Bellman-Ford networks (NBFNet)</a> is a model that generalises path-based reasoning models for predicting links in homogeneous and heterogeneous graphs. 

In this notebook we use NBFNet for link prediction in the FB15k-237 knowledge graph with 14541 entities, 237 relation types and 272115 triples. However in practice we explicitly insert reverse edges, which brings us to a total of 474 relation types and 544230 triples.

Unlike many other knowledge graph completion models, NBFNet can be *inductive*, in other words it can generalise to entities that do not appear in the training data. To demonstrate this inductive behaviour we train the model on a small subset of the graph (4707 entities, 54406 triples) and perform inference on the complete FB15k-237 graph.

This notebook assumes some familiarity with PopTorch as well as PyTorch Geometric (PyG). For additional resources please consult:
* [PopTorch Documentation](https://docs.graphcore.ai/projects/poptorch-user-guide/en/latest/index.html),
* [PopTorch Examples and Tutorials](https://docs.graphcore.ai/en/latest/examples.html#pytorch),
* [PyTorch Geometric](https://pytorch-geometric.readthedocs.io/en/latest/),
* [PopTorch Geometric Documentation](https://docs.graphcore.ai/projects/poptorch-geometric-user-guide/en/latest/index.html),

### Running on Paperspace

The Paperspace environment lets you run this notebook with no set up. To improve your experience we preload datasets and pre-install packages, this can take a few minutes, if you experience errors immediately after starting a session please try restarting the kernel before contacting support. If a problem persists or you want to give us feedback on the content of this notebook, please reach out to through our community of developers using our [slack channel](https://www.graphcore.ai/join-community) or raise a [GitHub issue](https://github.com/graphcore/examples).

Requirements:

* Python packages installed with `pip install -r requirements.txt`

In [None]:
%pip install -q -r requirements.txt

Let's import the required packages:

In [None]:
import notebook_utils

import os
import os.path as osp

import poptorch
import torch
from torch_geometric.datasets import RelLinkPredDataset

import data as nbfnet_data
import inference_utils
from nbfnet import NBFNet

For compatibility with the Paperspace environment variables we need to do the following:

In [None]:
poptorch.setLogLevel("ERR")
executable_cache_dir = (
    os.getenv("POPLAR_EXECUTABLE_CACHE_DIR", "/tmp/exe_cache/") + "/pyg-nbfnet"
)
dataset_directory = os.getenv("DATASETS_DIR", "data")
available_ipus = int(os.getenv("NUM_AVAILABLE_IPU", "4"))

Now we are ready to start!

## Training the model

First we will look at the steps required to train the model.

### Defining the model hyperparameters

Here we define some model settings and hyperparameters:
- BATCH_SIZE: The micro batch size (number of triples `(head, relation, tail)`) during training
- NUM_NEGATIVES: The number of triples `(head, relation, false_tail)` to contrast against each true triple
- LEARNING_RATE
- LATENT_DIM: The hidden dimension in the Message Passing Neural Network
- NUM_LAYERS: The number of message passing layers
- NEG_ADVERSARIAL_TEMP: The temperature of a softmax that weights negative samples based on their difficulty.

In [None]:
BATCH_SIZE = 6
NUM_NEGATIVES = 32
LEARNING_RATE = 0.001
LATENT_DIM = 64
NUM_LAYERS = 6
NEG_ADVERSARIAL_TEMP = 0.7

### Creating the dataset and dataloader

Now we build a training and validation dataset from the small `IndFB15k-237_v4` graph and a test dataset from the full `FB15k-237` graph. Then we create a dataloader for training, validation and test. The dataloader does the following: batches data; removes edges between head and tail entities in the training dataset to make the training objective non-trivial; samples negative tails. For validation and test, all entities will be treated as potential tail nodes.

In [None]:
dataset_train = nbfnet_data.build_dataset(
    name="IndFB15k-237", path=dataset_directory, version="v4"
)
dataset_inference = nbfnet_data.build_dataset(name="FB15k-237", path=dataset_directory)

In [None]:
dataloader = dict(
    train=nbfnet_data.DataWrapper(
        nbfnet_data.NBFData(
            data=dataset_train[0],
            batch_size=BATCH_SIZE,
            is_training=True,
            num_relations=dataset_train.num_relations,
            num_negatives=NUM_NEGATIVES,
        )
    ),
    valid=nbfnet_data.DataWrapper(
        nbfnet_data.NBFData(
            data=dataset_train[1],
            batch_size=1,
            is_training=False,
        )
    ),
    test=nbfnet_data.DataWrapper(
        nbfnet_data.NBFData(
            data=dataset_inference[2],
            batch_size=1,
            is_training=False,
        )
    ),
)

num_relations = dataset_inference.num_relations + 1

### Defining the model
We can now define the model and the optimiser using the hyperparameters that we have defined above. The model is cast to float16 for improved compute- and memory efficiency.

In [None]:
model = NBFNet(
    input_dim=LATENT_DIM,
    hidden_dims=[LATENT_DIM] * NUM_LAYERS,
    message_fct="mult",
    aggregation_fct="sum",
    num_mlp_layers=2,
    relation_learning="linear_query",
    adversarial_temperature=NEG_ADVERSARIAL_TEMP,
    num_relations=num_relations,
)

model.half();

In [None]:
optim = poptorch.optim.AdamW(
    model.parameters(),
    lr=LEARNING_RATE,
    bias_correction=True,
    weight_decay=0.0,
    eps=1e-8,
    betas=(0.9, 0.999),
)

The model defines a `poptorch.Stage` for every layer as well as for the preprocessing and prediction step. We can now define the IPUs to every block for a pipelined (or sharded) execution:

In [None]:
pipeline = {
    "preprocessing": 0,
    "layer0": 0,
    "layer1": 1,
    "layer2": 1,
    "layer3": 2,
    "layer4": 2,
    "layer5": 3,
    "prediction": 3,
}

And assign them using the PopTorch options:

In [None]:
pipeline_plan = [poptorch.Stage(k).ipu(v) for k, v in pipeline.items()]

train_opts = poptorch.Options()
train_opts.setExecutionStrategy(poptorch.PipelinedExecution(*pipeline_plan))
train_opts.Training.gradientAccumulation(16 if available_ipus == 16 else 64)
if available_ipus == 16:
    train_opts.replicationFactor(4)
train_opts.enableExecutableCaching(executable_cache_dir)

test_opts = poptorch.Options()
test_opts.setExecutionStrategy(poptorch.PipelinedExecution(*pipeline_plan))
test_opts.deviceIterations(len(set(pipeline.values())))
test_opts.enableExecutableCaching(executable_cache_dir)

We wrap the dataloader into a `poptorch.DataLoader`:

In [None]:
for partition in ["train", "valid", "test"]:
    dataloader[partition] = poptorch.DataLoader(
        options=train_opts if partition == "train" else test_opts,
        dataset=dataloader[partition],
        batch_size=1,
        collate_fn=nbfnet_data.custom_collate,
    )

And wrap the model into `poptorch.trainingModel` or `poptorch.inferenceModel`:

In [None]:
model_train = poptorch.trainingModel(model, options=train_opts, optimizer=optim)
model_valid = poptorch.inferenceModel(model, options=test_opts)

Now we are ready to start training.

### Training the model
Now we are ready to train the model. We run training for 5 epochs on the IndFB15k-237_v4 subgraph with interleaved validation.

In [None]:
num_epochs = 5

In [None]:
model_train.train()
model_valid.eval()

loss_per_epoch = []
mrr_per_epoch = []
for epoch in range(num_epochs):
    total_loss = 0
    total_count = 0
    for batch in dataloader["train"]:
        loss, count = model_train(**batch)
        loss, count = loss.mean(), count.sum()  # reduction across replicas
        total_loss += float(loss) * count
        total_count += count
    loss_per_epoch.append(total_loss / total_count)
    print(f"Epoch {epoch} finished, training loss {total_loss / total_count:.4}")

    # Interleaved validation
    mrr = 0
    total_count = 0
    model_train.detachFromDevice()
    for batch in dataloader["valid"]:
        prediction, count, mask, _ = model_valid(**batch)
        if isinstance(count, torch.Tensor):
            count = count.sum()
        prediction = prediction[mask]
        true_score = prediction[:, 0:1]
        rank = torch.sum(true_score <= prediction, dim=-1)
        mrr += float(torch.sum(1 / rank))
        total_count += count
    model_valid.detachFromDevice()
    mrr_per_epoch.append(mrr / total_count)
    print(f"Epoch {epoch}, validation MRR {mrr / total_count:.4}")

### Running inference on the trained model

Finally, we can use our trained model to perform inference on FB15k-237.

In [None]:
inference_opts = poptorch.Options()
inference_opts.setExecutionStrategy(poptorch.ShardedExecution(*pipeline_plan))
inference_opts.enableExecutableCaching(executable_cache_dir)
model_inference = poptorch.inferenceModel(model, options=inference_opts)

We wrap the detaset with a `Prediction` object to simplify the inference process for all the different tasks.

In [None]:
pred = inference_utils.Prediction(
    dataset_inference[0],
    "static/fb15k-237_entitymapping.txt",
    osp.join(dataset_directory, "FB15k-237/raw/"),
)

And run inference with this class.

In [None]:
pred.inference(model_inference, "Good Will Hunting", "genre", top_k=5)

## Running inference on the FB15k-237 graph

Now it is time to test the model on the bigger FB15k-237 graph and make some predictions of the form `(head, relation, ?)`.
We use a simple string comparison to match input strings to graph entities and relations. `pred.entity_vocab` and `pred.relation_vocab` contain lists of all available entities and relations.

Note that the FB15k-237 graph is relatively small and not only lacks edges (which could be inferred using a knowledge graph completion model like this one) but also entities.

`pred.inference` returns a list of entities and respective scores. Tails that occur in the graph are marked with an asterisk.

In [None]:
pred.inference(model_inference, "London", "/location/location/contains", top_k=5)

## Interpreting the results

Another advantage of the NBFNet model is its interpretability. By passing edge weights of `1.0` along all edges we can later compute the derivative of a prediction with respect to these weights and thus identify the paths that were most important for the prediction:

In [None]:
pred.path_importance(model, head_id=4695, tail_id=5180, relation_id=31)

In [None]:
pred.path_importance(model, head_id=12481, tail_id=1810, relation_id=4)

## Conclusion
Using a subgraph of FB15k-237 we have trained an inductive link prediction model for knowledge graphs. This model has been used to infer missing connections in the full FB15k-237 graph and could demonstrate the applied reasoning by outputting the paths in the graph that were most relevant to a given prediction.

As a next step you could try to speed up training by replicating the model four times on a POD-16 or train on a larger graph. this could be achieved by reducing the batch size or pipelining the model over more IPUs. 

If you are interested in node-level or graph-level tasks, take a look at our other examples. For instance, [Prediction of Molecular Properties using SchNet on Graphcore IPUs](../../schnet/pytorch_geometric/molecular_property_prediction_with_schnet.ipynb) for graph prediction or [Cluster GCN on IPU: Node classification task on a large graph using sampling](../../cluster_gcn/pytorch_geometric/node_classification_with_cluster_gcn.ipynb) for node prediction.