# Using BESS-KGE with your Own Data

<em>Copyright (c) 2023 Graphcore Ltd. All rights reserved.</em>

BESS-KGE (`besskge`) is a PyTorch library for knowledge graph embedding (KGE) models on IPUs implementing the distribution framework [BESS](https://arxiv.org/abs/2211.12281), with embedding tables stored in the IPU SRAM.

In this notebook we will show how to use the `besskge.dataset.KGDataset` class to easily pre-process a custom knowledge graph dataset for use with BESS-KGE.

As an example, we will download and build the [OGBL-BioKG](https://ogb.stanford.edu/docs/linkprop/#ogbl-biokg) biomedical knowledge graph. While BESS-KGE provides a built-in dataloader for this dataset (see [besskge.dataset.KGDataset.build_ogbl_biokg](https://graphcore-research.github.io/bess-kge/generated/besskge.dataset.KGDataset.html#besskge.dataset.KGDataset.build_ogbl_biokg)), in this notebook — for didactic purposes — we will show how to import the dataset from scratch.

## Environment setup

While this notebook doesn't contain any code that needs to be run on IPUs or other accelerating hardware, the best way to run it is on Paperspace Gradient's cloud IPUs because everything is already set up for you.

 [![Run on Gradient](https://assets.paperspace.io/img/gradient-badge.svg)](https://console.paperspace.com/github/graphcore-research/bess-kge?container=graphcore%2Fpytorch-paperspace%3A3.3.0-ubuntu-20.04-20230703&machine=Free-IPU-POD4&file=%2Fnotebooks%2F0_custom_KG_dataset.ipynb)

## Dependencies

We recommend that you install `besskge` directly from the GitHub sources:

In [1]:
import sys
!{sys.executable} -m pip uninstall -y besskge
!pip install -q git+https://github.com/graphcore-research/bess-kge.git

Found existing installation: besskge 0.1
Uninstalling besskge-0.1:
  Successfully uninstalled besskge-0.1


Next, import the necessary dependencies:

In [1]:
import os
from pathlib import Path

import ogb
import pandas as pd
import torch

from besskge.dataset import KGDataset

dataset_directory = os.getenv("DATASET_DIR", "../datasets/") + "/biokg/"

## The dataset

We download the OGBL-BioKG knowledge graph using the `ogb` package (see the OGB [description of the data loader](https://ogb.stanford.edu/docs/linkprop/#data-loader) for details on how to use it). It shouldn't take more than a couple of minutes for the dataset to download.

In [2]:
dataset = ogb.linkproppred.LinkPropPredDataset(
    name="ogbl-biokg", root=dataset_directory
)

Since the objective here is to show how to build the BESS-KGE `KGDataset` class from scratch, we will not use any of the pre-processing utilities provided by `ogb`. Instead we will use the raw source files directly. To start from the most generic case, we will actually undo some of the preprocessing already performed on the data, namely the mapping from entity labels to entity indices for the different entity types.

In [3]:
label_dict = {}
for l in ["disease", "drug", "function", "protein", "sideeffect"]:
    # labels for the entities of type l
    # we append the entity type to the label to prevent label collisions across different types
    # notice that some entities, like proteins, have labels that are still numerical
    label_dict[l] = (
        pd.read_csv(
            Path(dataset_directory).joinpath(
                f"ogbl_biokg/mapping/{l}_entidx2name.csv.gz"
            )
        ).set_index("ent idx")["ent name"]
        + f" ({l})"
    ).values

# labels for relation types
rel_dict = (
    pd.read_csv(
        Path(dataset_directory).joinpath(f"ogbl_biokg/mapping/relidx2relname.csv.gz")
    )
    .set_index("rel idx")["rel name"]
    .values
)

# collect triples in train, valid, test DataFrames
# replacing the entity IDs with their original labels.
df_dict = {}
for split in {"test", "train", "valid"}:
    triples = []
    data = torch.load(
        Path(dataset_directory).joinpath(f"ogbl_biokg/split/random/{split}.pt")
    )
    for h, h_type, t, t_type, r in zip(
        data["head"],
        data["head_type"],
        data["tail"],
        data["tail_type"],
        data["relation"],
    ):
        h_label = label_dict[h_type][h]
        t_label = label_dict[t_type][t]
        r_label = rel_dict[r]
        triples.append((h_label, r_label, t_label))
    df_dict[split] = pd.DataFrame(
        triples, columns=["head_label", "relation_label", "tail_label"]
    )

print(f"# train triples: {df_dict['train'].shape[0]:,}")
print(f"# validation triples: {df_dict['valid'].shape[0]:,}")
print(f"# test triples: {df_dict['test'].shape[0]:,}")

# train triples: 4,762,678
# validation triples: 162,886
# test triples: 162,870


In [4]:
df_dict["train"].head()

Unnamed: 0,head_label,relation_label,tail_label
0,C0038586 (disease),disease-protein,1653 (protein)
1,C0751849 (disease),disease-protein,718 (protein)
2,C1320474 (disease),disease-protein,8622 (protein)
3,C0270844 (disease),disease-protein,3569 (protein)
4,C4279912 (disease),disease-protein,8856 (protein)


We put ourselves in a very generic starting point, where the **edges of the knowledge graph are represented as a list of (head, relation, tail) triples**, with *unique labels* for entities and relations.

In our knowledge graph, moreover, entities are of different types (disease, drug, function, protein and side-effect). This is not always the case, but BESS-KGE can leverage this additional information, for instance by constructing negative samples corrupting entities only with entities of the same type (see [besskge.negative_sampler.TypeBasedNegativeSampler](https://graphcore-research.github.io/bess-kge/API_ref/negative_sampler.html#besskge.negative_sampler.TypeBasedShardedNegativeSampler)). We store this data by creating a mapping from entity labels to entity types.

In [5]:
entity_types = {l: t for t in label_dict.keys() for l in label_dict[t]}
entity_types = pd.DataFrame(
    {"ent_label": entity_types.keys(), "ent_type": entity_types.values()}
)
entity_types

Unnamed: 0,ent_label,ent_type
0,C0000737 (disease),disease
1,C0000744 (disease),disease
2,C0000768 (disease),disease
3,C0000771 (disease),disease
4,C0000772 (disease),disease
...,...,...
93768,C3665444 (sideeffect),sideeffect
93769,C3665596 (sideeffect),sideeffect
93770,C3665609 (sideeffect),sideeffect
93771,C3665624 (sideeffect),sideeffect


In [6]:
# entity type counts
entity_types.groupby("ent_type")["ent_type"].count()

ent_type
disease       10687
drug          10533
function      45085
protein       17499
sideeffect     9969
Name: ent_type, dtype: int64

## The KGDataset class in BESS-KGE

When using BESS-KGE, the knowledge graph data is stored in an instance of the [besskge.dataset.KGDataset](https://graphcore-research.github.io/bess-kge/API_ref/dataset.html#besskge.dataset.KGDataset) class. This class has built-in methods to download and build some commonly-used knowledge graph datasets, or it can be instantiated manually with custom data by specifying all the required attributes.

The `besskge.dataset.KGDataset.from_dataframe` method is perfect to **build a custom dataset starting from labelled triples with minimum effort**. It simply requires a pandas DataFrame containing all labelled triples (or a dictionary of DataFrames, one for each of the dataset splits). If entities are of different types, like for OGBL-BioKG, this can be communicated to `KGDataset` by providing it with a mapping of entity labels to entity types, in the form of a dictionary or a pandas Series, indexed over the entity labels.

In [7]:
entity_types_series = entity_types.set_index("ent_label")["ent_type"]
entity_types_series

ent_label
C0000737 (disease)          disease
C0000744 (disease)          disease
C0000768 (disease)          disease
C0000771 (disease)          disease
C0000772 (disease)          disease
                            ...    
C3665444 (sideeffect)    sideeffect
C3665596 (sideeffect)    sideeffect
C3665609 (sideeffect)    sideeffect
C3665624 (sideeffect)    sideeffect
C3665888 (sideeffect)    sideeffect
Name: ent_type, Length: 93773, dtype: object

To build `KGDataset` we only need to call `KGDataset.from_dataframe`, specifying the names of the columns of the dataframes which contain the entity and relation labels.

In [8]:
biokg = KGDataset.from_dataframe(
    df_dict,
    head_column="head_label",
    relation_column="relation_label",
    tail_column="tail_label",
    entity_types=entity_types_series,
)

That was easy, wasn't it? Let us have a closer look at the attributes of the `KGDataset` class we just created.

In [9]:
print(f"Number of entities: {biokg.n_entity:,}\n")
print(f"Number of relation types: {biokg.n_relation_type}\n")
print(
    f"Number of triples: \n training: {biokg.triples['train'].shape[0]:,} \n validation: {biokg.triples['valid'].shape[0]:,}"
    f"\n test: {biokg.triples['test'].shape[0]:,}"
)

Number of entities: 93,773

Number of relation types: 51

Number of triples: 
 training: 4,762,678 
 validation: 162,886
 test: 162,870


Note that only the entities that appear as head or tail in at least one of the dataset's triples are counted in `KGDataset.n_entity` (and the same for relations in `KGDataset.n_relation_type`).

If we take a look at the `triples` attribute of `KGDataset`, we see that it is still structured as a dictionary, with the same keys that we used in `df_dict` to identify the different dataset splits.

In [10]:
biokg.triples.keys()

dict_keys(['valid', 'test', 'train'])

In [11]:
biokg.triples["train"]

array([[10120,     0, 79687],
       [10137,     0, 79168],
       [ 5659,     0, 78209],
       ...,
       [73643,    50, 73701],
       [80650,    50, 72104],
       [80643,    50, 76000]], dtype=int32)

Each row of these NumPy arrays corresponds to a (h,r,t) triple in the dataset, but now the **labels for entities and relations have been replaced by numerical IDs**. We can use `KGDataset.entity_dict` and `KGDataset.relation_dict` to recover the mapping from IDs to labels for entities and relation types respectively.

In [12]:
biokg.entity_dict[:10]

['C0393778 (disease)',
 'C2677109 (disease)',
 'C0796093 (disease)',
 'C2931278 (disease)',
 'C0149910 (disease)',
 'C3668942 (disease)',
 'C4225263 (disease)',
 'C0393814 (disease)',
 'C0342883 (disease)',
 'C4017556 (disease)']

meaning that the entity with ID 0 is "C0031117 (disease)", the entity with ID 1 is "C4225263 (disease)", etc. (and similarly for `biokg.relation_dict`).

Let's do a quick sanity check.

In [13]:
part = "test"
triple_number = 1234

head_label = biokg.entity_dict[biokg.triples[part][triple_number, 0]]
relation_label = biokg.relation_dict[biokg.triples[part][triple_number, 1]]
tail_label = biokg.entity_dict[biokg.triples[part][triple_number, 2]]

head_label, relation_label, tail_label

('C0751495 (disease)', 'disease-protein', '7249 (protein)')

In the original dataframes, this should coincide with:

In [14]:
df_dict[part].iloc[triple_number][["head_label", "relation_label", "tail_label"]]

head_label        C0751495 (disease)
relation_label       disease-protein
tail_label            7249 (protein)
Name: 1234, dtype: object

It is important to note that, when entities have different types, the numerical entity IDs need to be assigned so that **entities of the same type have contiguous IDs**! This is done automatically when using `KGDataset.from_dataframe`, but it needs to be kept in mind if you are instantiating the `KGDataset` class manually.

Since entity IDs are now clustered by type, we only need to know the ID ranges corresponding to the different types, which are stored in `KGDataset.type_offsets`:

In [15]:
biokg.type_offsets

{'disease': 0,
 'drug': 10687,
 'function': 21220,
 'protein': 66305,
 'sideeffect': 83804}

This means that entities with ID from 0 to 10686 are of type 'disease', from 10687 to 21219 are of type 'drug' and so on.

The type IDs (assigned following the order of the keys in `KGDataset.type_offsets`) for heads and tails of all triples in the dataset can be immediately recovered using `KGDataset.ht_types`:

In [16]:
biokg.ht_types.keys()

dict_keys(['valid', 'test', 'train'])

In [17]:
print(
    biokg.ht_types["train"].shape,
    biokg.ht_types["valid"].shape,
    biokg.ht_types["test"].shape
)

(4762678, 2) (162886, 2) (162870, 2)


Each row of these arrays stores the ID of the type of the head entity and tail entity of the corresponding triple, for example:

In [18]:
part = "valid"
triple_number = 0

biokg.ht_types[part][triple_number]

array([0, 3])

This means that the first validation triple has a head entity of type 'disease' (the 0-th key of `biokg.type_offsets`) and a tail entity of type 'protein' (the key of `biokg.type_offsets` with index 3). Indeed, we can check:

In [19]:
type_head = entity_types_series[df_dict[part].iloc[triple_number]["head_label"]]
type_tail = entity_types_series[df_dict[part].iloc[triple_number]["tail_label"]]

type_head, type_tail

('disease', 'protein')

## Random dataset split

What if no custom train/validation/test split is provided? `KGDataset.from_dataframe` can perform a random split, with the desired ratios between the three parts. This happens whenever it is provided with a single pandas DataFrame, instead of a dictionary of DataFrames as before.

In [20]:
# Merge all triples in a single DataFrame
df_all_triples = pd.concat(
    [df_dict["train"], df_dict["valid"], df_dict["test"]], axis=0
)

print(f"Total number of triples: {df_all_triples.shape[0]:,}")

Total number of triples: 5,088,434


In [21]:
# 80/10/10 train/valid/test split
split_ratios = (0.8, 0.1, 0.1)

biokg_random = KGDataset.from_dataframe(
    df_all_triples,
    head_column="head_label",
    relation_column="relation_label",
    tail_column="tail_label",
    entity_types=entity_types_series,
    split=split_ratios,
)

print(f"Number of entities: {biokg_random.n_entity:,}\n")
print(f"Number of relation types: {biokg_random.n_relation_type}\n")
print(
    f"Number of triples: \n training: {biokg_random.triples['train'].shape[0]:,} \n validation: {biokg_random.triples['valid'].shape[0]:,}"
    f"\n test: {biokg_random.triples['test'].shape[0]:,}"
)

Number of entities: 93,773

Number of relation types: 51

Number of triples: 
 training: 4,070,747 
 validation: 508,843
 test: 508,844


## Conclusions and next steps

The `KGDataset` class has a few additional attributes, which can be defined when instantiating the class manually. For instance, for each triple it allows you to specify a set of negative heads and tails that should be used to corrupt that triple. This is useful when you want to find the best completion for a (h,r,?)/(?,r,t) query only among a specific set of candidate nodes, or you have already identified good negative samples that you want to use during training.

For more information on the `KGDataset` class, have a look at the [BESS-KGE documentation](https://graphcore-research.github.io/bess-kge/API_ref/dataset.html#besskge.dataset.KGDataset).

Once you have built your dataset as a `KGDataset` class, you are ready to use BESS to train your preferred KGE model and perform inference with it! To learn how, we suggest starting from the introductory [KGE Training and Inference on OGBL-BioKG](1_biokg_training_inference.ipynb) notebook.