# Using BESS-KGE with your own data

<em>Copyright (c) 2023 Graphcore Ltd. All rights reserved.</em>

BESS-KGE (`besskge`) is a PyTorch library for knowledge graph embedding (KGE) models on IPUs implementing the distribution framework [BESS](https://arxiv.org/abs/2211.12281), with embedding tables stored in the IPU SRAM.

In this notebook we will show how easy it is to pre-process a custom Knowledge Graph dataset to use it with BESS-KGE, thanks to the `besskge.dataset.KGDataset` class.

As an example, we will download and build the [OpenBioLink](https://github.com/openbiolink/openbiolink) biomedical Knowledge Graph. 

## Environment setup

While this notebook doesn't contain any code that needs to be run on IPUs or other accelerating hardware, the best way to run it is on Paperspace Gradient's cloud IPUs because everything is already set up for you.

 [![Run on Gradient](https://assets.paperspace.io/img/gradient-badge.svg)](https://console.paperspace.com/github/graphcore-research/bess-kge?container=graphcore%2Fpytorch-paperspace%3A3.3.0-ubuntu-20.04-20230703&machine=Free-IPU-POD4&file=%2Fnotebooks%2F0_custom_KG_dataset.ipynb)

## Dependencies

We recommend that you install `besskge` directly from the GitHub sources:

In [1]:
import sys
!{sys.executable} -m pip uninstall -y besskge
!pip install -q git+https://github.com/graphcore-research/bess-kge.git

Found existing installation: besskge 0.1
Uninstalling besskge-0.1:
  Successfully uninstalled besskge-0.1


Next, import the necessary dependencies. 

In [6]:
import os
import zipfile
from io import BytesIO

import numpy as np
import pandas as pd
import requests

from besskge.dataset import KGDataset

dataset_directory = os.getenv("DATASET_DIR", "../datasets/") + "/OpenBioLinkHQ/"

## The dataset

We download the directed, high-quality version of [OpenBioLink2020](https://github.com/openbiolink/openbiolink#benchmark-dataset) directly from the link provided by the authors. This shouldn't take more than a minute.

Notice that OpenBioLink2020 integrates data from other sources, whose licensing terms are detailed in [this table](https://openbiolink.readthedocs.io/en/latest/sources.html) and should be minded when utilizing or redistributing the dataset files.

In [43]:
URL = "https://zenodo.org/record/3834052/files/HQ_DIR.zip"

res = requests.get(url=URL)
with zipfile.ZipFile(BytesIO(res.content)) as zip_f:
    zip_f.extractall(path=dataset_directory)

The dataset comes with a custom train/validation/test split: the triples in each part are collected in dedicated pandas DataFrames.

In [7]:
column_names = ["h_label", "r_label", "t_label", "quality", "TP/TN", "source"]

train_triples = pd.read_csv(
                    dataset_directory + "HQ_DIR/train_test_data/train_sample.csv",
                    header=None,
                    names=column_names,
                    sep="\t",
                )
valid_triples = pd.read_csv(
                    dataset_directory + "HQ_DIR/train_test_data/val_sample.csv",
                    header=None,
                    names=column_names,
                    sep="\t",
                )
test_triples = pd.read_csv(
                    dataset_directory + "HQ_DIR/train_test_data/test_sample.csv",
                    header=None,
                    names=column_names,
                    sep="\t",
                )

print(f"# training triples: {train_triples.shape[0]:,}")
print(f"# validation triples: {valid_triples.shape[0]:,}")
print(f"# test triples: {test_triples.shape[0]:,}")

# training triples: 4,192,002
# validation triples: 188,394
# test triples: 183,009


In [8]:
train_triples.head()

Unnamed: 0,h_label,r_label,t_label,quality,TP/TN,source
0,NCBIGENE:11200,GENE_PHENOTYPE,HP:0009919,,1,HPO
1,NCBIGENE:2649,GENE_EXPRESSED_ANATOMY,UBERON:0000059,gold quality,1,Bgee
2,NCBIGENE:534,GENE_EXPRESSED_ANATOMY,UBERON:0000467,gold quality,1,Bgee
3,NCBIGENE:2036,GENE_BINDING_GENE,NCBIGENE:5295,900,1,STRING
4,NCBIGENE:51195,GENE_UNDEREXPRESSED_ANATOMY,CL:0000738,high quality,1,Bgee


The relevant columns in the dataframe are `h_label`, `r_label` and `t_label`, which contain the labels for the **head entity**, **relation type** and **tail entity**, respectively, of the triples (=graph edges) in the dataset.

It is also interesting to notice that entities in the Knowledge Graph are of different types - a piece of information which can be leveraged when using BESS-KGE, for instance by constructing negative samples corrupting entities only with entities of the same type (see [besskge.negative_sampler.TypeBasedNegativeSampler](https://graphcore-research.github.io/bess-kge/API_ref/negative_sampler.html#besskge.negative_sampler.TypeBasedShardedNegativeSampler)).

In [9]:
entity_types = pd.read_csv(
                    dataset_directory + "HQ_DIR/train_test_data/train_val_nodes.csv",
                    header=None,
                    names=["ent_label", "ent_type"],
                    sep="\t",
                )
entity_types

Unnamed: 0,ent_label,ent_type
0,GO:0051505,GO
1,PUBCHEM.COMPOUND:10199303,DRUG
2,PUBCHEM.COMPOUND:157838,DRUG
3,UBERON:0025261,ANATOMY
4,PUBCHEM.COMPOUND:10698567,DRUG
...,...,...
184744,GO:0043279,GO
184745,PUBCHEM.COMPOUND:44329448,DRUG
184746,GO:0075336,GO
184747,PUBCHEM.COMPOUND:5289122,DRUG


In [10]:
print("Entity type counts")
entity_types.groupby("ent_type")["ent_type"].count()

Entity type counts


ent_type
ANATOMY      16031
DIS           9505
DRUG         77716
GENE         19604
GO           44944
PATHWAY       2363
PHENOTYPE    14586
Name: ent_type, dtype: int64

## The KGDataset class in BESS-KGE

When using BESS-KGE, the Knowledge Graph data is stored in an instance of the [besskge.dataset.KGDataset](https://graphcore-research.github.io/bess-kge/API_ref/dataset.html#besskge.dataset.KGDataset) class. This class comes with built-in methods to download and build some commonly-used Knowledge Graph datasets, or it can be instantiated manually with custom data by specifying all the required attributes.

The `besskge.dataset.KGDataset.from_dataframe` method is perfect to **build a custom dataset starting from labelled triples with minimum effort**. It simply requires to provide a pandas DataFrame containing all labelled triples (or a dictionary of DataFrames, one for each of the dataset splits). If entities are of different types, like for OpenBioLink, this can be communicated to the `KGDataset` by providing it with a mapping of entity labels to entity types, in the form of a dictionary or a pandas Series, indexed over the entity labels.

In [11]:
entity_types_series = entity_types.set_index("ent_label")["ent_type"]
entity_types_series

ent_label
GO:0051505                          GO
PUBCHEM.COMPOUND:10199303         DRUG
PUBCHEM.COMPOUND:157838           DRUG
UBERON:0025261                 ANATOMY
PUBCHEM.COMPOUND:10698567         DRUG
                               ...    
GO:0043279                          GO
PUBCHEM.COMPOUND:44329448         DRUG
GO:0075336                          GO
PUBCHEM.COMPOUND:5289122          DRUG
HP:0000096                   PHENOTYPE
Name: ent_type, Length: 184749, dtype: object

Now, we only need to arrange the train/validation/test DataFrames in a dictionary and call `KGDataset.from_dataframe`, specifying the names of the columns of the dataframes which contain the entities/relations labels.

In [26]:
df_dict = {"train": train_triples, "valid": valid_triples, "test": test_triples}

openbiolink = KGDataset.from_dataframe(df_dict,
                                        head_column="h_label", relation_column="r_label", tail_column="t_label",
                                        entity_types=entity_types_series)

That was easy, wasn't it? Let us have a closer look at the attributes of the `KGDataset` class we just created.

In [17]:
print(f"Number of entities: {openbiolink.n_entity:,}\n")
print(f"Number of relation types: {openbiolink.n_relation_type}\n")
print(f"Number of triples: \n training: {openbiolink.triples['train'].shape[0]:,} \n validation: {openbiolink.triples['valid'].shape[0]:,}"
        f"\n test: {openbiolink.triples['test'].shape[0]:,}")

Number of entities: 184,635

Number of relation types: 28

Number of triples: 
 training: 4,192,002 
 validation: 188,394
 test: 183,009


Notice that only the entities that appear as head or tail in at least one of the dataset's triples are counted in `KGDataset.n_entity` (and the same for relations in `KGDataset.n_relation_type`). If we take a look at the `triples` attribute of the `KGDataset`, we see that it is still structured as a dictionary, with the same keys that we used in `df_dict` to identify the different dataset splits.

In [18]:
openbiolink.triples.keys()

dict_keys(['train', 'valid', 'test'])

In [19]:
openbiolink.triples["train"]

array([[122052,      0, 183267],
       [107363,      1,  15635],
       [107365,      1,   6510],
       ...,
       [109440,      5, 104537],
       [108133,      1,  14619],
       [111211,      1,   5477]], dtype=int32)

Each row of these numpy arrays corresponds to a (h,r,t) triple in the dataset, but now the labels for entities and relations have been replaced by numerical IDs. We can use `KGDataset.entity_dict`, `KGDataset.relation_dict` to recover the mapping from IDs to labels for entities and relation types respectively.

In [24]:
openbiolink.entity_dict[:10]

['UBERON:2002263',
 'CL:0000532',
 'UBERON:4300129',
 'UBERON:0004122',
 'UBERON:0004247',
 'UBERON:0002249',
 'UBERON:0018407',
 'UBERON:0000000',
 'UBERON:0001385',
 'CL:0010003']

meaning that the entity with ID 0 is "UBERON:2002263", the entity with ID 1 is "CL:0000532", etc. (and similarly for `openbiolink.relation_dict`).

Let's make a quick sanity check.

In [38]:
part = "test"
triple_number = 123

head_label = openbiolink.entity_dict[openbiolink.triples[part][triple_number,0]]
relation_label = openbiolink.relation_dict[openbiolink.triples[part][triple_number,1]]
tail_label = openbiolink.entity_dict[openbiolink.triples[part][triple_number,2]]

head_label, relation_label, tail_label

('NCBIGENE:9501', 'GENE_EXPRESSED_ANATOMY', 'UBERON:0002084')

which, in the original dataframes, should coincide with

In [42]:
df_dict[part].iloc[triple_number][["h_label", "r_label", "t_label"]]

h_label             NCBIGENE:9501
r_label    GENE_EXPRESSED_ANATOMY
t_label            UBERON:0002084
Name: 123, dtype: object

It is important to notice that, when entities have different types, numerical entity IDs need to be assigned so that **entities of the same type have contiguous IDs**! This is done automatically behind the curtain when using `KGDataset.from_dataframe`, but it needs to be kept in mind if the user is instantiating the `KGDataset` class manually.

Since entity IDs are now clustered by type, we only need to know the ID ranges corresponding to the different types, which are stored in `KGDataset.type_offsets`:

In [25]:
openbiolink.type_offsets

{'ANATOMY': 0,
 'DIS': 16031,
 'DRUG': 25526,
 'GENE': 103144,
 'GO': 122742,
 'PATHWAY': 167686,
 'PHENOTYPE': 170049}

meaning that entities with ID from 0 to 16030 are of type 'ANATOMY', from 16031 to 25525 are of type 'DIS', etc.

The type IDs (assigned following the order of the keys in `KGDataset.type_offsets`) for heads and tails of all triples in the dataset can be immediately recovered using `KGDataset.ht_types`:

In [35]:
openbiolink.ht_types.keys()

dict_keys(['train', 'valid', 'test'])

In [36]:
openbiolink.ht_types["train"].shape, openbiolink.ht_types["valid"].shape, openbiolink.ht_types["test"].shape

((4192002, 2), (188394, 2), (183009, 2))

Each row of these arrays stores the ID of the type of the head entity and tail entity of the corresponding triple, e.g.

In [50]:
part = "valid"
triple_number = 0

openbiolink.ht_types[part][triple_number]

array([3, 0])

means that the first validation triple has head entity of type 'GENE' (the fourth key of `openbiolink.type_offsets`) and tail entity of type 'ANATOMY' (the first key of `openbiolink.type_offsets`). Indeed, we can check:

In [48]:
type_head = entity_types_series[df_dict[part].iloc[triple_number]["h_label"]]
type_tail = entity_types_series[df_dict[part].iloc[triple_number]["t_label"]]

type_head, type_tail

('GENE', 'ANATOMY')

## Random dataset split

What if no custom train/validation/test split is provided? `KGDataset.from_dataframe` can perform a random split, with the desidered ratios between the three parts. This happens whenever it is provided with a single pandas DataFrame, instead of a dictionary of DataFrames as before.

In [58]:
# Merge all triples in a single DataFrame
df_all_triples = pd.concat([train_triples, valid_triples, test_triples], axis=0)

print(f"Total number of triples: {df_all_triples.shape[0]:,}")

Total number of triples: 4,563,405


In [57]:
# 80/10/10 train/valid/test split
split_ratios = (0.8, 0.1, 0.1)

openbiolink_random = KGDataset.from_dataframe(df_all_triples,
                                        head_column="h_label", relation_column="r_label", tail_column="t_label",
                                        entity_types=entity_types_series,
                                        split=split_ratios)

print(f"Number of entities: {openbiolink_random.n_entity:,}\n")
print(f"Number of relation types: {openbiolink_random.n_relation_type}\n")
print(f"Number of triples: \n training: {openbiolink_random.triples['train'].shape[0]:,} \n validation: {openbiolink_random.triples['valid'].shape[0]:,}"
        f"\n test: {openbiolink_random.triples['test'].shape[0]:,}")

Number of entities: 184,635

Number of relation types: 28

Number of triples: 
 training: 3,650,724 
 validation: 456,340
 test: 456,341


## Conclusions and next steps

The `KGDataset` class has a few additional attributes, which can be defined when instantiating the class manually. For instance, it allows to specify for each triple a set of negative heads and tails that should be used to corrupt that triple (this is useful when you want to find the best completion for a (h,r,?)/(?,r,t) query only among a specific set of candidate nodes, or you have already identified good negative samples that you want to use during training).
For more information on the `KGDataset` class, have a look at the [BESS-KGE documentation](https://graphcore-research.github.io/bess-kge/API_ref/dataset.html#besskge.dataset.KGDataset).

Once you have built your dataset as a `KGDataset` class, you are ready to use BESS-KGE to train your prefered KGE model and perform inference with it! To learn how, we suggest starting from the introductory [KGE Training and Inference on OGBL-BioKG](1_biokg_training_inference.ipynb).