# PyKEEN Negative Sampling Extension: A introductory Tutorial

This tutorial will guide you with the basic usage of the new negative samplers classes. Remember to unzip the provided dataset file the the data/ folder, in order to assure the correct functionalities of the code

In [None]:
from pathlib import Path

import pykeen
import torch
from pykeen.pipeline import pipeline
from pykeen.sampling import (
    BasicNegativeSampler,
    BernoulliNegativeSampler,
    negative_sampler_resolver,
)
from pykeen.sampling.filtering import filterer_resolver
from pykeen.training import SLCWATrainingLoop

from extension.dataset import OnMemoryDataset
from extension.filtering import NullPythonSetFilterer
from extension.sampling import (
    CorruptNegativeSampler,
    NearestNeighbourNegativeSampler,
    NearMissNegativeSampler,
    RelationalNegativeSampler,
    SubSetNegativeSampler,
    TypedNegativeSampler,
)
import extension.constants as const

This should automatically get the correct data path given the tutorial provided location, if needed, modify this path with your custom "data" folder path.

In [2]:
data_path = Path().cwd().parent / "data"
data_path

PosixPath('/Users/navis/dev/refactor-negative-sampler/data')

## Loading the provided datasets

Let's load the YAGO4-20 dataset, with the additional provided metadata

In [None]:
dataset = OnMemoryDataset(
    data_path=data_path / "YAGO4-20", load_domain_range=True, load_entity_classes=True
)

Now you can use all the basic functionalities of the standard pykeen dataset, with added loaded data of domain and range proprieties, and entity class membership, lets see them

In [4]:
print(f"Num Entities {dataset.num_entities}")
print(f"Num Relations {dataset.num_relations}")
print(f"Relation Mapping {dataset.relation_to_id}")

relation_id = 10
entity_id = 50

id_to_entity = {v: k for k, v in dataset.entity_to_id.items()}
id_to_relation = {v: k for k, v in dataset.relation_to_id.items()}

print(
    f"Domain and Range of '{id_to_relation[relation_id]}': {dataset.relation_id_to_domain_range[relation_id]}"
)
print(
    f"Entity Classes of '{id_to_entity[entity_id]}': {dataset.entity_id_to_classes[entity_id]}"
)

Num Entities 96910
Num Relations 70
Relation Mapping {'about': 0, 'actor': 1, 'affiliation': 2, 'alumniOf': 3, 'author': 4, 'award': 5, 'bioChemInteraction': 6, 'birthPlace': 7, 'brand': 8, 'byArtist': 9, 'character': 10, 'children': 11, 'citation': 12, 'competitor': 13, 'composer': 14, 'containedInPlace': 15, 'containsPlace': 16, 'contentLocation': 17, 'contributor': 18, 'copyrightHolder': 19, 'countryOfOrigin': 20, 'creator': 21, 'deathPlace': 22, 'director': 23, 'editor': 24, 'exampleOfWork': 25, 'familyName': 26, 'founder': 27, 'foundingLocation': 28, 'gender': 29, 'genre': 30, 'givenName': 31, 'hasMolecularFunction': 32, 'hasOccupation': 33, 'hasPart': 34, 'homeLocation': 35, 'honorificPrefix': 36, 'inLanguage': 37, 'isBasedOn': 38, 'isInvolvedInBiologicalProcess': 39, 'isLocatedInSubcellularLocation': 40, 'isPartOf': 41, 'knowsLanguage': 42, 'license': 43, 'location': 44, 'locationCreated': 45, 'material': 46, 'memberOf': 47, 'musicBy': 48, 'nationality': 49, 'parent': 50, 'paren

## Using the static negative samplers 

Lets instantiate some static negative samplers, the corrupt, and typed  that used the additional metadata and relationa. We just use the pykeen inferface and provide the additional required metadata loaded with the dataset. In this case we set the integration of random negatives to false, in oder to showcase the real negatives generated with these methods. Since relational uses a heavy filtering criterion, it will create a cached file to store the subset.

In [5]:
filterer = NullPythonSetFilterer(mapped_triples=dataset.training.mapped_triples)

samplers = {
    "Corrupt": CorruptNegativeSampler(
        mapped_triples=dataset.training.mapped_triples,
        filtered=True,
        filterer=filterer,
        num_negs_per_pos=5,
        integrate=False,
    ),
    "Typed": TypedNegativeSampler(
        mapped_triples=dataset.training.mapped_triples,
        filtered=True,
        filterer=filterer,
        num_negs_per_pos=5,
        entity_classes_dict=dataset.entity_id_to_classes,
        relation_domain_range_dict=dataset.relation_id_to_domain_range,
        integrate=False,
    ),
    "Relational": RelationalNegativeSampler(
        mapped_triples=dataset.training.mapped_triples,
        filtered=True,
        filterer=filterer,
        num_negs_per_pos=5,
        local_file="relational_cached.bin",
        integrate=False,
    ),
}

[RelationalNegativeSampler] Loading Pre-Computed Subset


Now lets use the sampler to produce the negative for the first 2 triples 

In [6]:
for name, sampler in samplers.items():
    print(f"Negative Sampler: {name}")
    print(samplers[name].sample(dataset.training.mapped_triples[:2]))
    print("")

Negative Sampler: Corrupt
(tensor([[[    0,     1, 49067],
         [63642,     1,  5226],
         [    0,     1, 81587],
         [    0,     1, 17358],
         [11297,     1,  5226]],

        [[    0,     1, 63254],
         [    0,     1, 52264],
         [ 3889,     1, 19014],
         [31922,     1, 19014],
         [    0,     1, 56737]]]), tensor([[True, True, True, True, True],
        [True, True, True, True, True]]))

Negative Sampler: Typed
(tensor([[[    0,     1,  1717],
         [10968,     1,  5226],
         [    0,     1, 73238],
         [    0,     1,  1717],
         [59817,     1,  5226]],

        [[72558,     1, 19014],
         [    0,     1,  1717],
         [92845,     1, 19014],
         [    0,     1,  1717],
         [    0,     1, 43149]]]), tensor([[True, True, True, True, True],
        [True, True, True, True, True]]))

Negative Sampler: Relational
(tensor([[[   -1,     1,  5226],
         [    0,     1, 74377],
         [    0,     1, 74377],
      

You can see that some triples have negative index entities, this are placeholders filterred by the NullPythonSetFilterer

## Using dynamic negative samplers

In oder to use the dynamic negative samplers, we will need to first pretrain a model, for this purpose, lets train Transe on YAGO for 2 epochs, just for the sake of the tutorial, using a random sampler

In [7]:
import pykeen.models


model = pykeen.models.TransE(triples_factory=dataset.training, embedding_dim=10)

loop = SLCWATrainingLoop(
    model=model, triples_factory=dataset.training, optimizer="Adam"
)

loop.train(triples_factory=dataset.training, num_epochs=2, batch_size=256)

No random seed is specified. This may lead to non-reproducible results.


Training epochs on cpu:   0%|          | 0/2 [00:00<?, ?epoch/s]

Training batches on cpu:   0%|          | 0/2169 [00:00<?, ?batch/s]

Training batches on cpu:   0%|          | 0/2169 [00:00<?, ?batch/s]

[0.9830639038343153, 0.8069588712087256]

Now lets define the custom function used for prediction 

In [8]:
def sampling_model_prediction(model, hrt_batch, targets):
    out = torch.zeros(
        (hrt_batch.size(0), model.entity_representations[0]().size(1)),
        device=hrt_batch.device,
    )
    out[targets == 0] = model.entity_representations[0](
        hrt_batch[targets == 0, 2]
    ) - model.relation_representations[0](hrt_batch[targets == 0, 1])
    out[targets == 2] = model.entity_representations[0](
        hrt_batch[targets == 2, 0]
    ) + model.relation_representations[0](hrt_batch[targets == 2, 1])

    return out

And not we can instantiate the Adversarial negative sampler

In [9]:
sampler = NearMissNegativeSampler(
    mapped_triples=dataset.training.mapped_triples,
    prediction_function=sampling_model_prediction,
    sampling_model=model,
    num_negs_per_pos=5,
    num_query_results=10,
    filtered=True,
    filterer=filterer,
)

print("Adversarial Negative Sampler")
print(sampler.sample(dataset.training.mapped_triples[:2]))

Adversarial Negative Sampler
[[92mDONE[0m ] [NS NearMissNegativeSampler] Calculating HEAD prediction with TransE pretrained model in [96m0000.0007[0ms
[[92mDONE[0m ] [NS NearMissNegativeSampler] Calculating TAIL prediction with TransE pretrained model in [96m0000.0002[0ms
[[92mDONE[0m ] [NS NearMissNegativeSampler] Querying KDTREE for HEAD predictions in [96m0000.0010[0ms
[[92mDONE[0m ] [NS NearMissNegativeSampler] Querying KDTREE for TAIL predictions in [96m0000.0009[0ms


100%|██████████| 2/2 [00:00<00:00, 667.14it/s]

(tensor([[[94881,     1,  5226],
         [65931,     1,  5226],
         [    0,     1, 68207],
         [    0,     1, 37774],
         [    0,     1, 66097]],

        [[15681,     1, 19014],
         [79459,     1, 19014],
         [    0,     1, 37774],
         [    0,     1, 26074],
         [    0,     1, 66097]]]), tensor([[True, True, True, True, True],
        [True, True, True, True, True]]))





Or the nearest neighbour negative sampler

In [10]:
sampler = NearestNeighbourNegativeSampler(
    mapped_triples=dataset.training.mapped_triples,
    sampling_model=model,
    num_negs_per_pos=5,
    num_query_results=10,
    filtered=True,
    filterer=filterer,
)

print("NN Negative Sampler")
print(sampler.sample(dataset.training.mapped_triples[:2]))

NN Negative Sampler
(tensor([[[    0,     1, 50626],
         [    0,     1, 41365],
         [28108,     1,  5226],
         [52837,     1,  5226],
         [    0,     1, 41365]],

        [[52837,     1, 19014],
         [    0,     1, 14871],
         [    0,     1, 81824],
         [82036,     1, 19014],
         [    0,     1,  4414]]]), tensor([[True, True, True, True, True],
        [True, True, True, True, True]]))


## Using the samplers in a full training pipeline

Lets see how to use our new samplers in a standard pykeen training and evaluation pipeline, first, we register the new samplers in the pykeen namesspace, in order to address them only by their string name

In [11]:
negative_sampler_resolver.register(element=CorruptNegativeSampler)
filterer_resolver.register(element=NullPythonSetFilterer)

Then, we can just use them as any other negative sampler, as a comparison, we show the same pipeline using the random sampler, already available in pykeen

In [12]:
# NEW SAMPLER

pipeline_result = pipeline(
    dataset="Nations",
    model="TransE",
    negative_sampler="corrupt",
    negative_sampler_kwargs=dict(filtered=True, filterer="nullpythonset"),
    training_loop="sLCWA",
)

# RANDOM SAMPLER

pipeline_result = pipeline(
    dataset="Nations",
    model="TransE",
    negative_sampler="basic",
    negative_sampler_kwargs=dict(filtered=True, filterer="bloom"),
    training_loop="sLCWA",
)

No random seed is specified. Setting to 1809488467.


  data = dict(torch.load(path.joinpath(cls.base_file_name)))
  data = dict(torch.load(path.joinpath(cls.base_file_name)))
  data = dict(torch.load(path.joinpath(cls.base_file_name)))
  metadata = torch.load(metadata_path) if metadata_path.is_file() else None
No cuda devices were available. The model runs on CPU


Training epochs on cpu:   0%|          | 0/5 [00:00<?, ?epoch/s]

Training batches on cpu:   0%|          | 0/7 [00:00<?, ?batch/s]

Training batches on cpu:   0%|          | 0/7 [00:00<?, ?batch/s]

Training batches on cpu:   0%|          | 0/7 [00:00<?, ?batch/s]

Training batches on cpu:   0%|          | 0/7 [00:00<?, ?batch/s]

Training batches on cpu:   0%|          | 0/7 [00:00<?, ?batch/s]



Evaluating on cpu:   0%|          | 0.00/201 [00:00<?, ?triple/s]

INFO:pykeen.evaluation.evaluator:Evaluation took 0.06s seconds
INFO:pykeen.datasets.utils:Loading cached preprocessed dataset from file:///Users/navis/.data/pykeen/datasets/nations/cache/47DEQpj8HBSa-_TImW-5JCeuQeRkm5NM
INFO:pykeen.triples.triples_factory:Loading from file:///Users/navis/.data/pykeen/datasets/nations/cache/47DEQpj8HBSa-_TImW-5JCeuQeRkm5NM/training
  data = dict(torch.load(path.joinpath(cls.base_file_name)))
INFO:pykeen.triples.triples_factory:Loading from file:///Users/navis/.data/pykeen/datasets/nations/cache/47DEQpj8HBSa-_TImW-5JCeuQeRkm5NM/testing
  data = dict(torch.load(path.joinpath(cls.base_file_name)))
INFO:pykeen.triples.triples_factory:Loading from file:///Users/navis/.data/pykeen/datasets/nations/cache/47DEQpj8HBSa-_TImW-5JCeuQeRkm5NM/validation
  data = dict(torch.load(path.joinpath(cls.base_file_name)))
  metadata = torch.load(metadata_path) if metadata_path.is_file() else None
INFO:pykeen.pipeline.api:Using device: None


Training epochs on cpu:   0%|          | 0/5 [00:00<?, ?epoch/s]

Training batches on cpu:   0%|          | 0/7 [00:00<?, ?batch/s]

Training batches on cpu:   0%|          | 0/7 [00:00<?, ?batch/s]

Training batches on cpu:   0%|          | 0/7 [00:00<?, ?batch/s]

Training batches on cpu:   0%|          | 0/7 [00:00<?, ?batch/s]

Training batches on cpu:   0%|          | 0/7 [00:00<?, ?batch/s]



Evaluating on cpu:   0%|          | 0.00/201 [00:00<?, ?triple/s]

INFO:pykeen.evaluation.evaluator:Evaluation took 0.03s seconds


As you can see, the two pipeline are excatly the same, and given the corret kwargs with the additional needed information, the new samplers can be integrated in any pre-existing pipeline!

## Writing your own custom subset based negative samplers

Lets assume we want to create a negative sampler that uses our proposed abstraction to corrupt triple with this rationale:
For each triple to be corrupted, given its relation, use as a negative pool the top X less occuring entities


In [27]:
class TutorialSampler(SubSetNegativeSampler):
    def __init__(self, *args, top_k=100, **kwargs):
        # We define this variable before super, to it can be available in the subset generation
        object.__setattr__(self, "top_k", top_k)
        super().__init__(*args, **kwargs)

    # Precompute the entity set for head and tail for each relation
    def generate_subset(self, mapped_triples, **kwargs):
        subset = dict()

        for r in range(self.num_relations):

            subset[r] = {0: None, 2: None}

            for target in [const.HEAD, const.TAIL]:
                data, counts = torch.unique(
                    self.mapped_triples[self.mapped_triples[:, const.REL] == r, target],
                    return_counts=True,
                )
                ordered_data = data[torch.sort(counts, descending=False)[1]][
                    : self.top_k
                ]

                subset[r][target] = ordered_data

        return subset

    # Now lets define the negative pool for each triple
    def strategy_negative_pool(self, h, r, t, target):
        return self.subset[r][const.TARGET_TO_INDEX[target]]


# Instantiate the sampler like any other one, adding our additional variable
sampler = TutorialSampler(
    mapped_triples=dataset.training.mapped_triples, top_k=5, num_negs_per_pos=5
)

# And just get the negatives!
print(sampler.sample(dataset.training.mapped_triples[:2]))

(tensor([[[    0,     1,   939],
         [    0,     1,   939],
         [    0,     1,   939],
         [84088,     1,  5226],
         [94074,     1,  5226]],

        [[    0,     1,   939],
         [ 4040,     1, 19014],
         [94074,     1, 19014],
         [    0,     1, 32666],
         [    0,     1, 96783]]]), None)


As you can see, its extremely easy to define new negative samplers with our abstraction. Another detail is the use of integration, for example, with a top_k of 5, each triple can have a negative pool of 5 elements, this can be detrimental when the number of negative per positive are very high, in this case the "integrate" parameter can be used, and the negative pool will be integrated with addiotional entities sampled at random.

In [28]:
sampler = TutorialSampler(
    mapped_triples=dataset.training.mapped_triples,
    top_k=5,
    num_negs_per_pos=100,
    integrate=True,
)

print(sampler.sample(dataset.training.mapped_triples[:2]))

(tensor([[[    0,     1,  8268],
         [    0,     1,   939],
         [    0,     1, 96783],
         [75891,     1,  5226],
         [    0,     1,  1096],
         [    0,     1, 32666],
         [    0,     1, 85280],
         [94074,     1,  5226],
         [84088,     1,  5226],
         [31776,     1,  5226],
         [    0,     1, 64519],
         [ 4040,     1,  5226],
         [    0,     1, 28806],
         [    0,     1, 21298],
         [42033,     1,  5226],
         [    0,     1, 90941],
         [10327,     1,  5226],
         [    0,     1, 75271],
         [28065,     1,  5226],
         [    0,     1, 71786],
         [30858,     1,  5226],
         [ 2295,     1,  5226],
         [    0,     1, 53625],
         [    0,     1, 90237],
         [    0,     1, 71840],
         [66410,     1,  5226],
         [    0,     1, 38254],
         [    0,     1, 44567],
         [17956,     1,  5226],
         [69772,     1,  5226],
         [ 3414,     1,  5226],
       

Additionaly you can compute the dataset statistics directly using our provided functions, this can take some time, since this computation as to be computed for each `<h,r,*>` and `<*,r,t>` combination, for this reason we test it on a subset of the training data.

In [32]:
sampler = TutorialSampler(
    mapped_triples=dataset.training.mapped_triples, top_k=5, num_negs_per_pos=5
)

triples = dataset.training.mapped_triples
sampler.average_pool_size(triples[torch.randperm(len(triples))][:5000])

[SubsetNegativeSampler] Computing <h,r,*> Negative Pools


100%|██████████| 4875/4875 [00:03<00:00, 1497.16it/s]


[SubsetNegativeSampler] Computing <*,r,t> Negative Pools


100%|██████████| 3547/3547 [00:02<00:00, 1382.49it/s]


(4,
 {0: (1, 0.00011873664212776063),
  2: (4, 0.0004749465685110425),
  10: (8422, 1.0),
  40: (8422, 1.0),
  100: (8422, 1.0)})

The function produces the average number of entities in each negative pool (checking if there are false negative), and then in order, the number of triples that have less than 0, 2, 10, 40, 100 entities in their negative pool.