# PyKEEN Negative Sampling Extension: A introductory Tutorial

This tutorial will guide you with the basic usage of the new negative samplers classes. Remember to unzip the provided dataset file the the data/ folder, in order to assure the correct functionalities of the code

In [1]:
import pykeen
from pykeen.sampling import BasicNegativeSampler, BernoulliNegativeSampler

from extension.sampling import (CorruptNegativeSampler,
                                NearestNeighbourNegativeSampler,
                                NearMissNegativeSampler,
                                RelationalNegativeSampler,
                                TypedNegativeSampler)

from extension.filtering import NullPythonSetFilterer
from extension.dataset import OnMemoryDataset
from pathlib import Path
from pykeen.pipeline import pipeline
from pykeen.training import SLCWATrainingLoop
import torch

  from .autonotebook import tqdm as notebook_tqdm


This should automatically get the correct data path given the tutorial provided location, if needed, modify this path with your custom "data" folder path.

In [2]:
data_path = Path().cwd().parent / "data"
data_path

PosixPath('/home/navis/dev/refactor-negative-sampler/data')

## Loading the provided datasets

Let's load the YAGO4-20 dataset, with the additional provided metadata

In [3]:
dataset = OnMemoryDataset(
    data_path = data_path / "YAGO4-20",
    load_domain_range = True,
    load_entity_classes = True
)

Now you can use all the basic functionalities of the standard pykeen dataset, with added loaded data of domain and range proprieties, and entity class membership, lets see them

In [4]:
print(f"Num Entities {dataset.num_entities}")
print(f"Num Relations {dataset.num_relations}")
print(f"Relation Mapping {dataset.relation_to_id}")

relation_id = 10
entity_id = 50

id_to_entity = {v:k for k,v in dataset.entity_to_id.items()}
id_to_relation = {v:k for k,v in dataset.relation_to_id.items()}

print(f"Domain and Range of '{id_to_relation[relation_id]}': {dataset.relation_id_to_domain_range[relation_id]}")
print(f"Entity Classes of '{id_to_entity[entity_id]}': {dataset.entity_id_to_classes[entity_id]}")

Num Entities 96910
Num Relations 70
Relation Mapping {'about': 0, 'actor': 1, 'affiliation': 2, 'alumniOf': 3, 'author': 4, 'award': 5, 'bioChemInteraction': 6, 'birthPlace': 7, 'brand': 8, 'byArtist': 9, 'character': 10, 'children': 11, 'citation': 12, 'competitor': 13, 'composer': 14, 'containedInPlace': 15, 'containsPlace': 16, 'contentLocation': 17, 'contributor': 18, 'copyrightHolder': 19, 'countryOfOrigin': 20, 'creator': 21, 'deathPlace': 22, 'director': 23, 'editor': 24, 'exampleOfWork': 25, 'familyName': 26, 'founder': 27, 'foundingLocation': 28, 'gender': 29, 'genre': 30, 'givenName': 31, 'hasMolecularFunction': 32, 'hasOccupation': 33, 'hasPart': 34, 'homeLocation': 35, 'honorificPrefix': 36, 'inLanguage': 37, 'isBasedOn': 38, 'isInvolvedInBiologicalProcess': 39, 'isLocatedInSubcellularLocation': 40, 'isPartOf': 41, 'knowsLanguage': 42, 'license': 43, 'location': 44, 'locationCreated': 45, 'material': 46, 'memberOf': 47, 'musicBy': 48, 'nationality': 49, 'parent': 50, 'paren

## Using the static negative samplers 

Lets instantiate some static negative samplers, the corrupt, and typed, that used the additional metadata, we just use the pykeen inferface and provide the additional required metadata loaded with the dataset. In this case we set the integration of random negatives to false, in oder to showcase the real negatives generated with these methods.

In [5]:
filterer = NullPythonSetFilterer(
    mapped_triples=dataset.training.mapped_triples
)

samplers = {
    "Corrupt" : CorruptNegativeSampler(
        mapped_triples = dataset.training.mapped_triples,
        filtered = True,
        filterer = filterer,
        num_negs_per_pos = 5,
        integrate = False
    ),
    "Typed" : TypedNegativeSampler(
        mapped_triples = dataset.training.mapped_triples,
        filtered = True,
        filterer = filterer,
        num_negs_per_pos = 5,
        entity_classes_dict=dataset.entity_id_to_classes,
        relation_domain_range_dict=dataset.relation_id_to_domain_range,
        integrate = False
    ),
}

Now lets use the sampler to produce the negative for the first 2 triples 

In [6]:
for name, sampler in samplers.items():
    print(f"Negative Sampler: {name}")
    print(samplers[name].sample(dataset.training.mapped_triples[:2]))
    print("")

Negative Sampler: Corrupt
(tensor([[[    0,     1, 68452],
         [    0,     1,  5994],
         [84822,     1,  5226],
         [84698,     1,  5226],
         [    0,     1,  9090]],

        [[    0,     1, 41384],
         [    0,     1, 64844],
         [ 7829,     1, 19014],
         [    0,     1, 57403],
         [85227,     1, 19014]]]), tensor([[True, True, True, True, True],
        [True, True, True, True, True]]))

Negative Sampler: Typed
(tensor([[[    0,     1, 27722],
         [    0,     1, 42456],
         [    0,     1, 79084],
         [86399,     1,  5226],
         [ 2614,     1,  5226]],

        [[    0,     1, 27722],
         [87083,     1, 19014],
         [84942,     1, 19014],
         [    0,     1, 79084],
         [    0,     1, 60725]]]), tensor([[True, True, True, True, True],
        [True, True, True, True, True]]))



## Using dynamic negative samplers

In oder to use the dynamic negative samplers, we will need to first pretrain a model, for this purpose, lets train Transe on YAGO for 2 epochs, just for the sake of the tutorial, using a random sampler

In [7]:
import pykeen.models


model = pykeen.models.TransE(
    triples_factory = dataset.training,
    embedding_dim=10
)

loop = SLCWATrainingLoop(
    model= model,
    triples_factory = dataset.training,
    optimizer="Adam"
)

loop.train(
    triples_factory=dataset.training,
    num_epochs=2,
    batch_size=256
)

No random seed is specified. This may lead to non-reproducible results.
Training epochs on cpu: 100%|██████████| 2/2 [00:32<00:00, 16.04s/epoch, loss=0.804, prev_loss=0.984]


[0.9840400149988215, 0.804481829570262]

Now lets define the custom function used for prediction 

In [8]:
def sampling_model_prediction(model, hrt_batch, targets):
    out = torch.zeros(
        (hrt_batch.size(0), model.entity_representations[0]().size(1)),
        device=hrt_batch.device,
    )
    out[targets == 0] = model.entity_representations[0](
        hrt_batch[targets == 0, 2]
    ) - model.relation_representations[0](hrt_batch[targets == 0, 1])
    out[targets == 2] = model.entity_representations[0](
        hrt_batch[targets == 2, 0]
    ) + model.relation_representations[0](hrt_batch[targets == 2, 1])

    return out

And not we can instantiate the Adversarial negative sampler

In [9]:
sampler = NearMissNegativeSampler(
    mapped_triples = dataset.training.mapped_triples,
    prediction_function=sampling_model_prediction,
    sampling_model=model,
    num_query_results=5
)

In [None]:
print("Adversarial Negative Sampler")
print(sampler.sample(dataset.training.mapped_triples[:2]))

[[92mDONE[0m ] [NS NearMissNegativeSampler] Calculating HEAD prediction with TransE pretrained model in [96m0000.0021[0ms
[[92mDONE[0m ] [NS NearMissNegativeSampler] Calculating TAIL prediction with TransE pretrained model in [96m0000.0005[0ms
[[92mDONE[0m ] [NS NearMissNegativeSampler] Querying KDTREE for HEAD predictions in [96m0000.0025[0ms
[[92mDONE[0m ] [NS NearMissNegativeSampler] Querying KDTREE for TAIL predictions in [96m0000.0024[0ms


  0%|          | 0/2 [00:00<?, ?it/s]


KeyError: 5226