GitHub - jinzhuoran/CogKGE: CogKGE: A Knowledge Graph Embedding Toolkit and Benchmark for Representing Multi-source and Heterogeneous Knowledge. ACL 2022

CogKGE: A Knowledge Graph Embedding Toolkit and Benckmark for Representing Multi-source and Heterogeneous Knowledge

Demo system and more information is available at http://cognlp.com/cogkge

What's New?

Mar 2022: We have supported node classification framework.
Jan 2022: We have released CogKGE v1.0.

Description

CogKGE is a knowledge graph embedding toolkit that aims to represent multi-source and heterogeneous knowledge. CogKGE currently supports 17 models, 11 datasets including two multi-source heterogeneous KGs, five evaluation metrics, four knowledge adapters, four loss functions, three samplers and three built-in data containers.

This easy-to-use python package has the following advantages:

Multi-source and heterogeneous knowledge representation. CogKGE explores the unified representation of knowledge from diverse sources. Moreover, our toolkit not only contains the triple fact-based embedding models, but also supports the fusion representation of additional information, including text descriptions, node types and temporal information.
Comprehensive models and benchmark datasets. CogKGE implements lots of classic KGE models in the four categories of translation distance models, semantic matching models, graph neural network-based models and transformer-based models. Besides out-of-the-box models, we release two large benchmark datasets for further evaluating KGE methods, called EventKG240K and CogNet360K.
Extensible and modularized framework. CogKGE provides a programming framework for KGE tasks. Based on the extensible architecture, CogKGE can meet the requirements of module extension and secondary development, and pre-trained knowledge embeddings can be directly applied to downstream tasks.
Open source and visualization demo. Besides the toolkit, we also release an online system to discover knowledge visually. Source code, datasets and pre-trained embeddings are publicly available.

Install

Install from git

# clone CogKGE   
git clone https://github.com/jinzhuoran/CogKGE.git

# install CogKGE   
cd cogkge
pip install -e .   
pip install -r requirements.txt

Install from pip

pip install cogkge

Quick Start

Pre-trained Embedder for Knowledge Discovery

from cogkge import *

# loader lut
device = init_cogkge(device_id="0", seed=1)
loader = EVENTKG2MLoader(dataset_path="../dataset", download=True)
train_data, valid_data, test_data = loader.load_all_data()
node_lut, relation_lut, time_lut = loader.load_all_lut()
processor = EVENTKG2MProcessor(node_lut, relation_lut, time_lut,
                               reprocess=True,
                               type=False, time=False, description=False, path=False,
                               time_unit="year",
                               pretrain_model_name="roberta-base", token_len=10,
                               path_len=10)
node_lut, relation_lut, time_lut = processor.process_lut()

# loader model
model = BoxE(entity_dict_len=len(node_lut),
             relation_dict_len=len(relation_lut),
             embedding_dim=50)

# load predictor
predictor = Predictor(model_name="BoxE",
                      data_name="EVENTKG2M",
                      model=model,
                      device=device,
                      node_lut=node_lut,
                      relation_lut=relation_lut,
                      pretrained_model_path="data/BoxE_Model.pkl",
                      processed_data_path="data",
                      reprocess=False,
                      fuzzy_query_top_k=10,
                      predict_top_k=10)

# fuzzy query node
result_node = predictor.fuzzy_query_node_keyword('champion')
print(result_node)

# fuzzy query relation
result_relation = predictor.fuzzy_query_relation_keyword("instance")
print(result_relation)

# query similary nodes
similar_node_list = predictor.predict_similar_node(node_id=0)
print(similar_node_list)

# given head and relation, query tail
tail_list = predictor.predcit_tail(head_id=0, relation_id=0)
print(tail_list)

# given tail and relation, query head
head_list = predictor.predict_head(tail_id=0, relation_id=0)
print(head_list)

# given head and tail, query relation
relation_list = predictor.predict_relation(head_id=0, tail_id=0)
print(relation_list)

# dimensionality reduction and visualization of nodes
visual_list = predictor.show_img(node_id=100, visual_num=1000)

Programming Framework for Training Models

import torch
from torch.utils.data import RandomSampler
from cogkge import *

device = init_cogkge(device_id="0", seed=1)

loader = EVENTKG2MLoader(dataset_path="../dataset", download=True)
train_data, valid_data, test_data = loader.load_all_data()
node_lut, relation_lut, time_lut = loader.load_all_lut()

processor = EVENTKG2MProcessor(node_lut, relation_lut, time_lut,
                               reprocess=True,
                               type=True, time=False, description=False, path=False,
                               time_unit="year",
                               pretrain_model_name="roberta-base", token_len=10,
                               path_len=10)
train_dataset = processor.process(train_data)
valid_dataset = processor.process(valid_data)
test_dataset = processor.process(test_data)
node_lut, relation_lut, time_lut = processor.process_lut()

train_sampler = RandomSampler(train_dataset)
valid_sampler = RandomSampler(valid_dataset)
test_sampler = RandomSampler(test_dataset)

model = TransE(entity_dict_len=len(node_lut),
               relation_dict_len=len(relation_lut),
               embedding_dim=50)

loss = MarginLoss(margin=1.0, C=0)

optimizer = torch.optim.Adam(model.parameters(), lr=0.001, weight_decay=0)

metric = Link_Prediction(link_prediction_raw=True,
                         link_prediction_filt=False,
                         batch_size=5000000,
                         reverse=False)

lr_scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(
    optimizer, mode='min', patience=3, threshold_mode='abs', threshold=5,
    factor=0.5, min_lr=1e-9, verbose=True
)

negative_sampler = UnifNegativeSampler(triples=train_dataset,
                                       entity_dict_len=len(node_lut),
                                       relation_dict_len=len(relation_lut))

trainer = Trainer(
    train_dataset=train_dataset,
    valid_dataset=valid_dataset,
    train_sampler=train_sampler,
    valid_sampler=valid_sampler,
    model=model,
    loss=loss,
    optimizer=optimizer,
    negative_sampler=negative_sampler,
    device=device,
    output_path="../dataset",
    lookuptable_E=node_lut,
    lookuptable_R=relation_lut,
    metric=metric,
    lr_scheduler=lr_scheduler,
    log=True,
    trainer_batch_size=100000,
    epoch=3000,
    visualization=1,
    apex=True,
    dataloaderX=True,
    num_workers=4,
    pin_memory=True,
    metric_step=200,
    save_step=200,
    metric_final_model=True,
    save_final_model=True,
    load_checkpoint=None
)
trainer.train()

evaluator = Evaluator(
    test_dataset=test_dataset,
    test_sampler=test_sampler,
    model=model,
    device=device,
    metric=metric,
    output_path="../dataset",
    train_dataset=train_dataset,
    valid_dataset=valid_dataset,
    lookuptable_E=node_lut,
    lookuptable_R=relation_lut,
    log=True,
    evaluator_batch_size=50000,
    dataloaderX=True,
    num_workers=4,
    pin_memory=True,
    trained_model_path=None
)
evaluator.evaluate()

Model

Category	Model	Conference	Paper
Translation Distance Models	TransE	NIPS 2013	Translating embeddings for modeling multi-relational data
	TransH	AAAI 2014	Knowledge Graph Embedding by Translating on Hyperplanes
	TransR	AAAI 2015	Learning Entity and Relation Embeddings for Knowledge Graph Completion
	TransD	ACL 2015	Knowledge Graph Embedding via Dynamic Mapping Matrix
	TransA	AAAI 2015	TransA: An Adaptive Approach for Knowledge Graph Embedding
	BoxE	NIPS 2020	BoxE: A Box Embedding Model for Knowledge Base Completion
	PairRE	ACL 2021	PairRE: Knowledge Graph Embeddings via Paired Relation Vectorss
Semantic Matching Models	RESCAL	ICML 2011	A Three-Way Model for Collective Learning on Multi-Relational Data
	DistMult	ICLR 2015	Embedding Entities and Relations for Learning and Inference in Knowledge Bases
	SimplE	NIPS 2018	SimplE Embedding for Link Prediction in Knowledge Graphs
	TuckER	ACL 2019	TuckER: Tensor Factorization for Knowledge Graph Completion
	RotatE	ICLR 2019	RotatE: Knowledge Graph Embedding by Relational Rotation in Complex Space
Graph Neural Network-based Models	R-GCN	ESWC 2018	Modeling Relational Data with Graph Convolutional Networks
Graph Neural Network-based Models	CompGCN	ICLR 2020	Composition-based Multi-Relational Graph Convolutional Networks
Transformer-based Models	HittER	EMNLP 2021	HittER: Hierarchical Transformers for Knowledge Graph Embeddings
Transformer-based Models	KEPLER	TACL 2021	KEPLER: A Unified Model for Knowledge Embedding and Pre-trained Language Representation

Dataset

EventKG240K

EventKG is a event-centric temporal knowledge graph, which incorporates over 690 thousand contemporary and historical events and over 2.3 million temporal relations. To our best knowledge, EventKG240K is the first event-centric KGE dataset. We use EventKG V3.0 data to construct the dataset. First, we filter entities and events based on their degrees. Then, we select the triple facts when both nodes' degrees are greater than 10. At last, we add text descriptions and node types for nodes and translate triples to quadruples by temporal information. The whole dataset contains 238,911 nodes, 822 relations and 2,333,986 triples.

CogNet360K

CogNet is a multi-source heterogeneous KG dedicated to integrating linguistic, world and commonsense knowledge. To build a subset as the dataset, we count the number of occurrences for each node. Then, we sort frame instances by the minimum occurrences of their connected nodes. After we have the sorted list, we filter the triple facts according to the preset frame categories. Finally, we find the nodes that participate in these triple facts and complete their information. The final dataset contains 360,637 nodes and 1,470,488 triples.

Name		Name	Last commit message	Last commit date
Latest commit History 569 Commits
cogkge		cogkge
docs		docs
examples		examples
paperlist		paperlist
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
setup.py		setup.py

License

jinzhuoran/CogKGE

Folders and files

Latest commit

History

Repository files navigation

What's New?

Description

Install

Install from git

Install from pip

Quick Start

Pre-trained Embedder for Knowledge Discovery

Programming Framework for Training Models

Model

Dataset

Other KGE open-source project

About

Topics

Resources

License

Stars

Watchers

Forks

Languages