WiKC

This repository provides the source code, data resource, outputs of LLM prompting, and evaluation results for the academic paper Refining Wikidata Taxonomy using Large Language Models. The project is licensed under the MIT license.

Data

The Wikidata dump can be accessed through the website. In the data folder, we provide the following resources mainly extracted from the data dump (dated March 22, 2024):

WiKC (a cleaned version of the Wikidata taxonomy) structured as NT format (also TSV format, HTML format for visualization), and a mapping tsv from WiKC to Wikidata (as some classes are merged).
wikidata: Useful data resources crawled from the data dump, such as direct instance counts for each class; labels and descriptions for each class; metaclasses used in Taxonomy Extraction; identifiers which should be excluded in properties...etc.
wikipedia: Mappings between Wikipedia and Wikidata in different languages.
evaluation: Entity typing data for extrinsic evaluation
taxonomies: All intermediate taxonomies from the refining steps

Approach

We provide the source code for the refinement pipeline used to clean a taxonomy, enabling others to reuse it for their own taxonomy cleaning needs. Specifically:

data_mining_scripts: Source code for Taxonomy Extraction (from data dump)
llm_predict.py: Semantic prediction by zero-shot prompting on LLMs
clean.ipynb: Refinement steps using graph mining techniques
reprompt.py: Part of the Rewire step during the refinement

Evaluation & Visualization

Evaluation is conducted from both intrinsic and extrinsic perspectives, and use LLM-as-a-Judge for the entity typing task in extrinsic evaluation.
The data for extrinsic evaluation are provided in the data/evaluation where dataset.ipynb presents our method for generating the evaluation dataset.
The specific taxonomic paths can be visualized in both svg graph or html format through draw.ipynb.

Note: All prompts, including those for evaluation and semantic prediction, are provided in the prompts folder for reuse by others. results folder store the outputs of LLM for both semantic prediction and entity typing evaluation. Every time you change the file you want to run, you need to change the file path in config.ini.

Acknowledgment

Part of our code is based on the source code of Yago4.5, thanks to their contributions!

Name	Name	Last commit message	Last commit date
Latest commit peng-yiwen update readme Aug 26, 2024 c63dc16 · Aug 26, 2024 History 38 Commits
data	data	update readme	Aug 26, 2024
data_mining_scripts	data_mining_scripts	WiKC v2.0	Aug 26, 2024
prompts	prompts	WiKC v2.0	Aug 26, 2024
results	results	WiKC v2.0	Aug 26, 2024
LICENSE	LICENSE	Update LICENSE	Jun 6, 2024
README.md	README.md	WiKC v2.0	Aug 26, 2024
clean.ipynb	clean.ipynb	WiKC v2.0	Aug 26, 2024
config.ini	config.ini	WiKC v2.0	Aug 26, 2024
draw.ipynb	draw.ipynb	WiKC v2.0	Aug 26, 2024
extrinsic.py	extrinsic.py	WiKC v2.0	Aug 26, 2024
graph_utils.py	graph_utils.py	WiKC v2.0	Aug 26, 2024
intrinsic.ipynb	intrinsic.ipynb	WiKC v2.0	Aug 26, 2024
llm_predict.py	llm_predict.py	WiKC v2.0	Aug 26, 2024
reprompt.py	reprompt.py	WiKC v2.0	Aug 26, 2024
requirements.txt	requirements.txt	WiKC v2.0	Aug 26, 2024
utils.py	utils.py	WiKC v2.0	Aug 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

WiKC

Data

Approach

Evaluation & Visualization

Acknowledgment

About

Releases

Packages

Languages

License

peng-yiwen/WiKC

Folders and files

Latest commit

History

Repository files navigation

WiKC

Data

Approach

Evaluation & Visualization

Acknowledgment

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages