This repository contains dataset and code for the following paper:
Learning Relation Entailment with Structured and Textual Information (AKBC2020)
- https://www.wikidata.org/wiki/Wikidata:Property_navboxes
- https://tools.wmflabs.org/hay/propbrowse/
- https://www.npmjs.com/package/wikidata-taxonomy
- get subproperty:
wdtaxonomy P361
- get subproperty:
- https://tools.wmflabs.org/prop-explorer/
- A tree structure constructed by "subclass of" (P279) or "subproperty of" (P1647).
- https://github.com/lucaswerkmeister/wikidata-ontology-exploreradd
- https://www.wikidata.org/wiki/Wikidata:Database_reports/List_of_properties/all (property types)
- https://tools.wmflabs.org/hay/propbrowse/props properties in a single json file
- Wikidata Graph Builder
- WikidataTreeBuilderSPARQL tutorial
- WikidataTreeBuilderSPARQL
- Wikidata DataModel
SELECT ?item ?itemLabel ?value ?valueLabel
WHERE
{
?item wdt:P170 ?value. # value should be the creator of item
#?item wdt:P136 wd:Q828322. # item's genre must be a game
#?item wdt:P31 wd:Q7397. # item is an instance of software
#?value wdt:P452 wd:Q941594. # value's industry be a video game
?value wdt:P106 wd:Q5482740. # value's occupation should be developer
#?item ?prop ?value.
FILTER NOT EXISTS { ?item wdt:P178 ?value } # value is not the developer of item
SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }
}
LIMIT 100
- Download truthy file from https://dumps.wikimedia.org/wikidatawiki/entities/
- Generate triples.txt and split it by properties.
- Only keep properties whose head and tail items are entities (starts with 'Q').
- Downsampling
- Downsample triples to keep only frequent entities and save it to triples_ds.txt
- Downsample triples by properties. Keep the most popular instances for each property. The number of instance kept is determined by taking sqrt or log on the size of the property.
- Inflate the downsampled properties because a kept instance of property A might also be an instance of property B but is not select. Make sure that all the entities kept have their P31 being kept because we use this to split properties.
- Build an ontology (based on the entire Wikidata, not the downsampled one) for the dataset using P31 (instance of) and P279 (subclass of). The classification system of Wikidata is very strange, which is an cyclic graph.
- Split leaf properties to ultra-fine properties based on the ontology inferred from above.
- Compute the depth of each item in the ontology. Several heuristics need be used because it is cyclic.
- For each instance of a certain property, we use the value of P31 of its head entity and tail entity as signature to split. In cases where the entities don't have P31, we use 'Q' as the placeholder.
- For a property with K instances, all its sub-properties larger than K/100 are kept and the remaining ones are merged if any, which mean that the maximum number of sub-properties we could get is 100+1.
- Merge instances from all the properties generated by the splitting algorithm.
- Train KGE methods using the merged file.
- Choose new parent among the sub-properties. This is crucial because it will influence the perform significantly.
# get string of the mention
mention_str = doc.phrase(mention.begin, mention.end)
# iterate over evokes
for e in mention.evokes():
print(e.data(pretty=True))
Google sheet used to track experimental resutls.
Google sheet used to track ranking results.