Corpus-based lexicalization for DBpedia

The corpus-based lexicalization (CBL) is a tool for bridging the lexical gap between natural language (NL) expressions (i.e. linguistic patterns) and the content stored in an RDF knowledge base (i.e. ontology). The tool uses class-specific association rules together with null-invariant measures of interestingness to induce correspondences between lexical elements and KB elements.

The current version of the tool is implemented for DBpedia. The Corpus-based lexicalization (CBL) is the core task of so-called ontology lexicalization ontology lexicalization.
This page provides instructions on how to run the tool.

Requirements

docker docker
6GB free space: The DBpedia resource includes abstracts, knowledge graph (i.e. triples), and anchor text (i.e. rdfs:label dictionary for entities).The DBpedia resources is provided with the container. The image size is near 5GB .
- The task is divided into two parts: lexicalization an create lemon. The lexicalization endpoint has to be run first (instruction 3 and 4). After completion, the run create lemon endpoint (instruction 5 and 6).

Getting started with CBL

To run CBL on your machine, follow 6 instructions given below:

Download the image of CBL (lex-cbl).

docker pull pretallod/lex-cbl

Run the image as a container.

docker run -p 8001:8080 -t pretallod/lex-cbl

lexicalization

Given a class, the program will provide class-specific lexicalization. That is, it links the linguistic patterns (a token/a sequence of tokens tagged with parts-of-speech) of the text (of DBpedia abstract) KB elements.

List of DBpedia class can be found link

For simplicity run it for a single class (for example http://dbpedia.org/ontology/Actor). The input file contains class url and parameteres to run lexicalization process. The detail parameters can be found in swagger document

Input Example:

{
  "class_url"   : "http://dbpedia.org/ontology/Actor",
  "minimum_entities_per_class": 100,
  "maximum_entities_per_class": 10000,
  "minimum_onegram_length": 4,
  "minimum_pattern_count": 5,
  "minimum_anchor_count": 10,
  "minimum_propertyonegram_length": 4,
  "minimum_propertypattern_count": 5,
  "minimum_propertystring_length": 5,
  "maximum_propertystring_length": 50,
  "minimum_supportA": 5,
  "minimum_supportB": 5,
  "minimum_supportAB":5,
}

Download the input file

wget -O inputLex.json https://raw.githubusercontent.com/Pret-a-LLOD/ontology-lexicalization/master/inputLex.json

run the following command. The process may take nearly 2 hours.

curl -H "Accept: application/json" -H "Content-type: application/json"  --data-binary @inputLex.json -X POST  http://localhost:8001/lexicalization

Output Example:

{
  "class_url":"http://dbpedia.org/ontology/Actor",
  "status":"Successfull completed lexicalization!!"
}

Note: The system can be also run for all frequent classes (i.e. 340 classes) of DBpedia. It will take more than a week to get results.

create lemon

The process will create the results in ontolex lemon format.

The input file contains the url (of the resource) and number ranked list (of senses) for each linguistic pattern. the detail can be found in swagger document An example of input file is shown below.

{
  "uri_basic": "http://localhost:8080/",
  "rank_limit": 20
}

Download input file for lemon creation

wget -O inputLemon.json https://raw.githubusercontent.com/Pret-a-LLOD/ontology-lexicalization/master/inputLemon.json

run the following command

curl -H "Accept: application/json" -H "Content-type: application/json"  --data-binary @inputLemon.json -X POST  http://localhost:8001/createLemon

An example of output is lemon in Json-LD format.

CBL code

The project contains code of Perl and Java.

Reference

Please use the following citation:

@inproceedings{Buono-LREC2020,
	title = {{Bridging the gap between Ontology and Lexicon via Class-specific Association Rules Mined from a Loosely-Parallel Text-Data Corpus}},
	author = {Basil Ell, Mohammad Fazleh Elahi, Philipp Cimiano},
	booktitle = {Proceedings of the  3rd Conference on Language, Data and Knowledge (LDK 2021)},
	year = {2021},
	location = {Zaragoza, Spain},
	link = {https://pub.uni-bielefeld.de/record/2954753}
}

Developers

Mohammad Fazleh Elahi
Basil Ell

Supervisors

Dr. Philipp Cimiano

Name		Name	Last commit message	Last commit date
Latest commit History 91 Commits
classes		classes
examples		examples
perl		perl
src		src
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
inputLemon.json		inputLemon.json
inputLex.json		inputLex.json
nb-configuration.xml		nb-configuration.xml
openapi.yaml		openapi.yaml
pom.xml		pom.xml

Pret-a-LLOD/ontology-lexicalization

Folders and files

Latest commit

History

Repository files navigation

Corpus-based lexicalization for DBpedia

Requirements

Getting started with CBL

lexicalization

create lemon

CBL code

Reference

Developers

Supervisors

About

Resources

Stars

Watchers

Forks

Languages