Combining Linking Techniques (Docker version)

How to run

Requirements
1. Git
2. Docker / docker-compose

Clone Github repository and all the submodules & go into its root folder.

Execute command

 git clone --recurse-submodules https://github.com/kmdn/combining-linking-techniques && cd combining-linking-techniques

In case git gives you "over data quota" issues, you may clone (non-LFS data) with - this will not download benchmark data sets:

GIT_LFS_SKIP_SMUDGE=1 git clone --recurse-submodules https://github.com/kmdn/combining-linking-techniques.git && cd combining-linking-techniques

Run front-end
1. Option 1: Run only simple front-end
```
docker-compose up
```
2. Option 2: Run front-end incl. following docker containers:
  1. linker recommendation. (Executed by default when no linker is chosen.)
  2. spacy mention detection. (Default when adding IP-based API.)
```
docker-compose -f ./docker-compose-all.yml up
```

Build yourself

Requirements
1. Git
2. Docker / docker-compose
3. Maven
4. An internet browser and open ports.
Clone Github repository & go into its root folder.

 git clone --recurse-submodules https://github.com/kmdn/combining-linking-techniques && cd combining-linking-techniques

Build with Maven (Note: issues may arise depending on local mvn settings, hence we recommend the first step)

Build clit_backend

cd clit_backend && mvn clean install -Dmaven.wagon.http.ssl.insecure=true -Dmaven.wagon.http.ssl.allowall=true -Dmaven.wagon.http.ssl.ignore.validity.dates=true && cd ..

Build clit_frontend (relies on backend)

cd clit_frontend && mvn clean install && cd ..

Build & run docker container(s).
1. Run only front-end docker container.
```
docker-compose -f ./docker-compose-build.yml up
```
1. Run front-end incl. following docker containers. Note: Building Python dependencies may take a while.
  1. linker recommendation. (Executed by default when no linker is chosen.)
  2. spacy mention detection. (Default when adding IP-based API.)
```
docker-compose -f ./docker-compose-build-all.yml up
```
Access front-end via browser at address:

 localhost:8080/

Repository structure

/linker_recommender_api

API template for entity linker recommendation.

/spacy_md_api

API template for mention detection based on spaCy.

/evaluation_datasets

Evaluation data sets accessible through the framework front-end.

/img

Images for README.md

CLiT Website (barebone)

Tutorials and Videos

Tutorials will be made available at https://github.com/kmdn/clit-tutorials

For videos regarding the use of our framework, we refer to the following public YouTube playlist: https://www.youtube.com/playlist?list=PLjJ1ICImS5q7D3k6bLGSUts_qEgtlVJnO

Classical Pipeline

While Entity Linking (EL) systems vary in terms of approaches and potential steps within respective pipelines, we identify the most commonly-employed ones as the classical pipeline. We use said pipeline as a template for our framework in order to reach compatibility with as many existing systems as possible. In the figure below, we present our understanding of the functioning of a classical pipeline for a single system.

Input Document

Representing the starting point of any annotation framework, an input document consists of plain text of some form, generally not bringing any additional information for systems to take into account. The document may be part of an evaluation data set or simple text - in both cases only text is given as information to the successive step.

Mention Detection

 **Input:** Text Document
**Output:** Collection of mentions and their locations within given document

Also referred to as spotting, this task refers to the detection of so-called mentions within a given plain text. Depending on system, different kinds of mentions may be detected. POS tagging, NER-based techniques and string-matching to labels of specific types of entities from considered knowledge bases are among potential techniques. Further information pertaining to the mention is sometimes passed on to a system's consequent step. From the given text, mentions are extracted and passed on to the following step.

Candidate Generation

 **Input:** Collection of mentions
**Output:** Collection of candidate entities for each mention

Receiving a list of mentions, the process of candidate generation finds lists of appropriate entities for each passed mention. Some approaches additionally rank candidates at this step e.g. in terms of general importance based on statistical measures.

Entity Disambiguation

 **Input:** Collection of candidate entities
**Output:** Disambiguated entities, either collections ranked
            by likelihood or at most one per mention

Part of the pipeline potentially granting the most sophisticated techniques. Generally, this step is based on statistical measures, allowing for the synthetization of contexts based on detected mentions and suggested candidates.

Results

Finally, results from the given pipeline are returned. These tend to be in the form of annotations based on the initial input document or as hyperlinks referring to specific knowledge bases, such as Wikipedia, DBpedia or Wikidata. Occasionally, some systems add an additional step for pruning

Pipeline Customization

In order to allow for customized experiences and settings, we introduce further processing possibilities with the intent of allowing for nigh-infinite combinations of system components. The following subcomponents place themselves ideologically in between components presented within the classical model of an EL pipeline. We refer to them as processors or subcomponents, handling post-processing of structures output from prior tasks, preparing them for being potentially, in turn, further processed by subsequent steps in the chosen workflow. In this paper, we define 4 types of processors: splitters combiners, filters and translators.

Splitter

 **Description:** Splits results from components in intelligent ways (e.g. person-type entities to ).
 **Preceded by:** Any single-connected component.
**Succeeded by:** 2 or more components.
    **Commonly:** Directly passing same information to two (or more) components.

Allowing for processing of items prior to passing them on to a subsequent step, a splitter is utilised in the case of a single stream of data being sent to multiple components, potentially warranting specific splitting of data streams. This step encompasses both a post-processing step for a prior component, as well as a pre-processing step for a following one. A potential post-processing step may be to filter information from a prior step, such as eliminating superfluous candidate entities or unwanted mentions.

As for pre-processing, a splitter may preprocessing --> translate from one KB to another allows for processing of entities resulting from a prior st

Combiner

 **Description:** Combines results from components in intelligent ways.
 **Preceded by:** Any multiply-connected ≥ 2 component or subcomponent.
**Succeeded by:** Any single component, *translator* or *filter*.
    **Commonly:** *Union* operation, *intersection* operation.

In case multiple components were utilised in a prior step and are meant to be consolidated through a variety of possible combinations actions, a combiner subcomponent must be utilised. It combines results from multiple inputs into a single output, passing merged partial results on to a subsequent component. Common operations include union - taking information from multiple sources and adding it together - and intersection - checking multiple sources for certainty of information prior to passing it on.

Filter

 **Description:** Removes certain parts of information. Aims to facilitate further downstream processing.
 **Preceded by:** Any component or subcomponent.
**Succeeded by:** Any component or *translator*.
    **Commonly:** NER-, POS-specific or `rdf:type` filtering.

In order to allow removal of particular sets of items through user-defined rules or dynamic filtering, we introduce a subcomponent capable of processing results on binary classifiers: a filter. The truth values evaluated on passed partial results define which further outcomes may be detected by a subsequent component or translator.

Translator

 **Description:** 'Translates' entities from one KB to another (e.g. from Wikipedia to Wikidata).
 **Preceded by:** Any component or subcomponent.
**Succeeded by:** Any component or subcomponent.
    **Commonly:** `owl:sameAs` linking across KGs.

Enabling seamless use of annotation tools regardless of underlying KG, we introduce the translator subcomponent. It is meant as a processing unit capable of translating entities and potentially other features used by one tool to another, allowing further inter-system compatibility. It may be employed at any level and succeeded by any (sub)component due to its ubiquitous characteristics and necessity when working with heterogeneous systems.

Pipeline Examples

Example of a simple EL pipeline with a processor component, translating from Wikidata to DBpedia entities:

Example of an advanced EL pipeline with a splitter and a combiner, merging the results of the mention detection of three different EL systems:

Protocol Development

For the formal definition of a pipeline we use JSON. A pipeline configuration consists of an ID, the pipeline type, a list of components for each component type, and a list of connections between the components. Each component list consists of key-value-pairs, where the key is the ID of the component, and the value defines either the EL system used for this component, or in case of the processors, it defines their type. Slightly differently, for the connection list the key represents the source and the value the target component of the connection. An example of a JSON configuration for a simple pipeline (that corresponds to the simple pipeline in the former of the figures above) looks like this:

pipelineConfig = {
	"id": 123,
	"pipelineType": "complex",
	"md": [
		{"MD1": "Babelfy"} ],
	"cg": [
		{"CG1": "DBpediaSp"} ],
	"ed": [
		{"ED1": "AIDA"} ],
	"combiners": [],
	"splitters": [],
	"translators": [
		{"TR1": "WD2DBP"} ],
	"filters": [],
	"connections": [
		{"MD1": "CG1"},
		{"CG1": "ED1"},
		{"ED1": "TR1"} ]
}

JSON Result structure

Pipeline Results:

	JSONObject{
		"experimentId": int,
		"experimentTasks": 
			JSONArray[
				EXPERIMENTS
			]
	}

EXPERIMENTS:

	JSONObject{
		"currentComponent": String,
		"documents": 
			JSONArray[
				DOCUMENTS
			],
		"pipelineConfig": 
			PIPELINE,
		"errorMessage": String,
		"experimentId": int,
		"state": String (e.g. "DONE"),
		"pipelineType": String (e.g. "FULL"),
		"taskId": int
	}

DOCUMENTS:

	JSONArray[			-- need double JSONArray in order to store sub-pipeline results for each document
		JSONObject{
			"componentId": String,
			"mentions": 
				JSONArray[
					MENTIONS
				],
			"text": String (e.g. "Napoleon was the emperor of the First French Empire."),
			"pipelineType": String (e.g. "ED"),
			"uri": String
		}
	]

MENTIONS:

	JSONObject{
		"offset": int,
		"assignment": 
			JSONObject{
				"score": double,
				"assignment": String (e.g. "http://dbpedia.org/resource/Empire")
			},
			"possibleAssignments": 
				JSONArray[
					JSONObject{
						"score": double, 
						"assignment": String
					}
				]
		},
		"originalWithoutStopwords": String (e.g. "Empire"),
		"detectionConfidence": double (e.g. 0.0),
		"originalMention": String (e.g. "Empire"),
		"mention": String (e.g. "Empire")
	}

PIPELINE:

	JSONObject{
		"startComponents": 
			JSONArray[
				String (e.g. "MD1")
			],
		"components": 
			JSONObject{
				"md" (Only if present. May be: md, cg, ed, md_cg_ed, md_cg, cg_ed, ...):
					JSONArray[
						JSONObject{
							"id": String (e.g. "MD1"; ID matches w/ IDs in connections),
							"value": String (e.g. "Babelfy")
						}
					],

				"cg_ed" (IF PRESENT): 
					JSONArray[
						JSONObject{
							"id": String (e.g. "CG_ED1"),
							"value": String (e.g. "Babelfy")
						}
					]
			},
		"exampleId": String (e.g. "md_combined_cged"),
		"endComponents": 
			JSONArray[
				String (e.g. "CG_ED1", "MD1")
			],
		"displayName": String (e.g. "MD + combined CG-ED"),
		"id": int (e.g. 1),
		"connections":
			JSONArray[
				JSONObject{
					"source": String (e.g. "MD1"),
					"target": String (e.g. "CG_ED1")
				}
			],
		"pipelineConfigType": String (e.g. "complex")
	}

Name		Name	Last commit message	Last commit date
Latest commit History 141 Commits
.vim		.vim
REL @ 7fe7572		REL @ 7fe7572
analysis @ 805316f		analysis @ 805316f
clit_backend @ 17cbd4f		clit_backend @ 17cbd4f
clit_frontend @ eb49553		clit_frontend @ eb49553
evaluation_datasets		evaluation_datasets
img		img
linker_recommender_api @ 6b0da87		linker_recommender_api @ 6b0da87
other_datasets		other_datasets
properties		properties
spacy_md_api @ 53e7084		spacy_md_api @ 53e7084
systems		systems
templates		templates
.gitattributes		.gitattributes
.gitignore		.gitignore
.gitmodules		.gitmodules
API_JSON_document.txt		API_JSON_document.txt
Dockerfile		Dockerfile
LICENSE		LICENSE
OrganizingThings.csv		OrganizingThings.csv
README.md		README.md
docker-compose-all.yml		docker-compose-all.yml
docker-compose-build-all.yml		docker-compose-build-all.yml
docker-compose-build-rel.yml		docker-compose-build-rel.yml
docker-compose-build.yml		docker-compose-build.yml
docker-compose.yml		docker-compose.yml
pipeline.schema.json		pipeline.schema.json
run_all.txt		run_all.txt
run_clit.txt		run_clit.txt

License

kmdn/combining-linking-techniques

Folders and files

Latest commit

History

Repository files navigation