Permalink
Switch branches/tags
Nothing to show
Find file Copy path
Fetching contributors…
Cannot retrieve contributors at this time
142 lines (87 sloc) 9.62 KB

About

In this tutorial I will explain how I installed and configured fusepool to

  • Task 1: Create a sample taxonomy based on the authority data for subject indexing of German National Library (GND)

  • Task 2: Create a sample dataset to run the SMA component on, based on useful plaintext attributes in DBpedia

  • Task 3: Integrate these two data sets into Fusepool plattform and create a enrichment chain which runs SMA on the taxonomy

If you follow the instructions you should be able to use the Annotation Client displayed in the screenshot below:

![The Library Enrichment Engine is a simple client for semiautomated annotation of plain text with GND Identifiers. ][lee-screenshot]

[lee-screenshot]: img/lee-screenshot.png "The Library Enrichment Engine is a simple client for semiautomated annotation of plain text with GND Identifiers. " width="797px" height="633px"

Installing Fusepool

Fusepool is available as source code on github, but there are also precompiled -jar's available, at the fusepool's repository

Download the latest build (I use #378) and change to the directory. Then type:

java -Xmx4G -XX:MaxPermSize=512M -Xss512k -XX:+UseCompressedOops -jar launcher-0.1-SNAPSHOT.jar

If you want to run fusepool as server in the background and allow yourselve to logout from the shell use nohup: nohup java -Xmx18G -XX:MaxPermSize=512M -Xss512k -XX:+UseCompressedOops -jar launcher-0.1-SNAPSHOT.jar &

Set up Sample Taxonomy and Document Corpus for matching

This section describes the selection of data used for annotating DBpedia pages with Authority Data (concepts) from the German National Library (GND). Both sources can be accessed using SPARQL at the endpoints of GND and DBpedia (german). It is described how to integrate the data into fusepool to build an Dictionary Annotator for using the String Matching Algorithm (SMA) of the Fusepool project (Task 1 + Task 2).

Building Dictionaries

The German National Library (DNB) published its Authority Data on their website, but there is also a SPARQL Endpoint, provided by the German National Library of Economics (ZBW). At the time of writing this tutorial the ZBW-Endpoint contains the DNB Dumps of Feb 2014.

Fusepools Dictionary Annotator expects a dictionary in .nt that has a concept's identifier (URI) associated with labels used for matching text documents. However, you are free to use additional properties and vocabs in your graph. Just dont forget the properties you want to use for the matching and the type of your concept that provides a derferencable URI. (We will need this information for the set up of the Dictionary Annotator )

For our example I choose skos:Concept for the URI and rdfs:label as matching strings. Additionally we have skos:hasTopConcept, and skos:prefLabel. Please note that I used the rdfs:label property as a workaround to bring preferred labels and synonyms into the dictionary. Meanwhile, the configuration supports to use multiply labels for matching.

Your dictionary graph should look like this:

<http://d-nb.info/gnd/4020775-4> <http://www.w3.org/2004/02/skos/core#prefLabel> "Gesundheitswesen" .
<http://d-nb.info/gnd/4020775-4> <http://www.w3.org/2000/01/rdf-schema#label> "Gesundheitswesen" .
<http://d-nb.info/gnd/4020775-4> <http://www.w3.org/2000/01/rdf-schema#label> "Medizinalwesen" .
<http://d-nb.info/gnd/4020775-4> <http://www.w3.org/2000/01/rdf-schema#label> "Gesundheitssystem" .
<http://d-nb.info/gnd/4020775-4> <http://www.w3.org/2004/02/skos/core#hasTopConcept> <http://d-nb.info/standards/vocab/gnd/gnd-sc#27.20> .
<http://d-nb.info/gnd/4020775-4> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.w3.org/2004/02/skos/core#Concept> .
<http://d-nb.info/gnd/4020775-4> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://d-nb.info/standards/elementset/gnd#SubjectHeadingSensoStricto> .

To have this modified structure we can make a construct on the ZBW Endpoint. For instance, I want to create a dictionary for each subclass of SubjectHeading in the GND Ontology. This has the advantage that you are more flexible in building custom chains with particular types of subject headings, for instance I choose:

  • SubjectHeading // 196.670 id's

  • SubjectHeadingSensoStricto // 76.416 id's

  • NomenclatureInBiologyOrChemistry // 50.285 id's

  • HistoricSingleEventOrEra // 6779 id's

  • Language // 5299 id's

  • ProductNameOrBrandName // 4843 id's

  • EthnographicName // 3971 id's

  • GroupOfPersons // 273 id's

A construct looks like the following:

PREFIX :        <http://d-nb.info/gnd/>
PREFIX gndo:    <http://d-nb.info/standards/elementset/gnd#>
PREFIX rdfs:    <http://www.w3.org/2000/01/rdf-schema#>
PREFIX skos:    <http://www.w3.org/2004/02/skos/core#>
construct {  
?s a ?type .   
?s a skos:Concept . 
?s skos:hasTopConcept ?sCat .  
?s skos:prefLabel ?sPref . 
?s rdfs:label ?sPref .  
?s rdfs:label ?sAlt . }
where {
       ?s a gndo:{class to construct} . 
       ?s a ?type .  
       ?s  gndo:preferredNameForTheSubjectHeading ?sPref ; 
       gndo:gndSubjectCategory ?sCat .
       Optional{ ?s gndo:variantNameForTheSubjectHeading ?sAlt .}
      }

Where {class to construct} is a class from the GND Ontology (cf. list above).

Copy the response of the SPARQL Endpoint into a file and set the filetype (extension) to .nt. Now we have a dictionary which can be used by fusepool.

Upload the Dictionary

Start the Fusepool plattform and go to upload form to insert a new graph for your dictionary at http://localhost.de:8080/graph/upload-form ) You can specify any name in this syntax urn:x-localinstance:/HistoricSingleEventOrEra.graph, copy the name ( URI ), because we will need it for the setup of the dictionary. Then, select a N-Triples file from your filesystem and click upload.

Now, go to: http://localhost:8000/admin/graphs/ to check your graph was uploaded correctly.

Configure the Dictionary Annotator in Fusepool

After you have loaded the data you can proceed with setting up the dictionary annotator in the OSGi configuration console Search for Dictionary Annotator and click on + sign to create a new factory configuration. Configure it based on the screenshot: Configure the Dictionary

here the values for convenience:

dictionary name: HistoricSingleEventOrEra-dict 
Description: `your description`
graph-uri: urn:x-localinstance:/HistoricSingleEventOrEra.graph
Label-field: rdfs:label
URI-Field: skos:Concept 
prefixes: `skos: <http://www.w3.org/2004/02/skos/core#>` + `rdfs: <http://www.w3.org/2000/01/rdf-schema#>`
Category: `HistoricSingleEventOrEra;http://d-nb.info/standards/elementset/gnd#HistoricSingleEventOrEra`

The recall for concept detection will increase if you set stemming to "German", however the precision may drop. I think its a good thing to set case sensitivity to zero, to prevent false matchings because of acronyms.

Add the dictionary to a enhancement chain

In stanbol/fusepool terminology chains can be used in sequence to enhance content, i.e. compute matchings. You can compute matchings using the Data Life Center (DLC). At the writing of this tutorial there is no possibility to specify a certain chain for batch matching, so you need to add the dictionaries to the default chain. You find the Weighted Chain Configuration in the OSGi configuration console which contains a factory setting for the default chain. Click on configure and add the exact name of the dictionary, i.e. HistoricSingleEventOrEra-dict for our example. Add your dictionaries to the default chain

Creating an own chain

You may create a new chain to route your analytics to certain sets of dictionaries or to combine them with other engines. This is basicly done as if you add your dictionary names to the default chain. Just create a new factory setting in the Weighted Chain Configuration.

Try it

Now go to: http://localhost:8080/enhancer/ and check if all engines in default are available. (i.e. click on default or your custom chain name). The chain only works if all your engines are 'green'. Check your configuration if not. Paste text and review the results. If everything works in the fusepool enhancer it should also work with the Library Enrichment Engine

Download the source, cd to the created directory and type:

$ npm install

LEE is started like this:

node ./bin/www

From this point you should be able to use the Library Enrichtment Engine, for instance at your local host http://localhost:8081 You can use different chains of different fusepool instances as you like. To register new services for the APP's frontend go to: ./res/services.json in the base directory where you add URIs of your chains to easily switch between different services.