Blend, grind, and enjoy LOD – fresh from the mill!
.travis.yml file for details on the CI config used by Travis.
lodmill-rd folder contains code for working with raw data and its transformation to LOD, based on the Culturegraph Metafacture toolkit. The
lodmill-ld folder contains code for processing RDF with Hadoop and for indexing it in Elasticsearch. The
hbz/lobid repo contains a web app based on the Play framework that interacts with the resulting Elasticsearch index. It provides an HTTP API to the index and a UI for documentation and basic sample usage. See below for details on the data workflow.
Prerequisites: Java 8 and Maven 3 (check
mvn -version to make sure Maven is using Java 8)
To set up a local build of the lodmill components follow these steps:
Clone lodmill and run the Maven build
git clone https://github.com/lobid/lodmill.git; cd lodmill; echo "required for MRUnit tests:"; umask 0022; mvn clean install -DdescriptorId=jar-with-dependencies -DskipIntegrationTests --settings settings.xml
Build Hadoop Jar
cd .. ; cd lodmill-ld
mvn clean assembly:assembly --settings ../settings.xml
index.sh scripts in
lodmill-ld/doc/scripts on how to use the resulting Jar with your Hadoop and Elasticsearch clusters.
For information on how we have set up our Hadoop and Elasticsearch backend clusters see
The complete data workflow transforms the original raw data to Linked Data served via an HTTP API:
Raw Data to Linked Data Triples
The raw data is transformed to linked data triples in the N-Triples RDF serialization. Being a line based format, N-Triples are well suited for batch processing with Hadoop. Every triple, both those generated from the raw data, and the enrichment data from external sources (like GND and Dewey), are stored in the Hadoop distributed file system (HDFS) and made available to the two Hadoop jobs that process and convert the data.
Linked Data Triples to Records
The first Hadoop job (implemented in
CollectSubjects.java) selects triples that need to be resolved to build meaningful records to be indexed later. One example would be the
creator triples for lobid-resources, e.g.
<http://lobid.org/resource/HT002189125> <http://purl.org/dc/elements/1.1/creator> <http://d-nb.info/gnd/118580604> .
The creator of the resource is identified via its GND ID. To allow searching by the actual name of the creator, we want to resolve the name literals, so we declare that the
creator property needs to be resolved in the
resolve = \ http://purl.org/dc/elements/1.1/creator; \ [...]
Having the property declared to need resolution, when the first Hadoop job encounters the triple above, it will map the GND ID to the resource ID:
<http://d-nb.info/gnd/118580604> : <http://lobid.org/resource/HT002189125>
This information is written to a zipped Hadoop map file (a persistent lookup mechanism) after the first job is complete.
The second Hadoop job (implemented in
NTriplesToJsonLd.java) collects all triples with the same subject (i.e. all statements about a resource), and converts them to a JSON-LD record. In addition to the selected triples, we also need details about the entities defined as needing resolution in the first job. For instance, we want the
creator in our records, so we declare it as a resolution property in
predicates = \ http://d-nb.info/standards/elementset/gnd#preferredNameForThePerson; \ [...]
This will cause the second Hadoop job to perform a lookup on the subject of a triple containing that property, e.g.:
<http://d-nb.info/gnd/118580604> <http://d-nb.info/standards/elementset/gnd#preferredNameForThePerson> "Melville, Herman" .
Since the subject is mapped to the
<http://lobid.org/resource/HT002189125> resource ID, we add that triple to the triples of that resource, which yields a record for
<http://lobid.org/resource/HT002189125> that contains not only the triples with that subject, but also the enrichment triples defined in the
resolve.properties file. That way, we effectively define our records to be subgraphs of the complete triple set.
The same mechanism is used to resolve information modeled using blank nodes. See our current resolve.properties file for details.
Linked Data Records to Index
The second Hadoop job writes the records as expanded JSON-LD in the Elasticsearch bulk import format, which is then indexed in Elasticsearch using the Elasticsearch bulk Java API. We use expanded JSON-LD in the index to have consistent types for each field. In compact JSON-LD, if a field has just a single value, the type of that field will simply be the type of the value. If the same field has multiple values, the type will be an array, etc. Elasticsearch learns the index schema from the data, so we need to use consistent types for a given field. The expanded JSON-LD serialization does exactly this. For instance, it always uses arrays, even if there is only a single value.
Linked Data Index to HTTP API
Finally our Play frontend accesses the Elasticsearch index and serves the records as JSON-LD. There are multiple options for queries and different supported results formats, see documentation at http://lobid.org/api (implemented in the
app/views/index.scala.html template in the
hbz/lobid repo). Since the expanded JSON-LD described above is cumbersome, the API serves compact JSON-LD. It also uses an external JSON-LD context document to allow shorter keys instead of full URI properties, and to encapsulate the actual properties used in the expanded form. That way, we can change the properties, without requiring API clients to change how they process the JSON responses.
Eclipse Public License: http://www.eclipse.org/legal/epl-v10.html