Skip to content
Andreas Nareike edited this page Nov 26, 2014 · 2 revisions

How to Initialize the LUL Knowledge Graph

General Setup

We use Virtuoso as a back-end for the Knowledge Graph. Some transformations are needed to build a integrated graph out of different graphs. Additionally, some transformations are also accomplished by reasoning. Some of the SPARQL update queries use Virtuoso features so some changes might be necessary to run with another triplestore.

The 'entry-point' to the Knowledge Graph is a GND number. Some metadata about a GND number can be retrieved from the GND. More will be pulled from DBpedia, possibly with the help of other graphs, e.g. a graph built from geonames.org.

A lot of linking between GND and DBpedia is done with owl:sameAs. There are triples in the GND as well as in DBpedia, but not every GND number that is linked to a DBpedia resource is linked back and vice versa.

Graphs

In our triplestore, we are using multiple graphs:

Tools

Preparation the Basic Graphs

  • GND complete
    • preprocessing
  • GND Ontology The GND Ontology can be curled like so:
    curl --header "Accept: application/rdf+xml"       \
         http://d-nb.info/standards/elementset/gnd#   \
         > gndo.rdf
    
  • Parts of DBpedia
    • preprocessing
      • parenthesis
      • inverse owl:sameAs? (not needed, but might be faster?)
  • Geonames, that are linked from GND

Additional Triples

For some GND numbers, there is a link to a Wikipedia entry via foaf:page, but no link to the DBpedia resource GND links. A Wikipedia link be transformed to a DBpedia link by simple string substitution. In this case, instead of foaf:page the property owl:sameAs should be used. Those extra triples are generated with a INSERT query:

untested!

PREFIX gndo: <http://d-nb.info/standards/elementset/gnd#>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
PREFIX owl: <http://www.w3.org/2002/07/owl#>

INSERT INTO <http://kg.ub.uni-leipzig.de/sameAs/> {
   ?gnd owl:sameAs ?dblink .
}
WHERE
{
    {
        SELECT ?gnd ?page (iri(replace(str(?page),"^http://de.wikipedia.org/wiki/", "http://de.dbpedia.org/resource/")) as ?dblink)
        WHERE {
            ?gnd foaf:page ?page .
            FILTER NOT EXISTS{
               ?gnd owl:sameAs ?uri .
               FILTER (regex(str(?uri), "dbpedia\\.org"))
            }
        }
    }
}

NOTE: This query could be refactored to a query using the BIND function (which might be easier to understand first glance). Our version of Virtuoso does not support BIND, so we did it this way.

Inference rules

Linking With Same As

Virtuoso comes with the option to enable reasoning via owl:sameAs. Also symmetry and transitivity of the owl:sameAs is considered, so it is not really necessary to build a transitive and/or symmetric closure as a preprocessing. To enable reasing via owl:sameAs, the line

DEFINE input:same-as "yes"

must be included in SPARQL queries.

Subproperties

The GND uses a variety of different properties (e.g. gndo:peferredNameForThePerson for persons) to add labels to resources. There might be some advantage to this, but it also greatly complicates the process of getting a label for a resource. The GND properties are not set into relationship with rdfs:label, thus we add this information in a rules graph.

Additionally, not all GND resources are linked to DBpedia resources. For these resources with can't link them to a comment from the DBpedia. However, there are two GND properties that serve roughly the same purpose, namely gndo:defintion and gndo:biographicalOrHistoricalInformation, so we also add triples that link these properties with rdfs:comment.

Building a (rules) graph with this additional information via SPARUL:

SPARQL
prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#>
prefix gndo: <http://d-nb.info/standards/elementset/gnd#>
 
INSERT IN GRAPH <http://kg.ub.uni-leipzig.de/gndrules/> {
    gndo:preferredName rdfs:subPropertyOf rdfs:label .
    gndo:preferredNameForTheConferenceOrEvent rdfs:subPropertyOf rdfs:label .
    gndo:preferredNameForTheCorporateBody rdfs:subPropertyOf rdfs:label .
    gndo:preferredNameForTheFamily rdfs:subPropertyOf rdfs:label .
    gndo:preferredNameForThePerson rdfs:subPropertyOf rdfs:label .
    gndo:preferredNameForThePlaceOrGeographicName rdfs:subPropertyOf rdfs:label .
    gndo:preferredNameForTheSubjectHeading rdfs:subPropertyOf rdfs:label .
    gndo:preferredNameForTheWork rdfs:subPropertyOf rdfs:label .
    
    gndo:definition rdfs:subPropertyOf rdfs:comment .
    gndo:biographicalOrHistoricalInformation rdfs:subPropertyOf rdfs:comment .
} ;

Next, define a ruleset:

rdfs_rule_set('gndrules', 'http://kg.ub.uni-leipzig.de/gndrules/');

Note: The should be redone whenever the graphs change, that make use of the rules.

Check, if the rule has been successfully created:

SELECT * FROM DB.DBA.SYS_RDF_SCHEMA;

To use this ruleset in a SPARQL query, the header

DEFINE input:inference <gndrules>

must be included at the beginning of a SPARQL query.

Note: Queries generally run faster without additional reasoning. So it might be better to materialize the triples instead. This can be done with a CONSTRUCT query:

CONSTRUCT { ?s ?p2 ?o }
WHERE {
    GRAPH <http://kg.ub.uni-leipzig.de/gndrules/> {
        ?p1 rdfs:subPropertyOf ?p2
    }
    GRAPH <http://d-nb.info/gnd/> {
        ?s ?p1 ?o
    }
}

Be aware that this query produces close to 13M unique triples.

Clone this wiki locally