Labels in large RDF databases

Paul Houle edited this page Apr 9, 2014 · 6 revisions
Clone this wiki locally

The other day I was looking at :BaseKB in the browser and trying to explain the situation about labels in :BaseKB, which are derived from labels in Freebase.

To make a long story short, Freebase has a lot of labels in the sense of labels in different languages, multiple labels for the same thing, and many identical labels. For instance if I write

   ?s rdfs:label ?l .

I get 1334 from :BaseKB Gold. You'll get similar results if you try other short words. You might ask, "What are all these things?" and the way to know that in detail is to write more queries. I'm sure you'll find the apicultural product in there somewhere, but out of 40 million objects there are many music tracks, music albums, books, book editions, films, television shows, television episodes and other sorts of things, even the names of tens of thousands of boats. So lots of things end up having the same name (here are the Honeys of Wikipedia.)

This problem is not so bad when you work with DBpedia for two reasons. One of them is that DBpedia has about 4 million entities in it, which is one tenth that of :BaseKB, the other one is that Dbpedia names are derived from Wikipedia's namespace, in which names are also unique keys. Names in Wikipedia are always disambiguated, for instance we have

Because Wikipedia forces pages to have unique name, labels in DBpedia are unique. This is not a formal system of disambiguating them, but rather a process of forced human choice, however as you see people typically disambiguate things by (1) using a non-problematic name, (2) adding a type "TV Channel" or (3) adding a type and a reference to a related object "Hikaru_Utada Song."

Fortunately, Wikipedia labels are hidden inside the Freebase keys. If we've loaded the key shard of :BaseKB we can look up the sweet syrup with

prefix : <>

select ?s {
   ?s :type.object.key "/wikipedia/en_title/Honey" .

and get this for a result


which is the real thing, let's take a look at the Freebase keys by running this query

prefix : <>

select ?k {
   :m.03qmh :type.object.key ?k .

and we get back a lot of stuff.


note there are keys pointing to all of the language Wikipedias. The /?lang/ ones can be interpreted as labels in that they are the names of other pages in Wikipedia that redirect to this page. Most are good but some are superbad. There are also the /?lang_title/ keys that are the canonical title from Wikipedia, which is almost always a good title to display. Note the /?lang_id/ fields point to the document id that is the real primary key of Wikipedia.

You can use these keys to link back to Wikipedia or DBpedia or you can undo the character escaping of the keys and use them as labels. I'm sure you can think of many other ways to express it that would be better, but this data satisfies the most critical requirements of the job and is well-populated.

Alternate Representation of Keys in Freebase

Keys are represented two different ways in the Freebase dump. :BaseKB physically separates these two representations so you can choose to use none, either, or both. In the example we used the key segment which stores Freebase keys as fully-spelled out strings. Freebase keys form a directed acyclic graph, and you can access that graph directly if you load the keyNs segment, which looks like

<>     <>    "Cross_Roads$002C_TX"   .

Note that the namespace of this key is "/wikipedia/en", and that is encoded in the predicate field. You can find the topic matching Honey with

prefix key: <>

select ?s {
   ?s key:wikipedia.en "Honey" .

Label Lookup in SPARQL

As a relational database programmer I got in the habit of writing code like

SELECT id,title,price,shipping,taxes,inventory_count FROM product WHERE product_type='z'

if I was trying to draw a list of things. It's tempting to try this in SPARQL but you have to be careful. Since :BaseKB contains multiple rdfs:label(s) for a given subject, you get an explosion in the size of the result set you get back. For example, I count the number of ships in the system like so


prefix : <>

select count(?s) as ?cnt {
   ?s a :boats.ship .

and get 24909. If I write a query that brings both the subject key and the label, like so

prefix : <>

select count(?l) as ?cnt {
   ?s a :boats.ship .
   ?s rdfs:label ?l .

it gets 67537 results, which is the sum of the number of labels, that is, we get multiple rows for each subject. That's a real waste of bits and an awful mess to disentangle at the client.

Doing it like a relational database

If you want this query to behave more like the typical query, you have to choose some property (it could rdf:label if you apply discipline to it, but will have to be different if you take other people's rdf:label) at face value. Let's call it

?s mySystem:primaryLabel ?o .

and insist that this is be a owl:FunctionalProperty (one ?s can have only one ?o) and even a bit beyond that. The query above will miss a ship it if has no rdfs:label, so there is also the requirement that mySystem:primaryLabel always exist for a given ?s if statements about ?s exist (if one of them turns up in a result you have to show the user something) If we then write a query like

prefix : <>

select ?s ?l as ?cnt {
   ?s a :boats.ship .
   ?s mySystem:primaryLabel ?l .

we get results like the relational database query but run into the risk that any other predicate that has multiple values could cause the same multiplication of rows.

Where do names come from?

One approach is to run a batch job that precomputes names. This is easy in the M/R approach because we can bring all the data for a given ?s to one place, then apply what rules we will to generate a name.

Wikipedia names are good when they are available. But you will want rules to handle the easy cases. If you don't have an "en" label available, for instance, for some crag in Spain that people rock climb, it is reasonable to serve up the Italian or Spanish name but you'd want to serve up a name in a non-roman alphabet only as a last resort. If no name is available you should make up something based on the identifier because at least people can send you a screenshot of a failure and you know what record is involved.

Paths not taken

Note the are other ways to make this work. For instance, if you were doing CONSTRUCT queries and putting your results into an RDF graph which you process some way, you could pass the rdfs:label(s) and other inputs to the name rule into a later graph, against which you apply rules to create the name. Perhaps you could make a triple store infer mySystem:primaryLabel on the fly.

Operating at webscale

An opposite approach is used with qLabel, which is JavaScript library that, given a Linked Data URI annotating an HTML page, looks up labels in a chosen language. The service is used to locate the identifiers in the HTML markup, and Wikidata is used to do the name lookup. This is (1) name lookup as a service, and (2) name lookup with a parameter, which is the language. It illustrates a big value of this kind of database -- once you've resolved a concept you get multilingual labels "for free"; this is in contrast to the usual situation of translating an application being an expensive process.

It makes sense for name-lookup to work as a service because it can be separated from other concerns this way. For instance, there can be multiple lookup services or parameterizations of a single service that let you serve up different names in different contexts. You have a choice of "AMC" or "American Motors Corporation" and that matters if you're trying to make a tight grid like the one here

Thus we need the ability to make a service that works like the wikidata service used by qLabel, but have it be a service that is under our control.

qLabel batches name lookups on a web page together to reduce the system overhead of running multiple API calls over the public internet which can be slow in terms of bandwidth and latency. This is essential for good performance over the web. I'm thinking about running a service like that inside my server farm, and there batching might still help. Long-term this system may be worth performance tuning, but in the short-term the architecture should be designed to make performance tuning possible.

Let's get some sense for the size of the data. Let's assume a naïve implementation with Java that consumes 100 bytes per topic, that would be 400MB with DBpedia, 1400MB for Wikidata or 4000MB for Freebase. You could do much better than that with some of the data structures that come with Lucene or an off-heap Hashtable. If a system is large at all (in terms of needing a lot of CPU capacity or generating a lot of revenue) it's worth keeping the labels in RAM somewhere for consistent low latency.

Names of another sort

The problem of looking up names is similar to the problem of looking up URLs. There are many types of URLs that we may want to look up. For instance, in a public-facing site, we may want to generate a URL to another page on our own site about a topic. We might have several pages with different views on the topic, sister web sites that we work with, and also want to link the concept at places like DBpedia, Freebase, Wikipedia, the New York Times, etc. The system ought to be smart enough that it doesn't need to store repetitive information. For instance, DBpedia and Wikipedia URLs are related by a simple rule, so there is no need to store both. It would be an efficient option to produce something that returns the links and labels for a concept in a single API call.

Integration with larger apps

Imagine we are trying to generate a page like

The legacy system does a monster SQL query that joins several tables to generate the photograph links in the main UI and does similar queries to generate "related topics" and "nearby locations". There are "canonical" links and titles in the major SQL rows, so this works. However, the rows are fat with many other columns so that it takes a lot of RAM to keep the rows cached.

With some kind of model-view-controller architecture, we can easily separate name and link lookup from other concerns.

In a first step we run queries (access whatever services, etc.) to retrieve the backbone of the data, that is, omitting the labels and links. After the model has been populated, we do another stage that resolves labels and links for the topics that appeared. This could be parameterized, for instance, we could have a language parameter so we get labels for a chosen language without affecting the rest of the system. Then we pass the model, with labels and links available, to the view subsystem. (If we're going to be balls-to-the-walls asynchronous we could start looking up labels as soon as they are added to the view, but let's save that for a future phase.)

Internal and External Interfaces

Note that it would be useful to define internal (inside a process) and external (outside a process) implementations of name lookup. An internal interface is important for a few reasons:

  • these can be easily implemented internally (i.e. doing SPARQL queries, applying rules, or storing a map in RAM)
  • we can put a cache layer in front of something else
  • people won't be tempted to re-implement distributed clients

an external interface will be important once we want to create a name lookup microservice. I know JSON and things like that are fashionable, but I'd really rather see something that is balls-to-the-walls efficient because name lookup just has to be fast.