Ripple on Linked Data

timrdf edited this page Mar 10, 2012 · 18 revisions

What is first

  • Commands for an overview of the command features
  • Running Ripple if you would like to run these examples yourself

Let's get to it

Ripple was designed for use with Linked Data. Ripple programs can express the RDF Property Paths now available in the SPARQL query language, but you can also mix general-purpose computational steps into traversal paths. This allows you to combine simple property paths with more sophisticated filters (like Unix pipelines for RDF graph data) without departing from the path-like syntax.

Ripple's default RDF query layer, LinkedDataSail, gives you a dynamic view of the Web of Data which can be queried like a single, monolithic database. As you explore different areas of the Semantic Web, new data is pulled in by LinkedDataSail in order to answer queries, incrementally as you traverse through the sequence RDF predicates in the path. Since Ripple programs are evaluated in a well-defined order, you can reason about the data you'll get back in response to queries.

The examples below illustrate some of the main ideas in exploring Linked Data at the Ripple command line (to start the application, see Running Ripple).

Starting from DBpedia...

Let's start traversing from DBpedia, the de-facto central hub of the Web of Data. We'll pick a resource and give it a handy alias (a list which wraps the URI of the resource):

1) @list beijing: <>

You can begin exploring the Linked Data neighborhood of the resource by applying (or "dequoting") the list:

2) :beijing.

  [1]  dbr:Beijing
             "CPC Ctte Secretary"@en;

That has caused the URI of the city of Beijing to be dereferenced. In other words, LinkedDataSail has issued an HTTP GET request for the URI, which DBpedia has answered with an RDF document describing Beijing. That document has been parsed and cached in LinkedDataSail, then used to answer queries from that point on. Whenever LinkedDataSail receives a query for information about new resource, it attempts to fetch the corresponding document before proceeding to answer the query.

By default, the Ripple command line will show you some of that information beneath the URI itself (for reasons of space, this will be omitted from most of the examples). For Beijing, there's natural language text such as comments in various languages:

3)  :beijing. rdfs:comment.

  [1]  "北京市(唐音: ペキンし、漢音: ほくけいし、ほっけいし)は、中華人民共和国の首都である。中国の東部、河北省の中央に位置する。古くは大都・燕京・北平とも呼ばれた。現在の行政区画としては直轄市である。世界有数のグローバル都市であり、金融センターとしても高い重要性を持っている。"@ja
  [2]  "Peking eli Beijing on Kiinan kansantasavallan pääkaupunki. Nimessä bei tarkoittaa pohjoista ja jing pääkaupunkia. Suomeksi Beijing tarkoittaa siis pohjoista pääkaupunkia, erotukseksi muualla sijainneista Kiinan pääkaupungeista (vrt. Nanking = eteläinen pääkaupunki). Peking on myös yksi Kiinan neljästä provinssitasolla itsehallinnollisesta kunnasta. Peking on Kiinan kolmanneksi suurin kaupunki Chongqingin ja Shanghain jälkeen."@fi
  [14]  "La metropoli di Beijīng (悲京, in lingua cinese, che letteralmente vuol dire \"Capitale del Nord\"), o Pechino come è maggiormente conosciuta in italiano, è la capitale della Repubblica Popolare Cinese. L'intera municipalità ha dimensioni pari a più della metà del Belgio e conta 10 milioni di abitanti. Pechino è la seconda città più popolosa della Cina dopo Shanghai con 11.500.000 residenti. Confina in tutte le direzioni con la provincia dell'Hebei e a sud-est con la municipalità di Tianjin."@it

If we want, we can grab just the English-language comment using a filter:

4)  :beijing. rdfs:comment. (lang. "en" equal.) require.

  [1]  "Beijing, also known as Peking, is a metropolis in northern China, and the capital of the People's Republic of China. Governed as a municipality under direct administration of the central government, Beijing borders Hebei Province to the north, west, south, and for a small section in the east, and Tianjin Municipality to the southeast. Beijing is one of the Four Great Ancient Capitals of China. Beijing is divided into 14 urban and suburban districts and two rural counties."@en

The filter (lang. "en" equal.) require. says "look at the topmost item on the stack, which should be a literal. Find the language of the literal. If the item is not a literal, or if there is no language or if the language is not English, then this stack is not a solution (so don't include it in the query results)".

Finding related things

Apart from natural-language text, there are a number of other resources associated with Beijing through various RDF predicates. A couple of predicates which might catch our eye are owl:sameAs and dcterms:subject. E.g.

5)  :beijing. dcterms:subject.                                  

  [1]  <>
  [2]  <>
  [3]  <>
  [18]  <>

This query will take a few moments to finish, as LinkedDataSail has a few new URIs to dereference. These are categories of resources which contain Beijing as a topic, in no particular order. Lets find some other topics in the same categories. Evidently, DBpedia provides backlinks from categories to topics, because taking a step forward through the dcterms:subject mapping, and then back again through the inverse mapping, gives us a number of "related" resources:

6)  :beijing. dcterms:subject. dcterms:subject~.

  [1]  dbr:Beijing
  [2]  dbr:One-dog_policy
  [3]  dbr:Beijing_Economic_and_Technological_Development_Area
  [678]  dbr:Tianjin

You'll notice that Beijing itself is one of the solutions to the query, as we can get back to Beijing through the same dcterms:subject statement which took us to its categories.

Eliminating duplicates

Looking through the query results above, we see that there are a lot of repeated items. When there are multiple paths to a solution (e.g. through RDF statements that are provided in multiple documents), Ripple will produce that solution more than once. We can filter out the duplicates using distinct (see the stream library):

7)  :beijing. dcterms:subject. dcterms:subject~. distinct.

  [1]  dbr:Beijing
  [2]  dbr:One-dog_policy
  [3]  dbr:Beijing_Economic_and_Technological_Development_Area
  [169]  <>

Give it a name

Our path expression is getting a little long. We have two options for cutting it down to size: either we define a named program using the @list directive (see Commands), or we give the actual query result a name. Defining a program is a good idea if we want to be able to re-use the path in future sessions, or even publish the path to an RDF data store. On the other hand, naming the query result is a good idea if we're just exploring, and want to avoid excess typing or re-computing of intermediate results. To give it a name, just append an = x to the query (or even after evaluation of the query, on the next line), where x is a new keyword:

8)  :beijing. dcterms:subject. dcterms:subject~. distinct. = x

  [1]  dbr:Beijing
  [2]  dbr:One-dog_policy
  [3]  dbr:Beijing_Economic_and_Technological_Development_Area
  [169]  <>

Now we can "replay" the query result by applying x:

9)  x.

  [1]  dbr:Beijing
  [2]  dbr:One-dog_policy
  [3]  dbr:Shanghai
  [4]  dbr:Beijing_Economic_and_Technological_Development_Area
  [169]  dbr:Barcelona

It's a stack!

Even after elimination of duplicates, here are a lot of resources in the above list. Some of them are obviously related to Beijing, such as Shanghai, while others are less obvious, like Barcelona. Seeing as Ripple is a stack language, let's put a little more information on the stack, to make things a bit clearer. All we have to do is to duplicate Beijing and the category resource so they remain on the stack(s) after we traverse to the "related resources", i.e.

10)  :beijing. dup. dcterms:subject. dup. dcterms:subject~. distinct. = y

  [1]  dbr:Beijing <> dbr:Beijing
  [2]  dbr:Beijing <> dbr:One-dog_policy
  [3]  dbr:Beijing <> dbr:Shanghai
  [4]  dbr:Beijing <> dbr:Beijing_Economic_and_Technological_Development_Area
  [187]  dbr:Beijing <> dbr:Barcelona

Now we can see that Beijing and Shanghai are both independent cities, while Beijing and Barcelona were both host cities of the summer Olympics. So far, so good, but it's starting to look like "relatedness" via shared categories is a pretty broad relationship, so let's also filter on location.

Into GeoNames

Backing up a little, one of the owl:sameAs links we noticed earlier points from the DBpedia resource for Beijing to the GeoNames resource for Beijing:

11)  :beijing. owl:sameAs.

  [1]  <>

We can see that each location in GeoNames has a latitude and longitude, which we can use in order to find places which are not only "related to", but also nearby to Beijing in space. Let's define a quick-and-dirty distance function:

12)  @list x sq: x x mul.
14)  @list d wrap: d 360 gt. (d 360 sub.) (d) branch.
16)  @list lon1 lat1 lon2 lat2 haversin-kludge: \
17)      lon2 lon1 sub. abs. :wrap. :sq. \
18)      lat2 lat1 sub. :sq. \
19)      add. sqrt. (0 gt.) require.
21)  @list p lonlat: p wgspos:long. to-double. p wgspos:lat. to-double.
23)  @list p1 p2 distance: p1 :lonlat. p2 :lonlat. :haversin-kludge. 180 div.
25)  @list p1 p2 with-dist: p1 p2 p1 p2 :distance. 3 ary.

This will give us a value of 0 when two resources share the same location, while it will give a value of around 1 when they're as far apart as possible. Now we can apply it to Beijing and related places:

26)  y. rolldown. :with-dist.

  [1]  <> dbr:Bohai_Economic_Rim dbr:Beijing 0.011642707215552911E0
  [2]  <> dbr:Paris dbr:Beijing 0.6355051911679347E0
  [3]  <> dbr:Stockholm dbr:Beijing 0.5568199106659997E0
  [54]  <> dbr:Damascus dbr:Beijing 0.4464167538927126E0

What we now have on the stack are the category, the two topics (places), and an approximation of their distance. Anything which is not a place has been filtered out due to its lack of a longitude and latitude. There are quite a few near-duplicate solutions due to multiple shared categories between Beijing and some of these places, and due to some almost-but-not-quite identical longitude and latitude information we have picked up (from data sets other than GeoNames) along the way.

Ordering solutions

To get the full benefit of our "distance" metric, we can now order solutions by increasing distance, using the closed-world order primitive (see the stream library):

27)  y. rolldown. :with-dist. order.

  [1]  <> dbr:Western_Hills dbr:Beijing 0.0012904689499344082E0
  [2]  <> dbr:Tianjin dbr:Beijing 0.006176402148306225E0
  [3]  <> dbr:Tianjin dbr:Beijing 0.006176402148306225E0
  [54]  <> dbr:Tangier dbr:Beijing 0.6792334949691136E0

Share and enjoy.