Scio + SPARQL = Distributed offline graph database

The scio-sparql project is a Scio extension that lets you use SPARQL on top of a collection of RDF triples.

Features

Mostly SPARQL 1.1 compliant
rdf4j based
RDF/Turtle/Trig/etc. parsers for statements

Quick Start

import com.spotify.scio.values.SCollection
import es.jolivar.scio.sparql.Interpreter.SCollectionStatements
import org.eclipse.rdf4j.model.Statement
import org.eclipse.rdf4j.query.BindingSet

val statements: SCollection[Statement] = ???
val bindingSets: SCollection[BindingSet] = statements.executeSparql(
  """
    |PREFIX foaf: <http://xmlns.com/foaf/0.1/>
    |SELECT ?name 
    |       ?email
    |WHERE
    |  {
    |    ?person  a          foaf:Person .
    |    ?person  foaf:name  ?name .
    |    ?person  foaf:mbox  ?email .
    |  }""".stripMargin
)
bindingSets.map { bindingSet =>
  val name = bindingSet.getValue("name").stringValue()
  // Do something with the bindings
}

SPARQL Limitations

Since Apache Beam doesn't have a concept of ordering there's no support for ORDER BY. In fact, this is a no-op in the implementation. What is supported though it's using ORDER BY in a sliced context. That is to say, that the following query will return the 3 largest elements after skiping the first 5 largest ones:

SELECT ?a ?b ?c
WHERE {
  ?a ?b ?c
}
ORDER BY ?c
OFFSET 5
LIMIT 3

A caveat about it though, that since Apache Beam offers no ordering guarantees, the results will not appear in order but are guaranteed to be the elements of the correct result set.

MINUS and inner FILTERs are not currently supported due to the fact of having to process statement by statement individually.

Property paths are supported as long as they are finite. In particular, the following quantity modifiers aren't supported:

iri* ZeroOrMorePath
iri+ OneOrMorePath

An additional limitation due to the nature of the Apache Beam model is that if there's a duplicate statement in the original SCollection it will surface through to the end if the query matches it. To avoid this it would mean making a SCollection[Statement].distinct shuffle at the beginning, which is potentially very expensive.

Currently, there's also no support for SERVICE within the SPARQL query as that logic is an entire different can of worms.

How it works

At dataflow building time the program parses the SPARQL query and converts the query into its equivalent algebra operations. It then proceeds to build the operations based on interpreting the algebra into its equivalence in Scio operations. This in turn statically compiles the dataflow for execution, meaning the overhead is equivalent to writing out the operations by hand.

One very important caveat to notice is that SPARQL is very, very prone to using JOINS. This will translate to very large amounts of shuffling, but that's the price to pay for running this process in a distributed fashion.

In economic terms this means that some queries might have a larger shuffle cost than a compute cost. In particular this is true of Cloud Dataflow using its Dataflow Shuffle service.

Name		Name	Last commit message	Last commit date
Latest commit History 137 Commits
.github/workflows		.github/workflows
project		project
src		src
.gitignore		.gitignore
.scalafmt.conf		.scalafmt.conf
LICENSE		LICENSE
README.md		README.md
build.sbt		build.sbt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Scio + SPARQL = Distributed offline graph database

Features

Quick Start

SPARQL Limitations

How it works

About

Releases

Packages

Contributors 2

Languages

License

jordiolivares/scio-sparql

Folders and files

Latest commit

History

Repository files navigation

Scio + SPARQL = Distributed offline graph database

Features

Quick Start

SPARQL Limitations

How it works

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages