Skip to content

System Evaluation: Historical Reachability Queries

radu1 edited this page Jul 20, 2017 · 18 revisions

To emphasize the ease of realizing empirical evaluations on top of EGG, we present a performance comparison of approaches for answering historical reachability queries, a problem introduced in the article TimeReach: Historical Reachability Queries on Evolving Graphs.

We focus on the disjunctive semantics of historical reachability queries, which ask whether there exists a path between two nodes in a specified interval of time i.e., the query returns yes if there is at least one snapshot in the specified interval where the two nodes are connected.

We have implemented and evaluated the performance of two approaches for solving the aforementioned problem:

  1. Disjunctive-BFS = a simple yet efficient solution based on dynamic programming and detailed in the TimeReach paper. Disjunctive-BFS takes as input an evolving graph in a version graph format, where each pair of nodes is annotated with the union of intervals where the edge between the two nodes is valid. To do so, we easily implemented in EGG the serialization in the version graph format in addition to the standard RDF serialization. The code to realize this additional serialization can be found here, an example of the version graph for the social use case can be found here, and our code for Disjunctive-BFS is available here.

  2. SPARQL = our SPARQL implementation of historical reachability queries on top of Apache Jena. An example of such a SPARQL script is available here.

NB: we use limited recursion i.e., "a{,10}" instead of "a*" because of the limited performance of SPARQL in handling regular path queries.

We have run two experiments that we have found relevant as they express difference between the two approaches:

  1. we fix size of static graph output of gMark (100K nodes, 500K edges), fix interval size in the disjunctive historical reachability query (10 snapshots) and vary the number of snapshots.
  • Small # of snapshots (10): SPARQL is on average better than Disjunctive-BFS and has less variance. On the other hand, Disjunctive-BFS has an important variance: its time depends on the how close are the input nodes.

  • Medium # of snapshots (100): Disjunctive-BFS still has more variance, but on average is better than SPARQL.

  • Large # of snapshots (1000): Disjunctive-BFS returns in reasonable time, whereas SPARQL gives a java exception (out of memory).

  1. we fix size of static graph output of gMark as above, fix number of snapshots to 100 and vary the interval size in the disjunctive historical reachability query: 1, 10, 50, 100.
  • SPARQL needs almost constant time (i.e., has less variance) in the sense that its performance is not affected by the size of the query interval.

  • Disjunctive-BFS tends to give a better time for the queries with large intervals since if two nodes are historically-reachable, there are more chances that this is discovered and output earlier in the algorithm's execution.

The experimental results presented on this page have been obtained on a machine with 16 x 2.4 GHz CPU and 64 GB RAM, reporting averages over 10 runs.