Skip to content

mikelieberman/blueprints-accumulo-graph

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

36 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

NOTE: This code is no longer maintained. See https://github.com/JHUAPL/AccumuloGraph for a newer implementation.

Blueprints for Accumulo

This is an implementation of the Tinkerpop Blueprints API backed by Accumulo. The graph is stored in a single table in Accumulo. This implementation has support for key/value indexing and some performance tweaks. If indexing is enabled, the index is stored in a separate table.

How to use it

AccumuloGraphOptions opts = new AccumuloGraphOptions();

opts.setConnectorInfo(instance, zookeepers, username, password);
// OR
opts.setConnector(connector);

opts.setGraphTable(graphTable);

// Optional
opts.setIndexTable(indexTable);
opts.setAutoflush(...);
opts.setReturnRemovedPropertyValues(...);
opts.setMock(...);

AccumuloGraph graph = new AccumuloGraph(opts);

Options are as follows:

  • Connector info: Set the information you need to connect to Accumulo. Alternatively, pass in an Accumulo Connector object which represents the connection. If not supplied, mock instance is needed (see below).

  • Graph table: Where to store the graph.

  • Index table: Where to store the key/value index.

  • Autoflush (default: true): Immediately flush changes to Accumulo, rather than waiting for performance reasons. If disabled, may cause timing issues (see caveats).

  • Return removed property values (default: true): The removeProperty method specifies that the value of the removed property is returned. This potentially requires another read from Accumulo. If you don't care what is returned, disable this to speed things up.

  • Use mock instance (default: false): If you don't have an Accumulo cluster lying around, but still want to use this, you can use a "mock" instance of Accumulo which runs in memory and simulates a real cluster.

Caveats

There are definitely bugs.

Timing issues: There may be a lag time between when you add a vertex/edge, set their properties, etc. and when it is reflected in the backing Accumulo table. This is done for performance reasons, but as a result, if you set values and then immediately read them back, the results may be inconsistent. The same holds for key/value indexes. This isn't a problem if you're doing things like bulk loads, or using the graph as read-only, but otherwise it may be problematic. If this is an issue, this can be mitigated somewhat using the autoflush option, where changes are flushed immediately to Accumulo, at the cost of write performance. I have tried to reduce these timing issues as much as possible, but there may still be issues with this, and it needs more testing.

TODO

  • Hadoop integration.
  • Read-only usage. This will enforce only read operations, and would allow caching strategies, and avoid timing issues.
  • Element/property cache, to increase performance for read-only usage.
  • Bulk loading of graph elements.
  • Regular-style indexes, in addition to key/value index.
  • Tuned querying.
  • Benchmarking.
  • Documentation.

Implementation details

The graph is stored in a single table with the following schema.

Row CF CQ Val Purpose
[v id] MVERTEX - - Vertex id
[v id] EOUT [e id] [e label] Vertex out-edge
[v id] EIN [e id] [e label] Vertex in-edge
[e id] MEDGE [e label] - Edge id
[e id] VOUT [v id] - Edge out-vertex
[e id] VIN [v id] - Edge in-vertex
[v/e id] PROP [pname] [pval] Element property

If the index table is enabled, it has the following schema.

Row CF CQ Val Purpose
PVLIST [p name] - - Vertex property list
PELIST [p name] - - Edge property list
[p name] [p val] [v/e id] - Property index

=======

Please contact me if you find any bugs! Thanks!

About

Implementation of the Tinkerpop Blueprints API backed by Accumulo

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Languages