Secondary indexing is a common design pattern in BigTable-like databases that allows users to index one or more columns in a table. This technique enables fast search of records in a database based on a particular column instead of the row id, thus enabling relational-style semantics in a NoSQL environment. This is accomplished by representing the index either in a reserved namespace in the table or another index table. Despite the fact that this is a common design pattern in BigTable-based applications, most implementations of this practice to date have been tightly coupled with a particular application. As a result, few general-purpose frameworks for secondary indexing on BigTable-like databases exist, and those that do are tied to a particular implementation of the BigTable model.
We developed a solution to this problem called Culvert that supports online index updates as well as a variation of the HIVE query language. In designing Culvert, we sought to make the solution pluggable so that it can be used on any of the many BigTable-like databases (HBase, Cassandra, etc.). Our goal with Culvert is to make an easy, extensible tool for use in the entire NoSQL community.
- Java 1.5
- Maven 3 (though Maven 2 may work).
- Hbase 0.92
- Pull down the source and run: "mvn clean package". This outputs a compiled jar.
- Install the jar on the classpath of all the servers hosting your table.
- Install the jar on the local server (the 'client') from which to issue requests.
- Create an index table and update your configurations
- Create an instance of a com.bah.culvert.Client
- Write your data into your primary table through the Client.
All support resources for Culvert are present under resources/. Currently, the folder consists of:
- CulvertFormat.xml - Formatting for eclipse of the code. Set this for all the Culvert projects from Preferences > Java > Code Style >Formatter
- Switch Joins to first attempting to use an in-memory table, server side, before dumping results into a 'scratch' table
Enable higher consistency puts through use of coprocessors in HBase
- Switch to doing the table put before the index
- Actually use CPs to ensure that a put has been made before updating the index (two phase commit)
- Adding support for removes (consistent or otherwise)
- Add support for batch indexing existing tables
Add more index types
- Document Partitioned Index
- N-grams index
- Numeric indexes (integer, float, etc)
- Web URL index
Culvert is a brand new project and we are continually looking to grow the community. We welcome any input, thoughts, patches, etc.
You can find help or talk development on IRC at #culvert on irc.freenode.net
Information on how to use culvert is also available at this blog post.
The original slides from the presentation at Hadoop Summit 2011 are available on slideshare
Culvert is provided AS-IS, under the Apache License. See LICENSE.txt for full description.