Skip to content

Cassandra

Eric Robertson edited this page May 11, 2016 · 8 revisions

Cassandra is a wide-row store NoSQL database.

Cassandra uses partitioners to distribute data across nodes within database nodes. Nodes are organized into hierarchy of data centers contained by clusters.

Cassandra separates applications by key space. This can have an equivalence to GeoWave namespace. A keyspace contains one or more column families.

Cassandra performs best when data needed to satisfy a given query is located in the same partition key.

Partitioning

The default partitioner is the Murmur3Partitioner partitioner. It may look tempting to use the ByteOrderedPartitioner, to align with Hbase or Accumulo behaviors for the types of range scans, fundamental to GeoWave's design. The ByteOrderedPartitioner is not recommended as it presents difficulties in load balancing and hot spotting on writes.

Compaction

Choosing the correct compaction strategy impacts GeoWave performance. Compaction consolitates SSTables, removing deleted rows and producing single versions of each updated and inserted row. SSTables are organized by sorted partition id. By default, LeveledCompactionStrategy should be used since GeoWave targets read performance. LeveledCompactionStrategy levels the size of SSTables, consolidating into higher level tables of non-overlapping data. Each level increases the permitted table size by a factor of 10. In low write and high read volume applications, typically one higher level SSTable is only processed. In high write volume applications, there are many level 0 SSTables requiring more disk operations. Thus, for streaming applications, the DateTieredCompactionStrategy strategy may be more appropriate, as it compacts based on age of data.

Primary Indexing

The primary index for a Cassandra table is the row key. Secondary indices in Cassandra use the same definition as GeoWave, maintained as separate B-trees the reference (point to) the data in a table sorted by a primary key.

Cassandra design is akin to a nested sorted map. Row keys reference a sorted map of column key/values. Both are unbounded, supporting wide rows. Columns can be organized into super columns (columns within columns) and composite columns (two keys combined into a single column). Cassandra allows 2 billion columns per row. Column key names should be short, as the name is replicated for each cell. It is permissible to place the value in the key, leaving the column value null.

Wide rows have two caveats. First, a row's columns are stored together on a node. A wide row can result in a hotspot. Second, wide rows cannot fit into memory. Row caching cannot be used.

Each row contains a Bloom Filter of the column names in the row. Thus, scans across a row should be efficient. Furthermore, rows contain a column index for columns that collectively expand beyond a configured size (a page size). During query processing, the column index is when determining an column offset into the row. This occurs when a query requests a start column (position) or descending order.

The primary search criteria for GeoWave is a set of ranges over a set of bytes. Dividing bytes into two parts is the recommended design. The first part, the higher order bytes, is used as the row key and the adapter id. The second part, the lower order bytes, are used as part of the column name. The column name is a composite key, which includes the data id. The

 CREATE TABLE IDX (
  horowid bigint,
  adapterid text,
  lowrowid bigint,
  dataid text,
  data blob
  PRIMARY KEY ((horowid , adapterid), lowrowid ,dataid )
 ) WITH CLUSTERING ORDER BY (lowrowid ,dataid ASC);

Secondary Indexing

Cassandra has a rich set of secondary capabilities. SSTable attached secondary indices are an effective architecture for handling text indexing.

Unlike the GeoWave API, Cassandra's query optimizer can choose the appropriate secondary index. When using Cassandra, it may be tempting to turn off GeoWave secondary indexing. This largely depends on whether a features attributes are exposed as part of the column name.

Visibility

Cassandra does not have native cell level security. Custom filters would be needed. There are a few approaches that could used to handle security. (1) Include security within the column name. (2) Include security in a column value describing security for all attributes (the default behavior used to identify security for attributes in the GeoSever plugin).

Clone this wiki locally