Stratio's Cassandra Lucene Index

Overview
Indexing
Searching
Geographical elements
- Distance
- Transformations
  
  Bounding box
  
  Buffer
  
  Centroid
  
  Convex hull
  
  Difference
  
  Intersection
  
  Union
Complex data types
Query builder
Spark and Hadoop
JMX interface
Performance tips

Overview

Stratio’s Cassandra Lucene Index, derived from Stratio Cassandra, is a plugin for Apache Cassandra that extends its index functionality to provide near real time search such as ElasticSearch or Solr, including full text search capabilities and free multivariable, geospatial and bitemporal search. It is achieved through an Apache Lucene based implementation of Cassandra secondary indexes, where each node of the cluster indexes its own data. Stratio’s Cassandra indexes are one of the core modules on which Stratio’s BigData platform is based.

n more relevant results satisfying a search. The coordinator node sends the search to each node in the cluster, each node returns its n best results and then the coordinator combines these partial results and gives you the n best of them, avoiding full scan. You can also base the sorting in a combination of fields.

Any cell in the tables can be indexed, including those in the primary key as well as collections. Wide rows are also supported. You can scan token/key ranges, apply additional CQL3 clauses and page on the filtered results.

Index filtered searches are a powerful help when analyzing the data stored in Cassandra with MapReduce frameworks as Apache Hadoop or, even better, Apache Spark. Adding Lucene filters in the jobs input can dramatically reduce the amount of data to be processed, avoiding full scan.

Features

Lucene search technology integration into Cassandra provides:

Stratio’s Cassandra Lucene Index and its integration with Lucene search technology provides:

Full text search (language-aware analysis, wildcard, fuzzy, regexp)
Boolean search (and, or, not)
Sorting by relevance, column value, and distance)
Geospatial indexing (points, lines, polygons and their multiparts)
Geospatial transformations (bounding box, buffer, centroid, convex hull, union, difference, intersection)
Geospatial operations (intersects, contains, is within)
Bitemporal search (valid and transaction time durations)
CQL complex types (list, set, map, tuple and UDT)
CQL user defined functions (UDF)
CQL paging, even with sorted searches
Columns with TTL
Third-party CQL-based drivers compatibility
Spark and Hadoop compatibility

Not yet supported:

Thrift API
Legacy compact storage option
Indexing counter columns
Indexing static columns
Other partitioners than Murmur3

Architecture

Indexing is achieved through a Lucene based implementation of Apache Cassandra secondary indexes. Cassandra's secondary indexes are local indexes, meaning that each node of the cluster indexes it's own data. As usual in Cassandra, each node can act as search coordinator. The coordinator node sends the searches to all the involved nodes, and then it post-processes the returned rows to return the required ones. This post-processing is particularly important in sorted searches.

Regarding to the Cassandra-Lucene mapping, each node has a single Lucene index per indexed table, and each logic CQL row is mapped to a Lucene document. This documents are composed by the user-defined fields, the primary key and the partitioner's token. Indexing is done in a synchronous fashion at the storage layer, so each row upsert implies a document upsert. This adds an extra cost for write operations, which is the price of the provided search features. As long as indexing is done below the distribution layer, replication has been already achieved when the rows come to the index.

Requirements

Cassandra (identified by the three first numbers of the plugin version)
Java >= 1.8 (OpenJDK and Sun have been tested)
Maven >= 3.0

Installation

Stratio’s Cassandra Lucene Index is distributed as a plugin for Apache Cassandra. Thus, you just need to build a JAR containing the plugin and add it to the Cassandra’s classpath:

Clone the project: git clone http://github.com/Stratio/cassandra-lucene-index
Change to the downloaded directory: cd cassandra-lucene-index
Checkout a plugin version suitable for your Apache Cassandra version: git checkout A.B.C.X
Build the plugin with Maven: mvn clean package
Copy the generated JAR to the lib folder of your compatible Cassandra installation: cp plugin/target/cassandra-lucene-index-plugin-*.jar <CASSANDRA_HOME>/lib/
Start/restart Cassandra as usual.

Specific Cassandra Lucene index versions are targeted to specific Apache Cassandra versions. So, cassandra-lucene-index A.B.C.X is aimed to be used with Apache Cassandra A.B.C, e.g. cassandra-lucene-index:3.0.7.1 for cassandra:3.0.7. Please note that production-ready releases are version tags (e.g. 3.0.6.3), don't use branch-X nor master branches in production.

Alternatively, patching can also be done with this Maven profile, specifying the path of your Cassandra installation, this task also deletes previous plugin's JAR versions in CASSANDRA_HOME/lib/ directory:

mvn clean package -Ppatch -Dcassandra_home=<CASSANDRA_HOME>

If you don’t have an installed version of Cassandra, there is also an alternative profile to let Maven download and patch the proper version of Apache Cassandra:

mvn clean package -Pdownload_and_patch -Dcassandra_home=<CASSANDRA_HOME>

Now you can run Cassandra and do some tests using the Cassandra Query Language:

<CASSANDRA_HOME>/bin/cassandra -f
<CASSANDRA_HOME>/bin/cqlsh

The Lucene’s index files will be stored in the same directories where the Cassandra’s will be. The default data directory is /var/lib/cassandra/data, and each index is placed next to the SSTables of its indexed column family.

For more details about Apache Cassandra please see its documentation.

Upgrade

If you want to upgrade your cassandra cluster to a newer version you must follow the Datastax official upgrade instructions.

The rule for the Lucene secondary indexes is to delete them with older version, upgrade cassandra and lucene index jar and create them again with running newer version.

If you have huge amount of data in your cluster this could be an expensive task. We have tested it and here you have a compatibility matrix that states between which versions it is not needed to delete the index:

From\ To	3.0.3.0	3.0.3.1	3.0.4.0	3.0.4.1	3.0.5.0	3.0.5.1	3.0.5.2	3.0.6.0	3.0.6.1	3.0.6.2	3.0.7.0	3.0.7.1	3.0.7.2	3.0.8.0	3.0.8.1
2.x	NO	NO	NO	NO	NO	NO	NO	NO	NO	NO	NO	NO	NO	NO	NO
3.0.3.0	--	YES	YES	NO	NO	NO	NO	NO	NO	NO	NO	NO	NO	NO	NO
3.0.3.1	--	--	YES	NO	NO	NO	NO	NO	NO	NO	NO	NO	NO	NO	NO
3.0.4.0	--	--	--	NO	NO	NO	NO	NO	NO	NO	NO	NO	NO	NO	NO
3.0.4.1	--	--	--	--	YES	NO	NO	NO	NO	NO	NO	NO	NO	NO	NO
3.0.5.0	--	--	--	--	--	NO	NO	NO	NO	NO	NO	NO	NO	NO	NO
3.0.5.1	--	--	--	--	--	--	NO	NO	NO	NO	NO	NO	NO	NO	NO
3.0.5.2	--	--	--	--	--	--	--	YES	YES	YES	YES	YES	YES	YES	(1)
3.0.6.0	--	--	--	--	--	--	--	--	YES	YES	YES	YES	YES	YES	(1)
3.0.6.1	--	--	--	--	--	--	--	--	--	YES	YES	YES	YES	YES	(1)
3.0.6.2	--	--	--	--	--	--	--	--	--	--	YES	YES	YES	YES	(1)
3.0.7.0	--	--	--	--	--	--	--	--	--	--	--	YES	YES	YES	(1)
3.0.7.1	--	--	--	--	--	--	--	--	--	--	--	--	YES	YES	(1)
3.0.7.2	--	--	--	--	--	--	--	--	--	--	--	--	--	YES	(1)
3.0.8.0	--	--	--	--	--	--	--	--	--	--	--	--	--	--	(1)

(1): Compatible only if you are not using geospatial mappers.

Alternative syntaxes

There are two alternative syntaxes for managing indexes. Prior to Cassandra 3.0, indexes had to be linked to a dummy column due to CQL syntax limitations:

CREATE TABLE test(pk int PRIMARY KEY, rc text);
ALTER TABLE test ADD lucene text; -- Dummy column

CREATE CUSTOM INDEX idx ON test(lucene) -- Index is linked to the dummy column
USING 'com.stratio.cassandra.lucene.Index'
WITH OPTIONS = {'schema': '{fields: {rc: {type: "text"}}}'};

This column wasn't intended to store anything, it was just a trick to embed Lucene syntax into CQL syntax, so custom search predicates could be directed to this dummy column:

SELECT * FROM test WHERE lucene = '{...}';

As a collateral benefit, this column was used to return the score assigned by the Lucene query to each of the rows.

However, Cassandra 3.0 introduced a secondary index API redesign including explicit syntactical support for custom per-row indexes using their own query language. This new syntax didn't require the dummy column anymore:

CREATE TABLE test(pk int PRIMARY KEY, rc text);

CREATE CUSTOM INDEX idx ON test() -- Index is directly linked to the table, without dummy column
USING 'com.stratio.cassandra.lucene.Index'
WITH OPTIONS = {'schema': '{fields: {rc: {type: "text"}}}'};

Instead, we can address custom search expressions directly to the index using the new 'expr' operator:

SELECT * FROM test WHERE expr(idx, '{...}');

As you can see, this new syntax is far clearer than the previous one. However, the old syntax is still supported for compatibility reasons, given that several client applications do not support the new syntax yet. The most remarkable case is DataStax's connector for Apache Spark, which doesn't allow 'expr' queries and fails managing tables with new-style indexes even if the Spark operation doesn't use the index at all. So, unfortunately, you must continue using the old dummy column approach if you are going to use the Spark connector or any other incompatible software.

Additionally, another possible reason for using the old syntax is that it uses the fake column to show the scores assigned by the Lucene's scoring formula to each one of the matched rows. This score is internally used for sorting and selecting the matched rows according to some user-defined search criteria. Although it is more intended for internal use, showing this value could be useful in some specific cases.

Last but not least, it is important to note that you can address searches with the new syntax to indexes created with the old fake column approach:

CREATE TABLE test(pk int PRIMARY KEY, rc text);
ALTER TABLE test ADD lucene text; -- Dummy column

CREATE CUSTOM INDEX idx ON test(lucene) -- Index is linked to the dummy column
USING 'com.stratio.cassandra.lucene.Index'
WITH OPTIONS = {'schema': '{fields: {rc: {type: "text"}}}'};

SELECT * FROM test WHERE expr(idx,'{...}');

This offers a good balance between the advantages of both syntaxes.

Cassandra only allows one per-row index per table, whereas there is no limit for the number of per-column indexes that a table can have. So, an additional benefit of creating indexes over dummy columns is that you can have multiple Lucene indexes per table, as long as they are considered per-column indexes.

All the examples in this document use the new syntax, but all of them can be written in the old way.

Example

We will create the following table to store tweets:

CREATE KEYSPACE demo
WITH REPLICATION = {'class' : 'SimpleStrategy', 'replication_factor': 1};
USE demo;
CREATE TABLE tweets (
    id INT PRIMARY KEY,
    user TEXT,
    body TEXT,
    time TIMESTAMP,
    latitude FLOAT,
    longitude FLOAT
);

Now you can create a custom Lucene index on it with the following statement:

CREATE CUSTOM INDEX tweets_index ON tweets ()
USING 'com.stratio.cassandra.lucene.Index'
WITH OPTIONS = {
    'refresh_seconds' : '1',
    'schema' : '{
        fields : {
            id    : {type : "integer"},
            user  : {type : "string"},
            body  : {type : "text", analyzer : "english"},
            time  : {type : "date", pattern : "yyyy/MM/dd"},
            place : {type : "geo_point", latitude:"latitude", longitude:"longitude"}
        }
    }'
};

This will index all the columns in the table with the specified types, and it will be refreshed once per second. Alternatively, you can explicitly refresh all the index shards with an empty search with consistency ALL:

CONSISTENCY ALL
SELECT * FROM tweets WHERE expr(tweets_index, '{refresh:true}');
CONSISTENCY QUORUM

Now, to search for tweets within a certain date range:

SELECT * FROM tweets WHERE expr(tweets_index, '{
    filter : {type: "range", field: "time", lower: "2014/04/25", upper: "2014/05/01"}
}');

The same search can be performed forcing an explicit refresh of the involved index shards:

SELECT * FROM tweets WHERE expr(tweets_index, '{
    filter : {type: "range", field: "time", lower: "2014/04/25", upper: "2014/05/01"},
    refresh : true
}') limit 100;

Now, to search the top 100 more relevant tweets where body field contains the phrase “big data gives organizations” within the aforementioned date range:

SELECT * FROM tweets WHERE expr(tweets_index, '{
    filter : {type: "range", field: "time", lower: "2014/04/25", upper: "2014/05/01"},
    query : {type: "phrase", field: "body", value: "big data gives organizations", slop: 1}
}') LIMIT 100;

To refine the search to get only the tweets written by users whose names start with "a":

SELECT * FROM tweets WHERE expr(tweets_index, '{
    filter : [ {type: "range", field: "time", lower: "2014/04/25", upper: "2014/05/01"},
               {type: "prefix", field: "user", value: "a"} ],
    query : {type: "phrase", field: "body", value: "big data gives organizations", slop: 1}
}') LIMIT 100;

To get the 100 more recent filtered results you can use the sort option:

SELECT * FROM tweets WHERE expr(tweets_index, '{
    filter : [ {type: "range", field: "time", lower: "2014/04/25", upper: "2014/05/01"},
               {type: "prefix", field: "user", value: "a"} ],
    query : {type: "phrase", field: "body", value: "big data gives organizations", slop: 1},
    sort : {field: "time", reverse: true}
}') limit 100;

The previous search can be restricted to tweets created close to a geographical position:

SELECT * FROM tweets WHERE expr(tweets_index, '{
    filter : [ {type: "range", field: "time", lower: "2014/04/25", upper: "2014/05/01"},
               {type: "prefix", field: "user", value: "a"},
               {type: "geo_distance", field: "place", latitude: 40.3930, longitude: -3.7328, max_distance: "10km"} ],
    query : {type: "phrase", field: "body", value: "big data gives organizations", slop: 1},
    sort : {field: "time", reverse: true}
}') limit 100;

It is also possible to sort the results by distance to a geographical position:

SELECT * FROM tweets WHERE expr(tweets_index, '{
    filter: [ {type: "range", field: "time", lower: "2014/04/25", upper: "2014/05/01"},
               {type: "prefix", field: "user", value: "a"},
               {type: "geo_distance", field: "place", latitude: 40.3930, longitude: -3.7328, max_distance: "10km"} ],
    query :  {type: "phrase", field: "body", value: "big data gives organizations", slop: 1},
    sort : [ {field: "time", reverse: true},
             {field: "place", type: "geo_distance", latitude: 40.3930, longitude: -3.7328} ]
}') limit 100;

Last but not least, you can route any search to a certain token range or partition, in such a way that only a subset of the cluster nodes will be hit, saving precious resources:

SELECT * FROM tweets WHERE expr(tweets_index, '{
    filter : [ {type: "range", field: "time", lower: "2014/04/25", upper: "2014/05/01"},
               {type: "prefix", field: "user", value: "a"},
               {type: "geo_distance", field: "place", latitude: 40.3930, longitude: -3.7328, max_distance: "10km"} ],
    query :  {type: "phrase", field: "body", value: "big data gives organizations", slop: 1},
    sort : [ {field: "time", reverse: true},
             {field: "place", type: "geo_distance", latitude: 40.3930, longitude: -3.7328} ]
}') AND TOKEN(id) >= TOKEN(0) AND TOKEN(id) < TOKEN(10000000) limit 100;

Indexing

Lucene indexes are an extension of the Cassandra secondary indexes. As such, they are created through CQL CREATE CUSTOM INDEX statement, specifying the full qualified class name and a list of configuration options that are specified in this section.

Syntax:

CREATE CUSTOM INDEX (IF NOT EXISTS)? <index_name>
                                  ON <table_name> ()
                               USING 'com.stratio.cassandra.lucene.Index'
                        WITH OPTIONS = <options>

where <options> is a JSON object:

<options> := { ('refresh_seconds'        : '<int_value>',)?
               ('ram_buffer_mb'          : '<int_value>',)?
               ('max_merge_mb'           : '<int_value>',)?
               ('max_cached_mb'          : '<int_value>',)?
               ('indexing_threads'       : '<int_value>',)?
               ('indexing_queues_size'   : '<int_value>',)?
               ('directory_path'         : '<string_value>',)?
               ('excluded_data_centers'  : '<string_value>',)?
               'schema'                  : '<schema_definition>'};

All options take a value enclosed in single quotes:

refresh_seconds: number of seconds before auto-refreshing the index reader. It is the max time taken for writes to be searchable without forcing an index refresh. Defaults to '60'.
ram_buffer_mb: size of the write buffer. Its content will be committed to disk when full. Defaults to '64'.
max_merge_mb: defaults to '5'.
max_cached_mb: defaults to '30'.
indexing_threads: number of asynchronous indexing threads. ’0’ means synchronous indexing. Defaults to ’0’.
indexing_queues_size: max number of queued documents per asynchronous indexing thread. Defaults to ’50’.
directory_path: The path of the directory where the Lucene index will be stored.
excluded_data_centers: The comma-separated list of the data centers to be excluded. The index will be created on this data centers but all the write operations will be silently ignored.
schema: see below

<schema_definition> := {
    (analyzers : { <analyzer_definition> (, <analyzer_definition>)* } ,)?
    (default_analyzer : "<analyzer_name>",)?
    fields : { <mapper_definition> (, <mapper_definition>)* }
}

Where default_analyzer defaults to ‘org.apache.lucene.analysis.standard.StandardAnalyzer’.

<analyzer_definition> := <analyzer_name> : {
    type : "<analyzer_type>" (, <option> : "<value>")*
}

<mapper_definition> := <mapper_name> : {
    type : "<mapper_type>" (, <option> : "<value>")*
}

Analyzers

Analyzer definition options depend on the analyzer type. Details and default values are listed in the table below.

Analyzer type

Option

Value type

Default value

classpath

class

string

null

snowball

language

-------------+: stopwords

string

--------------+: string

null

-----------------+: null

Classpath analyzer

Analyzer which instances a Lucene's analyzer present in classpath.

Example:

CREATE CUSTOM INDEX census_index on census()
USING 'com.stratio.cassandra.lucene.Index'
WITH OPTIONS = {
    'refresh_seconds' : '1',
    'schema' : '{
        analyzers : {
            an_analyzer : {
                type  : "classpath",
                class : "org.apache.lucene.analysis.en.EnglishAnalyzer"
            }
        }
    }'
};

Snowball analyzer

Analyzer using a http://snowball.tartarus.org/ snowball filter SnowballFilter

Example:

CREATE CUSTOM INDEX census_index on census()
USING 'com.stratio.cassandra.lucene.Index'
WITH OPTIONS = {
    'refresh_seconds' : '1',
    'schema' : '{
        analyzers : {
            an_analyzer : {
                type  : "snowball",
                language : "English",
                stopwords : "a,an,the,this,that"
            }
        }
    }'
};

Supported languages: English, French, Spanish, Portuguese, Italian, Romanian, German, Dutch, Swedish, Norwegian, Danish, Russian, Finnish, Irish, Hungarian, Turkish, Armenian, Basque and Catalan

Mappers

Field mapping definition options specify how the CQL rows will be mapped to Lucene documents. Several mappers can be applied to the same CQL column/s. Details and default values are listed in the table below.

Mapper type

Option

Value type

Default value

Mandatory

bigdec

validated

-----------------+: column
-----------------+: integer_digits
-----------------+: decimal_digits

boolean

-----------------+: string
-----------------+: integer
-----------------+: integer

false

--------------------------------+: mapper_name of the schema
--------------------------------+: 32
--------------------------------+: 32

No

-----------+: No
-----------+: No
-----------+: No

bigint

validated

-----------------+: column
-----------------+: digits

boolean

-----------------+: string
-----------------+: integer

false

--------------------------------+: mapper_name of the schema
--------------------------------+: 32

No

-----------+: No
-----------+: No

bitemporal

validated

-----------------+: vt_from
-----------------+: vt_to
-----------------+: tt_from
-----------------+: tt_to
-----------------+: pattern
-----------------+: now_value

boolean

-----------------+: string
-----------------+: string
-----------------+: string
-----------------+: string
-----------------+: string
-----------------+: object

false

--------------------------------+

--------------------------------+: yyyy/MM/dd HH:mm:ss.SSS Z
--------------------------------+: Long.MAX_VALUE

No

-----------+: Yes
-----------+: Yes
-----------+: Yes
-----------+: Yes
-----------+: No
-----------+: No

blob

validated

-----------------+: column

boolean

-----------------+: string

false

--------------------------------+: mapper_name of the schema

No

-----------+: No

boolean

validated

-----------------+: column

boolean

-----------------+: string

false

--------------------------------+: mapper_name of the schema

No

-----------+: No

date

validated

-----------------+: column
-----------------+: pattern

boolean

-----------------+: string
-----------------+: string

false

--------------------------------+: mapper_name of the schema
--------------------------------+: yyyy/MM/dd HH:mm:ss.SSS Z

No

-----------+: No
-----------+: No

date_range

validated

-----------------+: from
-----------------+: to
-----------------+: pattern

boolean

-----------------+: string
-----------------+: string
-----------------+: string

false

--------------------------------+

--------------------------------+: yyyy/MM/dd HH:mm:ss.SSS Z

No

-----------+: Yes
-----------+: Yes
-----------+: No

double

validated

-----------------+: column
-----------------+: boost

boolean

-----------------+: string
-----------------+: integer

false

--------------------------------+: mapper_name of the schema
--------------------------------+: 0.1f

No

-----------+: No
-----------+: No

float

validated

-----------------+: column
-----------------+: boost

boolean

-----------------+: string
-----------------+: integer

false

--------------------------------+: mapper_name of the schema
--------------------------------+: 0.1f

No

-----------+: No
-----------+: No

geo_point

validated

-----------------+: latitude
-----------------+: longitude
-----------------+: max_levels

boolean

-----------------+: string
-----------------+: string
-----------------+: integer

false

--------------------------------+

--------------------------------+: 11

No

-----------+: Yes
-----------+: Yes
-----------+: No

geo_shape

validated

-----------------+: column
-----------------+: max_levels
-----------------+: transformations

boolean

-----------------+: string
-----------------+: integer
-----------------+: array

false

--------------------------------+: mapper_name of the schema
--------------------------------+: 5

--------------------------------+

No

-----------+: No
-----------+: No
-----------+: No

inet

validated

-----------------+: column

boolean

-----------------+: string

false

--------------------------------+: mapper_name of the schema

No

-----------+: No

integer

validated

-----------------+: column
-----------------+: boost

boolean

-----------------+: string
-----------------+: integer

false

--------------------------------+: mapper_name of the schema
--------------------------------+: 0.1f

No

-----------+: No
-----------+: No

long

validated

-----------------+: column
-----------------+: boost

boolean

-----------------+: string
-----------------+: integer

false

--------------------------------+: mapper_name of the schema
--------------------------------+: 0.1f

No

-----------+: No
-----------+: No

string

validated

-----------------+: column
-----------------+: case_sensitive

boolean

-----------------+: string
-----------------+: boolean

false

--------------------------------+: mapper_name of the schema
--------------------------------+: true

No

-----------+: No
-----------+: No

text

validated

-----------------+: column
-----------------+: analyzer

boolean

-----------------+: string
-----------------+: string

false

--------------------------------+: mapper_name of the schema
--------------------------------+: default_analyzer of the schema

No

-----------+: No
-----------+: No

uuid

validated

-----------------+: column

boolean

-----------------+: string

false

--------------------------------+: mapper_name of the schema

No

-----------+: No

All mappers have a validated option indicating if the mapped column values must be validated at CQL level before performing the distributed write operation. If this option is set then the coordinator node will throw an error on writes containing values that can't be mapped, causing the failure of all the write operation and notifying the client about the failure cause. If validation is not set, which is the default setting, writes to C* will never fail due to the index. Instead, each failing column value will be silently discarded, and the error message will be just logged in the implied nodes. This option is useful to avoid writes containing values that can't be searched afterwards, and can also be used as a generic data validation layer. Note that mappers affecting several columns at a time, such as date_range,geo_point and bitemporal, need to have all the involved columns to perform validation, so no partial columns update will be allowed when validation is active.

Cassandra allows only one custom per-row index per table, and it does not allow any modify operation on indexes. So, to modify an index it needs to be deleted first and created again. Alternatively, if you are using the classic dummy-column syntax, the index will be considered per-column, so you would be able to create a second index with the new schema, wait until the new index is completely built, and then delete the old index.

Big decimal mapper

Maps arbitrary precision signed decimal values.

Parameters:

validated (default = false): if mapping errors should make CQL writes fail, instead of just logging the error.
column (default = name of the mapper): the name of the column storing the big decimal to be indexed.
integer_digits (default = 32): the max number of decimal digits for the integer part.
decimal_digits (mandatory): the max number of decimal digits for the decimal part.

Supported CQL types:

ascii, bigint, decimal, double, float, int, smallint, text, tinyint, varchar, varint

Example:

CREATE CUSTOM INDEX census_index on census()
USING 'com.stratio.cassandra.lucene.Index'
WITH OPTIONS = {
    'refresh_seconds' : '1',
    'schema' : '{
        fields : {
            bigdecimal : {
                type           : "bigdec",
                integer_digits : 2,
                decimal_digits : 2,
                validated      : true,
                column         : "column_name"
            }
        }
    }'
};

Big integer mapper

Maps arbitrary precision signed integer values.

Parameters:

validated (default = false): if mapping errors should make CQL writes fail, instead of just logging the error.
column (default = name of the mapper): the name of the column storing the big integer to be indexed.
digits (default = 32): the max number of decimal digits.

Supported CQL types:

ascii, bigint, int, smallint, text, tinyint, varchar, varint

Example:

CREATE CUSTOM INDEX test_idx ON test()
USING 'com.stratio.cassandra.lucene.Index'
WITH OPTIONS = {
    'refresh_seconds' : '1',
    'schema' : '{
        fields : {
            biginteger : {
                type      : "bigint",
                digits    : 10,
                validated : true,
                column    : "column_name"
            }
        }
    }'
};

Bitemporal mapper

Maps four columns containing the four dates defining a bitemporal fact.

Parameters:

validated (default = false): if mapping errors should make CQL writes fail, instead of just logging the error.
vt_from (mandatory): the name of the column storing the beginning of the valid date range.
vt_to (mandatory): the name of the column storing the end of the valid date range.
tt_from (mandatory): the name of the column storing the beginning of the transaction date range.
tt_to (mandatory): the name of the column storing the end of the transaction date range.
now_value (default = Long.MAX_VALUE): a date representing now.
pattern (default = yyyy/MM/dd HH:mm:ss.SSS Z): the date pattern for parsing Cassandra not-date columns and creating Lucene fields. Note that it can be used to index dates with reduced precision.

Supported CQL types:

ascii, bigint, date, int, text, timestamp, timeuuid, varchar, varint

Example:

CREATE CUSTOM INDEX census_index on census()
USING 'com.stratio.cassandra.lucene.Index'
WITH OPTIONS = {
    'refresh_seconds' : '1',
    'schema' : '{
        fields : {
            bitemporal : {
                type      : "bitemporal",
                vt_from   : "vt_from",
                vt_to     : "vt_to",
                tt_from   : "tt_from",
                tt_to     : "tt_to",
                validated : true,
                pattern   : "yyyy/MM/dd HH:mm:ss.SSS";,
                now_value : "3000/01/01 00:00:00.000",
            }
        }
    }'
};

Blob mapper

Maps a blob value.

Parameters:

validated (default = false): if mapping errors should make CQL writes fail, instead of just logging the error.
column (default = name of the mapper): the name of the column storing blob to be indexed.

Supported CQL types:

ascii, blob, text, varchar

Example:

CREATE CUSTOM INDEX test_idx ON test()
USING 'com.stratio.cassandra.lucene.Index'
WITH OPTIONS = {
    'refresh_seconds' : '1',
    'schema' : '{
        fields : {
            blob : {
                type    : "bytes",
                column  : "column_name"
            }
        }
    }'
};

Boolean mapper

Maps a boolean value.

Parameters:

validated (default = false): if mapping errors should make CQL writes fail, instead of just logging the error.
column (default = name of the mapper): the name of the column storing boolean value to be indexed.

Supported CQL types:

ascii, boolean , text, varchar

Example:

CREATE CUSTOM INDEX test_idx ON test()
USING 'com.stratio.cassandra.lucene.Index'
WITH OPTIONS = {
    'refresh_seconds' : '1',
    'schema' : '{
        fields : {
            bool : {
                type      : "boolean",
                validated : true,
                column    : "column_name"
            }
        }
    }'
};

Date mapper

Maps dates using a either a pattern, an UNIX timestamp or a time UUID.

Parameters:

validated (default = false): if mapping errors should make CQL writes fail, instead of just logging the error.
column (default = name of the mapper): the name of the column storing the date to be indexed.
pattern (default = yyyy/MM/dd HH:mm:ss.SSS Z): the date pattern for parsing Cassandra not-date columns and creating Lucene fields. Note that it can be used to index dates with reduced precision.

Supported CQL types:

ascii, bigint, date, int, text, timestamp, timeuuid, varchar, varint

Example: Index the column creation with a precision of minutes using the date format pattern yyyy/MM/dd HH:mm:

CREATE CUSTOM INDEX test_idx ON test()
USING 'com.stratio.cassandra.lucene.Index'
WITH OPTIONS = {
    'refresh_seconds' : '1',
    'schema' : '{
        fields : {
            creation : {
                type    : "date",
                pattern : "yyyy/MM/dd HH:mm",
            }
        }
    }'
};

Date range mapper

Maps a time duration/period defined by a start date and a stop date.

Parameters:

validated (default = false): if mapping errors should make CQL writes fail, instead of just logging the error.
from (mandatory): the name of the column storing the start date of the time duration to be indexed.
to (mandatory): the name of the column storing the stop date of the time duration to be indexed.
pattern (default = yyyy/MM/dd HH:mm:ss.SSS Z): the date pattern for parsing Cassandra not-date columns and creating Lucene fields. Note that it can be used to index dates with reduced precision.

Supported CQL types:

ascii, bigint, date, int, text, timestamp, timeuuid, varchar, varint

Example 1: Index the column time period defined by the columns start and stop, using the default date pattern:

CREATE CUSTOM INDEX test_idx ON test()
USING 'com.stratio.cassandra.lucene.Index'
WITH OPTIONS = {
    'refresh_seconds' : '1',
    'schema' : '{
        fields : {
            duration : {
                type    : "date_range",
                from    : "start",
                to      : "stop"
            }
        }
    }'
};

Example 2: Index the column time period defined by the columns start and stop, validating values, and using a precision of minutes:

CREATE CUSTOM INDEX test_idx ON test()
USING 'com.stratio.cassandra.lucene.Index'
WITH OPTIONS = {
    'refresh_seconds' : '1',
    'schema' : '{
        fields : {
            duration : {
                type      : "date_range",
                validated : true,
                from      : "start",
                to        : "stop",
                pattern   : "yyyy/MM/dd HH:mm"
            }
        }
    }'
};

Double mapper

Maps a 64-bit decimal number.

Parameters:

validated (default = false): if mapping errors should make CQL writes fail, instead of just logging the error.
column (default = name of the mapper): the name of the column storing the double to be indexed.
boost (default = 0.1f): the Lucene's index-time boosting factor.

Supported CQL types:

ascii, bigint, decimal, double, float, int, smallint, text, tinyint, varchar, varint

Example:

CREATE CUSTOM INDEX test_idx ON test()
USING 'com.stratio.cassandra.lucene.Index'
WITH OPTIONS = {
    'refresh_seconds' : '1',
    'schema' : '{
        fields : {
            double : {
                type      : "double",
                boost     : 2.0,
                validated : true,
                column    : "column_name"
            }
        }
    }'
};

Float mapper

Maps a 32-bit decimal number.

Parameters:

validated (default = false): if mapping errors should make CQL writes fail, instead of just logging the error.
column (default = name of the mapper): the name of the column storing the float to be indexed.
boost (default = 0.1f): the Lucene's index-time boosting factor.

Supported CQL types:

ascii, bigint, decimal, double, float, int, smallint, tinyint, varchar, varint

Example:

CREATE CUSTOM INDEX test_idx ON test()
USING 'com.stratio.cassandra.lucene.Index'
WITH OPTIONS = {
    'refresh_seconds' : '1',
    'schema' : '{
        fields : {
            float : {
                type      : "float",
                boost     : 2.0,
                validated : true,
                column    : "column_name"
            }
        }
    }'
};

Geo point mapper

Maps a geospatial location (point) defined by two columns containing a latitude and a longitude. Indexing is based on a composite spatial strategy that stores points in a doc values field and also indexes them into a geohash recursive prefix tree with a certain precision level. The low-accuracy prefix tree is used to quickly find results, maybe producing some false positives, and the doc values field is used to discard these false positives.

Parameters:

validated (default = false): if mapping errors should make CQL writes fail, instead of just logging the error.
latitude (mandatory): the name of the column storing the latitude of the point to be indexed.
longitude (mandatory): the name of the column storing the longitude of the point to be indexed.
max_levels (default = 11): the maximum number of levels in the underlying geohash search tree. False positives will be discarded using stored doc values, so this doesn't mean precision lost. Higher values will produce few false positives to be post-filtered, at the expense of creating more terms in the search index.

Supported CQL types:

ascii, bigint, decimal, double, float, int, smallint, text, timestamp, varchar, varint

Example:

CREATE CUSTOM INDEX test_idx ON test()
USING 'com.stratio.cassandra.lucene.Index'
WITH OPTIONS = {
    'refresh_seconds' : '1',
    'schema' : '{
        fields : {
            geo_point : {
                type       : "geo_point",
                validated  : true,
                latitude   : "lat",
                longitude  : "long",
                max_levels : 15
            }
        }
    }'
};

Geo shape mapper

Maps a geographical shape stored in a text column with Well Known Text (WKT) format. The supported WKT shapes are point, linestring, polygon, multipoint, multilinestring and multipolygon.

It is possible to specify a sequence of geometrical transformations to be applied to the shape before indexing it. It could be used for indexing only the centroid of the shape, or a buffer around it, etc.

Indexing is based on a composite spatial strategy that stores shapes in a doc values field and also indexes them into a geohash recursive prefix tree with a certain precision level. The low-accuracy prefix tree is used to quickly find results, maybe producing some false positives, and the doc values field is used to discard these false positives.

This mapper depends on Java Topology Suite (JTS). This library can't be distributed together with this project due to license compatibility problems, but you can add it by putting jts-core-1.14.0.jar into your Cassandra installation lib directory.

Parameters:

validated (default = false): if mapping errors should make CQL writes fail, instead of just logging the error.
column (default = name of the mapper): the name of the column storing the shape to be indexed in WKT format.
max_levels (default = 5): the maximum number of levels in the underlying geohash search tree. False positives will be discarded using stored doc values, so this doesn't mean precision lost. Higher values will produce few false positives to be post-filtered, at the expense of creating more terms in the search index.
transformations (optional): sequence of geometrical transformations to be applied to each shape before indexing it.

Supported CQL types:

ascii, text, varchar

Example 1:

CREATE TABLE IF NOT EXISTS test (
    id int,
    shape text,
    lucene text,
    PRIMARY KEY (id)
);

INSERT INTO test(id, shape) VALUES (1, 'POINT(-0.13 51.50)');
INSERT INTO test(id, shape) VALUES (2, 'LINESTRING(-0.25 51.52, -0.08 51.39, -0.02 51.42)');
INSERT INTO test(id, shape) VALUES (3, 'POLYGON((-0.07 51.63, 0.03 51.54, 0.05 51.65, -0.07 51.63))');
INSERT INTO test(id, shape) VALUES (4, 'MULTIPOINT(-0.65 52.60, -1.00 51.76, -0.65 52.60)');
INSERT INTO test(id, shape) VALUES (5, 'MULTILINESTRING((-0.43 51.56, -0.33 51.35, -0.13 51.35),
                                                        (-0.25 51.56, -0.14 51.48))');
INSERT INTO test(id, shape) VALUES (6, 'MULTIPOLYGON(((-0.51 51.58, -0.18 51.14, 0.49 51.73, -0.51 51.58),
                                                      (-0.25 51.54, -0.12 51.32, 0.16 51.59, -0.25 51.54)))');

CREATE CUSTOM INDEX test_idx ON test()
USING 'com.stratio.cassandra.lucene.Index'
WITH OPTIONS = {
    'refresh_seconds' : '1',
    'schema' : '{
        fields : {
            shape : {
                type       : "geo_shape",
                max_levels : 15
            }
        }
    }'
};

Example 2: Index only the centroid of the WKT shape contained in the indexed column:

CREATE TABLE IF NOT EXISTS cities (
    name text,
    shape text,
    lucene text,
    PRIMARY KEY (name)
);

INSERT INTO cities(name, shape) VALUES ('birmingham', 'POLYGON((-2.25 52.63, -2.26 52.49, -2.13 52.36, -1.80 52.34, -1.57 52.54, -1.89 52.67, -2.25 52.63))');
INSERT INTO cities(name, shape) VALUES ('london', 'POLYGON((-0.55 51.50, -0.13 51.19, 0.21 51.35, 0.30 51.62, -0.02 51.75, -0.34 51.69, -0.55 51.50))');

CREATE CUSTOM INDEX cities_index on cities()
USING 'com.stratio.cassandra.lucene.Index'
WITH OPTIONS = {
    'refresh_seconds' : '1',
    'schema' : '{
        fields : {
            shape : {
                type            : "geo_shape",
                max_levels      : 15,
                transformations : [{type:"centroid"}]
            }
        }
    }'
};

Example 3: Index a buffer 50 kilometres around the area of a city:

CREATE TABLE IF NOT EXISTS cities (
    name text,
    shape text,
    lucene text,
    PRIMARY KEY (name)
);

INSERT INTO cities(name, shape) VALUES ('birmingham', 'POLYGON((-2.25 52.63, -2.26 52.49, -2.13 52.36, -1.80 52.34, -1.57 52.54, -1.89 52.67, -2.25 52.63))');
INSERT INTO cities(name, shape) VALUES ('london', 'POLYGON((-0.55 51.50, -0.13 51.19, 0.21 51.35, 0.30 51.62, -0.02 51.75, -0.34 51.69, -0.55 51.50))');

CREATE CUSTOM INDEX cities_index on cities()
USING 'com.stratio.cassandra.lucene.Index'
WITH OPTIONS = {
    'refresh_seconds' : '1',
    'schema' : '{
        fields : {
            shape : {
                type            : "geo_shape",
                max_levels      : 15,
                transformations : [{type:"buffer", min_distance:"50km"}]
            }
        }
    }'
};

Example 4: Index a buffer 50 kilometres around the borders of a country:

CREATE TABLE IF NOT EXISTS borders (
    country text,
    shape text,
    PRIMARY KEY (country)
);

INSERT INTO borders(country, shape) VALUES ('france', 'LINESTRING(-1.8037198483943 43.463094234466, -1.3642667233943 43.331258296966 ... )');
INSERT INTO borders(country, shape) VALUES ('portugal', 'LINESTRING(-8.8789151608943 41.925008296966, -8.2636807858943 42.100789546966 ... )');

CREATE CUSTOM INDEX borders_index on borders()
USING 'com.stratio.cassandra.lucene.Index'
WITH OPTIONS = {
    'refresh_seconds' : '1',
    'schema' : '{
        fields : {
            shape : {
                type            : "geo_shape",
                max_levels      : 15,
                transformations : [{type:"buffer", max_distance:"50km"}]
            }
        }
    }'
};

Example 5: Index the convex hull of the WKT shape contained in the indexed column:

CREATE TABLE IF NOT EXISTS blocks (
    id bigint PRIMARY KEY,
    shape text
);

INSERT INTO blocks(name, shape) VALUES (341, 'MULTIPOLYGON(((-86.693279 32.390691, -86.693185 32.391494, -86.691590 32.391362, -86.691621 32.391095 ... )))');

CREATE CUSTOM INDEX blocks_index on cities()
USING 'com.stratio.cassandra.lucene.Index'
WITH OPTIONS = {
    'refresh_seconds' : '1',
    'schema' : '{
        fields : {
            shape : {
                type            : "geo_shape",
                max_levels      : 15,
                transformations : [{type:"convex_hull"}]
            }
        }
    }'
};

Example 6: Index the bounding box of the WKT shape contained in the indexed column:

CREATE TABLE IF NOT EXISTS blocks (
    id bigint PRIMARY KEY,
    shape text
);

INSERT INTO blocks(name, shape) VALUES (341, 'MULTIPOLYGON(((-86.693279 32.390691, -86.693185 32.391494, -86.691590 32.391362 ... )))');

CREATE CUSTOM INDEX blocks_index on cities()
USING 'com.stratio.cassandra.lucene.Index'
WITH OPTIONS = {
    'refresh_seconds' : '1',
    'schema' : '{
        fields : {
            shape : {
                type            : "geo_shape",
                max_levels      : 15,
                transformations : [{type:"bbox"}]
            }
        }
    }'
};

Inet mapper

Maps an IP address. Either IPv4 and IPv6 are supported.

Parameters:

validated (default = false): if mapping errors should make CQL writes fail, instead of just logging the error.
column (default = name of the mapper): the name of the column storing the IP address to be indexed.

Supported CQL types:

ascii, inet, text, varchar

Example:

CREATE CUSTOM INDEX test_idx ON test()
USING 'com.stratio.cassandra.lucene.Index'
WITH OPTIONS = {
    'refresh_seconds' : '1',
    'schema' : '{
        fields : {
            inet : {
                type      : "inet",
                validated : true,
                column    : "column_name"
            }
        }
    }'
};

Integer mapper

Maps a 32-bit integer number.

Parameters:

validated (default = false): if mapping errors should make CQL writes fail, instead of just logging the error.
column (default = name of the mapper): the name of the column storing the integer to be indexed.
boost (default = 0.1f): the Lucene's index-time boosting factor.

Supported CQL types:

ascii, bigint, date, decimal, double, float, int, smallint, text, timestamp, tinyint, varchar, varint

Example:

CREATE CUSTOM INDEX test_idx ON test()
USING 'com.stratio.cassandra.lucene.Index'
WITH OPTIONS = {
    'refresh_seconds' : '1',
    'schema' : '{
        fields : {
            integer : {
                type      : "integer",
                validated : true,
                column    : "column_name"
                boost     : 2.0,
            }
        }
    }'
};

Long mapper

Maps a 64-bit integer number.

Parameters:

validated (default = false): if mapping errors should make CQL writes fail, instead of just logging the error.
column (default = name of the mapper): the name of the column storing the double to be indexed.
boost (default = 0.1f): the Lucene's index-time boosting factor.

Supported CQL types:

ascii, bigint, date, decimal, double, float, int, smallint, text, timestamp, tinyint, varchar, varint

Example:

CREATE CUSTOM INDEX test_idx ON test()
USING 'com.stratio.cassandra.lucene.Index'
WITH OPTIONS = {
    'refresh_seconds' : '1',
    'schema' : '{
        fields : {
            long : {
                type      : "long",
                validated : true,
                column    : "column_name"
                boost     : 2.0,
            }
        }
    }'
};

String mapper

Maps a not-analyzed text value.

Parameters:

validated (default = false): if mapping errors should make CQL writes fail, instead of just logging the error.
column (default = name of the mapper): the name of the column storing the IP address to be indexed.
case_sensitive (default = true): if the text will be indexed preserving its casing.

Supported CQL types:

ascii, bigint, blob, boolean, double, float, inet, int, smallint, text, timestamp, timeuuid, tinyint, uuid, varchar, varint

Example:

CREATE CUSTOM INDEX test_idx ON test()
USING 'com.stratio.cassandra.lucene.Index'
WITH OPTIONS = {
    'refresh_seconds' : '1',
    'schema' : '{
        fields : {
            string : {
                type           : "string",
                validated      : true,
                column         : "column_name"
                case_sensitive : false,
            }
        }
    }'
};

Text mapper

Maps a language-aware text value analyzed according to the specified analyzer.

Parameters:

validated (default = false): if mapping errors should make CQL writes fail, instead of just logging the error.
column (default = name of the mapper): the name of the column storing the IP address to be indexed.
analyzer (default = default_analyzer): the name of the text analyzer to be used.

Supported CQL types:

ascii, bigint, blob, boolean, double, float, inet, int, smallint, text, timestamp, timeuuid, tinyint, uuid, varchar, varint

Example:

CREATE CUSTOM INDEX test_idx ON test()
USING 'com.stratio.cassandra.lucene.Index'
WITH OPTIONS = {
    'refresh_seconds' : '1',
    'schema' : '{
        analyzers : {
            my_custom_analyzer : {
                  type      : "snowball",
                  language  : "Spanish",
                  stopwords : "el,la,lo,loas,las,a,ante,bajo,cabe,con,contra"
            }
        },
        fields : {
            text : {
                type      : "text",
                validated : true,
                column    : "column_name"
                analyzer  : "my_custom_analyzer",
            }
        }
    }'
};

UUID mapper

Maps an UUID value.

Parameters:

validated (default = false): if mapping errors should make CQL writes fail, instead of just logging the error.
column (default = name of the mapper): the name of the column storing the IP address to be indexed.

Supported CQL types:

ascii, text, timeuuid, uuid, varchar

Example:

CREATE CUSTOM INDEX test_idx ON test()
USING 'com.stratio.cassandra.lucene.Index'
WITH OPTIONS = {
    'refresh_seconds' : '1',
    'schema' : '{
        fields : {
            id : {
                type      : "uuid",
                validated : true,
                column    : "column_name"
            }
        }
    }'
};

Example

This code below and the one for creating the corresponding keyspace and table is available in a CQL script that can be sourced from the Cassandra shell: test-users-create.cql.

CREATE CUSTOM INDEX IF NOT EXISTS users_index
ON test.users ()
USING 'com.stratio.cassandra.lucene.Index'
WITH OPTIONS = {
    'refresh_seconds'       : '60',
    'ram_buffer_mb'         : '64',
    'max_merge_mb'          : '5',
    'max_cached_mb'         : '30',
    'excluded_data_centers' : 'dc2,dc3',
    'schema' : '{
        analyzers : {
            my_custom_analyzer : {
                type      : "snowball",
                language  : "Spanish",
                stopwords : "el,la,lo,loas,las,a,ante,bajo,cabe,con,contra"
            }
        },
        default_analyzer : "english",
        fields : {
            name     : {type : "string"},
            gender   : {type : "string", validated : true},
            animal   : {type : "string"},
            age      : {type : "integer"},
            food     : {type : "string"},
            number   : {type : "integer"},
            bool     : {type : "boolean"},
            date     : {type : "date", validated : true, pattern : "yyyy/MM/dd"},
            duration : {type : "date_range", from : "start_date", to : "stop_date"},
            place    : {type : "geo_point", latitude : "latitude", longitude : "longitude"},
            mapz     : {type : "string"},
            setz     : {type : "string"},
            listz    : {type : "string"},
            phrase   : {type : "text", analyzer : "my_custom_analyzer"}
        }
    }'
};

Searching

Lucene indexes are queried using a custom JSON syntax defining the kind of search to be done.

Syntax:

SELECT ( <fields> | * ) FROM <table_name> WHERE expr(<index_name>, '{
    (   filter  : ( <filter> )* )?
    ( , query   : ( <query>  )* )?
    ( , sort    : ( <sort>   )* )?
    ( , refresh : ( true | false ) )?
}');

where <filter> and <query> are a JSON object:

<filter> := { type : <type> (, <option> : ( <value> | <value_list> ) )* }
<query>  := { type : <type> (, <option> : ( <value> | <value_list> ) )* }

and <sort> is another JSON object:

<sort> := <simple_sort_field> | <geo_distance_sort_field>
<simple_sort_field> := {(type: "simple",)? field : <field> (, reverse : <reverse> )? }
<geo_distance_sort_field> := {  type: "geo_distance",
                                field : <field>,
                                latitude : <Double>,
                                longitude: <Double>
                                (, reverse : <reverse> )? }

When searching by filter, without any query or sort defined, then the results are returned in the Cassandra’s natural order, which is defined by the partitioner and the column name comparator. When searching by query, results are returned sorted by descending relevance. Sort option is used to specify the order in which the indexed rows will be traversed. When simple_sort_field sorting is used, the query scoring is delayed.

Geo_distance_sort_field is use to sort Rows by min distance to point indicating the GeoPointMapper to use by mapper field

Relevance queries must touch all the nodes in the ring in order to find the globally best results, so you should prefer filters over queries when no relevance nor sorting are needed.

The refresh boolean option indicates if the search must commit pending writes and refresh the Lucene IndexSearcher before being performed. This way a search with refresh set to true will view the most recent changes done to the index, independently of the index auto-refresh time. Please note that it is a costly operation, so you should not use it unless it is strictly necessary. The default value is false. You can explicitly refresh all the index shards with an empty search with consistency ALL, and the return to your desired consistency level:

CONSISTENCY ALL
SELECT * FROM <table> WHERE expr(<index_name>, '{refresh:true}');
CONSISTENCY QUORUM

This way the subsequent searches will view all the writes done before this operation, without needing to wait for the index auto refresh. It is useful to perform this operation before searching after a bulk data load.

Types of search and their options are summarized in the table below. Details for each of them are available in individual sections and the examples can be downloaded as a CQL script: extended-search-examples.cql.

In addition to the options described in the table, all search types have a “boost” option that acts as a weight on the resulting score.

Search type

Option

Value type

Default value

Mandatory

All

Bitemporal

field

-----------------+: vt_from
-----------------+: vt_to
-----------------+: tt_from
-----------------+: tt_to
-----------------+: operation

string

-----------------+: string/long
-----------------+: string/long
-----------------+: string/long
-----------------+: string/long
-----------------+: string

--------------------------------+ 0L --------------------------------+ Long.MAX_VALUE --------------------------------+ 0L --------------------------------+ Long.MAX_VALUE --------------------------------+ intersects

Yes

-----------+: No
-----------+: No
-----------+: No
-----------+: No
-----------+: No

Boolean

must

-----------------+: should
-----------------+: not

search

-----------------+: search
-----------------+: search

--------------------------------+

No

-----------+: No
-----------+: No

Contains

field

-----------------+: values
-----------------+: doc_values

string

-----------------+: array
-----------------+: boolean

--------------------------------+

--------------------------------+: false

Yes

-----------+: Yes
-----------+: No

Date range

field

-----------------+: from
-----------------+: to
-----------------+: operation

string

-----------------+: string/long
-----------------+: string/long
-----------------+: string

--------------------------------+ 0 --------------------------------+ Long.MAX_VALUE --------------------------------+ is_within

Yes

-----------+: No
-----------+: No
-----------+: No

Fuzzy

field

-----------------+: value
-----------------+: max_edits
-----------------+: prefix_length
-----------------+: max_expansions
-----------------+: transpositions

string

-----------------+: string
-----------------+: integer
-----------------+: integer
-----------------+: integer
-----------------+: boolean

--------------------------------+

--------------------------------+: 2
--------------------------------+: 0
--------------------------------+: 50
--------------------------------+: true

Yes

-----------+: Yes
-----------+: No
-----------+: No
-----------+: No
-----------+: No

Geo bounding box

field

-----------------+: min_latitude
-----------------+: max_latitude
-----------------+: min_longitude
-----------------+: max_longitude

string

-----------------+: double
-----------------+: double
-----------------+: double
-----------------+: double

--------------------------------+

Yes

-----------+: Yes
-----------+: Yes
-----------+: Yes
-----------+: Yes

Geo distance

field

-----------------+: latitude
-----------------+: longitude
-----------------+: max_distance
-----------------+: min_distance

string

-----------------+: double
-----------------+: double
-----------------+: string
-----------------+: string

--------------------------------+

Yes

-----------+: Yes
-----------+: Yes
-----------+: Yes
-----------+: No

Geo shape

field

-----------------+: shape
-----------------+: operation
-----------------+: transformations

string

-----------------+: string (WKT)
-----------------+: string
-----------------+: array

--------------------------------+

--------------------------------+: is_within

--------------------------------+

Yes

-----------+: Yes
-----------+: No
-----------+: No

Match

field

-----------------+: value
-----------------+: doc_values

string

-----------------+: any
-----------------+: boolean

--------------------------------+

--------------------------------+: false

Yes

-----------+: Yes
-----------+: No

None

Phrase

field

-----------------+: value
-----------------+: slop

string

-----------------+: string
-----------------+: integer

--------------------------------+

--------------------------------+: 0

Yes

-----------+: Yes
-----------+: No

Prefix

field

-----------------+: value

string

-----------------+: string

--------------------------------+

Yes

-----------+: Yes

Range

field

-----------------+: lower
-----------------+: upper
-----------------+: include_lower
-----------------+: include_upper
-----------------+: doc_values

string

-----------------+: any
-----------------+: any
-----------------+: boolean
-----------------+: boolean
-----------------+: boolean

--------------------------------+

--------------------------------+: false
--------------------------------+: false
--------------------------------+: false

Yes

-----------+: No
-----------+: No
-----------+: No
-----------+: No
-----------+: No

Regexp

field

-----------------+: value

string

-----------------+: string

--------------------------------+

Yes

-----------+: Yes

Wildcard

field

-----------------+: value

string

-----------------+: string

--------------------------------+

Yes

-----------+: Yes

All search

Search for all the indexed rows.

Syntax:

SELECT ( <fields> | * ) FROM <table> WHERE expr(<index_name>, '{
    (filter | query) : { type  : "all"}
}');

Example: search for all the indexed rows:

SELECT * FROM users WHERE expr(users_index, '
    {filter : { type  : "all" }
}');

Values	Unit
mm, millimetres	millimetre
cm, centimetres	centimetre
dm, decimetres	decimetre
m, metres	metre
dam, decametres	decametre
hm, hectometres	hectometre
km, kilometres	kilometre
ft, foots	foot
yd, yards	yard
in, inches	inch
mi, miles	mile
M, NM, mil, nautical_miles	nautical mile

Name	Type	Notes
NumDeletedDocs	Attribute	Total number of documents in the index.
NumDocs	Attribute	Total number of documents in the index.
Commit	Operation	Commits all the pending index changes to disk.
Refresh	Operation	Reopens all the readers and searchers to provide a recent view of the index.
forceMerge	Operation	Optimizes the index forcing merge segments leaving the specified number of segments. It also includes a boolean parameter to block until all merging completes.
forceMergeDeletes	Operation	Optimizes the index forcing merge segments containing deletions, leaving the specified number of segments. It also includes a boolean parameter to block until all merging completes.

Files

documentation.rst

Latest commit

History

documentation.rst

File metadata and controls

Stratio's Cassandra Lucene Index

Overview

Features

Architecture

Requirements

Installation

Upgrade

Alternative syntaxes

Example

Indexing

Analyzers

Classpath analyzer

Snowball analyzer

Mappers

Big decimal mapper

Big integer mapper

Bitemporal mapper

Blob mapper

Boolean mapper

Date mapper

Date range mapper

Double mapper

Float mapper

Geo point mapper

Geo shape mapper

Inet mapper

Integer mapper

Long mapper

String mapper

Text mapper

UUID mapper

Example

Searching

All search

Bitemporal search

Boolean search

Contains search

Date range search

Fuzzy search

Geo bbox search

Geo distance search

Geo shape search

Match search

None search

Phrase search

Prefix search

Range search

Regexp search

Wildcard search

Geographical elements

Distance

Transformations

Bounding box

Buffer

Centroid

Convex hull

Difference

Intersection

Union

Complex data types

Tuples

User Defined Types

Collections

Query Builder

Spark and Hadoop

Token Range Searches

Paging

Examples

Performance

JMX Interface

Performance tips

Choose the right use case

Use the latest version

Disable virtual nodes