# RediSearch: a full-text search engine

Redisearch implements a search engine on top of Redis. Unlike other Redis search libraries, it does not use internal data structures like Sorted sets, but rather its own data structures via the Redis Modules API.

Primary Features:

* Full-Text indexing of multiple fields in documents.
* Incremental indexing without performance loss.
* Document ranking (provided manually by the user at index time).
* Complex boolean queries with AND, OR, NOT operators between sub-queries.
* Optional query clauses.
* Prefix based searches.
* Field weights.
* Exact Phrase Search, Slop (number of words between) based search.
* Stemming based query expansion in many languages.
* Support for custom functions for query expansion and scoring.
* Limiting searches to specific document fields.
* Numeric filters and ranges.
* Geo filtering using Redis' own Geo-commands.
* Supports any utf-8 encoded text.
* Retrieve full document content or just ids
* Automatically index existing Hash keys as documents.
* Document deletion and updating with index garbage collection.
* Auto-complete suggestions (with fuzzy prefix suggestions)

The Redisearch project website is at [http://redisearch.io](http://redisearch.io).

Redisearch is open source, the code is available at [https://github.com/RedisLabsModules/RediSearch](https://github.com/RedisLabsModules/RediSearch).

## What is searching

Searching is a core capability of almost any service and application. In fact, consider the following examples and the part that search plays in them:

* A web search engine (e.g. Google)
* A shopping website (e.g. Amazon)
* A social network (e.g Facebook or Twitter)
* A streaming provider (e.g. Netflix)
* Your application here

When using the term "search" to refer to that capability, it should be interpreted as fetching data based on its value(s).

## Searching in Redis

Redis is a (mostly) perfect data store, but when it comes to searching the data - i.e. get by value - it offers virtually nothing. Accessing data in Redis is always by (primary) key, so searching for values requires using one of two approaches:

1. Full scan - perform any search by iterating over all the data items
2. Indexing - maintain a data structure that supports efficient search

A full scan is not an option in the context of low latency data serving as it is a slow and compute-intensive task. Indices, on the other hand, provide a solution that fits real time search requirements in terms of performance, but at the cost of extra storage and operations.

While Redis has no built-in indexing mechanisms, its core data structures can be used for creating and maintaining them. For example, Sets naturally lend themselves to representing 1:N relationships, and Sorted Sets are excellent for doing range queries. There quite a few well-known "recipes" that use different data structures and techniques for indexing in Redis - see the [Secondary indexing with Redis](https://redis.io/topics/indexes) documentation page for more information.

Because maintaining an index can quickly become complex, there are a lot of 3rd-party open source libraries that abstract the underlying complexities by providing indexing via an object-mapping framework. Notable examples for such mappers are Ruby's [Ohm (Object-hash mapping)](http://ohm.keyvalue.org/) and Python's [ROM (Redis object mapper)](https://pypi.python.org/pypi/rom).

## Why RediSearch

While there are many existing search solutions, most do not deliver when indexing and querying billions of document in real-time. Existing technologies usually "suffer" from being:

* Written in Java
* Bloated and complex
* Design with disk storage in mind

Redis, although extremely performant, has no built-in search and implementing your own indices on top of it can become a challenging task as described above. Moreover, the core Redis data structures are not always optimal for every indexing need in terms of time and/or space and more efficient approaches could be implemented to solve such cases.

With v4's Modules API, Redis can be extended with modules that add new data structures and commands to the server. RediSearch is an open source Redis module written from scratch to address these: it indexes documents and executes text, numeric ranges, geospatial and autocomplete queries blazing fast.

## Terminology

### Searching

Fetching data, referred to as a document, by its value. RediSearch lets you do exactly that.

### A document

A collection of one or more attribute properties (values). Attributes may be:

* Numerical, having numbers as properties
* Textual, made up of one or more words (terms)
* Geographical, coordinates given as a longitude-latitude pairs

RediSearch can store documents as Redis Hashes, or use your existing Redis Hashes. It also allows storing an optional arbitrary binary payload with each document.

### An index

An [index](https://en.wikipedia.org/wiki/Database_index) is the core concept of searching - it is a data structure that's designed to allow efficient access to data, at the cost of additional:

* Storage resources: for keeping the index's data
* Compute resources: for keeping the data in the index up-to-date

An index is a mapping between a value and its location in the database, e.g. the primary key in an RDBMS or key name in Redis. RediSearch creates an index for each indexed attribute and uses several types of indices, depending on the properties' types.

### The documents table

While the input documents may be given any arbitrary identifiers, for compression purposes RediSearch maps each document to an internal identifier. RediSearch uses an efficient implementation of a [trie](https://en.wikipedia.org/wiki/Trie) - [triemap](https://github.com/RedisLabs/triemap) - for this purpose.

### An inverted index (Posting List)

An [inverted index](https://en.wikipedia.org/wiki/Inverted_index) (also referred to as Posting List) is a mapping between terms and their locations in the database (i.e. documents). RediSearch includes a custom data type that implements an inverted index - this is used for indexing textual terms in the documents.

### A range tree

A [tree data structure](https://en.wikipedia.org/wiki/Range_tree) that is used for range searches. RediSearch uses a clever mix of preallocations and linked lists to maintain trees with close-to-optimal depths.

### A search engine

Software that builds and queries indices. The general flow for building an index is:

1. Take a document
2. Break it apart to its constituent values
3. Map terms/properties to the document using a Posting List

Searching, via queries, is getting the documents that are linked to the provided terms. RediSearch is a search engine.

## Training prelude

The following sections in this notebook show how to work with RediSearch's core capabilities. The dataset used in the examples is the complete works of William Shakespeare. The imported dataset is stored in the `will_play_text.csv` semicolon-separated file in the following format:

<code><pre>
"86169";"Romeo and Juliet";"4";"2.2.35";"JULIET";"O Romeo, Romeo! wherefore art thou Romeo?"
</pre></code>

The uncompressed file is 9.9MB and 111,396 lines long. Each record in it is a single line from Shakespear's works.

RediSearch can be used via a client that provides a language-specific interface to the module's API, and the project's website has the [full list](http://redisearch.io/#client-libraries). This notebook uses the Python client, [redisearch-py](https://github.com/RedisLabs/redisearch-py), that's installable with:

<code><pre>
$ pip install redisearch
</pre></code>

## Connecting to RediSearch

To begin using RediSearch, the first thing that's needed is a client that connects to the search engine. The search client is associated with a single index, identified by its name:

In [None]:
from redisearch import Client, TextField, NumericField

# Create a search client for the index called 'ws' using the notebook's client
client = Client('ws', conn=r)

## Index creation and deletion

A RediSearch index must be explicitly created before it can be added with documents and/or searched. The [`FT.CREATE`](http://redisearch.io/Commands/#ftcreate) command creates an index:

<code><pre>
FT.CREATE {index} 
    [NOOFFSETS] [NOFIELDS] [NOSCOREIDX]
    SCHEMA {field} [TEXT [WEIGHT {weight}] | NUMERIC | GEO] [SORTABLE] ...
</pre></code>

The index is identified by its unique `{index}` name.

> Note: Keep your index name short to save RAM and bandwidth, the name is used in all key names that make up the index.

When used, the optional `NOOFFSETS`, `NOFIELDS` and `NOSCOREIDX` flags reduce the index's memory footprint, but doing so disables parts of the search functionality. The schema is a list of one or more fields and their respective types - these fields will be indexed for documents that are added to the index.

To create an index in Python using redisearch-py, call [`create_index()`](https://github.com/RedisLabs/redisearch-py/blob/master/API.md#create_index).

Indices can be deleted with the [`FT.DROP`](http://redisearch.io/Commands/#ftdrop) command, thus deleting all keys that are associated to the index:

<code><pre>
FT.DROP {index}
</pre></code>

The Python client's [`drop_index()`](https://github.com/RedisLabs/redisearch-py/blob/master/API.md#drop_index) method achieves the same effect:

In [None]:
from redis import exceptions

# drop the index, ignore ResponseError exception if it doesn't exist
try:
    client.drop_index()
except exceptions.ResponseError as e:
    pass

Here are a few examples that create an index in Python by calling [`create_index()`](https://github.com/RedisLabs/redisearch-py/blob/master/API.md#create_index) and drop it immediately afterwards:

In [None]:
# Index only the line's text
client.create_index((TextField('text'),))
client.drop_index()

# Index both line text and number
client.create_index((TextField('text'),
                    NumericField('line')))
client.drop_index()

# Also index character's name with a low score
client.create_index((TextField('text'),
                    NumericField('line'),
                    TextField('character', weight=0.1)))
client.drop_index()

## Index creation exercise

In the following code block, replace the comments with the actual field definitions that are specified:

In [None]:
try:
    client.drop_index()
except exceptions.ResponseError as e:
    pass

client.create_index((TextField('text'),
                    NumericField('line'),
                    TextField('character', weight=0.1)))
                    # Add the `play` text field with weight of 0.1
                    # Add the `chapter` numerical field

## Obtaining information about an index

The index's meta data is stored in the Redis database with a custom data structure at a key named `idx:{index}`. To fetch the meta data, call [`FT.INFO`](http://redisearch.io/Commands/#ftinfo), or the respective [`info()`](https://github.com/RedisLabs/redisearch-py/blob/master/API.md#info) method of the Python client:

In [None]:
pprint.pprint(client.info())

## Document indexing

Once created and before it can be queried against for meaningful results, an index needs to be added with documents. RediSearch's [`FT.ADD`](http://redisearch.io/Commands/#ftadd) does that:

<code><pre>
FT.ADD {index} {docId} {score}
  [NOSAVE]
  [REPLACE]
  [LANGUAGE {language}] 
  [PAYLOAD {payload}]
  FIELDS {field} {value} [{field} {value}...]`]
</pre></code>

Besides the mandatory `{index}` name, adding a document requires providing a unique document identifier - the `{docId}` - and a given identifier can be added to the index only once. Also, a document needs to be given a numerical score (`{score}`) between 0.0 and 1.0 that is used by the engine's scoring function. Lastly, every indexed field needs to be given a value for that added document.

Adding documents to the index with the Python client is simply a matter of invoking the client's [`add_document()`](https://github.com/RedisLabs/redisearch-py/blob/master/API.md#add_document) with the right arguments. However, when doing bulk updates to the index, it is preferable to instantiate a [`BatchIndexer`](https://github.com/RedisLabs/redisearch-py/blob/master/API.md#class-batchindexer) object by calling [`batch_indexer()`](https://github.com/RedisLabs/redisearch-py/blob/master/API.md#batch_indexer). The `BatchIndexer` implements the same interface as the client, but employs Redis' pipelining for optimizing the network traffic.

The following reads the dataset from the file, adding each line as a document in the index:

In [None]:
# Since this is a bulk upload, we'll use Redis' pipelining with the...
indexer = Client.BatchIndexer(client)

# Add the complete works of William Shakespear to the index
import csv
with open('../static/files/will_play_text.csv') as fp:
    f = csv.reader(fp, delimiter=';')
    for l in f:
        # [86169, 'Romeo and Juliet', 4, '2.2.35', 'JULIET', "O Romeo, Romeo! wherefore art thou Romeo?"]
        indexer.add_document(l[0],
                        text=l[5], line=l[0], play=l[1], chapter=(l[2] or 0), character=l[4])
indexer.commit()
pprint.pprint(client.info())

### The anatomy of a document in the index

When a document is added to the index, the engine creates the relevant mappings for it. Each `docId` is stored in Redis as a [Hash](https://redis.io/topics/data-types-intro#redis-hashes), with the Hash's fields corrosponding to the document's contents:

In [None]:
docid = '86169'
print 'Type of key {}: {}'.format(docid, r.type(docid))
pprint.pprint(r.hgetall(docid))

Note that while documents are stored in Redis by default, this isn't a strict requirement. You can use RediSearch solely for the purpose of indexing, while the documents themselves are managed by another data store. To do so, call `FT.ADD` with the optional `NOSAVE` flag.

Next, the terms and attributes of every indexed field are mapped to the document with a set of custom data structures according to their type. Textual terms (words) are stored in Posting Lists (inverted index) under the keys `ft:{index}/{term}`. Numerical properties are indexed using a [range tree](https://en.wikipedia.org/wiki/Range_tree) that's stored under the keys `nm:{index}/{field}`. Lastly, geographical using [Redis' Geosets](https://redis.io/commands#geo) under `geo:{index}/{field}`.

### Adding existing Redis Hashes as documents

RediSearch is also capable of indexing the documents that're already in your Redis database, provided that you've used the Hash data structure for storing them. The [`FT.ADDHASH`](http://redisearch.io/Commands/#ftaddhash) assumes that the `docId` is a Hash key name with the index schema's fields stored in it:

<code><pre>
 FT.ADDHASH {index} {docId} {score} [LANGUAGE language] [REPLACE]
</pre></code>

## Deleting a document

To delete a document from the index, use the [`FT.DEL`](http://redisearch.io/Commands/#ftdel) command:

<code><pre>
FT.DEL {index} {doc_id}
</pre></code>

## Optimizing the RAM footprint of an index

While populating the index's data structures, RediSearch preallocates memory to reduce processing time. Once the indexing is complete, it is possible to instruct the engine to free any leftover allocations with the [`FT.OPTIMIZE`](http://redisearch.io/Commands/#ftoptimize) command.

## Index queries

Querying the index is possible via the use of a query language. The language has the following rules:

| Query                                    | Rule                                     |
| ---------------------------------------- | ---------------------------------------- |
| `wherefore art thou`                     | documents containing the intersection of all terms (AND) |
| `(wherefore art thou)`                   | ditto                                    |
| `"wherefore art thou"`                   | exact phrase match                       |
| <code>wherefore&#124;art&#124;thou</code> | union of documents containing one of the terms (OR) |
| `-wherefore`                             | negation, documents that do not contain the term (NOT) |
| `whe*`                                   | prefix (3 or more characters) match      |
| `@text:romeo`                            | search the term in a selected field      |
| <code>@text&#124;character: romeo</codE> | search term in any of the selected fields |
| `@chapter:[(0 +inf]`                     | numeric range, Redis-like syntax         |
| `~love`                                  | optional term/clause, documents with it will rank higher |

Complex queries can be made up of one or more rules, as shown by the following example:

<code><pre>
where* ~love @character:romeo|juliet
</pre></code>

## Executing a search query

Calling [`FT.SEARCH`](http://redisearch.io/Commands/#ftsearch) runs a query against an index - here's the commands' full syntax:

<code><pre>
FT.SEARCH {index} {query} [NOCONTENT] [VERBATIM] [NOSTOPWORDS] [WITHSCORES] [WITHPAYLOADS]
  [FILTER {numeric_field} {min} {max}] ...
  [GEOFILTER {geo_field} {lon} {lat} {raius} m|km|mi|ft]
  [INKEYS {num} {key} ... ]
  [INFIELDS {num {field} ... ]
  [SLOP {slop}] [INORDER]
  [LANGUAGE {language}]
  [EXPANDER {expander}]
  [SCORER {scorer}]
  [PAYLOAD {payload}]
  [SORTBY {field} [ASC|DESC]]
  [LIMIT offset num]
</pre></code>

The client's [`search()`](https://github.com/RedisLabs/redisearch-py/blob/master/API.md#search) method can be invoked to perform a simple search:

In [None]:
results = client.search('where* ~love @character:romeo|juliet')
print 'Total results: {}'.format(results.total)
print 'First search result: {}'.format(results.docs[0])

## Query exercises

1. Search for love in all of Shakespeare's works (5275 results)

In [None]:
query = 'your query here'
results = client.search(query)
print results.total

2. Find the lines that Othello's character utters (3762 results)

In [None]:
query = 'your query here'
results = client.search(query)
print results.total

3. Make your own query and run it

In [None]:
query = 'your query here'
results = client.search(query)
print results.total

## Autocompletion

RediSearch can provide suggestions for autocompletion. This feature is unrelated to document indexing and can be used exclusively, alongside or not at all.

### Creating a suggestions dictionary

To create a dictionary with suggestion for autocomple, use the [`FT.SUGGADD`](http://redisearch.io/Commands/#ftsuggadd) command:

<code><pre>
FT.SUGADD {key} {string} {score} [INCR]
</pre></code>

`{key}` is the name of the key that will be used for storing the dictionary. Internally, RediSearch uses a highly-compressed implementation of the [Trie data structure](https://en.wikipedia.org/wiki/Trie) to facilitate this type of search. `{string}` is the suggestion that the autocomplete will provide, along with a `{score}` that is a floating-point number used in ranking it. The optional `INCR` flag can be used to increment an existing suggestion's score, rather than replace it.

The following code constructs an autocomplete dictionary from Shakespeare's dataset by adding every line as a suggestion:

In [None]:
from redisearch import AutoCompleter, Suggestion

ac = AutoCompleter('wsac', conn=r)
with open('../static/files/will_play_text.csv') as fp:
    f = csv.reader(fp, delimiter=';')
    suggs = []
    for l in f:
        suggs.append(Suggestion(l[5]))
        if int(l[0]) % 100 == 0:
            ac.add_suggestions(*suggs, increment=True)
            suggs = []
    ac.add_suggestions(*suggs)
print ac.len()

### Deleting suggestions from the dictionary and deleting the dictionary

To delete a suggestion from the dictionary use [`FT.SUGDEL`](http://redisearch.io/Commands/#ftsugdel), or the AutoCompleter's [`delete()`](https://github.com/RedisLabs/redisearch-py/blob/master/API.md#delete) method.

The autocomplete dictionary is stored under a single key - to delete the entire dictionary simply use Redis' [`DEL`](https://redis.io/commands/del) on it.

### Getting autocomplete suggestions

Querying the autocomplete dictionary for suggestions is done with the [`FT.SUGGET`](http://redisearch.io/Commands/#ftsugget) command:

<code><pre>
FT.SUGGET {key} {prefix} [FUZZY] [MAX num] [WITHSCORES]
</pre></code>

`{key}` is the key name of the autocomplete dictionary, and `{prefix}` is the prefix to provide suggestions for. The `FUZZY` flag, when set, will perform a fuzzy search that includes suggestions at a [Levenshtein distance](https://en.wikipedia.org/wiki/Levenshtein_distance) of 1 from the searched prefix. `MAX` can be specified to limit the number of returned suggestions, and `WITHSCORES` will include suggestion's scores in the reply.

In [None]:
query = 'search'
reply = ac.get_suggestions(query)
print 'Autocomplete results for \'{}\':'.format(query)
pprint.pprint(reply)