# The definitive guide

## Searching

# Full Text Search

In [1]:
from pprint import pprint as pp
import elasticsearch as es
e = es.Elasticsearch([{ 'host': 'localhost', 'port': 9200 }])
e.ping()

True

In [94]:
ctx_tweet = {'index': 'tweet', 'doc_type': 'doc'}
ctx_user = {'index': 'user', 'doc_type': 'doc'}

In [95]:
bulk_users = """
{"index": {"_id": 1}
{"lang": "us", "email": "john@smith.com", "name": "John Smith", "username": "@john"}
{"index": {"_id": 2}
{"lang": "gb", "email": "mary@jones.com", "name": "Mary Jones", "username": "@mary"}
"""[1:]

bulk_tweets = """
{"index": {"_id": 3}
{"lang": "gb", "date": "2014-09-13", "name": "Mary Jones", "tweet": "Elasticsearch means full text search has never been so easy", "user_id": 2}
{"index": {"_id": 4}
{"lang": "us", "date": "2014-09-14", "name": "John Smith", "tweet": "@mary it is not just text, it does everything", "user_id": 1}
{"index": {"_id": 5}
{"lang": "gb", "date": "2014-09-15", "name": "Mary Jones", "tweet": "However did I manage before Elasticsearch?", "user_id": 2}
{"index": {"_id": 6}
{"lang": "us", "date": "2014-09-16", "name": "John Smith", "tweet": "The Elasticsearch API is really easy to use", "user_id": 1}
{"index": {"_id": 7}
{"lang": "gb", "date": "2014-09-17", "name": "Mary Jones", "tweet": "The Query DSL is really powerful and flexible", "user_id": 2}
{"index": {"_id": 8}
{"lang": "us", "date": "2014-09-18", "name": "John Smith", "user_id": 1}
{"index": {"_id": 9}
{"lang": "gb", "date": "2014-09-19", "name": "Mary Jones", "tweet": "Geo-location aggregations are really cool", "user_id": 2}
{"index": {"_id": 10}
{"lang": "us", "date": "2014-09-20", "name": "John Smith", "tweet": "Elasticsearch surely is one of the hottest new NoSQL products", "user_id": 1}
{"index": {"_id": 11}
{"lang": "gb", "date": "2014-09-21", "name": "Mary Jones", "tweet": "Elasticsearch is built for the cloud, easy to scale", "user_id": 2}
{"index": {"_id": 12}
{"lang": "us", "date": "2014-09-22", "name": "John Smith", "tweet": "Elasticsearch and I have left the honeymoon stage, and I still love her.", "user_id": 1}
{"index": {"_id": 13}
{"lang": "gb", "date": "2014-09-23", "name": "Mary Jones", "tweet": "So yes, I am an Elasticsearch fanboy", "user_id": 2}
{"index": {"_id": 14}
{"lang": "us", "date": "2014-09-24", "name": "John Smith", "tweet": "How many more cheesy tweets do I have to write?", "user_id": 1}
"""[1:]

def populate_user_index():
    print('populating user index')
    res = e.bulk(body=bulk_users, **ctx_user)
    print('errors:', res['errors'], '- indexed:', len(res['items']))

def populate_tweet_index():
    print('populating tweet index')
    res = e.bulk(body=bulk_tweets, **ctx_tweet)
    print('errors:', res['errors'], '- indexed:', len(res['items']))

In [96]:
populate_user_index()
populate_tweet_index()

populating user index
errors: True - indexed: 2
populating tweet index
errors: False - indexed: 12


### Lite Search

* https://www.elastic.co/guide/en/elasticsearch/reference/master/query-dsl-query-string-query.html#query-string-syntax

In [97]:
for hit in e.search(q='tweet:elasticsearch')['hits']['hits']:
    print('{} ({}): {}'.format(hit['_id'], hit['_score'], hit['_source']['tweet']))

6 (0.6931472): The Elasticsearch API is really easy to use
13 (0.6682933): So yes, I am an Elasticsearch fanboy
5 (0.37881336): However did I manage before Elasticsearch?
10 (0.35667494): Elasticsearch surely is one of the hottest new NoSQL products
12 (0.3034693): Elasticsearch and I have left the honeymoon stage, and I still love her.
11 (0.21110918): Elasticsearch is built for the cloud, easy to scale
3 (0.16044298): Elasticsearch means full text search has never been so easy


In [98]:
pp(e.search(q='+name:john +tweet:mary')['hits'])

{'hits': [{'_id': '4',
           '_index': 'tweet',
           '_score': 0.87546873,
           '_source': {'date': '2014-09-14',
                       'lang': 'us',
                       'name': 'John Smith',
                       'tweet': '@mary it is not just text, it does everything',
                       'user_id': 1},
           '_type': 'doc'}],
 'max_score': 0.87546873,
 'total': 1}


In [99]:
# name contains 'mary' or 'john'
# date is greater than 2014-08-10
# _all contains aggregations or geo

pp(e.search(q='+name:(mary john) +date:>2014-08-10 +(aggregations geo)'))

{'_shards': {'failed': 0, 'skipped': 0, 'successful': 33, 'total': 33},
 'hits': {'hits': [{'_id': '9',
                    '_index': 'tweet',
                    '_score': 4.6021132,
                    '_source': {'date': '2014-09-19',
                                'lang': 'gb',
                                'name': 'Mary Jones',
                                'tweet': 'Geo-location aggregations are really '
                                         'cool',
                                'user_id': 2},
                    '_type': 'doc'}],
          'max_score': 4.6021132,
          'total': 1},
 'timed_out': False,
 'took': 40}


### Analyzer

An analyzer combines three functions:

* **Character Filters:** preprocessing which removes markup and replaces symbols
* **Tokenizer:** extract terms
* **Token Filters:** stemming, synsets etc.

Given `Set the shape to semi-transparent by calling set_trans(5)`:

* **Standard Analyzer:** `[set, the, shape, to, semi, transparent, by, calling, set_trans, 5]`
* **Simple Analyzer:** `[set, the shape, to, semi, transparent, by, calling, set, trans]`
* **Whitespace Analyzer** `[Set, the, shape, to, semi-transparent, by, calling, set_trans(5)]`
* **Language Analyzer** `[set, shape, semi, transpar, call, set_tran, 5]`

In [100]:
pp(e.indices.analyze(index='tweet', body={'analyzer': 'standard', 'text': 'Some text, hi Mom!'}))

{'tokens': [{'end_offset': 4,
             'position': 0,
             'start_offset': 0,
             'token': 'some',
             'type': '<ALPHANUM>'},
            {'end_offset': 9,
             'position': 1,
             'start_offset': 5,
             'token': 'text',
             'type': '<ALPHANUM>'},
            {'end_offset': 13,
             'position': 2,
             'start_offset': 11,
             'token': 'hi',
             'type': '<ALPHANUM>'},
            {'end_offset': 17,
             'position': 3,
             'start_offset': 14,
             'token': 'mom',
             'type': '<ALPHANUM>'}]}


### Mapping

Supported types: `boolean, long, double, date, string, (geo?)`

In [101]:
pp(e.indices.get('tweet')['tweet']['mappings'])

{'doc': {'properties': {'date': {'type': 'date'},
                        'lang': {'fields': {'keyword': {'ignore_above': 256,
                                                        'type': 'keyword'}},
                                 'type': 'text'},
                        'name': {'type': 'text'},
                        'tag': {'type': 'keyword'},
                        'tweet': {'analyzer': 'english', 'type': 'text'},
                        'user_id': {'type': 'long'}}}}


Recreate the mapping of the tweet index manually

In [102]:
if e.indices.exists(index='tweet'):
    res = e.indices.delete(index='tweet')
    print('delete index:', res)

tweet_mapping = {
    'mappings': {
        'doc': {
            'properties': {
                'tweet': {'type': 'text', 'analyzer': 'english'},
                'date': {'type': 'date'},
                'name': {'type': 'text'},
                'user_id': {'type': 'long'},
            }
        }
    }
}

res = e.indices.create(index='tweet', body=tweet_mapping)
print('re-create index:', res)

delete index: {'acknowledged': True}
re-create index: {'acknowledged': True, 'shards_acknowledged': True, 'index': 'tweet'}


* https://www.elastic.co/blog/strings-are-dead-long-live-strings

set the tag field to type 'keyword'

In [103]:
e.indices.put_mapping(body={
   'properties': {
       'tag': {
           'type': 'keyword',
           'index': 'true',
       }
   }
}, index='tweet', doc_type='doc')

{'acknowledged': True}

In [104]:
populate_tweet_index()

populating tweet index
errors: False - indexed: 12


### Note the difference

In [61]:
e.indices.analyze(body={'field': 'tweet', 'text': 'Black-cats'}, index='tweet')

{'tokens': [{'token': 'black',
   'start_offset': 0,
   'end_offset': 5,
   'type': '<ALPHANUM>',
   'position': 0},
  {'token': 'cat',
   'start_offset': 6,
   'end_offset': 10,
   'type': '<ALPHANUM>',
   'position': 1}]}

In [62]:
e.indices.analyze(body={'field': 'tag', 'text': 'Black-cats'}, index='tweet')

{'tokens': [{'token': 'Black-cats',
   'start_offset': 0,
   'end_offset': 10,
   'type': 'word',
   'position': 0}]}

# Full Body Search

### Empty Search

Also using pagination because the result is very big

In [105]:
e.search(body={'from': 30, 'size': 1})

{'took': 4,
 'timed_out': False,
 '_shards': {'total': 33, 'successful': 33, 'skipped': 0, 'failed': 0},
 'hits': {'total': 142692,
  'max_score': 1.0,
  'hits': [{'_index': '.monitoring-es-6-2018.05.16',
    '_type': 'doc',
    '_id': '4e93YLRBRcyShOQN7FSiEA:_na:blog:4:r',
    '_score': 1.0,
    '_source': {'cluster_uuid': 'O37AcGfuSMe1_i68NGyEPg',
     'timestamp': '2018-05-16T06:37:31.673Z',
     'interval_ms': 10000,
     'type': 'shards',
     'source_node': None,
     'state_uuid': '4e93YLRBRcyShOQN7FSiEA',
     'shard': {'state': 'UNASSIGNED',
      'primary': False,
      'node': None,
      'relocating_node': None,
      'shard': 4,
      'index': 'blog'}}}]}}

### Query DSL

In [124]:
# find mentions with 'match'

res = e.search(body={'query': {'match_all': {}}})
print('returned results:', res['hits']['total'])

hits = res['hits']['hits']
print('pagination standard values: from=0, size=10:', len(hits))

print('\nsearching for tweets containing "elasticsearch":')
res = e.search(body={'query': {'match': {'tweet': 'elasticsearch'}}})
for i, hit in enumerate(res['hits']['hits']):
    print('  ', i, hit['_score'], hit['_source']['tweet'])

returned results: 144785
pagination standard values: from=0, size=10: 10

searching for tweets containing "elasticsearch":
   0 0.6931472 The Elasticsearch API is really easy to use
   1 0.6682933 So yes, I am an Elasticsearch fanboy
   2 0.37881336 However did I manage before Elasticsearch?
   3 0.35667494 Elasticsearch surely is one of the hottest new NoSQL products
   4 0.3034693 Elasticsearch and I have left the honeymoon stage, and I still love her.
   5 0.21110918 Elasticsearch is built for the cloud, easy to scale
   6 0.16044298 Elasticsearch means full text search has never been so easy


### Combining Multiple Clauses

In [127]:
# example which will not yield results but displays a combination of clauses

q = {
    'query': {
        'bool': {
            'must': {'match': {'tweet': 'elasticsearch'}},
            'must_not': {'match': {'name': 'mary'}},
            'should': {'match': {'tweet': 'full text'}},
            'filter': {'range': {'age': {'gt': 30}}}
        }
    }
}

res = e.search(body=q, index='tweet')
print(res)

{'took': 0, 'timed_out': False, '_shards': {'total': 5, 'successful': 5, 'skipped': 0, 'failed': 0}, 'hits': {'total': 0, 'max_score': None, 'hits': []}}


### Queries and Filters

https://www.elastic.co/guide/en/elasticsearch/guide/master/_queries_and_filters.html

* Filters answer yes|no questions
* Queries yield a scoring

In [132]:
# check validity of a query
# (match and tweet are in wrong order)
e.indices.validate_query(explain=True, body={
    'query': {'tweet': {'match': 'really powerful'}}}, index='tweet')

{'valid': False,
 'error': 'org.elasticsearch.common.ParsingException: no [query] registered for [tweet]'}

In [133]:
# explanations are nice for valid queries, too
e.indices.validate_query(explain=True, body={
    'query': {'match': {'tweet': 'really powerful'}}}, index='tweet')

{'valid': True,
 '_shards': {'total': 1, 'successful': 1, 'failed': 0},
 'explanations': [{'index': 'tweet',
   'valid': True,
   'explanation': 'tweet:realli tweet:power'}]}

## Sorting

In [142]:
# sort tweet of user 1 by recency
hits = e.search(body={
    'query': {'bool': {'filter': {'term': {'user_id': 1}}}},
    'sort': {'date': {'order': 'desc'}}
}, index='tweet')['hits']['hits']

for i, hit in enumerate(hits):
    tweet = hit['_source']
    print(i, tweet['user_id'], hit['sort'], tweet['date'])

0 1 [1411516800000] 2014-09-24
1 1 [1411344000000] 2014-09-22
2 1 [1411171200000] 2014-09-20
3 1 [1410998400000] 2014-09-18
4 1 [1410825600000] 2014-09-16
5 1 [1410652800000] 2014-09-14


Transform the tweet field into a multifield mapping

In [149]:
e.indices.put_mapping(body={
    'properties': {
        'tweet': {
            'type': 'text',
            'analyzer': 'english',
            'fields': {
                'raw': {
                    'type': 'keyword',
                }
            }
        }
    }
}, **ctx_tweet)

{'acknowledged': True}

In [150]:
populate_tweet_index()

populating tweet index
errors: False - indexed: 12


In [152]:
hits = e.search(body={
    'query': {'match': {'tweet': 'elasticsearch'}},
    'sort': 'tweet.raw',
}, **ctx_tweet)['hits']['hits']

for i, hit in enumerate(hits):
    print(i, hit['_source']['tweet'])

0 Elasticsearch and I have left the honeymoon stage, and I still love her.
1 Elasticsearch is built for the cloud, easy to scale
2 Elasticsearch means full text search has never been so easy
3 Elasticsearch surely is one of the hottest new NoSQL products
4 However did I manage before Elasticsearch?
5 So yes, I am an Elasticsearch fanboy
6 The Elasticsearch API is really easy to use


## Relevance

In [157]:
hits = e.search(body={
    'query': {'match': {'tweet': 'honeymoon'}}}, explain=True, **ctx_tweet)['hits']['hits']

for hit in hits:
    print(hit['_source']['tweet'], '\n')
    pp(hit['_explanation'])

Elasticsearch and I have left the honeymoon stage, and I still love her. 

{'description': 'weight(tweet:honeymoon in 4) [PerFieldSimilarity], result of:',
 'details': [{'description': 'score(doc=4,freq=1.0 = termFreq=1.0\n'
                             '), product of:',
              'details': [{'description': 'idf, computed as log(1 + (docCount '
                                          '- docFreq + 0.5) / (docFreq + 0.5)) '
                                          'from:',
                           'details': [{'description': 'docFreq',
                                        'details': [],
                                        'value': 1.0},
                                       {'description': 'docCount',
                                        'details': [],
                                        'value': 4.0}],
                           'value': 1.2039728},
                          {'description': 'tfNorm, computed as (freq * (k1 + '
                                   

Calculation of the relevance score $ \text{score}(\text{doc}=4, \text{freq}=1.0) $:

### Inverse Document Frequency

* $ d_f $ **docFreq** (how many times does the word occur in all documents)
* $ d_c $ **docCount** (how many documents are there)

$$ \text{idf}(d_c, d_f) := \log\Bigg(1 + \frac{d_c - d_f + 0.5}{d_f + 0.5}\Bigg) $$

with $ d_f = 1, d_c = 4 $:

$$ 1.20397\dots = \log\Bigg(1 + \frac{4 - 1 + 0.5}{1 + 0.5}\Bigg) $$

### Term Frequency Norm

* $ t_f $ - **termFreq** (how many times does the word occur in the document)
* $ k^1 $ - **k1** (...)
* $ b $ - **b** (...)
* $ \bar{l} $ - **avgFieldLength** (...)
* $ l $ - **fieldLength** (...)

$$ \text{tfn}(\dots) := \frac{t_f * (k^1 + 1)}{t_f + k^1 \cdot (1 - b + x)}\;, \quad x = b \cdot \frac{l}{\bar{l}} $$

with $ t_f = 1, k^1 = 1.2, b = 0.75, \bar{l} = 7, l = 10 $:

$$ 0.85082\dots = \frac{1 * (1.2 + 1)}{1 + 1.2 \cdot (1 - 0.75 + 0.75 \cdot \frac{10}{7})} $$

### TF-IDF

$$ 1.02436\dots = \prod_{x \in \{\text{tf}, \text{idf}\}} x = 1.2039728 \cdot 0.8508287 $$

## Analysis

Example: At index time use synonyms for ***quick***: (***speedy, rapid, fast***). Then at search time, providing one of those suffices. Given document d, containing ***speedy***, with id=4:

    INVERTED INDEX:
    ...
    [quick] = [..., 4]
    [speedy] = [..., 4]
    [rapid] = [..., 4]
    [fast] = [..., 4]
    ...

## Note: Relevance Calculation Across Shards

The inverse document frequency is not calculated globally over all shards but only locally per shard. This is important to keep in mind if there is not much data. The difference between local and global term frequency counts diminishes with increasing data volume.

* https://www.elastic.co/guide/en/elasticsearch/guide/current/relevance-is-broken.html