## Elasticsearch: The Definitive Guide - Python

Following the examples in the book, here are Python snippets that achieve the same effect.

Documentation for the Python libs:

Low-level API:

https://elasticsearch-py.readthedocs.io/en/master/index.html

Expressive DSL API (more "Pythonic")

http://elasticsearch-dsl.readthedocs.io/en/latest/index.html

Github repo for DSL API:

https://github.com/elastic/elasticsearch-dsl-py


In [1]:
import sys, os
sys.path.insert(1, os.path.join(sys.path[0], '..'))

In [2]:
import index
from elasticsearch import Elasticsearch
from elasticsearch_dsl import Search, Q
from pprint import pprint

es = Elasticsearch(
    'localhost',
    # sniff before doing anything
    sniff_on_start=True,
    # refresh nodes after a node fails to respond
    sniff_on_connection_fail=True,
    # and also every 60 seconds
    sniffer_timeout=60
)

r = index.populate()
print('{} items created'.format(len(r['items'])))

# Let's repopulate the index as we deleted 'gb' in earlier chapters:
# Run the script: populate.ipynb

14 items created


### Finding Exact Values

These are **non-scoring** queries suitable for extracting exact values like numbers, tags etc.

In [3]:
idx = index.load_sid_examples()

In [4]:
s = Search(using=es)
q = Q('term', price=20)
s = s.query(q)
response = s.execute()

In [5]:
response[0].meta.score

1.0

In [6]:
s = Search(using=es)
q = Q('term', productID='XHDK-A-1293-#fJ3')
s = s.query(q)
s.execute()

<Response: []>

Hmmm. No results. Let's analyze:

In [7]:
body = {
  "field": "productID",
  "text": "XHDK-A-1293-#fJ3"
}
es.indices.analyze(index='my_store', body=body)

{'tokens': [{'end_offset': 4,
   'position': 0,
   'start_offset': 0,
   'token': 'xhdk',
   'type': '<ALPHANUM>'},
  {'end_offset': 6,
   'position': 1,
   'start_offset': 5,
   'token': 'a',
   'type': '<ALPHANUM>'},
  {'end_offset': 11,
   'position': 2,
   'start_offset': 7,
   'token': '1293',
   'type': '<NUM>'},
  {'end_offset': 16,
   'position': 3,
   'start_offset': 13,
   'token': 'fj3',
   'type': '<ALPHANUM>'}]}

The value was indexed as separate tokens instead of one. Let's fix this:

In [8]:
body = {
    "mappings" : {
        "products" : {
            "properties" : {
                "productID" : {
                    "type" : "string",
                    "index" : "not_analyzed" 
                }
            }
        }
    }

}
i = index.load_sid_examples(settings=body)

In [9]:
s = Search(using=es)
q = Q('term', productID='XHDK-A-1293-#fJ3')
s = s.query(q)
s.execute()

<Response: [<Hit(my_store/produces/1): {'productID': 'XHDK-A-1293-#fJ3', 'price': 10}>]>

The above are simple queries, but what if we want something more complex?
We can use Boolean.

In [10]:
# First - let's go manual with the entire JSON body
body = {
   "query" : {
      "constant_score" : { 
         "filter" : {
            "bool" : {
              "should" : [
                 { "term" : {"price" : 20}}, 
                 { "term" : {"productID" : "XHDK-A-1293-#fJ3"}} 
              ],
              "must_not" : {
                 "term" : {"price" : 30} 
              }
           }
         }
      }
   }
}
es.search(index='my_store', body=body)

{'_shards': {'failed': 0, 'successful': 5, 'total': 5},
 'hits': {'hits': [{'_id': '2',
    '_index': 'my_store',
    '_score': 1.0,
    '_source': {'price': 20, 'productID': 'KDKE-B-9947-#kL5'},
    '_type': 'produces'},
   {'_id': '1',
    '_index': 'my_store',
    '_score': 1.0,
    '_source': {'price': 10, 'productID': 'XHDK-A-1293-#fJ3'},
    '_type': 'produces'}],
  'max_score': 1.0,
  'total': 2},
 'timed_out': False,
 'took': 1}

In [11]:
# Or the more "Pythonic way" using the DSL
bool_query = Q('bool',
               should=[Q('term', price=20), Q('term', productID='XHDK-A-1293-#fJ3')],
               must_not=[Q('term', price=30)]
               )
s = Search(using=es)
s = s.query(bool_query)
res = s.execute()

In [12]:
for hit in res:
    print(hit.meta.score)

1.0
0.2876821


Hmmm. These are not constant scores of 1, so our query here is not a filter. It is exercising a boolean query, but still scoring.

In [13]:
# Or the more "Pythonic way" using the DSL **AND** a filter
bool_filter = Q('bool',
               should=[Q('term', price=20), Q('term', productID='XHDK-A-1293-#fJ3')],
               must_not=[Q('term', price=30)]
               )
s = Search(using=es)
s = s.filter(bool_filter)
res = s.execute()

In [14]:
for hit in res:
    print(hit.meta.score)

0.0
0.0


So far, so good - we are not scoring. But what if we want a constant score of 1?

In [15]:
# Or the more "Pythonic way" using the DSL **AND** a filter **AND** constant score
bool_filter = Q('bool',
               should=[Q('term', price=20), Q('term', productID='XHDK-A-1293-#fJ3')],
               must_not=[Q('term', price=30)]
               )
s = Search(using=es)
s = s.query('constant_score', filter=bool_filter)
res = s.execute()

In [16]:
for hit in res:
    print(hit.meta.score)

1.0
1.0


#### Nesting Boolean Queries

We can go further:

In [17]:
# Nested boolean
inner_bool = Q('bool', 
               must=[Q('term', price=30), Q('term', productID='JODL-X-1937-#pV7')])
outer_bool = Q('bool', 
               should=[Q('term', productID='KDKE-B-9947-#kL5'), inner_bool])
s = Search(using=es)
s = s.query('constant_score', filter=bool_filter)
res = s.execute()

In [18]:
res

<Response: [<Hit(my_store/produces/2): {'productID': 'KDKE-B-9947-#kL5', 'price': 20}>, <Hit(my_store/produces/1): {'productID': 'XHDK-A-1293-#fJ3', 'price': 10}>]>

#### Finding Multiple Exact Values

This can be done using a **terms** (note the 's') query:

In [19]:
s = Search(using=es)
s = s.query('constant_score', filter=Q('terms', price=[20,30]))
res = s.execute()

In [20]:
res.hits.total

3

In [21]:
res.hits

[<Hit(my_store/produces/2): {'productID': 'KDKE-B-9947-#kL5', 'price': 20}>, <Hit(my_store/produces/4): {'productID': 'QQPX-R-3956-#aD8', 'price': 30}>, <Hit(my_store/produces/3): {'productID': 'JODL-X-1937-#pV7', 'price': 30}>]

#### Contain, but Does Not Equal

It is important to understand that term and terms are contains operations, not equals. What does that mean?

If you have a term query for `{ "term" : { "tags" : "search" } }`, it will match both of the following documents:

`
{ "tags" : ["search"] }
{ "tags" : ["search", "open_source"] }
`

When a `term` query is executed for the token search, it goes straight to the corresponding entry in the inverted index and extracts the associated doc IDs. Both document 1 and document 2 contain the token in the inverted index. Therefore, they are both returned as a result.

#### Equals Exactly

If exact matching is required, then one solution is to index the number of tags:

`
{ "tags" : ["search"], "tag_count" : 1 }
{ "tags" : ["search", "open_source"], "tag_count" : 2 }`

And then search on both indexed fields:

`
GET /my_index/my_type/_search
{
    "query": {
        "constant_score" : {
            "filter" : {
                 "bool" : {
                    "must" : [
                        { "term" : { "tags" : "search" } }, 
                        { "term" : { "tag_count" : 1 } } 
                    ]
                }
            }
        }
    }
}
`

#### Ranges

The range query supports both inclusive and exclusive ranges, through combinations of the following options:

* `gt:` > greater than
* `lt:` < less than
* `gte:` >= greater than or equal to
* `lte:` <= less than or equal to

In [22]:
s = Search(using=es)
s = s.query('constant_score', filter=Q('range', price={ "gte": 20, "lt": 40 }))
res = s.execute()

In [23]:
res.hits

[<Hit(my_store/produces/2): {'productID': 'KDKE-B-9947-#kL5', 'price': 20}>, <Hit(my_store/produces/4): {'productID': 'QQPX-R-3956-#aD8', 'price': 30}>, <Hit(my_store/produces/3): {'productID': 'JODL-X-1937-#pV7', 'price': 30}>]

#### Range on Dates

Can also be used on date fields:

`
"range" : {
    "timestamp" : {
        "gt" : "2014-01-01 00:00:00",
        "lt" : "2014-01-07 00:00:00"
    }
}`

When used on date fields, the `range` query supports `date math` operations. For example, if we want to find all documents that have a timestamp sometime in the last hour:

`
"range" : {
    "timestamp" : {
        "gt" : "now-1h"
    }
}`

This filter will now constantly find all documents with a timestamp greater than the current time minus 1 hour, making the filter a sliding window across your documents.

Date math can also be applied to actual dates, rather than a placeholder like now. Just add a double pipe (||) after the date and follow it with a date math expression:

`"range" : {
    "timestamp" : {
        "gt" : "2014-01-01 00:00:00",
        "lt" : "2014-01-01 00:00:00||+1M" 
    }
}`

The `"lt"` means: Less than January 1, 2014 plus one month.

#### Range on Strings

The range query can also operate on string fields. String ranges are calculated lexicographically or alphabetically. For example, these values are sorted in lexicographic order:

* 5, 50, 6, B, C, a, ab, abb, abc, b

NOTE: Terms in the inverted index are sorted in lexicographical order, which is why string ranges use this order.

#### Be Careful of Cardinality

Numeric and date fields are indexed in such a way that ranges are efficient to calculate. This is not the case for string fields, however. To perform a range on a string field, Elasticsearch is effectively performing a term filter for every term that falls in the range. This is much slower than a date or numeric range.

String ranges are fine on a field with low cardinality—a small number of unique terms. But the more unique terms you have, the slower the string range will be.

### Dealing with Null Values

In the range example above, tags is a multivalue field. A document may have one tag, many tags, or potentially no tags at all. If a field has no values, then it won't be present in the index (for this doc).

Ultimately, this means that a null, [] (an empty array), and [null] are all equivalent. They simply don’t exist in the inverted index!

Let's expore an example:

![screenshot 2017-03-14 11 15 46](https://cloud.githubusercontent.com/assets/28526/23915873/984ba1be-08a7-11e7-950e-22834445f302.png)


In [25]:
# First load the sample data for posts
idx = index.load_sid_examples(set=2)

In [47]:
s = Search(using=es)
s = s.query('constant_score', filter=Q('bool',
                                must_not=[Q('exists', field='tags')]))
res = s.execute()

In [48]:
res.hits.total

24

In [51]:
for hit in res:
    print(hit.meta.id)

megacorp
website
my_index
5.2.2
14
5
9
8
10
12


#### Missing Query

Note: missing query is [deprecated from ES 5.x](https://www.elastic.co/guide/en/elasticsearch/reference/5.2/breaking_50_search_changes.html#_deprecated_queries_removed)

missing query has been removed because it can be advantageously replaced by an exists query inside a must_not clause as follows:

In [64]:
from elasticsearch_dsl import Index
s = Index(name='my_index', using=es).search()
s = s.query('constant_score', filter=Q('bool',
                                       must_not=[Q('exists', field='tags')]))
res = s.execute()

In [66]:
for hit in res:
    print(hit)

<Hit(my_index/posts/4): {'tags': None}>
<Hit(my_index/posts/3): {'other_field': 'some data'}>
