## Elasticsearch: The Definitive Guide - Python

Following the examples in the book, here are Python snippets that achieve the same effect.

Documentation for the Python libs:

Low-level API:

https://elasticsearch-py.readthedocs.io/en/master/index.html

Expressive DSL API (more "Pythonic")

http://elasticsearch-dsl.readthedocs.io/en/latest/index.html

Github repo for DSL API:

https://github.com/elastic/elasticsearch-dsl-py


In [1]:
from elasticsearch import Elasticsearch
from elasticsearch_dsl import Search, Q
from pprint import pprint

es = Elasticsearch(
    'localhost',
    # sniff before doing anything
    sniff_on_start=True,
    # refresh nodes after a node fails to respond
    sniff_on_connection_fail=True,
    # and also every 60 seconds
    sniffer_timeout=60
)

# Let's repopulate the index as we deleted 'gb' in earlier chapters:
# Run the script: populate.ipynb

 Search lite—a query-string search—is useful for ad hoc queries from the command line. To harness the full power of search, however, you should use the request body search API, so called because most parameters are passed in the HTTP request body instead of in the query string.

Request body search—henceforth known as search—not only handles the query itself, but also allows you to return highlighted snippets from your results, aggregate analytics across all results or subsets of results, and return did-you-mean suggestions, which will help guide your users to the best results quickly.

### Empty Search

> GET _search

Returns all docs in an index. We already covered this when looking at query string search, so the techniques are similar in Python:

In [2]:
res = es.search('_all') # same as es.search()

In [3]:
s = Search(using=es)
response = s.execute()

>GET /index_2014*/type1,type2/_search
>{}

In [4]:
# Using index, types from our test data rather than the actual example above:
res = es.search(index='us', doc_type='tweet,user')

In [5]:
s = Search(using=es, index=['us'], doc_type=['user','tweet'])
response = s.execute()

And the same from and size parameters (using different examples from book as our index does not span from 30 to size 10 (i.e. 40).

In [6]:
# Using index, types from our test data rather than the actual example above:
res = es.search(index='us', doc_type='tweet,user', from_=5, size=5)
res['hits']

{'hits': [{'_id': '6',
   '_index': 'us',
   '_score': 1.0,
   '_source': {'date': '2014-09-16',
    'name': 'John Smith',
    'tweet': 'The Elasticsearch API is really easy to use',
    'user_id': 1},
   '_type': 'tweet'},
  {'_id': '1',
   '_index': 'us',
   '_score': 1.0,
   '_source': {'email': 'john@smith.com',
    'name': 'John Smith',
    'username': '@john'},
   '_type': 'user'}],
 'max_score': 1.0,
 'total': 7}

There's a more "Pythonic way" for the DSL using array slicing:

In [7]:
s = Search(using=es, index=['us'], doc_type=['user','tweet'])[5:5]
response = s.execute()

In [8]:
len(response)

0

Instead of the cryptic query-string approach, a request body search allows us to write queries by using the query domain-specific language, or query DSL.

### "Elasticsearch" (i.e. Lucene) Query DSL

The Elasticsearch approach is to expose the Lucene query language via a JSON interface via the passing of data into the 'query' parameter:

>GET /_search
{
    "query": YOUR_QUERY_HERE
}

Clearly, if you haven't already figured, any queries like this can be passed into the low-level API via a body document (Python dictionary) that gets fed as a parameter into the call. The API is doing little more than masequerade as an HTTP request with slightly more readable function calls:

In [9]:
body = {
    "query": {
        "match_all": {}
    }
}
res = es.search(body=body)

In [10]:
body = {
    "query": {
        "match": {
            "tweet": "elasticsearch"
        }
    }
}
res = es.search(body=body)

In [11]:
body = {
    "query": {
        "bool": {
            "must":     { "match": { "tweet": "elasticsearch" }},
            "must_not": { "match": { "name":  "mary" }},
            "should":   { "match": { "tweet": "full text" }},
            "filter":   { "range": { "age" : { "gt" : 30 }} }
                }
        }
}

In [12]:
res = es.search(body=body)

In [13]:
res

{'_shards': {'failed': 0, 'successful': 16, 'total': 16},
 'hits': {'hits': [], 'max_score': None, 'total': 0},
 'timed_out': False,
 'took': 4}

Zero results here because our filter doesn't match. We have no age data in the docs. So let's first add some age data to make this example more interesting, especially before switching to the Pythonic DSL.

In [14]:
doc = { 
    "date" : "2014-09-24", 
    "name" : "Ken Dodd", 
    "tweet" : "Am I a twittiot for tweeting about elasticsearch?", 
    "user_id" : 17,
    "age": 74
}

In [15]:
res = es.create(index='gb', doc_type='tweet', body=doc, id=201)

In [16]:
body = {
    "query": {
        "bool": {
            "must":     { "match": { "tweet": "elasticsearch" }},
            "must_not": { "match": { "name":  "mary" }},
            "should":   { "match": { "tweet": "full text" }},
            "filter":   { "range": { "age" : { "gt" : 30 }} }
                }
        }
}
res = es.search(body=body)

In [17]:
res

{'_shards': {'failed': 0, 'successful': 16, 'total': 16},
 'hits': {'hits': [], 'max_score': None, 'total': 0},
 'timed_out': False,
 'took': 2}

Of course, when using DSL, this is when the Python DSL library seems more attractive. What follows are DSL variants of above searches...

In [18]:
s = Search(using=es).query() # same as match-all
response = s.execute()

In [19]:
response['hits']

{'hits': [{'_type': 'config', '_score': 1.0, '_index': '.kib...}

In [20]:
# same, but with chaining 
s = Search(using=es)
s = s.query()
s.execute() == response

False

In [21]:
# same, but with chaining and explicitly invoking match_all
s = Search(using=es)
s = s.query('match_all')
s.execute() == response

True

In [22]:
body = {
    "query": {
        "match": {
            "tweet": "elasticsearch"
        }
    }
}
res = es.search(body=body)
# DSL equivalent
s = Search(using=es)
s = s.query('match', tweet='elasticsearch')
s.execute() == res

False

Now invoke the Q shortcut to construct the more elaborate searches more expressively in Python.

In [23]:
body = {
    "query": {
        "bool": {
            "must":     { "match": { "tweet": "elasticsearch" }},
            "must_not": { "match": { "name":  "mary" }},
            "should":   { "match": { "tweet": "full text" }},
            "filter":   { "range": { "age" : { "gt" : 30 }} }
                }
        }
}
res = es.search(body=body)
# Let's line up some various ways to mention this with Q objects
q1 = Q('match', tweet='elasticsearch')
q2 = Q('match', name='mary')
q3 = Q('match', tweet='full text')
q = Q('bool', must=q1, must_not=q2, should=q3)

In [24]:
# DSL equivalent
s = Search(using=es)
s = s.query(q).filter('range', age={"gt": 30})
s.execute() == res

False

In [25]:
# Another variant:
q = q1 & ~q2 | q3

In [26]:
s = Search(using=es)
s = s.query(q).filter('range', age={"gt": 30})
s.execute() == res

False

In [27]:
print(q)

Bool(should=[Bool(must=[Match(tweet='elasticsearch')], must_not=[Match(name='mary')]), Match(tweet='full text')])


### Filtering vs. Querying

Recap of chapter: filtering is non-scoring whereas querying is scoring. Filtering is more efficient because no need to fetch and also calculate the _score. Therefore, the best use of filtering is to reduce the number of docs that have to be scored.

As a general rule, use query clauses for full-text search or for any condition that should affect the relevance score, and use filters for everything else.

### Most Important Queries

#### match_all Query

The match_all query happens often with a filter - i.e. to grab a bunch of docs without care about relevance, such as all the docs in a category (that doesn't need ordering). All docs are considered equally relevant and so receive a neutral _score of 1.

#### match Query

The match query should be the standard query that you reach for whenever you want to query for a full-text or exact value in almost any field.

If you run a match query against a full-text field, it will analyze the query string by using the correct analyzer for that field before executing the search:

> ```{ "match": { "tweet": "About Search" }}```

If you use it on a field containing an exact value, such as a number, a date, a Boolean, or a not_analyzed string field, then it will search for that **exact value**:

> ```{ "match": { "age":    26           }}```

> ```{ "match": { "date":   "2014-09-01" }}```

> ```{ "match": { "public": true         }}```

> ```{ "match": { "tag":    "full_text"  }}```

TIP: For exact-value searches, you probably want to use a filter clause instead of a query, as a filter will be cached. We’ll see some filtering examples soon.

#### term Query

The term query is used to search by exact values, be they numbers, dates, Booleans, or not_analyzed exact-value string fields:

> ```{ "term": { "age":    26           }}```

> ```{ "term": { "date":   "2014-09-01" }}```

> ```{ "term": { "public": true         }}```

> ```{ "term": { "tag":    "full_text"  }}```

#### terms Query

The terms query is the same as the term query, but allows you to specify multiple values to match. If the field contains any of the specified values, the document matches:

> ```{ "terms": { "tag": [ "search", "full_text", "nosql" ] }}```

#### exists and missing Queries

The exists and missing queries are used to find documents in which the specified field either has one or more values (exists) or doesn’t have any values (missing). It is similar in nature to IS_NULL (missing) and NOT IS_NULL (exists) in SQL:

```
{
    "exists":   {
        "field":    "title"
    }
}
```

### Combining Queries Together (Boolean)

Real-world queries are often complex, so they reqire combination of qualifiers. For this, use the ```bool``` query.

##### must
Clauses that must match for the document to be included.
##### must_not
Clauses that must not match for the document to be included.
##### should
If these clauses match, **they increase the _score**; otherwise, they have no effect. They are simply used to refine the relevance score for each document.
##### filter
Clauses that must match, but are run in non-scoring, filtering mode. These clauses do not contribute to the score, instead they simply include/exclude documents based on their criteria.

Because this is the first query we’ve seen that contains other queries, we need to talk about how scores are combined. Each sub-query clause will individually calculate a relevance score for the document. Once these scores are calculated, the bool query will merge the scores together and return a single score representing the total score of the boolean operation.

The following query finds documents whose title field matches the query string how to make millions and that are not marked as spam. If any documents are starred or are from 2014 onward, they will rank higher than they would have otherwise. Documents that match both conditions will rank even higher:

```
{
    "bool": {
        "must":     { "match": { "title": "how to make millions" }},
        "must_not": { "match": { "tag":   "spam" }},
        "should": [
            { "match": { "tag": "starred" }},
            { "range": { "date": { "gte": "2014-01-01" }}}
        ]
    }
}
```

In [28]:
# A reminder of how to do this in Python DSL:
s = Search(using=es)
s = s.query(Q('bool',
              must=Q('match', title='how to make millions'),
              must_not=Q('match', tag='spam'),
              should=[Q('match', tag='starred'),Q('range', date={'gte':"2014-01-01"})]
             )
            )

In [29]:
s.execute()

<Response: []>

In [30]:
# Ok, let's put some docs in place then to make this work:
doc = {
    'title': 'how to make millions',
    'tag': ['spam','deleted'],
    'date': '2013-01-01'
}
es.create(index='email', doc_type='messages', body=doc, id=1)

{'_id': '1',
 '_index': 'email',
 '_shards': {'failed': 0, 'successful': 1, 'total': 2},
 '_type': 'messages',
 '_version': 1,
 'created': True,
 'result': 'created'}

In [31]:
# Ok, let's put some docs in place then to make this work:
doc = {
    'title': 'how to make millions',
    'tag': ['priority','read'],
    'date': '2013-01-01'
}
es.create(index='email', doc_type='messages', body=doc, id=2)

{'_id': '2',
 '_index': 'email',
 '_shards': {'failed': 0, 'successful': 1, 'total': 2},
 '_type': 'messages',
 '_version': 1,
 'created': True,
 'result': 'created'}

In [32]:
# Ok, let's put some docs in place then to make this work:
doc = {
    'title': 'how to make millions',
    'tag': ['priority','starred'],
    'date': '2013-01-01'
}
es.create(index='email', doc_type='messages', body=doc, id=3)

{'_id': '3',
 '_index': 'email',
 '_shards': {'failed': 0, 'successful': 1, 'total': 2},
 '_type': 'messages',
 '_version': 1,
 'created': True,
 'result': 'created'}

In [48]:
s = Search(using=es)
s = s.query(Q('bool',
              must=Q('match', title='how to make millions'),
              must_not=Q('match', tag='spam'),
              should=[Q('match', tag='starred'),Q('range', date={'gte':"2013-01-01"})]
             )
            )

In [49]:
res = s.execute()

In [50]:
for hit in res['hits']['hits']:
    print(hit['_score'], hit['_id'])

2.4088445 3
2.1507282 2


Note that the doc with _id=3 (which has a tag of 'starred') has a higher _score.

Now, in this case, the date range is part of the query and so contributes towards the score. If we don't want date to affect the score, then we can move it to a filter:

```
{
    "bool": {
        "must":     { "match": { "title": "how to make millions" }},
        "must_not": { "match": { "tag":   "spam" }},
        "should": [
            { "match": { "tag": "starred" }}
        ],
        "filter": {
          "range": { "date": { "gte": "2014-01-01" }} 
        }
    }
}
```

In [42]:
s = Search(using=es)
s = s.query(Q('bool',
              must=Q('match', title='how to make millions'),
              must_not=Q('match', tag='spam'),
              should=Q('match', tag='starred')
             )
            )
s = s.filter('range', date={ "gte": "2013-01-01" })
s.execute()

<Response: [<Hit(email/messages/3): {'title': 'how to make millions', 'date': '2013-01-01', 'tag...}>, <Hit(email/messages/2): {'title': 'how to make millions', 'date': '2013-01-01', 'tag...}>]>

In [43]:
filtered_res = s.execute()

In [44]:
filtered_res == res

False

In [45]:
filtered_res

<Response: [<Hit(email/messages/3): {'title': 'how to make millions', 'date': '2013-01-01', 'tag...}>, <Hit(email/messages/2): {'title': 'how to make millions', 'date': '2013-01-01', 'tag...}>]>

In [46]:
for hit in filtered_res['hits']['hits']:
    print(hit['_score'], hit['_id'])

1.4088445 3
1.1507283 2


Notice that the scores are different from above.

In [47]:
# And a filter-only search
s = Search(using=es)
s = s.filter('term', tag='spam')
s.execute()

<Response: [<Hit(email/messages/1): {'title': 'how to make millions', 'date': '2013-01-01', 'tag...}>]>