## Elasticsearch: The Definitive Guide - Python

Following the examples in the book, here are Python snippets that achieve the same effect.

Documentation for the Python libs:

Low-level API:

https://elasticsearch-py.readthedocs.io/en/master/index.html

Expressive DSL API (more "Pythonic")

http://elasticsearch-dsl.readthedocs.io/en/latest/index.html

Github repo for DSL API:

https://github.com/elastic/elasticsearch-dsl-py


In [1]:
import sys, os
sys.path.insert(1, os.path.join(sys.path[0], '..'))

In [2]:
import index
from elasticsearch import Elasticsearch
from elasticsearch_dsl import Search, Q, Index
from pprint import pprint

es = Elasticsearch(
    'localhost',
    # sniff before doing anything
    sniff_on_start=True,
    # refresh nodes after a node fails to respond
    sniff_on_connection_fail=True,
    # and also every 60 seconds
    sniffer_timeout=60
)

r = index.load_sid_examples(settings={ "settings": { "number_of_shards": 1 }},set=3)
print('{} items created'.format(len(r['items'])))

# Let's repopulate the index as we deleted 'gb' in earlier chapters:
# Run the script: populate.ipynb

4 items created


### Full-Text Search

The two most important aspects of full-text search are as follows:

##### Relevance

>The ability to rank results by how relevant they are to the given query, whether relevance is calculated using TF/IDF (see [What Is Relevance?](https://www.elastic.co/guide/en/elasticsearch/guide/master/relevance-intro.html), proximity to a geolocation, fuzzy similarity, or some other algorithm.

##### Analysis

>The process of converting a block of text into distinct, normalized tokens (see [Analysis and Analyzers](https://www.elastic.co/guide/en/elasticsearch/guide/master/analysis-intro.html) in order to (a) create an inverted index and (b) query the inverted index.

#### Term-Based Versus Full-Text

Two types of text query:

##### Term-based

Queries like the term or fuzzy queries are low-level queries that have no analysis phase. They operate on a single term. A term query for the term Foo looks for that exact term in the inverted index and calculates the TF/IDF relevance _score for each document that contains the term.

##### Full-text queries

Queries like the match or query_string queries are high-level queries that understand the mapping of a field:

* If you use them to query a date or integer field, they will treat the query string as a date or integer, respectively.

* If you query an exact value (not_analyzed) string field, they will treat the whole query string as a single term.

* But if you query a full-text (analyzed) field, they will first pass the query string through the appropriate analyzer to produce the list of terms to be queried.

Once the query has assembled a list of terms, it executes the appropriate low-level query for each of these terms, and then combines their results to produce the final relevance score for each document.

#### The match Query

The *go-to* query.

In [3]:
s = Index('my_index', using=es).search()
s = s.query('match', title='QUICK!')
res = s.execute()

In [4]:
for hit in res:
    print(hit.meta.id, hit.title, ' - Score:', hit.meta.score)

1 The quick brown fox  - Score: 0.42327404
3 The quick brown fox jumps over the quick dog  - Score: 0.42211798
2 The quick brown fox jumps over the lazy dog  - Score: 0.2887157


Document 1 is most relevant because its title field is short, which means that quick represents a large portion of its content.


Document 3 is more relevant than document 2 because quick appears twice.

#### Multiword Queries

Obviously, we can search on more than one word at a time:


In [5]:
s = Index('my_index', using=es).search()
s = s.query('match', title='BROWN DOG!')
res = s.execute()
for hit in res:
    print(hit.meta.id, hit.title, ' - Score:', hit.meta.score)

4 Brown fox brown dog  - Score: 0.58571666
2 The quick brown fox jumps over the lazy dog  - Score: 0.37400126
3 The quick brown fox jumps over the quick dog  - Score: 0.37400126
1 The quick brown fox  - Score: 0.12503365


Document 4 is the most relevant because it contains "brown" twice and "dog" once.

Documents 2 and 3 both contain brown and dog once each, and the title field is the same length in both docs, so they have the same score.

Document 1 matches even though it contains only brown, not dog.

Internally, this is a boolean query (more later). The important thing is: **any** document whose title field **contains at least one of the specified terms** will match the query. The more terms that match, the more relevant the document.

#### Improving Precision

Do we really want *ALL* the docs that contain brown and/or dog?

In [6]:
q = Q('match', title={      
                "query":    "BROWN DOG!",
                "operator": "and"
            })
s = Index('my_index', using=es).search()
s = s.query(q)
res = s.execute()
for hit in res:
    print(hit.meta.id, hit.title, ' - Score:', hit.meta.score)

4 Brown fox brown dog  - Score: 0.58571666
2 The quick brown fox jumps over the lazy dog  - Score: 0.37400126
3 The quick brown fox jumps over the quick dog  - Score: 0.37400126


#### Controlling Precision

The `match` query supports the `minimum_should_match` parameter, which allows you to specify the number of terms that must match for a document to be considered relevant. While you can specify an absolute number of terms, it usually makes sense to specify a percentage instead, as you have no control over the number of words the user may enter:

In [7]:
q = Q('match', title={      
                "query":    "BROWN DOG!",
                "minimum_should_match": "75%"
            })
s = Index('my_index', using=es).search()
s = s.query(q)
res = s.execute()
for hit in res:
    print(hit.meta.id, hit.title, ' - Score:', hit.meta.score)

4 Brown fox brown dog  - Score: 0.58571666
2 The quick brown fox jumps over the lazy dog  - Score: 0.37400126
3 The quick brown fox jumps over the quick dog  - Score: 0.37400126
1 The quick brown fox  - Score: 0.12503365


The minimum_should_match parameter is flexible. See [full documentation](https://www.elastic.co/guide/en/elasticsearch/reference/master/query-dsl-minimum-should-match.html#query-dsl-minimum-should-match)

#### Combining Queries

We already looked at bool filter to combine multiple filter clauses with `and`, `or`, and `not` logic. In query land, the bool query does a similar job but with one important difference.

Filters make a binary decision: should this document be included in the results list or not? Queries decide not only whether to include a document, but also **how relevant that document is.**

In [8]:
s = Index('my_index', using=es).search()
q = Q('bool',
     must = [Q('match', title="quick")],
     must_not = [Q('match', title="lazy")],
     should = [Q('match', title="brown"), Q('match', title="dog")])
s = s.query(q)
res = s.execute()
for hit in res:
    print(hit.meta.id, hit.title, ' - Score:', hit.meta.score)

3 The quick brown fox jumps over the quick dog  - Score: 0.7961192
1 The quick brown fox  - Score: 0.54830766


Document 3 scores higher because it contains both brown and dog.

#### Score Calculation

The `bool` query calculates the relevance `_score` for each document by adding together the `_score` from all of the matching `must` and `should` clauses, and then dividing by the total number of must and should clauses.

The `must_not` clauses do not affect the score; their only purpose is to exclude documents that might otherwise have been included.

#### Controlling Precision

All the `must` clauses must match, and all the `must_not` clauses must not match, but how many should clauses should match? By default, none of the `should` clauses are required to match, with one exception: if there are no `must` clauses, then at least one `should` clause must match.

Just as we can control the precision of the match query, we can control how many `should` clauses need to match by using the `minimum_should_match` parameter, either as an absolute number or as a percentage:

In [9]:
s = Index('my_index', using=es).search()
q = Q('bool',
     should = [Q('match', title="brown"), 
               Q('match', title="fox"),
               Q('match', title="dog")],
     minimum_should_match=2)
s = s.query(q)
res = s.execute()
for hit in res:
    print(hit.meta.id, hit.title, ' - Score:', hit.meta.score)

4 Brown fox brown dog  - Score: 0.71075034
2 The quick brown fox jumps over the lazy dog  - Score: 0.45928687
3 The quick brown fox jumps over the quick dog  - Score: 0.45928687
1 The quick brown fox  - Score: 0.2500673


This means return docs in which a minimum of 2 of the search terms should match (i.e. all of them in this case). 

The results would include only documents whose title field contains "brown" AND "fox", "brown" AND "dog", or "fox" AND "dog". If a document contains all three, it would be considered more relevant than those that contain just two of the three.


We could put this as a %:

In [10]:
s = Index('my_index', using=es).search()
q = Q('bool',
     should = [Q('match', title="brown"), 
               Q('match', title="fox"),
               Q('match', title="dog")],
     minimum_should_match='50%')
s = s.query(q)
res = s.execute()
for hit in res:
    print(hit.meta.id, hit.title, ' - Score:', hit.meta.score)

4 Brown fox brown dog  - Score: 0.71075034
2 The quick brown fox jumps over the lazy dog  - Score: 0.45928687
3 The quick brown fox jumps over the quick dog  - Score: 0.45928687
1 The quick brown fox  - Score: 0.2500673


#### How match Uses bool

There two queries are equivalent:



In [11]:
s = Index('my_index', using=es).search()
q = Q('bool',
     should = [Q('term', title="brown"), 
               Q('term', title="fox")],)
s = s.query(q)
search_bool = s.execute()
for hit in search_bool:
    print(hit.meta.id, hit.title, ' - Score:', hit.meta.score)

4 Brown fox brown dog  - Score: 0.2874763
1 The quick brown fox  - Score: 0.2500673
2 The quick brown fox jumps over the lazy dog  - Score: 0.17057118
3 The quick brown fox jumps over the quick dog  - Score: 0.17057118


In [12]:
s = Index('my_index', using=es).search()
q = Q('match', title="brown fox")
s = s.query(q)
search_match = s.execute()
for hit in search_match:
    print(hit.meta.id, hit.title, ' - Score:', hit.meta.score)

4 Brown fox brown dog  - Score: 0.2874763
1 The quick brown fox  - Score: 0.2500673
2 The quick brown fox jumps over the lazy dog  - Score: 0.17057118
3 The quick brown fox jumps over the quick dog  - Score: 0.17057118


In [13]:
search_match.hits

[<Hit(my_index/my_type/4): {'title': 'Brown fox brown dog'}>, <Hit(my_index/my_type/1): {'title': 'The quick brown fox'}>, <Hit(my_index/my_type/2): {'title': 'The quick brown fox jumps over the lazy dog'}>, <Hit(my_index/my_type/3): {'title': 'The quick brown fox jumps over the quick dog'}>]

In [14]:
search_bool

<Response: [<Hit(my_index/my_type/4): {'title': 'Brown fox brown dog'}>, <Hit(my_index/my_type/1): {'title': 'The quick brown fox'}>, <Hit(my_index/my_type/2): {'title': 'The quick brown fox jumps over the lazy dog'}>, <Hit(my_index/my_type/3): {'title': 'The quick brown fox jumps over the quick dog'}>]>

These searches are also equivalent:

In [15]:
# this time with **must**
s = Index('my_index', using=es).search()
q = Q('bool',
     must = [Q('term', title="brown"), 
               Q('term', title="dog")],)
s = s.query(q)
search_bool = s.execute()
for hit in search_bool:
    print(hit.meta.id, hit.title, ' - Score:', hit.meta.score)

4 Brown fox brown dog  - Score: 0.58571666
2 The quick brown fox jumps over the lazy dog  - Score: 0.37400126
3 The quick brown fox jumps over the quick dog  - Score: 0.37400126


In [16]:
# this time with **operator and**
s = Index('my_index', using=es).search()
q = Q('match', title={"query": "brown dog", "operator": "and"})
s = s.query(q)
search_match = s.execute()
for hit in search_match:
    print(hit.meta.id, hit.title, ' - Score:', hit.meta.score)

4 Brown fox brown dog  - Score: 0.58571666
2 The quick brown fox jumps over the lazy dog  - Score: 0.37400126
3 The quick brown fox jumps over the quick dog  - Score: 0.37400126


Same if we pass the `minimum_should_match` parameter

In [17]:
# this time with **minimum_should_match**
s = Index('my_index', using=es).search()
q = Q('bool',
     should = [Q('term', title="brown"), 
               Q('term', title="fox"),
               Q('term', title="quick")],
             minimum_should_match=2)
s = s.query(q)
search_bool = s.execute()
for hit in search_bool:
    print(hit.meta.id, hit.title, ' - Score:', hit.meta.score)

1 The quick brown fox  - Score: 0.67334133
3 The quick brown fox jumps over the quick dog  - Score: 0.59268916
2 The quick brown fox jumps over the lazy dog  - Score: 0.45928687
4 Brown fox brown dog  - Score: 0.2874763


In [18]:
# this time with **operator and**
s = Index('my_index', using=es).search()
q = Q('match', title={"query": "brown fox quick",
                      "minimum_should_match": "75%"})
s = s.query(q)
search_match = s.execute()
for hit in search_match:
    print(hit.meta.id, hit.title, ' - Score:', hit.meta.score)

1 The quick brown fox  - Score: 0.67334133
3 The quick brown fox jumps over the quick dog  - Score: 0.59268916
2 The quick brown fox jumps over the lazy dog  - Score: 0.45928687
4 Brown fox brown dog  - Score: 0.2874763


Because there are only three clauses, the `minimum_should_match` value of 75% in the match query is rounded down to 2. At least two out of the three should clauses must match.

We would normally write these types of queries by using the `match` query, but understanding how the match query works internally lets you take control when needed.

#### Boosting Query Clauses

Imagine that we want to search for documents about "full-text search," but we want to give more _weight_ to documents that also mention "Elasticsearch" or "Lucene." By _more weight_, we mean that documents mentioning "Elasticsearch" or "Lucene" will receive a higher relevance _score than those that don’t, which means that they will appear higher in the list of results.

A simple bool query allows us to write this fairly complex logic as follows:

**NOTE**: I will use the fox examples here to exercise the existing index, and build up the query bit by bit to show the influence of the boost on the results.

In [21]:
r = index.load_sid_examples(settings={ "settings": { "number_of_shards": 1 }},set=3)
# Add another doc first
body = { "title": "The slow dog tried to chase the goat" }
es.create(index='my_index', doc_type='my_type', body=body, id=5)
# Add another doc first
body = { "title": "The goat liked the fox and the dog" }
es.create(index='my_index', doc_type='my_type', body=body, id=6)

{'_id': '6',
 '_index': 'my_index',
 '_shards': {'failed': 0, 'successful': 1, 'total': 2},
 '_type': 'my_type',
 '_version': 1,
 'created': True,
 'result': 'created'}

In [22]:
# just search for dog first
s = Index('my_index', using=es).search()
q = Q('bool',
     must = [Q('match', title="dog")])
s = s.query(q)
search_bool = s.execute()
for hit in search_bool:
    print(hit.meta.id, hit.title, ' - Score:', hit.meta.score)

4 Brown fox brown dog  - Score: 0.29243276
2 The quick brown fox jumps over the lazy dog  - Score: 0.20276785
3 The quick brown fox jumps over the quick dog  - Score: 0.20276785
5 The slow dog tried to chase the goat  - Score: 0.20276785
6 The goat liked the fox and the dog  - Score: 0.20276785


In [25]:
# Now add in jumps and goat
s = Index('my_index', using=es).search()
q = Q('bool',
     must = [Q('match', title="dog")],
     should = [Q('match', title="jumps"),Q('match', title="goat")])
s = s.query(q)
search_bool = s.execute()
for hit in search_bool:
    print(hit.meta.id, hit.title, ' - Score:', hit.meta.score)

2 The quick brown fox jumps over the lazy dog  - Score: 1.0684667
3 The quick brown fox jumps over the quick dog  - Score: 1.0684667
5 The slow dog tried to chase the goat  - Score: 1.0684667
6 The goat liked the fox and the dog  - Score: 1.0684667
4 Brown fox brown dog  - Score: 0.29243276


In [26]:
# Now boost goat over jumps
s = Index('my_index', using=es).search()
q = Q('bool',
     must = [Q('match', title="dog")],
     should = [Q('match', title={'query' : 'jumps', 'boost': 2}),
               Q('match', title={'query' : 'goat', 'boost': 3})])
s = s.query(q)
search_bool = s.execute()
for hit in search_bool:
    print(hit.meta.id, hit.title, ' - Score:', hit.meta.score)

5 The slow dog tried to chase the goat  - Score: 2.7998643
6 The goat liked the fox and the dog  - Score: 2.7998643
2 The quick brown fox jumps over the lazy dog  - Score: 1.9341656
3 The quick brown fox jumps over the quick dog  - Score: 1.9341656
4 Brown fox brown dog  - Score: 0.29243276


#### Controlling Analysis

Queries can find only terms that actually exist in the inverted index, so it is important to ensure that the same analysis process is applied both to the document at index time, and to the query string at search time so that the terms in the query match the terms in the inverted index.

Although we say document, analyzers are determined per field. Each field can have a different analyzer, either by configuring a specific analyzer for that field or by falling back on the type, index, or node defaults. At index time, a field’s value is analyzed by using the configured or default analyzer for that field.

In [39]:
mapping = {
    "my_type": {
        "properties": {
            "english_title": {
                "type":     "text",
                "analyzer": "english"
            }
        }
    }
}
es.indices.put_mapping(index='my_index', doc_type='my_type', body=mapping)

{'acknowledged': True}

We applied the mapping, but this won't be applied to the already indexed docs. We can reindex or just re-create the index altogether. Meanwhile, we can analyze the fields by validating a query.

In [59]:
#english (language)
i = Index('my_index', using=es)
q = Q('bool',
     should = [Q('match', title='Foxes'),
               Q('match', english_title='Foxes')])
es.indices.validate_query(index='my_index', body={ "query" :q.to_dict() }, explain=1)

{'_shards': {'failed': 0, 'successful': 1, 'total': 1},
 'explanations': [{'explanation': 'title:foxes english_title:fox',
   'index': 'my_index',
   'valid': True}],
 'valid': True}