## Elasticsearch: The Definitive Guide - Python

Following the examples in the book, here are Python snippets that achieve the same effect.

Documentation for the Python libs:

Low-level API:

https://elasticsearch-py.readthedocs.io/en/master/index.html

Expressive DSL API (more "Pythonic")

http://elasticsearch-dsl.readthedocs.io/en/latest/index.html

Github repo for DSL API:

https://github.com/elastic/elasticsearch-dsl-py


In [2]:
import sys, os
sys.path.insert(1, os.path.join(sys.path[0], '..'))

In [3]:
import index
from elasticsearch import Elasticsearch
from elasticsearch_dsl import Search, Q, Index
from pprint import pprint

es = Elasticsearch(
    'localhost',
    # sniff before doing anything
    sniff_on_start=True,
    # refresh nodes after a node fails to respond
    sniff_on_connection_fail=True,
    # and also every 60 seconds
    sniffer_timeout=60
)

r = index.load_sid_examples(settings={ "settings": { "number_of_shards": 1 }},set=3)
print('{} items created'.format(len(r['items'])))

# Let's repopulate the index as we deleted 'gb' in earlier chapters:
# Run the script: populate.ipynb

4 items created


### Multifield Search

Queries are seldom simple one-clause match queries.

In [101]:
if es.indices.exists('books'):
    es.indices.delete('books')
es.indices.create(index='books',
                     body={ "settings": { "number_of_shards": 1 }})

{'acknowledged': True, 'shards_acknowledged': True}

In [102]:
body = {
    "title": "War and Peace",
    "author": "Leo Tolstoy",
    "translator": "Constance Garnett"
}
r = es.create(index='books', doc_type='classics', body=body, id=1)

In [103]:
body = {
    "title": "War and Peace",
    "author": "Leo Tolstoy",
    "translator": "Louise Maude"
}
r = es.create(index='books', doc_type='classics', body=body, id=2)

In [104]:
body = {
    "title": "War and Peace",
    "author": "Leo Tolstoy",
    "format" : "hardback"
}
r = es.create(index='books', doc_type='classics', body=body, id=3)

In [105]:
s = Index('books', using=es).search()

In [106]:
q = Q('bool',
         should=[Q('match', title={ "query": "War and Peace", "boost": 2}),
                 Q('match', author={ "query": "Leo Tolstoy", "boost": 2}),
                 Q('bool',
                      should=[Q('match', translator="Constance Garnett"),
                              Q('match', translater='Louise Maude')])]
         )

In [107]:
pprint(q.to_dict())

{'bool': {'should': [{'match': {'title': {'boost': 2,
                                          'query': 'War and Peace'}}},
                     {'match': {'author': {'boost': 2,
                                           'query': 'Leo Tolstoy'}}},
                     {'bool': {'should': [{'match': {'translator': 'Constance '
                                                                   'Garnett'}},
                                          {'match': {'translater': 'Louise '
                                                                   'Maude'}}]}}]}}


In [108]:
s = s.query(q)
res = s.execute()
for hit in res:
    print(hit.meta.id, hit.meta.score)

1 2.4280977
2 1.1842774
3 1.1842774


#### Best Fields

In [109]:
if es.indices.exists('my_index'):
    es.indices.delete('my_index')
es.indices.create(index='my_index',
                     body={ "settings": { "number_of_shards": 1 }})

{'acknowledged': True, 'shards_acknowledged': True}

In [110]:

body = {
    "title": "Quick brown rabbits",
    "body":  "Brown rabbits are commonly seen."
}
r = es.create(index='my_index', doc_type='my_type', body=body, id=1)
body = {
    "title": "Keeping pets healthy",
    "body":  "My quick brown fox eats rabbits on a regular basis."
}
r = es.create(index='my_index', doc_type='my_type', body=body, id=2)

In [111]:
s = Index('my_index', using=es).search()
s = s.query(Q('bool',
                 should=[Q('match', title="Brown fox"),
                         Q('match', body="Brown fox")]))
s = s.extra(explain=True)
res = s.execute()
for hit in res:
    print(hit.meta.id, hit.title, hit.meta.score)

1 Quick brown rabbits 0.8181274
2 Keeping pets healthy 0.7616384


In [112]:
s = Index('my_index', using=es).search()
q = Q('dis_max',
                 queries=[Q('match', title="Brown fox").to_dict(),
                         Q('match', body="Brown fox").to_dict()])
s = s.query(q)
res = s.execute()
for hit in res:
    print(hit.meta.id, hit.title, hit.meta.score)

2 Keeping pets healthy 0.7616384
1 Quick brown rabbits 0.6099695


#### Tuning Best Fields Queries

In [113]:
s = Index('my_index', using=es).search()
q = Q('dis_max',
                 queries=[Q('match', title="Quick pets").to_dict(),
                         Q('match', body="Quick pets").to_dict()])
s = s.query(q)
s = s.extra(explain=True)
res = s.execute()
for hit in res:
    print(hit.meta.id, hit.title, hit.meta.score)

1 Quick brown rabbits 0.6099695
2 Keeping pets healthy 0.6099695


The `dis-max` only takes into the account the best scoring fields from each doc. In this case, they are the same. In this case (a tie-breaker) we take the _score from the other matching clauses into account, by specifying the tie_breaker parameter:

In [115]:
s = Index('my_index', using=es).search()
q = Q('dis_max',
                 queries=[Q('match', title="Quick pets").to_dict(),
                         Q('match', body="Quick pets").to_dict()],
     tie_breaker=0.3)
s = s.query(q)
s = s.extra(explain=True)
res = s.execute()
for hit in res:
    print(hit.meta.id, hit.title, hit.meta.score)

2 Keeping pets healthy 0.7908763
1 Quick brown rabbits 0.6099695


The tie_breaker parameter makes the dis_max query behave more like a halfway house between dis_max and bool. It changes the score calculation as follows:

Take the _score of the best-matching clause.
Multiply the score of each of the other matching clauses by the tie_breaker.
Add them all together and normalize.
With the tie_breaker, all matching clauses count, but the best-matching clause counts most.

#### Multi-match queries

We can re-write a query like this:

`
{
  "dis_max": {
    "queries":  [
      {
        "match": {
          "title": {
            "query": "Quick brown fox",
            "minimum_should_match": "30%"
          }
        }
      },
      {
        "match": {
          "body": {
            "query": "Quick brown fox",
            "minimum_should_match": "30%"
          }
        }
      },
    ],
    "tie_breaker": 0.3
  }
}`

more concisely as this:

`
{
    "multi_match": {
        "query":                "Quick brown fox",
        "type":                 "best_fields", 
        "fields":               [ "title", "body" ],
        "tie_breaker":          0.3,
        "minimum_should_match": "30%" 
    }
}
`

But in Pythonic DSL, it's even more expressive:

In [117]:
s = Index('my_index', using=es).search()
q = Q('multi_match', query='Quick pets', fields=['title','body'],
      tie_breaker=0.3, type='best_fields')
s = s.query(q)
res = s.execute()
for hit in res:
    print(hit.meta.id, hit.title, hit.meta.score)

2 Keeping pets healthy 0.7908763
1 Quick brown rabbits 0.6099695


In [118]:
# best_fields is the default type anyway, so can be left out (though less expressive)
s = Index('my_index', using=es).search()
q = Q('multi_match', query='Quick pets', fields=['title','body'],
      tie_breaker=0.3)
s = s.query(q)
res = s.execute()
for hit in res:
    print(hit.meta.id, hit.title, hit.meta.score)

2 Keeping pets healthy 0.7908763
1 Quick brown rabbits 0.6099695


#### Using Wildcards in Field Names

Below example contrived, but more useful for fields with similar names, like prefixes perhaps: `book_title`, `chapter_title`, and `section_title` fields, with the following:

`Q('multi_match', query='Quick brown fox', fields=['*_title'])`

In [120]:
# wouldn't do this, but just for demo
s = Index('my_index', using=es).search()
q = Q('multi_match', query='Quick pets', fields=['t*','b*'],
      tie_breaker=0.3)
s = s.query(q)
res = s.execute()
for hit in res:
    print(hit.meta.id, hit.title, hit.meta.score)

2 Keeping pets healthy 0.7908763
1 Quick brown rabbits 0.6099695


#### Boosting Individual Fields

Individual fields can be boosted by using the caret (^) syntax: just add ^boost after the field name, where boost is a floating-point number:

In [123]:
# without the boost
s = Index('my_index', using=es).search()
q = Q('multi_match', query='Quick pets', fields=['title','body'],
      tie_breaker=0.3)
s = s.query(q)
res = s.execute()
for hit in res:
    print(hit.meta.id, hit.title, ' - ', hit.body, hit.meta.score)

2 Keeping pets healthy  -  My quick brown fox eats rabbits on a regular basis. 0.7908763
1 Quick brown rabbits  -  Brown rabbits are commonly seen. 0.6099695


In [125]:
# **with** the boost
s = Index('my_index', using=es).search()
q = Q('multi_match', query='Quick pets', fields=['title^2','body'],
      tie_breaker=0.3)
s = s.query(q)
res = s.execute()
for hit in res:
    print(hit.meta.id, hit.title, ' - ', hit.body, hit.meta.score)

2 Keeping pets healthy  -  My quick brown fox eats rabbits on a regular basis. 1.4008458
1 Quick brown rabbits  -  Brown rabbits are commonly seen. 1.219939


#### Most Fields

Full-text search is a battle between recall—returning all the documents that are relevant—and precision—not returning irrelevant documents. The goal is to present the user with the most relevant documents on the first page of results.

To improve recall, we cast the net wide—we include not only documents that match the user’s search terms exactly, but also documents that we believe to be pertinent to the query.

A common technique for fine-tuning full-text relevance is to index the same text in multiple ways, each of which provides a different relevance signal. The main field would contain terms in their broadest-matching form to match as many documents as possible. For instance, we could do the following:

* Use a stemmer to index jumps, jumping, and jumped as their root form: jump. Then it doesn’t matter if the user searches for jumped; we could still match documents containing jumping.
* Include synonyms like jump, leap, and hop.
* Remove diacritics, or accents: for example, ésta, está, and esta would all be indexed without accents as esta.

However, if we have two documents, one of which contains jumped and the other jumping, the user would probably expect the first document to rank higher, as it contains exactly what was typed in.

We can achieve this by indexing the same text in other fields to provide more-precise matching. One field may contain the unstemmed version, another the original word with diacritics, and a third might use shingles to provide information about word proximity. These other fields act as signals that increase the relevance score of each matching document. The more fields that match, the better.

A document is included in the results list if it matches the broad-matching main field. If it also matches the signal fields, it gets extra points and is pushed up the results list.

#### Multifield Mapping

Fields can be mapped to more than one type of indexing technique:

Below we have `title` indexed with the `english` analyzer and the `standard` analyzer:

In [128]:
settings = {
    "settings": { "number_of_shards": 1 }, 
    "mappings": {
        "my_type": {
            "properties": {
                "title": { 
                    "type":     "text",
                    "analyzer": "english",
                    "fields": {
                        "std":   { 
                            "type":     "text",
                            "analyzer": "standard"
                        }
                    }
                }
            }
        }
    }
}
index.create_my_index(body=settings)

In [129]:
body = { "title": "My rabbit jumps" }
es.create(index='my_index', doc_type='my_type', body = body, id=1)
body = { "title": "Jumping jack rabbits" }
es.create(index='my_index', doc_type='my_type', body = body, id=2)

{'_id': '2',
 '_index': 'my_index',
 '_shards': {'failed': 0, 'successful': 1, 'total': 2},
 '_type': 'my_type',
 '_version': 1,
 'created': True,
 'result': 'created'}

In [130]:
s = Index('my_index', using=es).search()
s = s.query(Q('match', title='jumping rabbits'))
res = s.execute()
for hit in res:
    print(hit.meta.id, hit.title, hit.meta.score)

1 My rabbit jumps 0.32088596
2 Jumping jack rabbits 0.32088596


These are equally scored due to stemming of the `english` stemmer:

In [133]:
# effect of english analyzer on our string
titles = ['jumping rabbits', 'My rabbit jumps', 'Jumping jack rabbits']
for title in titles:
    analyzed_text = [x['token'] for x in es.indices.analyze\
                 (analyzer='english', body=title)['tokens']]
    print(','.join(analyzed_text))

jump,rabbit
my,rabbit,jump
jump,jack,rabbit


Now let's try the multimatch to include the other indexed field variant:

In [138]:
# run search again, but with most_fields setting
s = Index('my_index', using=es).search()
s = s.query(Q('multi_match', query='jumping rabbits',
              type='most_fields', fields=['title','title.std']))
res = s.execute()
for hit in res:
    print(hit.meta.id, hit.title, hit.meta.score)

2 Jumping jack rabbits 1.5408249
1 My rabbit jumps 0.32088596


Document 2 now scores higher, reflecting the fact that it is very close to the search string in terms of its original (unstemmed) content.

We want to combine the scores from all matching fields, so we use the `most_fields` type. This causes the `multi_match` query to wrap the two field-clauses in a `bool` query instead of a `dis_max` query.

We are using the broad-matching title field to include as many documents as possible—to increase recall—but we use the title.std field as a signal to push the most relevant results to the top.

We can also boost a field:

In [140]:
# run search again, but boost the title field ^10 
# to make it more relatively important than title.std
s = Index('my_index', using=es).search()
s = s.query(Q('multi_match', query='jumping rabbits',
              type='most_fields', fields=['title^10','title.std']))
res = s.execute()
for hit in res:
    print(hit.meta.id, hit.title, hit.meta.score)

2 Jumping jack rabbits 4.4287987
1 My rabbit jumps 3.2088597
