## Elasticsearch: The Definitive Guide - Python

Following the examples in the book, here are Python snippets that achieve the same effect.

Documentation for the Python libs:

Low-level API:

https://elasticsearch-py.readthedocs.io/en/master/index.html

Expressive DSL API (more "Pythonic")

http://elasticsearch-dsl.readthedocs.io/en/latest/index.html

Github repo for DSL API:

https://github.com/elastic/elasticsearch-dsl-py


In [1]:
import sys, os
sys.path.insert(1, os.path.join(sys.path[0], '..'))

In [3]:
import index
from elasticsearch import Elasticsearch
from elasticsearch_dsl import Search, Q, Index
from pprint import pprint

es = Elasticsearch(
    'localhost',
    # sniff before doing anything
    sniff_on_start=True,
    # refresh nodes after a node fails to respond
    sniff_on_connection_fail=True,
    # and also every 60 seconds
    sniffer_timeout=60
)

#r = index.load_sid_examples(settings={ "settings": { "number_of_shards": 1 }},set=3)
#print('{} items created'.format(len(r['items'])))

# Let's repopulate the index as we deleted 'gb' in earlier chapters:
# Run the script: populate.ipynb

### Postcodes and Structured Data

Assuming we are indexing UK postcodes (like `"W1V 3DG"`).

* W1V: This outer part identifies the postal area and district:

    * W indicates the area (one or two letters)
    * 1V indicates the district (one or two numbers, possibly followed by a letter)

* 3DG: This inner part identifies a street or building:
    * 3 indicates the sector (one number)
    * DG indicates the unit (two letters)
    



In [4]:
# Let's confirm how the most_fields query works by validating the query
body= {
    "settings": { "number_of_shards": 1 },
    "mappings": {
        "address": {
            "properties": {
                "postcode": {
                    "type":  "string",
                    "index": "not_analyzed"
                }
            }
        }
    }
}
index.create_my_index(body=body)

Now index some postcodes:


In [5]:
# Adding the **and** operator
zips = [ "W1V 3DG", "W2F 8HW", "W1F 7HW", "WC1N 1LZ", "SW5 0BE" ]
for i,postcode in enumerate(zips):
    body = {}
    body['postcode'] = zips[i]
    print(body)
    r = es.create(index='my_index', doc_type='address', id=i, body=body)


{'postcode': 'W1V 3DG'}
{'postcode': 'W2F 8HW'}
{'postcode': 'W1F 7HW'}
{'postcode': 'WC1N 1LZ'}
{'postcode': 'SW5 0BE'}


#### Prefix Query

In [6]:
s = Index('my_index', using=es).search()
s = s.query(Q('prefix', postcode="W1"))
s.execute()

<Response: [<Hit(my_index/address/0): {'postcode': 'W1V 3DG'}>, <Hit(my_index/address/2): {'postcode': 'W1F 7HW'}>]>

NOTE: he prefix query or filter are useful for ad hoc prefix matching, but should be used with care. They can be used freely on fields with a small number of terms, but they scale poorly and can put your cluster under a lot of strain.

#### Wildcard and regexp Queries

The wildcard query is a low-level, term-based query similar in nature to the prefix query, but it allows you to specify a pattern instead of just a prefix. It uses the standard shell wildcards: ? matches any character, and * matches zero or more characters.

In [8]:
# wildcards
s = Index('my_index', using=es).search()
s = s.query(Q('wildcard', postcode="W?F*HW"))
s.execute()

<Response: [<Hit(my_index/address/1): {'postcode': 'W2F 8HW'}>, <Hit(my_index/address/2): {'postcode': 'W1F 7HW'}>]>

In [10]:
# regex
s = Index('my_index', using=es).search()
s = s.query(Q('regexp', postcode="W[0-9].+"))
s.execute()

<Response: [<Hit(my_index/address/0): {'postcode': 'W1V 3DG'}>, <Hit(my_index/address/1): {'postcode': 'W2F 8HW'}>, <Hit(my_index/address/2): {'postcode': 'W1F 7HW'}>]>

#### Query-Time Search-as-you-Type

In [Phrase Matching](Proximity%20Matching.ipynb#Phrase-Matching) the `match_phrase` matches all the specified words in the same positions relative to each other. For-query time search-as-you-type, we can use a specialization of this query, called the match_phrase_prefix query:

In [13]:
s = Index('my_index', using=es).search()
s = s.query(Q('match_phrase_prefix', title="quick brown f"))
s.execute()

<Response: [<Hit(my_index/my_type/1): {'title': 'The quick brown fox'}>, <Hit(my_index/my_type/2): {'title': 'The quick brown fox jumps over the lazy dog'}>, <Hit(my_index/my_type/3): {'title': 'The quick brown fox jumps over the quick dog'}>]>

This would look for:

* quick
* Followed by brown
* Followed by words beginning with f

Like the `match_phrase` query, it accepts a `slop` parameter (see [Mixing It Up](Proximity%20Matching.ipynb#Mixing-It-Up)) to make the word order and relative positions somewhat less rigid:

In [16]:
# Adding some slop
s = Index('my_index', using=es).search()
s = s.query(Q('match_phrase_prefix', title={"query": "brown quick f", "slop":2}))
s.execute()

<Response: [<Hit(my_index/my_type/1): {'title': 'The quick brown fox'}>, <Hit(my_index/my_type/2): {'title': 'The quick brown fox jumps over the lazy dog'}>, <Hit(my_index/my_type/3): {'title': 'The quick brown fox jumps over the quick dog'}>]>

	
Even though the words are in the wrong order, the query still matches because we have set a high enough slop value to allow some flexibility in word positions. However, it is always only the last word in the query string that is treated as a prefix.

Earlier, in [prefix Query](#Prefix-Query), we warned about the perils of the prefix—how prefix queries can be resource intensive. The same is true in this case. A prefix of a could match hundreds of thousands of terms. Not only would matching on this many terms be resource intensive, but it would also not be useful to the user.

We can limit the impact of the prefix expansion by setting max_expansions to a reasonable number, such as 50:

In [21]:
# Adding some control on expansions:
s = Index('my_index', using=es).search()
s = s.query(Q('match_phrase_prefix', title={"query": "quick brown f", "max_expansions":2}))
s.execute()

<Response: [<Hit(my_index/my_type/1): {'title': 'The quick brown fox'}>, <Hit(my_index/my_type/2): {'title': 'The quick brown fox jumps over the lazy dog'}>, <Hit(my_index/my_type/3): {'title': 'The quick brown fox jumps over the quick dog'}>]>

The `max_expansions` parameter controls how many terms the prefix is allowed to match. It will find the first term starting with `f` and keep collecting terms (in alphabetical order) until it either runs out of terms with prefix `f`, or it has more terms than max_expansions.

Don’t forget that we have to run this query every time the user types another character, so it needs to be fast. If the first set of results isn’t what users are after, they’ll keep typing until they get the results that they want.

#### Ngrams for Partial Matching

`Prefix`, `wildcard` and `regexp` queries operate by iterating through the list of indexed terms to match the string pattern. This is expensive versus how search should ideally work, which is to find a token that is already in the index - i.e. let's imagine that the prefix string is already in the index as a separate token.

This preparation of the index can be done via tokenizing the docs into n-grams:

* Length 1 (unigram): [ q, u, i, c, k ]
* Length 2 (bigram): [ qu, ui, ic, ck ]
* Length 3 (trigram): [ qui, uic, ick ]
* Length 4 (four-gram): [ quic, uick ]
* Length 5 (five-gram): [ quick ]

For search-as-you-type, _edge n-grams_ are better:

* q
* qu
* qui
* quic
* quick

#### Index-Time Search-as-You-Type

Edge n-grams are part of the tokenization process within the analysis flow.


In [22]:
body = {
    "settings": {
        "number_of_shards": 1, 
        "analysis": {
            "filter": {
                "autocomplete_filter": { 
                    "type":     "edge_ngram",
                    "min_gram": 1,
                    "max_gram": 20
                }
            },
            "analyzer": {
                "autocomplete": {
                    "type":      "custom",
                    "tokenizer": "standard",
                    "filter": [
                        "lowercase",
                        "autocomplete_filter" 
                    ]
                }
            }
        }
    }
}
index.create_my_index(body=body)

In [24]:
# Now let's confirm how this analyzer works:
text = "quick brown" 
analyzed_text = [[x['position'],x['token']] for x in es.indices.analyze\
                 (index='my_index', analyzer='autocomplete', text=text)['tokens']]
for item in analyzed_text:
    print('Pos {}: ({})'.format(item[0],item[1]))

Pos 0: (q)
Pos 0: (qu)
Pos 0: (qui)
Pos 0: (quic)
Pos 0: (quick)
Pos 1: (b)
Pos 1: (br)
Pos 1: (bro)
Pos 1: (brow)
Pos 1: (brown)


In [26]:
# Update the mapping to try it out:
mapping = {
    "my_type": {
        "properties": {
            "name": {
                "type":     "string",
                "analyzer": "autocomplete"
            }
        }
    }
}
es.indices.put_mapping(index='my_index', doc_type='my_type', body=mapping)

{'acknowledged': True}

In [27]:
doc = { "name": "Brown foxes"    }
es.create(index='my_index', doc_type='my_type', body = doc, id=1)
doc = { "name": "Yellow furballs"    }
es.create(index='my_index', doc_type='my_type', body = doc, id=2)

{'_id': '2',
 '_index': 'my_index',
 '_shards': {'failed': 0, 'successful': 1, 'total': 2},
 '_type': 'my_type',
 '_version': 1,
 'created': True,
 'result': 'created'}

In [28]:
#Now try a search:
s = Index('my_index', using=es).search()
s = s.query('match', name="brown fo")
s.execute()

<Response: [<Hit(my_index/my_type/1): {'name': 'Brown foxes'}>, <Hit(my_index/my_type/2): {'name': 'Yellow furballs'}>]>

Hmmm. What's going on here? Why did Yellow furballs score a hit?

Let's validate the query to see what Elasticsearch thinks it's looking for:

In [41]:
q = Q('match', name="brown fo").to_dict()
es.indices.validate_query(index='my_index', body={"query": q}, explain=1)

{'_shards': {'failed': 0, 'successful': 1, 'total': 1},
 'explanations': [{'explanation': 'Synonym(name:b name:br name:bro name:brow name:brown) Synonym(name:f name:fo)',
   'index': 'my_index',
   'valid': True}],
 'valid': True}

The query string `name:f` is part of the above query, hence the match with f in furballs.

In this case, we want to use the `autocomplete` analyzer at index time, but the `standard` analyzer at search time.

In [43]:
# Let's use the standard analyzer at search time
s = Index('my_index', using=es).search()
s = s.query('match', name={"query":"brown fo", "analyzer":"standard"})
s.execute()

<Response: [<Hit(my_index/my_type/1): {'name': 'Brown foxes'}>]>

We can also specify this in the mapping:

In [44]:
# Update the mapping to specify different index vs. search analyzers:
mapping = {
    "my_type": {
        "properties": {
            "name": {
                "type":     "string",
                "analyzer": "autocomplete",
                "search_analyzer": "standard"
            }
        }
    }
}
es.indices.put_mapping(index='my_index', doc_type='my_type', body=mapping)

{'acknowledged': True}

In [45]:
#Now try a search with specifying an analyzer:
s = Index('my_index', using=es).search()
s = s.query('match', name="brown fo")
s.execute()

<Response: [<Hit(my_index/my_type/1): {'name': 'Brown foxes'}>]>

In [46]:
# And let's validate again:
q = Q('match', name="brown fo").to_dict()
es.indices.validate_query(index='my_index', body={"query": q}, explain=1)

{'_shards': {'failed': 0, 'successful': 1, 'total': 1},
 'explanations': [{'explanation': 'name:brown name:fo',
   'index': 'my_index',
   'valid': True}],
 'valid': True}