## Elasticsearch: The Definitive Guide - Python

Following the examples in the book, here are Python snippets that achieve the same effect.

Documentation for the Python libs:

Low-level API:

https://elasticsearch-py.readthedocs.io/en/master/index.html

Expressive DSL API (more "Pythonic")

http://elasticsearch-dsl.readthedocs.io/en/latest/index.html

Github repo for DSL API:

https://github.com/elastic/elasticsearch-dsl-py


In [1]:
import sys, os
sys.path.insert(1, os.path.join(sys.path[0], '..'))

In [2]:
import index
from elasticsearch import Elasticsearch
from elasticsearch_dsl import Search, Q
from pprint import pprint

es = Elasticsearch(
    'localhost',
    # sniff before doing anything
    sniff_on_start=True,
    # refresh nodes after a node fails to respond
    sniff_on_connection_fail=True,
    # and also every 60 seconds
    sniffer_timeout=60
)

r = index.populate()
print('{} items created'.format(len(r['items'])))

# Let's repopulate the index as we deleted 'gb' in earlier chapters:
# Run the script: populate.ipynb

14 items created


### Typoes and Mispelings

full-text search that only matches exactly will probably frustrate your users. Wouldn’t you expect a search for “quick brown fox” to match a document containing “fast brown foxes,” “Johnny Walker” to match “Johnnie Walker,” or “Arnold Shcwarzenneger” to match “Arnold Schwarzenegger”?

Fuzzy matching allows for query-time matching of misspelled words, while phonetic token filters at index time can be used for sounds-like matching.

#### Fuzziness

Fuzzy matching treats two words that are “fuzzily” similar as if they were the same word. First, we need to define what we mean by fuzziness. It is the concept of distance - e.g. Damerau-Levenshtein distance.

Damerau observed that 80% of human misspellings have an edit distance of 1. In other words, 80% of misspellings could be corrected with a single edit to the original string.

Elasticsearch supports a maximum edit distance, specified with the fuzziness parameter, of 2.

Of course, the impact that a single edit has on a string depends on the length of the string. Two edits to the word hat can produce mad, so allowing two edits on a string of length 3 is overkill. The fuzziness parameter can be set to AUTO, which results in the following maximum edit distances:

* 0 for strings of one or two characters
* 1 for strings of three, four, or five characters
* 2 for strings of more than five characters

Of course, you may find that an edit distance of 2 is still overkill, and returns results that don’t appear to be related. You may get better results, and better performance, with a maximum fuzziness of 1.

In [33]:
data = ['Surprise me!', 'That was surprising.', 'I wasn\'t surprised.']
for i,txt in enumerate(data):
    body = { "text": ""}
    body['text'] = txt
    es.create(index='my_index', doc_type='my_type', id=i, body=body)

In [35]:
body = {
  "query": {
    "fuzzy": {
      "text": "surprize"
    }
  }
}
es.search(body=body)

{'_shards': {'failed': 0, 'successful': 16, 'total': 16},
 'hits': {'hits': [{'_id': '0',
    '_index': 'my_index',
    '_score': 0.22585157,
    '_source': {'text': 'Surprise me!'},
    '_type': 'my_type'},
   {'_id': '2',
    '_index': 'my_index',
    '_score': 0.1898702,
    '_source': {'text': "I wasn't surprised."},
    '_type': 'my_type'}],
  'max_score': 0.22585157,
  'total': 2},
 'timed_out': False,
 'took': 5}

The fuzzy query is a term-level query, so it doesn’t do any analysis. It takes a single term and finds all terms in the term dictionary that are within the specified fuzziness. The default fuzziness is AUTO.

In our example, surprize is within an edit distance of 2 from both surprise and surprised, so documents 1 and 3 match. We could reduce the matches to just surprise with the following query:


In [36]:
body = {
  "query": {
    "fuzzy": {
      "text": {
        "value": "surprize",
        "fuzziness": 1
      }
    }
  }
}
es.search(body=body)

{'_shards': {'failed': 0, 'successful': 16, 'total': 16},
 'hits': {'hits': [{'_id': '0',
    '_index': 'my_index',
    '_score': 0.22585157,
    '_source': {'text': 'Surprise me!'},
    '_type': 'my_type'}],
  'max_score': 0.22585157,
  'total': 1},
 'timed_out': False,
 'took': 3}

#### Improving Performance

The fuzzy query works by taking the original term and building a Levenshtein automaton—like a big graph representing all the strings that are within the specified edit distance of the original string.

The fuzzy query then uses the automaton to step efficiently through all of the terms in the term dictionary to see if they match. Once it has collected all of the matching terms that exist in the term dictionary, it can compute the list of matching documents.

Of course, depending on the type of data stored in the index, a fuzzy query with an edit distance of 2 can match a very large number of terms and perform very badly. Two parameters can be used to limit the performance impact:

##### prefix_length

>The number of initial characters that will not be “fuzzified.” **Most spelling errors occur toward the end of the word, not toward the beginning.** By using a prefix_length of 3, for example, you can signficantly reduce the number of matching terms.

##### max_expansions

>If a fuzzy query expands to three or four fuzzy options, the new options may be meaningful. If it produces 1,000 options, they are essentially meaningless. Use max_expansions to limit the total number of options that will be produced. The fuzzy query will collect matching terms until it runs out of terms or reaches the max_expansions limit.

#### Fuzzy Match Query

The `match` query supports fuzzy matching out of the box:

In [37]:
body= {
  "query": {
    "match": {
      "text": {
        "query":     "SURPRIZE ME!",
        "fuzziness": "AUTO",
        "operator":  "and"
      }
    }
  }
}
es.search(body=body)

{'_shards': {'failed': 0, 'successful': 16, 'total': 16},
 'hits': {'hits': [{'_id': '0',
    '_index': 'my_index',
    '_score': 0.48396763,
    '_source': {'text': 'Surprise me!'},
    '_type': 'my_type'}],
  'max_score': 0.48396763,
  'total': 1},
 'timed_out': False,
 'took': 6}

In [38]:
body = {
  "query": {
    "multi_match": {
      "fields":  [ "text", "title" ],
      "query":     "SURPRIZE ME!",
      "fuzziness": "AUTO"
    }
  }
}
es.search(body=body)

{'_shards': {'failed': 0, 'successful': 16, 'total': 16},
 'hits': {'hits': [{'_id': '0',
    '_index': 'my_index',
    '_score': 0.48396763,
    '_source': {'text': 'Surprise me!'},
    '_type': 'my_type'},
   {'_id': '2',
    '_index': 'my_index',
    '_score': 0.1898702,
    '_source': {'text': "I wasn't surprised."},
    '_type': 'my_type'}],
  'max_score': 0.48396763,
  'total': 2},
 'timed_out': False,
 'took': 7}

In [39]:
# Let's add some more data to test how fuzziness relates to relevance:
data = ['The element of surprize!', 'That is surprising.', 'Inside every Kinder egg is a surprise.']
for i,txt in enumerate(data):
    body = { "text": ""}
    body['text'] = txt
    es.create(index='my_index', doc_type='my_type', id=i+3, body=body)

In [41]:
body= {
  "query": {
    "match": {
      "text": {
        "query":     "SURPRIZE!",
        "fuzziness": "AUTO"
      }
    }
  }
}
es.search(body=body)

{'_shards': {'failed': 0, 'successful': 16, 'total': 16},
 'hits': {'hits': [{'_id': '2',
    '_index': 'my_index',
    '_score': 0.45747715,
    '_source': {'text': "I wasn't surprised."},
    '_type': 'my_type'},
   {'_id': '3',
    '_index': 'my_index',
    '_score': 0.2876821,
    '_source': {'text': 'The element of surprize!'},
    '_type': 'my_type'},
   {'_id': '5',
    '_index': 'my_index',
    '_score': 0.2500978,
    '_source': {'text': 'Inside every Kinder egg is a surprise.'},
    '_type': 'my_type'},
   {'_id': '0',
    '_index': 'my_index',
    '_score': 0.22585157,
    '_source': {'text': 'Surprise me!'},
    '_type': 'my_type'}],
  'max_score': 0.45747715,
  'total': 4},
 'timed_out': False,
 'took': 8}

#### Scoring Fuzziness

Imagine that we have 1,000 documents containing “Schwarzenegger,” and just one document with the misspelling “Schwarzeneger.” According to the theory of term frequency/inverse document frequency, the misspelling is much more relevant than the correct spelling, because it appears in far fewer documents!


Fuzzy queries alone are much less useful than they initially appear. They are better used as part of a “bigger” feature, such as the search-as-you-type completion suggester or the did-you-mean phrase suggester.

#### Phonetic Matching

It might be useful to match by phonetic similarity - words that sound similar (despite different spellings):


In [44]:
settings = {
  "settings": {
    "analysis": {
      "filter": {
        "dbl_metaphone": { 
          "type":    "phonetic",
          "encoder": "double_metaphone"
        }
      },
      "analyzer": {
        "dbl_metaphone": {
          "tokenizer": "standard",
          "filter":    "dbl_metaphone" 
        }
      }
    }
  }
}


This won't work as it needs a plug-in for [Phoentic analysis](
https://www.elastic.co/guide/en/elasticsearch/plugins/5.2/analysis-phonetic.html)