## Elasticsearch: The Definitive Guide - Python

Following the examples in the book, here are Python snippets that achieve the same effect.

Documentation for the Python libs:

Low-level API:

https://elasticsearch-py.readthedocs.io/en/master/index.html

Expressive DSL API (more "Pythonic")

http://elasticsearch-dsl.readthedocs.io/en/latest/index.html

Github repo for DSL API:

https://github.com/elastic/elasticsearch-dsl-py


In [3]:
import sys, os
sys.path.insert(1, os.path.join(sys.path[0], '..'))

In [4]:
import index
from elasticsearch import Elasticsearch
from elasticsearch_dsl import Search, Q, Index
from pprint import pprint

es = Elasticsearch(
    'localhost',
    # sniff before doing anything
    sniff_on_start=True,
    # refresh nodes after a node fails to respond
    sniff_on_connection_fail=True,
    # and also every 60 seconds
    sniffer_timeout=60
)

r = index.load_sid_examples(settings={ "settings": { "number_of_shards": 1 }},set=3)
print('{} items created'.format(len(r['items'])))

# Let's repopulate the index as we deleted 'gb' in earlier chapters:
# Run the script: populate.ipynb

4 items created


### Multifield Search

#### Cross-fields Entity Search

Data often spread across many fields:

`
{
    "street":   "5 Poland Street",
    "city":     "London",
    "country":  "United Kingdom",
    "postcode": "W1V 3DG"
}
`

Here we are not concerned with multiple-query strings. Here we want to look at a _single_ query string like "Poland Street W1V." As parts of this string appear in different fields in the doc, using `dis_max / best_fields` will not work as they attempt to find the _single_ best-matching field.

#### A Naive Approach

We could try this:
`
{
  "query": {
    "bool": {
      "should": [
        { "match": { "street":    "Poland Street W1V" }},
        { "match": { "city":      "Poland Street W1V" }},
        { "match": { "country":   "Poland Street W1V" }},
        { "match": { "postcode":  "Poland Street W1V" }}
      ]
    }
  }
}
`

Which is better issued as this:
`
{
  "query": {
    "multi_match": {
      "query":       "Poland Street W1V",
      "type":        "most_fields",
      "fields":      [ "street", "city", "country", "postcode" ]
    }
  }
}
`

However:

The most_fields approach to entity search has some problems that are not immediately obvious:

* It is designed to find the most fields matching **any words**, rather than to find the most matching words across **all fields.**
* It can’t use the `operator` or `minimum_should_match` parameters to reduce the long tail of less-relevant results.
* Term frequencies are different in each field and could interfere with each other to produce badly ordered results.

#### Field-Centric Queries

All three of the above problems come from `most_fields` being field-centric rather than term-centric - it looks for the most matching fields, not terms! (Ditto `best_fields`).

Let's look at why these problems exist:

##### Problem 1 - Matching the same word in multiple fields


In [24]:
# Let's confirm how the most_fields query works by validating the query
body= {
  "query": {
    "multi_match": {
      "query":   "Poland Street W1V",
      "type":    "most_fields",
      "fields":  [ "street", "city", "country", "postcode" ]
    }
  }
}
es.indices.validate_query(index='my_index', body=body, explain=1)\
    ['explanations'][0]['explanation']

'(city:poland city:street city:w1v) (country:poland country:street country:w1v) (postcode:poland postcode:street postcode:w1v) (street:poland street:street street:w1v)'

You can see that a document matching just the word poland in two fields could score higher than a document matching poland and street in one field.

NOTE: The validated explanation shows the query as a [query string](https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-query-string-query.html)

##### Problem 2 - Trimming the long tail

Perhaps we could try this:


In [25]:
# Adding the **and** operator
body= {
  "query": {
    "multi_match": {
      "query":   "Poland Street W1V",
      "type":    "most_fields",
      "operator": "and",
      "fields":  [ "street", "city", "country", "postcode" ]
    }
  }
}
es.indices.validate_query(index='my_index', body=body, explain=1)\
    ['explanations'][0]['explanation']

'(+city:poland +city:street +city:w1v) (+country:poland +country:street +country:w1v) (+postcode:poland +postcode:street +postcode:w1v) (+street:poland +street:street +street:w1v)'

This shows that all words must exist (+) in the same field, which is clearly wrong! It is unlikely that any documents would match this query.

##### Problem 3 - Term Frequencies

In [What Is Relevance?](https://www.elastic.co/guide/en/elasticsearch/guide/master/relevance-intro.html), we explained that the default similarity algorithm used to calculate the relevance score for each term is TF/IDF:

##### Term frequency
>The more often a term appears in a field in a single document, the more relevant the document.

##### Inverse document frequency
>The more often a term appears in a field in all documents in the index, the less relevant is that term.

When searching against multiple fields, TF/IDF can introduce some surprising results.

Consider searching for “Peter Smith” using `first_name` and `last_name` fields. Peter is a common first name and Smith is a common last name, so both will have low IDFs. But what if we have another person in the index whose name is Smith Williams? Smith as a first name is very uncommon and so will have a high IDF!

A simple query like the following may well return Smith Williams above Peter Smith in spite of the fact that the second person is a better match than the first.

`
{
    "query": {
        "multi_match": {
            "query":       "Peter Smith",
            "type":        "most_fields",
            "fields":      [ "*_name" ]
        }
    }
}
`
The high IDF of smith in the first name field can overwhelm the two low IDFs of peter as a first name and smith as a last name.

#### Solution

These problems only exist because we are dealing with multiple fields. If we were to combine all of these fields into a single field, the problems would vanish. We could achieve this by adding a full_name field to our person document:

`
{
    "first_name":  "Peter",
    "last_name":   "Smith",
    "full_name":   "Peter Smith"
}`

When querying just the full_name field:

* Documents with more matching words would trump documents with the same word repeated.
* The minimum_should_match and operator parameters would function as expected.
* The inverse document frequencies for first and last names would be combined so it wouldn’t matter whether Smith were a first or last name anymore.

While this would work, we don’t like having to store redundant data. Instead, Elasticsearch offers us two solutions—one at index time and one at search time:

#### Custom `_all` Fields

The [Metadata: _all Field](https://www.elastic.co/guide/en/elasticsearch/guide/master/root-object.html#all-field) stored all values from all fields as one big string. A more flexible approach is an `_all` field for the person’s name, and another custom `_all` field for the address. 

This can be done using the `copy_to` parameter in field mappings:

`PUT /my_index
{
    "mappings": {
        "person": {
            "properties": {
                "first_name": {
                    "type":     "string",
                    "copy_to":  "full_name" 
                },
                "last_name": {
                    "type":     "string",
                    "copy_to":  "full_name" 
                },
                "full_name": {
                    "type":     "string"
                }
            }
        }
    }
}
`

With this mapping in place, we can query the `first_name` field for first names, the `last_name` field for last name, or the `full_name` field for first and last names.

**NOTE:** The copy_to setting will not work on a multi-field. If you attempt to configure your mapping this way, Elasticsearch will throw an exception.

Just add the `copy_to` to the main field, **not** the multi-field:

`
PUT /my_index
{
    "mappings": {
        "person": {
            "properties": {
                "first_name": {
                    "type":     "string",
                    "copy_to":  "full_name", 
                    "fields": {
                        "raw": {
                            "type": "string",
                            "index": "not_analyzed"
                        }
                    }
                },
                "full_name": {
                    "type":     "string"
                }
            }
        }
    }
}
`