## Elasticsearch: The Definitive Guide - Python

Following the examples in the book, here are Python snippets that achieve the same effect.

Documentation for the Python libs:

Low-level API:

https://elasticsearch-py.readthedocs.io/en/master/index.html

Expressive DSL API (more "Pythonic")

http://elasticsearch-dsl.readthedocs.io/en/latest/index.html

Github repo for DSL API:

https://github.com/elastic/elasticsearch-dsl-py


In [1]:
import index
from elasticsearch import Elasticsearch
from elasticsearch_dsl import Search, Q
from pprint import pprint

es = Elasticsearch(
    'localhost',
    # sniff before doing anything
    sniff_on_start=True,
    # refresh nodes after a node fails to respond
    sniff_on_connection_fail=True,
    # and also every 60 seconds
    sniffer_timeout=60
)

r = index.populate()
print('{} items created'.format(len(r['items'])))

# Let's repopulate the index as we deleted 'gb' in earlier chapters:
# Run the script: populate.ipynb

14 items created


### Sorting and Relevance

By default, results are returned sorted by relevance—with the most relevant docs first. Later in this chapter, we explain what we mean by relevance and how it is calculated, but let’s start by looking at the sort parameter and how to use it.

Relevance isn't always meaningful e.g. if we are mostly filtering:

```
GET /_search
{
    "query" : {
        "bool" : {
            "filter" : {
                "term" : {
                    "user_id" : 1
                }
            }
        }
    }
}
```

In [2]:
# And a filter-only search
s = Search(using=es)
s = s.filter('term', user_id=1)
res = s.execute()

Docs returned in random order and will have a _score of 0

In [3]:
for hit in res:
    print('Score:{}'.format(hit.meta.score))

Score:0.0
Score:0.0
Score:0.0
Score:0.0
Score:0.0
Score:0.0


In [4]:
res

<Response: [<Hit(us/tweet/14): {'date': '2014-09-24', 'tweet': 'How many more cheesy tweets...}>, <Hit(us/tweet/8): {'date': '2014-09-18', 'user_id': 1, 'name': 'John Smith'}>, <Hit(us/tweet/10): {'date': '2014-09-20', 'tweet': 'Elasticsearch surely is one...}>, <Hit(us/tweet/12): {'date': '2014-09-22', 'tweet': 'Elasticsearch and I have le...}>, <Hit(us/tweet/4): {'date': '2014-09-14', 'tweet': '@mary it is not just text, ...}>, <Hit(us/tweet/6): {'date': '2014-09-16', 'tweet': 'The Elasticsearch API is re...}>]>

In [5]:
# Or we can make sure the items have a constant non-zero score
s = Search(using=es).query('constant_score', filter=Q('term', user_id=1))
res = s.execute()

In [6]:
for hit in res:
    print('Score:{} with date of {}'.format(hit.meta.score,hit.date))

Score:1.0 with date of 2014-09-24
Score:1.0 with date of 2014-09-18
Score:1.0 with date of 2014-09-20
Score:1.0 with date of 2014-09-22
Score:1.0 with date of 2014-09-14
Score:1.0 with date of 2014-09-16


### Sorting by Field Values

```
GET /_search
{
    "query" : {
        "bool" : {
            "filter" : { "term" : { "user_id" : 1 }}
        }
    },
    "sort": { "date": { "order": "desc" }}
}
```

In [7]:
s = Search(using=es).query('bool', filter=Q('term', user_id=1))
s = s.sort({ "date": { "order": "desc" }})
res = s.execute()
# Now is date descending order:
for hit in res:
    print('Score:{} with date of {} and sort field:{}'
          .format(hit.meta.score,hit.date,hit.meta.sort))

Score:None with date of 2014-09-24 and sort field:[1411516800000]
Score:None with date of 2014-09-22 and sort field:[1411344000000]
Score:None with date of 2014-09-20 and sort field:[1411171200000]
Score:None with date of 2014-09-18 and sort field:[1410998400000]
Score:None with date of 2014-09-16 and sort field:[1410825600000]
Score:None with date of 2014-09-14 and sort field:[1410652800000]


Notice the score field set to None because it isn't required (due to a sort) and the addition of a "sort" field that was indexed internally and used to perform the sort (here in milliseconds since the epoch).

### Multilevel Sorting

```
GET /_search
{
    "query" : {
        "bool" : {
            "must":   { "match": { "tweet": "manage text search" }},
            "filter" : { "term" : { "user_id" : 2 }}
        }
    },
    "sort": [
        { "date":   { "order": "desc" }},
        { "_score": { "order": "desc" }}
    ]
}
```

In [8]:
s = Search(using=es).query('bool', 
                           must=Q('match', tweet='manage text search'),
                           filter=Q('term', user_id=2))
s = s.sort({ "date":   { "order": "desc" }}, { "_score": { "order": "desc" }})
#s = s.sort("date","_score") # sorted by date first
res = s.execute()

In [9]:
for hit in res:
    print('Score:{} with date of {} and sort field:{}'
          .format(hit.meta.score,hit.date,hit.meta.sort))

Score:0.64433396 with date of 2014-09-15 and sort field:[1410739200000, 0.64433396]
Score:1.3434829 with date of 2014-09-13 and sort field:[1410566400000, 1.3434829]


Order is important. Results are sorted by the first criterion first. Only results whose first sort value is identical will then be sorted by the second criterion, and so on.

Multilevel sorting doesn’t have to involve the _score. You could sort by using several different fields, on geo-distance or on a custom value calculated in a script.

#### Sorting on Multivalue Fields

Let's say we have fields with more than one item. How do we sort on them? For numbers and dates, you can reduce a multivalue field to a single value by using the min, max, avg, or sum sort modes. For instance, you could sort on the earliest date in each dates field by using the following:

```
"sort": {
    "dates": {
        "order": "asc",
        "mode":  "min"
    }
}
```

Let's create some docs to try this.

In [10]:
doc1 = {
    'title': 'How I Met Your Mother',
    'date': '2013-01-01',
    'ratings': [2,3,1,3,4,5,5,5,3,4,2]
}
doc2 = {
    'title': 'Breaking Bad',
    'date': '2013-01-01',
    'ratings': [5,5,4,3,4,5,5,5,3,5,5]
}
es.create(index='shows', doc_type='tv_series', body=doc1, id=1)
es.create(index='shows', doc_type='tv_series', body=doc2, id=2)


{'_id': '2',
 '_index': 'shows',
 '_shards': {'failed': 0, 'successful': 1, 'total': 2},
 '_type': 'tv_series',
 '_version': 1,
 'created': True,
 'result': 'created'}

In [11]:
s = Search(using=es)
s = s.sort({ "ratings":   { "order": "desc", "mode":"avg" }})
#s = s.sort("date","_score") # sorted by date first
res = s.execute()

In [12]:
for hit in res:
    print(hit.title, hit.meta.sort)

### String Sorting and Multifields

Sorting on text fields is problematic because an analyzed field will consist of a bunch of tokens (post analyzer). If you really want to sort on a text field, then it's best left in an unanalyzed form. This can be done by adding a field:

```
"tweet": { 
    "type":     "string",
    "analyzer": "english",
    "fields": {
        "raw": { 
            "type":  "string",
            "index": "not_analyzed"
        }
    }
}
```

And then sort on the raw field:

```
GET /_search
{
    "query": {
        "match": {
            "tweet": "elasticsearch"
        }
    },
    "sort": "tweet.raw"
}
```

First I will delete the tweet index and re-create using template 2.

In [13]:
r = index.populate(template=2)

In [14]:
s = Search(using=es).query(Q('match', tweet='elasticsearch'))
s = s.sort("tweet.raw")
res = s.execute()

In [15]:
for hit in res:
    print(hit.meta.sort)

### What is Relevance?

The relevance score of each document is represented by a positive floating-point number called the _score. The higher the _score, the more relevant the document.

A query clause generates a _score for each document. How that score is calculated depends on the type of query clause. Different query clauses are used for different purposes: a fuzzy query might determine the _score by calculating how similar the spelling of the found word is to the original search term; a terms query would incorporate the percentage of terms that were found. However, what we usually mean by relevance is the algorithm that we use to calculate how similar the contents of a full-text field are to a full-text query string.

The standard similarity algorithm used in Elasticsearch is known as term frequency/inverse document frequency, or [TF/IDF](https://en.wikipedia.org/wiki/Tf%E2%80%93idf)

Understanding how the relevance was calculated can be difficult to understand, hence the availability of the explain parameter.

```
GET /_search?explain 
{
   "query"   : { "match" : { "tweet" : "honeymoon" }}
}
```

In [16]:
s = Search(using=es).query(Q('match', tweet='honeymoon'))
s = s.extra(explain=True)
res = s.execute()

In [18]:
index.RenderJSON(res['hits']['hits'])

In [29]:
s = Search(using=es).query(Q('match', tweet='honeymoon') & Q('match', _id=12))
s = s.extra(explain=True)
res = s.execute()

In [33]:
index.RenderJSON(res['hits']['hits'])