## Elasticsearch: The Definitive Guide - Python

Following the examples in the book, here are Python snippets that achieve the same effect.

Documentation for the Python libs:

Low-level API:

https://elasticsearch-py.readthedocs.io/en/master/index.html

Expressive DSL API (more "Pythonic")

http://elasticsearch-dsl.readthedocs.io/en/latest/index.html

Github repo for DSL API:

https://github.com/elastic/elasticsearch-dsl-py


In [1]:
from elasticsearch import Elasticsearch
from elasticsearch_dsl import Search, Q
from pprint import pprint

es = Elasticsearch(
    'localhost',
    # sniff before doing anything
    sniff_on_start=True,
    # refresh nodes after a node fails to respond
    sniff_on_connection_fail=True,
    # and also every 60 seconds
    sniffer_timeout=60
)

### Mapping and Analysis

> GET /gb/_mapping/tweet

In [2]:
res = es.indices.get_mapping(index='gb', doc_type='tweet')
pprint(res)

{'gb': {'mappings': {'tweet': {'properties': {'date': {'type': 'date'},
                                              'name': {'fields': {'keyword': {'ignore_above': 256,
                                                                              'type': 'keyword'}},
                                                       'type': 'text'},
                                              'tweet': {'fields': {'keyword': {'ignore_above': 256,
                                                                               'type': 'keyword'}},
                                                        'type': 'text'},
                                              'user_id': {'type': 'long'}}}}}}


These queries return the same results because the analyzer has tokenized and normalized the string ```2014-09-15``` to ```2014```, ```09```, and ```15```.

> GET /_search?q=2014-09-15        # 12 results !

In [3]:
res = es.search(q='2014-09-15')
print(res['hits']['total'])

12


This search is against the _all meta field and so wherever these values are found (in all tweets), a hit is registered.

In [4]:
s = Search(using=es) \
    .query('match', _all='2014-09-15')
response = s.execute()
print('Total hits:{}\n'.format(response['hits']['total']))

Total hits:12



If we change the field explicitly to date, then only the one tweet (with that date) is returned:

In [5]:
s = Search(using=es) \
    .query('match', date='2014-09-15')
response = s.execute()
print('Total hits:{}\n'.format(response['hits']['total']))

Total hits:1



#### Testing Analyzers

We can test analyzers using the _analyze API:
```
GET /_analyze
{
  "analyzer": "standard",
  "text": "Text to analyze"
}
```

In [6]:
text = "Text to analyze"
res = es.indices.analyze(analyzer='standard', body=text)
pprint(res)

{'tokens': [{'end_offset': 4,
             'position': 0,
             'start_offset': 0,
             'token': 'text',
             'type': '<ALPHANUM>'},
            {'end_offset': 7,
             'position': 1,
             'start_offset': 5,
             'token': 'to',
             'type': '<ALPHANUM>'},
            {'end_offset': 15,
             'position': 2,
             'start_offset': 8,
             'token': 'analyze',
             'type': '<ALPHANUM>'}]}


Note that the token is the actual term that will be stored in the inverted index:

In [7]:
for token in res['tokens']:
    print(token['token'])

text
to
analyze


#### Built-in Analyzers (Examples)

In [8]:
text = "I want to buy an i-pad and use it to purchase some socks on e-bay"

In [9]:
#standard
analyzed_text = [x['token'] for x in es.indices.analyze(analyzer='standard', body=text)['tokens']]
print(','.join(analyzed_text))

i,want,to,buy,an,i,pad,and,use,it,to,purchase,some,socks,on,e,bay


In [10]:
#simple
analyzed_text = [x['token'] for x in es.indices.analyze(analyzer='simple', body=text)['tokens']]
print(','.join(analyzed_text))

i,want,to,buy,an,i,pad,and,use,it,to,purchase,some,socks,on,e,bay


In [11]:
#whitespace
analyzed_text = [x['token'] for x in es.indices.analyze(analyzer='whitespace', body=text)['tokens']]
print(','.join(analyzed_text))

I,want,to,buy,an,i-pad,and,use,it,to,purchase,some,socks,on,e-bay


In [12]:
#english (language)
analyzed_text = [x['token'] for x in es.indices.analyze(analyzer='english', body=text)['tokens']]
print(','.join(analyzed_text))

i,want,bui,i,pad,us,purchas,some,sock,e,bai
