## Elasticsearch: The Definitive Guide - Python

Following the examples in the book, here are Python snippets that achieve the same effect.

Documentation for the Python libs:

Low-level API:

https://elasticsearch-py.readthedocs.io/en/master/index.html

Expressive DSL API (more "Pythonic")

http://elasticsearch-dsl.readthedocs.io/en/latest/index.html

Github repo for DSL API:

https://github.com/elastic/elasticsearch-dsl-py


In [1]:
import sys, os
sys.path.insert(1, os.path.join(sys.path[0], '..'))

In [2]:
import index
from elasticsearch import Elasticsearch
from elasticsearch_dsl import Search, Q
from pprint import pprint

es = Elasticsearch(
    'localhost',
    # sniff before doing anything
    sniff_on_start=True,
    # refresh nodes after a node fails to respond
    sniff_on_connection_fail=True,
    # and also every 60 seconds
    sniffer_timeout=60
)

r = index.populate()
print('{} items created'.format(len(r['items'])))

# Let's repopulate the index as we deleted 'gb' in earlier chapters:
# Run the script: populate.ipynb

14 items created


### Identifying Words

A word in English is relatively simple to spot: words are separated by whitespace or (some) punctuation. Even in English, though, there can be controversy: is you’re one word or two? What about o’clock, cooperate, half-baked, or eyewitness?

The standard analyzer is used by default for any full-text analyzed string field. If we were to reimplement the standard analyzer as a custom analyzer, it would be defined as follows:

```
{
    "type":      "custom",
    "tokenizer": "standard",
    "filter":  [ "lowercase", "stop" ]
}
```

#### Standard Tokenizer

What is interesting is the algorithm that is used to identify words. The whitespace tokenizer simply breaks on whitespace—spaces, tabs, line feeds, and so forth—and assumes that contiguous nonwhitespace characters form a single token. For instance:

In [3]:
# Whitespace tokenizer
text = "You're the 1st runner home!"
analyzed_text = [x['token'] for x in es.indices.analyze\
                 (tokenizer='whitespace', body=text)['tokens']]
print(','.join(analyzed_text))

You're,the,1st,runner,home!


In [4]:
# Standard tokenizer - uses Unicode Text Segmentation standard
text = "You're my co-opted 'favorite' cool_dude." # single quotes 'favorite'
analyzed_text = [x['token'] for x in es.indices.analyze\
                 (tokenizer='standard', body=text)['tokens']]
print(','.join(analyzed_text))

You're,my,co,opted,favorite,cool_dude


In [5]:
# Standard tokenizer - uses Unicode Text Segmentation standard
# Note that string contains an email address
text = "You're my co-opted 'favorite' cool_dude. Pls email me friend@dude.it"
analyzed_text = [x['token'] for x in es.indices.analyze\
                 (tokenizer='standard', body=text)['tokens']]
print(','.join(analyzed_text))

You're,my,co,opted,favorite,cool_dude,Pls,email,me,friend,dude.it


In [30]:
# Standard tokenizer - uses Unicode Text Segmentation standard
text = "You're my co-opted 'favorite' cool_dude. Pls email me friend@dude.it"
analyzed_text = [x['token'] for x in es.indices.analyze\
                 (tokenizer='uax_url_email', text=text)['tokens']]
print(','.join(analyzed_text))

You're,my,co,opted,favorite,cool_dude,Pls,email,me,friend@dude.it


The standard tokenizer is a reasonable starting point for tokenizing most languages, especially Western languages. In fact, it forms the basis of most of the language-specific analyzers like the english, french, and spanish analyzers. Its support for Asian languages, however, is limited, and you should consider using the icu_tokenizer instead, which is available in the ICU plug-in.

#### Tidying Up Input Text

Tokenizers produce the best results when the input text is clean, valid text, where valid means that it follows the punctuation rules that the Unicode algorithm expects. Quite often, though, the text we need to process is anything but clean. Cleaning it up before tokenization improves the quality of the output.

For example, HTML can get messy...

```
GET /_analyze?tokenizer=standard
<p>Some d&eacute;j&agrave; vu <a href="http://somedomain.com>">website</a>
```

To use them as part of the analyzer, they should be added to a custom analyzer definition:

```
PUT /my_index
{
    "settings": {
        "analysis": {
            "analyzer": {
                "my_html_analyzer": {
                    "tokenizer":     "standard",
                    "char_filter": [ "html_strip" ]
                }
            }
        }
    }
}
```

In [31]:
text = '<p>Some d&eacute;j&agrave; vu <a href="http://somedomain.com>">website</a>'

In [32]:
from elasticsearch_dsl import analyzer, Index

In [33]:
my_custom_analyzer = analyzer('my_html_analyzer',
        tokenizer='standard',
        char_filter='html_strip')

In [34]:
i = Index('my_index')

In [35]:
i.analyzer(my_custom_analyzer)

In [38]:
analyzed_text = [x['token'] for x in es.indices.analyze\
                 (index='my_index', analyzer='my_html_analyzer', text=text)['tokens']]
print(','.join(analyzed_text))

Some,déjà,vu,website


NOTE: I cheated here because the above method call returned an illegal exception that I was unable to debug (related to passing in the char_filter param). So I created the index using the above params 