## Elasticsearch: The Definitive Guide - Python

Following the examples in the book, here are Python snippets that achieve the same effect.

Documentation for the Python libs:

Low-level API:

https://elasticsearch-py.readthedocs.io/en/master/index.html

Expressive DSL API (more "Pythonic")

http://elasticsearch-dsl.readthedocs.io/en/latest/index.html

Github repo for DSL API:

https://github.com/elastic/elasticsearch-dsl-py


In [9]:
import sys, os
sys.path.insert(1, os.path.join(sys.path[0], '..'))

In [12]:
import index
from elasticsearch import Elasticsearch
from elasticsearch_dsl import Search, Q
from pprint import pprint

es = Elasticsearch(
    'localhost',
    # sniff before doing anything
    sniff_on_start=True,
    # refresh nodes after a node fails to respond
    sniff_on_connection_fail=True,
    # and also every 60 seconds
    sniffer_timeout=60
)

r = index.populate()
print('{} items created'.format(len(r['items'])))

# Let's repopulate the index as we deleted 'gb' in earlier chapters:
# Run the script: populate.ipynb

14 items created


### Getting Started with Languages

Full-text search is a battle between precision—returning as few irrelevant documents as possible—and recall—returning as many relevant documents as possible.

Many tactics can be deployed to tackle precision and recall, such as modifying words: e.g. search for "jumping", "jumps" and "jumped" by reducing words to their stem (root form) - "jump".

However, the first step is to identify words using an analyzer:

##### Tokenize text into individual words:

```The quick brown foxes → [The, quick, brown, foxes]```

##### Lowercase tokens:

```The → the```

##### Remove common stopwords:

```[The, quick, brown, foxes] → [quick, brown, foxes]```

##### Stem tokens to their root form:

```foxes → fox```

Each analyzer may also apply other transformations specific to its language in order to make words from that language more searchable:

##### The english analyzer removes the possessive 's:

```John's → john```

##### The french analyzer removes elisions like l' and qu' and diacritics like ¨ or ^:

```l'église → eglis```

##### The german analyzer normalizes terms, replacing ä and ae with a, or ß with ss, among others:

```äußerst → ausserst```

### Using Language Analyzers

The built-in language analyzers are available globally and don’t need to be configured before being used. They can be specified directly in the field mapping:

```
PUT /my_index
{
  "mappings": {
    "blog": {
      "properties": {
        "title": {
          "type":     "string",
          "analyzer": "english" 
        }
      }
    }
  }
}
```

In [37]:
#english (language)
text = 'I\'m not happy about the foxes'
analyzed_text = [x['token'] for x in es.indices.analyze\
                 (analyzer='english', body=text)['tokens']]
print(','.join(analyzed_text))

i'm,happi,about,fox


We can’t tell if a document mentions one fox or many foxes; the word 'not' is a stopword and is removed, so we can’t tell whether the document is happy about foxes or not. By using the english analyzer, we have **increased recall** as we can match more loosely, but we have reduced our ability to rank documents accurately.

To get the best of both worlds, we can use multifields to index the title field twice: once with the english analyzer and once with the standard analyzer:

```
PUT /my_index
{
  "mappings": {
    "blog": {
      "properties": {
        "title": { 
          "type": "string",
          "fields": {
            "english": { 
              "type":     "string",
              "analyzer": "english"
            }
          }
        }
      }
    }
  }
}
```

In [18]:
index_template = {
  "mappings": {
    "blog": {
      "properties": {
        "title": { 
          "type": "text",
          "fields": {
            "english": { 
              "type":     "text",
              "analyzer": "english"
            }
          }
        }
      }
    }
  }
}


In [19]:
es.indices.create(index='my_index', body=index_template)

{'acknowledged': True, 'shards_acknowledged': True}

In [21]:
data = { "title": "I'm happy for this fox" }
es.create(index='my_index', doc_type='blog', body=data, id=1)

{'_id': '1',
 '_index': 'my_index',
 '_shards': {'failed': 0, 'successful': 1, 'total': 2},
 '_type': 'blog',
 '_version': 1,
 'created': True,
 'result': 'created'}

In [22]:
data = { "title": "I'm not happy about my fox problem" }
es.create(index='my_index', doc_type='blog', body=data, id=2)

{'_id': '2',
 '_index': 'my_index',
 '_shards': {'failed': 0, 'successful': 1, 'total': 2},
 '_type': 'blog',
 '_version': 1,
 'created': True,
 'result': 'created'}

In [29]:
s = Search(using=es, index='my_index', doc_type='blog')
q = Q('multi_match', type='most_fields', query='not happy foxes', fields=['title', 'title.english'])
s = s.query()
res = s.execute()
for hit in res:
    print(hit.title)

I'm not happy about my fox problem
I'm happy for this fox


Note that both hits **do not** contain the word foxes, but we got a hit on fox.

Use the ```most_fields``` query type to match the same text in as many fields as possible.

### Configuring Lanuage Analyzers

It might be useful to avoid stemming words (like "organization" --> organ) if you know this will sacrifice certain precision requirements (e.g. seaches for "world health organization"). It is possible to configure the analyzers, e.g. to exclude certain stop words or stems:

```
PUT /my_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_english": {
          "type": "english",
          "stem_exclusion": [ "organization", "organizations" ], 
          "stopwords": [ 
            "a", "an", "and", "are", "as", "at", "be", "but", "by", "for",
            "if", "in", "into", "is", "it", "of", "on", "or", "such", "that",
            "the", "their", "then", "there", "these", "they", "this", "to",
            "was", "will", "with"
          ]
        }
      }
    }
  }
}

GET /my_index/_analyze?analyzer=my_english 
The World Health Organization does not sell organs.
```



In [31]:
es.indices.delete(index='my_index')
index_template_with_exclusions = \
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_english": {
          "type": "english",
          "stem_exclusion": [ "organization", "organizations" ], 
          "stopwords": [ 
            "a", "an", "and", "are", "as", "at", "be", "but", "by", "for",
            "if", "in", "into", "is", "it", "of", "on", "or", "such", "that",
            "the", "their", "then", "there", "these", "they", "this", "to",
            "was", "will", "with"
          ]
        }
      }
    }
  }
}

es.indices.create(index='my_index', body=index_template_with_exclusions)

{'acknowledged': True, 'shards_acknowledged': True}

In [36]:
#english (language) with exclusions - my_english
text = 'The World Health Organization does not sell organs.'
analyzed_text = [x['token'] for x in es.indices.analyze\
                 (index='my_index', analyzer='my_english', body=text)['tokens']]
print(','.join(analyzed_text))

world,health,organization,doe,not,sell,organ
