## Elasticsearch: The Definitive Guide - Python

Following the examples in the book, here are Python snippets that achieve the same effect.

Documentation for the Python libs:

Low-level API:

https://elasticsearch-py.readthedocs.io/en/master/index.html

Expressive DSL API (more "Pythonic")

http://elasticsearch-dsl.readthedocs.io/en/latest/index.html

Github repo for DSL API:

https://github.com/elastic/elasticsearch-dsl-py


In [1]:
import sys, os
sys.path.insert(1, os.path.join(sys.path[0], '..'))

In [2]:
import index
from elasticsearch import Elasticsearch
from elasticsearch_dsl import Search, Q
from pprint import pprint

es = Elasticsearch(
    'localhost',
    # sniff before doing anything
    sniff_on_start=True,
    # refresh nodes after a node fails to respond
    sniff_on_connection_fail=True,
    # and also every 60 seconds
    sniffer_timeout=60
)

r = index.populate()
print('{} items created'.format(len(r['items'])))

# Let's repopulate the index as we deleted 'gb' in earlier chapters:
# Run the script: populate.ipynb

14 items created


### Synonyms

While stemming helps to broaden the scope of search by simplifying inflected words to their root form, synonyms broaden the scope by relating concepts and ideas. Perhaps no documents match a query for “English queen,” but documents that contain “British monarch” would probably be considered a good match.

A user might search for “the US” and expect to find documents that contain United States, USA, U.S.A., America, or the States. However, they wouldn’t expect to see results about the states of matter or state machines.

This example provides a valuable lesson. It demonstrates how simple it is for a human to distinguish between separate concepts, and how tricky it can be for mere machines.

Synonyms can be used to conflate words that have pretty much the same meaning, such as jump, leap, and hop, or pamphlet, leaflet, and brochure. Alternatively, they can be used to make a word more generic. For instance, bird could be used as a more general synonym for owl or pigeon, and adult could be used for man or woman.

Synonyms are used to broaden the scope of what is considered a matching document. Just as with stemming or partial matching, synonym fields should not be used alone but should be combined with a query on a main field that contains the original text in unadulterated form. See [Most Fields](https://www.elastic.co/guide/en/elasticsearch/guide/master/most-fields.html) for an explanation of how to maintain relevance when using synonyms.

#### Using Synonyms

Use via the `synonym` token filter:

In [4]:
settings = {
  "settings": {
    "analysis": {
      "filter": {
        "my_synonym_filter": {
          "type": "synonym", 
          "synonyms": [ 
            "british,english",
            "queen,monarch"
          ]
        }
      },
      "analyzer": {
        "my_synonyms": {
          "tokenizer": "standard",
          "filter": [
            "lowercase",
            "my_synonym_filter" 
          ]
        }
      }
    }
  }
}
index.create_my_index(body=settings)

In [13]:
# test with my_synonyms
text = "Elizabeth is the English queen" 
analyzed_text = [[x['position'],x['token']] for x in es.indices.analyze\
                 (index='my_index', analyzer='my_synonyms', text=text)['tokens']]
for item in analyzed_text:
    print('Pos {}: ({})'.format(item[0],item[1]))

Pos 0: (elizabeth)
Pos 1: (is)
Pos 2: (the)
Pos 3: (english)
Pos 3: (british)
Pos 4: (queen)
Pos 4: (monarch)


In [14]:
# Let's try some actual searchs:
body = { "text": text }
es.create(index='my_index', doc_type='test', body=body, id=1)

{'_id': '1',
 '_index': 'my_index',
 '_shards': {'failed': 0, 'successful': 1, 'total': 2},
 '_type': 'test',
 '_version': 1,
 'created': True,
 'result': 'created'}

In [27]:
s = Search(using=es)
s = s.query('match', text='monarch')
s.execute()

<Response: []>

Hmmm. Nothing here. We've seen this before. What's going on?

Let's try using the analyzer on the inbound search:

In [38]:
# Using Lucene query string syntax:
q = 'text:monarch'
res = es.search(index='my_index', doc_type='test', analyzer='my_synonyms', q=q)
res['hits']

{'hits': [{'_id': '1',
   '_index': 'my_index',
   '_score': 0.2824934,
   '_source': {'text': 'Elizabeth is the English queen'},
   '_type': 'test'}],
 'max_score': 0.2824934,
 'total': 1}

In [39]:
# Using Query DSL:
query = {
    "query": {
        "match" : {
            "text" : {
                "query" : "monarch",
                "analyzer" : "my_synonyms",
            }
        }
    }
}
res = es.search(index='my_index', doc_type='test', analyzer='my_synonyms', q=q)
res['hits']

{'hits': [{'_id': '1',
   '_index': 'my_index',
   '_score': 0.2824934,
   '_source': {'text': 'Elizabeth is the English queen'},
   '_type': 'test'}],
 'max_score': 0.2824934,
 'total': 1}

In [45]:
es.indices.get_field_mapping(index='my_index', doc_type='test', fields='text')

{'my_index': {'mappings': {'test': {'text': {'full_name': 'text',
     'mapping': {'text': {'fields': {'keyword': {'ignore_above': 256,
         'type': 'keyword'}},
       'type': 'text'}}}}}}}

As we can see, there is no mention of an analyzer. Let's re-create the index and add the mapping.

In [52]:
settings = {
  "settings": {
    "analysis": {
      "filter": {
        "my_synonym_filter": {
          "type": "synonym", 
          "synonyms": [ 
            "british,english",
            "queen,monarch"
          ]
        }
      },
      "analyzer": {
        "my_synonyms": {
          "tokenizer": "standard",
          "filter": [
            "lowercase",
            "my_synonym_filter" 
          ]
        }
      }
    }
  },
    "mappings": {
    "test": {
      "properties": {
        "text": {
          "type":  "text",
          "analyzer": "my_synonyms"
        }
      }
    }
  }
}
index.create_my_index(body=settings)

In [53]:
es.indices.get_field_mapping(index='my_index', doc_type='test', fields='text')

{'my_index': {'mappings': {'test': {'text': {'full_name': 'text',
     'mapping': {'text': {'analyzer': 'my_synonyms', 'type': 'text'}}}}}}}

In [54]:
text = "Elizabeth is the English queen" 
body = { "text": text }
es.create(index='my_index', doc_type='test', body=body, id=1)

{'_id': '1',
 '_index': 'my_index',
 '_shards': {'failed': 0, 'successful': 1, 'total': 2},
 '_type': 'test',
 '_version': 1,
 'created': True,
 'result': 'created'}

In [56]:
# search for monarch again
s = Search(using=es)
s = s.query('match', text='monarch')
s.execute()

<Response: [<Hit(my_index/test/1): {'text': 'Elizabeth is the English queen'}>]>

Voila! It works! (Of course.)

We can double-check this with a boolean search:

In [58]:
q = Q('match', text='english') & Q('match', text='monarch')
s = Search(using=es).query(q)
s.execute()

<Response: [<Hit(my_index/test/1): {'text': 'Elizabeth is the English queen'}>]>

In [60]:
# Using explicit boolean search, but with 'British Monarch':
q = Q('bool',
      must=[Q('match', text='british'), Q('match', text='monarch')])
s = Search(using=es).query(q)
s.execute()

<Response: [<Hit(my_index/test/1): {'text': 'Elizabeth is the English queen'}>]>

In [62]:
# Or this:
q = {
    "match": {
        "text": {
            "query":    "british monarch",
            "minimum_should_match": "100%"
        }
    }
}
s = Search(using=es).query(q)
s.execute()

<Response: [<Hit(my_index/test/1): {'text': 'Elizabeth is the English queen'}>]>

In [63]:
# Now let's use the => syntax for synonyms:
settings = {
  "settings": {
    "analysis": {
      "filter": {
        "my_synonym_filter": {
          "type": "synonym", 
          "synonyms": [ 
            "british => english",
            "queen => monarch, ruler"
          ]
        }
      },
      "analyzer": {
        "my_synonyms": {
          "tokenizer": "standard",
          "filter": [
            "lowercase",
            "my_synonym_filter" 
          ]
        }
      }
    }
  },
    "mappings": {
    "test": {
      "properties": {
        "text": {
          "type":  "text",
          "analyzer": "my_synonyms"
        }
      }
    }
  }
}
index.create_my_index(body=settings)

In [64]:
# test with my_synonyms - let's see what the analyzer does with our => mappings:
text = "Elizabeth is the English queen" 
analyzed_text = [[x['position'],x['token']] for x in es.indices.analyze\
                 (index='my_index', analyzer='my_synonyms', text=text)['tokens']]
for item in analyzed_text:
    print('Pos {}: ({})'.format(item[0],item[1]))

Pos 0: (elizabeth)
Pos 1: (is)
Pos 2: (the)
Pos 3: (english)
Pos 4: (monarch)
Pos 4: (ruler)


In [65]:
# What happens when we put the document in the index
text = "Elizabeth is the English queen" 
body = { "text": text }
es.create(index='my_index', doc_type='test', body=body, id=1)

{'_id': '1',
 '_index': 'my_index',
 '_shards': {'failed': 0, 'successful': 1, 'total': 2},
 '_type': 'test',
 '_version': 1,
 'created': True,
 'result': 'created'}

In [66]:
# And now search it using a synonym:
s = Search(using=es)
s = s.query('match', text='monarch')
s.execute()

<Response: [<Hit(my_index/test/1): {'text': 'Elizabeth is the English queen'}>]>

In [67]:
# And now search it using the original text:
s = Search(using=es)
s = s.query('match', text='queen')
s.execute()

<Response: [<Hit(my_index/test/1): {'text': 'Elizabeth is the English queen'}>]>

In [74]:
# What's going on?
s = Search(using=es)
s = s.extra(explain=True)
s = s.query('match', text='ruler')
res = s.execute()
index.RenderJSON(res['hits']['hits'])