## Elasticsearch: The Definitive Guide - Python

Following the examples in the book, here are Python snippets that achieve the same effect.

Documentation for the Python libs:

Low-level API:

https://elasticsearch-py.readthedocs.io/en/master/index.html

Expressive DSL API (more "Pythonic")

http://elasticsearch-dsl.readthedocs.io/en/latest/index.html

Github repo for DSL API:

https://github.com/elastic/elasticsearch-dsl-py


In [1]:
import sys, os
sys.path.insert(1, os.path.join(sys.path[0], '..'))

In [2]:
import index
from elasticsearch import Elasticsearch
from elasticsearch_dsl import Search, Q
from pprint import pprint

es = Elasticsearch(
    'localhost',
    # sniff before doing anything
    sniff_on_start=True,
    # refresh nodes after a node fails to respond
    sniff_on_connection_fail=True,
    # and also every 60 seconds
    sniffer_timeout=60
)

r = index.populate()
print('{} items created'.format(len(r['items'])))

# Let's repopulate the index as we deleted 'gb' in earlier chapters:
# Run the script: populate.ipynb

14 items created


### Synonyms

#### Expand or contract

It is possible to replace synonyms by simple expansion, simple contraction, or generic expansion. We will look at the trade-offs of each of these techniques in this section.

#### Simple expansion

In [3]:
settings = {
  "settings": {
    "analysis": {
      "filter": {
        "my_synonym_filter": {
          "type": "synonym", 
          "synonyms": [ 
            "jump,hop,leap"
          ]
        }
      },
      "analyzer": {
        "my_synonyms": {
          "tokenizer": "standard",
          "filter": [
            "lowercase",
            "my_synonym_filter" 
          ]
        }
      }
    }
  }
}
index.create_my_index(body=settings)

In [4]:
# test with my_synonyms
text = "the cow did not jump over the moon" 
analyzed_text = [[x['position'],x['token']] for x in es.indices.analyze\
                 (index='my_index', analyzer='my_synonyms', text=text)['tokens']]
for item in analyzed_text:
    print('Pos {}: ({})'.format(item[0],item[1]))

Pos 0: (the)
Pos 1: (cow)
Pos 2: (did)
Pos 3: (not)
Pos 4: (jump)
Pos 4: (hop)
Pos 4: (leap)
Pos 5: (over)
Pos 6: (the)
Pos 7: (moon)


![screenshot 2017-03-13 11 59 34](https://cloud.githubusercontent.com/assets/28526/23870720/a15f6eac-07e4-11e7-8cfd-9099e087fd12.png)

#### Simple contraction

In [5]:
settings = {
  "settings": {
    "analysis": {
      "filter": {
        "my_synonym_filter": {
          "type": "synonym", 
          "synonyms": [ 
            "leap,hop => jump"
          ]
        }
      },
      "analyzer": {
        "my_synonyms": {
          "tokenizer": "standard",
          "filter": [
            "lowercase",
            "my_synonym_filter" 
          ]
        }
      }
    }
  },
    "mappings": {
    "test": {
      "properties": {
        "text": {
          "type":  "text",
          "analyzer": "my_synonyms"
        }
      }
    }
  }
}
index.create_my_index(body=settings)

In [6]:
# test with my_synonyms
text = "the cow did not leap over the moon" 
analyzed_text = [[x['position'],x['token']] for x in es.indices.analyze\
                 (index='my_index', analyzer='my_synonyms', text=text)['tokens']]
for item in analyzed_text:
    print('Pos {}: ({})'.format(item[0],item[1]))

Pos 0: (the)
Pos 1: (cow)
Pos 2: (did)
Pos 3: (not)
Pos 4: (jump)
Pos 5: (over)
Pos 6: (the)
Pos 7: (moon)


It must be applied both at index time and at query time, to ensure that query terms are mapped to the same single value that exists in the index. Let's demonstrate:

In [7]:
text = "the cow did not leap over the moon" 
body = { "text": text }
es.create(index='my_index', doc_type='test', body=body, id=1)

{'_id': '1',
 '_index': 'my_index',
 '_shards': {'failed': 0, 'successful': 1, 'total': 2},
 '_type': 'test',
 '_version': 1,
 'created': True,
 'result': 'created'}

In [8]:
# search for hop 
s = Search(using=es)
s = s.query('match', text='hop')
s.execute()

<Response: [<Hit(my_index/test/1): {'text': 'the cow did not leap over the moon'}>]>

![screenshot 2017-03-13 12 25 10](https://cloud.githubusercontent.com/assets/28526/23871542/1d4c8556-07e8-11e7-8e47-c65a8eba39fa.png)

#### Genre Expansion

In [9]:
settings = {
  "settings": {
    "analysis": {
      "filter": {
        "my_synonym_filter": {
          "type": "synonym", 
          "synonyms": [ 
            "cat    => cat,pet",
            "kitten => kitten,cat,pet",
            "dog    => dog,pet",
            "puppy  => puppy,dog,pet"
          ]
        }
      },
      "analyzer": {
        "my_synonyms": {
          "tokenizer": "standard",
          "filter": [
            "lowercase",
            "my_synonym_filter" 
          ]
        }
      }
    }
  },
    "mappings": {
    "test": {
      "properties": {
        "text": {
          "type":  "text",
          "analyzer": "my_synonyms"
        }
      }
    }
  }
}
index.create_my_index(body=settings)

In [10]:
# test with my_synonyms for kittens
text = "i am looking for a kitten" 
analyzed_text = [[x['position'],x['token']] for x in es.indices.analyze\
                 (index='my_index', analyzer='my_synonyms', text=text)['tokens']]
for item in analyzed_text:
    print('Pos {}: ({})'.format(item[0],item[1]))

Pos 0: (i)
Pos 1: (am)
Pos 2: (looking)
Pos 3: (for)
Pos 4: (a)
Pos 5: (kitten)
Pos 5: (cat)
Pos 5: (pet)


In [11]:
# But what about pets?
text = "i am looking for a pet" 
analyzed_text = [[x['position'],x['token']] for x in es.indices.analyze\
                 (index='my_index', analyzer='my_synonyms', text=text)['tokens']]
for item in analyzed_text:
    print('Pos {}: ({})'.format(item[0],item[1]))

Pos 0: (i)
Pos 1: (am)
Pos 2: (looking)
Pos 3: (for)
Pos 4: (a)
Pos 5: (pet)


There is no mapping for pet here, but that would be catered for in an indexed doc:

In [12]:
text = "who wants a dog?" 
body = { "text": text }
es.create(index='my_index', doc_type='test', body=body, id=1)
text = "who wants a cat?" 
body = { "text": text }
es.create(index='my_index', doc_type='test', body=body, id=2)
text = "who wants a kitten?" 
body = { "text": text }
es.create(index='my_index', doc_type='test', body=body, id=3)

{'_id': '3',
 '_index': 'my_index',
 '_shards': {'failed': 0, 'successful': 1, 'total': 2},
 '_type': 'test',
 '_version': 1,
 'created': True,
 'result': 'created'}

In [13]:
# search for a pet - ahhh, how cute!
s = Search(using=es)
s = s.query('match', text='can i find a pet?')
s.execute()

<Response: [<Hit(my_index/test/3): {'text': 'who wants a kitten?'}>, <Hit(my_index/test/2): {'text': 'who wants a cat?'}>, <Hit(my_index/test/1): {'text': 'who wants a dog?'}>]>

You could also have the best of both worlds by applying expansion at index time to ensure that the genres are present in the index. Then, at query time, you can choose to not apply synonyms (so that a query for kitten returns only documents about kittens) or to apply synonyms in order to match kittens, cats and pets (including the canine variety).

With the preceding example rules above, the IDF for kitten will be correct, while the IDF for cat and pet will be artificially deflated. However, this works in your favor—a genre-expanded query for kitten OR cat OR pet will rank documents with kitten highest, followed by documents with cat, and documents with pet would be right at the bottom.

#### Synonyms and The Analysis Chain

Imagine that we have an analyzer that consists of the standard tokenizer, with the lowercase token filter followed by a synonym token filter. The analysis process for the text U.S.A. would look like this:

`
original string                  → "U.S.A."
standard           tokenizer     → (U),(S),(A)
lowercase          token filter  → (u),(s),(a)
synonym            token filter  → (usa)
`

If we had specified the synonym as U.S.A., it would never match anything because, by the time my_synonym_filter sees the terms, the periods have been removed and the letters have been lowercased.

This is an important point to consider. What if we want to combine synonyms with stemming, so that jumps, jumped, jump, leaps, leaped, and leap are all indexed as the single term jump? We could place the synonyms filter before the stemmer and list all inflections:

In [14]:
# first without any stemmer - let's see what happens:
settings = {
  "settings": {
    "analysis": {
      "filter": {
        "my_synonym_filter": {
          "type": "synonym", 
          "synonyms": [ 
            "jumps,jumped,leap,leaps,leaped => jump"
          ]
        }
      },
      "analyzer": {
        "my_synonyms": {
          "tokenizer": "standard",
          "filter": [
            "lowercase",
            "my_synonym_filter" 
          ]
        }
      }
    }
  },
    "mappings": {
    "test": {
      "properties": {
        "text": {
          "type":  "text",
          "analyzer": "my_synonyms"
        }
      }
    }
  }
}
index.create_my_index(body=settings)
text = "the cow jumped over the moon" 
body = { "text": text }
es.create(index='my_index', doc_type='test', body=body, id=1)

{'_id': '1',
 '_index': 'my_index',
 '_shards': {'failed': 0, 'successful': 1, 'total': 2},
 '_type': 'test',
 '_version': 1,
 'created': True,
 'result': 'created'}

In [15]:
# search for a cow that jumps
s = Search(using=es)
q = Q('match', text='cow') & Q('match', text='jumps')
s = s.query(q)
s.execute()
# it should work because we "stemmed" all terms via our synonyms contraction to jump

<Response: []>

But now consider an alternative strategy to use a stemmer:

In [16]:
# now with a stemmer
settings = {
  "settings": {
    "analysis": {
      "filter": {
        "my_synonym_filter": {
          "type": "synonym", 
          "synonyms": [ 
            "leap => jump"
          ]
        },
        "my_stemmer": {
          "type":       "stemmer",
          "language":   "english" 
        }
      },
      "analyzer": {
        "my_synonyms": {
          "tokenizer": "standard",
          "filter": [
            "lowercase",
            "my_stemmer",
            "my_synonym_filter" 
          ]
        }
      }
    }
  },
    "mappings": {
    "test": {
      "properties": {
        "text": {
          "type":  "text",
          "analyzer": "my_synonyms"
        }
      }
    }
  }
}
index.create_my_index(body=settings)

In [17]:
text = "the cow jumped over the moon" 
body = { "text": text }
es.create(index='my_index', doc_type='test', body=body, id=1)

{'_id': '1',
 '_index': 'my_index',
 '_shards': {'failed': 0, 'successful': 1, 'total': 2},
 '_type': 'test',
 '_version': 1,
 'created': True,
 'result': 'created'}

In [18]:
# search for a cow that jumps
s = Search(using=es)
q = Q('match', text='jumps')
s = s.query(q)
s.execute()
# it should work because we "stemmed" all terms via our synonyms contraction to jump

<Response: [<Hit(my_index/test/1): {'text': 'the cow jumped over the moon'}>]>

In [19]:
# But does it catch all those terms: jumps,jumped,leap,leaps,leaped and jumping?
# search for a cow that jumps
s = Search(using=es)
q = Q('match', text='jumps') & Q('match', text='leap') & Q('match', text='leaps') & \
    Q('match', text='leaped') & Q('match', text='jumping')
s = s.query(q)
s.execute()

<Response: [<Hit(my_index/test/1): {'text': 'the cow jumped over the moon'}>]>

#### Case-Sensitive Synonyms

Synonym filters usually placed after lowercase filters. But what if we really want to check for `CAT scan` and not cats?

Solution: create two synonym filters:

##### Case-sensitive rules:

`"CAT,CAT scan           => cat_scan"
"PET,PET scan           => pet_scan"
"Johnny Little,J Little => johnny_little"
"Johnny Small,J Small   => johnny_small"
`

##### Case-insensitive rules:

`
"cat                    => cat,pet"
"dog                    => dog,pet"
"cat scan,cat_scan scan => cat_scan"
"pet scan,pet_scan scan => pet_scan"
"little,small"
`

Let's try it:

In [20]:
# two sets of synonyms without a stemmer here:
settings = {
  "settings": {
    "analysis": {
      "filter": {
        "my_syns_1": {
          "type": "synonym", 
          "synonyms": [ 
            "CAT,CAT scan           => cat_scan",
            "PET,PET scan           => pet_scan",
            "Johnny Little,J Little => johnny_little",
            "Johnny Small,J Small   => johnny_small"
          ]
        },
        "my_syns_2": {
          "type": "synonym", 
          "synonyms": [ 
            "cat                    => cat,pet",
            "dog                    => dog,pet",
            "cat scan,cat_scan scan => cat_scan",
            "pet scan,pet_scan scan => pet_scan",
            "little,small"
          ]
        }
      },
      "analyzer": {
        "my_synonyms": {
          "tokenizer": "standard",
          "filter": [
            "my_syns_1",
            "my_syns_2"
          ]
        }
      }
    }
  },
    "mappings": {
    "test": {
      "properties": {
        "text": {
          "type":  "text",
          "analyzer": "my_synonyms"
        }
      }
    }
  }
}
index.create_my_index(body=settings)
text = "the man went to get a cat" 
body = { "text": text }
r = es.create(index='my_index', doc_type='test', body=body, id=1)
text = "the man went to get a CAT scan" 
body = { "text": text }
r = es.create(index='my_index', doc_type='test', body=body, id=2)

In [23]:
# search for a man with a cat
s = Search(using=es)
q = Q('match', text='cat')
s = s.query(q)
s.execute()
# it should work because we "stemmed" all terms via our synonyms contraction to jump

<Response: [<Hit(my_index/test/1): {'text': 'the man went to get a cat'}>]>

In [24]:
# search for a man with a cat
s = Search(using=es)
q = Q('match', text='CAT')
s = s.query(q)
s.execute()
# it should work because we "stemmed" all terms via our synonyms contraction to jump

<Response: [<Hit(my_index/test/2): {'text': 'the man went to get a CAT scan'}>]>

#### Multiword Synonyms and Phrase Queries

So far, synonyms appear to be quite straightforward. Unfortunately, this is where things start to go wrong. For phrase queries to function correctly, Elasticsearch needs to know the position that each term occupies in the original text. Multiword synonyms can play havoc with term positions, especially when the injected synonyms are of differing lengths.

To demonstrate, we’ll create a synonym token filter that uses this rule:

`"usa,united states,u s a,united states of america"`

In [25]:
settings = {
  "settings": {
    "analysis": {
      "filter": {
        "my_synonym_filter": {
          "type": "synonym",
          "synonyms": [
            "usa,united states,u s a,united states of america"
          ]
        }
      },
      "analyzer": {
        "my_synonyms": {
          "tokenizer": "standard",
          "filter": [
            "lowercase",
            "my_synonym_filter"
          ]
        }
      }
    }
  }
}
index.create_my_index(body=settings)


In [26]:
# test with my_synonyms 
text = "The United States is wealthy" 
analyzed_text = [[x['position'],x['token']] for x in es.indices.analyze\
                 (index='my_index', analyzer='my_synonyms', text=text)['tokens']]
for item in analyzed_text:
    print('Pos {}: ({})'.format(item[0],item[1]))

Pos 0: (the)
Pos 1: (united)
Pos 1: (usa)
Pos 1: (u)
Pos 1: (united)
Pos 2: (states)
Pos 2: (s)
Pos 2: (states)
Pos 3: (is)
Pos 3: (a)
Pos 3: (of)
Pos 4: (wealthy)
Pos 4: (america)


In [27]:
# Look at this query validation:
body = {
  "query": {
    "match_phrase": {
      "text": {
        "query": "usa is wealthy",
        "analyzer": "my_synonyms"
      }
    }
  }
}
es.indices.validate_query(index='my_index', body=body, explain=1)

{'_shards': {'failed': 0, 'successful': 1, 'total': 1},
 'explanations': [{'explanation': 'text:"(usa united u united) (is states s states) (wealthy a of) america"',
   'index': 'my_index',
   'valid': True}],
 'valid': True}

Any combination of the explained text would work for this query:

`(usa united u united) (is states s states) (wealthy a of) america`

* `usa is wealthy america`
* `u is of america`
* `usa states of america`

What a mess!

The way to avoid it is to use simple contract where possible to inject a single term that represents all synonyms and to use the same synonym token filter at query time:

In [28]:
settings = {
  "settings": {
    "analysis": {
      "filter": {
        "my_synonym_filter": {
          "type": "synonym",
          "synonyms": [
            "united states,u s a,united states of america=>usa"
          ]
        }
      },
      "analyzer": {
        "my_synonyms": {
          "tokenizer": "standard",
          "filter": [
            "lowercase",
            "my_synonym_filter"
          ]
        }
      }
    }
  }
}
index.create_my_index(body=settings)

In [29]:
# test it now
text = "The United States is wealthy" 
analyzed_text = [[x['position'],x['token']] for x in es.indices.analyze\
                 (index='my_index', analyzer='my_synonyms', text=text)['tokens']]
for item in analyzed_text:
    print('Pos {}: ({})'.format(item[0],item[1]))

Pos 0: (the)
Pos 1: (usa)
Pos 2: (is)
Pos 3: (wealthy)


In [30]:
# And now validate the query again:
body = {
  "query": {
    "match_phrase": {
      "text": {
        "query": "usa is wealthy",
        "analyzer": "my_synonyms"
      }
    }
  }
}
es.indices.validate_query(index='my_index', body=body, explain=1)

{'_shards': {'failed': 0, 'successful': 1, 'total': 1},
 'explanations': [{'explanation': 'text:"usa is wealthy"',
   'index': 'my_index',
   'valid': True}],
 'valid': True}

#### Symbol Synonyms

I am thrilled to be at work on Sunday.
I am thrilled to be at work on Sunday :(

The second string would have the emoticon stripped out.

If we want to handle emoticons, then create a mapping character filter:


In [31]:
settings = {
  "settings": {
    "analysis": {
      "char_filter": {
        "emoticons": {
          "type": "mapping",
          "mappings": [ 
            ":)=>emoticon_happy",
            ":(=>emoticon_sad"
          ]
        }
      },
      "analyzer": {
        "my_emoticons": {
          "char_filter": "emoticons",
          "tokenizer":   "standard",
          "filter":    [ "lowercase" ]
        }
      }
    }
  }
}
index.create_my_index(body=settings)

In [32]:
# test with my_synonyms - let's see what the analyzer does with our => mappings:
text = "I am :) not :(" 
analyzed_text = [[x['position'],x['token']] for x in es.indices.analyze\
                 (index='my_index', analyzer='my_emoticons', text=text)['tokens']]
for item in analyzed_text:
    print('Pos {}: ({})'.format(item[0],item[1]))

Pos 0: (i)
Pos 1: (am)
Pos 2: (emoticon_happy)
Pos 3: (not)
Pos 4: (emoticon_sad)


It is unlikely that anybody would ever search for emoticon_happy, but ensuring that important symbols like emoticons are included in the index can be helpful when doing sentiment analysis. Of course, we could equally have used real words, like happy and sad.