## Elasticsearch: The Definitive Guide - Python

Following the examples in the book, here are Python snippets that achieve the same effect.

Documentation for the Python libs:

Low-level API:

https://elasticsearch-py.readthedocs.io/en/master/index.html

Expressive DSL API (more "Pythonic")

http://elasticsearch-dsl.readthedocs.io/en/latest/index.html

Github repo for DSL API:

https://github.com/elastic/elasticsearch-dsl-py


In [10]:
import sys, os
sys.path.insert(1, os.path.join(sys.path[0], '..'))

In [11]:
import index
from elasticsearch import Elasticsearch
from elasticsearch_dsl import Search, Q
from pprint import pprint

es = Elasticsearch(
    'localhost',
    # sniff before doing anything
    sniff_on_start=True,
    # refresh nodes after a node fails to respond
    sniff_on_connection_fail=True,
    # and also every 60 seconds
    sniffer_timeout=60
)

r = index.populate()
print('{} items created'.format(len(r['items'])))

# Let's repopulate the index as we deleted 'gb' in earlier chapters:
# Run the script: populate.ipynb

14 items created


### Reducing Words to Their Root Form

#### Dictionary Stemmers

Dictionary stemmers work quite differently from algorithmic stemmers. Instead of applying a standard set of rules to each word, they simply look up the word in the dictionary. Theoretically, they could produce much better results than an algorithmic stemmer. A dictionary stemmer should be able to do the following:

* Return the correct root word for irregular forms such as feet and mice
* Recognize the distinction between words that are similar but have different word senses—for example, organ and organization

**Dictionary Stemmer** - only as good as its dictionary. Most e-dictionaries only ~10% of full dictionaries. Have to be updated etc.

**Size and performance** - A dictionary stemmer needs to load all words, all prefixes, and all suffixes into memory. This can use a significant amount of RAM. Finding the right stem for a word is often considerably more complex than the equivalent process with an algorithmic stemmer.

Let's explore the Hunspell dictionary "stemmer":

```
config/
  └ hunspell/ 
      └ en_GB/ 
          ├ en_GB.dic
          ├ en_GB.aff
          └ settings.yml 
```

Note that we don't need to touch settings.yml (which override any settings in the master settings file: ```elasticsearch.yml```. Settings can be used to ignore case, which is otherwise set to false. 

* ```indices.analysis.hunspell.dictionary.ignore_case```

(NOTE: due to my British roots, I changed the example to use the GB dictionary noting that the US version is derived from it.)

In [16]:
settings = {
    "analysis" : {
        "analyzer" : {
            "en_GB" : {
                "tokenizer" : "standard",
                "filter" : [ "lowercase", "en_GB" ]
            }
        },
        "filter" : {
            "en_GB" : {
                "type" : "hunspell",
                "locale" : "en_GB"
            }
        }
    }
}
index.create_my_index(body=settings)

In [18]:
# test with the standard English analyzer
text = "You're right about organizing jack's Über generation of waiters." 
analyzed_text = [x['token'] for x in es.indices.analyze\
                 (index='my_index', analyzer='english', text=text)['tokens']]
print(','.join(analyzed_text))

you'r,right,about,organ,jack,über,gener,waiter


In [19]:
analyzed_text = [x['token'] for x in es.indices.analyze\
                 (index='my_index', analyzer='en_GB', text=text)['tokens']]
print(','.join(analyzed_text))

you're,right,about,organize,organ,jack,über,generation,generate,genera,of,wait


Let's see what happens with the following words that are known to be overstemmed by Porter stemmers (and later improved by the Porter2 stemmer):

In [25]:
text = "A generically generally generously generated organized waiter."
# English
analyzed_text = [x['token'] for x in es.indices.analyze\
                 (index='my_index', analyzer='english', text=text)['tokens']]
print(','.join(analyzed_text))

gener,gener,gener,gener,organ,waiter


In [26]:
# en_GB Hunspell:
analyzed_text = [x['token'] for x in es.indices.analyze\
                 (index='my_index', analyzer='en_GB', text=text)['tokens']]
print(','.join(analyzed_text))

a,genera,genera,generously,generous,generate,organize,organ,wait


In [29]:
english_token_filter = {
  "settings": {
    "analysis": {
      "filter": {
        "english_stop": {
          "type":       "stop",
          "stopwords":  "_english_"
        },
        "light_english_stemmer": {
          "type":       "stemmer",
          "language":   "light_english" 
        },
        "english_possessive_stemmer": {
          "type":       "stemmer",
          "language":   "possessive_english"
        }
      },
      "analyzer": {
        "my_english": {
          "tokenizer":  "standard",
          "filter": [
            "english_possessive_stemmer",
            "lowercase",
            "english_stop",
            "light_english_stemmer", 
            "asciifolding" 
          ]
        }
      }
    }
  }
}
index.create_my_index(body=english_token_filter)

In [30]:
# my_english custom analyzer:
analyzed_text = [x['token'] for x in es.indices.analyze\
                 (index='my_index', analyzer='my_english', text=text)['tokens']]
print(','.join(analyzed_text))

generic,generally,generous,generate,organized,waiter


In [33]:
porter_token_filter = {
  "settings": {
    "analysis": {
      "filter": {
        "english_stop": {
          "type":       "stop",
          "stopwords":  "_english_"
        },
        "porter": {
          "type":       "stemmer",
          "language":   "porter" 
        },
        "english_possessive_stemmer": {
          "type":       "stemmer",
          "language":   "possessive_english"
        }
      },
      "analyzer": {
        "my_porter_english": {
          "tokenizer":  "standard",
          "filter": [
            "english_possessive_stemmer",
            "lowercase",
            "english_stop",
            "porter", 
            "asciifolding" 
          ]
        }
      }
    }
  }
}
index.create_my_index(body=porter_token_filter)

In [34]:
# my_english custom analyzer:
analyzed_text = [x['token'] for x in es.indices.analyze\
                 (index='my_index', analyzer='my_porter_english', text=text)['tokens']]
print(','.join(analyzed_text))

gener,gener,gener,gener,organ,waiter


In [35]:
porter2_token_filter = {
  "settings": {
    "analysis": {
      "filter": {
        "english_stop": {
          "type":       "stop",
          "stopwords":  "_english_"
        },
        "porter2": {
          "type":       "stemmer",
          "language":   "porter2" 
        },
        "english_possessive_stemmer": {
          "type":       "stemmer",
          "language":   "possessive_english"
        }
      },
      "analyzer": {
        "my_porter2_english": {
          "tokenizer":  "standard",
          "filter": [
            "english_possessive_stemmer",
            "lowercase",
            "english_stop",
            "porter2", 
            "asciifolding" 
          ]
        }
      }
    }
  }
}
index.create_my_index(body=porter2_token_filter)

In [36]:
# my_english custom analyzer:
analyzed_text = [x['token'] for x in es.indices.analyze\
                 (index='my_index', analyzer='my_porter2_english', text=text)['tokens']]
print(','.join(analyzed_text))

generic,general,generous,generat,organ,waiter


### Summary of Analyzer Comparison:

text = "A generically generally generously generated organized waiter."

##### English

gener,gener,gener,gener,organ,waiter

##### Hunspell (en_GB) #####

a,genera,genera,generously,generous,generate,organize,organ,wait

##### "My English" (Lite stemmer)

generic,generally,generous,generate,organized,waiter

##### "My English" (Porter stemmer)

gener,gener,gener,gener,organ,waiter

##### "My English" (Porter2 stemmer)

generic,general,generous,generat,organ,waiter


### Preventing Stemming

Maybe important to keep skies and skiing as distinct words rather than stemming them both down to ski (as would happen with the english analyzer).

The ```keyword_marker``` and ```stemmer_override``` token filters customize the stemming process.

In [40]:
stem_control_settings = {
  "settings": {
    "analysis": {
      "filter": {
        "no_stem": {
          "type": "keyword_marker",
          "keywords": [ "skies" ] 
        }
      },
      "analyzer": {
        "my_stemmer": {
          "tokenizer": "standard",
          "filter": [
            "lowercase",
            "no_stem",
            "porter_stem"
          ]
        }
      }
    }
  }
}
index.create_my_index(body=stem_control_settings)

In [41]:
# my_stemmer custom analyzer:
text = ['sky skies skiing skis']
analyzed_text = [x['token'] for x in es.indices.analyze\
                 (index='my_index', analyzer='my_stemmer', text=text)['tokens']]
print(','.join(analyzed_text))

sky,skies,ski,ski


While the language analyzers allow us only to specify an array of words in the stem_exclusion parameter, the keyword_marker token filter also accepts a keywords_path parameter that allows us to store all of our keywords in [a file](https://www.elastic.co/guide/en/elasticsearch/guide/master/using-stopwords.html#updating-stopwords).

#### Customizing Stemming

Perhaps we prefer "skies" to be stemmed to "sky" instead. The ```stemmer_override``` token filter allows us to specify our own custom stemming rules. At the same time, we can handle some irregular forms like stemming mice to mouse and feet to foot:

In [44]:
my_stemmer_override = {
  "settings": {
    "analysis": {
      "filter": {
        "custom_stem": {
          "type": "stemmer_override",
          "rules": [ 
            "skies=>sky",
            "mice=>mouse",
            "feet=>foot"
          ]
        }
      },
      "analyzer": {
        "my_stemmer_override": {
          "tokenizer": "standard",
          "filter": [
            "lowercase",
            "custom_stem", 
            "porter_stem"
          ]
        }
      }
    }
  }
}
index.create_my_index(body=my_stemmer_override)
# my_stemmer_override custom analyzer:
text = ['The mice came down from the skies and ran over my feet']
analyzed_text = [x['token'] for x in es.indices.analyze\
                 (index='my_index', analyzer='my_stemmer_override', text=text)['tokens']]
print(','.join(analyzed_text))

the,mouse,came,down,from,the,sky,and,ran,over,my,foot


**NOTE**: The stemmer_override filter ("custom_stem") must be placed **before** the stemmer (here "porter_stem").

Just as for the keyword_marker token filter, rules can be stored in a file whose location should be specified with the ```rules_path``` parameter.