## Elasticsearch: The Definitive Guide - Python

Following the examples in the book, here are Python snippets that achieve the same effect.

Documentation for the Python libs:

Low-level API:

https://elasticsearch-py.readthedocs.io/en/master/index.html

Expressive DSL API (more "Pythonic")

http://elasticsearch-dsl.readthedocs.io/en/latest/index.html

Github repo for DSL API:

https://github.com/elastic/elasticsearch-dsl-py


In [1]:
import sys, os
sys.path.insert(1, os.path.join(sys.path[0], '..'))

In [2]:
import index
from elasticsearch import Elasticsearch
from elasticsearch_dsl import Search, Q
from pprint import pprint

es = Elasticsearch(
    'localhost',
    # sniff before doing anything
    sniff_on_start=True,
    # refresh nodes after a node fails to respond
    sniff_on_connection_fail=True,
    # and also every 60 seconds
    sniffer_timeout=60
)

r = index.populate()
print('{} items created'.format(len(r['items'])))

# Let's repopulate the index as we deleted 'gb' in earlier chapters:
# Run the script: populate.ipynb

14 items created


### Reducing Words to Their Root Form

Most languages of the world are inflected, meaning that words can change their form to express differences in the following:

* Number: fox, foxes
* Tense: pay, paid, paying
* Gender: waiter, waitress
* Person: hear, hears
* Case: I, me, my
* Aspect: ate, eaten
* Mood: so be it, were it so

While inflection aids expressivity, it interferes with retrievability, as a single root word sense (or meaning) may be represented by many different sequences of letters

Stemming attempts to remove the differences between inflected forms of a word, in order to reduce each word to its root form. For instance foxes may be reduced to the root fox.

If stemming were easy, there would be only one implementation. Unfortunately, stemming is an inexact science that suffers from two issues: understemming and overstemming.

**The root form of a word may not even be a real word**. The words ```jumping``` and ```jumpiness``` may both be stemmed to ```jumpi```. It doesn’t matter—as long as the same terms are produced at index time and at search time, search will just work.

Understemming is the failure to reduce words with the same meaning to the same root. For example, ```jumped``` and ```jumps``` may be reduced to ```jump```, while ```jumping``` may be reduced to ```jumpi```. **Understemming reduces retrieval**; relevant documents are not returned.

Overstemming is the failure to keep two words with distinct meanings separate. For instance, ```general``` and ```generate``` may both be stemmed to ```gener```. **Overstemming reduces precision**: irrelevant documents are returned when they shouldn’t be.

#### Lemmatization ####

A lemma is the canonical, or dictionary, form of a set of related words—the lemma of paying, paid, and pays is pay. Sometimes the morphology differs: is, was, am, and being is be.

Lemmatization, like stemming, tries to group related words, but it goes one step further than stemming in that it tries to group words by their word sense, or meaning. The same word may represent two meanings—for example,wake can mean to wake up or a funeral. While lemmatization would try to distinguish these two word senses, stemming would incorrectly conflate them.

Lemmatization is a much more complicated and expensive process that needs to understand the context in which words appear in order to make decisions about what they mean. In practice, stemming appears to be just as effective as lemmatization, but with a much lower cost.

### Algorithmic Stemmers

While you can use the porter_stem or kstem token filter directly, or create a language-specific Snowball stemmer with the snowball token filter, all of the algorithmic stemmers are exposed via a single unified interface: the stemmer token filter, which accepts the language parameter.

For instance, perhaps you find the default stemmer used by the english analyzer to be too aggressive and you want to make it less aggressive. The first step is to look up the configuration for the english analyzer in the language analyzers documentation, which shows the following:

```
{
  "settings": {
    "analysis": {
      "filter": {
        "english_stop": {
          "type":       "stop",
          "stopwords":  "_english_"
        },
        "english_keywords": {
          "type":       "keyword_marker", 
          "keywords":   []
        },
        "english_stemmer": {
          "type":       "stemmer",
          "language":   "english" 
        },
        "english_possessive_stemmer": {
          "type":       "stemmer",
          "language":   "possessive_english" 
        }
      },
      "analyzer": {
        "english": {
          "tokenizer":  "standard",
          "filter": [
            "english_possessive_stemmer",
            "lowercase",
            "english_stop",
            "english_keywords",
            "english_stemmer"
          ]
        }
      }
    }
  }
}
```

The "lighter" modified English token filter:

In [107]:
english_token_filter = {
  "settings": {
    "analysis": {
      "filter": {
        "english_stop": {
          "type":       "stop",
          "stopwords":  "_english_"
        },
        "light_english_stemmer": {
          "type":       "stemmer",
          "language":   "light_english" 
        },
        "english_possessive_stemmer": {
          "type":       "stemmer",
          "language":   "possessive_english"
        }
      },
      "analyzer": {
        "my_english": {
          "tokenizer":  "standard",
          "filter": [
            "english_possessive_stemmer",
            "lowercase",
            "english_stop",
            "light_english_stemmer", 
            "asciifolding" 
          ]
        }
      }
    }
  }
}

In [108]:
index.create_my_index(body=english_token_filter)

Now let's put some data into my_index:

In [109]:
text = "You're right about those jumping jacks in the Über generation of waiters."
doc = {
    "message": text
}
es.create(index="my_index", doc_type='test', body=doc, id=1)

{'_id': '1',
 '_index': 'my_index',
 '_shards': {'failed': 0, 'successful': 1, 'total': 2},
 '_type': 'test',
 '_version': 1,
 'created': True,
 'result': 'created'}

In [110]:
# test with the standard English analyzer
analyzed_text = [x['token'] for x in es.indices.analyze\
                 (index='my_index', analyzer='english', text=text)['tokens']]
print(','.join(analyzed_text))


you'r,right,about,those,jump,jack,über,gener,waiter


Besides the singlular from plural ('jack(s)'), notice the following mappings:

* many => mani
* jumping => jump
* generation => gener

In [111]:
# test with the modified English analyzer - 'my_english'
analyzed_text = [x['token'] for x in es.indices.analyze\
                 (index='my_index', analyzer='my_english', text=text)['tokens']]
print(','.join(analyzed_text))

you're,right,about,those,jump,jack,uber,generation,waiter


Besides the singlular from plural ('jack(s)'), notice the different mappings:

* many => many (same i.e. non-stemmed)
* generation => generation (same e.g. non-stemmed)
* Über => uber (i.e. asciifolded)

But what if we search for one of these transformed words in the docs:

In [91]:
s = Search(using=es, index="my_index").query('match', message="jump uber")
s.execute()

<Response: []>

Hmmm. No results for ```jump```. How come?
Let's check the mapping for the field ```message```:

In [92]:
res = es.indices.get_mapping(index='my_index', doc_type='test')
res
#es.indices.get_field_mapping(index='my_index', fields='messages')

{'my_index': {'mappings': {'test': {'properties': {'message': {'fields': {'keyword': {'ignore_above': 256,
        'type': 'keyword'}},
      'type': 'text'}}}}}}

Our analyzer has not been mapped. Let's do it now:

In [99]:
english_token_filter = {
  "settings": {
    "analysis": {
      "filter": {
        "english_stop": {
          "type":       "stop",
          "stopwords":  "_english_"
        },
        "light_english_stemmer": {
          "type":       "stemmer",
          "language":   "light_english" 
        },
        "english_possessive_stemmer": {
          "type":       "stemmer",
          "language":   "possessive_english"
        }
      },
      "analyzer": {
        "my_english": {
          "tokenizer":  "standard",
          "filter": [
            "english_possessive_stemmer",
            "lowercase",
            "english_stop",
            "light_english_stemmer", 
            "asciifolding" 
          ]
        }
      }
    }
  },
    "mappings": {
    "test" : {
      "properties" : {
        "message" : {
          "type" :    "text",
          "analyzer": "my_english"
        }
      }
    }
  }
}
index.create_my_index(body=english_token_filter)

In [100]:
text = "You're right about those jumping jacks in the Über generation of waiters."
doc = {
    "message": text
}
es.create(index="my_index", doc_type='test', body=doc, id=1)

{'_id': '1',
 '_index': 'my_index',
 '_shards': {'failed': 0, 'successful': 1, 'total': 2},
 '_type': 'test',
 '_version': 1,
 'created': True,
 'result': 'created'}

In [101]:
s = Search(using=es, index="my_index", doc_type='test').query('match', message="jump")
res = s.execute()
print(res.hits.total)
print(res[0].message)

1
You're right about those jumping jacks in the Über generation of waiters.


In [102]:
s = Search(using=es, index="my_index", doc_type='test').query('match', message="uber")
res = s.execute()
print(res.hits.total)
print(res[0].message)

1
You're right about those jumping jacks in the Über generation of waiters.
