<a href="https://colab.research.google.com/github/pragmalingu/experiments/blob/master/01_Stemming/Experiment/Hunspell_Dictionary_Stemmer.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Dictionary Stemmer

Dictionary stemmers look up words in a provided dictionary, replacing unstemmed word variants with stemmed words from the dictionary.

In theory, these are well suited for:
1. Stemming irregular words;
2. Discerning between words that are spelled similarly but not related conceptually.

They also admit a few drawbacks:
1. Highly dependent on dictionary quality;
2. Size and performance, as a result of having to load all words, prefixes and suffixes from the dictionary.

(For more information see the [Elasticsearch documentation](https://www.elastic.co/guide/en/elasticsearch/reference/current/stemming.html#dictionary-stemmers)or the [First Experiment](https://pragmalingu.de/docs/experiments/experiment1#2-stemming) on our website)

## Introduction

### Setup an Elasticsearch Instance in Google Colab

Everthing to connect to Elasticsearch.

In [None]:
import os
from subprocess import Popen, PIPE, STDOUT
# download elasticsearch
!wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-7.9.1-linux-x86_64.tar.gz -q
!tar -xzf elasticsearch-7.9.1-linux-x86_64.tar.gz
!chown -R daemon:daemon elasticsearch-7.9.1

# start server
es_server = Popen(['elasticsearch-7.9.1/bin/elasticsearch'], 
                  stdout=PIPE, stderr=STDOUT,
                  preexec_fn=lambda: os.setuid(1)  # as daemon
                 )
# client-side
!pip install elasticsearch -q
from elasticsearch import Elasticsearch
from datetime import datetime
es = Elasticsearch(["localhost:9200/"])
#wait a bit
import time
time.sleep(30)
es.ping()  # got True

### Analyser

An analyser contains three lower-level building blocks: character filters, a tokenizer, and token filters.
To apply dictionary stemming, we first need to configure a custom analyser that makes use of the stemmer filter. 

(Hunspell filter reference guide at [Elasticsearch documentation](https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-hunspell-tokenfilter.html#analysis-hunspell-tokenfilter))

In [None]:
#the order of filter and analyser is arbitrary
dictionary_analyser = {
    "filter" : {
        "dictionary_stemmer" : {
          "type" : "hunspell",
          "locale" : "en_US",
          "dedup" : True  #duplicate tokens are removed from the filter’s output
        }
    },
    "analyzer" : {
        "default" : {
            "tokenizer" : "standard",
            "filter" : ["lowercase", "dictionary_stemmer"]
        }
    }
}

### Indexing

The next step is to specify a default analyser for the index; in the following example we do so at index creation.

(Reference guide for analyser specification at [Elasticsearch documentation](https://www.elastic.co/guide/en/elasticsearch/reference/current/specify-analyzer.html#specify-index-time-default-analyzer))

In [None]:
#create the correct settings 
settings = {
    "settings": {
        "number_of_shards": 1,
        "number_of_replicas": 0,
        "analysis": dictionary_analyser
    }
}

#create index
es.indices.delete("dictionary-stemming-index")
es.indices.create("dictionary-stemming-index", body=settings)

#index document
doc = {
    'author': 'kimchy',
    'text': 'the foxes jumping quickly',
    'timestamp': datetime.now()
}

res = es.index(index="dictionary-stemming-index", id=1, body=doc)
print(res['result'])

### Searching

Lastly, let us observe stemming in action by employing a mock query. It's worthwhile noting that the example searches for `jump` rather than `jumping`; The dictionary stemmer removes the `-ing` suffix, preserving and searching for the words root form. 

In [None]:
#test query
res = es.search(index="dictionary-stemming-index", body={"query": { "multi_match": { "query": "jump" , "fields" : ["title","text"]}}})

print("Got %d Hits:" % res['hits']['total']['value'])

for hit in res['hits']['hits']:
    print("%(timestamp)s %(author)s: %(text)s" % hit["_source"])