<a href="https://colab.research.google.com/github/pragmalingu/experiments/blob/01_Stemming/Experiment/Algorithmic_Stemmer.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Algorithmic Stemmer

Algorithmic stemmers apply a series of rules to each word to reduce it to its root form.

In this way, they present a few advantages:
1. They require little setup and usually work well out of the box;
2. They use little memory;
3. They are typically faster than dictionary stemmers.

(For more information see the [Elasticsearch documentation](https://www.elastic.co/guide/en/elasticsearch/reference/current/stemming.html#algorithmic-stemmers))

## Introduction

### Setup an Elasticsearch Instance in Google Colab

Everthing to connect to Elasticsearch, for detailed explaination see [this Notebook.](https://)
Download:

In [None]:
import os
from subprocess import Popen, PIPE, STDOUT
# download elasticsearch
!wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-7.9.1-linux-x86_64.tar.gz -q
!tar -xzf elasticsearch-7.9.1-linux-x86_64.tar.gz
!chown -R daemon:daemon elasticsearch-7.9.1

Start a local server:

In [None]:
# start server
es_server = Popen(['elasticsearch-7.9.1/bin/elasticsearch'], 
                  stdout=PIPE, stderr=STDOUT,
                  preexec_fn=lambda: os.setuid(1)  # as daemon
                 )
# client-side
!pip install elasticsearch -q
from elasticsearch import Elasticsearch
from datetime import datetime
es = Elasticsearch(["localhost:9200/"])
#wait a bit
import time
time.sleep(30)
es.ping()  # got True

True

### Analyser

An analyser contains three lower-level building blocks: character filters, a tokenizer, and token filters.
To apply stemming we first need to configure a custom analyser that makes use of the stemmer filter. 

(Stemmer filter reference guide at [Elasticsearch documentation](https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-stemmer-tokenfilter.html#analysis-stemmer-tokenfilter))

In [None]:
#the order of filter and analyser is arbitrary
stemming_analyser = {
    "filter" : {
        "eng_stemmer" : {
        "type" : "stemmer",
        "name" : "english"
        }
    },
    "analyzer" : {
        "default" : {
            "tokenizer" : "standard",
            "filter" : ["lowercase", "eng_stemmer"]
        }
    }
}

### Indexing

Next step is to specify the default analyser for the index; in the following example we do so at index creation.

(Reference guide for analyser specification at [Elasticsearch documentation](https://www.elastic.co/guide/en/elasticsearch/reference/current/specify-analyzer.html#specify-index-time-default-analyzer))

In [None]:
#create the correct settings 
settings = {
    "settings": {
        "number_of_shards": 1,
        "number_of_replicas": 0,
        "analysis": stemming_analyser
    }
}

# create index
es.indices.delete("stemming-index")
es.indices.create("stemming-index", body=settings)

#indexing documents
doc = {
    'author': 'kimchy',
    'text': 'the foxes jumping quickly',
    'timestamp': datetime.now()
}

res = es.index(index="stemming-index", id=1, body=doc)
print(res['result'])

### Searching

Lastly, let us observe stemming in action by employing a mock query. It's worthwhile noting that the example searches for `jump` rather than `jumping`; Elasticsearchs english stemmer removes the `-ing` suffix, preserving the words root form. 

In [None]:
#test query
res = es.search(index="stemming-index", body={"query": {"match" : {"text": {"query" : "jump"} }}})

print("Got %d Hits:" % res['hits']['total']['value'])

for hit in res['hits']['hits']:
    print("%(timestamp)s %(author)s: %(text)s" % hit["_source"])