<a href="https://colab.research.google.com/github/pragmalingu/experiments/blob/master/01_Stemming/Experiment/Algorithmic_Stemmer.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Algorithmic Stemmer

Algorithmic stemmers apply a series of rules to each word to reduce it to its root form.

In this way, they present a few advantages:
1. They require little setup and usually work well out of the box;
2. They use little memory;
3. They are typically faster than dictionary stemmers.

(For more information see the [Elasticsearch documentation](https://www.elastic.co/guide/en/elasticsearch/reference/current/stemming.html#algorithmic-stemmers) or the [First Experiment](https://pragmalingu.de/docs/experiments/experiment1#2-stemming) on our website)

## Introduction

### Setup an Elasticsearch Instance in Google Colab

Everthing to connect to Elasticsearch.
Download:

In [None]:
import os
from subprocess import Popen, PIPE, STDOUT
# download elasticsearch
!wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-7.9.1-linux-x86_64.tar.gz -q
!tar -xzf elasticsearch-7.9.1-linux-x86_64.tar.gz
!chown -R daemon:daemon elasticsearch-7.9.1

# start server
es_server = Popen(['elasticsearch-7.9.1/bin/elasticsearch'], 
                  stdout=PIPE, stderr=STDOUT,
                  preexec_fn=lambda: os.setuid(1)  # as daemon
                 )
# client-side
!pip install elasticsearch -q
from elasticsearch import Elasticsearch
from datetime import datetime
es = Elasticsearch(["localhost:9200/"])
#wait a bit
import time
time.sleep(30)
es.ping()  # got True

### Analyser

An analyser contains three lower-level building blocks: character filters, a tokenizer, and token filters.
To apply stemming we first need to configure a custom analyser that makes use of the stemmer filter. 

(Stemmer filter reference guide at [Elasticsearch documentation](https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-stemmer-tokenfilter.html#analysis-stemmer-tokenfilter))

In [None]:
#the order of filter and analyser is arbitrary

#stemmer token filter
stemmer_analyser = {
    "filter" : {
        "eng_stemmer" : {
        "type" : "stemmer",
        "name" : "english"
        }
    },
    "analyzer" : {
        "default" : {
            "tokenizer" : "standard",
            "filter" : ["lowercase", "eng_stemmer"]
        }
    }
}

#kstem token filter
kstem_analyser = {
    "filter" : {
        "eng_stemmer" : {
        "type" : "kstem",
        "name" : "english"
        }
    },
    "analyzer" : {
        "default" : {
            "tokenizer" : "standard",
            "filter" : ["lowercase", "eng_stemmer"]
        }
    }
}

#porter token filter
porter_analyser = {
    "filter" : {
        "eng_stemmer" : {
        "type" : "porter_stem",
        "name" : "english"
        }
    },
    "analyzer" : {
        "default" : {
            "tokenizer" : "standard",
            "filter" : ["lowercase", "eng_stemmer"]
        }
    }
}

#snowball token filter
snowball_analyser = {
    "filter" : {
        "eng_stemmer" : {
        "type" : "snowball",
        "name" : "english"
        }
    },
    "analyzer" : {
        "default" : {
            "tokenizer" : "standard",
            "filter" : ["lowercase", "eng_stemmer"]
        }
    }
}

### Indexing

Next step is to specify the default analyser for the index; in the following example we do so at index creation.

(Reference guide for analyser specification at [Elasticsearch documentation](https://www.elastic.co/guide/en/elasticsearch/reference/current/specify-analyzer.html#specify-index-time-default-analyzer))

In [None]:
#create the correct settings
stemmer_settings = {
    "settings": {
        "number_of_shards": 1,
        "number_of_replicas": 0,
        "analysis": stemmer_analyser
    }
}

kstem_settings = {
    "settings": {
        "number_of_shards": 1,
        "number_of_replicas": 0,
        "analysis": kstem_analyser
    }
}

porter_settings = {
    "settings": {
        "number_of_shards": 1,
        "number_of_replicas": 0,
        "analysis": porter_analyser
    }
}

snowball_settings = {
    "settings": {
        "number_of_shards": 1,
        "number_of_replicas": 0,
        "analysis": snowball_analyser
    }
}

# create index

# stemmer token filter
es.indices.create("stemmer-index", body=stemmer_settings)

# kstem token filter
es.indices.create("kstem-index", body=kstem_settings)

# porter token filter
es.indices.create("porter-index", body=porter_settings)

# snowball token filter
es.indices.create("snowball-index", body=snowball_settings)

{'acknowledged': True, 'index': 'snowball-index', 'shards_acknowledged': True}

In [None]:
#index document
doc = {
    'author': 'kimchy',
    'text': 'the foxes jumping quickly',
    'timestamp': datetime.now()
}

# stemmer token filter
res = es.index(index="stemmer-index", id=1, body=doc)
print(res['result'])

# kstem token filter
res = es.index(index="kstem-index", id=1, body=doc)
print(res['result'])

# porter token filter
res = es.index(index="porter-index", id=1, body=doc)
print(res['result'])

# snowball token filter
res = es.index(index="snowball-index", id=1, body=doc)
print(res['result'])

created
created
created
created


### Searching

Lastly, let us observe stemming in action by employing a mock query. It's worthwhile noting that the example searches for `jump` rather than `jumping`; Elasticsearchs english stemmer removes the `-ing` suffix, preserving the words root form. 

In [None]:
#test query stemmer token filter
res = es.search(index="stemmer-index", body={"query": {"match" : {"text": {"query" : "jump"} }}})
print("Got %d Hits:" % res['hits']['total']['value'])
for hit in res['hits']['hits']:
    print("Stemmer: %(timestamp)s %(author)s: %(text)s\n" % hit["_source"])

#test query stemmer token filter
res = es.search(index="kstem-index", body={"query": {"match" : {"text": {"query" : "jump"} }}})
print("Got %d Hits:" % res['hits']['total']['value'])
for hit in res['hits']['hits']:
    print("Kstem: %(timestamp)s %(author)s: %(text)s\n" % hit["_source"])

#test query porter token filter
res = es.search(index="porter-index", body={"query": {"match" : {"text": {"query" : "jump"} }}})
print("Got %d Hits:" % res['hits']['total']['value'])
for hit in res['hits']['hits']:
    print("Porter: %(timestamp)s %(author)s: %(text)s\n" % hit["_source"])

#test query snowball token filter
res = es.search(index="snowball-index", body={"query": {"match" : {"text": {"query" : "jump"} }}})
print("Got %d Hits:" % res['hits']['total']['value'])
for hit in res['hits']['hits']:
    print("Snowball: %(timestamp)s %(author)s: %(text)s" % hit["_source"])

Got 1 Hits:
Stemmer: 2020-09-23T08:38:42.057548 kimchy: the foxes jumping quickly

Got 1 Hits:
Kstem: 2020-09-23T08:38:42.057548 kimchy: the foxes jumping quickly

Got 1 Hits:
Porter: 2020-09-23T08:38:42.057548 kimchy: the foxes jumping quickly

Got 1 Hits:
Snowball: 2020-09-23T08:38:42.057548 kimchy: the foxes jumping quickly
