elasticsearch7-ngrams-fuzzy-shingles-stemming-workshop

references

preface

goals of this workshop
- introduction to typos and suggestions handling in elasticsearch
- introduction to basic constructs boosting search
  - ngram and edge ngram (typos, prefixes)
  - shingles (phrase)
  - stemmers (operating on roots rather than words)
  - fuzzy queries (typos)
  - suggesters
in docker-compose there is elasticsearch + kibana (7.6) prepared for local testing
- cd docker/compose
- docker-compose up -d
workshop and answers are in workshop directory

introduction

searching natural language is inherently imprecise - computers can't comprehend natural language
- they need heuristic, an algorithmic equivalent of true linguistic comprehension
Lucene is composed of many text processing tools
- example: prefix query
  - simply matches the beginning letters of words
- example: fuzzy queries
  - find words that need at most a certain number of 'edits', to match the query
- example: snowball stemmer and the metaphone phonetic analyzer
  - mimic grammatical and phonetic aspects of language
- example: autocomplete functionality (other names: search as you type, type-ahead search)

ngram

splitting a token into multiple subtokens
sliding window that moves across the word - continuous sequence of characters of the specified length
example
- token: pillar
- 1-grams: p, i, l, l, a, r
- bigrams: pi, il, ll, la, ar
- trigrams: pil, ill, lla, lar
how it could boost searching of incorrectly spelled word
- search: "pilar"
- bigrams for "pillar": pi, il, ll, la, ar
- bigrams for "pilar": pi, il, la, ar
  - 4/5 match comparing to correctly spelled word
drawback: more words than intended will match the original
use-case
- analyze text when the language is not known
- handle multiple languages with a single analyzer
- analyze text when the language combines words in different manner than other European languages
  - japanese: no spaces between words
  - german: long compound words

edge ngrams

no user will search for "Database" using the "ase" chunk of characters
- for many applications, only ngrams that start at the beginning of words are needed
- that’s where edge n-grams come into play
edge n-grams = consecutive prefixes
example
- token: pillar
- edge n-grams: p, pi, pil, pill, pilla, pillar
helpful for searching words with prefix
- vs prefix query: prefix query is much more time consuming
- but indexing edge ngrams is longer (and indexes are bigger - contain prefixes)
index vs search analyzer
- standard approach: same analyzer at index time and at search time
- different advice for edge ngrams
  - index time: save all prefixes
  - search time: search for the terms that user typed in
  - good practice to set upper limit
    - for example if search for text with length > 8 - full text search instead of terms

prepare index with edge-ngram analyzer

PUT /index-name
{
    "settings": {
        "analysis": {
            "filter": {
                "autocomplete_filter": { // filter that will split tokens into edge ngrams
                    "type": "edge_ngram",
                    "min_gram": 2, // smallest ngrams to generate, default: 1
                    "max_gram": 5 // largest ngrams to generate, default: 2
                }
            },
            "analyzer": {
                "autocomplete_analyzer": { // custom analyzer: standard + autocomplete
                    "type": "custom",
                    "tokenizer": "standard",
                    "filter": [ "lowercase", "stop", "autocomplete_filter"]
                }
            }
        }
    },
    "mappings": {
        "properties": {
            "description": { // field name with autocomplete feature
                "type": "text", // normal mapping
                "fields" : { // multi-field mapping
                    "autocomplete": { // mapping responsible for autocomplete feature
                        "type": "text", 
                        "analyzer": "autocomplete_analyzer", 
                        "search_analyzer": "standard" // override - by default, queries use the analyzer defined above
                    },
                    "english": { // other mappings for the same field
                        "type": "text",
                        "analyzer": "english"
                    }
                }
            }
        }
    }
}

querying for autocomplete

GET /index-type/_search
{
    "query": {
        "bool": {
            "should": { // filter, must
                "match": { "description.autocomplete": "search for descr" }
            }
        }
    }
}

shingles

ngrams at the token level instead of the character level
example
- token: "please divide this sentence into shingles"
- shingles bigrams: please divide, divide this, this sentence, sentence into, and into shingles
rare phrase matches will have a bigger score boost than more common phrases
downside: larger indices
- pre-bake phrase matching
  - build phrases into the index
  - avoid creating phrases at query time and save some processing time/speed
index-phrases option on a text field
- two-term word combinations (shingles) are indexed into a separate field
- allows exact phrase queries (no slop) to run more efficiently, at the expense of a larger index
  - slop = how far we allow the terms to be
field type search_as_you_type
- creates the following fields
  - my_field
    - if an analyzer is not configured, the default analyzer for the index is used
  - my_field._2gram
    - wraps the analyzer of my_field with a shingle token filter of shingle size 2
  - my_field._3gram
    - wraps the analyzer of my_field with a shingle token filter of shingle size 3
  - my_field._index_prefix
    - wraps the analyzer of my_field._3gram with an edge ngram token filter
- params
  - min_shingle_size
  - max_shingle_size
most efficient way to serve a search-as-you-type is multi_match query of type bool_prefix
- bool_prefix - constructs a bool query from the terms
  - each term except the last is used in a term query
  - last term is used in a prefix query
- example: please divide this sen
  - is parsed into terms: please, divide, this, sen
  - each term except last is used in exact match
  - last term is used as prefix query: sen -> sentence

prepare index with shingles

PUT /index-type
{
    "mappings": {
        "properties": {
            "description": { "type": "search_as_you_type"}
            ...
        }
    }
}

querying

GET /index-type/_search
{
    "query": {
        "multi_match": {
            "query": "...",
            "type": "bool_prefix",
            "fields": [
                "description._2gram",
                "description._3gram"
            ]
        }
    }
}

stemming

process of reducing a word to its root form
- root form may not be a real word
extremely handy when searching
- match words sharing the root
- make your searches more flexible than rigid exact matching
example
- word: "administrations"
- root: "administr"
- match: "administrator", "administration", "administrate"
categories
- algorithmic stemmers
  - based on a set of rules
  - example - remove the -s and -es from the end of plural words
  - advantages
    - little setup
    - little memory
    - typically faster than dictionary stemmers
  - disadvantages
    - problem with irregular words
      - be, are, and am
      - mouse and mice
      - foot and feet
  - types
    - stemmer - porter stemming algorithm, several languages
    - kstem - algorithmic stemming + built-in dictionary
    - porter_stem - porter stemming algorithm, recommended for English
    - snowball - Snowball-based stemming rules, several languages
    - differences in how aggressive they stem
      - more aggressive = chop off more
- dictionary stemmers
  - stem words by looking in a dictionary
  - advantages
    - more accurate
    - irregular words
    - distinguishes similar (in context of spelling) words but not related conceptually
      - example: organ and organization, broker and broken
  - disadvantages
    - stemmer is only as good as its dictionary
      - must include a significant number of words, be updated regularly, change with language trends
    - use a significant amount of RAM
      - can slow the stemming process significantly
  - types
    - hunspell

fuzzy

some typos cannot be solved with n-grams
Levenshtein distance between two words
- minimum number of single-character edits (insertions, deletions or substitutions) required to change one word into the other
Damerau-Levenshtein distance = Levenshtein + transposition
- changing a character (box -> fox)
- removing a character (black -> lack)
- inserting a character (sic -> sick)
- transposing two adjacent characters (act -> cat)
Levenshtein vs Damerau-Levenshtein
- compare 'aex' and 'axe'
- Levenshtein distance: two edits away
  - 'aex' is as far from 'axe' as 'faxes'
- Damerau-Levenshtein: one edit away
  - makes greater intuitive sense in most cases
fuzzy query
```
GET index-search/_search
{
    "query": {
        "bool": {
            "should": [
                { "prefix": { "field-name": "..." } },
                { "fuzzy": { "field-name": { "value": "...", "fuzziness": 2 } } }
            ]
        }
    }
}
```
- returns documents that contain terms similar (LD distance) to the search term
- creates a set of all possible variations of the search term within a specified edit distance
- does not analyze the query text first
  - exact matches for each expansion
- parameters worth mentioning
  - fuzziness - maximum edit distance allowed for matching
  - max_expansions - maximum number of variations created, defaults: 50
  - prefix_length - number of beginning characters left unchanged when creating expansions, defaults: 0
- often combined with prefix search
stemming context
- snowball analyzer
- fuzzy search for 'running', will be stemmed to 'run'
- misspelled 'runninga' stems to 'runninga'
- 'running' will not match the 'runninga'
  - 'run' is more than 2 edits away from 'runninga'
- can cause a bit of confusion
  - use the simple analyzer with fuzzy queries
  - disable synonyms as well

suggesters

suggests documents containing similar looking terms
still under development
types
- term-suggester
  - suggests terms based on edit distance
- completion-suggester
  - provides auto-complete/search-as-you-type functionality
  - not meant for spell correction or did-you-mean functionality
  - uses data structures that enable fast lookups, but are costly to build and are stored in-memory
  - supports fuzzy queries

define index

PUT index-name
{
    "mappings": {
        "properties": {
            "search_associations": {
                "type": "completion"
            },
            ...
        }
    }
}

index document

POST index-name/_doc
{
    "search_associations": [ ... ],
}

query

POST index-name/_search
{
    "suggest": {
        "suggest-name": {
            "prefix": "...",
            "completion": {
                "field": "search_associations"
            }
        }
    }
}

Name		Name	Last commit message	Last commit date
Latest commit History 33 Commits
docker-compose		docker-compose
workshop		workshop
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docker-compose

docker-compose

workshop

workshop

.gitignore

.gitignore

LICENSE

LICENSE

README.md

README.md

Repository files navigation

elasticsearch7-ngrams-fuzzy-shingles-stemming-workshop

preface

introduction

ngram

edge ngrams

shingles

stemming

fuzzy

suggesters

About

Releases

Packages

License

mtumilowicz/elasticsearch7-ngrams-fuzzy-shingles-stemming-workshop

Folders and files

Latest commit

History

Repository files navigation

elasticsearch7-ngrams-fuzzy-shingles-stemming-workshop

preface

introduction

ngram

edge ngrams

shingles

stemming

fuzzy

suggesters

About

Topics

Resources

License

Stars

Watchers

Forks