Skip to content

papahigh/elasticsearch-keyboard-layout

Repository files navigation

Elasticsearch plugin for keyboard layout suggestions

Build Status License Apache%202.0 blue

This plugin exposes keyboard_layout term suggester which suggests terms according to the switched keyboard layout.

Examples of suggestions this plugin helps to provide
шзрщту ч 64пи               ⟶    iphone x 64gb
nt[yjkjus]                  ⟶    технології
dszdbo ytrfkmrb dsgflrfo    ⟶    выявіў некалькі выпадкаў
тшлу rhjccjdrb runner 2     ⟶    nike кроссовки runner 2
;tcnrbq lbcr 1n,            ⟶    жесткий диск 1тб

The following keyboard layouts are supported:

  • Russian

  • Ukrainian

  • Belarusian

Feel free to open a pull request with any other keyboard layouts.

This plugin may be used in combination with default term suggester which is based on string similarity in order to build a google-like search experience known as "did you mean?".

Installation

⚠️
Please note that due to the serialization issue this plugin is available only for Elasticsearch 7.0.0 and above.

In order to install the plugin, choose a version and run:

$ bin/elasticsearch-plugin install URL

where URL points to zip file of the appropriate release which corresponds to your elasticsearch version.

The plugin must be installed on every node in the cluster, and each node must be restarted after installation.

E.g., command for Elasticsearch 7.6.0

# install plugin on Elasticsearch 7.6.0
$ bin/elasticsearch-plugin install https://github.com/papahigh/elasticsearch-keyboard-layout/raw/7.6.0/dist/keyboard-layout-7.6.0.zip

After installation this plugin will expose new token filter and term suggester named keyboard_layout.

Getting started with Suggester

You can start using the keyboard_layout suggester by providing the suggest part of a search request:

POST _search
{
  "suggest": {
    "text": "шЗрщту ЧЫ 64пи",
    "keyboard_suggestion": {
      "keyboard_layout": {
        "field": "content",
        "language": "russian",
        "lowercase_token": true,
        "preserve_case": true,
        "add_original": false
      }
    }
  }
}

In the response you should see the original start offset and length in the suggest text and if any found a switched keyboard layout options. Each options array contains an option object that includes the suggested text and its document frequency. You may also request original token and its frequency by providing add_original option.

{
  "suggest": {
    "keyboard_suggestion": [
      {
        "text": "шЗрщту",
        "offset": 0,
        "length": 6,
        "options": [
          {
            "text": "iPhone",
            "freq": 4,
            "switch": true
          }
        ]
      },
      {
        "text": "ЧЫ",
        "offset": 7,
        "length": 2,
        "options": [
          {
            "text": "XS",
            "freq": 2,
            "switch": true
          }
        ]
      },
      {
        "text": "64пи",
        "offset": 10,
        "length": 4,
        "options": [
          {
            "text": "64gb",
            "freq": 1,
            "switch": true
          }
        ]
      }
    ]
  }
  ...
}

Extension for go client github.com/olivere/elastic: https://github.com/aaerofeev/go-elasic-keyboard-layout

Suggester options

List of the supported suggester options is as follows:

text

The suggest text. The suggest text is a required option that needs to be set globally or per suggestion.

field

The field to fetch the candidate suggestions from. This is an required option that either needs to be set globally or per suggestion.

language

The language of the keyboard layout. This is an required option. Available options are: russian, belarusian, ukrainian.

analyzer

The analyzer to analyse the suggest text with. Defaults to the whitespace analyzer.

lowercase_token

Lower cases terms before frequency evaluation and after the suggest analysis is done. Default is false.

preserve_case

Whether case should be preserved in the switched suggest options. When lower_case is set to true this option restores the original case. Defaults to false.

min_freq

The minimal threshold in number of documents a suggestion should appear in. This can be specified as an absolute number or as a relative percentage of number of documents. This can improve quality by only suggesting high frequency terms. Defaults to 0f and is not enabled. If a value higher than 1 is specified then the number cannot be fractional. The shard level document frequencies are used for this option.

max_freq

The maximum threshold in number of documents a suggest text token can exist in order to be included. Can be a relative percentage number (e.g 0.4) or an absolute number to represent document frequencies. If an value higher than 1 is specified then fractional can not be specified. Defaults to -1 and is not enabled. This can be used to exclude high frequency terms from switch keyboard suggestions. The shard level document frequencies are used for this option.

add_original

Whether original term and its frequency should be included in the suggest options. Default is false.

Contribute

Use the issue tracker and/or open pull requests.

Licence

This project is released under version 2.0 of the Apache Licence.

Releases

No releases published

Packages

No packages published

Languages