Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with
or
.
Download ZIP
Elasticsearch Word Decompounding Plugin
Java
Branch: master

Merge pull request #10 from elastic-martin/master

Update DecompoundTokenFilter.java
latest commit 379f017140
@jprante authored
Failed to load latest commit information.
src Update DecompoundTokenFilter.java
.gitignore bumping versions
.travis.yml travis added, maven project info, package moved
LICENSE.txt initial commit
README.rst README
pom.xml 1.0.0.RC1.1

README.rst

Elasticsearch Word Decompounder Analysis Plugin

This is an implementation of a word decompounder plugin for Elasticsearch.

This word decompounding token filter is complementing the standard Elasticsearch compound word token filter

Compounding several words into one word is a property not all languages share. Compounding is used in German, Scandinavian Languages, Finnish and Korean.

This code is a reworked implementation of the Baseforms Tool found in the ASV toolbox of Chris Biemann, Automatische Sprachverarbeitung of Leipzig University.

Lucene comes with two coumpound word token filters, a dictionary- and a hyphenation-based variant. Both of them have a disadvantage, they require loading a word list in memory before they run. This decompounder does not require word lists, it can process german language text out of the box. The decompounder uses prebuilt Compact Patricia Tries for efficient word segmentation provided by the ASV toolbox.

Installation

https://travis-ci.org/jprante/elasticsearch-analysis-decompound.png
ES version Plugin Release date Command
0.90.5 1.3.0 Oct 26, 2013 ./bin/plugin --install decompound --url http://bit.ly/1hmvN6Z
1.0.0.RC1 1.0.0.RC1.1 Jan 16, 2014 ./bin/plugin --install decompound --url http://bit.ly/1aNQsyM

Do not forget to restart the node after installing.

Project docs

The Maven project site is available at Github

Binaries

Binaries are available at Bintray

Example

In the mapping, us a token filter of type "decompound":

{
   "index":{
      "analysis":{
          "filter":{
              "decomp":{
                  "type" : "decompound"
              }
          },
          "tokenizer" : {
              "decomp" : {
                 "type" : "standard",
                 "filter" : [ "decomp" ]
              }
          }
      }
   }
}

"Die Jahresfeier der Rechtsanwaltskanzleien auf dem Donaudampfschiff hat viel Ökosteuer gekostet" will be tokenized into "Die", "Die", "Jahresfeier", "Jahr", "feier", "der", "der", "Rechtsanwaltskanzleien", "Recht", "anwalt", "kanzlei", "auf", "auf", "dem", "dem", "Donaudampfschiff", "Donau", "dampf", "schiff", "hat", "hat", "viel", "viel", "Ökosteuer", "Ökosteuer", "gekostet", "gekosten"

It is recommended to add the Unique token filter to skip tokens that occur more than once.

Also the Lucene german normalization token filter is provided:

{
  "index":{
      "analysis":{
          "filter":{
              "umlaut":{
                  "type":"german_normalize"
              }
          },
          "tokenizer" : {
              "umlaut" : {
                 "type":"standard",
                 "filter" : "umlaut"
              }
          }
      }
  }
}

The input "Ein schöner Tag in Köln im Café an der Straßenecke" will be tokenized into "Ein", "schoner", "Tag", "in", "Koln", "im", "Café", "an", "der", "Strassenecke".

Threshold

The decomposing algorithm knows about a threshold when to assume words as decomposed successfully or not. If the threshold is too low, words could silently disappear from being indexed. In this case, you have to adapt the threshold so words do no longer disappear.

The default threshold value is 0.51. You can modify it in the settings:

{
   "index":{
      "analysis":{
          "filter":{
              "decomp":{
                  "type" : "decompound",
                  "threshold" : 0.51
              }
          },
          "tokenizer" : {
              "decomp" : {
                 "type" : "standard",
                 "filter" : [ "decomp" ]
              }
          }
      }
   }
}

References

The Compact Patricia Trie data structure can be found in

Morrison, D.: Patricia - practical algorithm to retrieve information coded in alphanumeric. Journal of ACM, 1968, 15(4):514–534

The compound splitter used for generating features for document classification is described in

Witschel, F., Biemann, C.: Rigorous dimensionality reduction through linguistically motivated feature selection for text categorization. Proceedings of NODALIDA 2005, Joensuu, Finland

The base form reduction step (for Norwegian) is described in

Eiken, U.C., Liseth, A.T., Richter, M., Witschel, F. and Biemann, C.: Ord i Dag: Mining Norwegian Daily Newswire. Proceedings of FinTAL, Turku, 2006, Finland

License

Elasticsearch Word Decompounder Analysis Plugin

Copyright (C) 2012 Jörg Prante

Derived work of ASV toolbox http://asv.informatik.uni-leipzig.de/asv/methoden

Copyright (C) 2005 Abteilung Automatische Sprachverarbeitung, Institut für Informatik, Universität Leipzig

This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with this program; if not, write to the Free Software Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA.

Something went wrong with that request. Please try again.