Skip to content

oleg-st/russianmorphology

 
 

Repository files navigation

Russian Morphology for Apache Lucene

Russian and English morphology for Java and Apache Lucene 7.2 framework based on open source dictionary from site АОТ. It uses dictionary base morphology with some heuristics for unknown words. It supports homonym for example for Russian word "вина" it gives two variants "вино" and "вина".

How to use

Build project, by running mvn clean package, this will provide you the latest versions of the artifacts - 1.7, add it to your classpath. You could select which version to use - Russian or English.

Now you can create a Lucene Analyzer:

  RussianAnalayzer russian = new RussianAnalayzer();
  EnglishAnalayzer english = new EnglishAnalayzer();

You can write you own analyzer using filter that convert word in its right forms.

  LuceneMorphology luceneMorph = new EnglishLuceneMorphology();
  TokenStream tokenStream = new MorphlogyFilter(result, luceneMorph);

Because usually LuceneMorphology contains a lot of data needing for its functionality, it is better didn't create this object for each MorphologyFilter.

Also, if you need get a list of base forms of word, you can use following example

 LuceneMorphology luceneMorph = new EnglishLuceneMorphology();
 List<String> wordBaseForms = luceneMorph.getMorphInfo(word);

Solr

You can use the LuceneMorphology as morphology filter in a Solr schema.xml using a MorphologyFilterFactory:

<fieldType name="content" class="solr.TextField" positionIncrementGap="100">
      <analyzer>
        <tokenizer class="solr.StandardTokenizerFactory"/>
		<filter class="org.apache.lucene.analysis.morphology.MorphologyFilterFactory" language="Russian"/>
		<filter class="org.apache.lucene.analysis.morphology.MorphologyFilterFactory" language="English"/>
      </analyzer>
</fieldType>

Just add morphology-1.7.jar in your Solr lib-directories

Restrictions

  • It works only with UTF-8.
  • It assumes what letters е and ё are the same.
  • Word forms with prefixes like "наибольший" treated as separate word.

License

Apache License, Version 2.0

About

Russian Morphology for Lucene

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages

  • Java 100.0%