A fast and less aggresive stemmer for Turkish in Java
Java
Latest commit b6d527f Jan 2, 2015 @hrzafer Update README.md
The Stemming in Turkish is added

README.md

resha-turkish-stemmer

Resha is a fast and "less aggressive" stemmer for Turkish written in Java. It uses a stem dictionary which is generated by Nuve using a statistical language model based on morpheme n-grams. So it returns the most possible stem for a word without considering the neighbor words.

Main Features

  • Less aggressive and more accurate than the other stemmers for available for Turkish such as the one in SnowBall
  • Contains more than 1.1 million word-stem pairs
  • Based on HashMap, very fast but uses approximately 300 MB of memory.
  • The stemmer class is singleton, thread safe, and lazy initialized

Usage

//it is implemented as an enum to guarantee
//a singleton, thread safe and lazy initialized object
Stemmer stemmer = Resha.Instance;

String stem = stemmer.stem("kitapçıdaki");
System.out.println(stem); //kitapçı

//If a word contains aphostrope, 
//the part before the first aphostrope is returned as stem.
stem = stemmer.stem("İstanbul'da");
System.out.println(stem); //İstanbul

actual = stemmer.stem("aaa'aaa'aa");
System.out.println(stem); //aaa

//If a word is not in the dictionary it remains unstemmed.
stem = stemmer.stem("xxx");
System.out.println(stem); //xxxx

Maven

Add this to pom.xml file

<repositories>
    <repository>
        <id>hrzafer-repo</id>
        <url>https://github.com/hrzafer/mvn-repo/raw/master/releases</url>
    </repository>
</repositories>

And the dependency

<dependencies>
    <dependency>
            <groupId>com.hrzafer</groupId>
            <artifactId>resha-turkish-stemmer</artifactId>
            <version>1.2.1</version>
        </dependency>
</dependencies>

Jar Distribution

Download the latest jar from the below link and add to your project: https://github.com/hrzafer/mvn-repo/tree/master/releases/com/hrzafer/resha-turkish-stemmer

Stemming in Turkish

This part presents a brief introduction to the stemming problem in Turkish and the methodology used to solve it.

In Turkish words are composed of three consecutive parts:

root + derivational suffix(es) + inflectional suffix(es)

Prefixes don't exist and no derivational suffix come after an inflectional suffix. Example:

kitapçığında  => kitap + çığ [CUK] + ın [(U)n] + da [DA]
word          => root  + d. sfx    + i. sfx    + i.sfx

A stemmer is expected to analyze the word and strip off the inflectional suffixes from the word. So the expected stem for kitapçığında is kitapçık. However such a morphological analysis is not trivial for Turkish. Let pay attention to the only derivative suffix shown above and try to understand the difficulties of the issue.

The inflectional suffic +CUK is somehow similar to the let suffix in English. The word kitapçık means booklet. +CUK suffix takes different forms according to the morphemes coming before and after it. In Turkish all the following forms are possible for the +CUK suffix:

cık, cik, cuk, cük, çık, çik, çuk, çük, cığ, ciğ, cuğ, cüğ, çığ, çiğ, çuğ, çüğ, 

In the word kitapçık the suffix is in çık form. If the root was kalem (pencil) then the word would be kalemcik and the suffix would be in cik form. When a new suffix, +(U)n is added, +CUK suffix changes its form from çık to çığ.

After stripping off the inflectional suffixes from the word kitapçığındaki the stem becomes kitapçığ. However the it should be kitapçık. Thus a stemmer/analyzer for Turkish should handle many character conversions in Turkish.

Nuve is an NLP library that can perform such complex morphologic analysis (and more) which is required for many tasks like stemming. This complex analysis could be expensive for applications in which there are millions of words to be stemmed.

Resha stemmer is a Turkish stemmer based on a dictionary which consists of already stemmed words by Nuve. The dictionary includes more than 1.1 million word-stem pairs.

How to add new stems or overwrite existing stems

It is highly probable that the 1.1 million word-stem pair dictionary does not include stems for some words or you may find some stems are not correct. In this case you can add your own word-stem pairs by editing manual.dict file. The word-stem (key-value) pairs will be added to the dictionary and if the word (key) already exists the stem (value) for it will be overwritten.