Resha is a fast and "less aggressive" stemmer for Turkish written in Java. It uses a stem dictionary which is generated by Nuve using a statistical language model based on morpheme n-grams. So it returns the most possible stem for a word without considering the neighbor words.
- Less aggressive and more accurate than the other stemmers for available for Turkish such as the one in SnowBall
- Contains more than 1.1 million word-stem pairs
- Based on HashMap, very fast but uses approximately 300 MB of memory.
- The stemmer class is singleton, thread safe, and lazy initialized
//it is implemented as an enum to guarantee //a singleton, thread safe and lazy initialized object Stemmer stemmer = Resha.Instance; String stem = stemmer.stem("kitapçıdaki"); System.out.println(stem); //kitapçı //If a word contains aphostrope, //the part before the first aphostrope is returned as stem. stem = stemmer.stem("İstanbul'da"); System.out.println(stem); //İstanbul actual = stemmer.stem("aaa'aaa'aa"); System.out.println(stem); //aaa //If a word is not in the dictionary it remains unstemmed. stem = stemmer.stem("xxx"); System.out.println(stem); //xxxx
Add this to pom.xml file
<repositories> <repository> <id>hrzafer-repo</id> <url>https://github.com/hrzafer/mvn-repo/raw/master/releases</url> </repository> </repositories>
And the dependency
<dependencies> <dependency> <groupId>com.hrzafer</groupId> <artifactId>resha-turkish-stemmer</artifactId> <version>1.2.1</version> </dependency> </dependencies>
Download the latest jar from the below link and add to your project: https://github.com/hrzafer/mvn-repo/tree/master/releases/com/hrzafer/resha-turkish-stemmer
Stemming in Turkish
This part presents a brief introduction to the stemming problem in Turkish and the methodology used to solve it.
In Turkish words are composed of three consecutive parts:
root + derivational suffix(es) + inflectional suffix(es)
Prefixes don't exist and no derivational suffix come after an inflectional suffix. Example:
kitapçığında => kitap + çığ [CUK] + ın [(U)n] + da [DA] word => root + d. sfx + i. sfx + i.sfx
A stemmer is expected to analyze the word and strip off the inflectional suffixes from the word. So the expected stem for
kitapçık. However such a morphological analysis is not trivial for Turkish. Let pay attention to the only derivative suffix shown above and try to understand the difficulties of the issue.
The inflectional suffic
+CUK is somehow similar to the
let suffix in English. The word
+CUK suffix takes different forms according to the morphemes coming before and after it. In Turkish all the following forms are possible for the
cık, cik, cuk, cük, çık, çik, çuk, çük, cığ, ciğ, cuğ, cüğ, çığ, çiğ, çuğ, çüğ,
In the word
kitapçık the suffix is in
çık form. If the root was
kalem (pencil) then the word would be
kalemcik and the suffix would be in
cik form. When a new suffix,
+(U)n is added,
+CUK suffix changes its form from
After stripping off the inflectional suffixes from the word
kitapçığındaki the stem becomes
kitapçığ. However the it should be
kitapçık. Thus a stemmer/analyzer for Turkish should handle many character conversions in Turkish.
Nuve is an NLP library that can perform such complex morphologic analysis (and more) which is required for many tasks like stemming. This complex analysis could be expensive for applications in which there are millions of words to be stemmed.
Resha stemmer is a Turkish stemmer based on a dictionary which consists of already stemmed words by Nuve. The dictionary includes more than 1.1 million word-stem pairs.
How to add new stems or overwrite existing stems
It is highly probable that the 1.1 million word-stem pair dictionary does not include stems for some words or you may find some stems are not correct. In this case you can add your own word-stem pairs by editing
manual.dict file. The word-stem (key-value) pairs will be added to the dictionary and if the word (key) already exists the stem (value) for it will be overwritten.