Snowball is a small string processing language designed for creating stemming algorithms for use in Information Retrieval. This package allows to use it as a part of Spark ML Pipeline API.
Link against this library using SBT:
libraryDependencies += "com.github.master" %% "spark-stemming" % "0.2.0"
Using Maven:
<dependency>
<groupId>com.github.master</groupId>
<artifactId>spark-stemming_2.10</artifactId>
<version>0.2.0</version>
</dependency>
Or include it when starting the Spark shell:
$ bin/spark-shell --packages com.github.master:spark-stemming_2.10:0.2.0
Currently implemented algorithms:
- Arabic
- English
- English (Porter)
- Romance stemmers:
- French
- Spanish
- Portuguese
- Italian
- Romanian
- Germanic stemmers:
- German
- Dutch
- Scandinavian stemmers:
- Swedish
- Norwegian (Bokmål)
- Danish
- Russian
- Finnish
- Greek
More details are on the Snowball stemming algorithms page.
Stemmer
Transformer
can be used directly or as a part of ML
Pipeline. In
particular, it is nicely combined with
Tokenizer.
import org.apache.spark.mllib.feature.Stemmer
val data = sqlContext
.createDataFrame(Seq(("мама", 1), ("мыла", 2), ("раму", 3)))
.toDF("word", "id")
val stemmed = new Stemmer()
.setInputCol("word")
.setOutputCol("stemmed")
.setLanguage("Russian")
.transform(data)
stemmed.show