Skip to content
Branch: master
Go to file
Code

Latest commit

Files

Permalink
Failed to load latest commit information.
Type
Name
Latest commit message
Commit time
src
Oct 11, 2018
Mar 1, 2016
Nov 27, 2018
Nov 27, 2018

README.md

Spark Stemming

Build Status

Snowball is a small string processing language designed for creating stemming algorithms for use in Information Retrieval. This package allows to use it as a part of Spark ML Pipeline API.

Linking

Link against this library using SBT:

libraryDependencies += "com.github.master" %% "spark-stemming" % "0.2.1"

Using Maven:

<dependency>
    <groupId>com.github.master</groupId>
    <artifactId>spark-stemming_2.10</artifactId>
    <version>0.2.0</version>
</dependency>

Or include it when starting the Spark shell:

$ bin/spark-shell --packages com.github.master:spark-stemming_2.10:0.2.1

Features

Currently implemented algorithms:

  • Arabic
  • English
  • English (Porter)
  • Romance stemmers:
    • French
    • Spanish
    • Portuguese
    • Italian
    • Romanian
  • Germanic stemmers:
    • German
    • Dutch
  • Scandinavian stemmers:
    • Swedish
    • Norwegian (Bokmål)
    • Danish
  • Russian
  • Finnish
  • Greek

More details are on the Snowball stemming algorithms page.

Usage

Stemmer Transformer can be used directly or as a part of ML Pipeline. In particular, it is nicely combined with Tokenizer.

import org.apache.spark.mllib.feature.Stemmer

val data = sqlContext
  .createDataFrame(Seq(("мама", 1), ("мыла", 2), ("раму", 3)))
  .toDF("word", "id")

val stemmed = new Stemmer()
  .setInputCol("word")
  .setOutputCol("stemmed")
  .setLanguage("Russian")
  .transform(data)

stemmed.show

About

Spark MLlib wrapper for the Snowball framework

Topics

Resources

License

Releases

No releases published
You can’t perform that action at this time.