Text.Stemmer

Pre-compiled Snowball stemmers for Elixir.

This package ships 36 stemming algorithms covering a wide range of natural languages, accessible through a single Text.Stemmer.stem/2 entry point. The stemmers themselves were generated from the canonical Snowball algorithm sources using the :snowball compiler, which is included as a runtime dependency.

Installation

Add :text_stemmer to your mix.exs deps:

def deps do
  [
    {:text_stemmer, "~> 0.1"}
  ]
end

Usage

iex> Text.Stemmer.stem("generalizations", :en)
"general"

iex> Text.Stemmer.stem("gouvernements", :fr)
"gouvern"

iex> Text.Stemmer.stem_list(["running", "ran", "runs"], :en)
["run", "ran", "run"]

Languages are identified by their ISO 639-1 two-letter code. Algorithm-specific variants use a <code>_<algorithm> form: :en_porter, :en_lovins, :nl_porter. See the Text.Stemmer moduledoc for the full table.

iex> length(Text.Stemmer.supported_languages())
36

Regenerating stemmers

The pre-generated stemmer modules under lib/text/stemmer/stemmers/ are produced from the .sbl algorithm sources vendored in src/algorithms/ (taken from snowballstem/snowball). To regenerate after editing or updating a source file, run:

mix snowball.gen --module-prefix Text.Stemmer.Stemmers \
                 --output-dir lib/text/stemmer/stemmers

The mix snowball.gen task is supplied by the :snowball compiler dependency.

Compliance testing

Each generated stemmer is verified against the canonical Snowball corpus from snowballstem/snowball-data, vendored under test/data/<lang>/ as gzipped voc.txt/output.txt pairs. Compliance tests are tagged :compliance and excluded by default; run them explicitly with:

mix test --only compliance

The corpus is not shipped with the Hex package — it lives in the source tree only. Per-language licensing notes from upstream are preserved in test/data/<lang>/COPYING files. The Arabic corpus is GPLv3; the rest are mostly BSD-3-Clause or CC BY-SA. See test/data/COPYING for the umbrella terms.

Documentation

Full API documentation is published at https://hexdocs.pm/text_stemmer.

License

Apache-2.0. See LICENSE.md.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
guides		guides
lib/text		lib/text
scripts		scripts
src/algorithms		src/algorithms
test		test
.formatter.exs		.formatter.exs
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
LICENSE.md		LICENSE.md
README.md		README.md
mix.exs		mix.exs
mix.lock		mix.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Text.Stemmer

Installation

Usage

Regenerating stemmers

Compliance testing

Documentation

License

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Text.Stemmer

Installation

Usage

Regenerating stemmers

Compliance testing

Documentation

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages