Pre-compiled Snowball stemmers for Elixir.
This package ships 36 stemming algorithms covering a wide range of natural languages, accessible through a single Text.Stemmer.stem/2 entry point. The stemmers themselves were generated from the canonical Snowball algorithm sources using the :snowball compiler, which is included as a runtime dependency.
Add :text_stemmer to your mix.exs deps:
def deps do
[
{:text_stemmer, "~> 0.1"}
]
endiex> Text.Stemmer.stem("generalizations", :en)
"general"
iex> Text.Stemmer.stem("gouvernements", :fr)
"gouvern"
iex> Text.Stemmer.stem_list(["running", "ran", "runs"], :en)
["run", "ran", "run"]Languages are identified by their ISO 639-1 two-letter code. Algorithm-specific variants use a <code>_<algorithm> form: :en_porter, :en_lovins, :nl_porter. See the Text.Stemmer moduledoc for the full table.
iex> length(Text.Stemmer.supported_languages())
36The pre-generated stemmer modules under lib/text/stemmer/stemmers/ are produced from the .sbl algorithm sources vendored in src/algorithms/ (taken from snowballstem/snowball). To regenerate after editing or updating a source file, run:
mix snowball.gen --module-prefix Text.Stemmer.Stemmers \
--output-dir lib/text/stemmer/stemmersThe mix snowball.gen task is supplied by the :snowball compiler dependency.
Each generated stemmer is verified against the canonical Snowball corpus from snowballstem/snowball-data, vendored under test/data/<lang>/ as gzipped voc.txt/output.txt pairs. Compliance tests are tagged :compliance and excluded by default; run them explicitly with:
mix test --only complianceThe corpus is not shipped with the Hex package — it lives in the source tree only. Per-language licensing notes from upstream are preserved in test/data/<lang>/COPYING files. The Arabic corpus is GPLv3; the rest are mostly BSD-3-Clause or CC BY-SA. See test/data/COPYING for the umbrella terms.
Full API documentation is published at https://hexdocs.pm/text_stemmer.
Apache-2.0. See LICENSE.md.