Text version 0.3.0
[0.3.0] — 2026-04-29
Added
-
Text.WordCloud— multilingual keyword extraction returning a weighted term list suitable for rendering as a word cloud. Six backends: YAKE! (default, unsupervised statistical), frequency, RAKE, TextRank, TF-IDF (requires:reference_corpus), and KeyBERT (neural, requires:bumblebee). The:stemoption (requires the optional:text_stemmerdependency) buckets morphological variants —demolish,demolished,demolishing— into a single entry labelled with the most-frequent surface form. -
Text.WordCloud.Layout— Wordle-style Archimedean-spiral packing that produces renderer-agnostic(x, y, width, height, font_size, rotation)placements. Pluggable:font_metricscallback so callers can supply pixel-accurate metrics from their actual font stack. -
Text.WordCloud.SVG— renders placements as a self-contained SVG document. Pluggable:palette(list of hex strings, aColor.Palette.Tonalscale, aColor.Palette.Theme, ornilfor single-colour) plus three mapping strategies (:by_weight,:by_index,:by_hash). Hex-string palettes work without optional deps;Color.Palettestructs require the optional:colordependency. -
Text.Stopwords— bundled multilingual stopword lists from stopwords-iso (~60 languages, MIT license). Public API:for/1,contains?/2,available_languages/0,available?/1,union/2,extend/2. Generation tooling lives inmix text.gen_stopwords. -
mix text.download_models --keybert— pre-fetches the multilingual MiniLM sentence-transformer used byText.WordCloud.Backends.KeyBERT(~470 MB). The--bumblebeeshorthand now includes--keybertalongside--sentiment --pos --ner. -
Text.POS— part-of-speech tagging via the optional:bumblebeedependency. English by default (vblagoje/bert-english-uncased-finetuned-pos); override:modelfor other checkpoints. Returns coarse-grained tag atoms (:noun,:verb,:adj, …) with confidence scores. -
Text.NER— named-entity recognition via the optional:bumblebeedependency. Multilingual by default (Davlan/bert-base-multilingual-cased-ner-hrl, 10 high-resource languages, CoNLL-2003 tag set). ReturnsText.NER.Entitystructs with span byte offsets, type atom (:per,:org,:loc,:misc), and score. -
Text.Embedding— load pre-trained word vectors in fastText.vecformat. Exposesvector/2,similarity/3,nearest/3, andanalogy/5over an L2-normalisedNxmatrix. Supports:filterand:max_tokensoptions for partial loads. -
Text.Language.Classifier.Fasttext.ScriptDetector.han_variant/1— disambiguates Simplified (:Hans) from Traditional (:Hant) Chinese using a curated codepoint-frequency analysis.detect/1now returns:Hansor:Hantdirectly for Han text when the input is unambiguous, falling back to:Haniotherwise. The script signal flows through toText.Language.Classifier.Fasttext.Locale.resolve/2, producingzh-Hans-CNvszh-Hant-TWautomatically. -
Text.Language.normalize/1andText.Language.to_locale_string/1— every public function in the package that takes a:languageor:localeoption now accepts an atom, a string (BCP-47 or otherwise), or aLocalize.LanguageTagstruct (when the optional:localizedependency is loaded). The new helpers normalise to the language subtag (atom) or to a canonical BCP-47 string respectively. -
Text.Sentiment.Backendbehaviour with two shipped backends:Text.Sentiment.Backends.Lexicon(the default — lexicon-based, multilingual via AFINN, always available) andText.Sentiment.Backends.Bumblebee(optional — neural via Bumblebee and XLM-RoBERTa, requires:bumblebeeand:exladeps). Routing via the:backendoption toText.Sentiment.analyze/2or globally via the:sentiment_backendapplication configuration. -
Text.Sentiment— multilingual lexicon-based sentiment analysis. Returns a label (:positive,:negative,:neutral), a normalised compound score, and the matched-token count. Handles negation ("not good"flips polarity) and intensifiers ("very good"boosts) via VADER-style scalars. -
Text.Sentiment.Lexicons.AFINN— bundled AFINN sentiment lexicons (Apache 2.0) for English, Danish, Finnish, French, Polish, Swedish, and Turkish, plus a language-agnostic emoticon lexicon. Routed automatically byText.Sentiment.analyze/2's:languageoption. -
Text.Sentiment.lexicon_for/2— composes a per-language lexicon with the emoticon lexicon and/or domain-specific overrides. -
Text.Language.Classifier.Fasttext— a pure-Elixir port of fastText'slid.176language identification model. Validated bit-for-bit against the official C++/Python reference for hashing, subword extraction, feature assembly, and tree traversal. See the README for usage. -
Text.Language.Classifier.Fasttext.ModelLoader.load/2parses anlid.176.binfile (~126 MB) into a typedModelstruct with the input/output matrices held asNxtensors. -
Text.Language.Classifier.Fasttext.detect/3,classify/2, andto_locale/2for the public detection API. -
Text.Language.Classifier.Fasttext.ScriptDetectorfor Unicode-script-of-text classification, used to disambiguate multi-script locales (e.g.sr-Latnvssr-Cyrl). Backed by theunicodeHex package. -
Text.Language.Classifier.Fasttext.Locale.resolve/2for CLDR-canonical locale assembly via likely-subtags. Uses the optionallocalizedependency when present, with a built-in fallback table for the most common languages otherwise. -
mix text.download_lid176task that fetcheslid.176.binintopriv/lid_176/. The model file is gitignored and not part of the Hex package. -
mix text.download_modelstask (plural) that pre-fetches every external model used by:text—lid.176.binplus the default Hugging Face checkpoints behindText.Sentiment.Backends.Bumblebee,Text.POS, andText.NER— for production environments that need every artefact present at boot. Selection flags (--lid176,--sentiment,--pos,--ner,--bumblebee) limit the download to a subset. -
mix text.gen_subword_fixtures,mix text.gen_features_fixtures,mix text.gen_predict_fixtures(viapriv/scripts/*.py) for regenerating the differential test fixtures against the referencefasttextPython bindings. -
docs/lid176_binary_format.md— full byte-layout specification of fastText's model file, derived from the C++ source.
Changed
-
The minimum Elixir version is now
~> 1.17(raised from~> 1.8). All development and testing targets Elixir 1.20 on Erlang/OTP 28. -
Added required dependencies on
:nxand:unicode. Optional dependencies on:exla(recommended for inference performance) and:localize(for CLDR-canonical locale resolution). -
The fastText inference forward pass (
take + mean + dot, plus the softmax tail for softmax-loss models) is now wrapped inNx.Defnso that an EXLA-compiled execution runs the entire pass as a single fused XLA kernel. With EXLA configured as both backend anddefncompiler, per-prediction wall time onlid.176drops from roughly 200 μs to ~100 μs — about 2× over the unfused EXLA path and 6-9× overNx.BinaryBackend. Bit-equivalent to the pre-fusion form; the test suite passes both ways. -
The hierarchical-softmax scoring path is now also fused into the same
defngraph: per-leaf paths through the Huffman tree are pre-computed at model load time and stored as fixed-shape tensors onText.Language.Classifier.Fasttext.HuffmanTree. The recursive BEAM-side DFS (and its accompanying f32-rounding workaround) is gone. Forlid.176specifically the latency is comparable to the previous DFS approach (~125 μs vs ~110 μs) — the win materialises for larger label spaces. The simpler architecture removes a fragile spot. -
Hex package version bumped to
0.3.0.
Removed
-
Breaking: the legacy n-gram language classifiers (
Text.Language.Classifier.NaiveBayesian,CummulativeFrequency,RankOrder) and their supporting modules (Text.Language,Text.Language.Classifier,Text.Corpus,Text.Vocabulary). These required a separately-installed corpus (text_corpus_udhr) and were not competitive with the fastText classifier on inputs outside the UDHR register. UseText.Language.Classifier.Fasttext.classify/2anddetect/3instead. -
The
:meeseeksbuild-time HTML scraper dependency along with the English-inflection scraper module (Text.Inflect.Data.En) and itsmix text.create_english_pluralstask. Pluralization data continues to ship as a precompiled ETF blob inpriv/inflection/en/en.etf; only the regeneration tooling is gone. -
Text.Ngram.Frequencystruct,Text.frequency_tupletypedef, and theText.ensure_compiled?/1helper. All three existed solely to support the deleted classifier behaviour and had no other callers.