Text version 0.4.0
[0.4.0] — 2026-05-01
Added
-
Text.Truecase— case restoration for ALL-CAPS or lowercased text using POS-aware heuristics for proper nouns, acronyms, and sentence starts. -
Text.Clean— pipeline-style normalization (whitespace, control characters, smart-quotes, dashes, NFC/NFKC) with a composableclean/2API. -
Text.Emoji— emoji detection, stripping, and counting. Uses the:unicodepackage's emoji property tables; no external data required. -
Text.Hyphenation— Knuth–Liang TeX-pattern hyphenation. Ships en-US patterns (~5k); other languages can be loaded viaText.Hyphenation.Parserfrom anyhyph-*.texfile. -
Text.PII— pattern-based detection and redaction of phone numbers, emails, credit-card-shaped digits, IBANs, IPv4/IPv6, and US SSNs. -
Text.Spell— Norvig-style edit-distance spelling suggestions backed byText.WordFreq. Returns ranked candidates with their corpus frequency. -
Text.Summarize— extractive summarization via a sentence-graph TextRank with configurable similarity (:cosineor:jaccard) and target length. -
Text.Syllable— English syllable counting using a vowel-group heuristic with override exceptions. Used as the per-word syllable signal feedingText.Readability. -
Text.Readability— Flesch, Flesch–Kincaid, Gunning-Fog, SMOG, Coleman–Liau, ARI, and Linsear-Write scores plus a unifiedanalyze/2summary. -
Text.WordFreq— frequency lookup over a 30k-word English corpus shipped inpriv/wordfreq/en.tsv. Providesrank/2,frequency/2,is_common?/2, andtop/2. -
Text.Lemma— dictionary-based lemmatization. Ships an en-US table of ~42k inflected→base mappings;lookup/2falls back to the input when no entry exists. -
Text.Inflect.En.PluralizeandText.Inflect.En.Singularize— English noun inflection covering ~1.6 KLoC of irregular-form rules and exceptions, withText.Inflect.En.Helpersfor shared morphology utilities. -
Text.Sentiment.Lexicons.AFINNnow ships sentiment lexicons for 104 languages (up from 7), an Emoji Sentiment Ranking 1.0 lexicon (:emoji, ~840 entries derived from the upstream corpus and rescaled onto AFINN's −5..+5 integer range), and per-language negator lists (negators/1). The seven hand-curated 0.3.0 lexicons (:en,:da,:fi,:fr,:pl,:sv,:tr) are preserved unchanged; the other ~95 are upstream machine-translated and ship as a baseline. -
Text.Sentiment.Backends.Lexiconautomatically resolves per-language negators fromText.Sentiment.Lexicons.AFINN.negators/1based on the requested:languageoption, so non-English text gets negation handling out of the box. Callers can still override with an explicit:negatorslist. -
mix text.gen_afinn_lexiconsregeneratespriv/sentiment/from the vendoreddata/affin/source files. Hand-curated TSVs are preserved unless--overwriteis passed.
Changed
-
The
:unicode_stringdependency requirement is~> 2.1. The 2.1 release replaces its regex evaluator with a single-pass DFA engine; benchmarks show ~17× faster word-cloud builds for typical English prose, with linear (rather than O(N²)) scaling on long unbroken inputs. -
Text.Word.word_count/2documentation now explicitly calls out that the default&String.split/1splitter does not implement UAX #29 segmentation and does not work for languages without inter-word whitespace (Chinese, Japanese, Korean, Thai, Lao, Khmer, Burmese). Examples show how to pass a UAX-aware or dictionary-aware splitter for those cases.