-
Text.Phonetic.NYSIIS — New York State Identification and Intelligence System phonetic encoding (Taft, 1970). Designed as a Soundex successor for English personal-name matching; produces pronounceable letter codes rather than digits and is more discriminating than Soundex on common name variations.
-
Text.Phonetic.Cologne — Kölner Phonetik (Postel, 1969), the German-language counterpart to Soundex. Optimized for German spelling variants — Müller / Mueller / Muller and Meyer / Mayer / Maier / Meier collapse to single codes.
-
Text.Phonetic.DoubleMetaphone — Lawrence Philips' Double Metaphone (2000), the de-facto standard for fuzzy English-name matching with non-Anglo origins. Returns a {primary, alternate} code pair so the same Anglicised name can match across multiple plausible pronunciations (e.g. Smith ↔ Schmidt, Catherine ↔ Katherine). Handles Germanic, Italian, Spanish, French, Greek, and Slavic patterns.
-
match?/2 (and match?/3 where options apply) on every Text.Phonetic.* module for direct equality comparison without manual encode/2 == encode/2 boilerplate. Text.Phonetic.DoubleMetaphone.match?/3 checks all four primary/alternate combinations.
-
Text.Clean.unaccent/1 — strip diacritics and fold non-decomposable Latin letters (Þ → Th, ß → ss, Æ → AE, ł → l, đ → d) by delegating to Unicode.Transform.LatinAscii.transform/1. Also exposed as the :unaccent option on Text.Clean.clean/2.
-
Text.Distance gains four set-based similarity metrics over character n-grams: jaccard/3, sorensen_dice/3, tanimoto/3 (alias for jaccard/3), and cosine/3. All accept an :n option for configurable shingle size (default 2). Operate at the grapheme level for Unicode correctness.
-
Text.Inflect.En.singularize/2 and Text.Inflect.En.singularize_noun/2 — invert the existing pluralizer. Combines reverse lookup of Conway's irregular tables, explicit suffix rules for unambiguous English plural forms (-ies, -shes/-ches/-xes/-zes/-sses), small whitelists for Greek-derived -is/-es plurals (analyses → analysis) and English -us plurals (geniuses → genius), and a pluralize/2 round-trip search to validate other candidates.
-
Text.Readability.dale_chall/2 and Text.Readability.spache/2 — the two classic word-list readability indices, backed by bundled easy-words lists in priv/readability/ (Dale-Chall 2,949 words, Spache 1,063 words; both sourced from the MIT-licensed py-readability-metrics distribution of the public-domain originals). statistics/2 now also returns :difficult_words and :unfamiliar_words counts.
-
Text.Hyphenation bundles six additional language packs: de-1996, fr, es, it, nl, pt. All loaded at compile time with zero I/O, joining the existing en-us pack. Source: hyph-utf8 upstream; per-file licenses (MIT/X11/BSD/LPPL) are preserved in each .tex header.
-
Text.WordFreq bundles six additional frequency tables at the same top-30,000 cap as English: de, fr, es, it, nl, pt. Source: Hermit Dave's MIT-licensed FrequencyWords OpenSubtitles 2018 corpus.
-
Text.Emoji.sentiment/1 and Text.Emoji.text_sentiment/1 — per-emoji and aggregate sentiment scoring backed by the bundled Emoji Sentiment Ranking v1.0 (Kralj Novak et al., 2015 — CC-BY-SA 3.0; data file at priv/emoji_sentiment/emoji_sentiment_v1.csv, ~750 emoji with negative/neutral/positive proportions and an aggregate score in [-1.0, 1.0]). Aggregate scoring is occurrence-weighted to match the original paper.
-
mix text.download_lemma_data <lang>... — fetches lemmatization dictionaries from the michmech upstream into the Text.Data cache without requiring the per-app auto_download_lemma_data flag. Useful as a build step when shipping a release with the dictionaries pre-warmed. Pass --list to see the supported languages; --force to refresh.