Core text analysis components for INFINI Pizza
16 tokenizers Β· 60+ token filters Β· 13 normalizers Β· 65 built-in language analyzers
Provides the comprehensive foundation of normalizers, tokenizers, token filters, and pre-composed language analyzers for the INFINI Pizza search engine.
- Architecture
- Normalizers
- Tokenizers
- Token Filters
- Language Analyzers
- Stop Word Lists
- Full Pipeline Examples
- Features
- Related Crates
Pizza uses a three-stage analysis pipeline:
Input Text β [Normalizer(s)] β [Tokenizer] β [Token Filter(s)] β Indexed Tokens
| Stage | Role | Trait | Invocation |
|---|---|---|---|
| Normalize | Pre-tokenization string transforms (mutate full input in-place) | Normalizer |
normalize(&mut String) |
| Tokenize | Split text into a stream of tokens | Tokenizer |
tokenize(&str) -> Vec<Token> |
| Filter | Transform, remove, or inject tokens post-tokenization | TokenFilter |
filter(&mut Token) -> (bool, Option<Vec<Token>>) |
The TokenFilter::filter() return value:
(true, _)β remove the token from the stream(false, None)β keep the (possibly modified) token(false, Some(extras))β keep the token AND inject additional tokens at the same position
All components are registered into an AnalysisFactory via register_all():
use pizza_analysis_core::analyzers::register_all;
use pizza_engine::analysis::AnalysisFactory;
let mut factory = AnalysisFactory::new();
register_all(&mut factory);
// Now use any registered analyzer by name
let analyzer = factory.get_analyzer("english").unwrap();Normalizers operate on the raw input string before tokenization. They modify the entire text in-place.
Strips HTML/XML tags and decodes HTML entities. Block-level tags are replaced with a space to preserve word boundaries.
use pizza_analysis_core::HtmlStripNormalizer;
use pizza_engine::analysis::Normalizer;
// Basic usage
let normalizer = HtmlStripNormalizer::new();
let mut text = String::from("<h1>Hello</h1><p>World & Pizza</p>");
normalizer.normalize(&mut text);
assert_eq!(text, " Hello World & Pizza ");
// Preserve specific tags (escaped tags are NOT stripped)
let normalizer = HtmlStripNormalizer::new()
.with_escaped_tags(vec!["b".to_string(), "i".to_string()]);
let mut text = String::from("<b>bold</b> and <script>evil</script>");
normalizer.normalize(&mut text);
assert_eq!(text, "<b>bold</b> and evil ");| Parameter | Type | Default | Description |
|---|---|---|---|
escaped_tags |
Vec<String> |
[] |
HTML tags to preserve (case-insensitive) |
Handles: &, <, >, ", &#NNN;, &#xHHHH; entities.
Character/string mapping normalizer. Replaces source strings with target strings in a single pass.
use pizza_analysis_core::MappingNormalizer;
use pizza_engine::analysis::Normalizer;
// From a list of mappings
let normalizer = MappingNormalizer::from_mappings(&[
("Ξ±", "a"),
("Ξ²", "b"),
(":)", "happy"),
(":(", "sad"),
]);
let mut text = String::from("Ξ± and Ξ² :) :(");
normalizer.normalize(&mut text);
assert_eq!(text, "a and b happy sad");
// Build incrementally
let mut normalizer = MappingNormalizer::new();
normalizer.add_mapping("ΓΆ", "oe");
normalizer.add_mapping("ΓΌ", "ue");
let mut text = String::from("ΓΌber ΓΆl");
normalizer.normalize(&mut text);
assert_eq!(text, "ueber oel");| Method | Description |
|---|---|
new() |
Empty mapping normalizer |
from_mappings(&[(&str, &str)]) |
Create with initial mapping pairs |
add_mapping(&mut self, from, to) |
Add a single mapping |
Regex-based find & replace on the entire input string before tokenization.
use pizza_analysis_core::PatternReplaceNormalizer;
use pizza_engine::analysis::Normalizer;
// Replace digits with placeholder
let normalizer = PatternReplaceNormalizer::new(r"\d+", "NUM");
let mut text = String::from("order 12345 shipped on 2024-01-15");
normalizer.normalize(&mut text);
assert_eq!(text, "order NUM shipped on NUM-NUM-NUM");
// Use capture groups ($1, $2, etc.)
let normalizer = PatternReplaceNormalizer::new(r"(\w+)@(\w+)", "$1 at $2");
let mut text = String::from("user@host");
normalizer.normalize(&mut text);
assert_eq!(text, "user at host");
// Persian zero-width non-joiner β space (used in Persian analyzer)
let normalizer = PatternReplaceNormalizer::new(r"\x{200C}", " ");| Parameter | Type | Description |
|---|---|---|
pattern |
&str |
Regex pattern (panics if invalid) |
replacement |
&str |
Replacement string; supports $1, $2 capture groups |
Simple case conversion for the entire input text.
use pizza_engine::analysis::Normalizer;
// These are provided by pizza-engine
// LowercaseNormalizer::new() - converts all text to lowercase
// UppercaseNormalizer::new() - converts all text to uppercaseStrips leading and trailing whitespace from the input text.
use pizza_analysis_core::TrimNormalizer;
use pizza_engine::analysis::Normalizer;
let normalizer = TrimNormalizer::new();
let mut text = String::from(" hello world \n");
normalizer.normalize(&mut text);
assert_eq!(text, "hello world");Collapses consecutive whitespace characters (spaces, tabs, newlines) into a single space.
use pizza_analysis_core::CollapseWhitespaceNormalizer;
use pizza_engine::analysis::Normalizer;
let normalizer = CollapseWhitespaceNormalizer::new();
let mut text = String::from("hello world\t\tfoo\n\nbar");
normalizer.normalize(&mut text);
assert_eq!(text, "hello world foo bar");Applies Unicode normalization (NFC, NFD, NFKC, NFKD) for handling composed vs. decomposed characters.
use pizza_analysis_core::UnicodeNormalizer;
use pizza_engine::analysis::Normalizer;
// NFC: Canonical Composition (most common)
let normalizer = UnicodeNormalizer::nfc();
let mut text = String::from("e\u{0301}"); // e + combining accent
normalizer.normalize(&mut text);
assert_eq!(text, "Γ©"); // single composed character
// NFKC: Compatibility Composition (folds typographic variants)
let normalizer = UnicodeNormalizer::nfkc();
let mut text = String::from("ο¬"); // fi ligature
normalizer.normalize(&mut text);
assert_eq!(text, "fi");| Constructor | Form | Use Case |
|---|---|---|
UnicodeNormalizer::nfc() |
NFC | Default composition; safe for most text |
UnicodeNormalizer::nfd() |
NFD | Decomposition; useful before diacritic stripping |
UnicodeNormalizer::nfkc() |
NFKC | Compatibility; normalizes ligatures, superscripts |
UnicodeNormalizer::nfkd() |
NFKD | Compat decomposition; most aggressive |
Tokenizers split the (normalized) text into individual tokens. Each token carries:
term: Cow<str>β the token textstart_offset: u32β byte offset of start in original textend_offset: u32β byte offset of end in original textposition: u32β positional index in token stream
UAX#29 Unicode word break rules. Provided by pizza-engine.
use pizza_engine::analysis::{StandardTokenizer, Tokenizer};
let tokenizer = StandardTokenizer::new();
let tokens = tokenizer.tokenize("The quick brown fox jumps!");
// ["The", "quick", "brown", "fox", "jumps"]Handles: Unicode letters/digits, keeps contractions (don't), splits on punctuation/whitespace.
Emits the entire input as a single token. Useful for exact-match fields (IDs, tags, SKUs).
use pizza_analysis_core::KeywordTokenizer;
use pizza_engine::analysis::Tokenizer;
let tokenizer = KeywordTokenizer::new();
let tokens = tokenizer.tokenize("New York City");
assert_eq!(tokens.len(), 1);
assert_eq!(tokens[0].term, "New York City");Splits text at any character that is not a Unicode letter. Non-letter characters are discarded.
use pizza_analysis_core::LetterTokenizer;
use pizza_engine::analysis::Tokenizer;
let tokenizer = LetterTokenizer::new();
let tokens = tokenizer.tokenize("hello-world! foo123bar");
let terms: Vec<&str> = tokens.iter().map(|t| t.term.as_ref()).collect();
assert_eq!(terms, vec!["hello", "world", "foo", "bar"]);Equivalent to LetterTokenizer + immediate lowercasing. Slightly more efficient than chaining.
use pizza_analysis_core::LowercaseTokenizer;
use pizza_engine::analysis::Tokenizer;
let tokenizer = LowercaseTokenizer::new();
let tokens = tokenizer.tokenize("Hello WORLD Foo-Bar");
let terms: Vec<&str> = tokens.iter().map(|t| t.term.as_ref()).collect();
assert_eq!(terms, vec!["hello", "world", "foo", "bar"]);Generates character n-grams of configurable length from text.
use pizza_analysis_core::NgramTokenizer;
use pizza_engine::analysis::Tokenizer;
let tokenizer = NgramTokenizer::new(2, 3);
let tokens = tokenizer.tokenize("pizza");
let terms: Vec<&str> = tokens.iter().map(|t| t.term.as_ref()).collect();
// ["pi", "piz", "iz", "izz", "zz", "zza", "za"]| Parameter | Type | Default | Description |
|---|---|---|---|
min_gram |
usize |
required | Minimum n-gram size (inclusive) |
max_gram |
usize |
required | Maximum n-gram size (inclusive) |
Builder method:
.with_token_chars(Vec<TokenCharKind>)β Character classes to include:Letter,Digit,Whitespace,Punctuation,Symbol. Empty = all characters.
Generates prefix-anchored (edge) n-grams. Only produces n-grams starting from the beginning of each token/word.
use pizza_analysis_core::EdgeNgramTokenizer;
use pizza_engine::analysis::Tokenizer;
let tokenizer = EdgeNgramTokenizer::new(1, 5);
let tokens = tokenizer.tokenize("pizza");
let terms: Vec<&str> = tokens.iter().map(|t| t.term.as_ref()).collect();
assert_eq!(terms, vec!["p", "pi", "piz", "pizz", "pizza"]);| Parameter | Type | Description |
|---|---|---|
min_gram |
usize |
Starting n-gram size |
max_gram |
usize |
Maximum n-gram size |
Use case: Autocomplete / search-as-you-type fields.
Splits text on configurable character sets β either specific characters or entire character classes.
use pizza_analysis_core::CharGroupTokenizer;
use pizza_engine::analysis::Tokenizer;
// Split on specific characters
let tokenizer = CharGroupTokenizer::new(vec!['-', '_', '.']);
let tokens = tokenizer.tokenize("hello-world_foo.bar");
let terms: Vec<&str> = tokens.iter().map(|t| t.term.as_ref()).collect();
assert_eq!(terms, vec!["hello", "world", "foo", "bar"]);
// Split on whitespace + punctuation character classes
let tokenizer = CharGroupTokenizer::new(vec![])
.split_on_whitespace()
.split_on_punctuation();| Method | Description |
|---|---|
new(chars: Vec<char>) |
Split on specific characters |
.split_on_whitespace() |
Also split on whitespace class |
.split_on_letter() |
Also split on letter class |
.split_on_digit() |
Also split on digit class |
.split_on_punctuation() |
Also split on punctuation class |
.split_on_symbol() |
Also split on symbol class |
Tokenizes filesystem-like paths into hierarchical segments for faceted navigation.
use pizza_analysis_core::PathHierarchyTokenizer;
use pizza_engine::analysis::Tokenizer;
let tokenizer = PathHierarchyTokenizer::default();
let tokens = tokenizer.tokenize("/usr/local/bin");
let terms: Vec<&str> = tokens.iter().map(|t| t.term.as_ref()).collect();
assert_eq!(terms, vec!["/usr", "/usr/local", "/usr/local/bin"]);
// Custom separator with skip
let tokenizer = PathHierarchyTokenizer::new()
.with_separator('.')
.with_skip(1); // skip first segment
let tokens = tokenizer.tokenize("com.example.app.Main");
let terms: Vec<&str> = tokens.iter().map(|t| t.term.as_ref()).collect();
assert_eq!(terms, vec!["example", "example.app", "example.app.Main"]);| Method | Default | Description |
|---|---|---|
.with_separator(char) |
'/' |
Path separator character |
.with_replacement(char) |
'/' |
Character used in output |
.with_skip(usize) |
0 |
Skip first N path segments |
.reversed() |
false |
Output in reverse hierarchy order |
Regex-based tokenizer with two operating modes:
- Split mode (default, group = -1): Pattern is the delimiter; text between matches becomes tokens
- Match mode (group β₯ 0): Pattern matches become tokens; capture groups extracted
use pizza_analysis_core::PatternTokenizer;
use pizza_engine::analysis::Tokenizer;
// Split mode: split on non-word characters (default)
let tokenizer = PatternTokenizer::default(); // pattern = r"\W+"
let tokens = tokenizer.tokenize("hello, world! foo-bar");
let terms: Vec<&str> = tokens.iter().map(|t| t.term.as_ref()).collect();
assert_eq!(terms, vec!["hello", "world", "foo", "bar"]);
// Split on custom delimiter
let tokenizer = PatternTokenizer::new(r"[,;]\s*");
let tokens = tokenizer.tokenize("one, two; three");
let terms: Vec<&str> = tokens.iter().map(|t| t.term.as_ref()).collect();
assert_eq!(terms, vec!["one", "two", "three"]);
// Match mode: extract emails
let tokenizer = PatternTokenizer::with_group(r"\b[\w.]+@[\w.]+\b", 0);
let tokens = tokenizer.tokenize("contact user@example.com or admin@site.org");
let terms: Vec<&str> = tokens.iter().map(|t| t.term.as_ref()).collect();
assert_eq!(terms, vec!["user@example.com", "admin@site.org"]);| Constructor | Description |
|---|---|
PatternTokenizer::default() |
Split on \W+ |
PatternTokenizer::new(pattern) |
Split on custom regex |
PatternTokenizer::with_group(pattern, group) |
Match mode; group=0 for full match, 1+ for capture groups |
Legacy tokenizer that recognizes English grammar patterns:
- Preserves acronyms (U.S.A.)
- Preserves company names with apostrophes (O'Reilly)
- Keeps email addresses and hostnames intact
- Splits on most punctuation
use pizza_analysis_core::ClassicTokenizer;
use pizza_engine::analysis::Tokenizer;
let tokenizer = ClassicTokenizer::new();
let tokens = tokenizer.tokenize("U.S.A. email: test@example.com");
let terms: Vec<&str> = tokens.iter().map(|t| t.term.as_ref()).collect();
// Keeps "U.S.A." and "test@example.com" as single tokens| Method | Default | Description |
|---|---|---|
.with_max_token_length(usize) |
255 |
Maximum characters per token |
UAX#29-based tokenizer that additionally recognizes URLs and email addresses as single tokens.
use pizza_analysis_core::UaxUrlEmailTokenizer;
use pizza_engine::analysis::Tokenizer;
let tokenizer = UaxUrlEmailTokenizer::new();
let tokens = tokenizer.tokenize("Visit https://pizza.dev or email hello@pizza.dev today");
let terms: Vec<&str> = tokens.iter().map(|t| t.term.as_ref()).collect();
assert_eq!(terms, vec!["Visit", "https://pizza.dev", "or", "email", "hello@pizza.dev", "today"]);| Method | Default | Description |
|---|---|---|
.with_max_token_length(usize) |
255 |
Maximum characters per token |
Lightweight regex tokenizers:
- SimplePatternTokenizer: Text matching the pattern becomes tokens
- SimplePatternSplitTokenizer: Pattern is a delimiter; text between matches becomes tokens
use pizza_analysis_core::SimplePatternTokenizer;
use pizza_engine::analysis::Tokenizer;
// Extract sequences of digits
let tokenizer = SimplePatternTokenizer::new(r"\d+").unwrap();
let tokens = tokenizer.tokenize("order 123 has 4 items");
let terms: Vec<&str> = tokens.iter().map(|t| t.term.as_ref()).collect();
assert_eq!(terms, vec!["123", "4"]);use pizza_analysis_core::SimplePatternSplitTokenizer;
use pizza_engine::analysis::Tokenizer;
// Split on underscores
let tokenizer = SimplePatternSplitTokenizer::new(r"_+").unwrap();
let tokens = tokenizer.tokenize("foo__bar_baz");
let terms: Vec<&str> = tokens.iter().map(|t| t.term.as_ref()).collect();
assert_eq!(terms, vec!["foo", "bar", "baz"]);Segments Thai text at script boundaries (Thai/non-Thai transitions). Handles Thai-specific whitespace and punctuation rules.
use pizza_analysis_core::ThaiTokenizer;
use pizza_engine::analysis::Tokenizer;
let tokenizer = ThaiTokenizer::new();
let tokens = tokenizer.tokenize("ΰΈΰΈ²ΰΈ£ΰΈΰΈΰΈͺΰΈΰΈ test");
// Splits at Thai/Latin script boundaryNote: For full dictionary-based Thai word segmentation, use an external ICU-based tokenizer.
Segments Myanmar/Burmese script text at syllable boundaries using Unicode code points and the virama (killer) character \u{1039}.
use pizza_analysis_core::BurmeseTokenizer;
use pizza_engine::analysis::Tokenizer;
let tokenizer = BurmeseTokenizer::new();
let tokens = tokenizer.tokenize("ααΌααΊαα¬α
α¬");
// Segments at Myanmar syllable boundariesNon-Myanmar text is split at whitespace/punctuation as usual.
Token filters transform, remove, or inject tokens after tokenization. They are applied sequentially in the order configured.
Converts all token text to lowercase using Unicode-aware lowercasing.
use pizza_analysis_core::LowercaseTokenFilter;
use pizza_engine::analysis::{Token, TokenFilter};
let filter = LowercaseTokenFilter::new();
let mut token = Token::new("HELLO World", 0, 11, 0);
filter.filter(&mut token);
assert_eq!(token.term, "hello world");Converts all token text to uppercase.
use pizza_analysis_core::UppercaseTokenFilter;
use pizza_engine::analysis::{Token, TokenFilter};
let filter = UppercaseTokenFilter::new();
let mut token = Token::new("hello", 0, 5, 0);
filter.filter(&mut token);
assert_eq!(token.term, "HELLO");Removes leading and trailing whitespace from each token.
use pizza_analysis_core::TrimTokenFilter;
use pizza_engine::analysis::{Token, TokenFilter};
let filter = TrimTokenFilter::new();
let mut token = Token::new(" hello ", 0, 9, 0);
filter.filter(&mut token);
assert_eq!(token.term, "hello");Reverses the character order of each token. Useful for leading-wildcard search simulation.
use pizza_analysis_core::ReverseTokenFilter;
use pizza_engine::analysis::{Token, TokenFilter};
let filter = ReverseTokenFilter::new();
let mut token = Token::new("hello", 0, 5, 0);
filter.filter(&mut token);
assert_eq!(token.term, "olleh");Truncates tokens to a maximum number of characters.
use pizza_analysis_core::TruncateTokenFilter;
use pizza_engine::analysis::{Token, TokenFilter};
let filter = TruncateTokenFilter::new(5);
let mut token = Token::new("university", 0, 10, 0);
filter.filter(&mut token);
assert_eq!(token.term, "unive");| Parameter | Type | Description |
|---|---|---|
length |
usize |
Maximum characters to keep |
Removes tokens that fall outside the specified character length range.
use pizza_analysis_core::LengthTokenFilter;
use pizza_engine::analysis::{Token, TokenFilter};
let filter = LengthTokenFilter::new(3, 10);
let mut short = Token::new("ab", 0, 2, 0);
let (remove, _) = filter.filter(&mut short);
assert!(remove); // "ab" is too short (< 3)
let mut ok = Token::new("hello", 0, 5, 1);
let (remove, _) = filter.filter(&mut ok);
assert!(!remove); // "hello" is 5 chars, within [3, 10]| Parameter | Type | Description |
|---|---|---|
min |
usize |
Minimum token length (chars); shorter tokens removed |
max |
usize |
Maximum token length (chars); longer tokens removed |
Limits the total number of tokens emitted. Stateful β call reset() between documents.
use pizza_analysis_core::LimitTokenFilter;
use pizza_engine::analysis::{Token, TokenFilter};
let filter = LimitTokenFilter::new(3);
// First 3 tokens pass through, subsequent ones are removed| Parameter | Type | Description |
|---|---|---|
max_token_count |
u32 |
Maximum tokens to emit per document |
Removes duplicate tokens from the stream. Only keeps the first occurrence.
use pizza_analysis_core::UniqueTokenFilter;
// If stream is ["the", "quick", "the", "fox"], removes second "the"Removes duplicate tokens at the same position (e.g., from synonym expansion). Uses RemoveDuplicatesState for tracking.
Removes common stop words from the token stream.
use pizza_analysis_core::StopTokenFilter;
use pizza_analysis_core::token_filters::stopwords;
use pizza_engine::analysis::{Token, TokenFilter};
// Using built-in language stop words
let words = stopwords::get_stop_words("english").unwrap();
let filter = StopTokenFilter::new(words);
let mut token = Token::new("the", 0, 3, 0);
let (remove, _) = filter.filter(&mut token);
assert!(remove); // "the" is a stop word
let mut token = Token::new("pizza", 0, 5, 1);
let (remove, _) = filter.filter(&mut token);
assert!(!remove); // "pizza" is NOT a stop word
// Case-insensitive mode
let filter = StopTokenFilter::new(&["the", "a", "an"])
.with_ignore_case(true);
let mut token = Token::new("THE", 0, 3, 0);
let (remove, _) = filter.filter(&mut token);
assert!(remove); // matches "the" case-insensitively| Parameter | Type | Default | Description |
|---|---|---|---|
words |
&[&str] |
required | Stop word list |
ignore_case |
bool |
false |
Case-insensitive matching |
The inverse of stop filter: only keeps tokens in the whitelist; removes everything else.
use pizza_analysis_core::KeepWordsTokenFilter;
use pizza_engine::analysis::{Token, TokenFilter};
let filter = KeepWordsTokenFilter::new(
vec!["pizza".to_string(), "pasta".to_string(), "salad".to_string()]
);
let mut token = Token::new("pizza", 0, 5, 0);
let (remove, _) = filter.filter(&mut token);
assert!(!remove); // "pizza" is in keep list
let mut token = Token::new("burger", 0, 6, 1);
let (remove, _) = filter.filter(&mut token);
assert!(remove); // "burger" is NOT in keep list| Parameter | Type | Default | Description |
|---|---|---|---|
words |
Vec<String> |
required | Whitelist of words to keep |
ignore_case |
bool |
false |
Case-insensitive matching |
Folds Unicode characters to their ASCII equivalents: ΓΌβu, Γ©βe, Γ±βn, Γβss, ΓΈβo, and hundreds more including Greek/Cyrillic transliterations.
use pizza_analysis_core::AsciiFoldingTokenFilter;
use pizza_engine::analysis::{Token, TokenFilter};
let filter = AsciiFoldingTokenFilter::new();
let mut token = Token::new("rΓ©sumΓ©", 0, 8, 0);
filter.filter(&mut token);
assert_eq!(token.term, "resume");
// Preserve original + emit folded version
let filter = AsciiFoldingTokenFilter::preserving_original();
let mut token = Token::new("ΓΌber", 0, 5, 0);
let (_, extra) = filter.filter(&mut token);
assert_eq!(token.term, "uber"); // modified to ASCII
// extra contains original "ΓΌber" at same positionConverts Unicode decimal digits from any script (Arabic-Indic, Devanagari, Thai, etc.) to ASCII 0-9.
use pizza_analysis_core::DecimalDigitTokenFilter;
use pizza_engine::analysis::{Token, TokenFilter};
let filter = DecimalDigitTokenFilter::new();
let mut token = Token::new("Ω£Ω’Ω‘", 0, 6, 0); // Arabic-Indic digits
filter.filter(&mut token);
assert_eq!(token.term, "321");Normalizes fullwidth/halfwidth CJK character variants:
- Fullwidth ASCII (οΌ‘-οΌΊ, οΌ-οΌ) β normal ASCII (A-Z, 0-9)
- Halfwidth Katakana β fullwidth Katakana
use pizza_analysis_core::CjkWidthTokenFilter;
use pizza_engine::analysis::{Token, TokenFilter};
let filter = CjkWidthTokenFilter::new();
let mut token = Token::new("οΌ΄ο½
ο½ο½", 0, 12, 0); // fullwidth
filter.filter(&mut token);
assert_eq!(token.term, "Test");Strips everything after (and including) the first apostrophe. Useful for Turkish, Italian.
use pizza_analysis_core::ApostropheTokenFilter;
use pizza_engine::analysis::{Token, TokenFilter};
let filter = ApostropheTokenFilter::new();
let mut token = Token::new("Istanbul'un", 0, 11, 0);
filter.filter(&mut token);
assert_eq!(token.term, "Istanbul");Removes leading articles/elisions in Romance languages (text before an apostrophe when it matches a known article).
use pizza_analysis_core::ElisionTokenFilter;
use pizza_engine::analysis::{Token, TokenFilter};
// French elisions
let filter = ElisionTokenFilter::french();
let mut token = Token::new("l'avion", 0, 7, 0);
filter.filter(&mut token);
assert_eq!(token.term, "avion");
let mut token = Token::new("qu'est", 0, 6, 0);
filter.filter(&mut token);
assert_eq!(token.term, "est");
// Custom articles
let filter = ElisionTokenFilter::new(&["l", "d", "n", "qu"]);
// Pre-built language sets
let french = ElisionTokenFilter::french(); // l', m', t', qu', n', s', j', d'
let italian = ElisionTokenFilter::italian(); // l', all', dall', dell', nell', ...
let catalan = ElisionTokenFilter::catalan(); // d', l', m', n', s', qu'| Constructor | Articles |
|---|---|
ElisionTokenFilter::new(&[&str]) |
Custom article list |
ElisionTokenFilter::french() |
l, m, t, qu, n, s, j, d |
ElisionTokenFilter::italian() |
l, all, dall, dell, nell, sull, un, quest, quell |
ElisionTokenFilter::catalan() |
d, l, m, n, s, qu |
K-stem algorithm for English. Combines algorithmic suffix stripping with a dictionary for high-quality English stemming.
use pizza_analysis_core::KStemTokenFilter;
use pizza_engine::analysis::{Token, TokenFilter};
let filter = KStemTokenFilter::new();
let mut token = Token::new("running", 0, 7, 0);
filter.filter(&mut token);
assert_eq!(token.term, "run");Less aggressive than Porter stemmer. Does not over-stem: "university" stays "university" (not "univers").
Post-processing for ClassicTokenizer: removes trailing possessives ('s) and dots from acronyms.
use pizza_analysis_core::ClassicTokenFilter;
use pizza_engine::analysis::{Token, TokenFilter};
let filter = ClassicTokenFilter::new();
let mut token = Token::new("U.S.A.", 0, 6, 0);
filter.filter(&mut token);
assert_eq!(token.term, "USA");
let mut token = Token::new("children's", 0, 10, 0);
filter.filter(&mut token);
assert_eq!(token.term, "children");Dictionary-based stem override. Apply before algorithmic stemming to handle exceptions and irregular words.
use pizza_analysis_core::StemmerOverrideTokenFilter;
use pizza_engine::analysis::{Token, TokenFilter};
let filter = StemmerOverrideTokenFilter::from_rules(&[
("running", "run"),
("better", "good"),
("mice", "mouse"),
]);
let mut token = Token::new("mice", 0, 4, 0);
filter.filter(&mut token);
assert_eq!(token.term, "mouse");| Constructor | Format |
|---|---|
from_rules(&[(&str, &str)]) |
(word, stem) pairs |
new(HashMap<String, String>) |
Pre-built HashMap |
.with_ignore_case(bool) |
Case-insensitive lookup (default: false) |
Dictionary-based stemming using loaded wordβstem mappings. Alternative to algorithmic stemmers for domain-specific vocabularies.
use pizza_analysis_core::DictionaryStemTokenFilter;
use pizza_engine::analysis::{Token, TokenFilter};
// From tab-separated file format
let filter = DictionaryStemTokenFilter::from_tab_separated(
"running\trun\nswimming\tswim\nchildren\tchild"
);
// From arrow-separated format
let filter = DictionaryStemTokenFilter::from_arrow_separated(
"running => run\nswimming => swim"
);
let mut token = Token::new("running", 0, 7, 0);
filter.filter(&mut token);
assert_eq!(token.term, "run");| Method | Description |
|---|---|
new(entries: Vec<(String, String)>) |
From (word, stem) pairs |
from_tab_separated(content: &str) |
Parse "word\tstem" lines |
from_arrow_separated(content: &str) |
Parse "word => stem" lines |
.with_case_insensitive(bool) |
Case-insensitive (default: true) |
Morphological stemming using Hunspell-style affix rules. Supports prefix/suffix stripping with conditions.
use pizza_analysis_core::{HunspellStemFilter, AffixRule};
use pizza_engine::analysis::{Token, TokenFilter};
let mut filter = HunspellStemFilter::new();
// Add suffix rule: strip "ing", add "", condition "." (any)
filter.add_suffix_rule("ing", "", ".");
// Add suffix rule: strip "s", add "", condition "." (any)
filter.add_suffix_rule("s", "", ".");
let mut token = Token::new("running", 0, 7, 0);
filter.filter(&mut token);
assert_eq!(token.term, "runn"); // strips "ing"| Method | Parameters | Description |
|---|---|---|
add_suffix_rule(strip, affix, condition) |
All &str |
Add a suffix stripping rule |
add_prefix_rule(strip, affix, condition) |
All &str |
Add a prefix stripping rule |
Config fields: dedup: bool (default: true), longest_only: bool (default: false).
All language stemmers take no parameters (::new()) and implement lightweight suffix-stripping algorithms.
| Filter | Language | Algorithm |
|---|---|---|
ArabicStemTokenFilter |
Arabic | Root extraction (prefix/suffix/pattern removal) |
BengaliStemTokenFilter |
Bengali | Common Bengali suffix removal |
BrazilianStemTokenFilter |
Portuguese (BR) | Plural, gender, verb, and noun stemming |
BulgarianStemTokenFilter |
Bulgarian | Light suffix stripping |
CzechStemTokenFilter |
Czech | Dolamic/Savoy light stemmer + palatalization |
DutchStemTokenFilter |
Dutch | Kraaij-Pohlmann suffix algorithm |
FinnishLightStemTokenFilter |
Finnish | Case/number suffix removal (-ssa, -lla, -lta, etc.) |
FrenchLightStemTokenFilter |
French | ~70 rules; gender/plural/verb endings |
FrenchMinimalStemTokenFilter |
French | Minimal: plural + feminine only |
GalicianStemTokenFilter |
Galician | Full: plural + derivational suffixes |
GalicianMinimalStemTokenFilter |
Galician | Minimal: plural only |
GermanLightStemTokenFilter |
German | Light compound-aware stemmer |
GermanMinimalStemTokenFilter |
German | Minimal: plurals only |
GreekStemTokenFilter |
Greek | Greek suffix rule set |
HindiStemTokenFilter |
Hindi | Hindi suffix removal |
HungarianLightStemTokenFilter |
Hungarian | Case suffix removal (-ban, -nak, -bΓ³l, etc.) |
IndonesianStemTokenFilter |
Indonesian | Prefix (me-, ber-, di-) + suffix (-kan, -an, -i) |
ItalianLightStemTokenFilter |
Italian | Light plurals/gender |
KannadaStemTokenFilter |
Kannada | Vibhakti (case marker) removal |
LatvianStemTokenFilter |
Latvian | Noun/adjective/verb endings |
NorwegianLightStemTokenFilter |
Norwegian | Light (BokmΓ₯l + Nynorsk) |
PersianStemTokenFilter |
Persian | Persian suffix stemmer |
PortugueseLightStemTokenFilter |
Portuguese | Light plural/gender removal |
RussianLightStemTokenFilter |
Russian | Lightweight suffix stripping |
SpanishLightStemTokenFilter |
Spanish | Light plural/gender |
TamilStemTokenFilter |
Tamil | Case/plural suffix stripping |
TeluguStemTokenFilter |
Telugu | Case marker suffix stripping |
Example:
use pizza_analysis_core::FrenchLightStemTokenFilter;
use pizza_engine::analysis::{Token, TokenFilter};
let filter = FrenchLightStemTokenFilter::new();
let mut token = Token::new("chevaux", 0, 7, 0);
filter.filter(&mut token);
assert_eq!(token.term, "cheval");
let mut token = Token::new("nationale", 0, 9, 0);
filter.filter(&mut token);
assert_eq!(token.term, "national");These normalize language-specific character variations. All take no parameters (::new()).
Normalizes Arabic orthographic variations:
- Alef variants (Ψ£ Ψ₯ Ψ’) β Alef (Ψ§)
- Teh Marbuta (Ψ©) β Heh (Ω)
- Yeh variants (Ω) β Yeh (Ω)
- Removes diacritics (Fatha, Kasra, Damma, Shadda, Sukun)
use pizza_analysis_core::ArabicNormalizationTokenFilter;
use pizza_engine::analysis::{Token, TokenFilter};
let filter = ArabicNormalizationTokenFilter::new();
// Normalizes Arabic character variants for consistent indexingNormalizes German umlaut characters and sharp-s:
- Γ€ β a, ΓΆ β o, ΓΌ β u
- Γ β ss
- ae β a, oe β o, ue β u (digraph normalization)
use pizza_analysis_core::GermanNormalizationTokenFilter;
use pizza_engine::analysis::{Token, TokenFilter};
let filter = GermanNormalizationTokenFilter::new();
let mut token = Token::new("ΓΌber", 0, 5, 0);
filter.filter(&mut token);
assert_eq!(token.term, "uber");
let mut token = Token::new("straΓe", 0, 7, 0);
filter.filter(&mut token);
assert_eq!(token.term, "strasse");Shared normalization across all Indic scripts (Devanagari, Bengali, Gurmukhi, Gujarati, Oriya, Telugu, Kannada, Malayalam). Normalizes nukta, canonical equivalents, and visarga.
Hindi-specific normalization (applied after IndicNormalization):
- Chandrabindu β Anunasika
- Nukta removal
- Final halant removal
Bengali-specific character normalizations on top of the generic Indic normalization.
Normalizes Persian character variants:
- Arabic Yeh (Ω) β Persian Yeh (Ϋ)
- Arabic Keh (Ω) β Persian Keh (Ϊ©)
Romanian diacritic normalization (handles both old and new standard):
- Ε (cedilla) β Θ (comma below)
- Ε£ (cedilla) β Θ (comma below)
Normalizes interchangeable Scandinavian vowels:
- Γ€, Γ¦ β a
- ΓΆ, ΓΈ β o
- Γ₯ β o (for Swedish/Norwegian equivalence)
use pizza_analysis_core::ScandinavianNormalizationTokenFilter;
use pizza_engine::analysis::{Token, TokenFilter};
let filter = ScandinavianNormalizationTokenFilter::new();
let mut token = Token::new("rΓ€ksmΓΆrgΓ₯s", 0, 12, 0);
filter.filter(&mut token);
// Normalizes Scandinavian vowels for cross-language matchingMore aggressive Scandinavian folding than normalization:
- Γ₯ β a, Γ€ β a, Γ¦ β a
- ΓΆ β o, ΓΈ β o
- ΓΌ β u
Transliterates Serbian Cyrillic to Latin equivalent for unified indexing.
use pizza_analysis_core::SerbianNormalizationTokenFilter;
use pizza_engine::analysis::{Token, TokenFilter};
let filter = SerbianNormalizationTokenFilter::new();
let mut token = Token::new("ΠΠ΅ΠΎΠ³ΡΠ°Π΄", 0, 14, 0);
filter.filter(&mut token);
assert_eq!(token.term, "Beograd");Sorani Kurdish normalization: handles Yeh/Alef Maksura equivalence, Heh/Ae variations.
Greek-aware lowercasing that handles:
- Tonos (accent) removal
- Final sigma (Ο β Ο after lowercasing)
- Dialytika preservation
use pizza_analysis_core::GreekLowercaseTokenFilter;
use pizza_engine::analysis::{Token, TokenFilter};
let filter = GreekLowercaseTokenFilter::new();
let mut token = Token::new("ΞΞΞΞΞ", 0, 10, 0);
filter.filter(&mut token);
assert_eq!(token.term, "Ξ±ΞΈΞ·Ξ½Ξ±"); // tonos removed, lowercasedTurkish-specific lowercasing with dotted/dotless I handling:
- Δ° (U+0130) β i
- I β Δ± (U+0131, dotless i)
use pizza_analysis_core::TurkishLowercaseTokenFilter;
use pizza_engine::analysis::{Token, TokenFilter};
let filter = TurkishLowercaseTokenFilter::new();
let mut token = Token::new("Δ°STANBUL", 0, 9, 0);
filter.filter(&mut token);
assert_eq!(token.term, "istanbul");
let mut token = Token::new("ISPARTA", 0, 7, 0);
filter.filter(&mut token);
assert_eq!(token.term, "Δ±sparta"); // I β Δ± (dotless)- IrishLowercaseTokenFilter: Handles Irish eclipsis mutations (nDΓΊn β dΓΊn when lowercasing)
- IrishElisionTokenFilter: Strips Irish elisions: d', n-, t-
Generates character-level n-grams from each token.
use pizza_analysis_core::NgramTokenFilter;
use pizza_engine::analysis::{Token, TokenFilter};
let filter = NgramTokenFilter::new(2, 3);
let mut token = Token::new("hello", 0, 5, 0);
let (_, extra) = filter.filter(&mut token);
// token becomes "he" (first 2-gram)
// extra contains: "hel", "el", "ell", "ll", "llo", "lo"| Parameter | Type | Default | Description |
|---|---|---|---|
min_gram |
usize |
required | Minimum n-gram size |
max_gram |
usize |
required | Maximum n-gram size |
preserve_original |
bool |
false |
Keep original token |
Generates prefix n-grams from each token (useful for autocomplete at index time).
use pizza_analysis_core::EdgeNgramTokenFilter;
use pizza_engine::analysis::{Token, TokenFilter};
let filter = EdgeNgramTokenFilter::new(1, 4)
.with_preserve_original(true);
let mut token = Token::new("pizza", 0, 5, 0);
let (_, extra) = filter.filter(&mut token);
// token becomes "p" (min edge-gram)
// extra contains: "pi", "piz", "pizz", and original "pizza"| Parameter | Type | Default | Description |
|---|---|---|---|
min_gram |
usize |
required | Starting prefix length |
max_gram |
usize |
required | Maximum prefix length |
preserve_original |
bool |
false |
Keep original token |
Creates word-level n-grams (shingles) for phrase search optimization. Stateful β uses add_token() API.
use pizza_analysis_core::ShingleTokenFilter;
let mut filter = ShingleTokenFilter::new(2, 3)
.with_separator(" ")
.with_output_unigrams(false);
// Feed tokens one at a time
filter.reset();
let shingles1 = filter.add_token("the"); // []
let shingles2 = filter.add_token("quick"); // ["the quick"]
let shingles3 = filter.add_token("fox"); // ["quick fox", "the quick fox"]| Parameter | Type | Default | Description |
|---|---|---|---|
min_size |
usize |
2 |
Minimum shingle size (words) |
max_size |
usize |
required | Maximum shingle size (words) |
separator |
String |
" " |
Word separator in output |
output_unigrams |
bool |
true |
Output individual tokens too |
filler_token |
String |
"_" |
Placeholder for position gaps |
Creates bigrams pairing adjacent common words to preserve phrase-query capability while reducing stop-word index impact.
use pizza_analysis_core::CommonGramsTokenFilter;
let mut filter = CommonGramsTokenFilter::new(
vec!["the".to_string(), "is".to_string(), "a".to_string()]
);
filter.reset();
let r1 = filter.process_token("the"); // None (buffered)
let r2 = filter.process_token("quick"); // Some("the_quick") bigram
let r3 = filter.process_token("fox"); // None (not common)| Parameter | Type | Default | Description |
|---|---|---|---|
words |
Vec<String> |
required | Common/frequent words |
ignore_case |
bool |
false |
Case-insensitive |
separator |
String |
"_" |
Bigram separator |
Creates bigrams from consecutive CJK characters (Han, Hiragana, Katakana, Hangul).
use pizza_analysis_core::CjkBigramTokenFilter;
use pizza_engine::analysis::{Token, TokenFilter};
let filter = CjkBigramTokenFilter::new();
let mut token = Token::new("ζ±δΊ¬ι½", 0, 9, 0);
let (_, extra) = filter.filter(&mut token);
// Produces bigrams: "ζ±δΊ¬", "δΊ¬ι½"| Method | Default | Description |
|---|---|---|
.with_output_unigrams(bool) |
false |
Also output individual CJK chars |
.with_han(bool) |
true |
Include Han (Chinese) characters |
.with_hiragana(bool) |
true |
Include Hiragana |
.with_katakana(bool) |
true |
Include Katakana |
.with_hangul(bool) |
true |
Include Hangul (Korean) |
Expands or contracts synonyms. Supports two modes:
- Expand: All synonyms emitted at the same position (for recall)
- Contract: Map multiple forms to a single canonical form (for precision)
use pizza_analysis_core::{SynonymTokenFilter, SynonymMode};
use pizza_engine::analysis::{Token, TokenFilter};
let mut filter = SynonymTokenFilter::new(true); // case-insensitive
// Equivalence group: all terms are interchangeable
filter.add_equivalence(&["fast", "quick", "speedy"]);
// Explicit mapping: "big" β replace with "large"
filter.add_mapping("big", &["large"], SynonymMode::Contract);
// Parse Solr/ES format
filter.parse_rules("
happy, glad, joyful
sad => unhappy
");
let mut token = Token::new("fast", 0, 4, 0);
let (_, extra) = filter.filter(&mut token);
// token = "fast", extra = ["quick", "speedy"] at same position| Parameter | Type | Default | Description |
|---|---|---|---|
ignore_case |
bool |
required | Case-insensitive matching |
| Mode | Format | Behavior |
|---|---|---|
| Expand | "a, b, c" |
Any of a/b/c β emits all three |
| Contract | "a => b" |
"a" β replaced with "b" |
Emits each token twice: once as a keyword (protected from stemming) and once for normal processing. Used with RemoveDuplicatesTokenFilter downstream.
use pizza_analysis_core::KeywordRepeatTokenFilter;
use pizza_engine::analysis::{Token, TokenFilter};
let filter = KeywordRepeatTokenFilter::new();
let mut token = Token::new("running", 0, 7, 0);
let (_, extra) = filter.filter(&mut token);
// token = "running" (will be stemmed)
// extra = ["running"] at same position (keyword, skip stemming)Marks specific tokens as keywords to prevent downstream stemming.
use pizza_analysis_core::KeywordMarkerTokenFilter;
let filter = KeywordMarkerTokenFilter::new(
vec!["iPhone".to_string(), "PlayStation".to_string()]
);Extracts regex capture groups as additional tokens. Useful for splitting compound patterns.
use pizza_analysis_core::PatternCaptureTokenFilter;
use pizza_engine::analysis::{Token, TokenFilter};
let filter = PatternCaptureTokenFilter::new(
vec![r"(\d+)-(\w+)"],
true // preserve original
);
let mut token = Token::new("123-abc", 0, 7, 0);
let (_, extra) = filter.filter(&mut token);
// extra contains "123" and "abc" (capture groups)
// original "123-abc" preserved| Parameter | Type | Description |
|---|---|---|
patterns |
Vec<&str> |
List of regex patterns with capture groups |
preserve_original |
bool |
Keep original token in stream |
Regex-based find/replace within individual token text.
use pizza_analysis_core::PatternReplaceTokenFilter;
use pizza_engine::analysis::{Token, TokenFilter};
let filter = PatternReplaceTokenFilter::new(r"[_-]", " ").unwrap();
let mut token = Token::new("hello_world-test", 0, 16, 0);
filter.filter(&mut token);
assert_eq!(token.term, "hello world test");
// Replace only first occurrence
let filter = PatternReplaceTokenFilter::new(r"\d+", "N")
.unwrap()
.with_replace_all(false);| Parameter | Type | Default | Description |
|---|---|---|---|
pattern |
&str |
required | Regex pattern |
replacement |
&str |
required | Replacement string |
replace_all |
bool |
true |
Replace all vs. first only |
Note: Tokens are removed if replacement produces an empty string.
Splits tokens at case transitions, letter/digit boundaries, and delimiter characters.
use pizza_analysis_core::{WordDelimiterTokenFilter, WordDelimiterConfig};
use pizza_engine::analysis::{Token, TokenFilter};
let config = WordDelimiterConfig {
split_on_case_change: true,
split_on_numerics: true,
generate_word_parts: true,
generate_number_parts: true,
catenate_words: false,
catenate_numbers: false,
preserve_original: false,
};
let filter = WordDelimiterTokenFilter::new(config);
let mut token = Token::new("camelCase", 0, 9, 0);
let (_, extra) = filter.filter(&mut token);
assert_eq!(token.term, "camel");
assert_eq!(extra.unwrap()[0].term, "Case");
let mut token = Token::new("Wi-Fi", 0, 5, 0);
let (_, extra) = filter.filter(&mut token);
// "Wi", "Fi"| Config Field | Type | Default | Description |
|---|---|---|---|
split_on_case_change |
bool |
true |
Split camelCase β camel + Case |
split_on_numerics |
bool |
true |
Split letter-digit boundaries |
generate_word_parts |
bool |
true |
Output alphabetic sub-parts |
generate_number_parts |
bool |
true |
Output numeric sub-parts |
catenate_words |
bool |
false |
Also emit concatenated word parts |
catenate_numbers |
bool |
false |
Also emit concatenated number parts |
preserve_original |
bool |
false |
Keep original token |
Graph-aware version with correct position tracking for phrase queries. Additional options:
| Config Field | Type | Default | Description |
|---|---|---|---|
concatenate_all |
bool |
false |
Emit concatenation of all parts |
stem_english_possessive |
bool |
true |
Remove trailing 's |
Splits compound words (common in Germanic languages) using a dictionary of known word parts.
use pizza_analysis_core::DictionaryDecompounderTokenFilter;
use pizza_engine::analysis::{Token, TokenFilter};
let dict = vec![
"donner".to_string(), "wetter".to_string(),
"butter".to_string(), "brot".to_string(),
"schule".to_string(), "kind".to_string(),
];
let filter = DictionaryDecompounderTokenFilter::new(dict)
.with_min_word_size(5)
.with_min_subword_size(3);
let mut token = Token::new("donnerwetter", 0, 12, 0);
let (_, extra) = filter.filter(&mut token);
// token = "donnerwetter" (preserved)
// extra = ["donner", "wetter"]| Method | Default | Description |
|---|---|---|
.with_min_word_size(usize) |
5 |
Minimum input word length to attempt decomposition |
.with_min_subword_size(usize) |
2 |
Minimum component length |
.with_max_subword_size(usize) |
15 |
Maximum component length |
.with_only_longest_match(bool) |
false |
Only emit longest decomposition |
Same as DictionaryDecompounder but uses hyphenation patterns to find possible split points before dictionary lookup.
use pizza_analysis_core::HyphenationDecompounderTokenFilter;
let filter = HyphenationDecompounderTokenFilter::new(
vec!["butter".to_string(), "brot".to_string()]
);Same configuration options as DictionaryDecompounderTokenFilter.
Encodes tokens using phonetic algorithms for sound-based matching ("sounds like" search).
use pizza_analysis_core::{PhoneticTokenFilter, PhoneticEncoder};
use pizza_engine::analysis::{Token, TokenFilter};
// Metaphone encoding
let filter = PhoneticTokenFilter::new(PhoneticEncoder::Metaphone(6));
let mut token = Token::new("smith", 0, 5, 0);
filter.filter(&mut token);
assert_eq!(token.term, "SM0"); // phonetic code
// Keep original + emit phonetic as extra token
let filter = PhoneticTokenFilter::new(PhoneticEncoder::Soundex)
.with_replace(false);
let mut token = Token::new("robert", 0, 6, 0);
let (_, extra) = filter.filter(&mut token);
// token = "robert" (original preserved)
// extra = ["R163"] (Soundex code)| Encoder | Description | Example |
|---|---|---|
Metaphone(max_len) |
Standard Metaphone | "smith" β "SM0" |
DoubleMetaphone(max_len) |
Two encodings per word | "smith" β "SM0"/"XMT" |
Soundex |
Classic 4-char code | "robert" β "R163" |
RefinedSoundex |
More granular Soundex | More distinctions |
Caverphone1 |
NZ English optimized | |
Caverphone2 |
Updated Caverphone | |
ColognePhonetic |
German phonetic | "mΓΌller" β "657" |
Nysiis |
NY state algorithm | |
DaitchMokotoff |
Eastern European names |
| Parameter | Type | Default | Description |
|---|---|---|---|
encoder |
PhoneticEncoder |
required | Algorithm to use |
replace |
bool |
true |
Replace original (true) or emit alongside (false) |
Beider-Morse Phonetic Matching for multi-language surname matching. Generates phonetic representations considering multiple possible language origins.
use pizza_analysis_core::{BeiderMorseFilter, BmNameType, BmRuleType};
use pizza_engine::analysis::{Token, TokenFilter};
let filter = BeiderMorseFilter::new()
.with_name_type(BmNameType::Generic)
.with_rule_type(BmRuleType::Approx)
.with_max_phonemes(10);
let mut token = Token::new("Schmidt", 0, 7, 0);
let (_, extra) = filter.filter(&mut token);
// Multiple phonetic variants for different language origins| Method | Options | Default | Description |
|---|---|---|---|
.with_name_type() |
Generic, Ashkenazi, Sephardic |
Generic |
Name origin type |
.with_rule_type() |
Approx, Exact |
Approx |
Matching strictness |
.with_replace(bool) |
true |
Replace or emit alongside | |
.with_max_phonemes(usize) |
20 |
Max phonetic variants |
Parses and normalizes phone numbers for consistent indexing.
use pizza_analysis_core::PhoneNumberFilter;
use pizza_engine::analysis::{Token, TokenFilter};
let filter = PhoneNumberFilter::new();
// Normalizes various phone formats
// +1 (555) 123-4567 β 15551234567| Field | Type | Default | Description |
|---|---|---|---|
generate_variants |
bool |
true |
Generate format variants for matching |
Creates a "fingerprint" of a document by sorting unique tokens and joining them. Useful for near-duplicate detection.
use pizza_analysis_core::FingerprintAccumulator;
// Use FingerprintAccumulator for stream-level fingerprinting
let mut acc = FingerprintAccumulator::new(" ", 1024);
acc.add_token("quick");
acc.add_token("the");
acc.add_token("brown");
acc.add_token("the"); // duplicate, ignored
let fingerprint = acc.finish();
assert_eq!(fingerprint, "brown quick the"); // sorted, deduped| Method | Default | Description |
|---|---|---|
.with_max_output_size(usize) |
1024 |
Maximum fingerprint length |
.with_separator(&str) |
" " |
Token separator |
Generates MinHash signatures for locality-sensitive hashing (document similarity / near-duplicate detection).
use pizza_analysis_core::MinHashTokenFilter;
let filter = MinHashTokenFilter::new()
.with_hash_count(1)
.with_bucket_count(512)
.with_hash_set_size(1)
.with_rotation(true);| Method | Default | Description |
|---|---|---|
.with_hash_count(usize) |
1 |
Number of hash functions |
.with_bucket_count(usize) |
512 |
Buckets per hash |
.with_hash_set_size(usize) |
1 |
Minimum hashes per bucket |
.with_rotation(bool) |
true |
Fill empty buckets from neighbors |
Runs tokens through multiple sub-filter chains independently, emitting all variants.
use pizza_analysis_core::{MultiplexerTokenFilter, AsciiFoldingTokenFilter, LowercaseTokenFilter};
use pizza_engine::analysis::TokenFilter;
let filter = MultiplexerTokenFilter::new(vec![
Box::new(LowercaseTokenFilter::new()),
Box::new(AsciiFoldingTokenFilter::new()),
]).with_preserve_original(true);
// Token "RΓ©sumΓ©" β emits "rΓ©sumΓ©" (lowercased) + "Resume" (folded) + "RΓ©sumΓ©" (original)| Method | Default | Description |
|---|---|---|
.with_preserve_original(bool) |
true |
Keep original token |
Applies a sub-filter only to tokens matching a predicate.
use pizza_analysis_core::{ConditionalTokenFilter, MinLengthPredicate, LowercaseTokenFilter};
use pizza_engine::analysis::TokenFilter;
// Only lowercase tokens >= 4 characters
let filter = ConditionalTokenFilter::new(
Box::new(MinLengthPredicate(4)),
Box::new(LowercaseTokenFilter::new()),
);Built-in predicates:
| Predicate | Description |
|---|---|
MinLengthPredicate(usize) |
Token length β₯ N |
MaxLengthPredicate(usize) |
Token length β€ N |
PatternPredicate::new(regex) |
Token matches regex pattern |
Removes tokens based on script type or custom predicate.
use pizza_analysis_core::{PredicateTokenFilter, ScriptType, TokenPredicateType};
// Keep only Latin script tokens
let filter = PredicateTokenFilter::new(TokenPredicateType::ScriptIs(ScriptType::Latin));Script types: Latin, Cyrillic, Arabic, Devanagari, Han, Hangul, Hiragana, Katakana, Thai, Greek, Hebrew, Other
Flattens a token graph (produced by synonym graph or word delimiter graph filters) into a linear stream suitable for indexing.
use pizza_analysis_core::FlattenGraphTokenFilter;
let filter = FlattenGraphTokenFilter::new();Extracts payload data from tokens in format term|payload.
use pizza_analysis_core::DelimitedPayloadTokenFilter;
use pizza_engine::analysis::{Token, TokenFilter};
let filter = DelimitedPayloadTokenFilter::new('|');
let mut token = Token::new("pizza|0.95", 0, 10, 0);
filter.filter(&mut token);
assert_eq!(token.term, "pizza");
// Payload "0.95" extracted (stored separately)Extracts term frequency from tokens in format term|freq.
use pizza_analysis_core::DelimitedTermFreqTokenFilter;
use pizza_engine::analysis::{Token, TokenFilter};
let filter = DelimitedTermFreqTokenFilter::new('|');
let mut token = Token::new("pizza|5", 0, 7, 0);
filter.filter(&mut token);
assert_eq!(token.term, "pizza");
// Term frequency 5 extractedPre-composed language analyzers follow Elasticsearch/Lucene conventions. Each is registered by name and can be retrieved from AnalysisFactory.
Snowball stemmers: Some languages (e.g.
polish,swedish,turkish,armenian,basque,catalan,estonian,lithuanian) are registered here aslowercase + stoponly. To enable full Snowball stemming for these languages, also callpizza_analysis_stemmers::register_all(&mut factory)afterpizza_analysis_core::analyzers::register_all. Seepizza-analysis-stemmers.
| Analyzer | Pipeline | Use Case |
|---|---|---|
keyword |
KeywordTokenizer (no filters) | Exact-match fields |
simple |
LetterTokenizer β Lowercase | Basic word splitting |
stop |
LetterTokenizer β Lowercase β English Stop | English with stop removal |
pattern |
PatternTokenizer (default \W+) β Lowercase |
Regex-based splitting |
fingerprint |
StandardTokenizer β Lowercase β AsciiFolding β Stop β Fingerprint | Deduplication |
Each language analyzer is tuned for its language with appropriate normalization, stemming, and stop word removal:
| Analyzer | Pipeline |
|---|---|
| arabic | Standard β Lowercase β DecimalDigit β ArabicNorm β Stop β ArabicStem |
| bengali | Standard β Lowercase β DecimalDigit β IndicNorm β BengaliNorm β Stop β BengaliStem |
| brazilian | Standard β Lowercase β Stop β BrazilianStem |
| bulgarian | Standard β Lowercase β Stop β BulgarianStem |
| catalan | Standard β Elision(l,d,qu,m,n,s) β Lowercase β Stop |
| cjk | Standard β CjkWidth β Lowercase β CjkBigram β Stop |
| czech | Standard β Lowercase β Stop β CzechStem |
| danish | Standard β Lowercase β Stop β ScandinavianNorm β ScandinavianFolding |
| dutch | Standard β Lowercase β Stop β DutchStem |
| english | Standard β Lowercase β Stop |
| finnish | Standard β Lowercase β Stop β FinnishLightStem |
| french | Standard β Elision(french) β Lowercase β Stop β FrenchLightStem |
| galician | Standard β Lowercase β Stop β GalicianStem |
| german | Standard β Lowercase β Stop β GermanNorm β GermanLightStem |
| greek | Standard β GreekLowercase β Stop β GreekStem |
| hindi | Standard β Lowercase β DecimalDigit β IndicNorm β HindiNorm β Stop β HindiStem |
| hungarian | Standard β Lowercase β Stop β HungarianLightStem |
| indonesian | Standard β Lowercase β Stop β IndonesianStem |
| irish | Standard β IrishElision β IrishLowercase β Stop |
| italian | Standard β Elision(italian) β Lowercase β Stop β ItalianLightStem |
| latvian | Standard β Lowercase β Stop β LatvianStem |
| marathi | Standard β Lowercase β DecimalDigit β IndicNorm β Stop |
| nepali | Standard β Lowercase β DecimalDigit β IndicNorm β Stop |
| norwegian | Standard β Lowercase β Stop β NorwegianLightStem |
| persian | PatternReplace(ZWNJβspace) + Standard β Lowercase β DecimalDigit β ArabicNorm β PersianNorm β Stop |
| portuguese | Standard β Lowercase β Stop β PortugueseLightStem |
| romanian | Standard β Lowercase β Stop β RomanianNorm |
| russian | Standard β Lowercase β Stop β RussianLightStem |
| serbian | Standard β Lowercase β Stop β SerbianNorm |
| sorani | Standard β SoraniNorm β Lowercase β DecimalDigit β Stop |
| spanish | Standard β Lowercase β Stop β SpanishLightStem |
| swedish | Standard β Lowercase β Stop β ScandinavianNorm β ScandinavianFolding |
| tamil | Standard β Lowercase β DecimalDigit β IndicNorm β Stop β TamilStem |
| thai | Thai β Lowercase β DecimalDigit β Stop |
| turkish | Standard β Apostrophe β TurkishLowercase β Stop |
| urdu | Standard β Lowercase β DecimalDigit β IndicNorm β Stop |
These languages have stop word removal but no specialized stemmer available in this crate:
afrikaans, amharic, armenian, azerbaijani, basque, croatian, estonian, filipino, georgian, hebrew, lithuanian, malay, mongolian, polish, slovak, slovenian, swahili, tagalog, ukrainian, vietnamese
Pipeline: Standard β Lowercase β Stop
Tip: For Snowball-based stemming on these languages, add
pizza-analysis-stemmerswhich provides 33 algorithmic stemmer algorithms.
Build your own analyzer by combining any normalizers, tokenizer, and filters:
use pizza_analysis_core::*;
use pizza_engine::analysis::{Analyzer, Normalizer, StandardTokenizer, TokenFilter};
// Custom e-commerce analyzer
let normalizers: Vec<Box<dyn Normalizer>> = vec![
Box::new(HtmlStripNormalizer::new()),
Box::new(PatternReplaceNormalizer::new(r"\b(SKU|sku):\s*", "")),
];
let filters: Vec<Box<dyn TokenFilter>> = vec![
Box::new(LowercaseTokenFilter::new()),
Box::new(AsciiFoldingTokenFilter::new()),
Box::new(StopTokenFilter::new(
token_filters::stopwords::get_stop_words("english").unwrap()
)),
Box::new(KStemTokenFilter::new()),
Box::new(LengthTokenFilter::new(2, 50)),
];
let analyzer = Analyzer::new(
normalizers,
Box::new(StandardTokenizer::new()),
filters,
);Pre-built stop word lists for 57 languages, accessible via the token_filters::stopwords module.
| Language | Constant | Language | Constant |
|---|---|---|---|
| Afrikaans | AFRIKAANS_STOP_WORDS |
Latvian | LATVIAN_STOP_WORDS |
| Amharic | AMHARIC_STOP_WORDS |
Lithuanian | LITHUANIAN_STOP_WORDS |
| Arabic | ARABIC_STOP_WORDS |
Malay | MALAY_STOP_WORDS |
| Armenian | ARMENIAN_STOP_WORDS |
Marathi | MARATHI_STOP_WORDS |
| Azerbaijani | AZERBAIJANI_STOP_WORDS |
Mongolian | MONGOLIAN_STOP_WORDS |
| Basque | BASQUE_STOP_WORDS |
Nepali | NEPALI_STOP_WORDS |
| Bengali | BENGALI_STOP_WORDS |
Norwegian | NORWEGIAN_STOP_WORDS |
| Brazilian Portuguese | BRAZILIAN_STOP_WORDS |
Persian | PERSIAN_STOP_WORDS |
| Bulgarian | BULGARIAN_STOP_WORDS |
Polish | POLISH_STOP_WORDS |
| Catalan | CATALAN_STOP_WORDS |
Portuguese | PORTUGUESE_STOP_WORDS |
| Chinese | CHINESE_STOP_WORDS |
Romanian | ROMANIAN_STOP_WORDS |
| CJK (generic) | CJK_STOP_WORDS |
Russian | RUSSIAN_STOP_WORDS |
| Croatian | CROATIAN_STOP_WORDS |
Serbian | SERBIAN_STOP_WORDS |
| Czech | CZECH_STOP_WORDS |
Slovak | SLOVAK_STOP_WORDS |
| Danish | DANISH_STOP_WORDS |
Slovenian | SLOVENIAN_STOP_WORDS |
| Dutch | DUTCH_STOP_WORDS |
Sorani Kurdish | SORANI_STOP_WORDS |
| English | ENGLISH_STOP_WORDS |
Spanish | SPANISH_STOP_WORDS |
| Estonian | ESTONIAN_STOP_WORDS |
Swahili | SWAHILI_STOP_WORDS |
| Filipino | FILIPINO_STOP_WORDS |
Swedish | SWEDISH_STOP_WORDS |
| Finnish | FINNISH_STOP_WORDS |
Tagalog | TAGALOG_STOP_WORDS |
| French | FRENCH_STOP_WORDS |
Tamil | TAMIL_STOP_WORDS |
| Galician | GALICIAN_STOP_WORDS |
Thai | THAI_STOP_WORDS |
| Georgian | GEORGIAN_STOP_WORDS |
Turkish | TURKISH_STOP_WORDS |
| German | GERMAN_STOP_WORDS |
Ukrainian | UKRAINIAN_STOP_WORDS |
| Greek | GREEK_STOP_WORDS |
Urdu | URDU_STOP_WORDS |
| Hebrew | HEBREW_STOP_WORDS |
Vietnamese | VIETNAMESE_STOP_WORDS |
| Hindi | HINDI_STOP_WORDS |
||
| Hungarian | HUNGARIAN_STOP_WORDS |
||
| Indonesian | INDONESIAN_STOP_WORDS |
||
| Irish | IRISH_STOP_WORDS |
||
| Italian | ITALIAN_STOP_WORDS |
||
| Japanese | JAPANESE_STOP_WORDS |
||
| Korean | KOREAN_STOP_WORDS |
use pizza_analysis_core::token_filters::stopwords::get_stop_words;
// Look up by language name
if let Some(words) = get_stop_words("french") {
println!("French has {} stop words", words.len());
}
// Also supports underscore-wrapped format (ES-compatible)
let words = get_stop_words("_german_").unwrap();use pizza_analysis_core::*;
use pizza_engine::analysis::{Normalizer, Tokenizer, Token, TokenFilter};
// 1. Normalize: strip HTML
let normalizer = HtmlStripNormalizer::new();
let mut text = String::from("<p>The Quick Brown Fox</p>");
normalizer.normalize(&mut text);
// text = " The Quick Brown Fox "
// 2. Tokenize
let tokenizer = LetterTokenizer::new();
let mut tokens = tokenizer.tokenize(&text);
// ["The", "Quick", "Brown", "Fox"]
// 3. Lowercase
let lowercase = LowercaseTokenFilter::new();
for token in &mut tokens {
lowercase.filter(token);
}
// ["the", "quick", "brown", "fox"]
// 4. Remove stop words
let stop_words = token_filters::stopwords::get_stop_words("english").unwrap();
let stop = StopTokenFilter::new(stop_words);
tokens.retain(|token| {
let mut t = token.clone();
let (remove, _) = stop.filter(&mut t);
!remove
});
// ["quick", "brown", "fox"]use pizza_analysis_core::*;
use pizza_engine::analysis::{StandardTokenizer, Tokenizer, Token, TokenFilter};
let tokenizer = StandardTokenizer::new();
let mut tokens = tokenizer.tokenize("Donaudampfschifffahrtsgesellschaft");
let lowercase = LowercaseTokenFilter::new();
let stop = StopTokenFilter::new(
token_filters::stopwords::get_stop_words("german").unwrap()
);
let norm = GermanNormalizationTokenFilter::new();
let stem = GermanLightStemTokenFilter::new();
let decomp = DictionaryDecompounderTokenFilter::new(
vec!["donau".into(), "dampf".into(), "schiff".into(),
"fahrt".into(), "gesellschaft".into()]
).with_min_subword_size(4);
for token in &mut tokens {
lowercase.filter(token);
norm.filter(token);
stem.filter(token);
}
// Tokens processed through German normalization + stemming
// Decompounding produces sub-words for compound termsuse pizza_analysis_core::*;
use pizza_engine::analysis::{Analyzer, StandardTokenizer, TokenFilter, Normalizer};
// Index-time analyzer: generate prefixes
let index_analyzer = Analyzer::new(
vec![],
Box::new(StandardTokenizer::new()),
vec![
Box::new(LowercaseTokenFilter::new()),
Box::new(EdgeNgramTokenFilter::new(2, 15)),
],
);
// Search-time analyzer: just lowercase (no edge-grams)
let search_analyzer = Analyzer::new(
vec![],
Box::new(StandardTokenizer::new()),
vec![
Box::new(LowercaseTokenFilter::new()),
],
);use pizza_analysis_core::*;
use pizza_engine::analysis::{Analyzer, StandardTokenizer, TokenFilter};
let analyzer = Analyzer::new(
vec![],
Box::new(StandardTokenizer::new()),
vec![
Box::new(LowercaseTokenFilter::new()),
Box::new(PhoneticTokenFilter::new(PhoneticEncoder::DoubleMetaphone(6))
.with_replace(false)), // Keep original + phonetic
],
);
// "Stephen" β ["stephen", "STFN"] (both indexed at same position)
// Matches queries for "Steven", "Stefan", etc.use pizza_analysis_core::*;
use pizza_engine::analysis::{Analyzer, StandardTokenizer, TokenFilter};
let mut synonyms = SynonymTokenFilter::new(true); // case-insensitive
synonyms.add_equivalence(&["laptop", "notebook", "portable computer"]);
synonyms.add_mapping("ny", &["new york"], SynonymMode::Expand);
let analyzer = Analyzer::new(
vec![],
Box::new(StandardTokenizer::new()),
vec![
Box::new(LowercaseTokenFilter::new()),
Box::new(synonyms),
Box::new(StopTokenFilter::new(
token_filters::stopwords::get_stop_words("english").unwrap()
)),
],
);| Feature | Default | Description |
|---|---|---|
std |
β | Standard library support |
| (none) | no_std compatible (uses alloc only) |
| Crate | Description |
|---|---|
pizza-engine |
Core engine: Normalizer, Tokenizer, TokenFilter, Analyzer traits |
pizza-analysis-all |
Auto-generated meta-crate β one register_all() that wires every discovered plugin |
pizza-plugin-discovery |
CLI tool that scans contrib crates and (re-)generates pizza-analysis-all |
pizza-analysis-stemmers |
Snowball stemming algorithms (33 languages) |
pizza-analysis-ik |
IK Chinese segmenter (smart/max_word modes) |
pizza-analysis-jieba |
Jieba Chinese segmenter |
pizza-analysis-pinyin |
Chinese Pinyin tokenizer + filter |
pizza-analysis-stconvert |
Simplified β Traditional Chinese conversion |
MIT