You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Would you consider changing the algorithm and documenting how you split the words?
Whenever ICU4X is ready (Rust crate for ICU from the Unicode consortium), we could even consider locale-aware word-breaking rules. This would be a full Unicode word-breaking algorithm.
The text was updated successfully, but these errors were encountered:
I noticed that the word-breaking algorithm you are using in zspell::dictionary::Dictionary::check is stringmetrics::tokenizers::split_whitespace_remove_punc.
In the code, this is in dictionary.rs.
Word-breaking is actually pretty complex in Unicode, and the best I know of in Rust is unicode-segmentation it implements a trait unicode_segmentation::UnicodeSegmentation for
str
, following the Unicode Standard Annex #29.Would you consider changing the algorithm and documenting how you split the words?
Whenever ICU4X is ready (Rust crate for ICU from the Unicode consortium), we could even consider locale-aware word-breaking rules. This would be a full Unicode word-breaking algorithm.
The text was updated successfully, but these errors were encountered: