[Improvement] Word breaking algorithm #23

saona-raimundo · 2022-10-27T10:02:59Z

I noticed that the word-breaking algorithm you are using in zspell::dictionary::Dictionary::check is stringmetrics::tokenizers::split_whitespace_remove_punc.
In the code, this is in dictionary.rs.

Word-breaking is actually pretty complex in Unicode, and the best I know of in Rust is unicode-segmentation it implements a trait unicode_segmentation::UnicodeSegmentation for str, following the Unicode Standard Annex #29.

Would you consider changing the algorithm and documenting how you split the words?

Whenever ICU4X is ready (Rust crate for ICU from the Unicode consortium), we could even consider locale-aware word-breaking rules. This would be a full Unicode word-breaking algorithm.

The text was updated successfully, but these errors were encountered:

tgross35 · 2022-10-27T16:06:13Z

Absolutely! Thank you for bringing this to my attention, I'll add it to the todo list

tgross35 · 2022-11-04T10:44:32Z

Awesome suggestion 👍 also came with a nice 20% speed boost

Spellcheck: 188 word paragraph                                                                             
                        time:   [50.602 µs 50.869 µs 51.235 µs]
                        change: [-21.704% -20.700% -19.645%] (p = 0.00 < 0.05)
                        Performance has improved.

Change is in #25

tgross35 mentioned this issue Nov 4, 2022

Update word breaking to use unicode segmentation #25

Merged

tgross35 closed this as completed in #25 Nov 4, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Improvement] Word breaking algorithm #23

[Improvement] Word breaking algorithm #23

saona-raimundo commented Oct 27, 2022

tgross35 commented Oct 27, 2022

tgross35 commented Nov 4, 2022

[Improvement] Word breaking algorithm #23

[Improvement] Word breaking algorithm #23

Comments

saona-raimundo commented Oct 27, 2022

tgross35 commented Oct 27, 2022

tgross35 commented Nov 4, 2022