Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Improvement] Word breaking algorithm #23

Closed
saona-raimundo opened this issue Oct 27, 2022 · 2 comments · Fixed by #25
Closed

[Improvement] Word breaking algorithm #23

saona-raimundo opened this issue Oct 27, 2022 · 2 comments · Fixed by #25

Comments

@saona-raimundo
Copy link

I noticed that the word-breaking algorithm you are using in zspell::dictionary::Dictionary::check is stringmetrics::tokenizers::split_whitespace_remove_punc.
In the code, this is in dictionary.rs.

Word-breaking is actually pretty complex in Unicode, and the best I know of in Rust is unicode-segmentation it implements a trait unicode_segmentation::UnicodeSegmentation for str, following the Unicode Standard Annex #29.

Would you consider changing the algorithm and documenting how you split the words?

Whenever ICU4X is ready (Rust crate for ICU from the Unicode consortium), we could even consider locale-aware word-breaking rules. This would be a full Unicode word-breaking algorithm.

@tgross35
Copy link
Contributor

Absolutely! Thank you for bringing this to my attention, I'll add it to the todo list

@tgross35
Copy link
Contributor

tgross35 commented Nov 4, 2022

Awesome suggestion 👍 also came with a nice 20% speed boost

Spellcheck: 188 word paragraph                                                                             
                        time:   [50.602 µs 50.869 µs 51.235 µs]
                        change: [-21.704% -20.700% -19.645%] (p = 0.00 < 0.05)
                        Performance has improved.

Change is in #25

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants