Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use Unicode line breaking algorithm to find words #313

Merged
merged 1 commit into from May 2, 2021

Conversation

mgeisler
Copy link
Owner

@mgeisler mgeisler commented Apr 8, 2021

This adds a new optional dependency on the unicode-linebreak crate which implements the line breaking algorithm from Unicode Standard Annex #14.

The new dependency is enabled by default since these line breaks are more correct than what you get by splitting on whitespace.

This should help address #220 and #80, though I’m no expert on non-Western languages. More feedback from the community would be needed here.

src/core.rs Outdated Show resolved Hide resolved
src/core.rs Outdated Show resolved Hide resolved
@mgeisler mgeisler force-pushed the unicode-line-breaks branch 3 times, most recently from fdfa47f to 500978b Compare April 14, 2021 22:02
src/line_breaks.rs Outdated Show resolved Hide resolved
@mgeisler
Copy link
Owner Author

I'll need to make a small detour before I can continue with this PR — the ?Sized bound for the WordSplitter trait cannot be generalized since we can only have one unsized field in a struct. I'll rework this first and then we can continue with this PR.

@mgeisler
Copy link
Owner Author

mgeisler commented May 2, 2021

I'll need to make a small detour before I can continue with this PR — the ?Sized bound for the WordSplitter trait cannot be generalized since we can only have one unsized field in a struct. I'll rework this first and then we can continue with this PR.

The detour has been completed with #331 and the WordSeparator trait introduced in #332.

This adds a new optional dependency on the unicode-linebreak crate,
which implements the line breaking algorithm from [Unicode Standard
Annex #14](https://www.unicode.org/reports/tr14/). We can use this to
find words in non-ASCII text.

The new dependency is enabled by default since these line breaks are
more correct than what you get by splitting on ASCII space.

This should help address #220 and #80, though I’m no expert on
non-Western languages. More feedback from the community would be
needed here.
@mgeisler mgeisler changed the title Find line breaks according to the Unicode line breaking algorithm Use Unicode line breaking algorithm to find words May 2, 2021
@mgeisler mgeisler merged commit 01fa58b into master May 2, 2021
@mgeisler mgeisler deleted the unicode-line-breaks branch May 2, 2021 17:49
This was referenced May 30, 2021
@mgeisler mgeisler mentioned this pull request Feb 26, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants