Use Unicode line breaking algorithm to find words #313

mgeisler · 2021-04-08T22:10:44Z

This adds a new optional dependency on the unicode-linebreak crate which implements the line breaking algorithm from Unicode Standard Annex #14.

The new dependency is enabled by default since these line breaks are more correct than what you get by splitting on whitespace.

This should help address #220 and #80, though I’m no expert on non-Western languages. More feedback from the community would be needed here.

src/core.rs

src/line_breaks.rs

mgeisler · 2021-04-18T11:20:24Z

I'll need to make a small detour before I can continue with this PR — the ?Sized bound for the WordSplitter trait cannot be generalized since we can only have one unsized field in a struct. I'll rework this first and then we can continue with this PR.

mgeisler · 2021-05-02T17:23:15Z

I'll need to make a small detour before I can continue with this PR — the ?Sized bound for the WordSplitter trait cannot be generalized since we can only have one unsized field in a struct. I'll rework this first and then we can continue with this PR.

The detour has been completed with #331 and the WordSeparator trait introduced in #332.

This adds a new optional dependency on the unicode-linebreak crate, which implements the line breaking algorithm from [Unicode Standard Annex #14](https://www.unicode.org/reports/tr14/). We can use this to find words in non-ASCII text. The new dependency is enabled by default since these line breaks are more correct than what you get by splitting on ASCII space. This should help address #220 and #80, though I’m no expert on non-Western languages. More feedback from the community would be needed here.

mgeisler mentioned this pull request Apr 9, 2021

Does not work for languages without word separators #220

Closed

mgeisler commented Apr 10, 2021

View reviewed changes

src/core.rs Outdated Show resolved Hide resolved

mgeisler commented Apr 10, 2021

View reviewed changes

src/core.rs Outdated Show resolved Hide resolved

mgeisler force-pushed the unicode-line-breaks branch 3 times, most recently from fdfa47f to 500978b Compare April 14, 2021 22:02

mgeisler commented Apr 14, 2021

View reviewed changes

src/line_breaks.rs Outdated Show resolved Hide resolved

mgeisler force-pushed the unicode-line-breaks branch from 500978b to 6c5220b Compare May 2, 2021 17:07

mgeisler mentioned this pull request May 2, 2021

Add fast path to UnicodeBreakProperties::find_words #333

Open

mgeisler force-pushed the unicode-line-breaks branch from 6c5220b to bbecb5a Compare May 2, 2021 17:32

mgeisler force-pushed the unicode-line-breaks branch from bbecb5a to ecbbde4 Compare May 2, 2021 17:40

mgeisler changed the title ~~Find line breaks according to the Unicode line breaking algorithm~~ Use Unicode line breaking algorithm to find words May 2, 2021

mgeisler merged commit 01fa58b into master May 2, 2021

mgeisler deleted the unicode-line-breaks branch May 2, 2021 17:49

mgeisler mentioned this pull request May 2, 2021

Investigate how textwrap works for East Asian languages #80

Closed

This was referenced May 30, 2021

Release 0.14.0 #372

Closed

Release 0.14.0 #373

Merged

mgeisler mentioned this pull request Feb 26, 2022

Unicode whitespaces #306

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use Unicode line breaking algorithm to find words #313

Use Unicode line breaking algorithm to find words #313

mgeisler commented Apr 8, 2021

mgeisler commented Apr 18, 2021

mgeisler commented May 2, 2021 •

edited

Use Unicode line breaking algorithm to find words #313

Use Unicode line breaking algorithm to find words #313

Conversation

mgeisler commented Apr 8, 2021

mgeisler commented Apr 18, 2021

mgeisler commented May 2, 2021 • edited

mgeisler commented May 2, 2021 •

edited