Skip to content
This repository has been archived by the owner on Aug 8, 2023. It is now read-only.

Support language dictionary-based line breaking #7362

Closed
ChrisLoer opened this issue Dec 9, 2016 · 7 comments
Closed

Support language dictionary-based line breaking #7362

ChrisLoer opened this issue Dec 9, 2016 · 7 comments
Assignees
Labels
archived Archived because of inactivity Core The cross-platform C++ core, aka mbgl text rendering

Comments

@ChrisLoer
Copy link
Contributor

We don't currently support line breaking at all for languages like Khmer, Lao, or Thai that don't use spaces. We support line breaking for the CJK languages by breaking in between characters whenever we need to make a break, but this is sub-optimal if it breaks in the middle of a word made out of multiple characters.

We can do better by including line breaking dictionaries that tell us where we can best break apart words in these languages. The challenge is that these dictionaries are large (the ICU line breaking dictionaries take up a few megabytes), so we really don't want to pull them in as a dependency.

With gl-native, however, I think we can avoid pulling in the dependency by using the line breaking support built into our host platforms. For Android, we can use BreakIterator. For iOS, we can use NSAttributedString::lineBreak. For Qt, we can use QTextBoundaryFinder.

cc @1ec5 @nickidlugash @ansis

@ChrisLoer ChrisLoer self-assigned this Dec 9, 2016
@1ec5
Copy link
Contributor

1ec5 commented Dec 9, 2016

For iOS, we can use NSAttributedString::lineBreak.

-[NSAttributedString lineBreakBeforeIndex:withinRange:] is only available on macOS, not iOS. A lower-level, cross-platform implementation would create a CFStringTokenizerRef using kCFStringTokenizerUnitLineBreak. However, I don’t think either API will detect the appropriate line breaking dictionary to apply. CFStringTokenizerCreate() lets you specify a language code, defaulting to the current locale (so no line breaking would happen within a Thai sentence).

It sounds like, in the absence of something like mapbox/DEPRECATED-mapbox-gl#21, you’re looking for a universal line breaking facility that is nonetheless locale-aware. Unfortunately, line breaking, like hyphenation, varies by language rather than by script.

@1ec5
Copy link
Contributor

1ec5 commented Dec 9, 2016

Apparently -[NSAttributedString lineBreakBeforeIndex:withinRange:] does line break within a Thai sentence even when the locale is English. So perhaps CFStringTokenizer does as well.

@ChrisLoer
Copy link
Contributor Author

It occurs to me we could also approach this on the data side instead of the client side. We could run a line breaking dictionary against all of our labels and insert zero-width spaces (or maybe some other code point, we'd have to make sure not to mess up shaping) at all potential line breaks. For ideographic text, we could still allow the current breaking behavior, but we'd give it a slight penalty so as to favor word-aligned breaks. We'd have to give a tool to customers providing their own data to allow them to insert the same potential-break metadata... but on the other hand, the solution would be portable across gl-js and gl-native without requiring any dictionary downloads.

@mb12
Copy link

mb12 commented Dec 10, 2016

@ChrisLoer A change that involves encoding the labels like this should be part of the mapbox vector tile spec. For e.g. geometry commands are bitwise encoded, coordinates are zig zag encoded. But all this is clearly documented in the vector tile spec (and also implemented in mapnik vector tile writer amongst other writers available on github).

@1ec5
Copy link
Contributor

1ec5 commented Jan 27, 2017

We could run a line breaking dictionary against all of our labels and insert zero-width spaces (or maybe some other code point, we'd have to make sure not to mess up shaping) at all potential line breaks.

The soft hyphen was designed for this purpose, and in fact it’s fairly common in Thai and Khmer text. We implemented support for breaking at soft hyphens in #2598.

I think it would be perfectly reasonable for a source such as Mapbox Streets to insert soft hyphens into names written in non-space-delimited languages. This would lessen the need for a “thin” client like GL JS to bundle a Thai or Khmer word list. I would suspect (but am uncertain) that separating multisyllabic and compound words using soft hyphens would also be acceptable in Japanese and Chinese text, respectively.

On the other hand, we’ll have to decide whether it makes sense to consider the soft hyphens when filtering features or returning feature querying results. mapbox/mapbox-gl-style-spec#548 could allow the style author to choose explicitly.

A change that involves encoding the labels like this should be part of the mapbox vector tile spec.

The vector tile specification shouldn’t need to concern itself with the use of soft hyphens. After all, it doesn’t even specify a character set.

@1ec5
Copy link
Contributor

1ec5 commented Jan 27, 2017

Actually, a soft hyphen isn’t quite what we want, since it’s rendered as a hyphen when taken into account. ZWSP sounds good in that case.

@kkaefer kkaefer added the Core The cross-platform C++ core, aka mbgl label May 9, 2017
@stale stale bot added the archived Archived because of inactivity label Nov 7, 2018
@stale
Copy link

stale bot commented Nov 27, 2018

This issue has been automatically detected as stale because it has not had recent activity and will be archived. Thank you for your contributions.

@stale stale bot closed this as completed Nov 27, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
archived Archived because of inactivity Core The cross-platform C++ core, aka mbgl text rendering
Projects
None yet
Development

No branches or pull requests

4 participants