Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Index by grapheme #2458

Closed
wants to merge 15 commits into from
Closed

Index by grapheme #2458

wants to merge 15 commits into from

Conversation

wipfli
Copy link
Member

@wipfli wipfli commented Apr 29, 2023

Sharing this code here because I like the idea so much of having better text support in MapLibre GL JS. It is a draft and for inspiration only at this point...

Demo: https://github.com/wipfli/index-by-grapheme

How does it work?

  • Use always the TinySDF codepath in MapLibre GL JS (rasterize glyphs in the client)
  • Assume that the user has marked the text in the tiles such that characters that should be treated as a cluster are marked with an @ separator
    • "Hallo" -> ["H", "a", "l", "l", "o"]
    • "H@all@o" -> ["Ha", "l", "lo"]
    • "H@a@llo" -> ["Hal", "l", "o"]
  • Use clusters to index the glyph atlas rather than unicode codepoints

What can it do?

It can render complex text on point labels and along lines.

Here are some languages:

Here are some cool cities:

And more:

What can it not do?

I don't know. Feel free to give some feedback if stuff does not work in your language...

Right-to-left languages like Hebrew and Arabic are not handled correctly.

  • Confirm your changes do not include backports from Mapbox projects (unless with compliant license) - if you are not sure about this, please ask!
  • Briefly describe the changes in this PR.
  • Link to related issues.
  • Include before/after visuals or gifs if this PR includes visual changes.
  • Write tests for all new functionality.
  • Document any changes to public APIs.
  • Post benchmark scores.
  • Add an entry to CHANGELOG.md under the ## main section.

@wipfli wipfli marked this pull request as draft April 29, 2023 21:16
@wipfli
Copy link
Member Author

wipfli commented Apr 29, 2023

in Bengali should be ক্রিসেন্ট লেক (diacritics have broken continuity, thanks @thehoneymad for the pointer)

@wipfli
Copy link
Member Author

wipfli commented Apr 29, 2023

image

in Khmer is wrong. Should be ក្រុងសៀមរាប

@ramSeraph
Copy link

ramSeraph commented Apr 30, 2023

Screenshot 2023-04-30 at 10 14 25 AM

For Telugu seeing clipping. Should be చుండూరు

@maxammann
Copy link
Contributor

Looks very promising!

So is tinysdf used in maplibre-gl-js since a long time already to go from fonts to SDFs? Where does the font data come from in this case?

It is not yet working in firefox, as it is a chrome only API right now. Though it can be polyfilled probably.

Object { message: "Intl.Segmenter is not a constructor", stack: "" }

@wipfli
Copy link
Member Author

wipfli commented Apr 30, 2023

TinySDF is used to render CJK in the client. It uses system fonts and the canvas element. @bdon told me that there are 10k CJK characters and the used ones are not close together in unicode numbers

@1ec5
Copy link
Contributor

1ec5 commented Apr 30, 2023

See also the discussion in maplibre/maplibre-style-spec#145 (reply in thread) maplibre/maplibre-native#778 (reply in thread) that led to this PR.

@1ec5
Copy link
Contributor

1ec5 commented Apr 30, 2023

It uses system fonts and the canvas element.

System fonts have been a long-requested feature: mapbox/mapbox-gl-native#7862. This PR overcomes the two hurdles blocking that feature: the overreliance on Unicode codepoints and the inability to include combining characters in the same glyph. Besides complex text support, leveraging the browser’s text layout engine also allows font fallback per glyph for free without forcing the style to specify a pan-Unicode fallback font. With some minor tweaks to make TinySDF’s font usage more flexible, this library could even support Web fonts for a more modern alternative to the fontstack mechanism.

It is not yet working in firefox, as it is a chrome only API right now. Though it can be polyfilled probably.

There has been a patch for Gecko for a few years but it got stalled due to the size it adds to the browser. There are several JavaScript libraries that claim to implement the same Unicode algorithm as Intl.Segmenter, but I guess they would add a similar amount to this library’s size.

in Bengali should be ক্রিসেন্ট লেক (diacritics have broken continuity

I think it’s important to set expectations for now: this approach still relies on slicing and dicing the string, just not as granularly as before. But grapheme clusters can still foil important complex text features such as initial/medial/final character forms, since these features don’t affect collation or text selection: maplibre/maplibre-native#778 (reply in thread). Maybe we can hack around it using joiner characters, but I don’t know.

Assuming the result looks less broken than before, it would be wonderful to land this improvement, perhaps behind a runtime option like the existing option for CJK in TinySDF. But improving upon this might require either word-based segmentation (which looks rough on curved lines) or ditching the custom text shaper in favor of Harfbuzz – back to square one essentially.

@wipfli
Copy link
Member Author

wipfli commented Apr 30, 2023

Intl.Segmenter gets some graphemes wrong which leads to the bugs reported above. What I did now was I check that the sum of two graphemes ${grapheme1}${grapheme2} renders the same way as conequent calls to fillText with grapheme1, grapheme2, see the CanvasComparer class (thanks ChatGPT for writing it!).

So that CanvasComparer class is super slow and I am positive that one can make it faster. So now the demo is a bit slow (a bit a lot) but it fixes the problems above:

image

image

Also the clipping is fixed although that is probably just because I use TinySDF with a larger 200 px canvas (I changed this in the node_modules tinysdf index.js file...):

image

@wipfli
Copy link
Member Author

wipfli commented Apr 30, 2023

Let me know if there are still some bugs in the demo now!

@wipfli
Copy link
Member Author

wipfli commented Apr 30, 2023

If we can make the CanvasComparer class more efficient, we can also drop the Intl.Segmenter and just give the canvas comparer the individual codepoints. For the labels I looked at it worked, but it was even slower than the current version because the input of the canvas comparer was larger...

@wipfli
Copy link
Member Author

wipfli commented Apr 30, 2023

@bdon we will need your higher resolution TinySDF version if this should every be used for real. At the moment, the latin letters for example look very pixelated and I am sure it is the same for the other languages...

@wipfli
Copy link
Member Author

wipfli commented Apr 30, 2023

You can open the browser console to see what the graphemes are:

image

Interestingly, this approach will also do kerning for us. To is not the same as T and o...

@wipfli
Copy link
Member Author

wipfli commented Apr 30, 2023

If someone could have a look at the CanvasComparer class and make it more efficient, it would be great. Help with this would be super welcome!

@1ec5
Copy link
Contributor

1ec5 commented Apr 30, 2023

Interestingly, this approach will also do kerning for us. To is not the same as T and o...

That is very nice, but on a line-placed label, wouldn’t that make the curvature less smooth, kind of choppy? Maybe in a line-placed label, when you detect a difference but it isn’t just one grapheme cluster, then you can break it apart just in case.

@wipfli
Copy link
Member Author

wipfli commented Apr 30, 2023

For Latin text we can use the browser API to segment.

By the way, here is a cool location to see some Burmese text along a line:

https://wipfli.github.io/index-by-grapheme/#map=15.01/16.80186/96.17123

@1ec5
Copy link
Contributor

1ec5 commented Apr 30, 2023

The memory usage is so intense that it sends MobileSafari into a crash loop. 🙈

@wipfli
Copy link
Member Author

wipfli commented Apr 30, 2023

Houps

@wipfli
Copy link
Member Author

wipfli commented Apr 30, 2023

@1ec5 does it work now? I made it a bit faster by comparing only the parts of the canvases that actually have text. Also, I use a smaller font.

@wipfli
Copy link
Member Author

wipfli commented May 7, 2023

That is an interesting idea, @1ec5. In chrome on ubuntu, it did unfortunately not work:

<!DOCTYPE html>
<html>
<head>
<style>
p {
  width: 0; 
  border: 1px solid #000000;
  word-break: break-all;
}

</style>
</head>
<body>

<h2>word-break: break-all:</h2>
<p>မြေတွေလှရွှေတေ</p>

</body>
</html>

gives:
image

@wipfli
Copy link
Member Author

wipfli commented May 7, 2023

I updated the code and removed the segmentation in the client. With the new version, I assume that the tiles contain strings which have marks between characters which should be treated as a cluster. I used the @ character as a mark.

If the user inputs "H@all@o", I will assume that I should use these clusters: ["Ha", "l", "lo"]. Similarly:

  • "Hallo" -> ["H", "a", "l", "l", "o"]
  • "H@a@llo" -> ["Hal", "l", "o"]

The point labels in the demo have now such marked strings with the @ separator. The tiles for line labels are still being generated (might take some days). Until then, the line labels will be broken.

Here is the script I used to generate marked strings: https://github.com/wipfli/swiss-map/blob/main/planetiler/cluster/index.js It is a rudimentary script and some things would need to be improved, in particular it did not seem to get the Khmer labels right. But I am happy with the assumption that we start in MapLibre GL JS with text that contains explicit cluster information.

@1ec5
Copy link
Contributor

1ec5 commented May 7, 2023

In chrome on ubuntu, it did unfortunately not work

Wow, that looks pretty ugly. I don’t know why break-all would realistically break on anything more granular than a grapheme cluster, since this property value is intended for display of human-readable text. It looks like possibly a bug in Blink or Harfbuzz.

This is what I see in various browsers I have readily available:

Safari 13.1 on macOS
Safari 13.1 on macOS 10.13

Safari on iOS 16.4
Safari and all other browsers on iOS 16.4

Firefox on macOS
Firefox 112.0 on macOS 10.13

SeaMonkey 2.53 on macOS
SeaMonkey 2.53 (like Firefox 91.0) on macOS 10.13

Chrome 115.0 on macOS
Chrome 115.0 on macOS 10.13

Since Firefox seems to be handling break-all the best, maybe we can use it as a workaround for Firefox while other (modern versions of) browsers use Intl.Segmenter?

@1ec5
Copy link
Contributor

1ec5 commented May 7, 2023

I am happy with the assumption that we start in MapLibre GL JS with text that contains explicit cluster information.

That sounds OK, but if you generally expect tilesets to sprinkle zero-width spaces in text regardless of writing system, then you’re effectively forcing word-break: break-all behavior in any point-placed label with text-max-width, in any style, because ZWSP is also a line-breaking opportunity:

[0x200b]: true, // zero-width space

I guess there is some precedent in that Mapbox Streets v8 now inserts zero-width spaces in names – but only in “text that is meant to be rendered on multiple lines”. What the documentation doesn’t say is that it’s also limited to certain writing systems, such as CJK, that don’t use spaces to segment words or break lines. Using it liberally on all writing systems would be a misuse of the character, as far as I can tell.

There are also some unfortunate side-effects to expecting the server side to munge what would normally be human-readable text for presentational purposes. For example, it would interfere with any data-driven styling based on the same feature properties, and some feature querying code could also be affected. For example, the VoiceOver screenreader integration built into the iOS map SDK, which is based on feature querying, would begin spelling out the name of every POI unless the ZWSPs are stripped out.

@wipfli
Copy link
Member Author

wipfli commented May 8, 2023

The line labels should work again now. I updated the tiles.

On Thursday, May 11th, 2023 at 8 AM CEST we will have our next MapLibre Eastern Call and discuss text rendering there. Feel free to join. The zoom link is in the slack.

@wipfli
Copy link
Member Author

wipfli commented May 8, 2023

Intl.Segmenter gives you graphemes, but not clusters, so using this browser API does not solve our problem.

Harfbuzz docs: https://harfbuzz.github.io/clusters.html says this about clusters and graphemes:

In text shaping, a cluster is a sequence of characters that needs to be treated as a single, indivisible unit. A single letter or symbol can be a cluster of its own. Other clusters correspond to longer subsequences of the input code points — such as a ligature or conjunct form — and require the shaper to ensure that the cluster is not broken during the shaping process.

A cluster is distinct from a grapheme, which is the smallest unit of meaning in a writing system or script.

I should actually rename this pull request to "index by cluster"...

@wipfli
Copy link
Member Author

wipfli commented May 8, 2023

The side-effects of having explicit joining characters can be mitigated by removing them before using text in expressions and voice-over.

Ideally, we could have a default customJoiningCharacter but also offer a style-spec property to let the user specify it.

@wipfli
Copy link
Member Author

wipfli commented May 8, 2023

@1ec5 the demo was using OffscreenCanvas, which is probably why it did not work on your iPhone 8. Now I removed the OffscreenCanvas and us the normal DOM canvas again. Does it work for you now?

@brawer
Copy link

brawer commented May 8, 2023

@wipfli Try FontView to see HarfBuzz (+Raqm+FriBiDi) acting on a single font file. There’s no need for grapheme clustering in this code; before calling into HarfBuzz, Raqm asks FriBiDi for bidi runs, and Raqm has its own (small) logic for splitting script runs. There’s also a demo of HarfBuzz in a browser which might perhaps be more relevant for MapLibre GL JS, but it’s the same HarfBuzz library called underneath.

@wipfli
Copy link
Member Author

wipfli commented May 9, 2023

A fun side-effect of generating the SDFs in the client is that we can use web fonts:

We already have the map.localIdeographFontFamily option. I've used this one to let the user configure the font family in the demo via the fontFamily part in the URL:

Serif

https://wipfli.github.io/index-by-grapheme/#map=4.82/47.76/12.2&fontFamily=serif

Monospace

https://wipfli.github.io/index-by-grapheme/#map=4.82/47.76/12.2&fontFamily=monospace

@1ec5
Copy link
Contributor

1ec5 commented May 9, 2023

Intl.Segmenter gives you graphemes, but not clusters, so using this browser API does not solve our problem.

Intl.Segmenter gives you “grapheme clusters”, which Harfbuzz calls “clusters” for short. If it gave you just graphemes, it would be equivalent to passing the empty string into String.prototype.split.

From the same documentation:

For example, two individual letters are often two separate graphemes. When two letters form a ligature, however, they combine into a single glyph. They are then part of the same cluster and are treated as a unit by the shaping engine — even though the two original, underlying letters remain separate graphemes.

Intl.Segmenter won’t always give you perfect results. It has no context about the font, and there are different interpretations of what should constitute a (grapheme) cluster, for example based on whether the font happens to create a ligature at a given font size. I view the whole grapheme cluster idea as a stopgap, but one that’s less onerous on both tileset generators and application developers than littering the text with ZWSPs.

The side-effects of having explicit joining characters can be mitigated by removing them before using text in expressions and voice-over.

There is still dataloss. In some languages like Thai, and to a lesser extent Chinese,1 ZWSPs or soft hyphens are typically used as word boundaries, analogous to the spaces in Latin. Overloading ZWSP to also represent a grapheme cluster boundary prevents GL JS from word-wrapping at ZWSPs as users expect. Stripping ZWSP from feature properties doesn’t solve this problem, but it does expand the problem, preventing natively rendered text from behaving correctly too.

Footnotes

  1. Well-typeset maps in Chinese avoid breaking up character compounds (which are mostly two or three Chinese characters long). Otherwise, it’s very easy for a label to accidentally say something very naughty or even illegal if the reader doesn’t realize the lexeme has been broken apart.

@1ec5
Copy link
Contributor

1ec5 commented May 9, 2023

the demo was using OffscreenCanvas, which is probably why it did not work on your iPhone 8. Now I removed the OffscreenCanvas and us the normal DOM canvas again. Does it work for you now?

Yes.

@wipfli
Copy link
Member Author

wipfli commented May 10, 2023

Thanks for the insight @1ec5. I think we can use a custom character to describe where joining should happen. Like that, we can avoid conflicts in Thai and other languages.

@wipfli
Copy link
Member Author

wipfli commented May 17, 2023

If we somehow could encode the font used when doing the server-side text segmentation, then we could do really cool stuff like using Noto Nastaliq Urdu for Persian labels.

Here is an example where I use a nastaliq font by default in tinysdf. As a result, all Arabic labels show up in nastaliq:

// in tinysdf
-  ctx.font = `${fontStyle} ${fontWeight} ${fontSize}px ${fontFamily}`;
+  ctx.font = `${fontStyle} ${fontWeight} ${fontSize}px 'Noto Nastaliq Urdu',Verdana,sans`;

https://wipfli.github.io/index-by-grapheme/nastaliq/#map=5.09/31.62/68.39

@1ec5
Copy link
Contributor

1ec5 commented May 19, 2023

If we somehow could encode the font used when doing the server-side text segmentation, then we could do really cool stuff like using Noto Nastaliq Urdu for Persian labels.

Yes, that would be wonderful, also for distinguishing between Chinese and Japanese variants of the same Unicode codepoints. This presupposes that the tiles or TileJSON somehow indicate the language of the field(s) being inserted into text-field – or maybe that the style indicate the language, in the case of an Americana-like code generation mechanism.

@wipfli
Copy link
Member Author

wipfli commented May 20, 2023

Following the HarfBuzz simple shaping example (https://harfbuzz.github.io/a-simple-shaping-example.html), one needs the following ingredients for correct text shaping:

  • the text itself
  • the direction, language, and script
  • the font

I think we can encode all of the above information in the tiles. A trivial way of doing it would be for example to use JSON strings like this one:

{
  "text": "Oliver",
  "direction": "ltr",
  "language": "en",
  "script": "latin",
  "font": "Noto Sans Regular"
}

Note that in the canvas you cannot set the language, but one could use for example html-to-image https://www.npmjs.com/package/html-to-image instead of the canvas. Like that, we could do stuff like showing CJK in different languages:

  • <span lang="zh-Hant">令</span> ->
  • <span lang="ja">令</span> ->

I am still a bit unsure why HarfBuzz needs to know the script. Does it maybe have something to do with Arabic/Urdu/Nastaliq?

@1ec5
Copy link
Contributor

1ec5 commented May 20, 2023

I think we can encode all of the above information in the tiles.

In principle, yes, although GL JS has never made such detailed assumptions about the tiles’ contents up to now. Instead, it has relied on TileJSON (or the inline TileJSON inside the style JSON) to describe the tiles. I think it would be prudent to extend that approach rather than make implicit assumptions. For one thing, the most popular OSM-based tilesets contain multiple name fields in various languages, not to mention a generic name field whose language is undetermined. The TileJSON could include an object that maps properties to their languages.

There is a separation of concerns between TileJSON and the style JSON. Fonts are typically defined in the latter, and I mostly don’t see a reason to depart from that approach for this feature. The iOS SDK already interprets the text-font property as either a fontstack (for server-side rendering) or a list of local font names (for client-side rendering). By analogy, you’d just set ctx.font to the evaluated text-font value.

Note that in the canvas you cannot set the language, but one could use for example html-to-image https://www.npmjs.com/package/html-to-image instead of the canvas. Like that, we could do stuff like showing CJK in different languages:

Clever library – it works by embedding the HTML element in an SVG document, creating an HTML image out of the SVG, and rendering the image into a canvas.

But assuming that the whole glyph belongs to a single language, there’s a much simpler solution: just set the <canvas> element’s lang attribute to the text’s language, and the browser will select the fonts accordingly. For example, try changing zh to ja in this demo; you should see the inner strokes of 海 change just as in ZeLonewolf/openstreetmap-americana#613 (comment). (In your TinySDF-based demo, it would probably involve setting ctx.canvas.lang.)

I am still a bit unsure why HarfBuzz needs to know the script. Does it maybe have something to do with Arabic/Urdu/Nastaliq?

I’m not entirely sure, but maybe HarfBuzz doesn’t maintain a mapping from language codes to default scripts? There are also plenty of edge cases, such as punctuation characters that don’t inherently belong to one script or another, but that different fonts might treat differently depending on the language.

@brawer
Copy link

brawer commented May 20, 2023

I am still a bit unsure why HarfBuzz needs to know the script. Does it maybe have something to do with Arabic/Urdu/Nastaliq?

For better or worse, this is due to how OpenType works internally. No, it’s unrelated to Nastaliq. Rather, the script is a property of the Unicode sequence being rendered, as defined by Unicode Annex 24. Before calling HarfBuzz, you need to split the string into “script runs”, which are sequences of characters that have the same script, and call HarfBuzz separately for each run. For example, if a label ローソンATM is tagged with language ja, you’ll have to call HarfBuzz twice: Once for ローソン with script Kana and language ja, and once for ATM with script Latn and (again) language ja. The process of splitting text into script runs is called “script itemization”. There’s some subtleties around punctuation and Emoji, and unfortunately, the algorithm has never been formally defined. My personal recommendation would be to leave this all up to a higher-level library like Raqm or Minikin. If you really have to implement it yourself, check out what Raqm does in _raqm_itemize() and _raqm_resolve_scripts(). Should you really want to go down this rabbit hole, w3c/font-text-cg#37 might be a good starting point, and this comparison of existing implementations.

For a general introduction, see Text layout is a loose hierarchy of segmentation.

@wipfli
Copy link
Member Author

wipfli commented May 22, 2023

Fascinating stuff, thank @brawer! I think I will stick to Raqm because when we render text with the canvas object from javascript, we have basically the same api as Raqm, which is

  • text
  • language
  • direction
  • font

So since we cannot set the script in a html canvas, we probably do not need to ship it to the client.

@brawer
Copy link

brawer commented May 22, 2023

I think I will stick to Raqm [instead of building a custom text rendering stack]

Sounds wise. From what I can see, the main differences to Minikin are:

  1. line breaking;
  2. font fallback.

Regarding line breaking, @khaledhosny once wrote a branch for Raqm but according to HOST-Oman/libraqm#50 it won’t get merged because libunibreak is better at finding line breaking opportunities. However, I’m not sure if Raqm can already call libunibreak; Khaled would know best. Note that Minikin also does hyphenation (using LaTeX hyphenation dictionaries), whereas libunibreak just implements the Unicode line breaking algorithm. But the latter is probably good enough for rendering map labels.

Regarding font fallback, it would be good to know how big an issue it really is for MapLibre. Since you’re already running Raqm on OpenStreetMap, can you count how many missing characters (glyph index zero) you see in the output glyph vectors?

@1ec5
Copy link
Contributor

1ec5 commented May 22, 2023

Note that Minikin also does hyphenation (using LaTeX hyphenation dictionaries), whereas libunibreak just implements the Unicode line breaking algorithm. But the latter is probably good enough for rendering map labels.

The homegrown line breaking code in GL JS implements LaTeX-style line balancing (not hyphenation), which as far as I know isn’t part of the Unicode line breaking algorithm. Line balancing keeps point-placed labels looking tidy. Without it, text-max-width effectively determines the length of the first line of the label, which mostly defeats the purpose of line wrapping.

@1ec5
Copy link
Contributor

1ec5 commented May 22, 2023

Regarding font fallback, it would be good to know how big an issue it really is for MapLibre. Since you’re already running Raqm on OpenStreetMap, can you count how many missing characters (glyph index zero) you see in the output glyph vectors?

This would primarily be a consideration when using OSM’s local-language name key throughout the world, as in OSM Americana or any style that aims to reproduce openstreetmap-carto.

The status quo of server-side glyph rasterization all but forces the style designer to specify a pan-Unicode font as the last font in the font fallback list. (The iOS SDK removes this font from the list, in favor of the system font fallbacks, when rendering glyphs locally.)

Even then, GL JS occasionally runs into its lack of support for non-BMP characters: mapbox/mapbox-gl-js#4001 (comment).

@HarelM
Copy link
Member

HarelM commented Jun 28, 2023

@wipfli what's the status of this PR?

@wipfli
Copy link
Member Author

wipfli commented Jun 29, 2023

I think this was a successful proof of concept. The next step would be to write a design proposal for the style specification.

Do you think this is the right direction?

@HarelM
Copy link
Member

HarelM commented Jun 29, 2023

Sure, my main question was about the reasons to keep this PR open...

@wipfli
Copy link
Member Author

wipfli commented Jun 30, 2023

We can close it. The branch will continue to exist in my repo

@wipfli wipfli closed this Jun 30, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants