Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add an option to prefer decomposed forms #653

Closed
khaledhosny opened this issue Dec 16, 2017 · 22 comments
Closed

Add an option to prefer decomposed forms #653

khaledhosny opened this issue Dec 16, 2017 · 22 comments

Comments

@khaledhosny
Copy link
Collaborator

As discussed on this Twitter thread.

Alternatively we can just prefer the form used in input if the font supports the characters, if this one form renders suboptimaly we can say it is a font bug.

@behdad
Copy link
Member

behdad commented Jan 5, 2018

Jonathan and I are discussing this now. I'm inclined to just change the default to prefer decomposed. Initially when we did this, it was to improve shaping with SBL Hebrew and older Latin fonts that lacked GPOS mark positioning. Maybe we can continue precomposing if font doesn't have GPOS. Just exposing more options is not best solution if there's no clear criteria for how the client of the library should set that option.

@khaledhosny
Copy link
Collaborator Author

I agree about not having an option.

@behdad
Copy link
Member

behdad commented Sep 10, 2018

@jfkthame any preference here?

@jfkthame
Copy link
Collaborator

Generally, I think I'd favor preferring the form as provided in the input, so that as far as possible harfbuzz stays out of the way of whatever the client & font are trying to do, rather than introducing additional magic.

But of course there are exceptions, where harfbuzz steps in to do something that wasn't explicitly requested by either the client or the font, but in practice often helps: fallback mark positioning is a great example. And precomposing accented letters if the font lacks GPOS seems like it fits in that category; it's basically a better alternative to fallback mark positioning.

When a font has a GPOS table, though, we should just use it and not try to second-guess the designer. If the GPOS table then fails to position accents well, that's a font bug.

@behdad
Copy link
Member

behdad commented Sep 11, 2018

Generally, I think I'd favor preferring the form as provided in the input, so that as far as possible harfbuzz stays out of the way of whatever the client & font are trying to do, rather than introducing additional magic.

I think it serves us better if HarfBuzz renders canonically-equivalent sequences the same independent of the font.

@jfkthame
Copy link
Collaborator

I think it serves us better if HarfBuzz renders canonically-equivalent sequences the same independent of the font.

Usually, I'd agree with that, but I don't see it as an absolute rule that is entirely the responsibility of the shaping engine.

(For example, there are the CJK Compatibility Ideographs at U+F900, which are canonically equivalent to other chars in the Unified block, but where a distinction in rendering must be maintained if we're to support text mapped from certain legacy encodings.)

For the case of Latin accents, I'm torn between the desire to render canonically equivalent sequences identically, the desire to allow clients to achieve effects such as mark coloring (dependent on decomposed rendering), and the desire to render decent-looking results with as many fonts as possible (often dependent on precomposed).

Consider me confused & conflicted.... :\

@behdad
Copy link
Member

behdad commented Sep 11, 2018

Consider me confused & conflicted.... :\

Same here. But yeah, in the interest of mark coloring, I like going in that direction.

@khaledhosny
Copy link
Collaborator Author

I think something like the following would be a good compromise:

  • If input uses decomposed form, and the characters are supported by the font, and the font has a GPOS table, use them.
  • Else, use the composed form, if supported by the font.

If after that canonically equivalent forms render differently, then it is a font bug (which can still happen with the current scheme, as seen in #1092).

@behdad
Copy link
Member

behdad commented Sep 20, 2018

So, basically: if GPOS available, prefer decomposed. Else, prefer composed. I think that makes sense. That said, our fallback positioning also kicks in if GPOS is not available. So, maybe always decomposed is fine...

@jfkthame
Copy link
Collaborator

That said, our fallback positioning also kicks in if GPOS is not available. So, maybe always decomposed is fine...

I don't think so; although fallback positioning is (much) better than nothing for otherwise-unsupported combinations, it's unlikely to be acceptable as a substitute for precomposed glyphs provided by the font.

@behdad
Copy link
Member

behdad commented Sep 21, 2018

Ok, but then can we agree on decomposed if font has GPOS? Has GPOS and mark feature?

@jfkthame
Copy link
Collaborator

Yes, I think that's reasonable. IMO "has GPOS and mark feature" would be a better condition than just "has GPOS", as it seems plausible there'll be fonts that include kerning (and newer tools may have put the kern table into GPOS rather than a legacy 'kern' table), but no mark positioning has been implemented.

@dscorbett
Copy link
Collaborator

That would break fonts that assume either that the shaper uses NFC, or that the shaper might not normalize but that most text is already in NFC. For example, here are U+0123 LATIN SMALL LETTER G WITH CEDILLA and U+0386 GREEK CAPITAL LETTER ALPHA WITH TONOS in Noto Sans when the default shaper uses HB_OT_SHAPE_NORMALIZATION_MODE_DECOMPOSED.

ģΆ

@behdad
Copy link
Member

behdad commented Sep 24, 2018

That would break fonts that assume either that the shaper uses NFC, or that the shaper might not normalize but that most text is already in NFC. For example, here are U+0123 LATIN SMALL LETTER G WITH CEDILLA and U+0386 GREEK CAPITAL LETTER ALPHA WITH TONOS in Noto Sans when the default shaper uses HB_OT_SHAPE_NORMALIZATION_MODE_DECOMPOSED.

What's wrong with that?!!

@behdad behdad closed this as completed in 62d1e08 Sep 24, 2018
@jfkthame
Copy link
Collaborator

SMALL G WITH CEDILLA is normally rendered with the "cedilla" as an inverted comma above, rather than a standard cedilla below; CAPITAL ALPHA WITH TONOS is conventionally rendered with the accent beside the top left of the A, rather than above.

Precomposed glyphs in the font will reflect these quirks of specific characters; using decomposition and applying fallback positioning loses this entirely.

I'm afraid such fonts will be rather common, given that shapers have typically either used NFC or not applied normalization but just rendered glyphs for the character sequence as provided.

@jfkthame jfkthame reopened this Sep 24, 2018
@behdad
Copy link
Member

behdad commented Sep 24, 2018

Umm. Ok, I'm reverting the behavior change for now.

@behdad
Copy link
Member

behdad commented Sep 24, 2018

Should we hand-pick glyphs that we want specifically composed?

@jfkthame
Copy link
Collaborator

I don't think that's really a solution, unfortunately. There are some special cases (like g-cedilla) that we could list, but there are also fonts that choose to style many "normal" accented glyphs somewhat differently from a base glyph + positioned accent. For example, some fonts use reduced-height uppercase letters when applying an accent above. Preferring decomposition means we'll lose carefully-crafted effects like this. :(

The more I think about it, the more I'm feeling that the only generally-safe way forward here is for the font designer to opt in to preferring-decomposition behavior, as only the font designer knows whether the components are designed such that dynamic composition will give an acceptable result. But we don't have a mechanism for the font designer to indicate this preference. :(

So for now, I'm inclined to vote for simply using the form found in the input, unless the font lacks the glyphs for it. (In which case applying either composition or decomposition to get something renderable is better than ending up with .notdef.)

@behdad
Copy link
Member

behdad commented Sep 24, 2018

So for now, I'm inclined to vote for simply using the form found in the input,

Generally I'm against that because it encourages user to expect the two canonically-equivalent forms to have distinctive representation. I'd rather we pick one way or another and stick to it (like we've been).

@behdad
Copy link
Member

behdad commented Sep 24, 2018

Anyway, leaving as is for now.

@behdad
Copy link
Member

behdad commented Jun 24, 2022

I wonder if we can ever make a move here.

@khaledhosny
Copy link
Collaborator Author

I think we can close this. An API is unlikely to be useful for most users, and preferring decomposed form will certainly break many fonts, and using the form used in input means we no longer maintain same rendering for canonically equivalent stings.

Fonts can already work around this if it is really needed.

@khaledhosny khaledhosny closed this as not planned Won't fix, can't repro, duplicate, stale Jun 25, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants