Codepoint references in HarfBuzz source #2862

jyavner · 2021-02-15T23:56:17Z

jyavner
Feb 15, 2021

HarfBuzz is great! I grepped the source code for all references to U+xxxx numbers and combined that list with a scan of the first six chapters of the Unicode Standard.

results

Just search for "harfbuzz"!

Naturally, this scan has left me with some questions:

Does (enum UPROPS_MASK_HIDDEN) correspond to something in the Unicode
Consortium docs? Is it correct that only Mongolian variation selectors
have this property and not the other kinds of variation selectors?
There seems to be a lot of fallback code to support fonts without GSUB
tables. Are those still common? Perhaps HarfBuzz could offer a general
HB_NO_OLD_FONTS config option, to subsume
HB_NO_SHAPE_COMPLEX_HEBREW_FALLBACK and
HB_NO_OT_SHAPE_COMPLEX_THAI_FALLBACK and the various other bits of code
and data that support old-style fonts and might not be needed for embedded
applications.
Why are there four modules that have their own copies of the UTF-16 conversions and seven modules that each create and insert their own U+25CC dotted circles?
Support for Hangul composition seems redundant between files hb-ucd.cc and
hb-ot-shape-complex-hangul.cc. Can duplicate code be removed?
Support for Thai seems redundant and inconsistent between files
hb-ot-shape-complex-thai.cc and hb-ot-shape-fallback.cc. Can these be
merged somehow?
Support for Myanmar is inconsistent. Module
hb-ot-shape-complex-myanmar.hh seems to override the general category of
many punctuation marks, while gen-use-table.py overrides fewer. Is this
because only the USE-overridden ones are still in use by Myanmar people, or
because the others caused too many problems and only the
least-objectionable ones were retained?
The USE module seem to be getting a lot of attention recently. It is
chock-full of explicit codepoint numbers. Is this a list of everything
that was ever wrong once, or are all of these still wrong in the latest
UCD?

Answered by behdad

Feb 17, 2021

HarfBuzz is great! I grepped the source code for all references to U+xxxx numbers and combined that list with a scan of the first six chapters of the Unicode Standard.

Let me repeat again: this is an amazing and monumental project! Have much time have you spent on it so far? Would you mind if we link to it from our documentation / advertise it on Twitter?

results

Just search for "harfbuzz"!

Naturally, this scan has left me with some questions:

Does (enum UPROPS_MASK_HIDDEN) correspond to something in the Unicode
Consortium docs? Is it correct that only Mongolian variation selectors
have this property and not the other kinds of variation selectors?

I don't think it can be derived fro…

View full answer

behdad · 2021-02-16T19:44:40Z

behdad
Feb 16, 2021
Maintainer

Your resource is amazing! Great work! I'll review your comments soon. Thanks.

0 replies

behdad · 2021-02-17T02:09:09Z

behdad
Feb 17, 2021
Maintainer

HarfBuzz is great! I grepped the source code for all references to U+xxxx numbers and combined that list with a scan of the first six chapters of the Unicode Standard.

Let me repeat again: this is an amazing and monumental project! Have much time have you spent on it so far? Would you mind if we link to it from our documentation / advertise it on Twitter?

results

Just search for "harfbuzz"!

Naturally, this scan has left me with some questions:

Does (enum UPROPS_MASK_HIDDEN) correspond to something in the Unicode
Consortium docs? Is it correct that only Mongolian variation selectors
have this property and not the other kinds of variation selectors?

I don't think it can be derived from the UCD; but from the Standard text I think you can derive. The reason this was introduced is that the Mongolian variation selectors should NOT be "ignored" (ie. possibly skipped over) during GSUB susbtitutions, but also their glyph shapes, if they survived the substitution rules, should NOT be displayed to the user. This is different from any other set of characters.

Most other Default_Ignorable codepoints, including other variation selectors, can both be ignored during substitutions as well as hidden from the user. We initially treated Mongolian variation selectors this way as well. Note that HarfBuzz is the only shaping engine that tries to be Unicode-complaint about Default_Ignorables in that we skip over them instead of failing to match ligatures, etc. For most Default_Ignorables codepoints this is a good idea and produces more Unicode-compliant rendering. But for the Mongolian ones it made fonts compatible with other shaping engines to produce incorrect results with HarfBuzz. That's why we changed their behavior to their unique current state.

Another way to say is that: Uniscribe implemented them uniquely, so we had to match.

There seems to be a lot of fallback code to support fonts without GSUB
tables. Are those still common? Perhaps HarfBuzz could offer a general
HB_NO_OLD_FONTS config option, to subsume
HB_NO_SHAPE_COMPLEX_HEBREW_FALLBACK and
HB_NO_OT_SHAPE_COMPLEX_THAI_FALLBACK and the various other bits of code
and data that support old-style fonts and might not be needed for embedded
applications.

That's what I implemented when I joined Facebook in 2019. See:

https://github.com/harfbuzz/harfbuzz/blob/master/CONFIG.md

Why are there four modules that have their own copies of the UTF-16 conversions and

seven modules that each create and insert their own U+25CC dotted circles?

We recently merged many of those. Every shaper that enforces a syllable grammar needs to do that. The rest cannot be meaningfully merged.

Support for Hangul composition seems redundant between files hb-ucd.cc and
hb-ot-shape-complex-hangul.cc. Can duplicate code be removed?

I don't think the duplication can be efficiently removed.

Support for Thai seems redundant and inconsistent between files
hb-ot-shape-complex-thai.cc and hb-ot-shape-fallback.cc. Can these be
merged somehow?

How so? They both do "fallback positioning", but in very different ways.

Support for Myanmar is inconsistent. Module
hb-ot-shape-complex-myanmar.hh seems to override the general category of
many punctuation marks, while gen-use-table.py overrides fewer. Is this
because only the USE-overridden ones are still in use by Myanmar people, or
because the others caused too many problems and only the
least-objectionable ones were retained?

Myanmar has its own shaper in OpenType. So it doesn't go through USE at all. Any overlap is unintentional.

The USE module seem to be getting a lot of attention recently. It is
chock-full of explicit codepoint numbers. Is this a list of everything
that was ever wrong once, or are all of these still wrong in the latest
UCD?

We always report those to UCD and prefer the data files to be fixed. But many times Unicode refuses. A recurring argument is that "Unicode data files are not designed to be OpenType-specific". I don't agree with that assessment. USE was built on the Unicode's Indic syllabic model. So if that data is not enough for USE, it means Unicode data is not enough to display encoded text.

Anyway. You can read one instance of that from last week in #2849

Cheers

0 replies

behdad · 2021-02-17T21:09:26Z

behdad
Feb 17, 2021
Maintainer

There seems to be a lot of fallback code to support fonts without GSUB
tables. Are those still common?

Everything we implemented we did as it was a requirement by clients. Firefox, for example, was first to want to use HarfBuzz instead of Uniscribe on Windows, and they needed it to handle legacy fonts either shipped on Windows or in common use.

We might be able to remove those over time. But since there's easy control to disable them for size-sensitive clients, I don't see a reason to do that any time soon.

0 replies

behdad · 2021-02-17T22:01:52Z

behdad
Feb 17, 2021
Maintainer

Why are there four modules that have their own copies of the UTF-16 conversions

We only do once in hb-utf.hh. But to face with system API that use UTF-16, we have to do some in hb-uniscribe, hb-directwrite, and hb-coretext. None of those are a core part of our library and sharing doesn't seem feasible.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Codepoint references in HarfBuzz source #2862

{{title}}

Replies: 4 comments

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Codepoint references in HarfBuzz source #2862

jyavner Feb 15, 2021

Replies: 4 comments

behdad Feb 16, 2021 Maintainer

behdad Feb 17, 2021 Maintainer

behdad Feb 17, 2021 Maintainer

behdad Feb 17, 2021 Maintainer

jyavner
Feb 15, 2021

behdad
Feb 16, 2021
Maintainer

behdad
Feb 17, 2021
Maintainer

behdad
Feb 17, 2021
Maintainer

behdad
Feb 17, 2021
Maintainer