Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor script and language tags #730

Merged
merged 11 commits into from Oct 11, 2018
Merged

Refactor script and language tags #730

merged 11 commits into from Oct 11, 2018

Conversation

dscorbett
Copy link
Collaborator

@dscorbett dscorbett commented Jan 28, 2018

This pull request updates everything related to parsing script and language tags. It fixes #362.

A private use subtag beginning with “hbsc” is parsed as a script override. See #645, which this pull request subsumes.

The functions that determine the script and language of a buffer can return any number of results, not just two scripts and one language. Among the results, the map builder chooses the first script and language found in the font.

The mappings between BCP 47 and OpenType tags are automatically generated. Doing it manually has resulted in missing tags (BSK), typos (BER for BBR), and overgeneration (kvd → KUI). There are still manual overrides where necessary. diff <(git show language-tags~8:src/hb-ot-tag.cc) <(git show language-tags:src/hb-ot-tag-table.hh) shows the mapping changes.

@coveralls
Copy link

coveralls commented Mar 13, 2018

Coverage Status

Coverage decreased (-0.8%) to 65.62% when pulling 99d67a3 on dscorbett:language-tags into 93dad9a on harfbuzz:master.

@behdad
Copy link
Member

behdad commented Jul 17, 2018

Wow, this is amazing. Thank you!

I did a quick review. Looks great. I'll do a more thorough review of the API changes and merge after making a release with existing master.

@behdad
Copy link
Member

behdad commented Jul 17, 2018

On the generated table:

  • hb_ot_ambiguous_tag_to_language() should be converted to a table and bsearch().
  • I'm slightly worried about performance of hb_ot_tags_from_complex_language(). Can you at least do a switch based on the first letter of the tag to compare?

@behdad
Copy link
Member

behdad commented Jul 17, 2018

Can you rebase on master please? That should fix the bots.

@behdad
Copy link
Member

behdad commented Jul 17, 2018

Where the old functions return the default tags 'DFLT' and 'dflt',
hb_ot_tags returns an empty array. The caller is responsible for
using the default tag in that case.

I don't think I like this part, but can live with it. Will do a more thorough review.

I was thinking, instead of hb_ot_tags() name, we need hb_ot_tags_from_bcp47 and hb_ot_tags_to_bcp47. No?

Also, we like to have hb_scripts_from_language() and hb_language_likely_for_script() kind of functionality. Any way you can add those (possibly separately) as well?

@behdad
Copy link
Member

behdad commented Jul 17, 2018

Also, opening braces "{" go on a new line. :) Thanks.

@behdad
Copy link
Member

behdad commented Jul 17, 2018

Also, HB_OT_MAX_TAGS_PER_SCRIPT we want to increase since dev3 is planned. Maybe both of them set to 8?

@behdad
Copy link
Member

behdad commented Jul 17, 2018

Also, in the codebase itself, we use bool, not hb_bool_t. The latter is used mostly in the C API only.

@dscorbett
Copy link
Collaborator Author

I'm slightly worried about performance of hb_ot_tags_from_complex_language(). Can you at least do a switch based on the first letter of the tag to compare?

I’ll fix this.

hb_ot_ambiguous_tag_to_language() should be converted to a table and bsearch().

I can do this, but is it necessary? It is just a switch statement on a uint32_t, so wouldn’t it get compiled to something like binary search anyway?

we like to have hb_scripts_from_language() and hb_language_likely_for_script() kind of functionality. Any way you can add those (possibly separately) as well?

This requires CLDR. I’d like to do it as a separate pull request.

I was thinking, instead of hb_ot_tags() name, we need hb_ot_tags_from_bcp47 and hb_ot_tags_to_bcp47.

Sure, though I would prefer bcp_47 over bcp47, since it is called “BCP 47”, not “BCP47”. hb_ot_tags_to_bcp47 would just be a wrapper around hb_ot_tag_to_script and hb_ot_tag_to_language; should those be deprecated?

opening braces "{" go on a new line. :)

HarfBuzz is inconsistent about this, but okay, I will change it.

HB_OT_MAX_TAGS_PER_SCRIPT we want to increase since dev3 is planned. Maybe both of them set to 8?

That seems wasteful. My idea was that HB_OT_MAX_TAGS_PER_SCRIPT would be exactly the maximum, and that when 'dev3' was implemented, it would be increased to 3. Otherwise, you’ll get users allocating space for 8 scripts when there are only 2 or 3.

in the codebase itself, we use bool, not hb_bool_t.

The codebase is inconsistent. I propose that I use bool except when a function is a wrapper around, or a close analogue of, an existing function that uses hb_bool_t.

@behdad
Copy link
Member

behdad commented Jul 18, 2018

hb_ot_ambiguous_tag_to_language() should be converted to a table and bsearch().

I can do this, but is it necessary? It is just a switch statement on a uint32_t, so wouldn’t it get compiled to something like binary search anyway?

I'm not sure they do. Do they?

we like to have hb_scripts_from_language() and hb_language_likely_for_script() kind of functionality. Any way you can add those (possibly separately) as well?

This requires CLDR. I’d like to do it as a separate pull request.

Sounds good.

I was thinking, instead of hb_ot_tags() name, we need hb_ot_tags_from_bcp47 and hb_ot_tags_to_bcp47.

Sure, though I would prefer bcp_47 over bcp47, since it is called “BCP 47”, not “BCP47”.

We already have things like hb_script_from_iso15924_tag, so I prefer bcp47. Also, to me, the underscores are semantic separators, not necessarily just space replacements. We also use "codepoint", not "code point", etc.

hb_ot_tags_to_bcp47 would just be a wrapper around hb_ot_tag_to_script and hb_ot_tag_to_language; should those be deprecated?

No, not deprecated.

opening braces "{" go on a new line. :)

HarfBuzz is inconsistent about this, but okay, I will change it.

You are absolutely right. I started with putting it on same line. Over the years, found it unreadable and switched style. Whenever touching code, it's encouraged to update the surrounding (but not far away) to new style.

HB_OT_MAX_TAGS_PER_SCRIPT we want to increase since dev3 is planned. Maybe both of them set to 8?

That seems wasteful. My idea was that HB_OT_MAX_TAGS_PER_SCRIPT would be exactly the maximum, and that when 'dev3' was implemented, it would be increased to 3. Otherwise, you’ll get users allocating space for 8 scripts when there are only 2 or 3.

This gets compiled into application code. So next time HarfBuzz is updated to add dev3, application code would also need updating. Since we are talking about 8*4 bytes allocated on the stack, I don't think it's a problem. But sure, I'd be fine with 3.

in the codebase itself, we use bool, not hb_bool_t.

The codebase is inconsistent. I propose that I use bool except when a function is a wrapper around, or a close analogue of, an existing function that uses hb_bool_t.

Again, correct. Started as hb_bool_t; over the years switched to bool when C API was not getting in the way.

Thanks

@dscorbett
Copy link
Collaborator Author

Switching on the first character made hb_ot_tags_from_complex_language much faster. "en" went from an average of 0.586 μs per call to 0.056 μs. "cdo" got a little worse (from 0.588 μs to 0.687 μs) but it is worth it. I’ll push this soon.

To determine whether a switch statement is compiled to binary search, I ran objdump -rwCS -Mintel libharfbuzz.so.0. The first line of hb_ot_ambiguous_tag_to_language is:

   188e0:   81 ff 20 49 56 4c       cmp    edi,0x4c564920

which corresponds to 'LVI ', which would indeed be the first pivot, given this set of tags. So it’s probably doing something like binary search, at least with g++ on CentOS.

After thinking about it some more, I don’t think it makes sense to rename hb_ot_tags to hb_ot_tags_from_bcp47. The function takes an hb_language_t and an hb_script_t. An hb_language_t is a wrapper around a BCP 47 string. The name hb_ot_tags_from_bcp47 would be synonymous with hb_ot_tags_from_language, which is confusingly similar to the existing hb_ot_tag_from_language, and would ignore the script parameter. How about hb_ot_tags_from_script_and_language?

@behdad
Copy link
Member

behdad commented Jul 23, 2018

Thanks. Switches seem to be compiled fine to bsearch indeed.

@behdad
Copy link
Member

behdad commented Jul 23, 2018

After thinking about it some more, I don’t think it makes sense to rename hb_ot_tags to hb_ot_tags_from_bcp47. The function takes an hb_language_t and an hb_script_t. An hb_language_t is a wrapper around a BCP 47 string. The name hb_ot_tags_from_bcp47 would be synonymous with hb_ot_tags_from_language, which is confusingly similar to the existing hb_ot_tag_from_language, and would ignore the script parameter. How about hb_ot_tags_from_script_and_language?

You are right. hb_ot_tags_from_script_and_language sounds great. Do we need the inverse?

@dscorbett
Copy link
Collaborator Author

No, the inverse is not needed, because there are already hb_ot_tag_to_script and hb_ot_tag_to_language. It might be convenient to include it though.

@behdad
Copy link
Member

behdad commented Jul 23, 2018

No, the inverse is not needed, because there are already hb_ot_tag_to_script and hb_ot_tag_to_language. It might be convenient to include it though.

But you can't encode, say, deva vs dev2 distinction in the old API.

@dscorbett
Copy link
Collaborator Author

On master, hb_ot_tag_to_script returns HB_SCRIPT_DEVANAGARI for both 'deva' and 'dev2', so I am not sure what you mean.

@behdad
Copy link
Member

behdad commented Jul 23, 2018

On master, hb_ot_tag_to_script returns HB_SCRIPT_DEVANAGARI for both 'deva' and 'dev2', so I am not sure what you mean.

Exactly. One of the reasons this new API is being added is to be able to choose 'deva' when the font has both; using the -hbsc extension. Right? I think if the script tag passed to the reverse function is NOT the first tag for script, we should encode it in the language tag.

Also, if the script is not the most likely script for the language, encode it in the language tag as well? Probably not.

Ok, I'm not sure about these, but my first point about deva makes roundtripping much more likely.

@dscorbett
Copy link
Collaborator Author

Now I see what you mean. That makes sense.

@harfbuzz harfbuzz deleted a comment Jul 24, 2018
@harfbuzz harfbuzz deleted a comment Jul 24, 2018
@dscorbett
Copy link
Collaborator Author

I think I’ve made all the requested changes. I don’t know why Codacy is finding fault with the character buffer code in hb_ot_tags_to_script_and_language.

@harfbuzz harfbuzz deleted a comment Jul 24, 2018
@harfbuzz harfbuzz deleted a comment Jul 24, 2018
@harfbuzz harfbuzz deleted a comment Jul 24, 2018
@harfbuzz harfbuzz deleted a comment Jul 24, 2018
@harfbuzz harfbuzz deleted a comment Jul 24, 2018
@harfbuzz harfbuzz deleted a comment Jul 24, 2018
@harfbuzz harfbuzz deleted a comment Jul 24, 2018
@harfbuzz harfbuzz deleted a comment Jul 24, 2018
@harfbuzz harfbuzz deleted a comment Jul 27, 2018
@harfbuzz harfbuzz deleted a comment Jul 27, 2018
@behdad
Copy link
Member

behdad commented Sep 10, 2018

I think I’ve made all the requested changes.

Any reason you moved hb_ot_tags_from_script() from hb-ot-tag.h to hb-ot-layout.h?

Also, hardcoding max script tags to 2 is not good. We want at least 3, since dev3 etc will be added in the future. By setting it to 3 now, applications wouldn't need to be recompiled.

@behdad
Copy link
Member

behdad commented Sep 10, 2018

I still have to sit down and do a thorough review of this.

@behdad
Copy link
Member

behdad commented Oct 8, 2018

@dscorbett Is this ready from your point of view? I like to get it in before next release, so I'll do a review tonight.

@dscorbett
Copy link
Collaborator Author

From my point of view, this is ready. clang-everything is failing, though, because of a -Wunreachable-code-return violation. I’ll fix that.

`hb_language_from_string` accepts not only ISO 639 but also BCP 47. Not
all ISO 639 codes are valid BCP 47 tags but the function does not accept
overlong language subtags anyway.
The old hb-ot-tag.cc functions, `hb_ot_tags_from_script` and
`hb_ot_tag_from_language`, are now wrappers around a new function:
`hb_ot_tags`. It converts a script and a language to arrays of script
tags and language tags. This will make it easier to add new script tags
to scripts, like 'dev3'. It also allows for language fallback chains;
nothing produces more than one language yet though.

Where the old functions return the default tags 'DFLT' and 'dflt',
`hb_ot_tags` returns an empty array. The caller is responsible for
using the default tag in that case.

The new function also adds a new private use subtag syntax for script
overrides: "x-hbscabcd" requests a script tag of 'abcd'.

The old hb-ot-layout.cc functions,`hb_ot_layout_table_choose_script` and
`hb_ot_layout_script_find_language` are now wrappers around the new
functions `hb_ot_layout_table_select_script` and
`hb_ot_layout_script_select_language`. They are essentially the same as
the old ones plus a tag count parameter.

Closes #495.
`hb_ot_tags` replaces `hb_ot_tags_from_script` and
`hb_ot_tag_from_language`.

`hb_ot_layout_table_select_script` replaces
`hb_ot_layout_table_choose_script`.

`hb_ot_layout_script_select_language` replaces
`hb_ot_layout_script_find_language`.
The new script, gen-tag-table.py, generates `ot_languages` automatically
from the [OpenType language system tag registry][ot] and the [IANA
Language Subtag Registry][bcp47] with some manual modifications. If an
OpenType tag maps to a BCP 47 macrolanguage, all the macrolanguage's
individual languages are mapped to the same OpenType tag, except for
individual languages with their own OpenType mappings. Deprecated
BCP 47 tags are canonicalized.

[ot]: https://docs.microsoft.com/en-us/typography/opentype/spec/languagetags
[bcp47]: https://www.iana.org/assignments/language-subtag-registry/language-subtag-registry

Some OpenType tags correspond to multiple ISO 639 codes. The mapping
from ISO 639 codes lists OpenType tags in priority order, such that more
specific or more likely tags appear first.

Some OpenType tags have no corresponding ISO 639 code in the registry so
their mappings use BCP 47 subtags besides the language. For example, any
BCP 47 tag with a fonipa variant subtag is mapped to 'IPPH', and 'IPPH'
is mapped back to und-fonipa.

Other OpenType tags have no corresponding ISO 639 code because it is not
clear what they are for. HarfBuzz just ignores these tags.

One such ignored tag is 'ZHP ' (Chinese Phonetic). It probably means
zh-Latn. However, it is used in Microsoft JhengHei and Microsoft YaHei
with the script tag 'hani', implying that it is not a romanization
scheme after all. It would be simple enough to add this mapping to
gen-tag-table.py once a definitive mapping is determined.

The manual modifications are mainly either obvious mappings that the
OpenType registry omits or mappings for compatibility with previous
versions of HarfBuzz. Some of the old mappings were discarded, though,
for homophonous language names. For example, OpenType maps 'KUI ' to
kxu; previous versions of HarfBuzz also mapped it to kvd, because kvd
and kxu both happen to be called "Kui".

gen-tag-table.py also generates a function to convert multi-subtag tags
like el-polyton and zh-HK to OpenType tags, replacing `ot_languages_zh`
and the hard-coded list of special cases in `hb_ot_tags_from_language`.
It also generates a function to convert OpenType tags to BCP 47,
replacing the hard-coded list of special cases in
`hb_ot_tag_to_language`.
If the second subtag of a BCP 47 tag is three letters long, it denotes
an extended language. The tag converter ignores the language subtag and
uses the extended language instead.

There are some grandfathered exceptions, which are handled earlier.
The font supports the deprecated tag 'DHV ' instead of 'DIV '. dv is
mapped to 'DIV ' and 'DHV ', in that order. The test specifies
`--language=dv`, demonstrating that if a font does not support the first
OpenType tag mapped to a BCP 47 tag, it will fall back to the next tag.
OpenType only officially maps four ISO 639 codes to Quechua languages,
but prior versions of HarfBuzz also mapped qu to 'QUZ '. Because qu is a
macrolanguage, the mapping now applies to all individual Quechua
languages. OpenType calls 'QUZ ' "Quechua", but it really corresponds to
Cusco Quechua, so the individual Quechua languages should not all
necessarily be mapped to it.
This results in a tenfold speed-up for the common case of tags that are
not complex, in the sense of `hb_ot_tags_from_complex_language`.
No script has 3 tags yet, but the plan is for the Indic scripts to each
get a third tag someday.
@dscorbett
Copy link
Collaborator Author

There are now a couple of deprecation warnings. It is not clear how to fix _hb_graphite2_shape, which calls hb_ot_tags_from_script, which I deprecated.

hb_ot_tags_from_script (hb_buffer_get_script (buffer), &script_tag[0], &script_tag[1]);
seg = gr_make_seg (nullptr, grface,
script_tag[1] == HB_TAG_NONE ? script_tag[0] : script_tag[1],
feats,
gr_utf32, chars, buffer->len,
2 | (hb_buffer_get_direction (buffer) == HB_DIRECTION_RTL ? 1 : 0));

This code assumes that script_tag[1] might be HB_TAG_NONE, but hb_ot_tags_from_script actually uses HB_OT_TAG_DEFAULT_SCRIPT as the default tag instead. It seems there is a bug here, but I don’t know what this code is supposed to do, so I am wary of fixing it.

@behdad
Copy link
Member

behdad commented Oct 11, 2018

This looks REALLY good. Thank you!!! I'm merging now. We'll deal with breakages later.

@behdad behdad merged commit 66790d6 into harfbuzz:master Oct 11, 2018
behdad added a commit that referenced this pull request Oct 11, 2018
Thanks to David Corbett who revamped our script and language processing
and implemented full BCP 47 support.

#730

New API:
+hb_ot_layout_table_select_script()
+hb_ot_layout_script_select_language()
+HB_OT_MAX_TAGS_PER_SCRIPT
+HB_OT_MAX_TAGS_PER_LANGUAGE
+hb_ot_tags_from_script_and_language()
+hb_ot_tags_to_script_and_language()

Deprecated API:
-hb_ot_layout_table_choose_script()
-hb_ot_layout_script_find_language()
-hb_ot_tags_from_script()
-hb_ot_tag_from_language()
@behdad
Copy link
Member

behdad commented Oct 11, 2018

This code assumes that script_tag[1] might be HB_TAG_NONE, but hb_ot_tags_from_script actually uses HB_OT_TAG_DEFAULT_SCRIPT as the default tag instead. It seems there is a bug here, but I don’t know what this code is supposed to do, so I am wary of fixing it.

This code want to reach out to 'deva' instead of 'dev2'. Basically, give it the last non-default tag.

Does your new code now depend on DEFAULT being in the list to be picked up?

@behdad
Copy link
Member

behdad commented Oct 11, 2018

Does your new code now depend on DEFAULT being in the list to be picked up?

I see not.

behdad added a commit that referenced this pull request Oct 11, 2018
@behdad
Copy link
Member

behdad commented Oct 11, 2018

All should be fixed now!

@dscorbett dscorbett deleted the language-tags branch October 11, 2018 19:52
@ebraminio
Copy link
Collaborator

ebraminio commented Oct 16, 2018

Is this fixed also?

@behdad
Copy link
Member

behdad commented Oct 16, 2018

Is this fixed also?

Probably as much as we'd do. I don't think we want to expose API for it. Or maybe one or two API entries to better compare languages is fine.

Getting most-likelly language from script, and scripts from language is still interesting API. Maybe open an issue and assign to @dscorbett ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Erroneous conversion of fonipa to IPPH
4 participants