New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor script and language tags #730
Conversation
Wow, this is amazing. Thank you! I did a quick review. Looks great. I'll do a more thorough review of the API changes and merge after making a release with existing master. |
On the generated table:
|
Can you rebase on master please? That should fix the bots. |
I don't think I like this part, but can live with it. Will do a more thorough review. I was thinking, instead of Also, we like to have |
Also, opening braces "{" go on a new line. :) Thanks. |
Also, |
Also, in the codebase itself, we use |
I’ll fix this.
I can do this, but is it necessary? It is just a switch statement on a
This requires CLDR. I’d like to do it as a separate pull request.
Sure, though I would prefer
HarfBuzz is inconsistent about this, but okay, I will change it.
That seems wasteful. My idea was that
The codebase is inconsistent. I propose that I use |
I'm not sure they do. Do they?
Sounds good.
We already have things like hb_script_from_iso15924_tag, so I prefer bcp47. Also, to me, the underscores are semantic separators, not necessarily just space replacements. We also use "codepoint", not "code point", etc.
No, not deprecated.
You are absolutely right. I started with putting it on same line. Over the years, found it unreadable and switched style. Whenever touching code, it's encouraged to update the surrounding (but not far away) to new style.
This gets compiled into application code. So next time HarfBuzz is updated to add dev3, application code would also need updating. Since we are talking about 8*4 bytes allocated on the stack, I don't think it's a problem. But sure, I'd be fine with 3.
Again, correct. Started as hb_bool_t; over the years switched to bool when C API was not getting in the way. Thanks |
Switching on the first character made To determine whether a switch statement is compiled to binary search, I ran 188e0: 81 ff 20 49 56 4c cmp edi,0x4c564920 which corresponds to 'LVI ', which would indeed be the first pivot, given this set of tags. So it’s probably doing something like binary search, at least with g++ on CentOS. After thinking about it some more, I don’t think it makes sense to rename |
Thanks. Switches seem to be compiled fine to bsearch indeed. |
You are right. |
No, the inverse is not needed, because there are already |
But you can't encode, say, deva vs dev2 distinction in the old API. |
On |
Exactly. One of the reasons this new API is being added is to be able to choose 'deva' when the font has both; using the -hbsc extension. Right? I think if the script tag passed to the reverse function is NOT the first tag for script, we should encode it in the language tag. Also, if the script is not the most likely script for the language, encode it in the language tag as well? Probably not. Ok, I'm not sure about these, but my first point about deva makes roundtripping much more likely. |
Now I see what you mean. That makes sense. |
I think I’ve made all the requested changes. I don’t know why Codacy is finding fault with the character buffer code in |
Any reason you moved Also, hardcoding max script tags to 2 is not good. We want at least 3, since dev3 etc will be added in the future. By setting it to 3 now, applications wouldn't need to be recompiled. |
I still have to sit down and do a thorough review of this. |
@dscorbett Is this ready from your point of view? I like to get it in before next release, so I'll do a review tonight. |
From my point of view, this is ready. clang-everything is failing, though, because of a |
`hb_language_from_string` accepts not only ISO 639 but also BCP 47. Not all ISO 639 codes are valid BCP 47 tags but the function does not accept overlong language subtags anyway.
The old hb-ot-tag.cc functions, `hb_ot_tags_from_script` and `hb_ot_tag_from_language`, are now wrappers around a new function: `hb_ot_tags`. It converts a script and a language to arrays of script tags and language tags. This will make it easier to add new script tags to scripts, like 'dev3'. It also allows for language fallback chains; nothing produces more than one language yet though. Where the old functions return the default tags 'DFLT' and 'dflt', `hb_ot_tags` returns an empty array. The caller is responsible for using the default tag in that case. The new function also adds a new private use subtag syntax for script overrides: "x-hbscabcd" requests a script tag of 'abcd'. The old hb-ot-layout.cc functions,`hb_ot_layout_table_choose_script` and `hb_ot_layout_script_find_language` are now wrappers around the new functions `hb_ot_layout_table_select_script` and `hb_ot_layout_script_select_language`. They are essentially the same as the old ones plus a tag count parameter. Closes #495.
`hb_ot_tags` replaces `hb_ot_tags_from_script` and `hb_ot_tag_from_language`. `hb_ot_layout_table_select_script` replaces `hb_ot_layout_table_choose_script`. `hb_ot_layout_script_select_language` replaces `hb_ot_layout_script_find_language`.
The new script, gen-tag-table.py, generates `ot_languages` automatically from the [OpenType language system tag registry][ot] and the [IANA Language Subtag Registry][bcp47] with some manual modifications. If an OpenType tag maps to a BCP 47 macrolanguage, all the macrolanguage's individual languages are mapped to the same OpenType tag, except for individual languages with their own OpenType mappings. Deprecated BCP 47 tags are canonicalized. [ot]: https://docs.microsoft.com/en-us/typography/opentype/spec/languagetags [bcp47]: https://www.iana.org/assignments/language-subtag-registry/language-subtag-registry Some OpenType tags correspond to multiple ISO 639 codes. The mapping from ISO 639 codes lists OpenType tags in priority order, such that more specific or more likely tags appear first. Some OpenType tags have no corresponding ISO 639 code in the registry so their mappings use BCP 47 subtags besides the language. For example, any BCP 47 tag with a fonipa variant subtag is mapped to 'IPPH', and 'IPPH' is mapped back to und-fonipa. Other OpenType tags have no corresponding ISO 639 code because it is not clear what they are for. HarfBuzz just ignores these tags. One such ignored tag is 'ZHP ' (Chinese Phonetic). It probably means zh-Latn. However, it is used in Microsoft JhengHei and Microsoft YaHei with the script tag 'hani', implying that it is not a romanization scheme after all. It would be simple enough to add this mapping to gen-tag-table.py once a definitive mapping is determined. The manual modifications are mainly either obvious mappings that the OpenType registry omits or mappings for compatibility with previous versions of HarfBuzz. Some of the old mappings were discarded, though, for homophonous language names. For example, OpenType maps 'KUI ' to kxu; previous versions of HarfBuzz also mapped it to kvd, because kvd and kxu both happen to be called "Kui". gen-tag-table.py also generates a function to convert multi-subtag tags like el-polyton and zh-HK to OpenType tags, replacing `ot_languages_zh` and the hard-coded list of special cases in `hb_ot_tags_from_language`. It also generates a function to convert OpenType tags to BCP 47, replacing the hard-coded list of special cases in `hb_ot_tag_to_language`.
If the second subtag of a BCP 47 tag is three letters long, it denotes an extended language. The tag converter ignores the language subtag and uses the extended language instead. There are some grandfathered exceptions, which are handled earlier.
The font supports the deprecated tag 'DHV ' instead of 'DIV '. dv is mapped to 'DIV ' and 'DHV ', in that order. The test specifies `--language=dv`, demonstrating that if a font does not support the first OpenType tag mapped to a BCP 47 tag, it will fall back to the next tag.
OpenType only officially maps four ISO 639 codes to Quechua languages, but prior versions of HarfBuzz also mapped qu to 'QUZ '. Because qu is a macrolanguage, the mapping now applies to all individual Quechua languages. OpenType calls 'QUZ ' "Quechua", but it really corresponds to Cusco Quechua, so the individual Quechua languages should not all necessarily be mapped to it.
This results in a tenfold speed-up for the common case of tags that are not complex, in the sense of `hb_ot_tags_from_complex_language`.
No script has 3 tags yet, but the plan is for the Indic scripts to each get a third tag someday.
There are now a couple of deprecation warnings. It is not clear how to fix Lines 281 to 287 in 4035158
This code assumes that |
This looks REALLY good. Thank you!!! I'm merging now. We'll deal with breakages later. |
Thanks to David Corbett who revamped our script and language processing and implemented full BCP 47 support. #730 New API: +hb_ot_layout_table_select_script() +hb_ot_layout_script_select_language() +HB_OT_MAX_TAGS_PER_SCRIPT +HB_OT_MAX_TAGS_PER_LANGUAGE +hb_ot_tags_from_script_and_language() +hb_ot_tags_to_script_and_language() Deprecated API: -hb_ot_layout_table_choose_script() -hb_ot_layout_script_find_language() -hb_ot_tags_from_script() -hb_ot_tag_from_language()
This code want to reach out to 'deva' instead of 'dev2'. Basically, give it the last non-default tag. Does your new code now depend on DEFAULT being in the list to be picked up? |
I see not. |
All should be fixed now! |
Is this fixed also? |
Probably as much as we'd do. I don't think we want to expose API for it. Or maybe one or two API entries to better compare languages is fine. Getting most-likelly language from script, and scripts from language is still interesting API. Maybe open an issue and assign to @dscorbett ? |
This pull request updates everything related to parsing script and language tags. It fixes #362.
A private use subtag beginning with “hbsc” is parsed as a script override. See #645, which this pull request subsumes.
The functions that determine the script and language of a buffer can return any number of results, not just two scripts and one language. Among the results, the map builder chooses the first script and language found in the font.
The mappings between BCP 47 and OpenType tags are automatically generated. Doing it manually has resulted in missing tags (BSK), typos (BER for BBR), and overgeneration (kvd → KUI). There are still manual overrides where necessary.
diff <(git show language-tags~8:src/hb-ot-tag.cc) <(git show language-tags:src/hb-ot-tag-table.hh)
shows the mapping changes.