Skip to content

Conversation

behdad
Copy link
Member

@behdad behdad commented Oct 16, 2018

Work in progress.

Fixes #432

@behdad behdad force-pushed the name-table branch 2 times, most recently from 8eb4f4b to bbe1ae8 Compare October 23, 2018 12:47
@behdad
Copy link
Member Author

behdad commented Oct 24, 2018

I fleshed most of this out.

What's left is 1. language handling and 2. encoding handling. The UTF conversion routines can use some review. @jfkthame Thanks in advance!

It assumes all names are encoded in UTF16-BE.  Other than that, and not
listing languages correctly, it's *supposed* to work.
Accept Mac Latin name entries as ASCII as well.
@behdad
Copy link
Member Author

behdad commented Oct 24, 2018

I'm done with the implementation. FULLY untested though... Going to write a tool to list names in a font...

@behdad
Copy link
Member Author

behdad commented Oct 24, 2018

Added the enum. Should I change hb_ot_name_t to be the enum type, or stay "unsigned int"?

We already have API that returns nameIDs using "hb_name_id_t *out". So, this is not the safest change. But that API is also unused by everyone right now.

We already define hb_script_t as an enum where we expect non-enumerated values to be used as well. That's legitimate in C. Not sure about C++. nameIDs are slightly different in that non-enumerated ones are more common?

@behdad
Copy link
Member Author

behdad commented Oct 24, 2018

I think I like to rename hb_name_id_t to hb_ot_name_id_t. That's a API-breakage, but since no one uses that API so far and it's very recent, I don't think it's an issue.

@behdad
Copy link
Member Author

behdad commented Oct 24, 2018

Maybe we don't care about subtags. It's not clear what the desired behavior is... Is good if we add subtags that distinguish script. But for regions, say, if user asks for fa-af and there's only fa available we should return it.

@behdad
Copy link
Member Author

behdad commented Oct 25, 2018

if user asks for fa-af and there's only fa available we should return it.

This is similar to what we do for ot-tags already. Do we need a couple functions for match-level of two language tags? Or just implement internally here?

@brawer
Copy link
Contributor

brawer commented Oct 25, 2018

If HarfBuzz really wants to do language matching (“language subtag handling”), have a look at the Go implementation which is quite cleanly written, and works on small data tables. See blog post and source code. ICU has an implementation too, which is more involved than Go. But my personal recommendation would be to stay away from this; it seems to go beyond the scope of HarfBuzz, and getting it correct is not trivial.

@brawer
Copy link
Contributor

brawer commented Oct 25, 2018

Regarding ltag, note that the table may contain a mix of language and script codes (according to the spec, and also in practice in Apple’s shipping fonts). Thus, if an ltag entry consists of four letters, it needs to be prefixed with und- to form a valid BCP47 code. For example, if ltag contains Hans, it should be converted to und-Hans, whereas sr-Latn needs of course no prefix. Not sure if that’s already handled, but I couldn’t see it in the current code.

@behdad
Copy link
Member Author

behdad commented Oct 25, 2018

Regarding ltag, note that the table may contain a mix of language and script codes (according to the spec, and also in practice in Apple’s shipping fonts). Thus, if an ltag entry consists of four letters, it needs to be prefixed with und- to form a valid BCP47 code. For example, if ltag contains Hans, it should be converted to und-Hans, whereas sr-Latn needs of course no prefix. Not sure if that’s already handled, but I couldn’t see it in the current code.

It doesn't. This is where it should be handled: https://github.com/harfbuzz/harfbuzz/pull/1254/files#diff-fc989e6dfa921e399954f43b75b091c0R68

@behdad
Copy link
Member Author

behdad commented Oct 25, 2018

Regarding ltag, note that the table may contain a mix of language and script codes (according to the spec, and also in practice in Apple’s shipping fonts). Thus, if an ltag entry consists of four letters, it needs to be prefixed with und- to form a valid BCP47 code. For example, if ltag contains Hans, it should be converted to und-Hans, whereas sr-Latn needs of course no prefix. Not sure if that’s already handled, but I couldn’t see it in the current code.

It doesn't. This is where it should be handled: https://github.com/harfbuzz/harfbuzz/pull/1254/files#diff-fc989e6dfa921e399954f43b75b091c0R68

If HarfBuzz really wants to do language matching (“language subtag handling”), have a look at the Go implementation which is quite cleanly written, and works on small data tables. See blog post and source code. ICU has an implementation too, which is more involved than Go. But my personal recommendation would be to stay away from this; it seems to go beyond the scope of HarfBuzz, and getting it correct is not trivial.

Thanks for the links. Okay, I'm convinced we don't want to be in this business. I know Akira TAGOH wrote a library for this, can't find it right now.

@brawer
Copy link
Contributor

brawer commented Oct 25, 2018

Hm, the ltag spec actually refers to standard BCP47; the non-standard omission of the language subtag is described in the Meta spec. Still, I believe I've encountered this in actual fonts. Anyhow, I'll give it a try, thanks for the pointer to the code.

@behdad
Copy link
Member Author

behdad commented Oct 25, 2018

the non-standard omission of the language subtag is described in the Meta spec.

Yeah that looks different.

@khaledhosny
Copy link
Collaborator

I know Akira TAGOH wrote a library for this, can't find it right now.

https://bitbucket.org/tagoh/liblangtag/overview

@brawer
Copy link
Contributor

brawer commented Oct 25, 2018

the non-standard omission of the language subtag is described in the Meta spec.

Just checked the system fonts on macOS 10.14. They contain the following language tags in ltag tables: en, fa, pl, and zh-Hant. All in standard BCP-47 syntax; sorry about my confusion of ltag with Meta.

@khaledhosny
Copy link
Collaborator

O think treating null language as “en” makes sense as the English not the localized names is what most applications want by default so makes the common case slightly simpler.

@khaledhosny
Copy link
Collaborator

We can make hb-name tool that takes name id (or even human readable aliases) and language and returns the name, just like the API do. It would also have an option to dumb all the names like what test-name-table is doing now.

@behdad
Copy link
Member Author

behdad commented Oct 25, 2018

O think treating null language as “en” makes sense as the English not the localized names is what most applications want by default so makes the common case slightly simpler.

Agreed. Done.

Also need docs for new API and hooking them up in the docs.

@behdad
Copy link
Member Author

behdad commented Oct 25, 2018

We can make hb-name tool that takes name id (or even human readable aliases) and language and returns the name, just like the API do. It would also have an option to dumb all the names like what test-name-table is doing now.

Right. Any idea what to do with newlines in the name?

@behdad
Copy link
Member Author

behdad commented Oct 27, 2018

I merged this manually. The cmdline tool can come later.

@behdad behdad closed this Oct 27, 2018
@ebraminio ebraminio deleted the name-table branch October 27, 2018 07:34
@behdad
Copy link
Member Author

behdad commented Oct 27, 2018

I was going to make a release, but had to make some last minute changes to this. And ended up making hb_name_id_t be an enum instead of unsigned int. @khaledhosny is that going to be a problem for any code you have written? Have you used hb_ot_layout_get_size_params anywhere? If yes, can you test if they compile fine? They should if they are C, but if C++, would need change. I'm open to reverting this. But it's much cleaner this way. Thanks.

d941f66

@behdad
Copy link
Member Author

behdad commented Oct 30, 2018

One last question before this goes into a release. Should I rename hb_name_id_t to hb_ot_name_id_t? It affects two APIs that went out in 2.0.0, but I'm sure they are unused.

@khaledhosny
Copy link
Collaborator

Is there anything OT specific about the name ids (not sure I fully grasp the OT/non-OT split in the API naming).

@behdad
Copy link
Member Author

behdad commented Oct 30, 2018

Is there anything OT specific about the name ids (not sure I fully grasp the OT/non-OT split in the API naming).

Do those name-id numbers make sense for Type1, BDF, PCF, ...? No.

@khaledhosny
Copy link
Collaborator

Right, in my mind non-OT meant AAT/Graphite etc. I have no strong opinion either way, but probably hb_ot_name_id_t makes more sense.

@behdad
Copy link
Member Author

behdad commented Oct 30, 2018

Right, in my mind non-OT meant AAT/Graphite etc.

We refer to GSUB/GPOS as ot-layout. To the script logic as ot-shape. Use OT as a encompassing of OpenType and extensions...

@jfkthame
Copy link
Collaborator

Agree that hb_ot_name.... seems appropriate.

behdad added a commit that referenced this pull request Oct 30, 2018
ebraminio pushed a commit to ebraminio/harfbuzz that referenced this pull request Nov 24, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants