Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add rough first draft of script matching doc #1

Merged
merged 2 commits into from
Mar 5, 2019

Conversation

raphlinus
Copy link
Contributor

No description provided.

@raphlinus
Copy link
Contributor Author

raphlinus commented Mar 4, 2019

Rendered.

I'm very open to ideas regarding open questions and TODO items.

@raphlinus
Copy link
Contributor Author

@jfkthame I'd love to get your input on this, even if just a few pointers of where to look in the Gecko codebase.

@RazrFalcon
Copy link

As a simple note, we can't use only the fontconfig on Linux, because KDE doesn't modify the fontconfig's config by default. Instead it will store it's font settings in ~/.config/kdeglobals. So we have to handle this too.

This is how Qt does this.

@jfkthame
Copy link

jfkthame commented Mar 4, 2019

For Gecko, see gfxFontGroup::FindFontForChar, which in turn will call into WhichPrefFontSupportsChar and WhichSystemFontSupportsChar.

Note that the structure of the Gecko font preferences is pretty ancient, with roots in the old world of multiple 8-bit and double-byte codepages for different "language groups", and could really use an extensive rewrite...

@raphlinus
Copy link
Contributor Author

@jfkthame Thanks, that's useful, but I find myself still mystified by where, in particular, Han unification logic happens. It seems like it should be in WhichSystemFontSupportsChar (as that takes an aRunScript argument), but when I drill down, I can't find any actual Han unification logic: GetCommonFallbackFonts seems not to cover CJK (except for plane 2 astral), and PlatformGlobalFontFallback seems to just drop aRunScript. I can keep digging, but maybe you know off the top of your head?

@jfkthame
Copy link

jfkthame commented Mar 4, 2019

You probably want to look at WhichPrefFontSupportsChar, as that's where whichever CSS generic is applicable will be mapped to a font family from the (user-configurable) prefs. It'll look up a "unicode range" for the character, and then map this through gfxPlatformFontList::GetFontPrefLangFor and gfxPlatformFontList::GetLangPrefs to determine which set of prefs to use.

So in most cases, if a CJK font hasn't been explicitly named, this is where it'll get selected. Only if the font specified via prefs doesn't cover the character in question will we end up in WhichSystemFontSupportsChar.

@raphlinus
Copy link
Contributor Author

Ok, that's helpful, though I've got to say it's not easy to figure out what's going on from reading the code.

However, having come across implement font cascading for system fonts under OSX, it seems like this might be the answer I'm looking for: CTFontCopyDefaultCascadeListForLanguages. That linked bug identifies a few problems with the approach, but I'm wondering whether I should be pursuing this or trying to replicate what Gecko does.

And after a little more digging, I found the source of truth for that: lang-tags in the font.name-list settings, with a "hardcoded" list of fonts from platform-specific #ifdefs in libpref/init/all.js. It's not really hardcoded because these can be changed, but I think it's a good bet that 99.99% of users won't touch those. This raises the requirements question of whether this needs to be configurable, or whether we can count on font-kit to get the correct information out of the system.

More background and information, mostly from investigations into Blink,
Gecko, and Qt.
@raphlinus
Copy link
Contributor Author

I've added significant new content, based on investigations of Gecko and Qt. This certainly seems to be a complicated problem domain. Again, feedback is welcome!


### Fontconfig

The [Fontconfig] configuration file format specifies "langset" as an an "RFC-3066-style" language. [RFC 3066] is a predecessor to BCP-47 (dated 2001), and basically specifies language and country, with no provision for explicit script or variant. For the purpose of Han unification, the convention is to infer script from country. For example, "zh-CN" could be translated to "zh-Hans", "zh-TW" to "zh-Hant". However, after a little investigation, it's not clear to me how useful it is to do sophisticated processing here, as the default fontconfig on a clean Debian 9 install lists doesn't specify "langset" attributes, but just has a few informal descriptions in comments.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC a lot of systems don't handle this so everyone just uses CN and TW. Not sure about font systems specifically.

I have had hans vs hant trigger differences in browsers

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is just the march of progress, and hopefully in a few decades the use of country to represent script will fade away.


I strongly recommend the use of BCP-47 as the identifier for language, script, and other locale metadata. This is an easy decision for web use cases, as it is the standard for the [lang] tag. The main challenge is that mechanisms for system font metadata in general predate BCP-47, so there will be some impedance matching.

TODO: investigate Rust ecosystem for common BCP-47 tag representation.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @zbraniecki

Pretty sure fluent-rs needs this too, worth knowing what y'all are using

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I'm curious what work is happening. I did just a bit of searching, didn't find any clear consensus on what people are doing, so would be very happy to hear from any efforts in this space.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm using a crate fluent-locale which provides the basic BCP47 locale management and negotiation.

I'm hoping for a Locale class to be added to unic, but that is going a bit slower and awaits open-i18n/rust-unic#195 (comment)

@raphlinus raphlinus merged commit 28c3c61 into master Mar 5, 2019
@raphlinus raphlinus deleted the script_matching_doc branch March 5, 2019 05:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants