[gen-metadata] More aggressively declare subsets in METADATA files #602
fonts.google.com launched a new specimen page last week, that reuses the improvements developed for the fonts.google.com/noto section specimen pages last year.
However, this has highlighted a problem with Kumbh Sans and Nabla, where they had Google Fonts Latin Plus glyph set support, but their METADATA files didn't declare the
In google/fonts#5095 I add the math subset to the 2 families we've noticed this issue surfaced, and I believe the root cause is that the METADATA generator has a threshold for when it auto-adds subsets to a METADATA file, such that when a font has only 1% of the math subset's characters, it won't add it but when 100% of latin-core subset is in the font, it is added. I forget the details on the threshhold, I think it was set per subset.
It seems that we probably ought to be more aggressive about declaring subsets, like if even 1 character from a subset exists in any font in the family, it should be declared in METADATA, and then using comments to include the % the tool found to inform later hand curation of the METADATA file.
We probably then need to make a special version of the METADATA generator to lint all the existing library and then PR a bunch of updates to a bunch of families.
It would also be good for the linter to tell us what unicode characters are in any font in a family that are not in any declared subset encoding, and not in any encoding (https://github.com/googlefonts/glyphsets/tree/main/Lib/glyphsets/encodings)
The text was updated successfully, but these errors were encountered:
Would be good to improve CJK detection as well. We have an ongoing issue where that tool adds many CJK subsets when only one is appropriate. The "I only have an hour" approach would be to simply prompt the user if this occurs and make them pick which one(s) they want. Given more time, identify characters that strongly suggest (high frequency, only in that script) support for specific CJKs and use them to autopick.
We discussed this issue in our team meeting last Friday.
We came to the conclusion that decreasing the number of glyphs needed in order to activate a subset won't work because users have the ability to search for fonts by subsets. If we start enabling subsets just because the font contains a single glyph within a given subset, users will get annoyed because the font doesn't fully support the subset fully.
Language drop down enables users to search for fonts by subset
@nathan-williams Is there a reason why we're not using the font's cmap in order to construct the glyphs palette? if we simply used the cmap, we wouldn't get fallback glyphs.
nabla on sandbox showing fallback glyphs because they don't exist in the font being used to display the glyphs
However, the CMAP of the fonts in the API is a subset of the CMAP of the upstream fonts, because the upstream HAS the glyphs:
But because the subset that contains the character isn't declared, the character is removed by the API and so unavailable to the Catalog:
I personally believe the subsets that cover the glyphs in the font should be enabled, so that all glyphs can be accessed; the language dropdown should be fixed, so that filtering happens by actual script support, even if the subset is included - because the API deleting glyphs is a foundational issue with the API that should be fixed as a priority, and then the Catalog should accommodate a correct API.
But @vv-monsalve and @m4rc1e make a fair argument that fixing the Catalog glyph table preview issue by modifying the API subsets configured for families will break the Catalog language filter, and the Catalog language filter is a more important Catalog feature, so we should not modify the subsets as they currently work, and we should modify the Catalog glyph table preview functionality to fix its issue.
Taking a higher level view, I think there are 3 semantics which are being conflated in the GF system architecture:
We should bring these into harmony and use the appropriate data for the appropriate Catalog feature, but that's a larger (2023) effort.
Therefore I think the short term solution to the 'unavailable glyphs being seen' issue is not, as per this issue title, to _more aggressively declare subsets in METADATA files), but rather, the Catalog front-end code team (@nathan-williams :) should fix it in that code.
But to start on bringing those things into harmony, for (1) we should get better data on which characters exist in which font files that do not exist in any characters defined in the families' METADATA subset associations, and I'd like to ask @m4rc1e to gather that data in a Google Sheet
Ha, just after posting this, I caught up on google chatrooms, and Nate said he already wrote a patch that excludes unavailable glyphs, which will be applied the next time such a family is pushed :)
So that seems weird and makes a detailed analysis and investigation of the subsets more important, although not urgent.