Encode unencoded glyphs as F0000 + hex(GID) #388

davelab6 · 2014-08-20T23:20:48Z

I was chatting with @twardoch today about how glyph names are the 'primary key' for fonts, because in any contemporary font you have so many unencoded glyphs, accessed with OpenType Layout logic... But unencoded glyphs are tricky to precisely call, because OTL logic is per-font. I mentioned that I might like to use the Unicode Private Use Area to encode otherwise-unencoded glyphs.

Adam kindly mentioned he already thought about this, and he concluded that the Private Use Plane A (Unicode Plane 15) is ideal for this, as its U+F0000..U+FFFFD so you can use a value of F0000 + hex(GID) to cleanly, logically, encode all unencoded glyphs.

Let's do it!

add a "3.10" cmap subtable

The text was updated successfully, but these errors were encountered:

twardoch · 2014-08-20T23:25:25Z

All it requires is a simple tool which adds a "3.10" cmap subtable which maps glyph ids to the PUP A (sequentially, by adding 0xF0000 to the glyph ID). Because the codepoints will be in the same order as the glyph IDs, you can use the space-saving cmap format 12 which only defines the start and end of the cmap mapping range. So the added size overhead is small.

vitalyvolkov · 2014-09-02T10:23:23Z

@behdad Is there any way to define that glyph is unencoded using fontTools?

davelab6 · 2014-09-02T11:43:23Z

@hash3g you can find the glyph IDs in the GlyphOrder table, eg https://github.com/hash3g/yesevaone/blob/master/YesevaOne-Regular.ttf.GlyphOrder.ttx and you can find glyphs that are encoded in the cmap table. I guess you need to make a set of the glyph names and a set of the encoded glyph names and compare them to get the set of unencoded glyphs.

twardoch · 2014-09-02T12:21:47Z

I would advocate that it would be much simpler (and more storage-effective) if you just encode ALL glyphs as U+F0000 + GID.

Check if there already is a cmap subtable with PID 3 EID 10 (3.10 for short). If it dors, skip to step 4.
Create a cmap subtable in format 12 and assign it to the cmap table as 3.10
Copy all mapping from the PID 3 EID 1 (3.1 for short) cmap subtable to the 3.10 subtable, as the spec requires 3.10 to be a superset of 3.1.
"Blindly" assign mapping to all glyphs from GlyphOrder from U+F0000 to U+F0000 + len(GlyphOrder) - 1.

This has the advantage that cmap subtable format 12 uses an efficient storage for continuous code-to-GID ranges. With my method, you'll only create one such range, so it'll only add a few bytes to the size, and will be very fast.

This approach has an additional benefit:
As a user of such font, I am are not forced to address the properly (i.e. via Unicode) glyphs encoded using the F0000+ codes.

I can still use the proper Unicodes. But if I do so, the browser/app will always perform the Unicode processing and default OpenType Layout shaping for complex scripts. So I won't really have the guarantee that the glyph I'm seeing is actually the glyph assigned to the Unicode codepoint in the font's cmap. It will be for most Unicodes but for some codepoints, the "Unicode+OTL magic" will kick in.

But if I address even the "properly" encoded glyphs using the U+F000+ codepoint, I will have a WYSIWYG guarantee. Even more: with harfbuzz.js, I can run a JS port of HarfBuzz in the browser, take the output GIDs, add F000+ to them and have my own explicit custom OTL processing if I need to. So I'm completely in control and independent of any "browser magic".

twardoch · 2014-09-02T13:36:54Z

Here is my code that does exactly what I described above.

#! /usr/bin/python
# -*- coding: utf-8 -*-
# 
# pyftaddspuaabygids.py
# Map all glyphs to the Supplementary PUA-A plane (U+F0000..U+FFFFF) 
# by 0xF0000 + glyphID
#  
# Copyright (c) 2014 by Adam Twardoch
# 
# Licensed to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License.  You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

import fontTools.ttLib, sys, copy

def addSPUAByGlyphIDsMappingToCMAP(ttx):
    cmap = ttx["cmap"]
    # Check if an UCS-2 cmap exists
    for ucs2cmapid in ((3, 1), (0, 3), (3, 0)): 
        ucs2cmap = cmap.getcmap(ucs2cmapid[0], ucs2cmapid[1])
        if ucs2cmap: 
            break
    # Create UCS-4 cmap and copy the contents of UCS-2 cmap
    # unless UCS 4 cmap already exists
    ucs4cmap = cmap.getcmap(3, 10)
    if not ucs4cmap: 
        cmapModule = fontTools.ttLib.getTableModule('cmap')
        ucs4cmap = cmapModule.cmap_format_12(12)
        ucs4cmap.platformID = 3
        ucs4cmap.platEncID = 10
        ucs4cmap.language = 0
        if ucs2cmap: 
            ucs4cmap.cmap = copy.deepcopy(ucs2cmap.cmap)
        cmap.tables.append(ucs4cmap)
    # Map all glyphs to UCS-4 cmap Supplementary PUA-A codepoints 
    # by 0xF0000 + glyphID
    ucs4cmap = cmap.getcmap(3, 10)
    for glyphID, glyphName in enumerate(ttx.getGlyphOrder()):
        ucs4cmap.cmap[0xF0000 + glyphID] = glyphName

def usage():
    print "Map all glyphs to the Supplementary PUA-A plane (U+F0000..U+FFFFF) by 0xF0000 + glyphID"
    print "python %s inputfile[.otf|.ttf] outputfile[.otf|.ttf]" % sys.argv[0]

if len(sys.argv) == 3:
    inpath = sys.argv[1]
    outpath = sys.argv[2]
    ttx = fontTools.ttLib.TTFont(inpath, 0, verbose=0)
    addSPUAByGlyphIDsMappingToCMAP(ttx): 
    ttx.save(outpath)
    ttx.close() 
else: 
    usage()

behdad · 2014-09-03T09:09:32Z

I categorically reject this and think it's a bad idea. Nowhere in this report I see any reasoning for why this is needed or is a good idea.

twardoch · 2014-09-03T09:24:39Z

Ah, yes. We talked with Dave about this. Sorry it didn't become clear.

The idea is not to do this for production-ready fonts but for the purpose of development, to be used within the context of document-driven type design and similar such applications.

In a way, think of it as the "debug" mode of building fonts. Such debug mode might include other options that generate some redundant data (such as, well, glyph names! :) ) which is useful while designing but when building fonts in "release" mode, this stuff should not be included.

behdad · 2014-09-03T09:31:43Z

Ok, sure. Yeah, that would be useful.

davelab6 · 2014-09-03T23:17:18Z

@behdad could you explain more about why you think this is a bad idea..? You think that if all Google Fonts have this feature, that it will increase the use of PUA characters and documents tightly bound to particular fonts in general usage?

behdad · 2014-09-04T07:03:50Z

@davelab6 for the same reasons that non-Unicode encodings are bad. This is even worse, this is full custom encoding, meaning any text encoded in those is illegible to any text processing use.

vitalyvolkov · 2014-09-04T13:55:32Z

@davelab6 please check that test and fix is applied

https://github.com/googlefonts/fontbakery-cli/commit/5acd915d47e9385ef529be646906790411bd731d

davelab6 · 2014-09-04T15:46:42Z

@behdad I am skeptical that this would find any general usage.

It its a secondary method that is not for text processing, but debugging: it is supplementing, not replacing, the unicode encodings and OTL tables.

Part of Document Driven Type Design is having good examples to refer to; specifically for the re-implementation of http://fuelproject.org/utrrs/index (which is the result of a 24-hour overnight sprint, but the concept is valid and needed.)

Since we don't have OTL processing in <canvas>, I figure this secondary encoding would be the best way to get that done. And the good examples will be in the production Fonts API.

davelab6 · 2014-09-04T15:47:20Z

@hash3g for now, can you make this optional in the same way as fontcrunch is optional, via bakery.yml and set up page?

davelab6 · 2016-02-05T17:24:45Z

Per TypeThursday's Laura Worthington article we should consider this, perhaps only for display fonts, if its become important for casual users of desktop fonts.

anthrotype · 2016-02-05T20:20:04Z

What's this article you are referring to?

davelab6 · 2016-02-05T20:25:02Z

Oh its not out yet. Stay tuned.

anthrotype · 2016-02-05T20:32:40Z

ok. I hope you don't want to revive the idea of using PUA codes in released fonts...

behdad · 2016-02-05T20:40:36Z

Oh its not out yet. Stay tuned.

lol. Ping us when it is. That said, people have had bad ideas forever; doesn't mean we should support them. I'm more willing to implement a HarfBuzz tool to render arbitrary glyphs than to add a hack in fonttools.

felipesanches · 2016-02-10T23:41:30Z

I risk saying something stupid here, but if a strict mapping should not be infered to be normative, then can't these tools randomise mappings on purpose?

davelab6 · 2016-02-10T23:51:51Z

randomise mappings on purpose?

That seems wise, to mitigate the practical problem with PUA text.

kenlunde · 2016-02-11T00:24:02Z

PUA Text: Hope for best, expect the worst.

davelab6 · 2016-02-11T03:02:18Z

@kenlunde I understand the categorical criticism of this, but I'm curious what your advise is for Laura Worthington. What should she do to serve her customers?

kenlunde · 2016-02-11T03:36:19Z

@davelab6: I wrote the following at the bottom of page 162 of CJKV Information Processing, Second Edition: "The use of PUA code points should be avoided at all costs, because their interpretation, in terms of character properties, and their interaction with legacy character set standards (in other words, interoperability) cannot be guaranteed."

With that said, Laura's article suggests that PUA usage is a necessary evil in order to access glyphs that are not directly encoded. However, the main caveat is that the more an implementation depends on PUA code points, the more closed said implementation is.

davelab6 · 2016-02-11T03:44:32Z

the more closed said implementation is

I'm not entirely sure what you mean by 'closed,' please could you clarify? :)

I am framing this as fallback to help users of implementations that are so poorly done that this is the only way to make the font useful. It isn't that implementations should depend on the PUA codepoints for providing large glyph sets to users, it is that they are oblivious to this need and PUA is a workaround for people who are held hostage by these implementations' incompetence.

kenlunde · 2016-02-11T03:57:32Z

@davelab6: What I mean by closed is that the implementation is closely tied to the PUA mappings of a particular font, and changing the selected font to anything else is virtually guaranteed to result in illegible text.

Furthermore, in the context of font fallback, I would claim that PUA usage is at least an order of magnitude more dangerous than environments that do not employ it, because it is not clear from which font (or fonts) the glyphs are being displayed.

Also, for fonts with glyphs that are not encoded and require OpenType feature support to access them, there may also be metrics-related dependencies that would likely be overlooked by simpler apps, meaning that even if a user is somehow able to enter a glyph via a PUA code point, the resulting glyph may not behave as expected, due to limitations in the simpler authoring app.

davelab6 · 2016-02-11T04:04:38Z

Hmm.

So, it seems that Laura should simply pyfeafreeze her font into a set of single-style font families, and I should set up a freezemyfont.com site to make it easy for regular people to freeze their libre OT fonts to use in such implementations, and maybe lobby Font Squirrel to add this to the Web Font Generator.

Would that be wiser than recommending SPUA-GID encoding?

khaledhosny · 2016-02-11T07:12:04Z

Where is the glyph positioning of all this (so 60s) PUA talk, it is not like OpenType is only about glyph replacements. OK you are not building fonts that need mark positioning or all this fancy stuff, but what about kerning?

kenlunde · 2016-02-11T13:31:24Z

@davelab6: I am not sure what pyfeafreeze is and its ramifications. I have no strong objections to someone who feels the need to use PUA code points, and my point is that those who decide to go down that path simply need to understand the consequences, which is that nothing is guaranteed, and any oddball behavior is likely to be related to the decision to use PUA code points.

@khaledhosny: That was sort of my point, specifically that there is more to glyphs that merely having them encoded. Perhaps such an approach works for Western fonts, but I can pretty much guarantee that it will crumble when non-Western fonts enter into the picture.

twardoch · 2016-02-11T14:13:21Z

All OpenType substitutions and positioning happen on the glyph level. Principally, it doesn't matter which codepoints a glyph is invoked through.

Of course since the OT Layout model relies on per-script shaping engines, reordering happens or certain features are automatically applied in a specific order when glyphs are invoked via their true Unicode codepoints.

If the same glyphs are invoked via PUA, the shaper probably classifies them as "DFLT" script. Indeed if a PUA codepoint is inserted in the middle of "true" Arabic or Devanagari or even Latin or Cyrillic text, some OTL engines may interpret that single glyph inserted via PUA as a separate run, and then would not execute feature interactions with meighboring glyphs (kern, mark etc.). Which does indeed pose a problem.

Other engines may fold that glyph into the dominant run and it will work. Inconsistent run itemization and inability to perform positionings or contextual substitutions across run boundaries is indeed a very weak aspect of OTL.

But if the PUA-invoked glyphs end up to be in the same run, universally applied default features like kern or mark, and any explicitly user specified features, both GSUB and GPOS, will work.

The pyftfeatfreeze method isn't problematic. It only remaps the "cmap" and typically does so within the same languagesystem, so the new default glyphs (assigned to Unicode codepoints) remain within the same script run, and everything works as it should, unless your features do really weird circular stuff.

For example, if you have swash Arabic glyphs in the "swsh" feature inside the "arab" languagesystem, these glyphs would normally already partake in the init, medi, fina, isol, curs or mark features in the original font. If you freeze the swsh feature using pyftfeatfreeze, those swash glyphs get mapped as the default Arabic letters in the cmap table, but since they get classified as the Arabic run, are fed into the Arabic shaper and already partake in the "arab" features defined in the font, everything works as expected.

Freezing "swsh" is sometimes even a better method than applying a user defined feature to one character via, say a span with a local style="font-feature-settings: 'swsh'" property, because doing the latter may force a run break (or generally cause the span to be rendered in a separate step), which also stops the interaction with the neighbors -- unless the higher-level text engine is smart enough to detect and ignore certain span changes or somehow fuses the line together before passing it to the OTL engine.

Again, this is all shit. :)

davelab6 · 2016-02-11T14:15:09Z

Khaled, non-OT implementations are TrueType implementations; thus they use the KERN table. :) Ken, please see https://github.com/twardoch/fonttools-utils/tree/master/pyftfeatfreeze

davelab6 · 2016-02-11T14:19:10Z

On 11 February 2016 at 09:13, Adam Twardoch notifications@github.com
wrote:

If the same glyphs are invoked via PUA, the shaper

Hmm. Do we have concrete examples of implementations that have an OT
shaper, but no UI for users to set/unset non-default features? (I'll ask
Laura on TypeDrawers about this.)

twardoch · 2016-02-11T14:19:33Z

@kenlunde pyftfeatfreeze is a tool I published which e.g. could turn the Source Han Sans superfont that uses the OT features to switch between SC, TC, J and K variants into a series of SC, TC, J and K fonts which all have the appropriate variants mapped in the cmap. In fact, after you published Source Han Sans, I realized that such an approach may no longer as frowned upon as it used to be by some people, and I finished and published the tool (I had a simple working version of it for a long time now).

twardoch · 2016-02-11T14:26:47Z

@davelab Start with Notepad. :) I supports calt (contextual alternates) but gives no way to override its results. Zapfino Extra LT Pro that I made in 2003 shows how this works -- you see substitutions happening as you type but if you finished typing, there is no way to pick another variant. And then virtually every Windows app that uses the standard GDI Windows text controls, any Photoshop clone, older versions of Word or Corel Draw, and tons of simple "add text to image" apps, or apps for vinyl sign cutting or CAD apps or motion graphics/video editing apps or apps that add subtitles or captions etc. All that uses the standard text stack on Windows.

davelab6 · 2016-02-11T14:33:28Z

Right. I see why this is a really big problem 😠

Looking back at Laura's post on TD, she says that feeding the new hobbyist retail market PUA fonts is increasing their use and appreciation of fonts, that puts pressure on app developers to provide better typography features.... but the example is a glyph picker, not a real OpenType UI. I wonder if any apps have added OT UIs, so I've updated my question to Laura to be about that :)

khaledhosny · 2016-02-11T15:28:20Z

That swash example makes no sense at all, since you usually want to enable it for just part of the word (at least for Arabic), and you can’t do that with a “frozen” font. That is really just another hack suitable for certain fonts. That CSS font-features currently force run breaks is an implementation bug, Firefox, for example, apply many style changes across spans without breaking OpenType logic and it should do that for font features too (since HarfBuzz already support that).

khaledhosny · 2016-02-11T15:31:18Z

@dave I don’t see how kern table can handle all kinds of kerning supported by GPOS pair positioning, nor how it is certain that the kern table will be always supported in these situations.

twardoch · 2016-02-11T16:31:39Z

@khaledhosny Replace "swash" with any other, e.g. one variant letter accessible through ss02 that you want to appear consistently in your text, or a localized form. Though admittedly, pyftfeatfreeze does not currently allow to freeze a feature for only some glyphs.

simoncozens · 2023-08-14T10:18:49Z

I'm going to be dogmatic here and say that encoding stuff in the PUA for general-purpose fonts is a bad idea and if anything we should check that it's not happening. I understand that it's a necessarily evil for old applications which don't read the glyph table, but fontbakery is about best practices, not bad practices. If anyone still strongly insists that it should happen, feel free to ~~fight me~~ reopen the issue.

davelab6 added this to the 2014-Q3 Week 21 milestone Aug 20, 2014

davelab6 added the build label Aug 20, 2014

davelab6 assigned vitalyvolkov Aug 20, 2014

vitalyvolkov assigned andriyko and unassigned vitalyvolkov Sep 4, 2014

vitalyvolkov assigned davelab6 and unassigned andriyko Sep 4, 2014

davelab6 mentioned this issue Mar 20, 2015

Choose + Review: Show all the glyphs, including unencoded typefacedesign/document-driven-typedesign#71

Open

davelab6 added P4 Someday-maybe and removed build labels Feb 2, 2016

davelab6 modified the milestones: 2.0, 2014-Q3 Week 21 Feb 2, 2016

davelab6 removed P1 Quick P2 Important P3 Soon P4 Someday-maybe labels Jul 8, 2016

davelab6 mentioned this issue Feb 16, 2017

Work with unencoded glyphs? googlefonts/gfregression#5

Closed

davelab6 removed their assignment Mar 14, 2017

felipesanches modified the milestones: MISC, Left-over from 2014 Dec 14, 2018

simoncozens closed this as not planned Won't fix, can't repro, duplicate, stale Aug 14, 2023

felipesanches removed this from the Left-over from 2014 milestone Sep 12, 2023

Encode unencoded glyphs as F0000 + hex(GID) #388

Encode unencoded glyphs as F0000 + hex(GID) #388

Comments

davelab6 commented Aug 20, 2014

twardoch commented Aug 20, 2014

vitalyvolkov commented Sep 2, 2014

davelab6 commented Sep 2, 2014

twardoch commented Sep 2, 2014

twardoch commented Sep 2, 2014

behdad commented Sep 3, 2014

twardoch commented Sep 3, 2014

behdad commented Sep 3, 2014

davelab6 commented Sep 3, 2014

behdad commented Sep 4, 2014

vitalyvolkov commented Sep 4, 2014

davelab6 commented Sep 4, 2014

davelab6 commented Sep 4, 2014

davelab6 commented Feb 5, 2016

anthrotype commented Feb 5, 2016

davelab6 commented Feb 5, 2016 via email

anthrotype commented Feb 5, 2016

behdad commented Feb 5, 2016

felipesanches commented Feb 10, 2016

davelab6 commented Feb 10, 2016

kenlunde commented Feb 11, 2016

davelab6 commented Feb 11, 2016

kenlunde commented Feb 11, 2016

davelab6 commented Feb 11, 2016

kenlunde commented Feb 11, 2016

davelab6 commented Feb 11, 2016 • edited

khaledhosny commented Feb 11, 2016

kenlunde commented Feb 11, 2016 • edited by davelab6

twardoch commented Feb 11, 2016 • edited by davelab6

davelab6 commented Feb 11, 2016 via email

davelab6 commented Feb 11, 2016

twardoch commented Feb 11, 2016

twardoch commented Feb 11, 2016

davelab6 commented Feb 11, 2016

khaledhosny commented Feb 11, 2016

khaledhosny commented Feb 11, 2016 • edited

twardoch commented Feb 11, 2016 • edited by davelab6

simoncozens commented Aug 14, 2023

davelab6 commented Feb 11, 2016 •

edited

kenlunde commented Feb 11, 2016 •

edited by davelab6

twardoch commented Feb 11, 2016 •

edited by davelab6

khaledhosny commented Feb 11, 2016 •

edited

twardoch commented Feb 11, 2016 •

edited by davelab6