Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

support "dist" feature for indic fonts in KernFeatureWriter #176

Closed
anthrotype opened this issue Nov 13, 2017 · 28 comments
Closed

support "dist" feature for indic fonts in KernFeatureWriter #176

anthrotype opened this issue Nov 13, 2017 · 28 comments

Comments

@anthrotype
Copy link
Member

Some Noto fonts for Indic scripts don't have explicit MTI sources for the opentype features, but have instead self-contained *.glyphs sources with FEA code, and these rely on a Glyphs.app's feature that splits the kerning data at export time between a regular "kern" feature, and a "dist" feature only for kern pairs between Indic glyphs.

googlefonts/glyphsLib#223

Kerning on most Indic scripts is put into the dist feature on export... When exporting OTF/TTF from Glyphs the kerning is split so that LGC kerning is put in the kern feature and Indic kerning is put in the dist feature.
All kerning between glyphs that belong to Indic scripts is put into dist

Indic scripts expect "dist" because the latter is usually on by default, whereas kern isn't so.

I think ufo2ft's KernFeatureWriter should do this by default as well: i.e. emit not just "kern", but also "dist" for Indic glyphs, if any.

The argument that this would be a Glyphs.app-only feature may be true, but:

  1. this could be useful for any project, not just those that use Glyphs.app;
  2. the kern feature generation from kerning.plist is already happening inside ufo2ft, so it would be strange that dist feature was generated in glyphsLib while kern in ufo2ft.

Whether we extend the existing kern feature writer or add an additional dist feature writer subclass is not so important, as long as we agree this is needed here.

Georg said he has heuristic to determine what's Indic or not, probably by splitting ligatures at "_" and dropping "." suffixes. We could do a similar thing (I think we already do to determine whether kern pair is left-to-right or right-to-left).

Also, in *.glyphs source one can assign custom script property to a glyph (the default is given by the GlyphData.xml database used).

In UFO, there's isn't such thing. So I was thinking of defining a private UFO glyph lib entry that specifies a custom script (as well one for category, which could be useful for other things) for glyphs that either don't have a unicode assigned, or they do have one but the user may want (for whatever reason) to override the default script value for them.

Comments?

@anthrotype
Copy link
Member Author

it would be nice to have this PR cleaned-up/merged before extending the feature writers #156
I could have a go myself, if @moyogo can't this week.
This one has become kind of priority as it affects several Indic Noto fonts, so I would like get this working ASAP, provided we agree this is the way to go.

@khaledhosny
Copy link
Collaborator

Are there actually any applications that support Indic scripts but turn off kern feature by default?

@anthrotype
Copy link
Member Author

@punchcutter

@anthrotype
Copy link
Member Author

Are there actually any applications that support Indic scripts but turn off kern feature by default?

I haven't tried, but I guess it's the usual suspects.. MS Word and the like.

@behdad
Copy link
Collaborator

behdad commented Nov 13, 2017

I haven't tried, but I guess it's the usual suspects.. MS Word and the like.

Yes. kern feature is not part of OT Indic specs.

@anthrotype
Copy link
Member Author

Do we want to extend the functionality of the existing KernFeatureWriter so that whenever it finds any "indic" glyphs in the font it splits the kerning.plist data into two sets, one to build the "dist" feature and the rest to build the normal "kern" feature?

Or do we want to devise some way for clients (fontmake) to override the default choice of KernFeatureWriter and have ufo2ft use a DistFeatureWriter implementation that does both kern and dist?

@anthrotype
Copy link
Member Author

anthrotype commented Nov 21, 2017

For the question above, I prefer option 1) a single KernFeatureWriter that does both "kern" and "dist" by default.

Another big problem is how do we define the "Indic-ness" of a glyph in a kerning pair?

There are actually two related problems here:

  1. One is, what are the script that we should consider "Indic" for the purpose of creating a "dist" feature instead of "kern"?

I looked at the "Script-specific development" section on the Microsoft website, and collected all the script tags where "dist" is mentioned among the recommended features:
https://www.microsoft.com/en-us/Typography/SpecificationsOverview.aspx

Please tell me if this is correct or if you would like to add or remove from it:

OT_INDIC_SCRIPT_TAGS = {
    "bng2": "Bengali",
    "beng": "Bengali",
    "bugi": "Buginese",
    "dev2": "Devanagari",
    "deva": "Devanagari",
    "gjr2": "Gujarati",
    "gujr": "Gujarati",
    "gur2": "Gurmukhi",
    "guru": "Gurmukhi",
    "java": "Javanese",
    "knd2": "Kannada",
    "knda": "Kannada",
    "khmr": "Khmer",
    "mlm2": "Malayalam",
    "mlym": "Malayalam",
    "ory2": "Oriya",
    "orya": "Oriya",
    "sinh": "Sinhala",
    "tml2": "Tamil",
    "taml": "Tamil",
    "tel2": "Telugu",
    "telu": "Telugu",
    "mym2": "Myanmar",
    "mymr": "Myanmar",
}
  1. the second problem is how do we assign a glyph which does not have a unicode codepoint defined (an alternate form or a ligature, etc.) to one of these scripts? In Glyphs.app, there's a default GlyphData.xml database that assigns script and category properties directly to glyph names independently of their unicode codepoint; and each glyph objects in Glyphs.app can override the global data with optional glyph-level properties, so users can embed their own custom database in the .glyphs file.
    I am thinking of a defining a private GLIF lib key (e.g. "com.github.googlei18n.ufo2ft.script") that can be used to assign a script property to un-encoded glyphs. Then glyphsLib could write that automatically when exporting *.glyphs to UFOs.
    The alternative is to rely on a consistent AGL-compliant glyph naming scheme, which is what we currently do for detecting right-to-left-ness of a glyph in order to define LTR and RTL kern lookups. That is, we try splitting the glyph name at "." and "_" until we find a part of the glyph name that corresponds to a glyph with a unicode codepoint. I believe this may work for say Arabic, but not so well for Indic fonts where the glyph naming conventions are different (given the big number of in ligature glyphs).

Any comments or suggestions would be appreciated

/cc @punchcutter @behdad @moyogo

@punchcutter
Copy link

Eventually there will also be the Indic-3 tags so I supposed you could add those, but not a big deal at the moment.

I don't know about Buginese, Javanese or Myanmar. At least for Noto we aren't using dist. Khmer also doesn't necessarily need it. I've used it just for adjustments, but the main kerning is still in kern. @ohbendy What do you think? Do you expect dist for Myanmar or Khmer or anything else?

@schriftgestalt Which scripts does Glyphs apply dist to?

/cc @kalapi

@ohbendy
Copy link

ohbendy commented Nov 21, 2017

I'm no expert in the engineering side, but dist is very helpful in the Southeast Asian scripts. I consider kerning as more to do with evening out the gaps, for improving the appearance of a glyph sequence, while dist is more about composing clusters so that components are positioned correctly which would otherwise not be correct or readable.

We've used dist extensively in our latest Burmese fonts. I can also imagine it also being used in Thai or Lao to shift clusters starting with โใไ/ໂໃໄ that follow clusters with abovemarks. I gather in scripts like Telugu, where marks need to be anchored to a base, but also need to have an advance width for their post-base part, the dist feature is used.

I think it's useful to consider and implement these kinds of adjustments in a different way than normal base-to-base kerning.

@NorbertLindenberg would be able to expand.

@anthrotype
Copy link
Member Author

Thanks Zachary and Ben.

Eventually there will also be the Indic-3 tags

we can add these later once they are fully spec'ed.

I don't know about Buginese, Javanese or Myanmar. At least for Noto we aren't using dist. Khmer also doesn't necessarily need it. I've used it just for adjustments, but the main kerning is still in kern

For the scripts you mentioned, I believe that the corresponding Noto fonts all use MTI feature files instead Adobe's FEA, so they shouldn't be affected by this.

kerning as more to do with evening out the gaps, for improving the appearance of a glyph sequence, while dist is more about composing clusters

I see. So does that mean one would like to be able to decide which part of the kerning pairs should be written out as "kern" feature, and which other part should compose the "dist" feature, instead of everything in either one or the other depending on the script tags?

I can also imagine it also being used in Thai or Lao to shift clusters starting with...

I could also add Thai and Lao to the list. Myanmar and Telugu are already present.


About the problem of how to split the data contained in groups.plist and kerning.plist into kern- vs dist-related.
Maybe instead of (or in addition to) looking up the script associated with the glyphs' codepoints -- and as an alternative to defining a private lib key containing the script property of each glyph --, we could devise a way to split the kern-kerning from the dist-kerning based on the presence of some special keyword in the kerning classes' names defined in groups.plist. E.g. if the kerning group contains the word "DIST" or "dist", then we treat all the glyphs contained in there accordingly, and thus build a "dist" feature instead of a "kern" feature with the kerning pairs that involve that group.

Or, yet another way could be to check for the presence of the script name (or 4-letter iso 15924 code or opentype script tag) in the kerning groups' names, and use that as a way to split up the kern vs dist "kerning".

Of course any approach that rely on kerning classes' naming conventions can't fully account for the glyph-to-glyph kerning pairs (unless these are "exceptions" to class-based kerning). For these we would have to guess from the glyphs' codepoints or glyph names (or again, the lib key with script property).

@khaledhosny
Copy link
Collaborator

(Can’t but wonder if it is time to have a more rich kerning model in UFO that does not lump everything in one big table and is not limited by what some ancient versions of FontLab were capable of doing).

@adrientetar
Copy link
Collaborator

@khaledhosny Any specific proposal you have in mind?

@NorbertLindenberg
Copy link

From a usage point of view, I tend to think of kern as improving the spacing between base glyphs, while dist is primarily used to avoid collisions between or with above- or below-base marks, which in Brahmic scripts are often wider than the bases they sit above or below of. Other people may use them differently. The rules in dist features are very often contextual and only look at selected marks (e.g., only above-base or only below-base) – I don’t know whether those can be generated automatically.

OT_INDIC_SCRIPT_TAGS = {

The list of script tags is missing all the ones listed in the Universal Shaping Engine specification:
https://www.microsoft.com/typography/OpenTypeDev/USE/intro.htm
Indic-3 tags are already implemented in Apple’s CoreText – bng3, dev3, gjr3, gur3, knd3, mlm3, ory3, tml3, tel3 are routed to the Universal Shaping Engine.

The list is correct in omitting Lao and Thai (and pre-Windows 10 Tibetan) because their specification doesn’t have dist.

@anthrotype
Copy link
Member Author

anthrotype commented Nov 24, 2017

I'm even more confused now.

What Norbert said about kern being about improving spacing between base glyphs vs dist about avoiding collision between or with marks, and often contextual, does not fit with the way (as I understood it) Glyphs.app auto-generates the dist feature for Indic scripts.

Glyphs.app automatically creates (or appends) a "dist" feature whenever there are kerning pairs between glyphs that are classified as "Indic" (according to a list of scripts which I still haven't completely figured out). Whereas any other kerned glyphs are included in the normal "kern" feature. As simple as that.
Note that these are regular PairPos lookups. The contextual lookups that Norbert is talking about, have to be written manually by the user, and thus are outside of the scope of what I'm proposing here.

(The main reason I'd like to extend the kern feature writer in ufo2ft is to be able to match Glyphs.app behaviour, as some Indic Noto fonts have been engineered using FEA (instead of MTI features) inside Glyphs.app, and they rely on these Glyphs.app-specific automatic features)

I don't have much expertise in making fonts for Indic scripts, that's why I'm asking for help. If Glyphs.app users and Indic fonts experts are happy with the way it auto-generates the dist feature for them, then this would be an argument for matching this behavior in our fontmake pipeline.
But if this is not the correct way, or if there are alternative ways that are more or equally valid, then maybe this should not be included as the default behavior.

The list of script tags is missing all the ones listed in the Universal Shaping Engine spec

Does that mean I should consider all the scripts listed at the end of that page for this auto-generation of a dist feature (as defined above, as alternative to the regular kern feature)? (I see stuff like N'ko in there.. Is that considered an "Indic" script?)

@anthrotype
Copy link
Member Author

Also, I don't understand what @behdad means by

kern feature is not part of OT Indic specs.

If I look at, for example, https://www.microsoft.com/typography/OpenTypeDev/kannada/intro.htm, I see that "kern" is listed among the positioning features alongside "dist", "abvm" and "blwm".

Both "kern" and "dist" are characterized in those documents as intended to "adjust distances", the only difference is that "dist"

does not rely on the application to enable kerning. Therefore, if you want to make sure certain spacing adjustments will always be displayed, you should use the 'dist' feature

@ohbendy
Copy link

ohbendy commented Nov 24, 2017

What Norbert said about kern being about improving spacing between base glyphs vs dist about avoiding collision between or with marks, and often contextual, does not fit with the way (as I understood it) Glyphs.app auto-generates the dist feature for Indic scripts.

As you say, base-to-base spacing adjustments can be handled easily by just kerning, which can be done graphically in Glyphs now. As Norbert says, marks often need to trigger spacing adjustments, and these are likely to be contextual because different bases and marks are usually different shapes/widths. Currently Glyphs does not offer a way to preview other spacing adjustments, or write a dist feature for those, but things are often being updated so I wouldn't necessarily think it's a meaningful criterion (I mean, yes, we write them manually now, but that seems like something likely to be improved).

I'm not sure of the reason why Glyphs would put all kerning (including base-to-base pairs) in dist for all Indic scripts, it's not the way I would do things, especially considering your observation that Kannada does include kern. I would not be happy for all kerning in a Burmese font to be moved into the dist feature behind the scenes.

The list is correct in omitting Lao and Thai (and pre-Windows 10 Tibetan) because their specification doesn’t have dist.

The specifications are not always very comprehensive...does it mean that dist will not be activated in those scripts even if a font has the feature? If so I think I'd like to suggest to the specification-writer (who is it?) that it should be possible (though not essential) to use dist in Thai and Lao.

@punchcutter
Copy link

I think there are two issues here:

  1. Generating the same GPOS code that Glyphs GUI does
  2. Having more control over how things like kern and dist (also abvm and blwm) are handled in UFO independent of Glyphs

The reason I started this whole issue was to get the same output that we get for some Indic scripts (like Kannada, Gurmukhi, Gujarati) when exporting from Glyphs. The use of kern vs dist in many situations can be personal preference. For a simple PairPos lookup it hardly matters which feature it's under. When looking at the 1st point the theory of what kern or dist are intended for or how somebody might use them is irrelevant. We want to match Glyphs (whether Glyphs is doing it correctly or not is another topic, but there's a lot of feedback directly to Georg and on the forums so I think it's looking pretty good in general). The reason for this issue and the reason Glyphs creates a dist feature is because for Indic scripts dist is expected to be on by default in apps like MS Word. kern, on the other hand, needs to be activated. For the general user who doesn't know these things they expect to open up an application and see text rendered correctly. It's pretty well known that dist is on by default and therefore desirable to use for adjustments required for correct shaping. I can give plenty of examples where dist is used in various scripts.

Related to the 2nd point above, one thing I do have a huge problem with is the automatic generation that Glyphs does. Intent cannot be automated. There should be more control over how the user decides to use these features. This is the main reason I have never used Glyphs generated OpenType tables. I write my own because I don't like how Glyphs autogenerates everything. If I want kern for this and dist for that then I decide. The same goes for more complex tables like chained contextual lookups for dist. For some users the default is fine, though.

What Norbert said about kern being about improving spacing between base glyphs vs dist about avoiding collision between or with marks, and often contextual, does not fit with the way (as I understood it) Glyphs.app auto-generates the dist feature for Indic scripts.

Norbert is absolutely right about the general use of dist. Using it instead of kern for Indic scripts is kind of an exception to that rule because dist is on by default and is necessary for correct shaping.

Does that mean I should consider all the scripts listed at the end of that page for this auto-generation of a dist feature (as defined above, as alternative to the regular kern feature)? (I see stuff like N'ko in there.. Is that considered an "Indic" script?)

I don't think USE shaped scripts are necessarily relevant here except for the Indic 3 script tags, but if Glyphs is also generating dist for the others then I think we should do the same here. Again, my main goal here is matching Glyphs. If a designer works in Glyphs and generates with Glyphs and tests and then delivers and the pipeline outputs a totally different font then I'd say that's an issue with the pipeline. However, I 100% agree with the idea of not tying this so tightly to Glyphs because using Glyphs is not at all a requirement for using fontmake or ufo2ft.

@khaledhosny
Copy link
Collaborator

@khaledhosny Any specific proposal you have in mind?

@adrientetar, off the top of my head: ability to control feature tags, how kerning is split into lookups, their order, language systems, and so on. Right now you either you have nearly no control over how the kern feature will be written, or abandon UFO kerning completely and write feature code manually (and keep it up to date as glyphs are added, removed, or renamed).

@behdad
Copy link
Collaborator

behdad commented Nov 26, 2017

If I look at, for example, https://www.microsoft.com/typography/OpenTypeDev/kannada/intro.htm, I see that "kern" is listed among the positioning features alongside "dist", "abvm" and "blwm".

You are right. Maybe they were added later. I remember they not being there . What was the original bug report? Did anyone observe something not working, or just our output being different from Glyphs?

@kalapi
Copy link

kalapi commented Nov 27, 2017

@anthrotype

If I look at, for example, https://www.microsoft.com/typography/OpenTypeDev/kannada/intro.htm, I see that "kern" is listed among the positioning features alongside "dist", "abvm" and "blow".

Let's just say that the Microsoft specifications aren't really very accurate ;)

@anthrotype
Copy link
Member Author

I am thinking of a defining a private GLIF lib key (e.g. "com.github.googlei18n.ufo2ft.script") that can be used to assign a script property to un-encoded glyphs.

see googlefonts/glyphsLib#308

However, I've also been thinking about an alternative way to assign all these Unicode character properties such as script, category, bidi type, etc. to every glyphs in a UFO, especially those without unicode codepoint.

The Glyphs.app way is to define a global database which maps predefined glyph names (including alternative equivalent names) to a set of Unicode character properties; if a font uses these "nice" names, the latter are looked up automatically in there; in addition, the user can override some of these on an individual glyph basis, and the overrides are stored in the .glyphs source file, under respective glyph properties like "script", "category", etc.

However, these are properties of unicode characters, not of glyphs as such. They are only assigned to glyphs indirectly because of their association with some unicode character, either via their unicode codepoint if any (i.e. the cmap table), or because of some GSUB substitution rule that replaces a glyph-with-a-unicode with another glyphs-without-a-unicode. And since they are all properties of a unicode character, it's impossible to override them as there's only one Unicode Character Database (ok several versions of it, but that's not the point); and they come together as a bundle, and the key is the unicode codepoint.
Even glyphs without an explicit unicode codepoint in the cmap, have some closely-related unicode codepoint: i.e. the one associated with the base glyph of which they are an alternate forms, or the string of characters required to typeset a ligature glyph.
The glyph name by itself is not something that can be relied upon to "guess" these unicode properties, because it is first and foremost meant to be readable by the designer, so it's usually compact and not necessarily follows the AGL rules (the "." suffix for alternate forms, the "_" for ligatures, etc.).

What I'm thinking is, instead of storing each of these properties separately as key/value pairs in a UFO GLIF's <lib> dictionary (e.g., one for script, one for category, one for subcategory, one for bidirectional type, etc.), we store only a single unicode string (encoded as utf-8) which represents that glyph. All the other properties can be derived by looking up the Unicode Character Database using that string of unicode characters.
I'm not sure how to call this yet (com.github.googlei18n.ufo2ft.unistring maybe).

What do you guys think? Am I missing something? Can you think of a case where one would like to assign any of these unicode properties to a glyph independently without also associating the glyph to a unicode character? I can't.

@anthrotype
Copy link
Member Author

(perhaps one may argue that even that is redundant because technically it could be derived from the substitution rules defined in the features.fea... 🤔 )

@punchcutter
Copy link

@anthrotype the only objection that comes to mind is that the Unicode category for some characters is not necessarily "correct" for how OpenType deals with that character. So pointing a glyph to its associated character as represented in the Unicode Character Database will not give the desired category. Mostly this has to do with the various types of marks like non-spacing, spacing, spacing combining. That's usually where things can go wrong. For example, a character defined as a Spacing Mark or Spacing Combining mark might need to be classified as a Letter to allow spacing. I think in general your idea will work, especially for script, bidi etc, but category/subcategory is where I've always run into problems because the GDEF only understands 1: base, 2: ligature, 3: mark and the Unicode database has way more categories that don't always fit nicely into that.

@ohbendy
Copy link

ohbendy commented Jan 17, 2018

I'm not certain whether I'm understanding correctly so please disregard if I've got the wrong end of the stick. Certain Unicode characters in Burmese are marks by default, but glyph alternates are spacing letters. So the spacing/letter alternate glyph would need to have different properties from the nonspacing/mark default character. And if a glyph is a ligature of a letter and a mark, which category should it inherit? That needs to be up to the user, sometimes depending on the design.

@punchcutter
Copy link

@ohbendy You are understanding correctly. We are basically saying the same thing: it can't all be automated or guessed.

@anthrotype
Copy link
Member Author

Thanks guys. I had not considered that. So I guess, we do need to allow users to override category/subCategory on a glyph basis, even when a unicode codepoint is assigned.

But I feel a bit uneasy with allowing to override properties like script or bidirectionalilty for glyphs that do have a unicode codepoint, and cannot be understood in a different way. One shouldn't be able to say this glyph with codepoint 0061 has script "Arabic"...
Perhaps we should only allow these on un-encoded glyphs? (but then again some glyphs may not be encoded yet, but may be in a future Unicode version, and "explicit is better than implicit"... oh my!)

@anthrotype
Copy link
Member Author

also "garbage in, garbage out". If that glyph with codepoint 0061 has a wrong script name, it will end up in the wrong lookup, but then the user is to blame, not the tool.
Ok then, back to square one. I'll add all these extra stuff to the UFO GLIF lib. When exporting UFOs from glyphsLib, the GlyphData.xml or the .glyphs source file's overrides will provide the source for this info. For UFO-only projects, it'll be up to the user to manually populate them.

@anthrotype
Copy link
Member Author

I think this issue is now fixed by #255.
In the end, I went for the simple approach. Instead of splitting kern and dist into separate lookups, I simply make a single kern lookup containing all the pairs (or two, if there're both LTR and RTL characters), and if there are any Indic scripts defined in the features.fea languagesystem statements, I then make a dist feature containing the same kern lookup and register it only for those Indic scripts.
Having the same lookup being shared by both dist and kern features (they can both be present if there are other non-dist-related scripts) is better than not having any dist feature at all.
The logic in the kern feature writer is already becoming quite complex (given that it needs to handle LTR vs RTL) and I wouldn't like to complicated it further.
If we need to, we can improve on this later.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants