complex shaper: avoid font warning for non-printable codepoints #508

avih · 2021-05-07T21:14:25Z

This was performed in two steps (commits):

Separate the non-printable simple-shaper detection into a stand-alone function.
Reuse this function by the complex shaper.

avih · 2021-05-07T21:27:13Z

Repushed with trivial changes:

Consistent termilonogy at the commit messages (non-printable/zero-width --> control-codepoint).
Trivial change at the second commit: if (is_control(info->symbol)) --> if (is_control(symbol))

avih · 2021-05-07T21:59:12Z

Note, there's a case which I'm not sure how it behaves - though I think it's not a new case: If all the symbols at the run don't try to resolve a font (e.g. if all are below 0x20, or newly also if is_control returns true) and there's only one run, then no font is resolved at all, and ass_font_get_index will set *glyph_index = 0 and *face_index = 0.

I don't know if this ends up valid later on, or even if this tries to use invalid or out-of-bounds index.

Though again - not a new issue as far as I can tell.

avih · 2021-05-08T06:35:02Z

So there are two things which might need addressing:

The second commit message states incorrectly that the simple shaper didn't show this warning - but it did, and the same commit addressed it unintentioally, as both shapers use ass_shaper_find_runs to resolve fonts. I'll revisit the commit message.
The simple shaper now scans the same set of exclusion-codepoints twice: once when resolving the font, and once afterwards to mark it as skipped. Preferably we should only test each symbol once. I didn't try to evaluate yet where's the best place at the code for this would be. I think a good approach would be to scan and mark these symbols early, so that both shapers, and maybe also ass_shaper_find_runs would make use of the same scan result.

Or we can push it now, and unify the scans later - they do use the same code, but it does run twice now for the simple shaper. Let me know what you think.

EDIT - also, now that I think about it, I don't know that the second commit doesn't introduce regressions. I think it has the potential to end up with shorter runs (because control symbols use font 0, but other symbols might use some other font), and I don't know what the implication of a run is.

For instance, if only complete runs are passed to harfbuzz, then these control symbols might be on "runs" of their own, and I don't know whether they'd be taken them into account by the shaper while shaping surrounding runs.

I'd need some help with assessing that. It's the first time I'm looking at libass code.

avih · 2021-05-08T10:24:10Z

Re-pushed with these changes:

First commit: only commit message and comments changes.
Second commit: only commit message changes, and raised questions - hence marked as WIP.
New third commit: avoid is_control redundancy for the simple shaper.

I'm copying the WIP questions asked at the commit message to allow easier discussion:

What happens if harfbuzz does want a font for these codepoints?
Should we use the same list of codepoint both to skip rendering at the simple shaper and to avoid font resolution? The codepoints are now documented, and it seems to me that the same list should be OK.
What if none of the symbols resolves a font? (not a new issue).

Once we answer those, I'll remove the WIP text from the second commit.

And, the last commit message suggests that it might be possible to reuse the existing .skip member instead of adding a new .is_control member. It's probably fine as-is currently, and here are some considerations:

They do seem at least somewhat overlapping in function.
Reusing will definitely NOT add any regression for the complex shaper, because right after ass_shaper_find_runs (which tests is_control) - ass_shaper_shape is called, and at the complex shaper code path it resets and re-calculates the .skip member anyway (at shape_harfbuzz).
For the simple shaper code path, it can cause regressions only if the .skip member is set selectively before calling ass_shaper_find_runs. If it's not, then immediately afterwards shape_fribidi is called (which doesn't seem to use .skip), and then ass_shaper_skip_characters (which sets it to the same values calculated earlier by is_control).

However, even if .skip can be reused 100% safely, there's still a question of whether we want to. I don't think I can answer this on my own, so for now I think it's OK to use the new member .is_control.

astiob · 2021-05-08T14:17:33Z

I don't know that the second commit doesn't introduce regressions. I think it has the potential to end up with shorter runs (because control symbols use font 0, but other symbols might use some other font), and I don't know what the implication of a run is.

Yes, that sounds like a regression, unfortunately. Good catch!

Or for some fonts, more like… lack of fix than regression. I think this is the very issue that triggered #71 (comment).

For e. g. bidi control overrides it shouldn’t matter, but e. g. U+200C ZERO WIDTH NON-JOINER and U+200D ZERO WIDTH JOINER must make it to HarfBuzz as part of their surrounding shaping runs, because they affect shaping. That means the control character must inherit shape_run_id of the preceding (or succeeding) character, regardless of whether and which fonts contain or lack a glyph for it and regardless of whether the preceding (or succeeding) character itself comes from the main font or a fallback font. It’s probably safe to stick to copying the run ID of the preceding character and use 0 or skip or whatever if the control character itself leads the string.

For what it’s worth, I’ve verified that VSFilter/Uniscribe seems to handle ZWJ correctly (unlike current libass) in a situation where it joins Arabic letters and the main font (Helvetica) has a ZWJ glyph but not Arabic glyphs.

What happens if harfbuzz does want a font for these codepoints?

If it does, and we don’t handle it specially, it will presumably be handled like missing glyphs are handled at the moment. See also my next answer.

Should we use the same list of codepoint both to skip rendering at the simple shaper and to avoid font resolution?

I think that’s what you do in this PR, so is the question rather “is there anything wrong with this approach”? No, I think it’s fine. But we may want to revise the list (again) to better match HarfBuzz.

I’m not fully sure how HarfBuzz decides which code points to request glyphs for and which not. That (plus the potential for changes over time) is why ideally I’d like to somehow make HarfBuzz pick these characters out for us, rather than us try to guess and hope the lists match.

I think it may be based on the default-ignorable Unicode property. I see HarfBuzz uses an is_default_ignorable function internally, but it seems that it’s not public. @khaledhosny Is this right?

What if none of the symbols resolves a font? (not a new issue).

Uh, well, is it really any different from when some of the symbols don’t resolve any font?

it might be possible to reuse the existing .skip member instead of adding a new .is_control member.

Indeed, it is. You’re right when you say the complex shaper resets .skip, and concerning the simple shaper, .skip is indeed not set anywhere before the shaper is called as far as I can tell & remember.

So you can just move the ass_shaper_skip_characters call from the simple shaper path in ass_shaper_shape to the common path in ass_shaper_find_runs or ass_render_event.

avih · 2021-05-08T14:51:59Z

Yes, that sounds like a regression, unfortunately. Good catch!

Right. Good catch, bad patch :)

must make it to HarfBuzz as part of their surrounding shaping runs

Indeed. I also presumed this could be the case, though as you mentioned, it can happen anyway before this PR if the glyph is missing - which as we know can definitely happen for control codepoints.

I think that’s what you do in this PR, so is the question rather “is there anything wrong with this approach”?

Correct. I thought this was obvious in this context, but I could have phrased it better. Thanks.

ideally I’d like to somehow make HarfBuzz pick these characters out for us

Completely agreed. That was one of the things I thought might be happening here #507 (comment)

It's definitely better to not guesswork/duplicate it, unless it's specified explicitly how harfbuzz does this.

What if none of the symbols resolves a font? (not a new issue).

Uh, well, is it really any different from when some of the symbols don’t resolve any font?

I think it is different. Because if no font was resolved, and later you're required to provide a font, then it might end up accessing index 0 of the array, and maybe the array size is 0 to begin with because no font was ever found.

I didn't try to follow the code enough to understand whether it's a real issue or not, but regardless, it's not a new issue because theoretically (but I don't know if practically) it can already happen before this PR if all the symbols are smaller than x20.

So you can just move the ass_shaper_skip_characters call from the simple shaper path in ass_shaper_shape to the common path in ass_shaper_find_runs or ass_render_event.

Can I? I wouldn't have thought so myself, because then shape_fribidi - which runs after ass_shaper_find_runs would not see the direction codepoints (because by then the symbol was modified to 0).

So, considering the potential (additional) regression with even more cases of too-short runs than before, here's what I suggest we do:

Do this now U+2060 (Word Joiner) warns "fontselect: failed to find any fallback for font: (sans-serif, 400, 0)" #507 (comment) (just suppress warnings for control codepoints). PR-wise, this means keeping only the first commit, and adding a new commit which reuses this function to test whether it should warn or just do info.
At a later point someone might want to have a go at the more correct U+2060 (Word Joiner) warns "fontselect: failed to find any fallback for font: (sans-serif, 400, 0)" #507 (comment) (refactor the dataflow - which I accidentally thought was posted by you, and only now noticed it's by a different person).

astiob · 2021-05-08T14:56:46Z

because by then the symbol was modified to 0

Er, right. Just drop the part that changes symbol to 0. (It’s important for the complex shaper, too.) I don’t think it adds any meaning for later processing stages anyway (when skip is set)!

astiob · 2021-05-08T15:01:42Z

On the other hand, if the font does contain a glyph for a control character and we’re using the “simple shaper”, I think we end up using the glyph’s metrics for layout despite ass_shaper_skip_characters. I’m not too confident though, so this needs to be tested.

This PR might actually fix that.

avih · 2021-05-08T15:10:02Z

Er, right. Just drop the part that changes symbol to 0. (It’s important for the complex shaper, too.) I don’t think it adds any meaning for later processing stages anyway (when skip is set)!

Right. That was indeed one thing I also considered, but didn't know if both .skip and .symbol=0 are required, or whether .skip is enough. And that's also the reason I thought that .skip might be set selectively before ass_shaper_find_runs in a way which is not completely overlapping .symbol==0 - and ultimately the reason I didn't reuse .skip (because otherwise why were both symbol and skip modified?).

It’s probably safe to stick to copying the run ID of the preceding character and use 0 or skip or whatever if the control character itself leads the string.

Yes, definitely better to not change the run ID, but it can become tricky to a degree I felt uncomfortable to make this change without deeper knowledge of the codebase.

So, I still stand by this:

So, considering the potential (additional) regression with even more cases of too-short runs than before, here's what I suggest we do:

Do this now ... (just suppress warnings for control codepoints). PR-wise, this means keeping only the first commit, and adding a new commit which reuses this function to test whether it should warn or just do info.

Now also because different approaches which work better than this PR (or just suppression) do also require deeper knowledge of the codebase, and I'm simply not comfortable taking them without knowing for a fact that I don't introduce regressions.

If you think that we can do better right now than simply addressing the warning on its own (i.e. my suggestion), then I'd prefer that you take over and handle it however you think is best.

What say you?

avih · 2021-05-09T10:52:27Z

That means the control character must inherit shape_run_id of the preceding (or succeeding) character, regardless of whether and which fonts contain or lack a glyph for it and regardless of whether the preceding (or succeeding) character itself comes from the main font or a fallback font.

I agree that's the best thing to do, and I decided to have a go at this too. I think on top of the existing 3 commits would be OK.

It’s probably safe to stick to copying the run ID of the preceding character and use 0 or skip or whatever if the control character itself leads the string.

That's implementation details, which I don't immediately know how to translate to code. Also, not sure about .skip. I recommend that for now, unless you think this would be incorrect, keep the existing usage of skip independent of the new code. This can be optimized later, instead of adding noise to the current discussion.

I've had a different(?) idea at the implementation:

The base is the current code - with the 3 commits which are already at this PR.
Keep the current function of ass_shaper_find_runs as is, but with an additional single bit which remembers whether at least one control codepoint was encountered throughout the entire length (spanning runs).
After the loop is complete - if no control codepoints were encountered or if there was only one run (shape_run == 0), then return normally. I.e. so far this is identical as before - but only for cases where control codepoints didn't exist or didn't break the run.
Otherwise - i.e. there were control codepoint[s] and more than one run: make a second pass of the whole length and re-calculate the runs. Depending on answers to the questions below - this might be integrated with the first pass instead of as an additional pass.

Few things I'd need help with:

Does anyone care later on if a run has more than one font? Specifically - can we leave the font as 0 for control codepoints, while surrounding symbols of the same run have a different font? If no one cares then it will make the logic quite simpler, because we wouldn't need to find an actual font from another symbol at the run to copy to the control codepoint[s] of the run.
Why is ass_font_get_index (which resolves a font) called only if (!info->drawing_text.str) ? I.e. what's the meaning of having or not having info->drawing_text.str? My guess would be that the code considers info->drawing_text.str != NULL if and only if font info was already resolved for this symbol. Is this correct?
At the existing code, to test whether a new run starts, the font info is compared with the previous glyph at the array - NOT with the previous glyph for which ass_font_get_index was executed (e.g. if info->drawing_text.str is not NULL). To me it looks that the code assume (see previous question) that the glyph already has front info. Is this correct?
Assuming info->drawing_text.str != NULL indeed means "already has font info", then just some info question (has no effect on any new logic): who sets the font prior to running ass_shaper_find_runs? and why only to some of the symbols at the array?

avih · 2021-05-09T11:54:08Z

Does anyone care later on if a run has more than one font?
Assuming info->drawing_text.str != NULL indeed means "already has font info"

Assuming no one cares if control codepoints inside a run have a different font than surrounding symbols at the run, and assuming info->drawing_text.str != NULL indeed means "already has font info", then here's my suggested patch on top of the existing 3 commits. I.e. This integrates the new logic into the existing pass without adding a new pass:

diff --git a/libass/ass_shaper.c b/libass/ass_shaper.c
index 396185f..50390c6 100644
--- a/libass/ass_shaper.c
+++ b/libass/ass_shaper.c
@@ -835,8 +835,17 @@ void ass_shaper_find_runs(ASS_Shaper *shaper, ASS_Renderer *render_priv,
             ass_font_get_index(render_priv->fontselect, info->font,
                     symbol, &info->face_index, &info->glyph_index);
         }
-        if (i > 0) {
-            GlyphInfo *last = glyphs + i - 1;
+
+        // test if a new run starts, completely ignoring control codepoints
+        int last_font = info->is_control ? -1 : i - 1;
+        while (last_font >= 0 && glyphs[last_font].is_control)
+            last_font--;
+        // last_font is -1 if NOT considering a new run (first symbol, or
+        // control codepoint, or all prior symbols are control), else nearest
+        // earlier index to compare font (guaranteed not a control codepoint).
+
+        if (last_font >= 0) {
+            GlyphInfo *last = glyphs + last_font;
             if ((last->font != info->font ||
                     last->face_index != info->face_index ||
                     last->script != info->script ||

This fixes the new cases of too-short runs introduced by the second commit - control codepoint breaks the run even if a glyph can be found.

And it also fixes earlier cases of too-short runs - control codepoint breaks the run because no glyph was found.

EDIT: or even a better patch which fixes the same two issues (less code, O(1) even in pathological edge cases):

diff --git a/libass/ass_shaper.c b/libass/ass_shaper.c
index 396185f..ea6bd5e 100644
--- a/libass/ass_shaper.c
+++ b/libass/ass_shaper.c
@@ -818,6 +818,7 @@ void ass_shaper_find_runs(ASS_Shaper *shaper, ASS_Renderer *render_priv,
 {
     int i;
     int shape_run = 0;
+    int last_font = -1; // latest earlier index which is not a control codepoint
 
     ass_shaper_determine_script(shaper, glyphs, len);
 
@@ -835,8 +836,15 @@ void ass_shaper_find_runs(ASS_Shaper *shaper, ASS_Renderer *render_priv,
             ass_font_get_index(render_priv->fontselect, info->font,
                     symbol, &info->face_index, &info->glyph_index);
         }
-        if (i > 0) {
-            GlyphInfo *last = glyphs + i - 1;
+        if (!info->is_control && last_font >= 0) {
+            // Consider a new run: current symbol is not a control point,
+            // and we also have a prior symbol which is not a control point.
+            // TODO: if we have last_font followed by several control symbols,
+            //       and the current symbol font differs from the last_font
+            //       (i.e. a new run begins), should the intermediate control
+            //       symbols belong to the previous run or to the new run?
+            //       (currently they remain with the previous run)
+            GlyphInfo *last = glyphs + last_font;
             if ((last->font != info->font ||
                     last->face_index != info->face_index ||
                     last->script != info->script ||
@@ -845,6 +853,9 @@ void ass_shaper_find_runs(ASS_Shaper *shaper, ASS_Renderer *render_priv,
                 shape_run++;
         }
         info->shape_run_id = shape_run;
+
+        if (!info->is_control)
+            last_font = i;
     }
 }

avih · 2021-05-09T21:48:34Z

Assuming no one cares if control codepoints inside a run have a different font than surrounding symbols at the run

Actually, I'd probably not make such assumption, unless we know 100% for a fact that that is the case. The reason being that I can imagine harfbuz wanting to use some font metrics while appying these control codepoints, and with a "dummy" font 0 - it could end up using wrong metrics from the font even if the glyph itself is not used.

Luckily, making sure that all the control codepoints font info in a run is copied from a codepoint for which a font was actually resolved is relatively simple to add to the patch above. Probably 2-3 more LOC.

I'm not updating the patch above, but unless you prefer otherwise (i.e. keep the patch above as is in this regard), then I'll add it to the commit, assuming you'd want this functionality added.

avih · 2021-05-10T18:07:38Z

Forced-pushed with the following changes:

First commit (stand-alone is_control): change unsigned to uint_least32_t - explained at the updated commit message.
Second commit (don't resolve control codes face): removed WIP, updated commit message and one code comment.
Third commit (avoid redundant is_control): only updated commit message about reusing the .skip member.
New fourth commit: don't break runs on control codepoints (bugfix of the inline patch posted at an earlier comment).
New fifth commit: WIP: make control codes use the surrounding face_index.

I'm fairly confident that this patchset:

Doesn't introduce regressions.
Avoids warnings for missing control code glyphs.
Possibly improves performance due to skipping face resolution for control codes.
Fixes accidental run breaks due to unresolved control codes glyphs (or due to resolution at a different face) .

The fourth commit (don't break runs on control codepoints) adds three TODO comments:

// TODO: currently only control codes face_index is not-comparable,
//       but we should consider the same also if a face was not
//       resolved by ass_font_get_index for other reasons

This is not a new question, and the code doesn't change existing behavior. However, someone might want to revisit the behavior in the future.

// TODO: else face_comparable remains true. is this correct?

(when info->drawing_text.str != 0). This is not a new question, and the code doesn't change existing behavior. However, someone might want to revisit the behavior in the future.

// TODO: if there are both comparable and uncomparable faces at
//       the previous run, then their face_index can differ.
//       if this matters to the shaper, it can be adjusted here.

This is a new question, and is a result of the change that face_index for control codes can now differ from surrounding symbols, while previously face_index was indentical for the whole run (because the run broke, possibly incorrectly, when it changes).

I don't know the answer to that (i.e. whether it matters to the complex shaper), but if the answer is "yes, it does matter", then the fifth WIP commit addresses this concern. If it doesn't matter then keeping the fifth commit doesn't hurt, but also doesn't add anything useful, so maybe it can be removed.

khaledhosny · 2021-05-10T23:20:43Z

I think it may be based on the default-ignorable Unicode property. I see HarfBuzz uses an is_default_ignorable function internally, but it seems that it’s not public. @khaledhosny Is this right?

Yes, this isn't exposed in public API (the check is tailored for font shaping in HarfBuzz, it doesn't match the Unicode property 100%).

avih · 2021-05-11T10:07:27Z

I think it may be based on the default-ignorable Unicode property. I see HarfBuzz uses an is_default_ignorable function internally, but it seems that it’s not public. @khaledhosny Is this right?

Yes, this isn't exposed in public API (the check is tailored for font shaping in HarfBuzz, it doesn't match the Unicode property 100%).

FWIW, I found this definition here https://en.wikipedia.org/wiki/Module:Unicode_data/derived_core_properties :

data.default_ignorable = {
	singles = {
		[0x000AD] = true,
		[0x0034F] = true,
		[0x0061C] = true,
		[0x03164] = true,
		[0x0FEFF] = true,
		[0x0FFA0] = true,
	},
	
	ranges = {
		{ 0x0115F, 0x01160 },
		{ 0x017B4, 0x017B5 },
		{ 0x0180B, 0x0180E },
		{ 0x0200B, 0x0200F },
		{ 0x0202A, 0x0202E },
		{ 0x02060, 0x0206F },
		{ 0x0FE00, 0x0FE0F },
		{ 0x0FFF0, 0x0FFF8 },
		{ 0x1BCA0, 0x1BCA3 },
		{ 0x1D173, 0x1D17A },
		{ 0xE0000, 0xE0FFF },
	},
}

the libass set which this patchset puts inside the is_control test is a subset of this, and unless I'm mistaken, there's only one codepoint which libass defines in this category which the definition above doesn't: 0x0180F.

In the other way around - the definition above includes quite a bit more than the definition of is_control.

However, I don't know how credible this definition is, and not sure what it refers to. though google did place this page near the top of search results, so I guess people do link to it.

avih · 2021-05-11T10:11:59Z

@khaledhosny for codepoints in this (or some other) default_ignorable list:

Does harfbuzz try to retrieve glyph info?
Does harfbuzz try to retrieve face info/metrics?

My guess would be probably "no" for glyph info, as glyphs for these codepoints are probably not very relevant.

But for face data for these codepoitns - I can imagine that there could be value even without the glyph.

So, for instance, if libass tries to resolve faces for runs, and there are default_ignorable codepoints embedded within these runs which libass doesn't try to resolve a glyph (and similarly neither a face), should their face be adjusted to match the surrounding face at the run?

khaledhosny · 2021-05-11T15:37:12Z

@khaledhosny for codepoints in this (or some other) default_ignorable list:
* Does harfbuzz try to retrieve glyph info?

* Does harfbuzz try to retrieve face info/metrics?

I don’t quite understand what you mean. But HarfBuzz will behave shaping wise whether the font have glyphs for these code points or not. Under normal circumstances it will replace them with space glyph and zero its widths (the replacement glyph can be controlled with the API), it can also retain the glyph in the output under some special circumstances (if requested with the API, or if the fonts has some substitution rules for these glyphs).

avih · 2021-05-11T16:28:11Z

I don’t quite understand what you mean

OK, I based my question on this (#507 (comment)):

ass_shaper_shape runs HarfBuzz, which requests font glyphs via callbacks and receives our previously found glyphs. I assume it doesn’t request a glyph for U+200B or U+2060. ...

This PR changes how "our previously found glyphs" works. I.e. it takes place at the stage of resolving the glyphs, before calling harfbuzz, and tries to avoid glyph resolution for default_ignorable codepoints, and deal with any consequences it might have.

So I'm trying to understand for a fact what happens without assuming: whether or not harfbuzz does try to "request the font glyph via callbacks" for codepoints at some default_ignorable list.

And the reason I'm asking is to understand whether or not libass should try to resolve a face which has a glyph for it, or whether it can simply say "it's ignorable, I'm not going to bother finding a (fallback) face which has this glyph".

The second question is basically: does harfbuzz care at all what the face is for codepoints in a default_ignorable list - under normal circumstances (without some "show hidden" mode etc).

I.e. I think it's not unimaginable that harfbuzz knows that the glyph itself is not relevant under normal circumstances, but that the face might still be relevant to extract some metrics in order to apply the codepoint correctly during layout. This is a guess - but I hoped for more knowledgable info by you.

The reason I'm asking is that due to implementation details, if libass doesn't try to resolve a face for the glyph then the face ends up arbitrary. So this patchset can either leave it with some arbitraty value (if harfbuzz doesn't care what the face is) or it needs to do some work in order to adjust the face to match the resolved face of surrounding glyphs.

So this is how I interpret your reply (please confirm):

But HarfBuzz will behave shaping wise whether the font have glyphs for these code points or not. Under normal circumstances it will replace them with space glyph and zero its widths (the replacement glyph can be controlled with the API)

So assuming a replacement glyph is not provided via the API, and under "normal" circumstances, do I understand correctly that harfbuzz does not need to access neither the glyph nor the font which is associated with this codepoint?

astiob · 2021-05-11T17:58:00Z

the check is tailored for font shaping in HarfBuzz, it doesn't match the Unicode property 100%

Hmm, so there’s no way to know which code points require font data and which don’t other than going the Chrome way (with extra complications due to synthesized glyphs) or duplicating internal HarfBuzz code?

Speaking of the Chrome way, is it not possible that a font intentionally uses glyph #0 for a supported character (or substitution) such that it should not invoke font fallback?

And the proper handling of default ignorable code points is… feed them to HarfBuzz first; if it returns .notdef, perform font fallback as usual; if out of fonts and they’re still .notdef, turn them invisible? It would be great if hb-unicode exposed the raw default-ignorable property so we didn’t have to maintain our own list likely to fall out of sync over time.

(By the way, I don’t see Chrome do this—turn ignorables invisible if no font has them and HarfBuzz has let them through—although I’m merely foolishly looking at unfamiliar source code. It looks like Firefox does this. But then Firefox had this issue but I don’t understand how it could be an issue to begin with and hence whether I need to worry about it in libass.)

astiob · 2021-05-11T18:10:40Z

@avih Sorry for my silence in the last few days. I very much appreciate your enthusiasm and effort, but I think you’re overcomplicating this :-)

Just copy the last glyph’s font face in ass_shaper_find_runs instead of invoking ass_font_get_index.

By the way, currently, libass will try to fetch the glyph later anyway, regardless of HarfBuzz.

avih · 2021-05-11T18:30:19Z

Just copy the last glyph’s font face in ass_shaper_find_runs instead of invoking ass_font_get_index

I don't think this would work, but I'd be to glad be proven wrong (and my suggestion for you to just take over is still in place, I really don't mind who writes the code).

The point is that a run could have several ignorable codepoints in a row at the beginning the run. I.e. you have nothing to copy from. And once the run ends - only then you can look back at the entire run - and if it has both resolved and unresolved glyhs - copy the resolved ones to the unresolved ones.

But sure, it's also entirely possible I'm overly complicating :)

astiob · 2021-05-11T19:20:23Z

libass/ass_shaper.c

+            // NOTE: if face_index changes then it can split the run further,
+            //       but cannot accidentally combine adjacent runs (e.g. if two
+            //       symbols had different face_index but they have the same
+            //       face_index after ass_font_get_index). that's because split


FYI: This is the very place that sets face_index. It is zero until here.

FYI: This is the very place that sets face_index. It is zero until here.

Right, that's the only thing I wasn't 100% sure of, so I added this comment to make it obvious if I'm assuming wrong. Thanks.

The reason I assumed that is that split_style_runs splits on bold and italic, and assumed face_index was derived from the style somewhere before ass_shaper_find_runs, and that then ass_font_get_index starts with the desired face/style, and then falls back to others. That's the very first thing which ass_font_get_index does - tries the glyph at the input face_index - and if it's found, then nothing happens further later:

// try with the requested face if (*face_index < font->n_faces) { face = font->faces[*face_index]; index = FT_Get_Char_Index(face, ass_font_index_magic(face, symbol)); }

And the comment // try with the requested face makes it look even more like the input face_index was already set explicitly to some requested value.

astiob · 2021-05-11T19:58:14Z

The point is that a run could have several ignorable codepoints in a row at the beginning the run. I.e. you have nothing to copy from.

This is correct.

I was about to say:
But for almost all default-ignorable code points, this shouldn’t make a difference as they only have meaning when they follow other things. And even for those that do have meaning as prefixes, in this PR’s approach, this would be consistent with our handling of ignorables in the middle of a run, where we always propagate fonts/runs forwards, never backwards. Using/leaving face_index 0 is fine in these cases.

But:

I didn’t notice that this kind of situation can happen in the middle of the text, where shaping runs are broken for some other reason and the first code point is one of the problematic/interesting ones. This somewhat belittles my argument about consistency.
⚠️ HarfBuzz can keep one of these code points visible if it’s named in an OpenType substitution. For this to happen, it needs to see the right font, and we need it, too, to fetch the glyph for display. (Thanks for mentioning this, Khaled. I saw the linked issue the other day, but it didn’t click that this affects us.)

I think this alone actually answers all your questions about needing the right font.

Oh welp, so we do want the right face_index for run-leading code points.

if no font was resolved, and later you're required to provide a font, then it might end up accessing index 0 of the array, and maybe the array size is 0 to begin with because no font was ever found.

This is impossible, so you need not worry. libass makes sure each font always starts with one face. (See: font check in parse_events, which follows update_font, which invokes ass_font_construct.)

Why is ass_font_get_index (which resolves a font) called only if (!info->drawing_text.str) ? I.e. what's the meaning of having or not having info->drawing_text.str?

It checks whether a “glyph” is actually an inline ASS vector drawing.

ASS can contain vector drawings instead of (in addition to) text. Drawings need no text shaping and delimit shaper runs.

to test whether a new run starts, the font info is compared with the previous glyph at the array - NOT with the previous glyph for which ass_font_get_index was executed (e.g. if info->drawing_text.str is not NULL).

Every drawing, and every glyph immediately following a drawing, has info->starts_new_run = true.

avih · 2021-05-11T23:12:01Z

maybe the array size is 0 to begin with because no font was ever found.

This is impossible, so you need not worry ..

Good. Thanks :)

I was about to say:
But for almost all default-ignorable code points ...

Right. But I didn't make any assumptions about prior states, except at my comment which you say is wrong :) (about face_index always being 0 before calling ass_font_get_index). Luckily the comment is wrong only in its assumption that face_index could be different on entry, but not in its conclusion - that if ass_font_get_index changes face_index then it can only split the run further but not accidentally combine adjacent runs - which was my main concern.

TBH, by looking at ass_font_get_index it still definitely does not look to me like a function which expects face_index to always be 0 on entry, and that n_faces is always at least 1, but I'll take your word that this is the case.

HarfBuzz can keep one of these code points visible ...

I don't think I really understand the statement or its implications. See below.

I think this alone actually answers all your questions about needing the right font.
Oh welp, so we do want the right face_index for run-leading code points.

Not 100% sure I follow. The right font as the surrounding glyphs (i.e. artificial adjustment)? or the right font as one which has a glyph for this codepoint (i.e. we actually need ass_font_get_index to try and resolve a face rather than calling it with symbol==0)?

If it's the latter and it does have a resolved face which (can) differs from its surroundings, do we want to break the run due to the changed face (assuming nothing else except the face changed)? or do we want to keep it in one run which will now have more than one face?

And also if it's the latter, what happens if we can't find a face which has this glyph? Do we still do the warning?

ASS can contain vector drawings instead of (in addition to) text. Drawings need no text shaping and delimit shaper runs.

Right. I didn't assume anything about it, and the code doesn't need modification following this new (to me) information, but it does answer one of the TODO items:

// TODO: else face_comparable remains true. is this correct?

So the answer is that it's correct because it doesn't matter - it's force-delimited by starts_new_run anyway. Good.

khaledhosny · 2021-05-12T02:05:10Z

So I'm trying to understand for a fact what happens without assuming: whether or not harfbuzz does try to "request the font glyph via callbacks" for codepoints at some default_ignorable list.

It depends on whether HarfBuzz wants to show this glyph or not, and like I said earlier there are few situations where it would.

And the reason I'm asking is to understand whether or not libass should try to resolve a face which has a glyph for it, or whether it can simply say "it's ignorable, I'm not going to bother finding a (fallback) face which has this glyph".

This should work, as long as it is kept in the same run (runs, by definition, must use the same font) with its surroundings.

Speaking of the Chrome way, is it not possible that a font intentionally uses glyph #0 for a supported character (or substitution) such that it should not invoke font fallback?

HarfBuzz does not have any special meaning of 0 glyph index, the callback function returns false when it does not find a glyph for the specified code point and that what tells HarfBuzz the font does not support it (Pango, for example, returns a very high glyph index that also encodes the code point to use it later for drawing missing glyph hex box).

feed them to HarfBuzz first; if it returns .notdef, perform font fallback as usual; if out of fonts and they’re still .notdef, turn them invisible

You are unlikely to ever end up with .notdef for Default_Ignorables unless no font in the fallback stack has the replacement character (space by default).

astiob · 2021-05-12T10:14:30Z

You are unlikely to ever end up with .notdef for Default_Ignorables unless no font in the fallback stack has the replacement character (space by default).

This is true for most of them, but what about the Default_Ignorables that are excluded from HarfBuzz’s internal list, such as hangul fillers and shorthand format controls? My understanding is that it does not treat them as ignorables in any manner at all. If the font supports them, the font itself will ligate/substitute them into blank glyphs. But if not, we should not show fallback .notdef, because that’s the definition of Default_Ignorable.

astiob · 2021-05-12T10:44:07Z

HarfBuzz does not have any special meaning of 0 glyph index, the callback function returns false when it does not find a glyph for the specified code point and that what tells HarfBuzz the font does not support it (Pango, for example, returns a very high glyph index that also encodes the code point to use it later for drawing missing glyph hex box).

Aha, I see it now. Nice, thanks!

If the hb_font_get_nominal/variation_glyph returns false, is it safe to assume HarfBuzz will not call any of the other callbacks for that glyph?

khaledhosny · 2021-05-15T20:38:48Z

This is true for most of them, but what about the Default_Ignorables that are excluded from HarfBuzz’s internal list

They would be returned as .notdef indeed, though I'm not familiar with these code points to know if it is an issue in practice.

If the hb_font_get_nominal/variation_glyph returns false, is it safe to assume HarfBuzz will not call any of the other callbacks for that glyph?

I think so.

@astiob

This removes the function ass_shaper_skip_characters, and sets .skip at ass_shaper_find_runs to the same value as before. However, unlike before - it doesn't also set .symbol=0. This affects only the simple shaper, because the complex shaper resets and re-calculates .skip later at shape_harfbuzz. According to @astiob this is equivalent and without regressions, though I didn't try to prove its correctness - avih. The type uint_least32_t is different than other places which hold a codepoint (`uint32_t' or `unsigned' are used) because that's the only type for which C99 guarantees support, and is at least 32 bits (uint32_t is optional, and unsigned can theoretically be 16 bits).

@astiob

default-ignorable (DI) codepoints glyphs are not present at all fonts, are not used by the simple shaper for rendering, and are mostly not used by the complex shaper for rendering. This avoids missing-glyph warnings for DI codepoints. Because now we don't try to resolve DI glyphs - their face_index could differ from their surrounding, which would break the run. Similarly, before this commit DI could also break the run due to missing glyphs or if resolved-with-different-face. So now we also don't allow DI face_index to break the run. However, DI face_index can still differ from the surrounding, so we adjust DI face_index to match the rest of the run if possible so that the complex shaper later gets the correct face_index if needed (technically, only the first face_index of a run is used later). There's still a potential issue if the complex shaper needs DI glyphs for the rendering but they're not available at the run's face_index. After discussion with @astiob, it was concluded that perfect solution can't exist without major refactoring of the dataflow, and that the the approach at this commit (don't resolve DI glyphs, don't break runs on DI, adjust DI face_index to match the run) is both better than before and also the best we can do within the current code form.

avih · 2021-05-26T16:48:15Z

Force-pushed two commits which replace the previous ones, as requested by @astiob on IRC.

avih · 2021-06-12T07:19:30Z

For what it's worth, I've been using the last patch-set (two-commits) in my own build of libass with mpv for the past 2 weeks. It seems to solve the original issue (no more warnings for missing glyphs of default-ignorable), and I didn't notice a negative impact, though I didn't try to measure anything.

astiob · 2021-07-09T23:50:55Z

I’ll get back to this now.

Preliminary tests suggested that Uniscribe in VSFilter doesn’t attach leading ZWJ to a subsequent fallback shape run, which should let us simplify our code further. But first, I want to do another test with something like U+1D179 MUSICAL SYMBOL BEGIN PHRASE that is logically a leading character, to make sure it doesn’t depend on the exact character.

avih · 2021-07-10T05:47:00Z

Preliminary tests suggested ...

I don't think I understand how is this related to the issue which this PR tries to solve - avoid warning for missing glyphs which are not rendered anyway.

Fonts don't need to contain any glyphs for these code points. Their special effect on layout (if any) will be handled by FriBidi and/or HarfBuzz, and they will be replaced by zero-width glyphs regardless of whether any font glyph was provided. By trying hard to find a glyph as we currently do, we end up mistakenly splitting shape runs on control characters that lack font glyphs and therefore ruining shaping, as well as logging unnecessary warnings about the missing glyphs. There are some quirks and exceptions. In particular, font glyphs for normally-invisible characters may be retained by recent HarfBuzz if OpenType substitutions are involved (which sometimes matches and sometimes doesn't match Uniscribe). This commit happens to retain this behavior for visible glyphs contained in the main font (as opposed to any arbitrarily-chosen fallback fonts) by virtue of defaulting run-leading invisibles to face_index 0, where HarfBuzz will find the glyphs on its own. Meanwhile, the case of _fallback_ fonts inserting visible glyphs doesn't even seem to be desirable to begin with. Preliminary testing shows this commit's logic matches Uniscribe in some common cases, but tests with musical symbols show strange results. More research is needed to match Uniscribe and hence VSFilter even better. On the other hand, this logic is likely not sophisticated enough for "optimal" typesetting without regard to Uniscribe compatibility. For example, Blink does not look up glyphs and split font-fallback runs prior to running HarfBuzz at all, and instead runs HarfBuzz on the whole higher-level run at once and then reruns it with the next fallback font on just those substrings that HarfBuzz failed to shape. In any case, some Default_Ignorable glyphs are more appropriately viewed as logical prefixes than suffixes and should be grouped with the following, not preceding, character in case of doubt. Furthermore, when a fallback font is used for a sequence of text, it may be desirable to use the same font for intervening or surrounding whitespace or punctuation even if the main font (or another fallback font) also contains glyphs for them; and whitespace characters can be synthesized in lieu of import from another font, and indeed HarfBuzz does synthesize them, but this is not always desirable. Additionally, the glyph HarfBuzz uses for zero-width substitutes may actually be visible as it defaults to the font's basic space glyph: if we were to forget about Uniscribe, we should define a custom, fake glyph index and translate it into `skip`. Finally, our higher-level shape run splitting is not perfect either: bidi splitting after script splitting can create awkward short runs, and in general, script splitting seems problematic in itself. Fixes libass#507 (log messages). Supersedes and closes libass#508. Fixes misshaping such as in the sample in libass#71 (comment), libass#71 (comment). However, this particular sample was actually a combined result of unnecessary font fallback (which this commit fixes) with another bug that was fixed in commit 80ce637: it already renders correctly since that commit, but the current commit alone would also have fixed it.

avih force-pushed the master branch from da6d5dd to fea34c7 Compare May 7, 2021 21:24

avih force-pushed the master branch from fea34c7 to 659dcc1 Compare May 8, 2021 10:07

astiob mentioned this pull request May 8, 2021

U+2060 (Word Joiner) warns "fontselect: failed to find any fallback for font: (sans-serif, 400, 0)" #507

Closed

avih force-pushed the master branch from 659dcc1 to dce0b7e Compare May 10, 2021 18:03

astiob reviewed May 11, 2021

View reviewed changes

avih added 2 commits May 26, 2021 19:46

avih force-pushed the master branch from dce0b7e to d9efc94 Compare May 26, 2021 16:47

astiob mentioned this pull request Sep 1, 2021

Don’t try to look up glyphs for invisible control characters #543

Merged

astiob closed this in d170d89 Sep 8, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

complex shaper: avoid font warning for non-printable codepoints #508

complex shaper: avoid font warning for non-printable codepoints #508

avih commented May 7, 2021 •

edited

Loading

avih commented May 7, 2021

avih commented May 7, 2021

avih commented May 8, 2021 •

edited

Loading

avih commented May 8, 2021

astiob commented May 8, 2021 •

edited

Loading

avih commented May 8, 2021

astiob commented May 8, 2021

astiob commented May 8, 2021 •

edited

Loading

avih commented May 8, 2021 •

edited

Loading

avih commented May 9, 2021 •

edited

Loading

avih commented May 9, 2021 •

edited

Loading

avih commented May 9, 2021

avih commented May 10, 2021 •

edited

Loading

khaledhosny commented May 10, 2021

avih commented May 11, 2021 •

edited

Loading

avih commented May 11, 2021 •

edited

Loading

khaledhosny commented May 11, 2021

avih commented May 11, 2021 •

edited

Loading

astiob commented May 11, 2021 •

edited

Loading

astiob commented May 11, 2021

avih commented May 11, 2021

astiob May 11, 2021

avih May 11, 2021 •

edited

Loading

astiob commented May 11, 2021 •

edited

Loading

avih commented May 11, 2021

khaledhosny commented May 12, 2021

astiob commented May 12, 2021 •

edited

Loading

astiob commented May 12, 2021

khaledhosny commented May 15, 2021

avih commented May 26, 2021

avih commented Jun 12, 2021

astiob commented Jul 9, 2021

avih commented Jul 10, 2021

complex shaper: avoid font warning for non-printable codepoints #508

complex shaper: avoid font warning for non-printable codepoints #508

Conversation

avih commented May 7, 2021 • edited Loading

avih commented May 7, 2021

avih commented May 7, 2021

avih commented May 8, 2021 • edited Loading

avih commented May 8, 2021

astiob commented May 8, 2021 • edited Loading

avih commented May 8, 2021

astiob commented May 8, 2021

astiob commented May 8, 2021 • edited Loading

avih commented May 8, 2021 • edited Loading

avih commented May 9, 2021 • edited Loading

avih commented May 9, 2021 • edited Loading

avih commented May 9, 2021

avih commented May 10, 2021 • edited Loading

khaledhosny commented May 10, 2021

avih commented May 11, 2021 • edited Loading

avih commented May 11, 2021 • edited Loading

khaledhosny commented May 11, 2021

avih commented May 11, 2021 • edited Loading

astiob commented May 11, 2021 • edited Loading

astiob commented May 11, 2021

avih commented May 11, 2021

astiob May 11, 2021

Choose a reason for hiding this comment

avih May 11, 2021 • edited Loading

Choose a reason for hiding this comment

astiob commented May 11, 2021 • edited Loading

avih commented May 11, 2021

khaledhosny commented May 12, 2021

astiob commented May 12, 2021 • edited Loading

astiob commented May 12, 2021

khaledhosny commented May 15, 2021

avih commented May 26, 2021

avih commented Jun 12, 2021

astiob commented Jul 9, 2021

avih commented Jul 10, 2021

avih commented May 7, 2021 •

edited

Loading

avih commented May 8, 2021 •

edited

Loading

astiob commented May 8, 2021 •

edited

Loading

astiob commented May 8, 2021 •

edited

Loading

avih commented May 8, 2021 •

edited

Loading

avih commented May 9, 2021 •

edited

Loading

avih commented May 9, 2021 •

edited

Loading

avih commented May 10, 2021 •

edited

Loading

avih commented May 11, 2021 •

edited

Loading

avih commented May 11, 2021 •

edited

Loading

avih commented May 11, 2021 •

edited

Loading

astiob commented May 11, 2021 •

edited

Loading

avih May 11, 2021 •

edited

Loading

astiob commented May 11, 2021 •

edited

Loading

astiob commented May 12, 2021 •

edited

Loading