Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

complex shaper: avoid font warning for non-printable codepoints #508

Closed
wants to merge 2 commits into from

Conversation

avih
Copy link
Contributor

@avih avih commented May 7, 2021

This was performed in two steps (commits):

  • Separate the non-printable simple-shaper detection into a stand-alone function.
  • Reuse this function by the complex shaper.

@avih
Copy link
Contributor Author

avih commented May 7, 2021

Repushed with trivial changes:

  • Consistent termilonogy at the commit messages (non-printable/zero-width --> control-codepoint).
  • Trivial change at the second commit: if (is_control(info->symbol)) --> if (is_control(symbol))

@avih
Copy link
Contributor Author

avih commented May 7, 2021

Note, there's a case which I'm not sure how it behaves - though I think it's not a new case: If all the symbols at the run don't try to resolve a font (e.g. if all are below 0x20, or newly also if is_control returns true) and there's only one run, then no font is resolved at all, and ass_font_get_index will set *glyph_index = 0 and *face_index = 0.

I don't know if this ends up valid later on, or even if this tries to use invalid or out-of-bounds index.

Though again - not a new issue as far as I can tell.

@avih
Copy link
Contributor Author

avih commented May 8, 2021

So there are two things which might need addressing:

  • The second commit message states incorrectly that the simple shaper didn't show this warning - but it did, and the same commit addressed it unintentioally, as both shapers use ass_shaper_find_runs to resolve fonts. I'll revisit the commit message.
  • The simple shaper now scans the same set of exclusion-codepoints twice: once when resolving the font, and once afterwards to mark it as skipped. Preferably we should only test each symbol once. I didn't try to evaluate yet where's the best place at the code for this would be. I think a good approach would be to scan and mark these symbols early, so that both shapers, and maybe also ass_shaper_find_runs would make use of the same scan result.

Or we can push it now, and unify the scans later - they do use the same code, but it does run twice now for the simple shaper. Let me know what you think.

EDIT - also, now that I think about it, I don't know that the second commit doesn't introduce regressions. I think it has the potential to end up with shorter runs (because control symbols use font 0, but other symbols might use some other font), and I don't know what the implication of a run is.

For instance, if only complete runs are passed to harfbuzz, then these control symbols might be on "runs" of their own, and I don't know whether they'd be taken them into account by the shaper while shaping surrounding runs.

I'd need some help with assessing that. It's the first time I'm looking at libass code.

@avih
Copy link
Contributor Author

avih commented May 8, 2021

Re-pushed with these changes:

  • First commit: only commit message and comments changes.
  • Second commit: only commit message changes, and raised questions - hence marked as WIP.
  • New third commit: avoid is_control redundancy for the simple shaper.

I'm copying the WIP questions asked at the commit message to allow easier discussion:

  • What happens if harfbuzz does want a font for these codepoints?
  • Should we use the same list of codepoint both to skip rendering at the simple shaper and to avoid font resolution? The codepoints are now documented, and it seems to me that the same list should be OK.
  • What if none of the symbols resolves a font? (not a new issue).

Once we answer those, I'll remove the WIP text from the second commit.

And, the last commit message suggests that it might be possible to reuse the existing .skip member instead of adding a new .is_control member. It's probably fine as-is currently, and here are some considerations:

  • They do seem at least somewhat overlapping in function.
  • Reusing will definitely NOT add any regression for the complex shaper, because right after ass_shaper_find_runs (which tests is_control) - ass_shaper_shape is called, and at the complex shaper code path it resets and re-calculates the .skip member anyway (at shape_harfbuzz).
  • For the simple shaper code path, it can cause regressions only if the .skip member is set selectively before calling ass_shaper_find_runs. If it's not, then immediately afterwards shape_fribidi is called (which doesn't seem to use .skip), and then ass_shaper_skip_characters (which sets it to the same values calculated earlier by is_control).

However, even if .skip can be reused 100% safely, there's still a question of whether we want to. I don't think I can answer this on my own, so for now I think it's OK to use the new member .is_control.

@astiob
Copy link
Member

astiob commented May 8, 2021

I don't know that the second commit doesn't introduce regressions. I think it has the potential to end up with shorter runs (because control symbols use font 0, but other symbols might use some other font), and I don't know what the implication of a run is.

Yes, that sounds like a regression, unfortunately. Good catch!

Or for some fonts, more like… lack of fix than regression. I think this is the very issue that triggered #71 (comment).

For e. g. bidi control overrides it shouldn’t matter, but e. g. U+200C ZERO WIDTH NON-JOINER and U+200D ZERO WIDTH JOINER must make it to HarfBuzz as part of their surrounding shaping runs, because they affect shaping. That means the control character must inherit shape_run_id of the preceding (or succeeding) character, regardless of whether and which fonts contain or lack a glyph for it and regardless of whether the preceding (or succeeding) character itself comes from the main font or a fallback font. It’s probably safe to stick to copying the run ID of the preceding character and use 0 or skip or whatever if the control character itself leads the string.

For what it’s worth, I’ve verified that VSFilter/Uniscribe seems to handle ZWJ correctly (unlike current libass) in a situation where it joins Arabic letters and the main font (Helvetica) has a ZWJ glyph but not Arabic glyphs.

What happens if harfbuzz does want a font for these codepoints?

If it does, and we don’t handle it specially, it will presumably be handled like missing glyphs are handled at the moment. See also my next answer.

Should we use the same list of codepoint both to skip rendering at the simple shaper and to avoid font resolution?

I think that’s what you do in this PR, so is the question rather “is there anything wrong with this approach”? No, I think it’s fine. But we may want to revise the list (again) to better match HarfBuzz.

I’m not fully sure how HarfBuzz decides which code points to request glyphs for and which not. That (plus the potential for changes over time) is why ideally I’d like to somehow make HarfBuzz pick these characters out for us, rather than us try to guess and hope the lists match.

I think it may be based on the default-ignorable Unicode property. I see HarfBuzz uses an is_default_ignorable function internally, but it seems that it’s not public. @khaledhosny Is this right?

What if none of the symbols resolves a font? (not a new issue).

Uh, well, is it really any different from when some of the symbols don’t resolve any font?

it might be possible to reuse the existing .skip member instead of adding a new .is_control member.

Indeed, it is. You’re right when you say the complex shaper resets .skip, and concerning the simple shaper, .skip is indeed not set anywhere before the shaper is called as far as I can tell & remember.

So you can just move the ass_shaper_skip_characters call from the simple shaper path in ass_shaper_shape to the common path in ass_shaper_find_runs or ass_render_event.

@avih
Copy link
Contributor Author

avih commented May 8, 2021

Yes, that sounds like a regression, unfortunately. Good catch!

Right. Good catch, bad patch :)

must make it to HarfBuzz as part of their surrounding shaping runs

Indeed. I also presumed this could be the case, though as you mentioned, it can happen anyway before this PR if the glyph is missing - which as we know can definitely happen for control codepoints.

I think that’s what you do in this PR, so is the question rather “is there anything wrong with this approach”?

Correct. I thought this was obvious in this context, but I could have phrased it better. Thanks.

ideally I’d like to somehow make HarfBuzz pick these characters out for us

Completely agreed. That was one of the things I thought might be happening here #507 (comment)

It's definitely better to not guesswork/duplicate it, unless it's specified explicitly how harfbuzz does this.

What if none of the symbols resolves a font? (not a new issue).

Uh, well, is it really any different from when some of the symbols don’t resolve any font?

I think it is different. Because if no font was resolved, and later you're required to provide a font, then it might end up accessing index 0 of the array, and maybe the array size is 0 to begin with because no font was ever found.

I didn't try to follow the code enough to understand whether it's a real issue or not, but regardless, it's not a new issue because theoretically (but I don't know if practically) it can already happen before this PR if all the symbols are smaller than x20.

So you can just move the ass_shaper_skip_characters call from the simple shaper path in ass_shaper_shape to the common path in ass_shaper_find_runs or ass_render_event.

Can I? I wouldn't have thought so myself, because then shape_fribidi - which runs after ass_shaper_find_runs would not see the direction codepoints (because by then the symbol was modified to 0).

So, considering the potential (additional) regression with even more cases of too-short runs than before, here's what I suggest we do:

  1. Do this now U+2060 (Word Joiner) warns "fontselect: failed to find any fallback for font: (sans-serif, 400, 0)" #507 (comment) (just suppress warnings for control codepoints). PR-wise, this means keeping only the first commit, and adding a new commit which reuses this function to test whether it should warn or just do info.
  2. At a later point someone might want to have a go at the more correct U+2060 (Word Joiner) warns "fontselect: failed to find any fallback for font: (sans-serif, 400, 0)" #507 (comment) (refactor the dataflow - which I accidentally thought was posted by you, and only now noticed it's by a different person).

@astiob
Copy link
Member

astiob commented May 8, 2021

because by then the symbol was modified to 0

Er, right. Just drop the part that changes symbol to 0. (It’s important for the complex shaper, too.) I don’t think it adds any meaning for later processing stages anyway (when skip is set)!

@astiob
Copy link
Member

astiob commented May 8, 2021

On the other hand, if the font does contain a glyph for a control character and we’re using the “simple shaper”, I think we end up using the glyph’s metrics for layout despite ass_shaper_skip_characters. I’m not too confident though, so this needs to be tested.

This PR might actually fix that.

@avih
Copy link
Contributor Author

avih commented May 8, 2021

Er, right. Just drop the part that changes symbol to 0. (It’s important for the complex shaper, too.) I don’t think it adds any meaning for later processing stages anyway (when skip is set)!

Right. That was indeed one thing I also considered, but didn't know if both .skip and .symbol=0 are required, or whether .skip is enough. And that's also the reason I thought that .skip might be set selectively before ass_shaper_find_runs in a way which is not completely overlapping .symbol==0 - and ultimately the reason I didn't reuse .skip (because otherwise why were both symbol and skip modified?).

It’s probably safe to stick to copying the run ID of the preceding character and use 0 or skip or whatever if the control character itself leads the string.

Yes, definitely better to not change the run ID, but it can become tricky to a degree I felt uncomfortable to make this change without deeper knowledge of the codebase.

So, I still stand by this:

So, considering the potential (additional) regression with even more cases of too-short runs than before, here's what I suggest we do:

Do this now ... (just suppress warnings for control codepoints). PR-wise, this means keeping only the first commit, and adding a new commit which reuses this function to test whether it should warn or just do info.

Now also because different approaches which work better than this PR (or just suppression) do also require deeper knowledge of the codebase, and I'm simply not comfortable taking them without knowing for a fact that I don't introduce regressions.

If you think that we can do better right now than simply addressing the warning on its own (i.e. my suggestion), then I'd prefer that you take over and handle it however you think is best.

What say you?

@avih
Copy link
Contributor Author

avih commented May 9, 2021

That means the control character must inherit shape_run_id of the preceding (or succeeding) character, regardless of whether and which fonts contain or lack a glyph for it and regardless of whether the preceding (or succeeding) character itself comes from the main font or a fallback font.

I agree that's the best thing to do, and I decided to have a go at this too. I think on top of the existing 3 commits would be OK.

It’s probably safe to stick to copying the run ID of the preceding character and use 0 or skip or whatever if the control character itself leads the string.

That's implementation details, which I don't immediately know how to translate to code. Also, not sure about .skip. I recommend that for now, unless you think this would be incorrect, keep the existing usage of skip independent of the new code. This can be optimized later, instead of adding noise to the current discussion.

I've had a different(?) idea at the implementation:

  • The base is the current code - with the 3 commits which are already at this PR.
  • Keep the current function of ass_shaper_find_runs as is, but with an additional single bit which remembers whether at least one control codepoint was encountered throughout the entire length (spanning runs).
  • After the loop is complete - if no control codepoints were encountered or if there was only one run (shape_run == 0), then return normally. I.e. so far this is identical as before - but only for cases where control codepoints didn't exist or didn't break the run.
  • Otherwise - i.e. there were control codepoint[s] and more than one run: make a second pass of the whole length and re-calculate the runs. Depending on answers to the questions below - this might be integrated with the first pass instead of as an additional pass.

Few things I'd need help with:

  • Does anyone care later on if a run has more than one font? Specifically - can we leave the font as 0 for control codepoints, while surrounding symbols of the same run have a different font? If no one cares then it will make the logic quite simpler, because we wouldn't need to find an actual font from another symbol at the run to copy to the control codepoint[s] of the run.
  • Why is ass_font_get_index (which resolves a font) called only if (!info->drawing_text.str) ? I.e. what's the meaning of having or not having info->drawing_text.str? My guess would be that the code considers info->drawing_text.str != NULL if and only if font info was already resolved for this symbol. Is this correct?
  • At the existing code, to test whether a new run starts, the font info is compared with the previous glyph at the array - NOT with the previous glyph for which ass_font_get_index was executed (e.g. if info->drawing_text.str is not NULL). To me it looks that the code assume (see previous question) that the glyph already has front info. Is this correct?
  • Assuming info->drawing_text.str != NULL indeed means "already has font info", then just some info question (has no effect on any new logic): who sets the font prior to running ass_shaper_find_runs? and why only to some of the symbols at the array?

@avih
Copy link
Contributor Author

avih commented May 9, 2021

Does anyone care later on if a run has more than one font?
Assuming info->drawing_text.str != NULL indeed means "already has font info"

Assuming no one cares if control codepoints inside a run have a different font than surrounding symbols at the run, and assuming info->drawing_text.str != NULL indeed means "already has font info", then here's my suggested patch on top of the existing 3 commits. I.e. This integrates the new logic into the existing pass without adding a new pass:

diff --git a/libass/ass_shaper.c b/libass/ass_shaper.c
index 396185f..50390c6 100644
--- a/libass/ass_shaper.c
+++ b/libass/ass_shaper.c
@@ -835,8 +835,17 @@ void ass_shaper_find_runs(ASS_Shaper *shaper, ASS_Renderer *render_priv,
             ass_font_get_index(render_priv->fontselect, info->font,
                     symbol, &info->face_index, &info->glyph_index);
         }
-        if (i > 0) {
-            GlyphInfo *last = glyphs + i - 1;
+
+        // test if a new run starts, completely ignoring control codepoints
+        int last_font = info->is_control ? -1 : i - 1;
+        while (last_font >= 0 && glyphs[last_font].is_control)
+            last_font--;
+        // last_font is -1 if NOT considering a new run (first symbol, or
+        // control codepoint, or all prior symbols are control), else nearest
+        // earlier index to compare font (guaranteed not a control codepoint).
+
+        if (last_font >= 0) {
+            GlyphInfo *last = glyphs + last_font;
             if ((last->font != info->font ||
                     last->face_index != info->face_index ||
                     last->script != info->script ||

This fixes the new cases of too-short runs introduced by the second commit - control codepoint breaks the run even if a glyph can be found.

And it also fixes earlier cases of too-short runs - control codepoint breaks the run because no glyph was found.

EDIT: or even a better patch which fixes the same two issues (less code, O(1) even in pathological edge cases):

diff --git a/libass/ass_shaper.c b/libass/ass_shaper.c
index 396185f..ea6bd5e 100644
--- a/libass/ass_shaper.c
+++ b/libass/ass_shaper.c
@@ -818,6 +818,7 @@ void ass_shaper_find_runs(ASS_Shaper *shaper, ASS_Renderer *render_priv,
 {
     int i;
     int shape_run = 0;
+    int last_font = -1; // latest earlier index which is not a control codepoint
 
     ass_shaper_determine_script(shaper, glyphs, len);
 
@@ -835,8 +836,15 @@ void ass_shaper_find_runs(ASS_Shaper *shaper, ASS_Renderer *render_priv,
             ass_font_get_index(render_priv->fontselect, info->font,
                     symbol, &info->face_index, &info->glyph_index);
         }
-        if (i > 0) {
-            GlyphInfo *last = glyphs + i - 1;
+        if (!info->is_control && last_font >= 0) {
+            // Consider a new run: current symbol is not a control point,
+            // and we also have a prior symbol which is not a control point.
+            // TODO: if we have last_font followed by several control symbols,
+            //       and the current symbol font differs from the last_font
+            //       (i.e. a new run begins), should the intermediate control
+            //       symbols belong to the previous run or to the new run?
+            //       (currently they remain with the previous run)
+            GlyphInfo *last = glyphs + last_font;
             if ((last->font != info->font ||
                     last->face_index != info->face_index ||
                     last->script != info->script ||
@@ -845,6 +853,9 @@ void ass_shaper_find_runs(ASS_Shaper *shaper, ASS_Renderer *render_priv,
                 shape_run++;
         }
         info->shape_run_id = shape_run;
+
+        if (!info->is_control)
+            last_font = i;
     }
 }

@avih
Copy link
Contributor Author

avih commented May 9, 2021

Assuming no one cares if control codepoints inside a run have a different font than surrounding symbols at the run

Actually, I'd probably not make such assumption, unless we know 100% for a fact that that is the case. The reason being that I can imagine harfbuz wanting to use some font metrics while appying these control codepoints, and with a "dummy" font 0 - it could end up using wrong metrics from the font even if the glyph itself is not used.

Luckily, making sure that all the control codepoints font info in a run is copied from a codepoint for which a font was actually resolved is relatively simple to add to the patch above. Probably 2-3 more LOC.

I'm not updating the patch above, but unless you prefer otherwise (i.e. keep the patch above as is in this regard), then I'll add it to the commit, assuming you'd want this functionality added.

@avih
Copy link
Contributor Author

avih commented May 10, 2021

Forced-pushed with the following changes:

  • First commit (stand-alone is_control): change unsigned to uint_least32_t - explained at the updated commit message.
  • Second commit (don't resolve control codes face): removed WIP, updated commit message and one code comment.
  • Third commit (avoid redundant is_control): only updated commit message about reusing the .skip member.
  • New fourth commit: don't break runs on control codepoints (bugfix of the inline patch posted at an earlier comment).
  • New fifth commit: WIP: make control codes use the surrounding face_index.

I'm fairly confident that this patchset:

  • Doesn't introduce regressions.
  • Avoids warnings for missing control code glyphs.
  • Possibly improves performance due to skipping face resolution for control codes.
  • Fixes accidental run breaks due to unresolved control codes glyphs (or due to resolution at a different face) .

The fourth commit (don't break runs on control codepoints) adds three TODO comments:

// TODO: currently only control codes face_index is not-comparable,
//       but we should consider the same also if a face was not
//       resolved by ass_font_get_index for other reasons

This is not a new question, and the code doesn't change existing behavior. However, someone might want to revisit the behavior in the future.

// TODO: else face_comparable remains true. is this correct?

(when info->drawing_text.str != 0). This is not a new question, and the code doesn't change existing behavior. However, someone might want to revisit the behavior in the future.

// TODO: if there are both comparable and uncomparable faces at
//       the previous run, then their face_index can differ.
//       if this matters to the shaper, it can be adjusted here.

This is a new question, and is a result of the change that face_index for control codes can now differ from surrounding symbols, while previously face_index was indentical for the whole run (because the run broke, possibly incorrectly, when it changes).

I don't know the answer to that (i.e. whether it matters to the complex shaper), but if the answer is "yes, it does matter", then the fifth WIP commit addresses this concern. If it doesn't matter then keeping the fifth commit doesn't hurt, but also doesn't add anything useful, so maybe it can be removed.

@khaledhosny
Copy link
Contributor

I think it may be based on the default-ignorable Unicode property. I see HarfBuzz uses an is_default_ignorable function internally, but it seems that it’s not public. @khaledhosny Is this right?

Yes, this isn't exposed in public API (the check is tailored for font shaping in HarfBuzz, it doesn't match the Unicode property 100%).

@avih
Copy link
Contributor Author

avih commented May 11, 2021

I think it may be based on the default-ignorable Unicode property. I see HarfBuzz uses an is_default_ignorable function internally, but it seems that it’s not public. @khaledhosny Is this right?

Yes, this isn't exposed in public API (the check is tailored for font shaping in HarfBuzz, it doesn't match the Unicode property 100%).

FWIW, I found this definition here https://en.wikipedia.org/wiki/Module:Unicode_data/derived_core_properties :

data.default_ignorable = {
	singles = {
		[0x000AD] = true,
		[0x0034F] = true,
		[0x0061C] = true,
		[0x03164] = true,
		[0x0FEFF] = true,
		[0x0FFA0] = true,
	},
	
	ranges = {
		{ 0x0115F, 0x01160 },
		{ 0x017B4, 0x017B5 },
		{ 0x0180B, 0x0180E },
		{ 0x0200B, 0x0200F },
		{ 0x0202A, 0x0202E },
		{ 0x02060, 0x0206F },
		{ 0x0FE00, 0x0FE0F },
		{ 0x0FFF0, 0x0FFF8 },
		{ 0x1BCA0, 0x1BCA3 },
		{ 0x1D173, 0x1D17A },
		{ 0xE0000, 0xE0FFF },
	},
}

the libass set which this patchset puts inside the is_control test is a subset of this, and unless I'm mistaken, there's only one codepoint which libass defines in this category which the definition above doesn't: 0x0180F.

In the other way around - the definition above includes quite a bit more than the definition of is_control.

However, I don't know how credible this definition is, and not sure what it refers to. though google did place this page near the top of search results, so I guess people do link to it.

@avih
Copy link
Contributor Author

avih commented May 11, 2021

@khaledhosny for codepoints in this (or some other) default_ignorable list:

  • Does harfbuzz try to retrieve glyph info?
  • Does harfbuzz try to retrieve face info/metrics?

My guess would be probably "no" for glyph info, as glyphs for these codepoints are probably not very relevant.

But for face data for these codepoitns - I can imagine that there could be value even without the glyph.

So, for instance, if libass tries to resolve faces for runs, and there are default_ignorable codepoints embedded within these runs which libass doesn't try to resolve a glyph (and similarly neither a face), should their face be adjusted to match the surrounding face at the run?

@khaledhosny
Copy link
Contributor

@khaledhosny for codepoints in this (or some other) default_ignorable list:

* Does harfbuzz try to retrieve glyph info?

* Does harfbuzz try to retrieve face info/metrics?

I don’t quite understand what you mean. But HarfBuzz will behave shaping wise whether the font have glyphs for these code points or not. Under normal circumstances it will replace them with space glyph and zero its widths (the replacement glyph can be controlled with the API), it can also retain the glyph in the output under some special circumstances (if requested with the API, or if the fonts has some substitution rules for these glyphs).

@avih
Copy link
Contributor Author

avih commented May 11, 2021

I don’t quite understand what you mean

OK, I based my question on this (#507 (comment)):

ass_shaper_shape runs HarfBuzz, which requests font glyphs via callbacks and receives our previously found glyphs. I assume it doesn’t request a glyph for U+200B or U+2060. ...

This PR changes how "our previously found glyphs" works. I.e. it takes place at the stage of resolving the glyphs, before calling harfbuzz, and tries to avoid glyph resolution for default_ignorable codepoints, and deal with any consequences it might have.

So I'm trying to understand for a fact what happens without assuming: whether or not harfbuzz does try to "request the font glyph via callbacks" for codepoints at some default_ignorable list.

And the reason I'm asking is to understand whether or not libass should try to resolve a face which has a glyph for it, or whether it can simply say "it's ignorable, I'm not going to bother finding a (fallback) face which has this glyph".

The second question is basically: does harfbuzz care at all what the face is for codepoints in a default_ignorable list - under normal circumstances (without some "show hidden" mode etc).

I.e. I think it's not unimaginable that harfbuzz knows that the glyph itself is not relevant under normal circumstances, but that the face might still be relevant to extract some metrics in order to apply the codepoint correctly during layout. This is a guess - but I hoped for more knowledgable info by you.

The reason I'm asking is that due to implementation details, if libass doesn't try to resolve a face for the glyph then the face ends up arbitrary. So this patchset can either leave it with some arbitraty value (if harfbuzz doesn't care what the face is) or it needs to do some work in order to adjust the face to match the resolved face of surrounding glyphs.

So this is how I interpret your reply (please confirm):

But HarfBuzz will behave shaping wise whether the font have glyphs for these code points or not. Under normal circumstances it will replace them with space glyph and zero its widths (the replacement glyph can be controlled with the API)

So assuming a replacement glyph is not provided via the API, and under "normal" circumstances, do I understand correctly that harfbuzz does not need to access neither the glyph nor the font which is associated with this codepoint?

@astiob
Copy link
Member

astiob commented May 11, 2021

the check is tailored for font shaping in HarfBuzz, it doesn't match the Unicode property 100%

Hmm, so there’s no way to know which code points require font data and which don’t other than going the Chrome way (with extra complications due to synthesized glyphs) or duplicating internal HarfBuzz code?

Speaking of the Chrome way, is it not possible that a font intentionally uses glyph #0 for a supported character (or substitution) such that it should not invoke font fallback?

And the proper handling of default ignorable code points is… feed them to HarfBuzz first; if it returns .notdef, perform font fallback as usual; if out of fonts and they’re still .notdef, turn them invisible? It would be great if hb-unicode exposed the raw default-ignorable property so we didn’t have to maintain our own list likely to fall out of sync over time.

(By the way, I don’t see Chrome do this—turn ignorables invisible if no font has them and HarfBuzz has let them through—although I’m merely foolishly looking at unfamiliar source code. It looks like Firefox does this. But then Firefox had this issue but I don’t understand how it could be an issue to begin with and hence whether I need to worry about it in libass.)

@astiob
Copy link
Member

astiob commented May 11, 2021

@avih Sorry for my silence in the last few days. I very much appreciate your enthusiasm and effort, but I think you’re overcomplicating this :-)

Just copy the last glyph’s font face in ass_shaper_find_runs instead of invoking ass_font_get_index.

By the way, currently, libass will try to fetch the glyph later anyway, regardless of HarfBuzz.

@avih
Copy link
Contributor Author

avih commented May 11, 2021

Just copy the last glyph’s font face in ass_shaper_find_runs instead of invoking ass_font_get_index

I don't think this would work, but I'd be to glad be proven wrong (and my suggestion for you to just take over is still in place, I really don't mind who writes the code).

The point is that a run could have several ignorable codepoints in a row at the beginning the run. I.e. you have nothing to copy from. And once the run ends - only then you can look back at the entire run - and if it has both resolved and unresolved glyhs - copy the resolved ones to the unresolved ones.

But sure, it's also entirely possible I'm overly complicating :)

// NOTE: if face_index changes then it can split the run further,
// but cannot accidentally combine adjacent runs (e.g. if two
// symbols had different face_index but they have the same
// face_index after ass_font_get_index). that's because split
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI: This is the very place that sets face_index. It is zero until here.

Copy link
Contributor Author

@avih avih May 11, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI: This is the very place that sets face_index. It is zero until here.

Right, that's the only thing I wasn't 100% sure of, so I added this comment to make it obvious if I'm assuming wrong. Thanks.

The reason I assumed that is that split_style_runs splits on bold and italic, and assumed face_index was derived from the style somewhere before ass_shaper_find_runs, and that then ass_font_get_index starts with the desired face/style, and then falls back to others. That's the very first thing which ass_font_get_index does - tries the glyph at the input face_index - and if it's found, then nothing happens further later:

    // try with the requested face
    if (*face_index < font->n_faces) {
        face = font->faces[*face_index];
        index = FT_Get_Char_Index(face, ass_font_index_magic(face, symbol));
    }

And the comment // try with the requested face makes it look even more like the input face_index was already set explicitly to some requested value.

@astiob
Copy link
Member

astiob commented May 11, 2021

The point is that a run could have several ignorable codepoints in a row at the beginning the run. I.e. you have nothing to copy from.

This is correct.

I was about to say:
But for almost all default-ignorable code points, this shouldn’t make a difference as they only have meaning when they follow other things. And even for those that do have meaning as prefixes, in this PR’s approach, this would be consistent with our handling of ignorables in the middle of a run, where we always propagate fonts/runs forwards, never backwards. Using/leaving face_index 0 is fine in these cases.

But:

  1. I didn’t notice that this kind of situation can happen in the middle of the text, where shaping runs are broken for some other reason and the first code point is one of the problematic/interesting ones. This somewhat belittles my argument about consistency.

  2. ⚠️ HarfBuzz can keep one of these code points visible if it’s named in an OpenType substitution. For this to happen, it needs to see the right font, and we need it, too, to fetch the glyph for display. (Thanks for mentioning this, Khaled. I saw the linked issue the other day, but it didn’t click that this affects us.)

    I think this alone actually answers all your questions about needing the right font.

Oh welp, so we do want the right face_index for run-leading code points.

if no font was resolved, and later you're required to provide a font, then it might end up accessing index 0 of the array, and maybe the array size is 0 to begin with because no font was ever found.

This is impossible, so you need not worry. libass makes sure each font always starts with one face. (See: font check in parse_events, which follows update_font, which invokes ass_font_construct.)

Why is ass_font_get_index (which resolves a font) called only if (!info->drawing_text.str) ? I.e. what's the meaning of having or not having info->drawing_text.str?

It checks whether a “glyph” is actually an inline ASS vector drawing.

ASS can contain vector drawings instead of (in addition to) text. Drawings need no text shaping and delimit shaper runs.

to test whether a new run starts, the font info is compared with the previous glyph at the array - NOT with the previous glyph for which ass_font_get_index was executed (e.g. if info->drawing_text.str is not NULL).

Every drawing, and every glyph immediately following a drawing, has info->starts_new_run = true.

@avih
Copy link
Contributor Author

avih commented May 11, 2021

maybe the array size is 0 to begin with because no font was ever found.

This is impossible, so you need not worry ..

Good. Thanks :)

I was about to say:
But for almost all default-ignorable code points ...

Right. But I didn't make any assumptions about prior states, except at my comment which you say is wrong :) (about face_index always being 0 before calling ass_font_get_index). Luckily the comment is wrong only in its assumption that face_index could be different on entry, but not in its conclusion - that if ass_font_get_index changes face_index then it can only split the run further but not accidentally combine adjacent runs - which was my main concern.

TBH, by looking at ass_font_get_index it still definitely does not look to me like a function which expects face_index to always be 0 on entry, and that n_faces is always at least 1, but I'll take your word that this is the case.

HarfBuzz can keep one of these code points visible ...

I don't think I really understand the statement or its implications. See below.

I think this alone actually answers all your questions about needing the right font.
Oh welp, so we do want the right face_index for run-leading code points.

Not 100% sure I follow. The right font as the surrounding glyphs (i.e. artificial adjustment)? or the right font as one which has a glyph for this codepoint (i.e. we actually need ass_font_get_index to try and resolve a face rather than calling it with symbol==0)?

If it's the latter and it does have a resolved face which (can) differs from its surroundings, do we want to break the run due to the changed face (assuming nothing else except the face changed)? or do we want to keep it in one run which will now have more than one face?

And also if it's the latter, what happens if we can't find a face which has this glyph? Do we still do the warning?

ASS can contain vector drawings instead of (in addition to) text. Drawings need no text shaping and delimit shaper runs.

Right. I didn't assume anything about it, and the code doesn't need modification following this new (to me) information, but it does answer one of the TODO items:

// TODO: else face_comparable remains true. is this correct?

So the answer is that it's correct because it doesn't matter - it's force-delimited by starts_new_run anyway. Good.

@khaledhosny
Copy link
Contributor

So I'm trying to understand for a fact what happens without assuming: whether or not harfbuzz does try to "request the font glyph via callbacks" for codepoints at some default_ignorable list.

It depends on whether HarfBuzz wants to show this glyph or not, and like I said earlier there are few situations where it would.

And the reason I'm asking is to understand whether or not libass should try to resolve a face which has a glyph for it, or whether it can simply say "it's ignorable, I'm not going to bother finding a (fallback) face which has this glyph".

This should work, as long as it is kept in the same run (runs, by definition, must use the same font) with its surroundings.

Speaking of the Chrome way, is it not possible that a font intentionally uses glyph #0 for a supported character (or substitution) such that it should not invoke font fallback?

HarfBuzz does not have any special meaning of 0 glyph index, the callback function returns false when it does not find a glyph for the specified code point and that what tells HarfBuzz the font does not support it (Pango, for example, returns a very high glyph index that also encodes the code point to use it later for drawing missing glyph hex box).

feed them to HarfBuzz first; if it returns .notdef, perform font fallback as usual; if out of fonts and they’re still .notdef, turn them invisible

You are unlikely to ever end up with .notdef for Default_Ignorables unless no font in the fallback stack has the replacement character (space by default).

@astiob
Copy link
Member

astiob commented May 12, 2021

You are unlikely to ever end up with .notdef for Default_Ignorables unless no font in the fallback stack has the replacement character (space by default).

This is true for most of them, but what about the Default_Ignorables that are excluded from HarfBuzz’s internal list, such as hangul fillers and shorthand format controls? My understanding is that it does not treat them as ignorables in any manner at all. If the font supports them, the font itself will ligate/substitute them into blank glyphs. But if not, we should not show fallback .notdef, because that’s the definition of Default_Ignorable.

@astiob
Copy link
Member

astiob commented May 12, 2021

HarfBuzz does not have any special meaning of 0 glyph index, the callback function returns false when it does not find a glyph for the specified code point and that what tells HarfBuzz the font does not support it (Pango, for example, returns a very high glyph index that also encodes the code point to use it later for drawing missing glyph hex box).

Aha, I see it now. Nice, thanks!

If the hb_font_get_nominal/variation_glyph returns false, is it safe to assume HarfBuzz will not call any of the other callbacks for that glyph?

@khaledhosny
Copy link
Contributor

This is true for most of them, but what about the Default_Ignorables that are excluded from HarfBuzz’s internal list

They would be returned as .notdef indeed, though I'm not familiar with these code points to know if it is an issue in practice.

If the hb_font_get_nominal/variation_glyph returns false, is it safe to assume HarfBuzz will not call any of the other callbacks for that glyph?

I think so.

avih added 2 commits May 26, 2021 19:46
This removes the function ass_shaper_skip_characters, and sets .skip
at ass_shaper_find_runs to the same value as before. However, unlike
before - it doesn't also set .symbol=0. This affects only the simple
shaper, because the complex shaper resets and re-calculates .skip
later at shape_harfbuzz.

According to @astiob this is equivalent and without regressions,
though I didn't try to prove its correctness - avih.

The type uint_least32_t is different than other places which hold a
codepoint (`uint32_t' or `unsigned' are used) because that's the
only type for which C99 guarantees support, and is at least 32 bits
(uint32_t is optional, and unsigned can theoretically be 16 bits).
default-ignorable (DI) codepoints glyphs are not present at all fonts,
are not used by the simple shaper for rendering, and are mostly not
used by the complex shaper for rendering.

This avoids missing-glyph warnings for DI codepoints.

Because now we don't try to resolve DI glyphs - their face_index could
differ from their surrounding, which would break the run. Similarly,
before this commit DI could also break the run due to missing glyphs
or if resolved-with-different-face.

So now we also don't allow DI face_index to break the run.

However, DI face_index can still differ from the surrounding, so we
adjust DI face_index to match the rest of the run if possible so that
the complex shaper later gets the correct face_index if needed
(technically, only the first face_index of a run is used later).

There's still a potential issue if the complex shaper needs DI glyphs
for the rendering but they're not available at the run's face_index.

After discussion with @astiob, it was concluded that perfect solution
can't exist without major refactoring of the dataflow, and that the
the approach at this commit (don't resolve DI glyphs, don't break runs
on DI, adjust DI face_index to match the run) is both better than
before and also the best we can do within the current code form.
@avih
Copy link
Contributor Author

avih commented May 26, 2021

Force-pushed two commits which replace the previous ones, as requested by @astiob on IRC.

@avih
Copy link
Contributor Author

avih commented Jun 12, 2021

For what it's worth, I've been using the last patch-set (two-commits) in my own build of libass with mpv for the past 2 weeks. It seems to solve the original issue (no more warnings for missing glyphs of default-ignorable), and I didn't notice a negative impact, though I didn't try to measure anything.

@astiob
Copy link
Member

astiob commented Jul 9, 2021

I’ll get back to this now.

Preliminary tests suggested that Uniscribe in VSFilter doesn’t attach leading ZWJ to a subsequent fallback shape run, which should let us simplify our code further. But first, I want to do another test with something like U+1D179 MUSICAL SYMBOL BEGIN PHRASE that is logically a leading character, to make sure it doesn’t depend on the exact character.

@avih
Copy link
Contributor Author

avih commented Jul 10, 2021

Preliminary tests suggested ...

I don't think I understand how is this related to the issue which this PR tries to solve - avoid warning for missing glyphs which are not rendered anyway.

astiob added a commit to astiob/libass that referenced this pull request Sep 1, 2021
Fonts don't need to contain any glyphs for these code points.
Their special effect on layout (if any) will be handled by FriBidi
and/or HarfBuzz, and they will be replaced by zero-width glyphs
regardless of whether any font glyph was provided.

By trying hard to find a glyph as we currently do, we end up
mistakenly splitting shape runs on control characters that lack font
glyphs and therefore ruining shaping, as well as logging unnecessary
warnings about the missing glyphs.

There are some quirks and exceptions. In particular, font glyphs
for normally-invisible characters may be retained by recent HarfBuzz
if OpenType substitutions are involved (which sometimes matches
and sometimes doesn't match Uniscribe). This commit happens to retain
this behavior for visible glyphs contained in the main font (as opposed
to any arbitrarily-chosen fallback fonts) by virtue of defaulting
run-leading invisibles to face_index 0, where HarfBuzz will find
the glyphs on its own. Meanwhile, the case of _fallback_ fonts inserting
visible glyphs doesn't even seem to be desirable to begin with.

Preliminary testing shows this commit's logic matches Uniscribe in some
common cases, but tests with musical symbols show strange results.
More research is needed to match Uniscribe and hence VSFilter even better.

On the other hand, this logic is likely not sophisticated enough
for "optimal" typesetting without regard to Uniscribe compatibility.
For example, Blink does not look up glyphs and split font-fallback runs
prior to running HarfBuzz at all, and instead runs HarfBuzz on the whole
higher-level run at once and then reruns it with the next fallback font
on just those substrings that HarfBuzz failed to shape. In any case,
some Default_Ignorable glyphs are more appropriately viewed as logical
prefixes than suffixes and should be grouped with the following,
not preceding, character in case of doubt. Furthermore, when a fallback
font is used for a sequence of text, it may be desirable to use the same
font for intervening or surrounding whitespace or punctuation even if
the main font (or another fallback font) also contains glyphs for them;
and whitespace characters can be synthesized in lieu of import from
another font, and indeed HarfBuzz does synthesize them, but this is not
always desirable. Additionally, the glyph HarfBuzz uses for zero-width
substitutes may actually be visible as it defaults to the font's basic
space glyph: if we were to forget about Uniscribe, we should define
a custom, fake glyph index and translate it into `skip`.
Finally, our higher-level shape run splitting is not perfect either:
bidi splitting after script splitting can create awkward short runs,
and in general, script splitting seems problematic in itself.

Fixes libass#507 (log messages).

Supersedes and closes libass#508.

Fixes misshaping such as in the sample in
  libass#71 (comment),
  libass#71 (comment).
However, this particular sample was actually a combined result of
unnecessary font fallback (which this commit fixes) with another bug
that was fixed in commit 80ce637:
it already renders correctly since that commit,
but the current commit alone would also have fixed it.
@astiob astiob closed this in d170d89 Sep 8, 2021
astiob added a commit to astiob/libass that referenced this pull request Sep 8, 2021
Fonts don't need to contain any glyphs for these code points.
Their special effect on layout (if any) will be handled by FriBidi
and/or HarfBuzz, and they will be replaced by zero-width glyphs
regardless of whether any font glyph was provided.

By trying hard to find a glyph as we currently do, we end up
mistakenly splitting shape runs on control characters that lack font
glyphs and therefore ruining shaping, as well as logging unnecessary
warnings about the missing glyphs.

There are some quirks and exceptions. In particular, font glyphs
for normally-invisible characters may be retained by recent HarfBuzz
if OpenType substitutions are involved (which sometimes matches
and sometimes doesn't match Uniscribe). This commit happens to retain
this behavior for visible glyphs contained in the main font (as opposed
to any arbitrarily-chosen fallback fonts) by virtue of defaulting
run-leading invisibles to face_index 0, where HarfBuzz will find
the glyphs on its own. Meanwhile, the case of _fallback_ fonts inserting
visible glyphs doesn't even seem to be desirable to begin with.

Preliminary testing shows this commit's logic matches Uniscribe in some
common cases, but tests with musical symbols show strange results.
More research is needed to match Uniscribe and hence VSFilter even better.

On the other hand, this logic is likely not sophisticated enough
for "optimal" typesetting without regard to Uniscribe compatibility.
For example, Blink does not look up glyphs and split font-fallback runs
prior to running HarfBuzz at all, and instead runs HarfBuzz on the whole
higher-level run at once and then reruns it with the next fallback font
on just those substrings that HarfBuzz failed to shape. In any case,
some Default_Ignorable glyphs are more appropriately viewed as logical
prefixes than suffixes and should be grouped with the following,
not preceding, character in case of doubt. Furthermore, when a fallback
font is used for a sequence of text, it may be desirable to use the same
font for intervening or surrounding whitespace or punctuation even if
the main font (or another fallback font) also contains glyphs for them;
and whitespace characters can be synthesized in lieu of import from
another font, and indeed HarfBuzz does synthesize them, but this is not
always desirable. Additionally, the glyph HarfBuzz uses for zero-width
substitutes may actually be visible as it defaults to the font's basic
space glyph: if we were to forget about Uniscribe, we should define
a custom, fake glyph index and translate it into `skip`.
Finally, our higher-level shape run splitting is not perfect either:
bidi splitting after script splitting can create awkward short runs,
and in general, script splitting seems problematic in itself.

Fixes libass#507 (log messages).

Supersedes and closes libass#508.

Fixes misshaping such as in the sample in
  libass#71 (comment),
  libass#71 (comment).
However, this particular sample was actually a combined result of
unnecessary font fallback (which this commit fixes) with another bug
that was fixed in commit 80ce637:
it already renders correctly since that commit,
but the current commit alone would also have fixed it.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants