Cascadia Code unexpected complexity analysis result #411

skyline75489 · 2021-02-13T03:59:03Z

Environment

Cascadia Code version number: 2009.22
Application (with version) used to display text: Windows Terminal dev branch
OS platform and version: 10.0.19042.804
Screen resolution (i.e. 220dpi):  4K @ 200% 

Any other software? No

Steps to reproduce

Running cmatrix in Windows Terminal with no font specified (fallback to Cascadia Mono, I think) & Hack.

Expected behavior

The performance should be about the same.

Actual behavior

This is the CPU usage breakdown:

Cascadia:

Hack:

A significant more amount of CPU is consumed with Cascadia.

The reason behind this is that for some reason, analyzer->GetTextComplexity() gives unexpected result for pure ASCII strings when using Cascadia.

For example, take a random string produced by cmatrix:

r   8   1   z $   J   p `   X \ M   o ;             E   o d A G 9   O                       T

With Cascadia GetTextComplexity reports this result:

Pos 0 - 17 : Simple
Pos 18 : Complex
Pos 18 - 97 : Simple

However with Hack, the entire text is reported as simple so that we can optimize the layout process.

I've seen a dozen of examples. The letter J seems to be the cause. I have no idea why the letter J is so special.

The text was updated successfully, but these errors were encountered:

aaronbell · 2021-02-13T05:35:34Z

Thanks for the feedback! Do you happen to know how GetTextComplexity determines simple vs complex?

skyline75489 · 2021-02-13T05:52:40Z

I'll have to summon @DHowett and @miniksa for this. Maybe they know some DX experts

skyline75489 · 2021-02-13T06:18:27Z

From MSDN:

isTextSimple: If true, the text is simple, and the glyphIndices array will already have the nominal glyphs for you. Otherwise, you need to call IDWriteTextAnalyzer::GetGlyphs to properly shape complex scripts and OpenType features.

Interestingly I've also seen similar behaviour with Fira Code, where ? is the special character.

DHowett · 2021-02-13T06:31:45Z

Hmm. This is weird.

@skyline75489 it doesn't also happen with I, does it?

skyline75489 · 2021-02-13T08:19:29Z

To be specific, with Cascadia and the following string:

abcdefghijklmnoprqstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ

The following characters are reported as complex: j, l and J, L.

DHowett · 2021-02-13T08:20:37Z

Hmm. I first thought it would be the locl table. I believe that’s for I and J though.

aaronbell · 2021-02-13T15:12:13Z

@DHowett, you might be on to something, actually. The locl table contains a variety of glyphs, but among the base Latin set, the only ones mentioned are j, J, l, and L. I would expect, then, that í, Ć and Д would be seen as complex too.

To further prove this theory, if you run GetTextComplexity with Hack on Şş, do these register as complex? Hack has a much more limited OpenType feature set, and their locl table only includes those two glyphs.

skyline75489 · 2021-02-13T16:09:10Z

I don’t know about fonts that much. Seems this is like a feature instead of bug? I wonder if it’s possible to stop some ASCII characters from being recognized as complex to improve the overall performance. 获取 Outlook for iOS<https://aka.ms/o0ukef>

…

________________________________ 发件人: Aaron <notifications@github.com> 发送时间: Saturday, February 13, 2021 11:12:29 PM 收件人: microsoft/cascadia-code <cascadia-code@noreply.github.com> 抄送: Chester Liu <skyline75489@outlook.com>; Mention <mention@noreply.github.com> 主题: Re: [microsoft/cascadia-code] Cascadia Code unexpected complexity analysis result (#411) @DHowett<https://github.com/DHowett>, you might be on to something, actually. The locl table contains a variety of glyphs, but among the base Latin set, the only ones mentioned are j, J, l, and L. I would expect, then, that í, Ć and Д would be seen as complex too. To further prove this theory, if you run GetTextComplexity with Hack on Şş, do these register as complex? Hack has a much more limited OpenType feature set, and their locl table only includes those two glyphs. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub<#411 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/ABD6BLYD3VP4WXHY3BCT3ODS62JF3ANCNFSM4XRXI55Q>.

kenmcd · 2021-02-13T20:13:11Z

获取 Outlook for iOShttps://aka.ms/o0ukef

This is spam. Please fix the signature in your phone email app.

DHowett · 2021-02-13T20:22:55Z

@kenmcd at least Chester’s message had meaningful content in addition to his signature, unlike yours which mailed the entire subscriber list for... this. 😄

A signature is not spam. A comment that does not contribute to the discussion is also not spam, but it's much closer to spam than an e-mail signature is.

kenmcd · 2021-02-13T23:53:01Z

A signature is not spam.

That "signature" is put there by the app developer to promote the app.
The user has just been too lazy to change it.
The app developer did the same thing with the android app.
It is nothing less than a promotional link for a commercial product.
The definition of spam.

If there is another easier/better way to report spam on GitHub, please advise.

skyline75489 · 2021-02-14T00:40:08Z

Hi @kenmcd sorry if my email response upsets you with its automatic signature.

I’ve been around GitHub for about 8 years and I’ve only recently started using email response because of some special security policy that only applies to Microsoft employees (Dustin knows what I mean). Overall it’s because of my laziness.
You’re absolutely right about this.

I’ve seen a lot of email replied comments that comes with various signature. I don’t find them spam-like unless the comments themselves have no meaning content, or the promotion is about a car or something.

You do realize that some of us around the repo work for Microsoft, right? I feel somewhat eligible to promote the products of our own company. And it’s an app that sends emails. Surely someone would need an app like this.

Anyway this is way off topic. I hope this discussion about signatures can just end here.

skyline75489 · 2021-02-14T00:41:46Z

@DHowett if the complex result is unavoidable, I think we can still optimize the analysis process with the partial-simple text. I’ll open an issue about this in the Terminal repo later.

aaronbell · 2021-02-14T03:17:52Z

I thought I'd wait until next week when y'all aren't on a long weekend, but let's talk :)

FYI @skyline75489, the purpose of the locl table is to enable localized versions of glyphs to be substituted instead of the default forms. So for example, in Polish, there's an alternate form of the acute mark that those users would prefer and a number of letters with acutes need to be replaced.

My understanding is that the way things work is that when the string is analyzed, if any letter found to be in the locl feature is discovered, DWrite decides that it is "complex" and then looks up the user's language settings to activate the relevant substitutions. This helps ensure that users get the best experience and I think that it makes sense to continue to do this—particularly for Romanian / Moldovian / Bulgarian / Serbian / Polish.

The two letters in question, j and l, are relevant in the case of Dutch to access IJacute and ijacute and Catalan to access the ŀ and Ŀ, by substituting l· or L·. However, a thought occurred to me. In both these cases, two glyphs are being replaced with one glyph. This is fine in a proportional font, but perhaps in a monospace font, especially one for coding, they don't make sense (particularly the ŀ / Ŀ).

Potentially we can remove those substitutions which I think will improve your speed tests (and I'm thinking about making an update). But removing them won't fully achieve parity with a font like Hack that has fewer locl substitutions. So you'll still see a significant CPU use impact for Cyrillic text string.

To actually solve the issue it would be better to improve the analysis process. For example, when GetTextComplexity is run, it should only consider substitutions which match the user's language settings (rather than all locl substitutions). So if a user is not using Catalan, why should the 'L' be considered complex?

Of course, all this is predicated on that locl is the only OpenType feature that impacts "simple" vs "complex". I'm still wondering why ? is considered complex in Fira Code.

Anyway, the most important thing is to understand what drives the logic of GetTextComplexity to figure out if there is a way we can address it in font, or if another solution is necessary, or if there is a solution.

skyline75489 · 2021-02-14T05:49:51Z

@aaronbell thank you for the detailed explanation. I’ll also try to find a way to improve the performance on the terminal side.

By the way I may be wrong about the exact character that breaks the simplicity in Fira Code since I don’t use Fira Code that much, but I’m positive that there is a lot characters that belong to the category.

This PR aims to optimize the text analysis process by breaking the text into simple & complex runs according to the result of `GetTextComplexity`. For simple runs, we can skip certain processing steps to improve the analysis performance. Previous to this PR, we rely on the result of `AnalyzeBidi`, `AnalyzeScript` and `AnalyzeNumberSubstitution` to both break the text into different runs and attach the corresponding bidi/script/number_substitution information to the run. Thanks to #6695 we have the chance to skip the expensive analysis process when we found the *entire text* is determined to be simple. Inspired by microsoft/cascadia-code#411 and discussions in #9156, I found that the "entire text simplicity" is often hard to meet. In order to fully utilize the complexity information of the text, we need to first break the text into simple & complex ranges. These ranges are also the initial runs prior to the bidi/script/number_substitution analysis. This way we can skip the text analysis for simple runs to speed up the process. VALIDATION Build & run cmatrix, cacafire, cat big.txt with it. Initial simple run PR: #6695 Closes #9156

skyline75489 mentioned this issue Feb 14, 2021

Fine-grained DWrite text analysis based on text complexity microsoft/terminal#9156

Closed

skyline75489 mentioned this issue Feb 19, 2021

Initial implementation of fine-grained text analysis microsoft/terminal#9202

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cascadia Code unexpected complexity analysis result #411

Cascadia Code unexpected complexity analysis result #411

skyline75489 commented Feb 13, 2021

aaronbell commented Feb 13, 2021

skyline75489 commented Feb 13, 2021

skyline75489 commented Feb 13, 2021

DHowett commented Feb 13, 2021

skyline75489 commented Feb 13, 2021

DHowett commented Feb 13, 2021

aaronbell commented Feb 13, 2021

skyline75489 commented Feb 13, 2021 via email

kenmcd commented Feb 13, 2021

DHowett commented Feb 13, 2021 •

edited

kenmcd commented Feb 13, 2021

skyline75489 commented Feb 14, 2021

skyline75489 commented Feb 14, 2021

aaronbell commented Feb 14, 2021

skyline75489 commented Feb 14, 2021

Cascadia Code unexpected complexity analysis result #411

Cascadia Code unexpected complexity analysis result #411

Comments

skyline75489 commented Feb 13, 2021

Environment

Steps to reproduce

Expected behavior

Actual behavior

aaronbell commented Feb 13, 2021

skyline75489 commented Feb 13, 2021

skyline75489 commented Feb 13, 2021

DHowett commented Feb 13, 2021

skyline75489 commented Feb 13, 2021

DHowett commented Feb 13, 2021

aaronbell commented Feb 13, 2021

skyline75489 commented Feb 13, 2021 via email

kenmcd commented Feb 13, 2021

DHowett commented Feb 13, 2021 • edited

kenmcd commented Feb 13, 2021

skyline75489 commented Feb 14, 2021

skyline75489 commented Feb 14, 2021

aaronbell commented Feb 14, 2021

skyline75489 commented Feb 14, 2021

DHowett commented Feb 13, 2021 •

edited