Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cascadia Code unexpected complexity analysis result #411

Open
skyline75489 opened this issue Feb 13, 2021 · 15 comments
Open

Cascadia Code unexpected complexity analysis result #411

skyline75489 opened this issue Feb 13, 2021 · 15 comments

Comments

@skyline75489
Copy link

Environment

Cascadia Code version number: 2009.22
Application (with version) used to display text: Windows Terminal dev branch
OS platform and version: 10.0.19042.804
Screen resolution (i.e. 220dpi):  4K @ 200% 

Any other software? No

Steps to reproduce

Running cmatrix in Windows Terminal with no font specified (fallback to Cascadia Mono, I think) & Hack.

Expected behavior

The performance should be about the same.

Actual behavior

This is the CPU usage breakdown:

Cascadia:

wt-cascadia

Hack:

wt-hack

A significant more amount of CPU is consumed with Cascadia.

The reason behind this is that for some reason, analyzer->GetTextComplexity() gives unexpected result for pure ASCII strings when using Cascadia.

For example, take a random string produced by cmatrix:

r   8   1   z $   J   p `   X \ M   o ;             E   o d A G 9   O                       T

With Cascadia GetTextComplexity reports this result:

  • Pos 0 - 17 : Simple
  • Pos 18 : Complex
  • Pos 18 - 97 : Simple

However with Hack, the entire text is reported as simple so that we can optimize the layout process.

I've seen a dozen of examples. The letter J seems to be the cause. I have no idea why the letter J is so special.

@aaronbell
Copy link
Collaborator

Thanks for the feedback! Do you happen to know how GetTextComplexity determines simple vs complex?

@skyline75489
Copy link
Author

I'll have to summon @DHowett and @miniksa for this. Maybe they know some DX experts

@skyline75489
Copy link
Author

From MSDN:

isTextSimple: If true, the text is simple, and the glyphIndices array will already have the nominal glyphs for you. Otherwise, you need to call IDWriteTextAnalyzer::GetGlyphs to properly shape complex scripts and OpenType features.

Interestingly I've also seen similar behaviour with Fira Code, where ? is the special character.

@DHowett
Copy link
Member

DHowett commented Feb 13, 2021

Hmm. This is weird.

@skyline75489 it doesn't also happen with I, does it?

@skyline75489
Copy link
Author

To be specific, with Cascadia and the following string:

abcdefghijklmnoprqstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ

The following characters are reported as complex: j, l and J, L.

@DHowett
Copy link
Member

DHowett commented Feb 13, 2021

Hmm. I first thought it would be the locl table. I believe that’s for I and J though.

@aaronbell
Copy link
Collaborator

@DHowett, you might be on to something, actually. The locl table contains a variety of glyphs, but among the base Latin set, the only ones mentioned are j, J, l, and L. I would expect, then, that í, Ć and Д would be seen as complex too.

To further prove this theory, if you run GetTextComplexity with Hack on Şş, do these register as complex? Hack has a much more limited OpenType feature set, and their locl table only includes those two glyphs.

@skyline75489
Copy link
Author

skyline75489 commented Feb 13, 2021 via email

@kenmcd
Copy link

kenmcd commented Feb 13, 2021

获取 Outlook for iOShttps://aka.ms/o0ukef

This is spam. Please fix the signature in your phone email app.

@DHowett
Copy link
Member

DHowett commented Feb 13, 2021

@kenmcd at least Chester’s message had meaningful content in addition to his signature, unlike yours which mailed the entire subscriber list for... this. 😄

A signature is not spam. A comment that does not contribute to the discussion is also not spam, but it's much closer to spam than an e-mail signature is.

@kenmcd
Copy link

kenmcd commented Feb 13, 2021

A signature is not spam.

That "signature" is put there by the app developer to promote the app.
The user has just been too lazy to change it.
The app developer did the same thing with the android app.
It is nothing less than a promotional link for a commercial product.
The definition of spam.

If there is another easier/better way to report spam on GitHub, please advise.

@skyline75489
Copy link
Author

Hi @kenmcd sorry if my email response upsets you with its automatic signature.

I’ve been around GitHub for about 8 years and I’ve only recently started using email response because of some special security policy that only applies to Microsoft employees (Dustin knows what I mean). Overall it’s because of my laziness.
You’re absolutely right about this.

I’ve seen a lot of email replied comments that comes with various signature. I don’t find them spam-like unless the comments themselves have no meaning content, or the promotion is about a car or something.

You do realize that some of us around the repo work for Microsoft, right? I feel somewhat eligible to promote the products of our own company. And it’s an app that sends emails. Surely someone would need an app like this.

Anyway this is way off topic. I hope this discussion about signatures can just end here.

@skyline75489
Copy link
Author

@DHowett if the complex result is unavoidable, I think we can still optimize the analysis process with the partial-simple text. I’ll open an issue about this in the Terminal repo later.

@aaronbell
Copy link
Collaborator

I thought I'd wait until next week when y'all aren't on a long weekend, but let's talk :)

FYI @skyline75489, the purpose of the locl table is to enable localized versions of glyphs to be substituted instead of the default forms. So for example, in Polish, there's an alternate form of the acute mark that those users would prefer and a number of letters with acutes need to be replaced.

My understanding is that the way things work is that when the string is analyzed, if any letter found to be in the locl feature is discovered, DWrite decides that it is "complex" and then looks up the user's language settings to activate the relevant substitutions. This helps ensure that users get the best experience and I think that it makes sense to continue to do this—particularly for Romanian / Moldovian / Bulgarian / Serbian / Polish.

The two letters in question, j and l, are relevant in the case of Dutch to access IJacute and ijacute and Catalan to access the ŀ and Ŀ, by substituting or . However, a thought occurred to me. In both these cases, two glyphs are being replaced with one glyph. This is fine in a proportional font, but perhaps in a monospace font, especially one for coding, they don't make sense (particularly the ŀ / Ŀ).

Potentially we can remove those substitutions which I think will improve your speed tests (and I'm thinking about making an update). But removing them won't fully achieve parity with a font like Hack that has fewer locl substitutions. So you'll still see a significant CPU use impact for Cyrillic text string.

To actually solve the issue it would be better to improve the analysis process. For example, when GetTextComplexity is run, it should only consider substitutions which match the user's language settings (rather than all locl substitutions). So if a user is not using Catalan, why should the 'L' be considered complex?

Of course, all this is predicated on that locl is the only OpenType feature that impacts "simple" vs "complex". I'm still wondering why ? is considered complex in Fira Code.

Anyway, the most important thing is to understand what drives the logic of GetTextComplexity to figure out if there is a way we can address it in font, or if another solution is necessary, or if there is a solution.

@skyline75489
Copy link
Author

@aaronbell thank you for the detailed explanation. I’ll also try to find a way to improve the performance on the terminal side.

By the way I may be wrong about the exact character that breaks the simplicity in Fira Code since I don’t use Fira Code that much, but I’m positive that there is a lot characters that belong to the category.

ghost pushed a commit to microsoft/terminal that referenced this issue Apr 28, 2021
This PR aims to optimize the text analysis process by breaking the text
into simple & complex runs according to the result of
`GetTextComplexity`. For simple runs, we can skip certain processing
steps to improve the analysis performance.

Previous to this PR, we rely on the result of `AnalyzeBidi`,
`AnalyzeScript` and `AnalyzeNumberSubstitution` to both break the text
into different runs and attach the corresponding
bidi/script/number_substitution information to the run. Thanks to #6695
we have the chance to skip the expensive analysis process when we found
the *entire text* is determined to be simple.

Inspired by microsoft/cascadia-code#411 and
discussions in #9156, I found that the "entire text simplicity" is often
hard to meet. In order to fully utilize the complexity information of
the text, we need to first break the text into simple & complex ranges.
These ranges are also the initial runs prior to the
bidi/script/number_substitution analysis. This way we can skip the text
analysis for simple runs to speed up the process.

VALIDATION
Build & run cmatrix, cacafire, cat big.txt with it.

Initial simple run PR: #6695
Closes #9156
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants