Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Arabic not rendering correctly #484

Closed
2 tasks done
ScrubN opened this issue Dec 20, 2022 · 33 comments · Fixed by #485
Closed
2 tasks done

Arabic not rendering correctly #484

ScrubN opened this issue Dec 20, 2022 · 33 comments · Fixed by #485
Labels
bug Something isn't working

Comments

@ScrubN
Copy link
Collaborator

ScrubN commented Dec 20, 2022

Checklist

Edition

Both

Describe your issue here

Some or all Arabic may not be rendering correctly. 1.50.7 also has the problem but as far as I can tell 1.50.6 does not

Source
image

1.50.7 - Master
image

1.50.6
image

@ScrubN ScrubN added the bug Something isn't working label Dec 20, 2022
@ScrubN
Copy link
Collaborator Author

ScrubN commented Dec 20, 2022

I'm looking at the diff for ChatRenderer.cs 1.50.6 -> 1.50.7 and I honestly can't figure it out.

@ScrubN
Copy link
Collaborator Author

ScrubN commented Dec 20, 2022

RTL is not rendered as expected in #437. That PR didn't touch rendering at all though and the previous commit to ChatRenderer.cs was from #424

@ScrubN
Copy link
Collaborator Author

ScrubN commented Dec 20, 2022

Rendered using #437 as the base but using #424's ChatRenderer.cs and the RTL is rendered incorrectly. What..?

@ScrubN
Copy link
Collaborator Author

ScrubN commented Dec 20, 2022

RTL is not rendered as expected in #437. That PR didn't touch rendering at all though and the previous commit to ChatRenderer.cs was from #424

Just realized that the conflict merge commit was pushed to the branch along with everything from master up to then.

@ScrubN
Copy link
Collaborator Author

ScrubN commented Dec 20, 2022

Found the problem. It doesn't seem like it should have, but this line change is what caused it all.

[Option('f', "font", Default = "Inter Embedded", HelpText = "Font to use.")]
public string Font { get; set; }

Yes, really. Fixing the CLI not using the embedded Inter font broke Arabic rendering. I don't know why.

Edit: It's because Arabic support for the Inter fontface is broken. Just installed Inter from fonts.google.com and that version also renders incorrectly.

@ScrubN
Copy link
Collaborator Author

ScrubN commented Dec 20, 2022

rsms/inter#523 (comment) It looks like it's official that Inter doesn't support Arabic.

@ScrubN ScrubN changed the title RTL not rendered correctly Arabic not rendering correctly Dec 20, 2022
@mohad12211
Copy link
Contributor

mohad12211 commented Dec 20, 2022

while we are at it, the picture for 1.50.6 you posted has another problem.
this is what twitch shows
image
there's no spacing between words, before a certain version, this was random, sometimes its right, sometimes its wrong. the latest few versions, it was almost always wrong (no spaces or wrong spaces)

@ScrubN
Copy link
Collaborator Author

ScrubN commented Dec 21, 2022

there's no spacing between words, before a certain version, this was random, sometimes its right, sometimes its wrong. the latest few versions, it was almost always wrong (no spaces or wrong spaces)

It looks like this is because SKShaper thinks they're 0 width characters (font is Segoe UI) so the only spacing that ends up getting drawn is message padding on each individual character. I want to get rid of drawing message padding per-char on nonfont messages because it looks super ugly and further breaks block arts which I just mostly fixed.
image

@ScrubN
Copy link
Collaborator Author

ScrubN commented Dec 21, 2022

So I fixed both block arts and arabic rendering, but in order to achieve the image below I had to give it an invalid font via the CLI.
image

Using any valid font produces this for arabic, which makes sense now why fixing the CLI not calling Inter Embedded was causing the problem: I didn't have Inter installed so it wasn't a valid font. Now we need to figure out why the fallback font renders wrong but the SKTypeface.FromFamilyName() fallback renders correctly (on Windows at least)
image

@ScrubN
Copy link
Collaborator Author

ScrubN commented Dec 21, 2022

@mohad12211 can you compile that branch and see if Arabic renders correctly when given a font that properly supports it. I'm paranoid it's the fault of my system.

@mohad12211
Copy link
Contributor

mohad12211 commented Dec 21, 2022

I used the github-actios binaries instead of compiling, assuming that it won't affect anything, let me know if you want me to actually compile it instead of using the github-actions.
I tried multiple fonts that supports Arabic, all rendered properly. but there's a couple of small problems.

rendered:
image
message json:
image
there's an additional ♿ emoji and a "?" one, the Arabic text is rendered properly tho, I'm not sure if the Arabic text has anything to do with this problem.

rendered:
image
message json:
image
note that this message has Arabic and directly after it the emojis, no space between the text and the emoji.
it's not rendered properly, it's rendered like I used a font that doesn't support Arabic

chat.json: https://file.io/ANLWjGSSBf4D

edit: yes, the first problem has nothing to do with Arabic, replacing the Arabic text with hello produced this:
image

@ScrubN
Copy link
Collaborator Author

ScrubN commented Dec 21, 2022

I used the github-actios binaries instead of compiling, assuming that it won't affect anything, let me know if you want me to actually compile it instead of using the github-actions. I tried multiple fonts that supports Arabic, all rendered properly. but there's a couple of small problems.

Github-actions build works great

there's an additional ♿ emoji and a "?" one, the Arabic text is rendered properly tho, I'm not sure if the Arabic text has anything to do with this problem.

This is probably an artifact of this change made in #464. Before it would advance by the amount of codepoints but 99.9% of emojis are handled as a single char in Substring's eyes so an index out of bounds was not uncommon. I personally would rather have the duplicate emojis than the ioob.

https://github.com/ScrubN/TwitchDownloader/blob/bf2f862091b049a860e96ebfaa1c745a8e9e3979/TwitchDownloaderCore/ChatRenderer.cs#L577-L579

image
note that this message has Arabic and directly after it the emojis, no space between the text and the emoji.

I can see why this is a problem and how it occurs. My solution to this is would be a rewrite of ChatRenderer.cs in OOP because the structure of it as it stands can be a headache to work with and is very unoptimal. For example there is no reason to draw nonFont chars one-by-one and we could speed up rendering while also making it easier to read by drawing whole nonFont sections in one go, but doing this would require rewriting a few functions.

@mohad12211
Copy link
Contributor

I see, it's a rare occurrence anyways, for now everything seems to be working.

@ScrubN
Copy link
Collaborator Author

ScrubN commented Dec 21, 2022

I still don't understand why when using an invalid font the arabic is rendered correctly but the GetFallbackFont() font doesn't. On Windows both are Segoe UI and the rendering problem is present on my Debian WSL.

Also this may be wrong but I believe there is a lot of wasted memory because it looks like the font stored in the fallback cache for a given character contains the entire language.
image

@mohad12211
Copy link
Contributor

what do you mean by an "invalid font"?

@ScrubN
Copy link
Collaborator Author

ScrubN commented Dec 21, 2022

what do you mean by an "invalid font"?

In order for the arabic to render correctly in the image below, I passed --font foo as an argument to cause SKTypeFace.FromFamilyName() to return the system fallback because I do not have a font named foo installed.
image

if (renderOptions.Font == "Inter Embedded")
{
nameFont.Typeface = GetInterTypeface(renderOptions.UsernameFontStyle);
messageFont.Typeface = GetInterTypeface(renderOptions.MessageFontStyle);
}
else
{
nameFont.Typeface = SKTypeface.FromFamilyName(renderOptions.Font, renderOptions.UsernameFontStyle);
messageFont.Typeface = SKTypeface.FromFamilyName(renderOptions.Font, renderOptions.MessageFontStyle);
}

When I pass any other valid font, such as --font 'Yu Gothic UI', or pass no font argument so Inter Embedded is the chosen message font, the arabic must render using the font provided by the function ChatRenderer.GetFallbackFont() which should be functionally identical to passing an invalid font as it is literally the exact same typeface but instead it renders incorrectly:
image

Both images had the exact same message input as the first message in this issue thread, I just copy-pasted it into a different chat for faster render times.

@mohad12211
Copy link
Contributor

mohad12211 commented Dec 21, 2022

Ohhh now I see, in that case I have the same problem.
passing a font that supports Arabic works.
passing an invalid font (e.g. -f "foo") will also work. (will use my system font)
passing a font that doesn't support Arabic/Passing no -f argument will not work.

Tested on Linux.

@ScrubN
Copy link
Collaborator Author

ScrubN commented Dec 21, 2022

Ohhh now I see, in that case I have the same problem.
passing a font that supports Arabic works. passing an invalid font (e.g. -f "foo") will also work. (will use my system font)
passing a font that doesn't support Arabic/Passing no -f argument will not work.

The thing is though, passing a font that doesnt support arabic/no -f argument goes and fetches your system font to render nonFont chars with, but only in the case of arabic (as far as i can tell) does it fail to render correctly in this specific scenario despite it fetching the same font as it would have had you passed -f foo.

@mohad12211
Copy link
Contributor

mohad12211 commented Dec 21, 2022

For example there is no reason to draw nonFont chars one-by-one

I think that this is the problem, maybe...
passing a font that doesn't support Arabic will actually use the fallback font, but the wrong form of the letter.

image
image

I can see that both of these images use the same font, the letters are drawn in the same style (font). but the letters are in the wrong form.

in Arabic, the shape of the letters changes depending on whether they occur alone, initially, medially, or finally within a word.
in the second image, where the letters are rendered incorrectly, all of the letters are rendered in the alone case, basically its assuming that there are no letters before it or after it.

so, I assume that you normally render word by word, but in the case of nonFont chars, you render them one by one.
this will cause the chat to be correctly rendered when passing an invalid font, because it's rendering word by word.
and will cause the chat to be incorrectly rendered when passing a valid font with no Arabic support, because you are rendering char by char, and Arabic depends on the previous and next chars.

is that the case? are you rendering word by word normally but in nonFont chars, char by char?

@ScrubN
Copy link
Collaborator Author

ScrubN commented Dec 21, 2022

Y'know I never even thought of that. I'll test if that's the case and I guess also test my theory about the wasted memory.

@ScrubN
Copy link
Collaborator Author

ScrubN commented Dec 21, 2022

image
Well, the answers are: true and true

@ScrubN
Copy link
Collaborator Author

ScrubN commented Dec 21, 2022

And as a side effect, rendering nonFont got slightly faster with the chat you provided

Before: FINISHED. RENDER TIME: 15s SPEED: 3.81x
After: FINISHED. RENDER TIME: 7s SPEED: 8.42x

Edit: This seems to be a flat speed increase rather than a % speed increase

@ScrubN
Copy link
Collaborator Author

ScrubN commented Dec 23, 2022

@mohad12211 Fixed this edge case.
image

Curiously though, this only lost the duplicate emoji. I have no clue what the white thing is. Perhaps a modifier that our emoji codec doesn't support?
image
With patch:
image
1.51.1:
image

@lay295
Copy link
Owner

lay295 commented Dec 23, 2022

This is probably an artifact of this change made in #464. Before it would advance by the amount of codepoints but 99.9% of emojis are handled as a single char in Substring's eyes so an index out of bounds was not uncommon. I personally would rather have the duplicate emojis than the ioob.

Were ioob that common? Why weren't there more crashes reported because of this.

99.9% of emojis are handled as a single char in Substring's eyes

Aren't 99.9% of emojis multiple characters? Because the char in C# internally is UTF-16 and most emojis seem to be a high and low surrogate of these UTF-16 characters to make a UTF-32 character.

EDIT:
Yeah the more I look at it I still don't think it really makes sense to me. Even a very basic emoji like 😊 is 2 characters. I'm not really understanding why we'd only remove 1 character.

@lay295
Copy link
Owner

lay295 commented Dec 23, 2022

8-30-21.iLoveKeepo69.-.emojis.mp4

Some old emoji testing I did, but it never resulted in an OOB exception for me.

@ScrubN
Copy link
Collaborator Author

ScrubN commented Dec 23, 2022

Were ioob that common? Why weren't there more crashes reported because of this.

I ran into them when someone sent the English flag 8 times in a row. Substring sees it as 1 char but it's made of 14 codepoints so before we would substring up 14 chars and IOOB.

Aren't 99.9% of emojis multiple characters? Because the char in C# internally is UTF-16 and most emojis seem to be a high and low surrogate of these UTF-16 characters to make a UTF-32 character.

It's a mix of both.

Working with Unicode has been a pain in my ass. There's technically another edge case that results in the wrong font being used for nonFont rendering because there is no way to get the Unicode range of a specific char. I looked online for like 3 days, stackoverflow always egotistically says to have the dev specify the desired Unicode range in the source code which does jack for us.

@lay295
Copy link
Owner

lay295 commented Dec 23, 2022

I ran into them when someone sent the English flag 8 times in a row. Substring sees it as 1 char but it's made of 14 codepoints so before we would substring up 14 chars and IOOB.

Do you still have an example JSON or could construct one that would crash on the older versions? I did testing with flags before and I didn't notice anything like that.

I feel pretty strongly that we should not be decrementing just by 1.

@ScrubN
Copy link
Collaborator Author

ScrubN commented Dec 23, 2022

I ran into them when someone sent the English flag 8 times in a row. Substring sees it as 1 char but it's made of 14 codepoints so before we would substring up 14 chars and IOOB.

Do you still have an example JSON or could construct one that would crash on the older versions? I did testing with flags before and I didn't notice anything like that.

I feel pretty strongly that we should not be decrementing just by 1.

I just changed it to using the length of the sequence string rather than incrementing by just 1, thats what fixed the duplicate emojis in #484 (comment). Before the patch we were incrementing by the number of codepoints. 5974546 337bc10

@lay295
Copy link
Owner

lay295 commented Dec 23, 2022

Oh that works. I just don't see how it's functionally different than what was in before #464. Besides the obvious trim difference, I feel it would always return the same thing?

fragmentString = fragmentString.Substring(selectedEmoji.Sequence.AsString.Trim('\uFE0F').Length);

fragmentSpan = fragmentSpan.Slice(selectedEmoji.Sequence.AsString.Length);

Oh you catch the OOB now. If you could paste me a string that hits the OOB I'm still really curious...

@ScrubN
Copy link
Collaborator Author

ScrubN commented Dec 23, 2022

Oh. Guess I'll add the trim back. I swear at some point we were substringing by the amount of codepoints.

Heres the OOB string: 🏴🏴🏴🏴🏴🏴🏴🏴🏴🏴🏴🏴

@ScrubN
Copy link
Collaborator Author

ScrubN commented Dec 23, 2022

Ok I'm running it again and its not OOB anymore, what?? I'm so confused man. Did I accidentally change it to codepoints length and cause the OOB myself or did swapping to spans fix it?

@lay295
Copy link
Owner

lay295 commented Dec 23, 2022

No yeah, it did render on 1.51.1 alright

@ScrubN
Copy link
Collaborator Author

ScrubN commented Dec 23, 2022

Well I've re-added the trim but left the try-catch incase it happens again. In other words it's back to how it used to substring in a6dda93

1.51.0/1.51.1 substring by 1 only char.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants