Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

On Windows code points from Supplementary Planes (U+010000 to U+10FFFF) are not accepted from keyboard #391

Closed
Konstantin-Glukhov opened this issue Jun 25, 2023 · 17 comments

Comments

@Konstantin-Glukhov
Copy link
Contributor

No description provided.

@gwsw
Copy link
Owner

gwsw commented Aug 28, 2023

@adoxa @avih
I've been experimenting a bit, and the behavior of ReadConsoleInputW seems really strange to me. If I paste a non-BMP char, ReadConsoleInputW return a KeyDown event with AsciiChar and UnicodeChar both set to zero and wVirtualScanCode set to 0x38, then a KeyUp event with UnicodeChar set to the codepoint of the pasted char.

In contrast, if I press an arrow key, I get a KeyDown event with again AsciiChar and UnicodeChar both set to zero and wVirtualScanCode set to the arrow's scan code, and then a KeyUp event with the fields set the same. This looks very similar to the non-BMP paste. Currently less processes such an event on the KeyDown, as it should, but to support a non-BMP paste, it seems I would have to wait until I get the KeyUp to distinguish them. This would be a poor user experience.

Is there a way to distinguish a non-BMP char from a regular non-ASCII key like the arrows? The 0x38 scan code seems to be the ALT key. If I press ALT I get a KeyDown event that looks (AFAICT) identical to the first event of a non-BMP char. I don't understand why a non-BMP char doesn't deliver the codepoint in the KeyDown event.

@gwsw
Copy link
Owner

gwsw commented Aug 28, 2023

Correction: it's not non-BMP chars that fail to populate the KeyDown event. Some chars work and some don't, and I can't see a pattern. For example, U+00AA works, but all the others between U+00A1 and U+00B4 do not. In the cases that work, the KeyDown event has wVirtualScanCode and wVirtualKeyCode set to zero, while in the cases that fail, the KeyDown event has wVirtualScanCode == 0x38 and wVirtualKeyCode == VK_MENU. I don't understand this.

@avih
Copy link
Contributor

avih commented Aug 28, 2023

It seems there are many issues with ReadConsoleInputW (and A), among them missing key-down events, missing key-up events, 0 instead of key-down, and additional issues when pasting unicode and with surrogate pairs.

The issues can also happen with BMP codepoints, see the samples at the comments at the link below.

Additionally, it behaves differently in different consoles (win XP/7/10) and windows terminal (win 10/11).

That's the solution I wrote for busybox-w32, which behaves like ReadConsoleInputA which delivers UTF8, while internally it uses ReadConsoleInputW:
https://github.com/rmyorston/busybox-w32/blob/d6fb7a2c36f8a6d5fac4754cdfdb5483a03f2892/win32/winansi.c#L1212-L1419

@gwsw
Copy link
Owner

gwsw commented Aug 28, 2023

OMG that's unbelievable that something as fundamental as reading chars from the console is broken. cmd itself doesn't seem to have this problem; I wonder what it does internally. I glanced through your busybox solution but haven't studied it in detail yet -- does it defer non-ASCII keypresses until key up happens? It seems like that would be an unpleasant experience for the user.

@avih
Copy link
Contributor

avih commented Aug 28, 2023

OMG that's unbelievable that something as fundamental as reading chars from the console is broken

Agreed. I'd love to have an alternative/smaller solution, but for now that's the best we got.

does it defer non-ASCII keypresses until key up happens?

No. The KB handling code mainly looks only at key-down, and ReadConsoleInput_utf8 delivers down/up as they come. But if it sees a key-up without matching prior key-down then it delivers a key-down event instead of the up, without further key-up of that event - which is fine for the usage which busybox-w32 does with it.

It also handles surrogate pairs which can be even more painful as a sequene of key-events.

For what it's worth, the windows terminal generally does a better job of consistent down/up events compared to the builtin windows console (conhost), but it's also buggy in some cases. If you try it now, use version 1.17 (1.18 is pre-whatever, and has more issues).

Generally speaking, the windows terminal is a front end for the new conhost, using the new CONPTY interface which is available since windows 10. The new conhost is developed at the same github repository as the windows terminal. If you only want to try the console (conhost) without the fancy windows terminal, just run OpenConsole.exe at the directory of the windows termibal. This binary is stand alone and looks identical to the builtin windows console/conhost, but is much improved (and it will launch cmd.exe by default).

If you want to try ReadConsoleInput_utf8, then delete these lines https://github.com/rmyorston/busybox-w32/blob/d6fb7a2c36f8a6d5fac4754cdfdb5483a03f2892/win32/winansi.c#L1371-L1373 - they're only relevant at the context of busybox-w32.

@adoxa
Copy link
Contributor

adoxa commented Aug 29, 2023

It's sending the equivalent Alt+numpad sequence for the current code page, or '?' (Alt+63) if n/a. It looks like U+00AA, U+00B5 & U+00BA count as alphabetic, which is why they have a direct code. CMD uses the high-level ReadConsoleW.

@avih
Copy link
Contributor

avih commented Aug 29, 2023

I've been experimenting a bit, and the behavior of ReadConsoleInputW seems really strange to me

It's sending the equivalent Alt+numpad sequence for the current code page, or '?' (Alt+63) if n/a.

Is that for ReadConsoleInputW? I didn't think the W variant cared about codepages?

Also, which codepage? ANSI (GetACP())? or the console's input CP (GetConsoleCP())?

@adoxa
Copy link
Contributor

adoxa commented Aug 29, 2023

The current code page, I didn't go so far as to test if it's explicitly input or output.

C:\Projects\less>chcp 437
Active code page: 437

C:\Projects\less>echo ¡|clip

C:\Projects\less>tst
UP; VK: 0d; scan: 1c; repeat: 1; state: 0020; ascii: 0d; unicode: 000d
DN; VK: 12; scan: 38; repeat: 1; state: 0002; ascii: 00; unicode: 0000
DN; VK: 61; scan: 4f; repeat: 1; state: 0002; ascii: 00; unicode: 0000
UP; VK: 61; scan: 4f; repeat: 1; state: 0002; ascii: 00; unicode: 0000
DN; VK: 67; scan: 47; repeat: 1; state: 0002; ascii: 00; unicode: 0000
UP; VK: 67; scan: 47; repeat: 1; state: 0002; ascii: 00; unicode: 0000
DN; VK: 63; scan: 51; repeat: 1; state: 0002; ascii: 00; unicode: 0000
UP; VK: 63; scan: 51; repeat: 1; state: 0002; ascii: 00; unicode: 0000
UP; VK: 12; scan: 38; repeat: 1; state: 0000; ascii: a1; unicode: 00a1
DN; VK: 0d; scan: 1c; repeat: 1; state: 0000; ascii: 0d; unicode: 000d
UP; VK: 0d; scan: 1c; repeat: 1; state: 0000; ascii: 0d; unicode: 000d
DN; VK: 1b; scan: 01; repeat: 1; state: 0020; ascii: 1b; unicode: 001b

C:\Projects\less>chcp 1252
Active code page: 1252

C:\Projects\less>tst
UP; VK: 0d; scan: 1c; repeat: 1; state: 0020; ascii: 0d; unicode: 000d
DN; VK: 12; scan: 38; repeat: 1; state: 0002; ascii: 00; unicode: 0000
DN; VK: 61; scan: 4f; repeat: 1; state: 0002; ascii: 00; unicode: 0000
UP; VK: 61; scan: 4f; repeat: 1; state: 0002; ascii: 00; unicode: 0000
DN; VK: 66; scan: 4d; repeat: 1; state: 0002; ascii: 00; unicode: 0000
UP; VK: 66; scan: 4d; repeat: 1; state: 0002; ascii: 00; unicode: 0000
DN; VK: 61; scan: 4f; repeat: 1; state: 0002; ascii: 00; unicode: 0000
UP; VK: 61; scan: 4f; repeat: 1; state: 0002; ascii: 00; unicode: 0000
UP; VK: 12; scan: 38; repeat: 1; state: 0000; ascii: a1; unicode: 00a1
DN; VK: 0d; scan: 1c; repeat: 1; state: 0000; ascii: 0d; unicode: 000d
UP; VK: 0d; scan: 1c; repeat: 1; state: 0000; ascii: 0d; unicode: 000d
DN; VK: 1b; scan: 01; repeat: 1; state: 0020; ascii: 1b; unicode: 001b

So with CP437 it uses Alt+173, with CP1252 it switches to Alt+161. Note, too, that the state has NumLock off (no 20), which is the opposite of what you need when entering at the keyboard.

@adoxa
Copy link
Contributor

adoxa commented Aug 29, 2023

Oh, that is with W, a buffer of 8, in 10.0.19045.3086 ConHost.

@gwsw
Copy link
Owner

gwsw commented Aug 29, 2023

This should be fixed in 0cd5427.

@avih
Copy link
Contributor

avih commented Aug 30, 2023

I've been experimenting a bit, and the behavior of ReadConsoleInputW seems really strange to me. If I paste a non-BMP char, ReadConsoleInputW return a KeyDown event with AsciiChar and UnicodeChar both set to zero and wVirtualScanCode set to 0x38, then a KeyUp event with UnicodeChar set to the codepoint of the pasted char.

Did you end up addressing this?

You're describing a case where the key-down for a codepoint seems to be missing, and the key-up event exists - which I've observed as well.

Is this addressed in any way at the KB handling code?

@gwsw
Copy link
Owner

gwsw commented Aug 30, 2023

Yes, that's handled by the last_down* stuff in console_input, specifically line 2845.

@avih
Copy link
Contributor

avih commented Aug 30, 2023

Right, I missed that, it's 76ea557

So, one thing I noticed is that you do that unconditionally, while at busybox-w32 we do that only for non-ASCII codepoints (128 or more).

The reason is that we've never observed this issue with ASCII codepoints, and also, this might generate an incorrect ENTER-down when invoking less from CLI.

Say one types less myfile.c <ENTER>. The shell will launch less on the ENTER-down event, and less will see the following ENTER-up event (at least this happens with busybox-w32), and because there was no prior ENTER-down, it might change it to ENTER-down, which will generate an incorrect ENTER-down key event right after launching less.

Does this issue happen with your current code?

@gwsw
Copy link
Owner

gwsw commented Aug 31, 2023

Thanks, yes, I can reproduce a spurious ENTER KeyDown event at startup. I have fixed this in 010d808.

@avih
Copy link
Contributor

avih commented Aug 31, 2023

Good.

@gwsw
Copy link
Owner

gwsw commented Sep 2, 2023

@avih I copied the code to construct the codepoint from the surrogate pair from your busybox code. But on review, I don't think it is correct. In line 1279 you OR together three expressions, one of which is 0x10000. But this means the result can never have a zero in bit 16, so it cannot represent a value such as U+20001. I think the 0x10000 term should be ADDed rather than ORed into the codepoint.

@avih
Copy link
Contributor

avih commented Sep 2, 2023

you OR together three expressions, one of which is 0x10000. But this means the result can never have a zero in bit 16

True, and there are other errors too, like if that bit is 1 also from the high surrogate, then it remains 1 instead of being added (becoming 0 and a higher bit.should become 1)

EDIT: actually, it's exactly the same case. The issue occurs exactly and only when the high surrogate has a 1 bit which overlaps the 1 of 0x1000 - which should then become 0 (and a higher bit should become 1), but doesn't due to the OR.

I'll fix that. Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants