[Epic] COOKED READ and character conversion fun with all our different Convert* functions #7777

DHowett · 2020-09-29T17:22:40Z

[Claimed by miniksa]
This issue exists to resolve many issues around cooked read data processing including choking on the return of bytes in various A function calls from Read methods (with both DBCS and UTF-8 codepages), the improper display of Emoji or other U+10000 characters on the edit line (commonly seen in cmd.exe), and the general buffer grossness of mixmatching both char and wchar_t text within the same client-owned buffer space during waiting read operations.

Related issues

These issues are related and should be improved or resolved by this work:

These issues should be easier to resolve after this work or be fixed by it:

Uncategorized:

ReadConsoleW fails with non-BMP characters #4628

This is what I expect to do:

Start with _handlePostCharInputLoop and revise it so it doesn't attempt to precount the number of "bytes" by counting the "width" of characters. Instead move to just translating the text and storing the excess, much like an aliased command would do.
Realize TranslateUnicodeToOem has almost no usages anymore (and ConvertToOem) and could likely be converged with ConvertToA. Also that the fallback behavior of TranslateUnicodeToOem can be accomplished just by asking WC2MB to replace with the default character anyway.
Realize that if we don't need to know the CP's default character, then why are we loading it on startup? Oh, it's still because we need the lead byte table to know when something IsDBCSLeadByteConsole or not.
Realize that now that we don't use custom codepages anymore... there doesn't really seem to be a difference between IsDBCSLeadByteConsole and just calling the winnls.h exported IsDBCSLeadByteEx with the same codepage (and then not holding onto the CPInfo stuff at all.)
Doing the same thing for output codepage as input codepage in respect to dumping the load of that information and just asking winnls.h directly.
Look closely at CheckBisectStringA because it only seems to have one consumer that's really just checking if the final cell of a row IsDBCSLeadByteEx...
The ConvertInputToUnicode and ConvertOutputToUnicode are pretty darn close to ConvertToA and ConvertToW anyway. The only variations I can see in the pattern of using MB2WC and WC2MB are: no flags at all, putting the default character in when choking, or using glyph chars instead of ctrl chars. So why can't we just have ConvertToA and ConvertToW have those modes and run all the translations through those and use the safer and more sensible string-in-string-out pattern to translate everything.
If we're going all in on ConvertToA and ConvertToW... perhaps it's time for til::u8u16 and til::u16u8 to get their til::au16 and til::u16a variants brought in, have the flags added, and just have a unified way of converting things around here.
Can we also torch CharToWchar since its just translation of a short string (single character) but with the glyph for ctrl char substitution into a til::au16 with the glyph chars flag?

The expected effects are primarily:

Returning classic codepage or UTF-8 information out of ReadConsoleA, ReadFile and such should work correctly
We should eliminate a giant pile of static analysis suppressions
The cooked line should be able to just hold/display/convert emoji correctly and erase them correctly

More will show up in each of these headings as I discover it or we have feedback

The text was updated successfully, but these errors were encountered:

DHowett · 2020-11-05T00:18:15Z

I expect (hope) that @eryksun has some things to contribute here, given how intimately he's reverse-engineered our stack
and how much of it he understands 😄

miniksa · 2020-11-05T23:32:43Z

Research from today....

I'm writing with WriteConsoleInputW. I'm reading with ReadConsoleInputA off of a cooked read.

This is the scenario overview:

Char Name	Char	`wchar_t`	437 `char`	932 `char*`
Greek Lowercase Alpha	α	`0x03b1`	`0xe0`	`0x83 0xbf`
Greek Lowercase Beta	β	`0x03b2`	`0xe1`	`0x83 0xc0`
Greek Lowercase Delta	δ	`0x03b4`	`0xeb`	`0x83 0xc2`
Greek Lowercase Epsilon	ε	`0x03b5`	`0xee`	`0x83 0xc3`

In conhostv1, if you were half way through reading a multibyte sequence and changed the codepage, the remaining bytes are discarded for the next call in the new codepage. By contrast, you're normally allowed to read byte by byte, even in a multibyte codepage like Japanese (shift-jis 932) and we'll store the leftover trail byte for the next call. But it looks like there's precedent to chuck it if you change codepage mid way. It also gives you an extra "byte read" for funsies.

SetConsoleMode on the input handle to ensure ENABLE_ECHO_INPUT | ENABLE_LINE_INPUT | ENABLE_PROCESSED_INPUT for cooked mode.
WriteConsoleInputW as 0x03b1 0x03b2 0x03b4 0x03b5 0x000d 0x000a (newline terminate it to trigger end of cooked read.)
SetConsoleCP to 932.
ReadConsoleInputA for 3 bytes. Receive 0x83 0xbf 0x83. Reported 4 bytes read.
SetConsoleCP to 437.
ReadConsoleInputA for 490 bytes. Receive 0xee 0x0d 0x0a. Reported 3 bytes read.
0xc0 from beta discarded.. where'd the delta go? Also... thanks for reporting another byte read that you didn't give me...

Also in conhostv1... it looks like if you change codepage while reading.... it just flat out loses the next character to the void.

SetConsoleMode on the input handle to ensure ENABLE_ECHO_INPUT | ENABLE_LINE_INPUT | ENABLE_PROCESSED_INPUT for cooked mode.
WriteConsoleInputW as 0x03b1 0x03b2 0x03b4 0x03b5 0x000d 0x000a (newline terminate it to trigger end of cooked read.)
SetConsoleCP to 932.
ReadConsoleInputA for 2 bytes. Receive 0x83 0xbf. Reported 2 bytes read.
SetConsoleCP to 437.
ReadConsoleInputA for 490 bytes. Receive 0xeb 0xee 0x0d 0x0a. Reported 4 bytes read.
Where'd the beta go?

miniksa · 2020-11-05T23:43:25Z

Current thoughts:

Giving leftover bytes of previous codepage after a switch on read continuation

I'm pretty sold that the v1 behavior is right here. If you changed the input codepage half way through reading a byte stream, I think it's easy to say you don't care about the left behind trailing bytes for DBCS or theoretically the rest of the UTF-8 sequence.

Losing a character when the codepage changes

This feels absolutely wrong on v1's part. We shouldn't be losing characters for switching a codepage. I'd like to figure out why, but I don't intend to preserve that.

Returning the read bytes as how many were pulled off the input queue, not how much of the given buffer was used

This feels really strange and feels like an ages old mistake in conhostv1. I'm not sure why you'd return lpNumberOfCharsRead to be longer than the initial size of the buffer given in nNumberOfCharsToRead for the ReadConsoleA API. That means that the caller now has extra work to realize that we didn't overrun their buffer but we're implying that there's a continuation that didn't fit. That's a really strange API pattern relative to other Read* sorts of APIs in Windows. But it now has sort of a 20 year precedent and I can imagine someone is depending on that amount read > number to read to realize that there are trailing bytes.

So I could see that one going either way. And for UTF-8 it could easily be 1-3 greater than the buffer size, not just the 1 greater that happens with a torn DBCS lead byte.

EDIT: I'm now more strongly favoring the number read should never exceed the buffer size. Because then what do we say on the return of the extra held bytes? Is the total count now absurdly larger than the total concatenated byte length? Is it zero or smaller than the length of buffer we filled? Yuck!

miniksa · 2020-11-06T00:20:13Z

Also in conhostv1... if you read 1 byte at a time off of Japanese codepage 932.... it loses the trailing character to the void every time.

SetConsoleMode on the input handle to ensure ENABLE_ECHO_INPUT | ENABLE_LINE_INPUT | ENABLE_PROCESSED_INPUT for cooked mode.
WriteConsoleInputW as 0x03b1 0x03b2 0x03b4 0x03b5 0x000d 0x000a (newline terminate it to trigger end of cooked read.)
SetConsoleCP to 932.
ReadConsoleInputA for 1 byte. Receive 0x83. Reported 2 bytes read.
ReadConsoleInputA for 1 byte. Receive 0x83. Reported 2 bytes read.
ReadConsoleInputA for 1 byte. Receive 0x83. Reported 2 bytes read.
ReadConsoleInputA for 1 byte. Receive 0x83. Reported 2 bytes read.
ReadConsoleInputA for 1 byte. Receive 0x0d. Reported 1 bytes read.
ReadConsoleInputA for 1 byte. Receive 0x0a. Reported 1 bytes read.
Where are my trailing bytes?!

Also in conhostv1... if you read 1 byte off and then read the rest in Japanese codepage 932... it loses the trailing character and doesn't stitch it to the next string. But it does present the rest of the string with a reasonable byte count.

SetConsoleMode on the input handle to ensure ENABLE_ECHO_INPUT | ENABLE_LINE_INPUT | ENABLE_PROCESSED_INPUT for cooked mode.
WriteConsoleInputW as 0x03b1 0x03b2 0x03b4 0x03b5 0x000d 0x000a (newline terminate it to trigger end of cooked read.)
SetConsoleCP to 932.
ReadConsoleInputA for 1 byte. Receive 0x83. Reported 2 bytes read.
ReadConsoleInputA for 100 bytes. Receive 0x83 0xc0 0x83 0xc2 0x83 0xc3 0x0d 0x0a. Reported 8 bytes read.
You lost one of my trailing bytes!

miniksa · 2020-11-06T00:21:57Z

Losing the trailing byte when reading 1 by 1

The console host, including v1, has a lot of provisions for storing that trailing byte in the input handle that made the request. It theoretically should be presenting it as the first thing up on the next read.

Am I not seeing it because I'm using a thread to read? I had to change the screen buffer to allow shared reading to make my threaded read work... maybe its actually on a different handle and therefore I don't get the trail?

miniksa · 2020-11-06T00:39:27Z

Losing the trailing byte when reading 1 by 1

The console host, including v1, has a lot of provisions for storing that trailing byte in the input handle that made the request. It theoretically should be presenting it as the first thing up on the next read.

Am I not seeing it because I'm using a thread to read? I had to change the screen buffer to allow shared reading to make my threaded read work... maybe its actually on a different handle and therefore I don't get the trail?

Doesn't look like it's a threading thing. I had to allow shared on the test screen buffer because cooked input is on and it needs a handle to the output buffer so it can echo. So an unshared test screen buffer won't work.

miniksa · 2020-11-06T00:55:38Z

I have a theory on this. I think if we look in TranslateUnicodeToOem which has been historically responsible for identifying and storing that trailing byte (this code is very similar in v1 console)..

terminal/src/host/dbcs.cpp

Lines 183 to 202 in 015675d

    
           if (IsDBCSLeadByteConsole(AsciiDbcs[0], &gci.CPInfo)) 
        
           { 
        
               if (j < cbAnsi - 1) 
        
               { // -1 is safe DBCS in buffer 
        
                   pchAnsi[j] = AsciiDbcs[0]; 
        
                   j++; 
        
                   pchAnsi[j] = AsciiDbcs[1]; 
        
                   AsciiDbcs[1] = 0; 
        
               } 
        
               else 
        
               { 
        
                   pchAnsi[j] = AsciiDbcs[0]; 
        
                   break; 
        
               } 
        
           } 
        
           else 
        
           { 
        
               pchAnsi[j] = AsciiDbcs[0]; 
        
               AsciiDbcs[1] = 0; 
        
           }

I think the only time that it will actually manage to store the trailing byte in the INPUT_READ_HANDLE_DATA and not set AsciiDbcs[1] = 0; immediately is when the size of the buffer coming down is landing on an odd number (or has been offset by 1 somehow).

However, if we look in the method that is decoding the ReadConsole packet and deciding what to do... ServerReadConsole... which is arguably a lot cleaned up in v2 console but should have about the same logic as v1... GetAugmentedOutputBuffer is always doubling the user's specified buffer size... presumably because it's using it willy nilly as scratch space while it operates.

terminal/src/server/ApiDispatchers.cpp

Lines 271 to 276 in ae550e0

    
           // TODO: This is dumb. We should find out how much we need, not guess. 
        
           // If the request is not in Unicode mode, we must allocate an output buffer that is twice as big as the actual caller buffer. 
        
           RETURN_IF_FAILED(m->GetAugmentedOutputBuffer((a->Unicode != FALSE) ? 1 : 2, 
        
                                                        &pvBuffer, 
        
                                                        &cbBufferSize));

So I'm not sure it's even possible to have the trailing byte get stored appropriately and be delivered later.

But I've only looked at this for cooked read. So...

Can anything actually managed to store the trailing byte for pickup later? Raw or direct mode instead of cooked?
Is there really a way to get the buffer to give me just the lead byte and faithfully hold the trail? Offsetting the buffer by 1 somehow with a 1 byte shift-JIS sequence that shifts everything off?
If it never worked... do we go with the spirit of what it was supposed to do and hold the trailing bytes correctly so someone could read byte by byte? And if so... what do we indicate on the "number of chars read"?

miniksa · 2020-11-10T00:48:28Z

Can anything actually managed to store the trailing byte for pickup later? Raw or direct mode instead of cooked?

Still need to check.

Is there really a way to get the buffer to give me just the lead byte and faithfully hold the trail? Offsetting the buffer by 1 somehow with a 1 byte shift-JIS sequence that shifts everything off?

Nope. Doesn't work. I tried zα and reading either 1 or 2 bytes. If you read 1, you get just the z. If you read 2, it sends back z then 0x83 0xbf and reports 3 bytes read, but only 2 ever fill the client app's buffer and it's just gone. I didn't find what in the driver/client servicing is preventing the overrun... but something is. And it's losing the extra bytes.

If it never worked... do we go with the spirit of what it was supposed to do and hold the trailing bytes correctly so someone could read byte by byte? And if so... what do we indicate on the "number of chars read"?

I believe the answer is yes. conhost.exe should never be returning more bytes than the user buffer said it could hold. The fact that the driver/client is preventing an overrun by truncating isn't something we should rely on.

Also reading the client code, we should be returning the number of chars read as the number of the client application buffer filled. Not more than that. Something's saving the console system from the overrun, but we're being mean and inducing an overrun in the client.

miniksa · 2020-11-10T23:25:23Z

Can anything actually managed to store the trailing byte for pickup later? Raw or direct mode instead of cooked?

Still need to check.

For cooked, raw....(ReadConsoleA) no. The buffer is overfilled/overdeclared in conhostv1 to the driver. It truncates it to the actual size of the user buffer, losing the trailing byte into space.

For direct...(ReadConsoleInputA) no. The buffer is filled with the trailing byte, but then the method early returns without telling the driver that the buffer has >0 bytes of payload to return.... again losing the trailing byte into space.

miniksa · 2020-11-12T00:25:19Z

For direct... (ReadConsoleInputA)... the no is with an exception. It's only no if it's asked for 1 record at a time. If it's anything greater than 1, it will successfully return the trail. That's pretty much the only way to successfully get a trailing byte back, as far as I can tell.

miniksa · 2020-11-13T17:41:57Z

Oh no. Oh no no. ReadConsoleA also has the problem where if you set it to raw read and ask it for more bytes than it has in storage... it just fills the buffer up with wchar_ts and early returns without translating them to the codepage you asked for.

miniksa · 2020-11-13T17:46:54Z

So here it is.

For conhostv1 (and thus as far back as we can tell for compatibility reasons):

The only A way to read input correctly is with ReadConsoleInputA and providing a buffer much larger than you ever expected to fill.
Otherwise, you must use the W versions of the ReadConsole* APIs to avoid pitfalls.

The pitfalls include:

Trailing bytes disappearing to space in one of two different ways.
Extra nulls being padded into the result.
Receiving UTF-16 data that is untranslated to the codepage you selected.

So my plan is to codify this in a series of tests against v1. Then add a switch in the tests for what the behavior SHOULD be for any app to actually be able to use these APIs (that is... take all those bugs/mistakes out.) And as a part of running the bugs out, I should also be lighting up UTF-8 on those APIs.

This is technically a compatibility break, but given that the APIs were pretty much useless for as far back as we can determine providing garbage data to any app unfortunate enough to attempt to use them... we're going to call that acceptable risk for now.

miniksa · 2020-11-13T21:12:58Z

To future self: It's not safe to go alone, take this:

%TAEF% .\bin\x64\debug\conhost.feature.tests.dll /select:"(@Name='*ReadCharByChar*' or @Name='*ReadChangeCodepageBetween*' or @Name='*ReadChangeCodepageIn*' or @Name='*ReadLeadTrail*')" /p:TestAsV1=true

lhecker · 2023-08-01T14:33:31Z

FYI large parts of this issue will be resolved by #15783. Some parts of conhost are still unaware about UTF-16 and UTF-8 however. But we're getting close.

zadjii-msft · 2023-08-23T15:02:32Z

Do we have tracking issues for what parts of this aren't fixed by #15783? It kinda looks like everything linked here will be closed when that merges.

mominshaikhdevs · 2024-03-29T12:41:16Z

#15783 has been merged. Is this issue still relevant?

lhecker · 2024-03-29T13:09:10Z

Hmm yeah I think it'd be fair to close this issue. There are still some edge cases but they aren't even mentioned in this issue. Thanks!

ghost added Needs-Triage It's a new issue that the core contributor team needs to triage at the next triage meeting Needs-Tag-Fix Doesn't match tag requirements labels Sep 29, 2020

zadjii-msft added ⛺ Reserved For future use and removed Needs-Triage It's a new issue that the core contributor team needs to triage at the next triage meeting labels Sep 29, 2020

This comment has been minimized.

Sign in to view

miniksa changed the title ~~<camper>~~ [Epic] COOKED_READ Oct 21, 2020

ghost removed the Needs-Tag-Fix Doesn't match tag requirements label Oct 21, 2020

miniksa added Issue-Feature Complex enough to require an in depth planning process and actual budgeted, scheduled work. Impact-Correctness It be wrong. labels Oct 21, 2020

miniksa changed the title ~~[Epic] COOKED_READ~~ [Epic] COOKED READ Oct 21, 2020

miniksa changed the title ~~[Epic] COOKED READ~~ [Epic] COOKED READ and character conversion fun with all our different Convert* functions Oct 21, 2020

miniksa added this to the Windows vNext milestone Oct 21, 2020

miniksa mentioned this issue Oct 28, 2020

2009 update MicrosoftDocs/Console-Docs#162

Merged

wez mentioned this issue Nov 4, 2020

Ctrl-Space inserts quotes using the German keyboard layout wez/wezterm#318

Closed

DHowett mentioned this issue Nov 20, 2020

Pasting Unicode characters in Cygwin Bash loses the first element of the surrogate pair #8302

Closed

dankamongmen mentioned this issue Aug 26, 2021

replace MSYS2 write() with WriteFile dankamongmen/notcurses#2102

Closed

alexrp mentioned this issue Jan 2, 2022

Switch to UTF-8 input on Windows vezel-dev/cathode#68

Closed

zadjii-msft modified the milestones: Windows vNext, 22H2 Jan 4, 2022

zadjii-msft added Area-CookedRead The cmd.exe COOKED_READ handling Issue-Scenario and removed Issue-Feature Complex enough to require an in depth planning process and actual budgeted, scheduled work. Issue-Task It's a feature request, but it doesn't really need a major design. labels Feb 23, 2022

everything411 mentioned this issue May 6, 2022

Help text has wrong encoding when "Beta: Use Unicode UTF-8 for worldwide language support" is enabled microsoft/openjdk#43

Closed

kelleyma49 mentioned this issue Aug 14, 2022

Cyrillic type input is not working on Windows when height parameter is used junegunn/fzf#2921

Open

3 tasks

zadjii-msft modified the milestones: 22H2, Megathreads Dec 5, 2022

zadjii-msft mentioned this issue Dec 13, 2022

Add emoji support to Windows Console #190

Closed

lhecker self-assigned this Jan 23, 2023

lhecker closed this as completed Mar 29, 2024

microsoft-github-policy-service bot added the Needs-Tag-Fix Doesn't match tag requirements label Mar 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Epic] COOKED READ and character conversion fun with all our different Convert* functions #7777

[Epic] COOKED READ and character conversion fun with all our different Convert* functions #7777

DHowett commented Sep 29, 2020 •

edited by miniksa

Loading

This comment has been minimized.

This comment has been minimized.

DHowett commented Nov 5, 2020 •

edited

Loading

miniksa commented Nov 5, 2020 •

edited

Loading

miniksa commented Nov 5, 2020 •

edited

Loading

miniksa commented Nov 6, 2020

miniksa commented Nov 6, 2020

miniksa commented Nov 6, 2020

Losing the trailing byte when reading 1 by 1

miniksa commented Nov 6, 2020 •

edited

Loading

miniksa commented Nov 10, 2020 •

edited

Loading

miniksa commented Nov 10, 2020

miniksa commented Nov 12, 2020

miniksa commented Nov 13, 2020

miniksa commented Nov 13, 2020

miniksa commented Nov 13, 2020

lhecker commented Aug 1, 2023

zadjii-msft commented Aug 23, 2023

mominshaikhdevs commented Mar 29, 2024

lhecker commented Mar 29, 2024

[Epic] COOKED READ and character conversion fun with all our different Convert* functions #7777

[Epic] COOKED READ and character conversion fun with all our different Convert* functions #7777

Comments

DHowett commented Sep 29, 2020 • edited by miniksa Loading

Related issues

These issues are related and should be improved or resolved by this work:

These issues should be easier to resolve after this work or be fixed by it:

Uncategorized:

This is what I expect to do:

The expected effects are primarily:

This comment has been minimized.

This comment has been minimized.

DHowett commented Nov 5, 2020 • edited Loading

miniksa commented Nov 5, 2020 • edited Loading

miniksa commented Nov 5, 2020 • edited Loading

Giving leftover bytes of previous codepage after a switch on read continuation

Losing a character when the codepage changes

Returning the read bytes as how many were pulled off the input queue, not how much of the given buffer was used

miniksa commented Nov 6, 2020

miniksa commented Nov 6, 2020

Losing the trailing byte when reading 1 by 1

miniksa commented Nov 6, 2020

Losing the trailing byte when reading 1 by 1

miniksa commented Nov 6, 2020 • edited Loading

miniksa commented Nov 10, 2020 • edited Loading

miniksa commented Nov 10, 2020

miniksa commented Nov 12, 2020

miniksa commented Nov 13, 2020

miniksa commented Nov 13, 2020

miniksa commented Nov 13, 2020

lhecker commented Aug 1, 2023

zadjii-msft commented Aug 23, 2023

mominshaikhdevs commented Mar 29, 2024

lhecker commented Mar 29, 2024

DHowett commented Sep 29, 2020 •

edited by miniksa

Loading

DHowett commented Nov 5, 2020 •

edited

Loading

miniksa commented Nov 5, 2020 •

edited

Loading

miniksa commented Nov 5, 2020 •

edited

Loading

miniksa commented Nov 6, 2020 •

edited

Loading

miniksa commented Nov 10, 2020 •

edited

Loading