Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Epic] COOKED READ and character conversion fun with all our different Convert* functions #7777

Closed
Tracked by #190
DHowett opened this issue Sep 29, 2020 · 19 comments
Closed
Tracked by #190
Assignees
Labels
Area-CookedRead The cmd.exe COOKED_READ handling Area-Input Related to input processing (key presses, mouse, etc.) Impact-Correctness It be wrong. Issue-Bug It either shouldn't be doing this or needs an investigation. Issue-Scenario Needs-Tag-Fix Doesn't match tag requirements Product-Conhost For issues in the Console codebase Product-Conpty For console issues specifically related to conpty
Milestone

Comments

@DHowett
Copy link
Member

DHowett commented Sep 29, 2020

[Claimed by miniksa]
This issue exists to resolve many issues around cooked read data processing including choking on the return of bytes in various A function calls from Read methods (with both DBCS and UTF-8 codepages), the improper display of Emoji or other U+10000 characters on the edit line (commonly seen in cmd.exe), and the general buffer grossness of mixmatching both char and wchar_t text within the same client-owned buffer space during waiting read operations.

Related issues

These issues are related and should be improved or resolved by this work:

These issues should be easier to resolve after this work or be fixed by it:

Uncategorized:

This is what I expect to do:

  • Start with _handlePostCharInputLoop and revise it so it doesn't attempt to precount the number of "bytes" by counting the "width" of characters. Instead move to just translating the text and storing the excess, much like an aliased command would do.
  • Realize TranslateUnicodeToOem has almost no usages anymore (and ConvertToOem) and could likely be converged with ConvertToA. Also that the fallback behavior of TranslateUnicodeToOem can be accomplished just by asking WC2MB to replace with the default character anyway.
  • Realize that if we don't need to know the CP's default character, then why are we loading it on startup? Oh, it's still because we need the lead byte table to know when something IsDBCSLeadByteConsole or not.
  • Realize that now that we don't use custom codepages anymore... there doesn't really seem to be a difference between IsDBCSLeadByteConsole and just calling the winnls.h exported IsDBCSLeadByteEx with the same codepage (and then not holding onto the CPInfo stuff at all.)
  • Doing the same thing for output codepage as input codepage in respect to dumping the load of that information and just asking winnls.h directly.
  • Look closely at CheckBisectStringA because it only seems to have one consumer that's really just checking if the final cell of a row IsDBCSLeadByteEx...
  • The ConvertInputToUnicode and ConvertOutputToUnicode are pretty darn close to ConvertToA and ConvertToW anyway. The only variations I can see in the pattern of using MB2WC and WC2MB are: no flags at all, putting the default character in when choking, or using glyph chars instead of ctrl chars. So why can't we just have ConvertToA and ConvertToW have those modes and run all the translations through those and use the safer and more sensible string-in-string-out pattern to translate everything.
  • If we're going all in on ConvertToA and ConvertToW... perhaps it's time for til::u8u16 and til::u16u8 to get their til::au16 and til::u16a variants brought in, have the flags added, and just have a unified way of converting things around here.
  • Can we also torch CharToWchar since its just translation of a short string (single character) but with the glyph for ctrl char substitution into a til::au16 with the glyph chars flag?

The expected effects are primarily:

  • Returning classic codepage or UTF-8 information out of ReadConsoleA, ReadFile and such should work correctly
  • We should eliminate a giant pile of static analysis suppressions
  • The cooked line should be able to just hold/display/convert emoji correctly and erase them correctly

More will show up in each of these headings as I discover it or we have feedback

@ghost ghost added Needs-Triage It's a new issue that the core contributor team needs to triage at the next triage meeting Needs-Tag-Fix Doesn't match tag requirements labels Sep 29, 2020
@zadjii-msft zadjii-msft added ⛺ Reserved For future use and removed Needs-Triage It's a new issue that the core contributor team needs to triage at the next triage meeting labels Sep 29, 2020
@codeofdusk

This comment has been minimized.

@zadjii-msft

This comment has been minimized.

@miniksa miniksa changed the title <camper> [Epic] COOKED_READ Oct 21, 2020
@miniksa miniksa added Area-Input Related to input processing (key presses, mouse, etc.) Issue-Bug It either shouldn't be doing this or needs an investigation. Issue-Task It's a feature request, but it doesn't really need a major design. Product-Conhost For issues in the Console codebase Product-Conpty For console issues specifically related to conpty and removed ⛺ Reserved For future use labels Oct 21, 2020
@ghost ghost removed the Needs-Tag-Fix Doesn't match tag requirements label Oct 21, 2020
@miniksa miniksa added Issue-Feature Complex enough to require an in depth planning process and actual budgeted, scheduled work. Impact-Correctness It be wrong. labels Oct 21, 2020
@miniksa miniksa changed the title [Epic] COOKED_READ [Epic] COOKED READ Oct 21, 2020
@miniksa miniksa changed the title [Epic] COOKED READ [Epic] COOKED READ and character conversion fun with all our different Convert* functions Oct 21, 2020
@miniksa miniksa added this to the Windows vNext milestone Oct 21, 2020
@DHowett
Copy link
Member Author

DHowett commented Nov 5, 2020

I expect (hope) that @eryksun has some things to contribute here, given how intimately he's reverse-engineered our stack
and how much of it he understands 😄

@miniksa
Copy link
Member

miniksa commented Nov 5, 2020

Research from today....

I'm writing with WriteConsoleInputW. I'm reading with ReadConsoleInputA off of a cooked read.

This is the scenario overview:

Char Name Char wchar_t 437 char 932 char*
Greek Lowercase Alpha α 0x03b1 0xe0 0x83 0xbf
Greek Lowercase Beta β 0x03b2 0xe1 0x83 0xc0
Greek Lowercase Delta δ 0x03b4 0xeb 0x83 0xc2
Greek Lowercase Epsilon ε 0x03b5 0xee 0x83 0xc3
  1. In conhostv1, if you were half way through reading a multibyte sequence and changed the codepage, the remaining bytes are discarded for the next call in the new codepage. By contrast, you're normally allowed to read byte by byte, even in a multibyte codepage like Japanese (shift-jis 932) and we'll store the leftover trail byte for the next call. But it looks like there's precedent to chuck it if you change codepage mid way. It also gives you an extra "byte read" for funsies.
  • SetConsoleMode on the input handle to ensure ENABLE_ECHO_INPUT | ENABLE_LINE_INPUT | ENABLE_PROCESSED_INPUT for cooked mode.
  • WriteConsoleInputW as 0x03b1 0x03b2 0x03b4 0x03b5 0x000d 0x000a (newline terminate it to trigger end of cooked read.)
  • SetConsoleCP to 932.
  • ReadConsoleInputA for 3 bytes. Receive 0x83 0xbf 0x83. Reported 4 bytes read.
  • SetConsoleCP to 437.
  • ReadConsoleInputA for 490 bytes. Receive 0xee 0x0d 0x0a. Reported 3 bytes read.
  • 0xc0 from beta discarded.. where'd the delta go? Also... thanks for reporting another byte read that you didn't give me...
  1. Also in conhostv1... it looks like if you change codepage while reading.... it just flat out loses the next character to the void.
  • SetConsoleMode on the input handle to ensure ENABLE_ECHO_INPUT | ENABLE_LINE_INPUT | ENABLE_PROCESSED_INPUT for cooked mode.
  • WriteConsoleInputW as 0x03b1 0x03b2 0x03b4 0x03b5 0x000d 0x000a (newline terminate it to trigger end of cooked read.)
  • SetConsoleCP to 932.
  • ReadConsoleInputA for 2 bytes. Receive 0x83 0xbf. Reported 2 bytes read.
  • SetConsoleCP to 437.
  • ReadConsoleInputA for 490 bytes. Receive 0xeb 0xee 0x0d 0x0a. Reported 4 bytes read.
  • Where'd the beta go?

@miniksa
Copy link
Member

miniksa commented Nov 5, 2020

Current thoughts:

Giving leftover bytes of previous codepage after a switch on read continuation

I'm pretty sold that the v1 behavior is right here. If you changed the input codepage half way through reading a byte stream, I think it's easy to say you don't care about the left behind trailing bytes for DBCS or theoretically the rest of the UTF-8 sequence.

Losing a character when the codepage changes

This feels absolutely wrong on v1's part. We shouldn't be losing characters for switching a codepage. I'd like to figure out why, but I don't intend to preserve that.

Returning the read bytes as how many were pulled off the input queue, not how much of the given buffer was used

This feels really strange and feels like an ages old mistake in conhostv1. I'm not sure why you'd return lpNumberOfCharsRead to be longer than the initial size of the buffer given in nNumberOfCharsToRead for the ReadConsoleA API. That means that the caller now has extra work to realize that we didn't overrun their buffer but we're implying that there's a continuation that didn't fit. That's a really strange API pattern relative to other Read* sorts of APIs in Windows. But it now has sort of a 20 year precedent and I can imagine someone is depending on that amount read > number to read to realize that there are trailing bytes.

So I could see that one going either way. And for UTF-8 it could easily be 1-3 greater than the buffer size, not just the 1 greater that happens with a torn DBCS lead byte.

EDIT: I'm now more strongly favoring the number read should never exceed the buffer size. Because then what do we say on the return of the extra held bytes? Is the total count now absurdly larger than the total concatenated byte length? Is it zero or smaller than the length of buffer we filled? Yuck!

@miniksa
Copy link
Member

miniksa commented Nov 6, 2020

  1. Also in conhostv1... if you read 1 byte at a time off of Japanese codepage 932.... it loses the trailing character to the void every time.
  • SetConsoleMode on the input handle to ensure ENABLE_ECHO_INPUT | ENABLE_LINE_INPUT | ENABLE_PROCESSED_INPUT for cooked mode.
  • WriteConsoleInputW as 0x03b1 0x03b2 0x03b4 0x03b5 0x000d 0x000a (newline terminate it to trigger end of cooked read.)
  • SetConsoleCP to 932.
  • ReadConsoleInputA for 1 byte. Receive 0x83. Reported 2 bytes read.
  • ReadConsoleInputA for 1 byte. Receive 0x83. Reported 2 bytes read.
  • ReadConsoleInputA for 1 byte. Receive 0x83. Reported 2 bytes read.
  • ReadConsoleInputA for 1 byte. Receive 0x83. Reported 2 bytes read.
  • ReadConsoleInputA for 1 byte. Receive 0x0d. Reported 1 bytes read.
  • ReadConsoleInputA for 1 byte. Receive 0x0a. Reported 1 bytes read.
  • Where are my trailing bytes?!
  1. Also in conhostv1... if you read 1 byte off and then read the rest in Japanese codepage 932... it loses the trailing character and doesn't stitch it to the next string. But it does present the rest of the string with a reasonable byte count.
  • SetConsoleMode on the input handle to ensure ENABLE_ECHO_INPUT | ENABLE_LINE_INPUT | ENABLE_PROCESSED_INPUT for cooked mode.
  • WriteConsoleInputW as 0x03b1 0x03b2 0x03b4 0x03b5 0x000d 0x000a (newline terminate it to trigger end of cooked read.)
  • SetConsoleCP to 932.
  • ReadConsoleInputA for 1 byte. Receive 0x83. Reported 2 bytes read.
  • ReadConsoleInputA for 100 bytes. Receive 0x83 0xc0 0x83 0xc2 0x83 0xc3 0x0d 0x0a. Reported 8 bytes read.
  • You lost one of my trailing bytes!

@miniksa
Copy link
Member

miniksa commented Nov 6, 2020

Losing the trailing byte when reading 1 by 1

The console host, including v1, has a lot of provisions for storing that trailing byte in the input handle that made the request. It theoretically should be presenting it as the first thing up on the next read.

Am I not seeing it because I'm using a thread to read? I had to change the screen buffer to allow shared reading to make my threaded read work... maybe its actually on a different handle and therefore I don't get the trail?

@miniksa
Copy link
Member

miniksa commented Nov 6, 2020

Losing the trailing byte when reading 1 by 1

The console host, including v1, has a lot of provisions for storing that trailing byte in the input handle that made the request. It theoretically should be presenting it as the first thing up on the next read.

Am I not seeing it because I'm using a thread to read? I had to change the screen buffer to allow shared reading to make my threaded read work... maybe its actually on a different handle and therefore I don't get the trail?

Doesn't look like it's a threading thing. I had to allow shared on the test screen buffer because cooked input is on and it needs a handle to the output buffer so it can echo. So an unshared test screen buffer won't work.

@miniksa
Copy link
Member

miniksa commented Nov 6, 2020

I have a theory on this. I think if we look in TranslateUnicodeToOem which has been historically responsible for identifying and storing that trailing byte (this code is very similar in v1 console)..

terminal/src/host/dbcs.cpp

Lines 183 to 202 in 015675d

if (IsDBCSLeadByteConsole(AsciiDbcs[0], &gci.CPInfo))
{
if (j < cbAnsi - 1)
{ // -1 is safe DBCS in buffer
pchAnsi[j] = AsciiDbcs[0];
j++;
pchAnsi[j] = AsciiDbcs[1];
AsciiDbcs[1] = 0;
}
else
{
pchAnsi[j] = AsciiDbcs[0];
break;
}
}
else
{
pchAnsi[j] = AsciiDbcs[0];
AsciiDbcs[1] = 0;
}

I think the only time that it will actually manage to store the trailing byte in the INPUT_READ_HANDLE_DATA and not set AsciiDbcs[1] = 0; immediately is when the size of the buffer coming down is landing on an odd number (or has been offset by 1 somehow).

However, if we look in the method that is decoding the ReadConsole packet and deciding what to do... ServerReadConsole... which is arguably a lot cleaned up in v2 console but should have about the same logic as v1... GetAugmentedOutputBuffer is always doubling the user's specified buffer size... presumably because it's using it willy nilly as scratch space while it operates.

// TODO: This is dumb. We should find out how much we need, not guess.
// If the request is not in Unicode mode, we must allocate an output buffer that is twice as big as the actual caller buffer.
RETURN_IF_FAILED(m->GetAugmentedOutputBuffer((a->Unicode != FALSE) ? 1 : 2,
&pvBuffer,
&cbBufferSize));

So I'm not sure it's even possible to have the trailing byte get stored appropriately and be delivered later.

But I've only looked at this for cooked read. So...

  1. Can anything actually managed to store the trailing byte for pickup later? Raw or direct mode instead of cooked?
  2. Is there really a way to get the buffer to give me just the lead byte and faithfully hold the trail? Offsetting the buffer by 1 somehow with a 1 byte shift-JIS sequence that shifts everything off?
  3. If it never worked... do we go with the spirit of what it was supposed to do and hold the trailing bytes correctly so someone could read byte by byte? And if so... what do we indicate on the "number of chars read"?

@miniksa
Copy link
Member

miniksa commented Nov 10, 2020

  1. Can anything actually managed to store the trailing byte for pickup later? Raw or direct mode instead of cooked?

Still need to check.

  1. Is there really a way to get the buffer to give me just the lead byte and faithfully hold the trail? Offsetting the buffer by 1 somehow with a 1 byte shift-JIS sequence that shifts everything off?

Nope. Doesn't work. I tried zα and reading either 1 or 2 bytes. If you read 1, you get just the z. If you read 2, it sends back z then 0x83 0xbf and reports 3 bytes read, but only 2 ever fill the client app's buffer and it's just gone. I didn't find what in the driver/client servicing is preventing the overrun... but something is. And it's losing the extra bytes.

  1. If it never worked... do we go with the spirit of what it was supposed to do and hold the trailing bytes correctly so someone could read byte by byte? And if so... what do we indicate on the "number of chars read"?

I believe the answer is yes. conhost.exe should never be returning more bytes than the user buffer said it could hold. The fact that the driver/client is preventing an overrun by truncating isn't something we should rely on.

Also reading the client code, we should be returning the number of chars read as the number of the client application buffer filled. Not more than that. Something's saving the console system from the overrun, but we're being mean and inducing an overrun in the client.

@miniksa
Copy link
Member

miniksa commented Nov 10, 2020

  1. Can anything actually managed to store the trailing byte for pickup later? Raw or direct mode instead of cooked?

Still need to check.

For cooked, raw....(ReadConsoleA) no. The buffer is overfilled/overdeclared in conhostv1 to the driver. It truncates it to the actual size of the user buffer, losing the trailing byte into space.

For direct...(ReadConsoleInputA) no. The buffer is filled with the trailing byte, but then the method early returns without telling the driver that the buffer has >0 bytes of payload to return.... again losing the trailing byte into space.

@miniksa
Copy link
Member

miniksa commented Nov 12, 2020

For direct... (ReadConsoleInputA)... the no is with an exception. It's only no if it's asked for 1 record at a time. If it's anything greater than 1, it will successfully return the trail. That's pretty much the only way to successfully get a trailing byte back, as far as I can tell.

@miniksa
Copy link
Member

miniksa commented Nov 13, 2020

Oh no. Oh no no. ReadConsoleA also has the problem where if you set it to raw read and ask it for more bytes than it has in storage... it just fills the buffer up with wchar_ts and early returns without translating them to the codepage you asked for.

@miniksa
Copy link
Member

miniksa commented Nov 13, 2020

So here it is.

For conhostv1 (and thus as far back as we can tell for compatibility reasons):

  • The only A way to read input correctly is with ReadConsoleInputA and providing a buffer much larger than you ever expected to fill.
  • Otherwise, you must use the W versions of the ReadConsole* APIs to avoid pitfalls.

The pitfalls include:

  • Trailing bytes disappearing to space in one of two different ways.
  • Extra nulls being padded into the result.
  • Receiving UTF-16 data that is untranslated to the codepage you selected.

So my plan is to codify this in a series of tests against v1. Then add a switch in the tests for what the behavior SHOULD be for any app to actually be able to use these APIs (that is... take all those bugs/mistakes out.) And as a part of running the bugs out, I should also be lighting up UTF-8 on those APIs.

This is technically a compatibility break, but given that the APIs were pretty much useless for as far back as we can determine providing garbage data to any app unfortunate enough to attempt to use them... we're going to call that acceptable risk for now.

@miniksa
Copy link
Member

miniksa commented Nov 13, 2020

To future self: It's not safe to go alone, take this:

%TAEF% .\bin\x64\debug\conhost.feature.tests.dll /select:"(@Name='*ReadCharByChar*' or @Name='*ReadChangeCodepageBetween*' or @Name='*ReadChangeCodepageIn*' or @Name='*ReadLeadTrail*')" /p:TestAsV1=true

@lhecker
Copy link
Member

lhecker commented Aug 1, 2023

FYI large parts of this issue will be resolved by #15783. Some parts of conhost are still unaware about UTF-16 and UTF-8 however. But we're getting close.

@zadjii-msft
Copy link
Member

Do we have tracking issues for what parts of this aren't fixed by #15783? It kinda looks like everything linked here will be closed when that merges.

@mominshaikhdevs
Copy link

#15783 has been merged. Is this issue still relevant?

@lhecker
Copy link
Member

lhecker commented Mar 29, 2024

Hmm yeah I think it'd be fair to close this issue. There are still some edge cases but they aren't even mentioned in this issue. Thanks!

@lhecker lhecker closed this as completed Mar 29, 2024
@microsoft-github-policy-service microsoft-github-policy-service bot added the Needs-Tag-Fix Doesn't match tag requirements label Mar 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Area-CookedRead The cmd.exe COOKED_READ handling Area-Input Related to input processing (key presses, mouse, etc.) Impact-Correctness It be wrong. Issue-Bug It either shouldn't be doing this or needs an investigation. Issue-Scenario Needs-Tag-Fix Doesn't match tag requirements Product-Conhost For issues in the Console codebase Product-Conpty For console issues specifically related to conpty
Projects
None yet
Development

No branches or pull requests

6 participants