-
Notifications
You must be signed in to change notification settings - Fork 8.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
COOKED_READ doesn't return UTF-8 on *A APIs in CP_UTF8 #4551
Comments
A ordinal in the range [0x000000, 0x00FFFF], i.e. the Basic Multilingual Plane (BMP), uses a single For non-ASCII ordinals, the internal For |
@eryksun thanks for the detailed write-up, and @stwish-msft thanks for the report. It looks like we don't actually have a bug tracking Regardless, this is now the one. Eryk, would you mind filing a separate issue for the non-BMP |
I'm giving this one the unusual denomination of "bugtask". We have a couple of them -- it's a bug, yes, but it's a fairly big chunk of work and a new feature to boot. 😄 |
Hi, I'm assuming that by "cooked" you mean that the following are enabled in the console mode on stdin: For me, the problem occurs even with those disabled. Here is my test code: #include <stdio.h>
#include <windows.h>
int main(void)
{
// Set UTF-8 code page for input and output
if (!SetConsoleOutputCP(65001))
return 1;
if (!SetConsoleCP(65001))
return 2;
printf("Output code page is %d, input code page is %d\n",
(int)GetConsoleOutputCP(), (int)GetConsoleCP());
puts(u8"We can output utf-8: γατάκι");
HANDLE hStdin = GetStdHandle(STD_INPUT_HANDLE);
if (hStdin == INVALID_HANDLE_VALUE || hStdin == NULL)
return 3;
// Set input mode
DWORD mode;
if (!GetConsoleMode(hStdin, &mode))
return 4;
mode &= ~(ENABLE_ECHO_INPUT | ENABLE_LINE_INPUT | ENABLE_PROCESSED_INPUT);
if (!SetConsoleMode(hStdin, mode))
return 5;
if (!GetConsoleMode(hStdin, &mode))
return 6;
printf("Console mode for stdin is 0x%08x\n", (int)mode);
puts("Input is now in 'raw' mode. Type something.");
char b;
do {
DWORD numRead;
if (!ReadFile(hStdin, &b, 1, &numRead, NULL) || numRead == 0)
return 7;
printf("%02x ", (int)b & 0x0ff);
} while (b != 'q');
return 0;
} If I run this Windows Terminal running cmd.exe and paste the text "I8Σπ q" when prompted, the output looks like this:
You can see that Σ and π are read as zeros. Conhost running cmd.exe is the same except that the reported console mode is 0x1b0. Some of the set flags in the console modes 0x1f0 or 0x1b0 seem to be undocumented, or am I reading them wrong? Console mode flags reference: https://docs.microsoft.com/en-us/windows/console/high-level-console-modes Instead of ReadFile() I've also tried getchar(), scanf_s(), fgetc() - none of them worked either. I'm new to console programming, so sorry if I've overlooked something that should be obvious. Thanks! Microsoft Visual Studio Community 2019 Version 16.8.3 |
Due to a bug in Windows, ReadFile() and ReadConsoleA() (and thus _read()), return zeros instead of non-ASCII characters when the console codepage is set to 65001. See this ticket for more details: microsoft/terminal#4551 This commit works around that bug by using ReadConsoleW() inside win32_read() when the passed fd points to the console and the console codepage is set to 65001. Fixes Perl#18701
Due to a bug in Windows, ReadFile() and ReadConsoleA() (and thus _read()), return zeros instead of non-ASCII characters when the console codepage is set to 65001. See this ticket for more details: microsoft/terminal#4551 This commit works around that bug by using ReadConsoleW() inside win32_read() when the passed fd points to the console and the console codepage is set to 65001. Fixes Perl#18701
Due to a bug in Windows, ReadFile() and ReadConsoleA() (and thus _read()), return zeros instead of non-ASCII characters when the console codepage is set to 65001. See this ticket for more details: microsoft/terminal#4551 This commit works around that bug by using ReadConsoleW() inside win32_read() when the passed fd points to the console and the console codepage is set to 65001. Fixes #18701
Due to a bug in Windows, ReadFile() and ReadConsoleA() (and thus _read()), return zeros instead of non-ASCII characters when the console codepage is set to 65001. See this ticket for more details: microsoft/terminal#4551 This commit works around that bug by using ReadConsoleW() inside win32_read() when the passed fd points to the console and the console codepage is set to 65001. Fixes Perl#18701
…h UTF8 codepage Corresponding Windows bug microsoft/terminal#4551 Use ReadConsoleW instead and convert to console's input codepage, to workaround. Also, disable VT sequences in the console output, as we do not knows what type of data comes with SELECT, we do not want VT escapes there. Remove my_cgets()
…h UTF8 codepage Corresponding Windows bug microsoft/terminal#4551 Use ReadConsoleW instead and convert to console's input codepage, to workaround. Also, disable VT sequences in the console output, as we do not knows what type of data comes with SELECT, we do not want VT escapes there. Remove my_cgets()
…h UTF8 codepage Corresponding Windows bug microsoft/terminal#4551 Use ReadConsoleW instead and convert to console's input codepage, to workaround. Also, disable VT sequences in the console output, as we do not knows what type of data comes with SELECT, we do not want VT escapes there. Remove my_cgets()
… UTF8 codepage Corresponding Windows bug microsoft/terminal#4551 Use ReadConsoleW instead and convert to console's input codepage, to workaround. Also, disable VT sequences in the console output, as we do not knows what type of data comes with SELECT, we do not want VT escapes there. Remove my_cgets()
… UTF8 codepage Corresponding Windows bug microsoft/terminal#4551 Use ReadConsoleW instead and convert to console's input codepage, to workaround. Also, disable VT sequences in the console output, as we do not knows what type of data comes with SELECT, we do not want VT escapes there. Remove my_cgets()
This comment has been minimized.
This comment has been minimized.
… UTF8 codepage Corresponding Windows bug microsoft/terminal#4551 Use ReadConsoleW instead and convert to console's input codepage, to workaround. Also, disable VT sequences in the console output, as we do not knows what type of data comes with SELECT, we do not want VT escapes there. Remove my_cgets()
… UTF8 codepage Corresponding Windows bug microsoft/terminal#4551 Use ReadConsoleW instead and convert to console's input codepage, to workaround. Also, disable VT sequences in the console output, as we do not knows what type of data comes with SELECT, we do not want VT escapes there. Remove my_cgets()
The overarching intention of this PR is to improve our Unicode support. Most of our APIs still don't support anything beyond UCS-2 and DBCS sequences. This commit doesn't fix the UTF-16 support (by supporting surrogate pairs), but it does improve support for UTF-8 by allowing longer `char` sequences. It does so by removing `TranslateUnicodeToOem` which seems to have had an almost viral effect on code quality wherever it was used. It made the assumption that _all_ narrow glyphs encode to 1 `char` and most wide glyphs to 2 `char`s. It also didn't bother to check whether `WideCharToMultiByte` failed or returned a different amount of `char`s. So up until now it was easily possible to read uninitialized stack memory from conhost. Any code that used this function was forced to do the same "measurement" of narrow/wide glyphs, because _of course_ it didn't had any way to indicate to the caller how much memory it needs to store the result. Instead all callers were forced to sorta replicate how it worked to calculate the required storage ahead of time. Unsurprisingly, none of the callers used the same algorithm... Without it the code is much leaner and easier to understand now. The best example is `COOKED_READ_DATA::_handlePostCharInputLoop` which used to contain 3 blocks of _almost_ identical code, but with ever so subtle differences. After reading the old code for hours I still don't know if they were relevant or not. It used to be 200 lines of code lacking any documentation and it's now 50 lines with descriptive function names. I hope this doesn't break anything, but to be honest I can't imagine anyone having relied on this mess in the first place. I needed some helpers to handle byte slices (`std::span<char>`), which is why a new `til/bytes.h` header was added. Initially I wrote a `buf_writer` class but felt like such a wrapper around a slice/span was annoying to use. As such I've opted for freestanding functions which take slices as mutable references and "advance" them (offset the start) whenever they're read from or written to. I'm not particularly happy with the design but they do the job. Related to #8000 Fixes #4551 Fixes #7589 Fixes #8663 ## Validation Steps Performed * Unit and feature tests ✅ * Far Manager ✅ * Fixes test cases in #4551, #7589 and #8663 ✅
Environment
Impact
This issue is affecting reading console input via the Universal C Runtime as well -
_read
,getchar
,fread
,scanf
, etc. Using_cgets_s
only works around this issue because it usesReadConsoleW
instead ofReadFile
. This is also reported against the UCRT on Developer Community here: _read() cannot read UTF-8 but _cgets_s() can.Steps to reproduce
When using
ReadFile
to read from a console handle, UTF-8 input is not correctly returned. UsingReadFile
on other types of handles (files, pipes) can read UTF-8 without issue.SetConsoleCP
andSetConsoleOutputCP
do not appear to affect this behavior.Expected behavior
Running
win32_test.exe
and entering '我是中文字符' input on the console should returne6 88 91 e6 98 af e4 b8 ad e6 96 87 e5 ad 97 e7 ac a6 0d 0a
as this is the UTF-8 representation of that string, plus CR LF.Actual behavior
Running
win32_test.exe
and entering '我是中文字符' input on the console will return 6 null characters and CR LF, but still returns that the read operation was successful.The text was updated successfully, but these errors were encountered: