Cope with UTF-8 codepage #1468

perennialmind · 2020-03-03T21:25:25Z

Add UTF-8 codepage 65001, sometimes referred to as "Unicode (UTF-8 without signature)". Recent versions of Windows allow for locale variants with UTF-8 encoding, making CP_UTF8 a valid return for GetOEMCP(). Alternatively, you could drop cp_hr_list and instead call GetCPInfoExA.

Add UTF-8 codepage 65001, sometimes referred to as "Unicode (UTF-8 without signature)". Recent versions of Windows allow for locale variants with UTF-8 encoding, making `CP_UTF8` a valid return for `GetOEMCP()`. Alternatively, you could drop `cp_hr_list` and instead call [GetCPInfoExA](https://docs.microsoft.com/en-us/windows/win32/api/winnls/nf-winnls-getcpinfoexa).

pbatard · 2020-03-07T18:18:49Z

Good find. Please mention that this triggers an assert, coz that helps me prioritize these matters.

I don't think I wan to use GetCPInfoExA because I want to be in control of the displayed name, and I also want to have an idea of what codepages don't work, so that I can add proper support for them (hence the assert).

Besides your patch, I guess I'm going to have to add a special case for UTF-8, because the end result is that 437 rather than 850 (well, 858 since we want the € symbol) is being used on UK/IE localized platforms, which isn't ideal, so I got to think about this some more...

pbatard · 2020-03-11T13:15:35Z

I think I have a proper solution for this issue now, that doesn't rely on falling back to US codepage for all systems that use UTF-8.

But boy is it a pain in the ass to get a Windows system that defaults to UTF-8 to give you an OEM codepage. I'll just leave this for people interested in programmatically finding out the real codepage of a system with default UTF-8 locale:

UINT actual_cp;
GetLocaleInfoA(GetUserDefaultUILanguage(), LOCALE_IDEFAULTCODEPAGE | LOCALE_RETURN_NUMBER,
               (char*)&actual_cp, sizeof(actual_cp)

The above is for OEM. If you want ANSI, you should use LOCALE_IDEFAULTANSICODEPAGE.

With this, one finally gets the expected result:

Will use DOS keyboard 'uk' [UK-English]
Will use codepage 858 [Western-European (Euro)]

I will close this PR once I push the relevant commit. Once again, thanks for reporting this!

* Recent versions of Windows can set the deafult locale to codepage 65001 (UTF-8). * This produces an assert due to a missing entry in cp_hr_list[], so fix that. * However, this fix alone is not enough, as a GetOEMCP() that returns 65001 means that any systems set to UTF-8 will fall back to codepage 437 for DOS, which is definitely not what we want => Add an extra call to determine the actual OEM codepage when UTF-8 is detected. * Closes pbatard#1468

pbatard added this to the 3.10 milestone Mar 7, 2020

pbatard self-assigned this Mar 7, 2020

pbatard closed this in 5681c3b Mar 11, 2020

github-actions bot locked as resolved and limited conversation to collaborators May 26, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cope with UTF-8 codepage #1468

Cope with UTF-8 codepage #1468

perennialmind commented Mar 3, 2020

pbatard commented Mar 7, 2020

pbatard commented Mar 11, 2020

Cope with UTF-8 codepage #1468

Cope with UTF-8 codepage #1468

Conversation

perennialmind commented Mar 3, 2020

pbatard commented Mar 7, 2020

pbatard commented Mar 11, 2020