Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

osd/strconv.cpp: Support UTF-8 Windows ANSI code page. #12131

Merged
merged 2 commits into from
Apr 15, 2024

Conversation

invertego
Copy link
Contributor

  • Defer to MAME's built-in UTF-8 decoding because MultiByteToWideChar needs the length of the code unit and once that's been determined it's convenient do the rest of the conversion and sidestep the complication of UTF-16 (surrogate pairs).
  • Remove the FIXME about other variable length encodings such as GB18030. With the sole exception of UTF-8, only single- and double-character encodings are supported as the active ANSI code page. Other code pages are only usable in conversion functions and not relevant here. Source: [MS-UCODEREF]: Supported Codepage in Windows | Microsoft Learn.
  • Remove the FIXME about surrogate pairs because these only arise in the context of Unicode inputs not representable as a single UTF-16 code unit, and as mentioned above UTF-8 (the only selectable Unicode ANSI code page) is now handled directly in MAME code.

@cuavas
Copy link
Member

cuavas commented Mar 12, 2024

Character-at-a-time conversion will still behave badly with nominal single-byte encodings that use combining characters like Windows-1258 because newer versions of Windows attempt to get clever when converting text runs, e.g. F5 D2 will be converted to a single UTF-16 code unit U+1EDF, but if you naïvely attempt to convert one character at a time, you’ll get the sequence U+01A1 U+0309. The function is fundamentally unsound in concept.

@invertego
Copy link
Contributor Author

I can't reproduce that behavior with your example on Win11 23H2. If I process both bytes together in a single call, I still get two wchars out.

WCHAR wide[16];
int len = MultiByteToWideChar(1258, 0, "\xF5\xD2", 2, wide, 16);
for (int i = 0; i < len; i++)
	printf("%04X ", wide[i]);

Prints 01A1 0309

That said, this PR only impacts systems with a UTF-8 active code page while avoiding the vagaries of Win32 API.

@cuavas cuavas merged commit 657bd51 into mamedev:master Apr 15, 2024
0 of 5 checks passed
@invertego invertego deleted the utf8-acp branch April 16, 2024 02:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants