Convert 4 byte utf8 to two surrogate utf16 and back not implemented #851

rovasiras · 2022-09-16T16:58:17Z

Converter function utf8s to utf16 has not capatiblity 4 byte utf8s to two utf16 surrogates and back. It could be resolved with
4 utf8->utf32->2 utf16 surrogates conversion. See: Unicode faq about "utf8, utf16, utf32".
The back conversion is implementable with same logic. The utf8s first byte xf0 to xf7.
It needs for languages/scripts which are in the codearea >0xffff. I think, for example Old Hungarian scripts.

rovasiras · 2022-09-16T18:15:12Z

@caolanm what do you think?

cuellius · 2022-10-04T11:39:35Z

Theoretically, C++11 provides such transformations in the <codecvt> header, and WinAPI in the functions MultiByteToWideChar/WideCharToMultiByte. But <codecvt> is deprecated in C++17 and WinAPI is available only on Windows and require two iterations over the string.

rovasiras · 2022-10-04T19:04:26Z

Theoretically, C++11 provides such transformations in the <codecvt> header, and WinAPI in the functions MultiByteToWideChar/WideCharToMultiByte. But <codecvt> is deprecated in C++17 and WinAPI is available only on Windows and require two iterations over the string.

@cuellius I know it well. I wrote about hunspell/csutil.cxx file's u16_u8 and u8_u16 functions!

rovasiras · 2022-10-05T12:20:52Z

@caolanm In the hunspell/csutil.cxx source code commented in u8_u16 and u16_u8 functions as conversion 4 utf8 byte are not implemented yet.

I think, the algorithm in #851 (comment) is could be useful.
See: Unicode faq about "utf8, utf16 utf32"

rovasiras · 2022-10-05T15:05:52Z

@caolanm I'm working on it.

rovasiras · 2022-10-19T15:47:34Z

@caolanm There is in the Unicode standard 15.0 version arabic extension, which section is upper than 0xFFFF code.
However I promissed, I will to do prepare this problem, I can not. I have a lot of another works.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Convert 4 byte utf8 to two surrogate utf16 and back not implemented #851

Convert 4 byte utf8 to two surrogate utf16 and back not implemented #851

rovasiras commented Sep 16, 2022

rovasiras commented Sep 16, 2022

cuellius commented Oct 4, 2022 •

edited

rovasiras commented Oct 4, 2022

rovasiras commented Oct 5, 2022 •

edited

rovasiras commented Oct 5, 2022

rovasiras commented Oct 19, 2022

Convert 4 byte utf8 to two surrogate utf16 and back not implemented #851

Convert 4 byte utf8 to two surrogate utf16 and back not implemented #851

Comments

rovasiras commented Sep 16, 2022

rovasiras commented Sep 16, 2022

cuellius commented Oct 4, 2022 • edited

rovasiras commented Oct 4, 2022

rovasiras commented Oct 5, 2022 • edited

rovasiras commented Oct 5, 2022

rovasiras commented Oct 19, 2022

cuellius commented Oct 4, 2022 •

edited

rovasiras commented Oct 5, 2022 •

edited