Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Convert 4 byte utf8 to two surrogate utf16 and back not implemented #851

Open
rovasiras opened this issue Sep 16, 2022 · 6 comments
Open

Comments

@rovasiras
Copy link

Converter function utf8s to utf16 has not capatiblity 4 byte utf8s to two utf16 surrogates and back. It could be resolved with
4 utf8->utf32->2 utf16 surrogates conversion. See: Unicode faq about "utf8, utf16, utf32".
The back conversion is implementable with same logic. The utf8s first byte xf0 to xf7.
It needs for languages/scripts which are in the codearea >0xffff. I think, for example Old Hungarian scripts.

@rovasiras
Copy link
Author

@caolanm what do you think?

@cuellius
Copy link
Contributor

cuellius commented Oct 4, 2022

Theoretically, C++11 provides such transformations in the <codecvt> header, and WinAPI in the functions MultiByteToWideChar/WideCharToMultiByte. But <codecvt> is deprecated in C++17 and WinAPI is available only on Windows and require two iterations over the string.

@rovasiras
Copy link
Author

Theoretically, C++11 provides such transformations in the <codecvt> header, and WinAPI in the functions MultiByteToWideChar/WideCharToMultiByte. But <codecvt> is deprecated in C++17 and WinAPI is available only on Windows and require two iterations over the string.

@cuellius I know it well. I wrote about hunspell/csutil.cxx file's u16_u8 and u8_u16 functions!

@rovasiras
Copy link
Author

rovasiras commented Oct 5, 2022

@caolanm In the hunspell/csutil.cxx source code commented in u8_u16 and u16_u8 functions as conversion 4 utf8 byte are not implemented yet.

I think, the algorithm in #851 (comment) is could be useful.
See: Unicode faq about "utf8, utf16 utf32"

@rovasiras
Copy link
Author

@caolanm I'm working on it.

@rovasiras
Copy link
Author

@caolanm There is in the Unicode standard 15.0 version arabic extension, which section is upper than 0xFFFF code.
However I promissed, I will to do prepare this problem, I can not. I have a lot of another works.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants