Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unicode supplemental planes in Harbour #242

Open
alcz opened this issue Apr 28, 2021 · 1 comment
Open

Unicode supplemental planes in Harbour #242

alcz opened this issue Apr 28, 2021 · 1 comment

Comments

@alcz
Copy link
Contributor

alcz commented Apr 28, 2021

issue by @kcarmody, i'm just reposting from harbour-devel
https://groups.google.com/g/harbour-devel/c/HWgaMNa7T-Y/m/WydcCeH6AAAJ

Surely Przemek and Viktor and everyone helping them know that

Unicode is more than a 16 bit encoding
its upper limit is 0x10FFFF, not 0xFFFF
it consists of 17 "planes", each of which contains 2^16 characters
the first plane, the Basic Multilingual Plane (BMP), has code points 0 to 0xFFFF
the other 16 planes, the supplemental planes, have code points 0x10000 to 0x10FFFF
the first supplemental plane, the Supplemental Multilingual Plane (SMP), has code points 0x10000 to 0x1FFFF
until recently only the BMP was normally used
the recent addition of emoticons like 😊 😲😔 to the SMP has made it much more popular

Why then do we see the following?

UTF8 for U+F60A <private-use-F60A> = e"\xEF\x98\x8A"
UTF8 for U+1F60A SMILING FACE WITH SMILING EYES = e"\xF0\x9F\x98\x8A"
HB_UTF8CHR(0x1F60A) --> e"\xEF\x98\x8A"
HB_NUMTOHEX(HB_UTF8ASC(HB_UTF8CHR(0x1F60A))) --> "F60A"
HB_NUMTOHEX(HB_UTF8ASC(e"\xF0\x9F\x98\x8A")) --> "F60A"

Is it a bug or a feature?

@alcz
Copy link
Contributor Author

alcz commented Apr 28, 2021

IMHO should be qualified as a bug, when not all the neccesary "features" are implemented.
Anyway the state of it reflects general needs of Harbour users and initial UTF8 support was actually donated and developed by Przemek. Most interested users seem to be happy with basic plane (right now including me).

Similarly old Windows NT versions were UCS-2 under Unicode wide API, that later evolved into UTF16LE. Oracle had also used CESU-8 under UTF8 name. Nothing new with those misclassifications, upgrade path for the responsible code is not denied.

Would be nice to have some Unicode tests like yours in hbtest too:
Placed in a new utils/hbtest/rt_ustr.prg or something.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants