Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow Unicode code points as chars #2024

Closed
rebolbot opened this issue Apr 10, 2013 · 8 comments
Closed

Allow Unicode code points as chars #2024

rebolbot opened this issue Apr 10, 2013 · 8 comments

Comments

@rebolbot
Copy link
Collaborator

rebolbot commented Apr 10, 2013

Submitted by: Ladislav

Currently, only BMP code points are supported by the CHAR! datatype, while:

"In a move of historic significance for software supporting Unicode, the PRC decided to mandate support of certain code points outside the BMP. This means that software can no longer get away with treating characters as 16 bit fixed width entities (UCS-2)."

CC - Data [ Version: r3 master Type: Wish Platform: All Category: Datatype Reproduce: Always Fixed-in:none ]

@rebolbot
Copy link
Collaborator Author

rebolbot commented Apr 10, 2013

Submitted by: BrianH

#683 is related, though I would prefer to support them rather than just trigger an error. Which method would you prefer to encode such characters?

One model is to start with UCS1 (aka Latin-1), then upgrade to UCS2 when characters not in UCS1 are added, then UCS4 when characters not in UCS2 are added. Strings wouldn't downgrade to a lower UCS automatically, we would need a function to do it on demand. This model has O(1) access for indexed operations and LENGTH?. This is what Python 3 and Red have done.

Another model would be to use UTF-8 or UTF-16 internally, depending on what the platform supports. Windows would be UTF-16, Linux would be UTF-8. This would have lower memory usage when higher codepoints are used, but indexed operations and LENGTH? would be O(n). Many other programming languages have done this, notably .NET and Java use UTF-16, and Go uses UTF-8. Note that strings would be series of codepoints, so all UTF encoding and decoding would need to be done internally, and no partial encodings (surrogate pairs or individual UTF-8 bytes) would be exposed.

@rebolbot
Copy link
Collaborator Author

rebolbot commented Apr 11, 2013

Submitted by: BrianH

"read somewhere that Carl implemented strings that way?"

that was the plan, but he hasn't actually implemented this plan yet.

Much of the string/binary series code can handle 1-byte and 2-byte element sizes in the macros, so we'd need to add support for 4-byte. The autoconversion code isn't written yet. Most code that refers to individual characters uses the REBUNI type, which is 16-bit, even though Unicode codepoints should be 32-bit when in individual variables rather than arrays. There is no 32-bit Unicode character type like REBUNI defined, so code that works on full codepoints tends to repurpose other defined types, and almost always mixes signed and unsigned types, especially using pointers to the one to refer to the other. Finally, I have asked whether we should be using signed or unsigned 32-bit values to store codepoints, but noone I've asked has been able to answer that question yet.

To implement the UCS switching model:

  • We need to determine whether hash calculations of strings require signed or unsigned math, which would determine whether we should use a signed or unsigned 32-bit codepoint type. (Ladislav?)
  • We need to declare a 32-bit codepoint type for people to use (I suggest it be called REBUNI, and also REBUCS4 for array element references), along with REBUCS2 (old REBUNI) and REBUCS1 (same as REBYTE) types.
  • We need a new flag value (for the existing flag) to indicate 4-byte element series (how do vectors do it?).
  • We need a third set of macros for 4-byte element series operations, similar to the current 1-byte and 2-byte macros.
  • We need autoconversion code that will upgrade the UCS encoding. Downgrading should be on-demand, not automatic.
  • We need to go through all of the code in R3 that operates on strings/binaries, make sure that it is using the right datatypes, and make sure that modifying code calls the autoconversion when appropriate.
  • The code that processes hex escape syntax in string! and char! needs to support 8-digit hex sequences. Not UTF-encoded, hex values of the codepoints as 32-bit integers.
  • The code that processes percent-encoded hex escape syntax in file!, email! and url! needs to support UTF-8 hex sequences (see Ordinal functions in the other direction #1983 and Internal representation of URLs #2014).
  • We need to make sure that all API calls get string data in the format they expect. On Windows this means converting from UCS4 to UTF-16, and either converting from UCS1 to UTF-16 or using the *A functions for UCS1 data. On Linux, this means converting any of our encodings to UTF-8 (yes, even UCS1, unless we want to make the byte-sized elements limited to ASCII instead of Latin-1), or whatever it requires.
  • We need a set of user-level native functions that can change the internal encoding on demand, as long as the characters in the string are within the new range. This lets users make their code's performance more predictable. We need a function that can tell us the current encoding mode too, for informational purposes.
  • We need to go over the tests and make sure that they match the new model.

In comparison, to implement the UTF model:

  • Same hash calc question as UCS, and the same 32-bit REBUNI type.
  • Two new sets of macros, one for UTF-8, one for UTF-16 (platform endianness), including macros to advance from one position to the next based on decoding the referenced data.
  • Go through all string code and change all index math to use the O(n) advancing code instead. Same for length calculations.
  • Make sure that all existing REBYTE code that operates on string data in a way that would be different for UTF-8 is split into different code that operates on UTF-8. There is some sharing of code between binary series and byte-element series of other datatypes.
  • Same treatment of hex escape sequences as the UCS model. That also means using UCS4 hex sequences in strings, not UTF, and using UTF-8 percent encoding for urls even if we are using UTF-16 internally.
  • Platform API calls will already be getting the data they expect, no conversion needed, that's why we're using UTF encodings in the first place :)
  • We wouldn't need any mode-switching functions, since there would be no modes.
  • Still need to go over the tests, with a special emphasis on making sure that the internal encoding is never exposed to the user. No access to UTF data, only to decoded codepoints.

Either way, that is a lot of work, but it's doable once we put our open-source many-hands effort to it. Fortunately we already include Unicode Consortium code, so we have some code to adapt that can do almost everything mentioned above.

@rebolbot
Copy link
Collaborator Author

rebolbot commented Apr 11, 2013

Submitted by: Ladislav

"whether we should be using signed or unsigned 32-bit values to store codepoints"

I slightly prefer unsigned

@rebolbot
Copy link
Collaborator Author

Submitted by: BrianH

The main problem I ran into is that I didn't understand the hash calculation. Does it need unsigned integers or not? Does it need to be adjusted to be able to handle 32-bit values? The Unicode Consortium code doesn't have any hash code, so I can't just go off of its preferences.

Not knowing any better, I also slightly prefer unsigned because that is the convention for 8 and 16 bit characters. The only downside is that some functions return -1 to indicate an error, having all non-negative values be considered non-erroneous, and you can't do that with an unsigned datatype. This could be solved by designating another error indication value. Or it could be solved by using a signed return for those functions and then convert to an unsigned value after screening the negative numbers indicating errors.

@rebolbot
Copy link
Collaborator Author

rebolbot commented Apr 12, 2013

Submitted by: abolka

Preferences:

  • Fixed width internal strings + with automatic widening (so the "UCS-1" to UCS-2 to UCS-4 model).
  • Slight preference for unsigned 32-bit integers for Unicode characters (but that doesn't matter much).

One remark: "UCS-1" to UCS-2 widening was not only planned, it is already implemented and used (in R3). So I think that suggests to continue with this model and provide full Unicode support via widening to UCS-4, internally.

Brian's implementation plan looks very solid to me and nicely summarises many related issues.

What hash calculation are you concerned about? The one used for computing internal hash values?

@rebolbot
Copy link
Collaborator Author

Submitted by: BrianH

Yes, that hash calculation.

After looking it up, UCS-4 is defined as having a range of the 0 or greater portion of a 32-bit signed integer. In practice, it is always less than that. So, it would be OK to have REBUNI be either a signed or unsigned value, though REBUCS2 and REUCS1 should still be unsigned to cover their respective ranges, so REBUNI should be unsigned too. However, the internal functions like the UTF-8 decoder that return a negative number for bad data or other errors can probably get away with returning a signed 32-bit integer, as long as they don't do so through a pointer (some functions have this problem), and don't use REBCNT as the type in case that type changes with 64-bit builds. Negative characters should never be put in a string though, they're out of range.

Weird, Wikipedia seems to have been changed since I last looked at its Unicode stuff. It now refers to UCS2 as UCS-2 and UCS4 as UCS-4 (the hyphens used to be specifically disallowed), no longer mentions UCS1 as a synonym for Latin-1, and scarcely mentions the UCS* encodings at all anymore, preferring to talk about the UTF-* stuff.

Given the compatibility between ASCII and the ASCII range of UTF-8, it might be a good idea to just use the byte-element strings for characters in the ASCII range. That would let us use UTF-8 (Linux) or ANSI (Windows) APIs with no conversion necessary in that mode. Similarly, we can pass UCS-2 mode strings to UTF-16 APIs with no conversion needed (on Windows, OSX?, .NET and Java). We would only consistently need to convert the UCS-4/UTF-32 mode strings to call string APIs, since noone uses UTF-32 APIs.

@hostilefork
Copy link
Member

Plan in Ren-C is more radical than the above but hopefully, when all is said and done--simpler. The plan is to always keep strings encoded in UTF-8, and convert them only at edges that require it (e.g. Windows print and input device would need to do this):

http://utf8everywhere.org/

@hostilefork
Copy link
Member

This is now working--all unicode codepoints are legal in strings and characters.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants