should `charCodeAt()` return full Unicode or UTF-16 like in ES #34

mmoskal · 2022-12-12T10:17:31Z

mmoskal
Dec 12, 2022
Maintainer

EcmaScript spec requires String.charCodeAt() to return a 16-bit value. Unicode code points outside of 16-bit (mostly emoticons, but also some historical alphabets, and rare Chinese/Japanese/Korean ideograms) are represented as surrogate pairs of 2 16-bit code units. The String.length returns the number of UTF-16 code units in the string.

If ES was designed today they would probably return up to 21-bit values from charCodeAt(), or possibly use yet another abstraction since even with full 21-bit Unicode, several code points can still combine into a single glyph (character displayed on the screen).

Here are some string representations:

use 1 byte per character, only works for Latin-1 text (this is very common in programs regardless of the language of actual strings due to property names etc. being encoded as strings)
use UTF-8 encoding with skip-lists for faster indexing
use UTF-16 encoding
a string that is concatenation of two other strings; these are kept as trees and only serialized when indexed - this is to avoid quadratic complexity of constructing strings with concatenation

MakeCode uses 0, 1 and 3 (the surrogate pairs are encoded in UTF-8, so it is ES-compatible). The encoding 0 is limited to ASCII (0-127) and so all strings are valid UTF-8.

mmoskal · 2022-12-12T14:27:27Z

mmoskal
Dec 12, 2022
Maintainer Author

After some more thought - stringified JSON will often be 2x shorter as UTF-8 rather than UTF-16 - much of content is ASCII but there will often be a few non-ASCII characters which would push it from encoding 0 to 2. Thus leaning towards MakeCode-like solution, possibly without skip lists yet.

1 reply

mmoskal Dec 12, 2022
Maintainer Author

The 21 bit return value is probably more intuitive and easier to implement - we'll go with that.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

should `charCodeAt()` return full Unicode or UTF-16 like in ES #34

{{title}}

Replies: 1 comment 1 reply

{{title}}

{{title}}

Select a reply

should charCodeAt() return full Unicode or UTF-16 like in ES #34

mmoskal Dec 12, 2022 Maintainer

Replies: 1 comment · 1 reply

mmoskal Dec 12, 2022 Maintainer Author

mmoskal Dec 12, 2022 Maintainer Author

should `charCodeAt()` return full Unicode or UTF-16 like in ES #34

mmoskal
Dec 12, 2022
Maintainer

Replies: 1 comment 1 reply

mmoskal
Dec 12, 2022
Maintainer Author

mmoskal Dec 12, 2022
Maintainer Author