Replies: 1 comment 1 reply
-
After some more thought - stringified JSON will often be 2x shorter as UTF-8 rather than UTF-16 - much of content is ASCII but there will often be a few non-ASCII characters which would push it from encoding 0 to 2. Thus leaning towards MakeCode-like solution, possibly without skip lists yet. |
Beta Was this translation helpful? Give feedback.
1 reply
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
EcmaScript spec requires
String.charCodeAt()
to return a 16-bit value. Unicode code points outside of 16-bit (mostly emoticons, but also some historical alphabets, and rare Chinese/Japanese/Korean ideograms) are represented as surrogate pairs of 2 16-bit code units. TheString.length
returns the number of UTF-16 code units in the string.If ES was designed today they would probably return up to 21-bit values from
charCodeAt()
, or possibly use yet another abstraction since even with full 21-bit Unicode, several code points can still combine into a single glyph (character displayed on the screen).Here are some string representations:
MakeCode uses 0, 1 and 3 (the surrogate pairs are encoded in UTF-8, so it is ES-compatible). The encoding 0 is limited to ASCII (0-127) and so all strings are valid UTF-8.
Beta Was this translation helpful? Give feedback.
All reactions