Encodings #9

josh · 2014-12-09T23:34:40Z

We need to decide on an encoding policy. Right now all the returned strings are untagged.

In Duktape, Ecmascript strings are encoded with CESU-8 encoding. CESU-8 matches UTF-8 except that it allows codepoints in the surrogate pair range (U+D800 to U+DFFF) to be encoded directly; these are prohibited in UTF-8. CESU-8, like UTF-8, encodes all 7-bit ASCII characters as-is which is convenient for C code.

I can't say I fully understand CESU-8, but it sounds we should always be transcoding Ruby Strings to UTF-8 before handing them off to Duktape. As for the returned object, they should either be tagged as UTF-8 or transcoded back to default internal/external (whatever one). I'd be fine with always having UTF-8 strings.

/cc @judofyr @brianmario

brianmario · 2014-12-10T00:07:56Z

we should always be transcoding Ruby Strings to UTF-8 before handing them off to Duktape

Yeah I think that makes sense.

As for the returned object, they should either be tagged as UTF-8 or transcoded back to default internal/external (whatever one)

The rule of thumb for this in Ruby, as I understand it, is to transcode into default_internal (if it's set) before handing the string back to the caller.

svaarala · 2014-12-10T13:37:30Z

The situation with Duktape strings seems a bit odd, but there's a reason for that:

Standard Ecmascript strings are BMP-only (U+0000 to U+FFFF). Ecmascript requires support for arbitrary surrogate codepoints which can be used freely without necessarily forming valid surrogate pairs. Duktape encodes all 16-bit codepoints with UTF-8, except it also allows the surrogate codepoints. Technically this is called CESU-8.
Duktape also allows non-BMP string data to be represented and the internal string algorithms support them to some extent. These strings are represented using UTF-8.
Duktape uses extended UTF-8 (codepoints above U+10FFFF) for its internal uses, e.g. to represent regexp bytecode.
Finally, note that Duktape strings can contain arbitrary bytes (not necessarily valid for any UTF-8 variant). The user can easily create such strings e.g. as String(Duktape.dec('hex', 'fffefdfc')). Invalid UTF-8 strings are used to implement "internal properties".

Implications for mapping into Duktape strings:

The mapping from UTF-8 to Duktape can be 1:1, all UTF-8 strings should be handled by Duktape in a reasonable fashion.
But note that strings containing non-BMP characters are not valid Ecmascript, so technically the best conversion would be convert non-BMP characters to surrogate pairs, and then encode using CESU-8. Ecmascript code would then see surrogate pair codepoints when it examined the string data. (I don't really like this approach, but it'd be closest to how standard Ecmascript is intended to work.)

Implications for mapping from Duktape strings:

Since non-BMP characters are expressed as surrogate pairs (with CESU-8) you may want to combine the surrogate pairs into valid UTF-8.
Extended UTF-8 codepoints above U+10FFFF are not valid UTF-8, so you may want to replace them with e.g. the Unicode Replacement Character or perhaps escape them into printable data like <U+12345678>.
Finally, since string data can be arbitrary bytes in some cases, you may want to skip invalid extended UTF-8/CESU-8 sequences and replace them with the Unicode Replacement Character (or escape them into printable bytes like <FF>). Another approach is to fail at the first invalid sequence and map the whole string 1:1 to Unicode codepoints U+0000 to U+00FF to maintain the bytes as is.

Note that I'm not necessarily suggesting doing something as complicated as above, but just wanted to describe the various alternatives :)

josh · 2014-12-11T07:33:27Z

Alright, I've got the basics in.

@svaarala thanks for the explanation!

For mapping into Duktape, we'll always transcode into UTF-8.

For returned strings, I think we should just always tag as UTF-8 regardless of internal encoding. JSON.parse always does this as well. The nice thing is that it gives the extended code point and arbitrary bytestrings cases some leeway. It will be up to the user to call encode(invalid: :replace) to strip those characters or they can just treat it as raw bytes and retag it as binary. We won't validate the encoding so it will never crash in these cases.

We'll probably want to do something about the surrogate pairs encoding but maybe in another PR.

Heres how Ruby JSON handles those cases.

https://github.com/ruby/ruby/blob/trunk/ext/json/parser/parser.c#L1304-L1372
https://github.com/ruby/ruby/blob/trunk/ext/json/generator/generator.c#L125-L219

Encodings

Returned Strings should always be UTF8

927aef8

josh added 2 commits December 10, 2014 23:02

Tag returned duk strings as UTF-8

ef7cbed

Transcode to UTF-8 before passing to duk

0dc9d6f

judofyr added a commit that referenced this pull request Dec 11, 2014

Merge pull request #9 from josh/encodings

703ae71

Encodings

judofyr merged commit 703ae71 into judofyr:master Dec 11, 2014

josh deleted the encodings branch December 11, 2014 18:10

josh mentioned this pull request Dec 11, 2014

Surrogate Pair Encodings #11

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Encodings #9

Encodings #9

josh commented Dec 9, 2014

brianmario commented Dec 10, 2014

svaarala commented Dec 10, 2014

josh commented Dec 11, 2014

Encodings #9

Encodings #9

Conversation

josh commented Dec 9, 2014

brianmario commented Dec 10, 2014

svaarala commented Dec 10, 2014

josh commented Dec 11, 2014