Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Encodings #9

Merged
merged 3 commits into from
Dec 11, 2014
Merged

Encodings #9

merged 3 commits into from
Dec 11, 2014

Conversation

josh
Copy link
Contributor

@josh josh commented Dec 9, 2014

We need to decide on an encoding policy. Right now all the returned strings are untagged.

In Duktape, Ecmascript strings are encoded with CESU-8 encoding. CESU-8 matches UTF-8 except that it allows codepoints in the surrogate pair range (U+D800 to U+DFFF) to be encoded directly; these are prohibited in UTF-8. CESU-8, like UTF-8, encodes all 7-bit ASCII characters as-is which is convenient for C code.

I can't say I fully understand CESU-8, but it sounds we should always be transcoding Ruby Strings to UTF-8 before handing them off to Duktape. As for the returned object, they should either be tagged as UTF-8 or transcoded back to default internal/external (whatever one). I'd be fine with always having UTF-8 strings.

/cc @judofyr @brianmario

@brianmario
Copy link

we should always be transcoding Ruby Strings to UTF-8 before handing them off to Duktape

Yeah I think that makes sense.

As for the returned object, they should either be tagged as UTF-8 or transcoded back to default internal/external (whatever one)

The rule of thumb for this in Ruby, as I understand it, is to transcode into default_internal (if it's set) before handing the string back to the caller.

@svaarala
Copy link

The situation with Duktape strings seems a bit odd, but there's a reason for that:

  • Standard Ecmascript strings are BMP-only (U+0000 to U+FFFF). Ecmascript requires support for arbitrary surrogate codepoints which can be used freely without necessarily forming valid surrogate pairs. Duktape encodes all 16-bit codepoints with UTF-8, except it also allows the surrogate codepoints. Technically this is called CESU-8.
  • Duktape also allows non-BMP string data to be represented and the internal string algorithms support them to some extent. These strings are represented using UTF-8.
  • Duktape uses extended UTF-8 (codepoints above U+10FFFF) for its internal uses, e.g. to represent regexp bytecode.
  • Finally, note that Duktape strings can contain arbitrary bytes (not necessarily valid for any UTF-8 variant). The user can easily create such strings e.g. as String(Duktape.dec('hex', 'fffefdfc')). Invalid UTF-8 strings are used to implement "internal properties".

Implications for mapping into Duktape strings:

  • The mapping from UTF-8 to Duktape can be 1:1, all UTF-8 strings should be handled by Duktape in a reasonable fashion.
  • But note that strings containing non-BMP characters are not valid Ecmascript, so technically the best conversion would be convert non-BMP characters to surrogate pairs, and then encode using CESU-8. Ecmascript code would then see surrogate pair codepoints when it examined the string data. (I don't really like this approach, but it'd be closest to how standard Ecmascript is intended to work.)

Implications for mapping from Duktape strings:

  • Since non-BMP characters are expressed as surrogate pairs (with CESU-8) you may want to combine the surrogate pairs into valid UTF-8.
  • Extended UTF-8 codepoints above U+10FFFF are not valid UTF-8, so you may want to replace them with e.g. the Unicode Replacement Character or perhaps escape them into printable data like <U+12345678>.
  • Finally, since string data can be arbitrary bytes in some cases, you may want to skip invalid extended UTF-8/CESU-8 sequences and replace them with the Unicode Replacement Character (or escape them into printable bytes like <FF>). Another approach is to fail at the first invalid sequence and map the whole string 1:1 to Unicode codepoints U+0000 to U+00FF to maintain the bytes as is.

Note that I'm not necessarily suggesting doing something as complicated as above, but just wanted to describe the various alternatives :)

@josh
Copy link
Contributor Author

josh commented Dec 11, 2014

Alright, I've got the basics in.

@svaarala thanks for the explanation!

For mapping into Duktape, we'll always transcode into UTF-8.

For returned strings, I think we should just always tag as UTF-8 regardless of internal encoding. JSON.parse always does this as well. The nice thing is that it gives the extended code point and arbitrary bytestrings cases some leeway. It will be up to the user to call encode(invalid: :replace) to strip those characters or they can just treat it as raw bytes and retag it as binary. We won't validate the encoding so it will never crash in these cases.

We'll probably want to do something about the surrogate pairs encoding but maybe in another PR.

Heres how Ruby JSON handles those cases.

https://github.com/ruby/ruby/blob/trunk/ext/json/parser/parser.c#L1304-L1372
https://github.com/ruby/ruby/blob/trunk/ext/json/generator/generator.c#L125-L219

judofyr added a commit that referenced this pull request Dec 11, 2014
@judofyr judofyr merged commit 703ae71 into judofyr:master Dec 11, 2014
@josh josh deleted the encodings branch December 11, 2014 18:10
@josh josh mentioned this pull request Dec 11, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants