Document UTF-8 Conversion in Spec + Simplify marshall/unmarshall Implementations. #62

guilt · 2020-04-01T08:53:33Z

After looking at this I thought we were doing something special, but it turns out some characters weren't encoded right. Single character string: '𐍈' is an example

We should switch to utf8.js since it makes the code easier to debug. If we wanted to save memory allocations, we could make it re-use a fixed buffer and return the buffer, length pair. And then we can rewrite the code for subsequent for adjusting the size in.

The Java version can be simplified with new String(bytes, utf8CharSet); and s.getBytes(utf8CharSet); As of Java 8, they decided to move away from UTF-16 as a default encoding, and previously were into UTF16. So we should just use the native implementation as much as possible than invent ours.

@pascaldekloe would like to hear your thoughts. :)

The text was updated successfully, but these errors were encountered:

pascaldekloe · 2020-04-01T12:08:32Z

I think text conversion depends on the implementation, i.e., the rules are not related to the data format. The compiler manual (see README) states the following.

	Text validation is not part of the marshalling and unmarshalling
	process. C and Go just pass any malformed UTF-8 characters. Java
	and JavaScript replace unmappable content with the '?' character
	(ASCII 63).

So java.lang.String is no longer backed by a char(acter) array. With the new implementation it is even harder to access the data in an efficient way. 😖 Happy to hear about better alternatives for String#charAt(int). String#getBytes allocates memory. The unmarshaller uses String(byte[],int,int,java.nio.charset.Charset) now, and that works fine.

No external libraries for generated code is key!

Feel free to open an issue for a specific improvement idea.

pascaldekloe · 2020-04-01T14:41:38Z

Had a quick look at the new streams with String#chars. It is way slower 😱than String#charAt(int).

guilt · 2020-04-01T16:55:36Z

Thank you for clarifying. So here are some specific suggestions:

Java's size fit should probably use *4 or *5 as a general rule. Assuming 𐍈🐱‍🐉🤷‍♂️🥗🚂 emojis are the norm than the exception.
We should probably add ser/deser tests with UTF characters across all 4 character ranges in these tests.

pascaldekloe · 2020-04-01T18:16:19Z

The size fit is the maximum ratio from UTF-16 char(acters) to UTF-8 bytes. That is, encoding of a char costs 1, 2 or 3 bytes; never more.
The golden cases have a string with all cases covered.

guilt · 2020-04-01T18:28:42Z

Ah, yes, based on characters, not code points. That should be okay for now.

pascaldekloe · 2020-04-02T12:47:56Z

What do you mean with "for now"? 😬 This must hold forever, even with malformed UTF-16 sequences.

pascaldekloe · 2020-04-02T12:55:39Z

--- a/ecma/test.js
+++ b/ecma/test.js
@@ -50,6 +50,7 @@ function newGoldenCases() {
                '87ffffffffffffffff2e5da4e77f': {t: new Date(-223), t_ns: 888999},
                '0801417f': {s: 'A'},
                '080261007f': {s: 'a\x00'},
+               '0804f0908d887f': {s: '𐍈'},
                '0809c280e0a080f09080807f': {s: '\u0080\u0800\u{10000}'},
                '08800120202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020207f': {s: '                                                                                                                                '},
                '0901ff7f': {a: new Uint8Array([0xFF])},

… passes just fine.

guilt · 2020-04-02T21:52:18Z

For now, it meant, until, Unicode ups the range dramatically.

pascaldekloe · 2020-04-03T10:09:33Z

Unicode doesn't up the range dramatically. It would would also be against their own stability policy.

guilt changed the title ~~Document UTF-8 Conversion of some characters~~ Document UTF-8 Conversion in Spec + Simplify marshall/unmarshall Implementations. Apr 1, 2020

pascaldekloe closed this as completed Apr 1, 2020

pascaldekloe added the enhancement label Apr 1, 2020

pascaldekloe added the invalid label Apr 2, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Document UTF-8 Conversion in Spec + Simplify marshall/unmarshall Implementations. #62

Document UTF-8 Conversion in Spec + Simplify marshall/unmarshall Implementations. #62

guilt commented Apr 1, 2020 •

edited

pascaldekloe commented Apr 1, 2020

pascaldekloe commented Apr 1, 2020

guilt commented Apr 1, 2020

pascaldekloe commented Apr 1, 2020

guilt commented Apr 1, 2020

pascaldekloe commented Apr 2, 2020

pascaldekloe commented Apr 2, 2020

guilt commented Apr 2, 2020

pascaldekloe commented Apr 3, 2020

Document UTF-8 Conversion in Spec + Simplify marshall/unmarshall Implementations. #62

Document UTF-8 Conversion in Spec + Simplify marshall/unmarshall Implementations. #62

Comments

guilt commented Apr 1, 2020 • edited

pascaldekloe commented Apr 1, 2020

pascaldekloe commented Apr 1, 2020

guilt commented Apr 1, 2020

pascaldekloe commented Apr 1, 2020

guilt commented Apr 1, 2020

pascaldekloe commented Apr 2, 2020

pascaldekloe commented Apr 2, 2020

guilt commented Apr 2, 2020

pascaldekloe commented Apr 3, 2020

guilt commented Apr 1, 2020 •

edited