Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Document UTF-8 Conversion in Spec + Simplify marshall/unmarshall Implementations. #62

Closed
guilt opened this issue Apr 1, 2020 · 9 comments

Comments

@guilt
Copy link
Contributor

guilt commented Apr 1, 2020

After looking at this I thought we were doing something special, but it turns out some characters weren't encoded right. Single character string: '𐍈' is an example

We should switch to utf8.js since it makes the code easier to debug. If we wanted to save memory allocations, we could make it re-use a fixed buffer and return the buffer, length pair. And then we can rewrite the code for subsequent for adjusting the size in.

The Java version can be simplified with new String(bytes, utf8CharSet); and s.getBytes(utf8CharSet); As of Java 8, they decided to move away from UTF-16 as a default encoding, and previously were into UTF16. So we should just use the native implementation as much as possible than invent ours.

@pascaldekloe would like to hear your thoughts. :)

@guilt guilt changed the title Document UTF-8 Conversion of some characters Document UTF-8 Conversion in Spec + Simplify marshall/unmarshall Implementations. Apr 1, 2020
@pascaldekloe
Copy link
Owner

I think text conversion depends on the implementation, i.e., the rules are not related to the data format. The compiler manual (see README) states the following.

	Text validation is not part of the marshalling and unmarshalling
	process. C and Go just pass any malformed UTF-8 characters. Java
	and JavaScript replace unmappable content with the '?' character
	(ASCII 63).

So java.lang.String is no longer backed by a char(acter) array. With the new implementation it is even harder to access the data in an efficient way. 😖 Happy to hear about better alternatives for String#charAt(int). String#getBytes allocates memory. The unmarshaller uses String(byte[],int,int,java.nio.charset.Charset) now, and that works fine.

No external libraries for generated code is key!

Feel free to open an issue for a specific improvement idea.

@pascaldekloe
Copy link
Owner

Had a quick look at the new streams with String#chars. It is way slower 😱than String#charAt(int).

@guilt
Copy link
Contributor Author

guilt commented Apr 1, 2020

Thank you for clarifying. So here are some specific suggestions:

  1. Java's size fit should probably use *4 or *5 as a general rule. Assuming 𐍈🐱‍🐉🤷‍♂️🥗🚂 emojis are the norm than the exception.
  2. We should probably add ser/deser tests with UTF characters across all 4 character ranges in these tests.

@pascaldekloe
Copy link
Owner

  1. The size fit is the maximum ratio from UTF-16 char(acters) to UTF-8 bytes. That is, encoding of a char costs 1, 2 or 3 bytes; never more.

  2. The golden cases have a string with all cases covered.

@guilt
Copy link
Contributor Author

guilt commented Apr 1, 2020

Ah, yes, based on characters, not code points. That should be okay for now.

@pascaldekloe
Copy link
Owner

What do you mean with "for now"? 😬 This must hold forever, even with malformed UTF-16 sequences.

@pascaldekloe
Copy link
Owner

--- a/ecma/test.js
+++ b/ecma/test.js
@@ -50,6 +50,7 @@ function newGoldenCases() {
                '87ffffffffffffffff2e5da4e77f': {t: new Date(-223), t_ns: 888999},
                '0801417f': {s: 'A'},
                '080261007f': {s: 'a\x00'},
+               '0804f0908d887f': {s: '𐍈'},
                '0809c280e0a080f09080807f': {s: '\u0080\u0800\u{10000}'},
                '08800120202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020207f': {s: '                                                                                                                                '},
                '0901ff7f': {a: new Uint8Array([0xFF])},

… passes just fine.

@guilt
Copy link
Contributor Author

guilt commented Apr 2, 2020

For now, it meant, until, Unicode ups the range dramatically.

@pascaldekloe
Copy link
Owner

Unicode doesn't up the range dramatically. It would would also be against their own stability policy.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants