New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Document UTF-8 Conversion in Spec + Simplify marshall/unmarshall Implementations. #62
Comments
I think text conversion depends on the implementation, i.e., the rules are not related to the data format. The compiler manual (see README) states the following.
So java.lang.String is no longer backed by a char(acter) array. With the new implementation it is even harder to access the data in an efficient way. 😖 Happy to hear about better alternatives for String#charAt(int). String#getBytes allocates memory. The unmarshaller uses String(byte[],int,int,java.nio.charset.Charset) now, and that works fine. No external libraries for generated code is key! Feel free to open an issue for a specific improvement idea. |
Had a quick look at the new streams with String#chars. It is way slower 😱than String#charAt(int). |
Thank you for clarifying. So here are some specific suggestions:
|
|
Ah, yes, based on characters, not code points. That should be okay for now. |
What do you mean with "for now"? 😬 This must hold forever, even with malformed UTF-16 sequences. |
--- a/ecma/test.js
+++ b/ecma/test.js
@@ -50,6 +50,7 @@ function newGoldenCases() {
'87ffffffffffffffff2e5da4e77f': {t: new Date(-223), t_ns: 888999},
'0801417f': {s: 'A'},
'080261007f': {s: 'a\x00'},
+ '0804f0908d887f': {s: '𐍈'},
'0809c280e0a080f09080807f': {s: '\u0080\u0800\u{10000}'},
'08800120202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020207f': {s: ' '},
'0901ff7f': {a: new Uint8Array([0xFF])}, … passes just fine. |
For now, it meant, until, Unicode ups the range dramatically. |
Unicode doesn't up the range dramatically. It would would also be against their own stability policy. |
After looking at this I thought we were doing something special, but it turns out some characters weren't encoded right. Single character string: '𐍈' is an example
We should switch to utf8.js since it makes the code easier to debug. If we wanted to save memory allocations, we could make it re-use a fixed buffer and return the buffer, length pair. And then we can rewrite the code for subsequent for adjusting the size in.
The Java version can be simplified with new
String(bytes, utf8CharSet);
ands.getBytes(utf8CharSet);
As of Java 8, they decided to move away from UTF-16 as a default encoding, and previously were into UTF16. So we should just use the native implementation as much as possible than invent ours.@pascaldekloe would like to hear your thoughts. :)
The text was updated successfully, but these errors were encountered: