Support Unicode #1

sparkprime · 2014-08-10T00:30:03Z

Currently only ASCII is supported in strings. It should not be too hard to accept UTF-8 (raising an error for invalid input), and adjust internal string routines to support unparsing those strings correctly, as well as routines for iterating over codepoints, correctly determining the length (in codepoints), etc.

Merge pull from master

johnboiles · 2015-10-01T21:52:34Z

This would be great. How painful did this change look to be? I might be able to contribute if it's not huge.

sparkprime · 2015-10-02T05:38:48Z

My plan was to avoid having a dependency on ICU -- store everything
internally as wstring and assume that wchar is a unicode codepoint. Then
we just need to tweak the lexer to parse utf8 in string literals and the
string output function to render it back as utf8. It shouldn't be too hard
as I left some placeholders and TODOs in there. You're very welcome to
have a try at it.

I suggest 1) modifying the internal string representation in state.h 2)
modifying the output code to encode utf8 and testing it with std.char(x)
for x > 127 and 3) modifying the lexer to parse utf8. It would be possible
to run all tests and commit upstream at each intermediate point.

This would be great. How painful did this change look to be? I might be
able to contribute if it's not huge.

—
Reply to this email directly or view it on GitHub
#1 (comment).

johnboiles · 2015-10-02T18:34:46Z

Great, thanks for the info @sparkprime. I'll update here if I get a chance to try it; I need more emoji in my json 🍻

I really love Jsonnet BTW. My team is using it along with ApiDoc to create API documentation that doubles as a mock API server for developing apps against APIs that aren't finished yet.

sparkprime · 2015-10-04T16:39:35Z

Glad you like it!

I did some reading and it seems wstring is not what we want because it has UTF16 behavior on windows. So we probably need to do something like

typedef std::basic_string<char32_t> JsonnetString;

with functions to convert from UTF8-encoded std::string to that and back.

There are a bunch of places where the HeapString internal representation leaks out into other places as well, e.g. field names, std.extVar() keys, filenames (from std.thisFile) etc.

sparkprime · 2015-10-07T03:24:55Z

@hotdog929 you may be interested

sparkprime · 2015-10-11T17:21:47Z

I'm going to have a go at this because I think it's probably harder / more work than I originally thought.

sparkprime · 2015-10-11T21:17:49Z

That was a productive 4 hours ;)

johnboiles · 2015-10-11T22:37:24Z

Wow @sparkprime, way to kill it!!

davidzchen · 2015-10-12T00:12:21Z

Nice! :D

Perhaps I should also add a jsonnet_test Bazel rule since it is possible to write tests in Jsonnet, such as the unicode.jsonnet test you just added. :)

johnboiles · 2015-10-20T17:52:29Z

Looks like normal unicode characters (like “ ” ‘ ’ etc) are working fine, but longer sequences for emoji (like 🚀 -- "\xF0\x9F\x9A\x80") always become the sequence "\xEF\xBF\xBD\xEF\xBF\xBD\xEF\xBF\xBD\xEF\xBF\xBD"

I'm suspicious of the encode_utf8 method, but I'm struggling to understand what all the bit masking and shifting is doing.

johnboiles · 2015-10-20T18:04:39Z

I think I have a fix, looks like a typo on this line:

} else if ((c0 & 0xF8) == 0xF) { //11110zzz 10zzyyyy 10yyyyxx 10xxxxxx

Changing that to the following seems more right

} else if ((c0 & 0xF8) == 0xF0) { //11110zzz 10zzyyyy 10yyyyxx 10xxxxxx

johnboiles · 2015-10-20T18:37:50Z

Submitted a fix as #78.

I didn't see an easy way to test this as the \u escape sequence only supports 4 hex digit escape sequences (ie up to character code 0xFFFF). So adding this is invalid:

std.assertEqual("\u1F680", "🚀") &&

One solution for testing could be to add support for the ECMAScript6 code point escapes (like \u{1F680}).

If you have another idea for testing, I'd love to hear it!

sparkprime · 2015-10-20T19:57:52Z

Thanks for tracking this down!

I suppose you can do things like "🚀🚀🚀"[1] which should == "🚀".

\u{XXX} should be a no-brainer though, it could be added in the lexer quite easily.

I have been worried for a long time about the limitation of \u and whether it's necessary to support e.g. things like this as well https://bugs.launchpad.net/zorba/+bug/1024448

johnboiles · 2015-10-20T23:59:40Z

No problem! It was enlightening to learn more about the inner workings of unicode

Fix build errors, implement parseCommaList

Port lexer changes from google/jsonnet 0c96da7 to 27ddf2c Fix google#1

sparkprime added the enhancement label Aug 10, 2014

sparkprime self-assigned this Aug 10, 2014

sparkprime pushed a commit that referenced this issue Mar 1, 2015

Merge pull request #1 from google/master

aa755bf

Merge pull from master

sparkprime closed this as completed in fef77da Oct 11, 2015

huggsboson mentioned this issue Aug 31, 2016

Syntax for optional array concatenation #234

Open

sparkprime mentioned this issue Nov 26, 2016

Implement verbatim string literals #265

Merged

steeling mentioned this issue Oct 30, 2019

$ is not lazy evaluated #723

Closed

This was referenced Nov 20, 2023

stack-overflow exists in the function parse in parser.cpp #1116

Open

stack-overflow exists in the function maybeParseGreedy in parser.cpp #1117

Open

sbarzowski pushed a commit to sbarzowski/jsonnet that referenced this issue Jun 10, 2024

Merge pull request google#1 from google/fix_build

642ec89

Fix build errors, implement parseCommaList

sbarzowski pushed a commit to sbarzowski/jsonnet that referenced this issue Jun 10, 2024

Port lexer changes from google/jsonnet 0c96da7 to 27ddf2c Fix google#1

c3f136d

sbarzowski pushed a commit to sbarzowski/jsonnet that referenced this issue Jun 10, 2024

Merge pull request google#5 from sparkprime/lexer_changes

04c51f7

Port lexer changes from google/jsonnet 0c96da7 to 27ddf2c Fix google#1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support Unicode #1

Support Unicode #1

sparkprime commented Aug 10, 2014

johnboiles commented Oct 1, 2015

sparkprime commented Oct 2, 2015

johnboiles commented Oct 2, 2015

sparkprime commented Oct 4, 2015

sparkprime commented Oct 7, 2015

sparkprime commented Oct 11, 2015

sparkprime commented Oct 11, 2015

johnboiles commented Oct 11, 2015

davidzchen commented Oct 12, 2015

johnboiles commented Oct 20, 2015

johnboiles commented Oct 20, 2015

johnboiles commented Oct 20, 2015

sparkprime commented Oct 20, 2015

johnboiles commented Oct 20, 2015

Support Unicode #1

Support Unicode #1

Comments

sparkprime commented Aug 10, 2014

johnboiles commented Oct 1, 2015

sparkprime commented Oct 2, 2015

johnboiles commented Oct 2, 2015

sparkprime commented Oct 4, 2015

sparkprime commented Oct 7, 2015

sparkprime commented Oct 11, 2015

sparkprime commented Oct 11, 2015

johnboiles commented Oct 11, 2015

davidzchen commented Oct 12, 2015

johnboiles commented Oct 20, 2015

johnboiles commented Oct 20, 2015

johnboiles commented Oct 20, 2015

sparkprime commented Oct 20, 2015

johnboiles commented Oct 20, 2015