Decoding of string with null-byte. #108

Open
remicollet opened this Issue Sep 26, 2013 · 8 comments

Projects

None yet

4 participants

@remicollet
Contributor

String may contains null byte, ex "foo\u0000bar"

While json-c 0.11 allow to decode such value, I don't find an easy solution to decode such key in an object.

Ex : {"foo\u0000bar":"bar\u0000baz"}

Related to remicollet/pecl-json-c#7

@remicollet remicollet referenced this issue in remicollet/pecl-json-c Sep 26, 2013
Open

json_decode: strings cut off after first null-byte #7

@remicollet
Contributor

A solution could be to change the linkhash fonction to accept a c_string (char *str + int len) as key (using _ex function and keep current function as a wrapper, to keep stable API)

What do you think?

@hawicz
Member
hawicz commented Mar 3, 2014

Extending the linkhash code to take a c_string won't be a problem, since you can already choose the types of keys that a particular hash table stores. That part should be just a matter of defining a lh_c_string_hash() function, and then allocating the jso->o.c_object table with that. However, that will increase the memory usage and might slightly slow things down too, so I'm reluctant to have it enabled all the time.
Another difficult part of this will be the json_object_object_{get_ex,add,del,foreach} functions in json_object.h which work with normal C strings (char *'s). We'd need to add c_string versions of each of these.
Also, when the c_string hash table isn't enabled, parsing something with embedded 0 characters should fail, as should attempting to use the new c_string functions.

@skaes
skaes commented Apr 24, 2014

I have been bitten by the same problem: I receive JSON from external sources over which I have no control. Sometimes the source sends "\u0000" as part of a key. It would be really great if json-c could handle this, as it is perfectly valid JSON.

In my opinion, it's a bug, and not a missing feature. I'd rather have a small performance impact on parsing, vs. not being able to handle embedded null characters correctly.

A quick and dirty hack to at least not lose a lot of the key part would be to simply avoid unescaping \u0000 when parsing json. Or provide a parser callback so the user can specify how to handle embedded null characters.

@hawicz
Member
hawicz commented Apr 24, 2014

I don't care much for that hack. You'll run into issues if someone includes \u0000 in the string.
Also, I don't think parser callbacks will help if there isn't already a way for json-c to do something reasonable with the nul-containing string. However, if you've got an API and example usage (or even better some actual code) then I'm all ears.

@rgerhards
Contributor

Sorry for re-viving this old thread. I came accross the NUL byte problem when working on some performance enhancements. It looks like json-c as a producer supports them, so there is indeed some inconsistency. But what I wonder mostly is how would one expect to be able to work with them in a generic type of API? All APIs seem basically to work with C strings. So let's assume a NUL is read in. How would you expect to pass this buffer to a caller? Of course, you can pass buffer pointer and size, but is that what a typical C program expects? I would really appreciate some feedback on this issue.

@hawicz
Member
hawicz commented Nov 17, 2015

Well, no, a C program written to work with "normal", nul terminated strings will not expect a buffer and size. As I mentioned before, there would need to be "c_string" (i.e. the buffer+len data structure) variants of all of the json-c API functions. If you wanted to take advantage of any features that allow for embedded nul characters your C code would need to change to use those new (not yet written) functions.

@rgerhards
Contributor

Well, I have totally no desire to use NUL bytes unencoded in my program. But if we do not want to support that in json-c, we could officially state so. In this case, some simplifications could be done inside the code base (there is even a test for NUL byte encoding in the testbench...).

@hawicz
Member
hawicz commented Nov 18, 2015

I think you misunderstand me. I am not saying that json-c should not support that, I'm saying that it will be a significant change to fully do so. I expect that adding c_string variants of all the API functions will actually mean changing all internal handling of strings to use c_string, and turning the current asciiz API calls to be wrappers around the c_string ones. However, I haven't actually spent any time to evaluate the actual scope of the change, so who knows, maybe it'll be easy. :)
Clearly, something should be done, and if not the full conversion to c_string, then it would be a good idea to at least cause embedded \u0000's to result in an error, perhaps with "quick and dirty hack" options (as @skaes said) to either pass those through as-is, or re-enable the current rather broken behavior.

@hawicz hawicz added the new-feature label Jun 8, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment