Output items sequence as original bytes #56

skeggse · 2021-08-16T20:23:51Z

Is your feature request related to a problem? Please describe.

It'd be great if there were a way to rapidly break a large stream of multiple JSON values (i.e. the multiple_values option) into its constituent values. For use-cases where you just need to know e.g. the number of JSON values in a stream, or need to multiplex an incoming stream across threads, or simply substring match the entire raw JSON value without first interpreting it, this is a pretty useful feature. As a point of reference, some JSON libraries like Golang's support this out of the box: in that case, you can decode a JSON-containing byte array into a json.RawMessage, which just copies the byte array.

Describe the solution you'd like

I'd like some equivalent to ijson.items that simply produces the original bytes (possibly copied) instead of parsing the items themselves.

Describe alternatives you've considered

If I had full control over the production of these JSON streams, I could require that the output were newline-delimited. At present, this is not the case.

I think the current workaround is to run jq -cM in a subprocess and pipe the stream into jq, which will force sequences like {}{} to get produced as {}\n{}\n. I could try to reserialize the original items, but that doesn't always result in the desired behavior (and would probably be slower than the jq equivalent). This is an imperfect solution because it'll mangle the original bytes, which may not be the desired behavior when searching for item-level substring matches.

The text was updated successfully, but these errors were encountered:

rtobar · 2021-08-19T17:30:59Z

Thanks @skeggse for the interesting proposal, and sorry for the delay on this initial response, busy days.

First things first, let me rephrase your idea to make sure I'm understanding correclty. You basically want something like this:

data = b'{}{}'
for raw_json in ijson.new_method_you_want(data):
    # raw_json is b'{}' each time

Is that a fair depiction of what you're looking for?

As you mentioned, a way to currently achieve this is doing something like:

data = b'{}{}'
for raw_json in map(json.dumps, ijson.items(data, '', multiple_values=True)):
     # raw_json is '{}' each time

The drawback is that this indeed builds each document fully as a Python object just to dump it back into its string form. In the process you might also lose some information (but not necessarily).

From the point of view of ijson and its inner workings here are some thoughts:

In the example above, and the one you mention in your original comment, the top-level documents in the single stream consist on JSON objects. Note however that in general they could be any JSON value; e.g. {} [], [] [], 1 2, true {2}, etc.
The above means that ijson cannot simply look for a starting/ending bracket, parenthesis or the like. Instead, and to ensure correct behavior, parsing of the original document must be done; in other words, there are no shortcuts. In particular, also note that although it might work most of the times, using newlines to determine the end of JSON value is obviously not fully reliable (e.g., {\n} is a valid, single JSON value).
To produce individual documents consisting of verbatim copies of the original bytes we then need to fully parse the document, while keeping track of the bytes the parser considered in the process (this is they key). To begin with, none of the ijson routines is "low-level" enough to offer this information -- we need to go to the parser technologies that power our backends.
Of those we currently have a few: our own pure python parser, the yajl library (versions 1 and 2), and a not-yet-on-the-master-branch boost json parser.
- We could change our own python parser to keep track of input bytes
- From memory the boost json library might keep track of this information already
- But the yajl parser (neither version) doesn't, so for most of our backends it would be simply impossible to provide this information.
Moreover, even if all underlying parsers exposed this information, it would still require some non-trivial amount of work to add your desired functionality on top of that.

In summary, I think this is simply not exactly possible because of the restrictions imposed by the underlying parsing technologies we use, and even if it was, at least for some of the backends, it would be too much effort for little gain.

skeggse · 2021-11-10T17:08:34Z

Okay, I might just pull in a parser backend. It seems like a relatively simple parser atop a tokenizer would be sufficient. Thanks for your consideration!

skeggse added the feature label Aug 16, 2021

skeggse closed this as completed Nov 10, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Output items sequence as original bytes #56

Output items sequence as original bytes #56

skeggse commented Aug 16, 2021

rtobar commented Aug 19, 2021

skeggse commented Nov 10, 2021

Output items sequence as original bytes #56

Output items sequence as original bytes #56

Comments

skeggse commented Aug 16, 2021

rtobar commented Aug 19, 2021

skeggse commented Nov 10, 2021