-
Notifications
You must be signed in to change notification settings - Fork 51
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Output items sequence as original bytes #56
Comments
Thanks @skeggse for the interesting proposal, and sorry for the delay on this initial response, busy days. First things first, let me rephrase your idea to make sure I'm understanding correclty. You basically want something like this: data = b'{}{}'
for raw_json in ijson.new_method_you_want(data):
# raw_json is b'{}' each time Is that a fair depiction of what you're looking for? As you mentioned, a way to currently achieve this is doing something like: data = b'{}{}'
for raw_json in map(json.dumps, ijson.items(data, '', multiple_values=True)):
# raw_json is '{}' each time The drawback is that this indeed builds each document fully as a Python object just to dump it back into its string form. In the process you might also lose some information (but not necessarily). From the point of view of ijson and its inner workings here are some thoughts:
In summary, I think this is simply not exactly possible because of the restrictions imposed by the underlying parsing technologies we use, and even if it was, at least for some of the backends, it would be too much effort for little gain. |
Okay, I might just pull in a parser backend. It seems like a relatively simple parser atop a tokenizer would be sufficient. Thanks for your consideration! |
Is your feature request related to a problem? Please describe.
It'd be great if there were a way to rapidly break a large stream of multiple JSON values (i.e. the
multiple_values
option) into its constituent values. For use-cases where you just need to know e.g. the number of JSON values in a stream, or need to multiplex an incoming stream across threads, or simply substring match the entire raw JSON value without first interpreting it, this is a pretty useful feature. As a point of reference, some JSON libraries like Golang's support this out of the box: in that case, you can decode a JSON-containing byte array into ajson.RawMessage
, which just copies the byte array.Describe the solution you'd like
I'd like some equivalent to
ijson.items
that simply produces the original bytes (possibly copied) instead of parsing the items themselves.Describe alternatives you've considered
If I had full control over the production of these JSON streams, I could require that the output were newline-delimited. At present, this is not the case.
I think the current workaround is to run
jq -cM
in a subprocess and pipe the stream intojq
, which will force sequences like{}{}
to get produced as{}\n{}\n
. I could try to reserialize the original items, but that doesn't always result in the desired behavior (and would probably be slower than thejq
equivalent). This is an imperfect solution because it'll mangle the original bytes, which may not be the desired behavior when searching for item-level substring matches.The text was updated successfully, but these errors were encountered: