Feature request: Return bytes at prefix #43

jpmckinney · 2021-02-27T00:26:06Z

I don't know if this is too narrow a use case for this library, or if there is another way to do this.

I work with large remote JSON objects. If I know a JSON key occurs near the start of the file, I might download only a few kilobytes, and then use ijson.items to read that key. ijson will not raise an IncompleteJSONError, because it never had to read to the end of the file. So far, so good.

While most of these JSON objects are standardized, some are not. For example, a publisher might wrap a standard object inside another object, like {"result": { standardized content } }. My data pipeline currently handles this, by being able to set the prefix for a given remote file (in this case result). A pipeline step returns the data at that prefix, using ijson.items.

When I combine these two tactics, I run into trouble. ijson.items(data, 'result') tries to read entire result key, but the JSON is incomplete, so it raises an error.

One solution is to collapse the two steps into one, i.e. index to result.some_key in one step. In my case, this is undesirable, because then I'll have multiple pipelines using similar code, which is more work to maintain.

Another solution might be to have a method that just returns the bytes from the given prefix, without parsing them. (Since the prefix might include one or more item, I guess the return value would be a string of bytes.) In that case, I could then use ijson.items as usual on one of the return values.

The text was updated successfully, but these errors were encountered:

rtobar · 2021-03-02T04:30:16Z

@jpmckinney if I understand your use case correctly from the description above, you basically want to iterate over your target object in the JSON content twice: once just to extract its raw bytes from the stream, then to actually parse it.

The feature you propose is far from trivial to implement actually. Almost all (if not all) parsing backends return already-parsed values, so in order to return the bytes at the given prefix we'd have to reconstruct the original bytes off the given values -- which is not even possible all the times. Even if we had access to the underlying bytes and offsets in all backends required to keep track of the original contents, notes that "return the bytes from the given prefix, without parsing them" is self-contradictory: there is no way to detect the end of a prefix's content without actually parsing the content coming after it. You can get away from creating the actual objects, but the parsing and lexing will still be necessary. Finally, even if all of this was implemented, you'd end up parsing the JSON content twice. In your case it might not be much of a concern given that you are dealing with small amounts, but it just doesn't sound necessary.

All in all this sounds like a huge amount of work, with high chances of no being possible to implement in the first place, and for little benefit. So unfortunately I'll have to mark this as "wont fix".

Back to your problem at hand, it's not entirely clear to me is why is it undesirable to use the collapsing of the two steps into one. On the one hand you mentioned you already have some code that accepts the extra prefix, but on the other you say that results on having to maintain "multiple pipelines". I'm not saying you don't have valid reasons, it's just that I don't understand them with the given information. A third alternative worth considering is using events interception: use ijson.parse to consume all the events leading to the beginning of the extra prefix element and ignore them, then pass the rest to ijson.items with your full prefix. I'm not sure if this falls into your category of collapsing two steps into one, so it might be as undesirable for your pipeline design as the first option you proposed.

jpmckinney · 2021-03-03T13:38:01Z

Thanks for the explanation! I’m convinced it will be simpler to just collapse the two steps in the end :)

rtobar added feature wontfix This will not be worked on labels Mar 2, 2021

rtobar closed this as completed Mar 2, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature request: Return bytes at prefix #43

Feature request: Return bytes at prefix #43

jpmckinney commented Feb 27, 2021

rtobar commented Mar 2, 2021

jpmckinney commented Mar 3, 2021

Feature request: Return bytes at prefix #43

Feature request: Return bytes at prefix #43

Comments

jpmckinney commented Feb 27, 2021

rtobar commented Mar 2, 2021

jpmckinney commented Mar 3, 2021