Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature request: Return bytes at prefix #43

Closed
jpmckinney opened this issue Feb 27, 2021 · 2 comments
Closed

Feature request: Return bytes at prefix #43

jpmckinney opened this issue Feb 27, 2021 · 2 comments
Labels
feature wontfix This will not be worked on

Comments

@jpmckinney
Copy link

I don't know if this is too narrow a use case for this library, or if there is another way to do this.

I work with large remote JSON objects. If I know a JSON key occurs near the start of the file, I might download only a few kilobytes, and then use ijson.items to read that key. ijson will not raise an IncompleteJSONError, because it never had to read to the end of the file. So far, so good.

While most of these JSON objects are standardized, some are not. For example, a publisher might wrap a standard object inside another object, like {"result": { standardized content } }. My data pipeline currently handles this, by being able to set the prefix for a given remote file (in this case result). A pipeline step returns the data at that prefix, using ijson.items.

When I combine these two tactics, I run into trouble. ijson.items(data, 'result') tries to read entire result key, but the JSON is incomplete, so it raises an error.

One solution is to collapse the two steps into one, i.e. index to result.some_key in one step. In my case, this is undesirable, because then I'll have multiple pipelines using similar code, which is more work to maintain.

Another solution might be to have a method that just returns the bytes from the given prefix, without parsing them. (Since the prefix might include one or more item, I guess the return value would be a string of bytes.) In that case, I could then use ijson.items as usual on one of the return values.

@rtobar rtobar added feature wontfix This will not be worked on labels Mar 2, 2021
@rtobar
Copy link

rtobar commented Mar 2, 2021

@jpmckinney if I understand your use case correctly from the description above, you basically want to iterate over your target object in the JSON content twice: once just to extract its raw bytes from the stream, then to actually parse it.

The feature you propose is far from trivial to implement actually. Almost all (if not all) parsing backends return already-parsed values, so in order to return the bytes at the given prefix we'd have to reconstruct the original bytes off the given values -- which is not even possible all the times. Even if we had access to the underlying bytes and offsets in all backends required to keep track of the original contents, notes that "return the bytes from the given prefix, without parsing them" is self-contradictory: there is no way to detect the end of a prefix's content without actually parsing the content coming after it. You can get away from creating the actual objects, but the parsing and lexing will still be necessary. Finally, even if all of this was implemented, you'd end up parsing the JSON content twice. In your case it might not be much of a concern given that you are dealing with small amounts, but it just doesn't sound necessary.

All in all this sounds like a huge amount of work, with high chances of no being possible to implement in the first place, and for little benefit. So unfortunately I'll have to mark this as "wont fix".

Back to your problem at hand, it's not entirely clear to me is why is it undesirable to use the collapsing of the two steps into one. On the one hand you mentioned you already have some code that accepts the extra prefix, but on the other you say that results on having to maintain "multiple pipelines". I'm not saying you don't have valid reasons, it's just that I don't understand them with the given information. A third alternative worth considering is using events interception: use ijson.parse to consume all the events leading to the beginning of the extra prefix element and ignore them, then pass the rest to ijson.items with your full prefix. I'm not sure if this falls into your category of collapsing two steps into one, so it might be as undesirable for your pipeline design as the first option you proposed.

@rtobar rtobar closed this as completed Mar 2, 2021
@jpmckinney
Copy link
Author

Thanks for the explanation! I’m convinced it will be simpler to just collapse the two steps in the end :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature wontfix This will not be worked on
Projects
None yet
Development

No branches or pull requests

2 participants