Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ijson.items(file, prefix) waits for EOF #104

Closed
green-green-avk opened this issue Oct 3, 2023 · 8 comments
Closed

ijson.items(file, prefix) waits for EOF #104

green-green-avk opened this issue Oct 3, 2023 · 8 comments
Labels
bug Something isn't working

Comments

@green-green-avk
Copy link

Describe the bug
ijson.items(file, prefix) waits for EOF before yielding any values.

How to reproduce
The input (file object f) represents an infinite stream of JSON objects: {"a":1} {"b":2} ...

gen = ijson.items(f, "", multiple_values=True)
for v in gen:
    print(v)

Expected behavior
Yield a value as soon as an object is parsed. Don't wait for EOF.

Execution information:

  • Python version: 3.11.2
  • ijson version: 3.2.3
  • ijson backend: python
  • ijson installation: pip
  • OS: Linux

Additional context
It looks similar to #72 but I see no reason to use any async here as it is not an asynchronous situation at all: just a simplest stream pull parser use case.

@green-green-avk green-green-avk added the bug Something isn't working label Oct 3, 2023
@rtobar
Copy link

rtobar commented Oct 3, 2023

Thanks @green-green-avk for the interesting report. Do you have an example file that can be used for testing this? In particular, I'd like to understand/confirm that you really don't get any values yielded until you hit the EOF. How are 100% certain of that? The example code looks correct of course, so I suppose your report is based on observations of some kind.

If f is an actual file on disk, and you opened it via open, you could try adding a print(f.tell()) in your for loop to see how much the file has been advanced every time you get a value.

@rtobar
Copy link

rtobar commented Oct 3, 2023

Also: does this happen with other backends? Just noticed this was reported for the python backend, so maybe (hopefully!) ithe issue, if there's one, is specific to this backend.

@green-green-avk
Copy link
Author

Uh. Huh. It's about streams, not files at all.

Please, consider this one-liner repro case:

python3 -c "$(echo -en 'import time\nwhile(True): print("{\"test\":1}", flush=True);time.sleep(1)')" | python3 -c "$(echo -en 'import sys\nimport ijson\nfor v in ijson.items(sys.stdin.buffer, "", multiple_values=True): print(v)')"

@green-green-avk
Copy link
Author

Ouch, it seems, I missed buf_size=1 stanza so bad here...

It works:

python3 -c "$(echo -en 'import time\nwhile(True): print("{\"test\":1}", flush=True);time.sleep(1)')" | python3 -c "$(echo -en 'import sys\nimport ijson\nfor v in ijson.items(sys.stdin.buffer, "", multiple_values=True, buf_size=1): print(v)')"

However, I think, the exact semantics of the buf_size option deserves to be documented.

@rtobar
Copy link

rtobar commented Oct 3, 2023

@green-green-avk thanks for that reproducer and the clarification on it being a continuous, potentially infinite stream of data (you did mention it in the original description, and I was a bit puzzled about what you exactly meant, but didn't ask further). In any case, it being a "stream" or a file on disk is irrelevant: ijson is presented with a file object regardless, and the f.tell() test should still be possible (haven't tested myself though, don't have a computer at hand ATM).

Am I understanding correctly that there is actually no issue, and that the actual problem was that your stream was generating data in chunks smaller than the default buf_size of 64kB? And not only that, but also that your streams in total were smaller than 64kB, since you were getting values only after writing your whole stream? If that's the case, could you please go ahead and close this bug report to keep things tidy? If I'm misinterpreting your last comment and there is a bug, then you'll need to wait until I get my hands on a computer with some time to check this.

To clarify: my understanding is that ijson was blocking when reading data from stdin via sys.stdin.buffer.read(64k). The buffer object is a io.BufferedIOBase whose read method tries to read as much data as requested (or until EOF). That's not ijson's fault, just the behaviour of that particular file object, although I understand it could be misleading and confusing.

On the matter of the documentation of buf_size, I'd be interested in hearing some thoughts on how to improve it. I would have said it was good enough, but it clearly isn't since it's leading to confusions. Maybe the confusion arose mostly around the particulars of reading from sys.stdin.buffer as explained above, so maybe some words of caution could be thrown in the documentation.

@green-green-avk
Copy link
Author

green-green-avk commented Oct 3, 2023

Aha!

In this case, we just need to add an option to use read1() instead. This way we could avoid the performance impact of buf_size=1 and preserve an ability to work with data streaming.

By data streams, I mean cases such as a network socket with small portions of data passing each several seconds. ijson must be able yield parsed objects as soon as they are available, without waiting for the buffer to be full.

@rtobar
Copy link

rtobar commented Oct 3, 2023

Or try with sys.stdin.buffer.raw (if that exists). Or choose a buf_size that better matches your situation.

I'd try to exhaust your options before jumping in and trying to implement a new feature that seems to cover only a very minor corner-case, but would take some effort to get right (and thus I'd be a bit unwilling to implement myself, although I'd be happy to review PRs). If you really want to go down that route, please let's track that on a separate issue to keep things separate.

Also, could I ask again if you could address the questions I posed in my previous comment? I want to confirm there's no actual issue in ijson ATM, and that this is mostly about managing expectations when using stdin.

@rtobar
Copy link

rtobar commented Oct 11, 2023

Haven't heard back in a week for further feedback, and the issue wasn't really a problem with ijson itself, so I'm closing this one.

@rtobar rtobar closed this as completed Oct 11, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants