-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Speed up parsing -- delay string and number parsing? #1857
Comments
Well as I have mentioned on the corresponding PR #1752 parsing string into decimal representation requires an order of magnitude fewer computations compared to parsing into the binary floating point. Early in my implementations, I've actually pushed a literal number PR which exactly caches the original number string on the jv_number (hence the name of the branch), but we have concluded that this technique is too restrictive in terms of what can be further accomplished with this string representation, and since then I've reimplemented the feature with the support of the I think that @wtlangford was going to run some performance tests on this branch. But I'm not sure if he has gotten to actually doing that. |
@leonid-s-usov BTW, thank you for your patience! So sorry this is taking so long. And, yes, I had that work in mind. |
These aren't the only options. If we utilize |
Yeah. Another thing I've been wanting to do is to add support for binary values, which would appear to jq programs to be no different from arrays [of numbers, all integers between 0 and 255 inclusive] but actually would be backed by a plain buffer. The cache could be of that sort. (On output, binary values could either print as arrays of small integers, or as base64-encoded strings. That could be a global option. A |
Wait, but binary values would need to be embedded into the input JSON somehow, wouldn't they? How would a binary input be referenced from the input JSON? |
@leonid-s-usov the idea is that you could:
|
What would be a consequence of such a lazy parsing on expressions like "(.a)" for the input like {"a" : 0.0101010101101001001010101011010101111110101010 }? In general, is it required to convert the value always. I feel like values would be passed through with no interpretation. |
Speaking of binary data, the restriction that |
@vdukhovni -You might be interested to know that gojq's
should recover B. I used this approach successfully to test gojq against https://www.w3.org/2001/06/utf-8-wrong/UTF-8-test.html :
|
Sorry, I think I may have given misleading input providing just 0s and 1s. I would consider JQ as a tool to transform input JSON to output JSON. E.g., normally I see that the input {"a" : 0.120, "b" : 100.0} converted by the expression "." which yields {"a" : 0.12, "b" : 100 }. I may understand why these conversions are done, anyway, this is another view. So when I saw this thread about delayed parsing I was thinking that maybe with a delayed parsing in this particular case no excessive conversions would be done. There is no transformation of the values whatsoever so why JQ could not just pass it through and yield all values as they are from the input? Similarly, I if I would use it in a string interpolation, I would accept (or even expect?) "0.120" and "100.0" because why not? The input is pure text, the string interpolation requires the values to be text, it is already there, no conversion would be needed. |
No, delayed parsing is simply lazy parsing that avoids doing too much work on values that don't ultimately get used. The output must still be canonical. So numbers that are syntactically valid may remain as raw strings until their numeric value is needed, but output of the value is "need". Similarly strings could remain escaped raw input until first use, ... But Since keys can be computed on the fly, that's ETOOHARD I think. Perhaps with streaming (if fast enough) we could find a way to filter the stream for the wanted keys quickly enough to then reassemble just the desired object fragments, some predicate on the stream piped to |
We have to validate the input, so we could avoid allocations for objects/arrays until they're needed, but when we then do those allocations we'd be wasting cycles on validation unless... unless we use a SIMD parser to validate and retain all the offsets to all the things and then allocate them only when they are "observed" or altered with any of But all the SIMD JSON parsers I've seen are non-incremental, but the jq processor absolutely needs an incremental JSON parser. There are cases where we could use a non-incremental JSON parser because we know the size of the text a priori and have it in memory already (e.g., in I suspect that trying to be lazy about parsing in general will mainly just add too many branches in |
We've talked about this before, especially in the context of printing numbers exactly as they were input.
@vdukhovni was telling me about a Haskell JSON parser that does this, so if you have large JSON texts and you're interested in some tiny portion, you don't need to pay the full cost of string and number parsing for all of the bits you don't need.
Realistically, we'd have to either still check for invalid strings and numbers at parse time, or else need a new option to enable lazy parsing (delayed parse errors would be surprising). And we still need to do enough work at parse time to find the start and end positions of any numbers or strings.
(Another option too is to hold on to a reference on the input JSON text and have the delayed-parse values refer to substrings of that, thus limiting memory pressure. But this wouldn't save that much allocation (arrays and objects would still need to be allocated), and wouldn't help the streaming parser (which wants to hold on to nothing in order to have a very small memory footprint). So I'd say no to this option.)
Speaking of the streaming parser, it is actually slower than the non-streaming parser! We should disentangle the two parsers and make the streaming parser reuse all of the path and output value arrayss it uses.
The text was updated successfully, but these errors were encountered: