Async methods for JSON.parse and JSON.stringify #7543
Comments
|
So much for "don't block the event loop" |
There's already a lot of 3rd party modules that do streaming json parser. There's no reason to include one into node.js itself: $ npm search json stream parse
NAME DESCRIPTION AUTHOR DATE VERSION KEYWORDS
baucis-json Baucis uses this to parse and format streams of JSON. =wprl 2014-05-01 1.0.0-prer… baucis stream json parse parser format
clarinet SAX based evented streaming JSON parser in JavaScript… =dscape =thejh 2014-02-17 0.8.1 sax json parser stream streaming event events emitter async streamer browser
clarinet-object-stream Wrap the Clarinet JSON parser with an object stream: JSON… =exratione 2014-02-19 0.0.3 clarinet json stream
csv-string PARSE and STRINGIFY for CSV strings. It's like JSON object… =touv 2013-12-09 2.1.1 csv parser string generator
csv2json-stream Streaming parser csv to json =mrdnk 2013-03-03 0.1.2 csv2json csv json parser
dave JSON Hypermedia API Language stream parser for Node.js. =isaac 2012-10-19 0.0.0 hal json hypermedia rest
dummy-streaming-array-parser Dummy Parser for streaming JSON as actual JSON Array =floby 2013-07-05 1.0.1 streaming stream json parser
fast `fast` is a very small JSON over TCP messaging framework.… =mcavage 2014-01-30 0.3.8
jaxon Jaxon is a sequential access, event-driven JSON parser… =deyles 2013-08-11 0.0.3 JSON parser stream parser SAX
jlick Streaming configurably terminated (simple) JSON parser =deoxxa 2013-03-14 0.0.4 json parse stream newline split whitespace
json-- A streaming JSON parser that sometimes might be faster than… =alFReD-NSH 2012-08-21 0.0.2 JSON
json-body Concat stream and parse JSON =tellnes 2013-08-09 0.0.0 json body parse stream concat
json-parse-stream streaming json parser =chrisdickinson 2014-04-14 0.0.2 json parse readable stream
json-parser-stream JSON.parse transform stream =nisaacson 2014-01-09 0.1.0 stream json parse
json-scrape scrape json from messy input streams =substack 2012-09-12 0.0.2 json scrape parse
json-stream New line-delimeted JSON parser with a stream interface =mmalecki 2013-06-05 0.2.0
json-stream2 JSON.parse and JSON.stringify wrapped in a node.js stream =jasonkuhrt 2013-10-23 0.0.3 json stream
json2csv-stream Transform stream data from json to csv =zemirco 2013-11-08 0.1.2 json csv stream parse json2csv convert transform
jsonparse This is a pure-js JSON streaming parser for node.js =creationix… 2014-01-31 0.0.6
jsons Transform stream for parsing and stringifying JSON =uggedal 2013-11-01 0.1.1 json stringify parse stream transform array
jsonsp JSON stream parser for Node.js. =jaredhanson 2012-09-09 0.2.0 json
JSONStream rawStream.pipe(JSONStream.parse()).pipe(streamOfObjects) =dominictarr 2014-04-30 0.7.3 json stream streaming parser async parsing
jstream Continously reads in JSON and outputs Javascript objects. =fent 2014-03-13 0.2.7 stream json parse api
jstream2 rawStream.pipe(JSONStream.parse()).pipe(streamOfObjects) =tjholowaychuk 2013-03-19 0.4.4
jsuck Streaming (optionally) newline/whitespace delimited JSON… =deoxxa 2013-03-24 0.0.4 json parse stream newline split whitespace
kazoo streaming json parser with the interface of clarinet but… =soldair 2012-10-19 0.0.0
ldjson-csv streaming csv to line delimited json parser =maxogden 2013-08-30 0.0.2
ldjson-stream streaming line delimited json parser + serializer =maxogden 2013-08-30 0.0.1
naptan-xml2json-parser Takes a stream of NaPTAN xml data and transforms it to a… =mrdnk 2012-12-17 0.0.2 NaPTAN stream parser
new-stream Parse and Stringify newline separated streams (including… =forbeslindesay 2013-07-07 1.0.0
oboe Oboe.js reads json, giving you the objects as they are… =joombar 2014-03-19 1.14.3 json parser stream progressive http sax event emitter async browser
parse2 event stream style JSON parse stream =bhurlow 2014-04-10 0.0.1 event stream through2 through stream object stream
regex-stream node.js stream module to use regular expressions to parse a… =jgoodall 2012-11-20 0.0.3 regex parser stream
rfc822-json Parses an RFC-822 message stream (standard email) into JSON… =andrewhallagan 2014-01-27 0.3.6 email json rfc 822 message stream
stream-json stream-json is a collection of node.js 0.10 stream… =elazutkin 2013-08-16 0.0.5 scanner lexer tokenizer parser
streamitems Simple stream parser. Emits 'item' and 'garbage'. Created… =pureppl 2012-04-25 0.0.0
svn-log-parser Parses SVN logs as into relevant JSON. =jswartwood 2012-08-02 0.2.0 svn parse stream json xml
through-json Through stream that parses each write as a JSON message. =mafintosh 2014-03-20 0.1.1 stream streams2 through json parse
through-parse parse json in a through stream, extracted from event… =hij1nx 2013-11-24 0.1.0 through streams parse json throughstream throughstreams streaming parser
tidepool-dexcom-stream Parse Dexcom text files into json. =cheddar… 2014-02-27 0.0.2 Dexcom export text json parser stream
to-string-stream stringify binary data transform stream =nisaacson 2014-01-09 0.1.0 stream json parse
wormhole A streaming message queue system for Node.JS focused on… =aikar 2011-09-24 3.0.0 message queue pass stream parser fast json
xmpp-ftw-item-parser Used to parse "standard" XMPP pubsub payloads both from… =lloydwatkin 2014-04-09 1.1.0 xmpp xmpp-ftw xml json rss atom activitystreams activitystrea.ms parse parser |
I'm not asking for a streaming parser... I'm asking for an async parser that does the parse on the thread pool, so that it doesn't block the event loop... As it stands, many node base api services can be DDOS'd easily up by passing large json requests. Yes, you can check request/response size, but many times you want the large-ish JSON... we were seeing issues with parsing many requests involving data from twitter for example. The long parse times are/were holding up the event loop. This is an area for really easy DDOS, by having out of bounds parse/stringify, this should be the typical path. |
I see what you're saying... but the reality is that the There's lots of things in JavaScript that can block the event loop, but I think being educated about it, and knowing when to use something async vs. something sync is the better way to go about it. We're not going to start adding magical sync-looking-but-really-async functions to node to try to prevent users from shooting themselves in the foot. Personally, I'd recommend to use generators/fibers/streamline/whatever to make your async code feel sync. |
From a more architectural standpoint, yes, you may want the large-ish JSON object, but if it is large, you probably would want to do some kind of streaming work on it while it's on the way in, rather then buffering the entire thing into memory and then parsing it. You can architecture your app to be a lot more memory-efficient this way. |
I'm reopening this, so long as Node's internal IPC mechanism relies upon JSON I think it's reasonable to explore a way to provide async JSON without changing the semantics of the runtime we're relying upon. That is to say we want to use V8's In any event, I'd entertain PRs from interested people along with some good data behind some of the costs associated with it since it won't necessarily be free. |
The complexity of doing this isn't going to be trivial. IPC won't work It's not like you could just uv_queue_work() and pass the char* and tell v8 Anyways, you get the idea. TBH I like the concept, but it won't be easy. |
@trevnorris is it even possible to pass objects between isolates? |
@vkurchatkin this is interesting question, in fact no. But it could be possible with some relocation machinery. The question is how much work will it actually take and will there be any speed benefit, because you may be doing a lot of double work with off-thread allocation and parsing. I'm going to look into it and experiment with it. |
@indutny i think maybe we can use 3rd party parser off the main thread to do the heavy lifting and then create actual v8 values on main thread from intermediate representation |
Isn't it all about general discipline of working with node? If you want to do some resource consuming, you should do it either by small chunks(streaming) or in another thread. What next? async map/forEach/reduce? |
@vkurchatkin What's costly? Is it parsing, or the construction of the result? If the cost comes primarily from the construction of the result and if we have no choice but do it in the main thread, it won't help much to offload just the parsing to a separate thread and build from an intermediate representation. My gut feeling is that parsing must be really cheap. There are two problems here: 1) JSON.parse takes CPU in the main thread and 2) JSON.parse blocks the event loop. We are trying to solve 1. Maybe we should solve 2 instead by keeping all the processing in the main loop but yielding periodically to the event loop. Not as good but maybe good enough. |
@bjouhier that's the question. My hypothesis is that actual parsing is pretty costly and this approach is beneficial for large JSONs. Also values can possibly be created lazily. |
First of all, object sharing among the Isolates not possible (because of heap memory indexes, GC levels and even optimizations).
for the above code V8 optimizes the operation and "may not" do much thing (since we don't use dict) and while doing this, it doesn't check anything on other Isolates. Eventually, sharing an object among the Isolates produces unexpected results. BTW, it doesn't matter if the external JSON parser etc. is efficient or not. You will end up blocking the main thread even more. JSON.stringify: JSON.parse: |
This sounds a lot like the hilarious fibonacci benchmark, where just moving to continuable functions and This sounds like people want streaming parsers, but are not throttling the connection / input the the streaming parser. Since Isolates are not able to share Objects across threads we would have to reconstruct the object manually either way... I am unsure how you would speed this up. You could remove the minimal fluff while parsing out of thread (quotes, colons, braces, brackets, commas...), but that would not really save much time since you would have to move the strings to be V8::String when you make an object. Inverse applies during stringification. |
Maybe implement streaming JSON parser in node core then? |
@Dream707 why? It won't make it faster. |
@indutny but at least it won't block main thread |
You could do non-blocking stuff in user-land too. Also it is pretty questionable, how well it'll perform. Especially considering flickering back and forth from worker thread to the main loop. |
Reasons why off-loading parsing to a thread won't help much have been outlined above by @obastemur and others. I suggest closing this. |
I'll let @tjfontaine decide on this one ;) |
@bnoordhuis do you mean "Lots of "Set"/"New" calls on the native side." ? Because you definitely can avoid these. |
That's not quite what I mean. There are two cost centers when parsing JSON: actual parsing and converting parsed input to JS values. You can farm out the first one to a worker thread but not the second one. I'm fairly confident (having profiled it with perf in the past) that of the two, rematerializing JS values is much more expensive than parsing, doesn't matter if you're going through the V8 API or not. So while you can spend a lot of time optimizing the parser, the biggest cost center still runs inside your main thread. Never mind that the vagaries of thread scheduling means you'll add variable (as opposed to deterministic) latency to your deserializer. (For the serializer, it's even worse because you cannot access V8 objects from outside the main thread. Ergo, it's not really possible to off-load the work.) Having said that, here is a semi-plausible way of implementing an off-thread serializer / deserializer. The main thread would need to release the V8 isolate using a v8::Unlocker right before entering epoll_wait() and reacquire it after returning from the system call. That way, the other thread can acquire the isolate and start serializing or deserializing away. However, that will only marginally improve matters because if the system call returns before the worker thread is finished, the main thread will block until the worker is done. A secondary issue is that Locker and Unlocker objects are backed by a system mutex and that opens the usual can of worms about unfavorable thread rescheduling when the contention rate is high. |
I don't really see how this can help, if only one thread can work with isolate. That means that no js can run in parallel with parse/stringify (not even other parse/stringify). Or am I missing something? |
Original message was about to making it async, not faster. |
@Dream707 actually it is about blocking less. If you just want async, you can split large json into pieces and feed it to streaming parser using |
How is it done with databases that not optimized for async operations? With file IO? |
@bmeck. Good point about buffers. Parsing incrementally from a utf-8 buffer is not much more difficult than parsing incrementally from a string. |
I just pushed a Another reason to move to C++. |
Just curious. What part of this would possibly be faster if moved to C++? |
First, it should make it easy to bring the Buffer implementation on par with the string one but I also see a number of micro optimizations: eliminate bounds checking in the automata, allocation of state frames on a free list, etc. But the proof of the pudding is in the eating! |
FWIW, I made good progress on a C++ implementation of the incremental parser. Typical run output:
So on average i-json is 1.63 times slower than JSON.parse on a full parse and 2.05 times slower on an incremental parse. This is significantly better than the JS implementation (was 2.65 and 2.8 times slower). Getting faster is starting to be challenging because JSON.parse is using internal V8 functions to build objects and optimize the allocation of ascii-only strings. The allocation of strings/objects/arrays and the assignments to array/object slots account for 66% of the overall processing time in the i-json C++ implementation (with my test data). The allocation of strings alone account for 38% of the time. |
for a large json request it would be possible to validate the POST request early and close it before uploading the rest of the 270kb if it contains error. Its not so much about performance. |
I would very much enjoy if a streaming (or stream->buffering) JSON parser was included in the stdlib. I would like to be able to do something like without any dependencies: echo '{"foo":42}' | node -e 'JSON.parseStream(process.stdin, function (err, o) { console.log(o.foo) })' |
@tjfontaine Excellent decision to keep this open and explore the possibilities. Thanks. Blocking on I see value in a non-blocking async API, where processing is backgrounded off the main thread and amortized across event-loop ticks: JSON.parseAsync(string, (err, parsedObject)=>{}) Having a callback (non-stream) version is analogous to One can conceive of a scenario where a JSON payload must be acted upon as a whole. In this case, streaming provides no benefit and an async API avoids latency compounding in concurrent requests. Background-ability / deprioritization is a common pro-thread anti-node argument, so I see benefit in addressing this. That said, the value of a core async |
@bjouhier @tjfontaine I just wanted to echo @CrabDude in thanking you for keeping this open... I was able to offload work to other workers as a string, so that they don't pile up and block the main service loop, but it was really cumbersome, and much less than ideal. |
Huge +1 to tackle this in core. Streaming libs only work around the problem, and can work OK in many cases -- where the parts are small enough to individually be (de)serialized, but the whole process is very inefficient. What makes a true With most popular Node.js apps being web-based in nature, the value and impact of such an addition cannot be overstated. This would be a huge win. |
Another +1 for this core addition. I understand that the current implementation is based in V8, but isn't that just a better reason to implement a core asynchronous version based in Node? Using synchronous JSON.parse from the V8 runtime should be considered bad style! The implementation details need not be so confusing... beyond whatever optimizations are found and decided, something like:
is clearly the convention. A few buffered parsing functions might also become handy. Maybe there is something I've missed and if so, please advise my thinking. Otherwise, the nature of the operation at hand (converting a string into a Javascript object and vice-versa), is absolutely a "core" function to NodeJS users. Edit: Is a pull request for a simple function in util.js, which just creates a child process for JSON.parse, a bad idea? Why? |
It's definitely not a good idea. You will have a parsed object in a different process. Then what? How do you pass it back? |
If it were that easy we wouldn't be requesting it in core ;-). |
And the same holds for parsing in a separate thread. V8 does not allow you to materialize the result directly from another thread (*). So you are a bit stuck. I see two ways to handle JSON parsing asynchronously:
Best would be to combine the two approaches to 1) materialize incrementally and 2) reduce CPU usage in the main thread. (*) Parsing is very different from compressing/decompressing or doing crypto computation because the result is a tree of JS objects, not a string or Buffer (that you can pass across threads). |
Another solution is to parse into a very efficient binary format (BSON or variant thereof) in a worker thread and then materialize from binary in the main thread. |
@vkurchatkin @asilvas @bjouhier: Fascinating. These insights are sensible and I see the challenge more clearly. If the primary block is memory-managing a JavaScript Object with unknown value characteristics, then working with BSON is quite compelling. However, I see an exciting implication arising. Although JSON is beautifully convenient, it may not be what I really need in my applications. Cool, thank you! Otherwise, am I overlooking some alternative to handling JSON in, for instance, REST API requests? Are most REST API's that exist in production suffering from this bottleneck? Is it common to hack together a parsing strategy? |
In theory BSON sounds like an interesting workaround (read: not a solution), but native BSON processing is actually far slower than JSON -- run the benchmarks if you like. Any solution involving serialization will likely be a nonstarter. Technically it might be possible to build a native module that handles passing the object references between threads (including thread safety) for serialization, but it'd probably be quite fragile unless it's in core. |
CL FTW |
@asilvas If BSON is slower than JSON then maybe an ad hoc binary format could help. But still, I agree that this is only a poor workaround. It will only reduce CPU significantly in main thread if parsing dominates materialization, which does not match my experience with i-json.
Any pointers here? I've been tracking this and so far all I've read is that only strings, buffers and typed arrays are transferable. If we could parse in a worker thread and transfer the result without copying it, the problem would be easy to solve (and it could be solved in userland). |
@bjouhier @asilvas I think the idea of doing incremental parsing in the main thread, after X amount of work resuming on I do know that generally speaking, 4K at a time would probably be a good place to start... not sure if JSON3 seems to be MIT licensed which might be a good place to start... |
@tracker1 I don't get how JSON3 would help. I looked at it and did not see any incremental parsing in it. Did you try i-json? It is incremental and you can control the size of the chunks. There is a C++ implentation for node.js and a fallback JS implementation. I did a quick bench of JSON3 against the i-json C++ parser on node 0.12 OSX:
|
This might help squelch any thinking around stream-based serialization as being a good solution. https://github.com/asilvas/json-stream-bench Please feel free to add other parsers and/or tests. |
Granted there is still active conversation around this, introducing async json parsing and serialization into the core lib is likely not going to happen and definitely would not happen in joyent/node. Going to close this issue. If anyone wishes to pursue this, nodejs/io.js or nodejs/node would be the appropriate venue. |
Given that parse/stringify of large JSON objects can tie up the current thread, it would be nice to see async versions of these methods. I know that parseAsync and stringifyAsync are a bit alien in node.js, just the same it is functionality that would be best served against a thread pool in a separate thread internally.
Also exposing options for when to use a given level of fallback internally based on the size of the string/object. ex: a 300kb json takes about 12-14ms to parse holding up the main thread. This is an extreme example, but it all adds up.
The text was updated successfully, but these errors were encountered: