Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added fast.json D library beating RapidJSON #46

Merged
merged 1 commit into from
Oct 13, 2015
Merged

Conversation

mleise
Copy link
Contributor

@mleise mleise commented Oct 12, 2015

Since RapidJSON claims to be the fastest I thought I'll accept the challenge. For better comparability I chose the Gcc based backend.

@kostya kostya merged commit 9dbd4cf into kostya:master Oct 13, 2015
kostya added a commit that referenced this pull request Oct 13, 2015
@kostya
Copy link
Owner

kostya commented Oct 13, 2015

wow, how this is so fast?

@mleise
Copy link
Contributor Author

mleise commented Oct 14, 2015

When I started I just scanned for object/array start and ends with SSE and measured the time. It was somewhere between 0.1s and 0.2s. Then when adding actual number parsing, string parsing and UTF-8 validation I kept a close eye on my time budget. Some of the performance is gained by doing heuristics first. While parsing numbers you can calculate exactly if an integer will overflow, or you can go faster by just comparing against a fixed value and still cover 99.99% of cases while handling the rest outside of the hot loop. Or you can add heuristics to the white-space checking function like RapidJSON did. SSE is fast when you have more than two bytes to skip, but white-space in JSON is often at most one character, so you better check that before doing SIMD processing. Things like force-inlining the white-space skipping function also made a 10% difference in overall speed at this level.
What I added was UTF-8 validation of strings coming from files, as I believe you should always do that on external input. But by far the biggest difference (factor of 2) is the parsing style. I know four ways to do it:

  1. Build a DOM and work on JSON data by adding and removing nodes. Fields have no static type but some kind of "Algebraic" or "Variant". You can set a string field to 0.1 and it will change to a number field behind the curtain.
  2. Pull parsing, where you can peek at the next value and then skip or consume it in one function call. For objects and arrays you pass a callback and continue processing in there. Fundamentally there is no document any more and processing is linear. On the other hand you get rid of dynamic types in statically typed languages like C++ or D and have the freedom to skip values you are not interested in. In particular it is faster to just validate a JSON number is correctly formatted than to convert it to a double.
  3. Pull parsing, where you receive tokens in the form (type, text). An object would start with (Type.object, "{"). The structure is flattened and you need to keep track of the nesting levels. You can work with the raw strings though, which is great if you want to reproduce the original input.
  4. Push parsing. Much like 3. but instead of you pulling things from the parser, you implement a set of functions for each type of token and the parser calls you back for each.

RapidJSON supports 1 and 4 as far as I can tell. 4 is way to cumbersome to use routinely and 1 adds unnecessary overhead in many cases. I decided to go with 2 and add a layer of convenience on top. For example I used a D's dispatch operator to make a JSON key lookup look like a property: json.coordinates gets rewritten into json.singleKey!("coordinates")(), a function that takes one compile-time argument and no run-time arguments and processes only that one key in an object while skipping over others. Only inside read I make use of D's garbage collector while building the dynamic array of Coord structures. The low memory consumption also stems from the fact, that I don't build a complete DOM and map the file into memory at once, often just reusing the disk cache as is.

On the downside I did not validate the unused side-structures. I think it is not necessary to validate data you are not using. So basically I only scan them so much as to find where they end. Granted it is a bit of optimization for a benchmark, but is actually handy in real-life as well. After all you could still raise the validation level to maximum if you really cared or call one of the validation functions.

Thanks for adding fast.json. Being the #1 feels good ;)

@kostya
Copy link
Owner

kostya commented Oct 14, 2015

great, you should create article from this and post somewhere.

@miloyip
Copy link
Contributor

miloyip commented Oct 22, 2015

I have fixed the file reading part and provided a SAX version for RapidJSON in #53 .

Since I do not have GNU D compiler on this machine yet and cannot do a comparison directly.
The new SAX version should be sure to spent much less memory, although writing such handler is a little bit clumsy.

This comparison may be not fair in somehow, for example, RapidJSON test does not turn on UTF-8 validation, and the fast.json does not validate the parts that are not used. Should the tests be adjusted?

@mleise
Copy link
Contributor Author

mleise commented Oct 28, 2015

Good to see you @miloyip. I hoped it would take a while before you notice that with SAX parsing you could do a lot better in this benchmark. Something is strange with the results though. I did try your push parser in a quick hack, just to see what to expect when you come around to improve your benchmark entry and it was more than twice as fast as the DOM parser ... on my i5.

As for the validation, it seems that everyone has a slightly different idea of what needs to be validated and as the authors of the parsers we know what kind of validation costs us precious milliseconds. My stance was that external input needs to be validated down to the character encoding level. But since my parser is effectively just a one-way deserializer, errors in unused parts don't matter, as they can't propagate.

We could certainly arrange for our two parsers to do the same level of validation, but what about the other entries? Some "read file as string" functions perform UTF-8 validation already, others don't. I'm actually in favor of keeping things as is, since not validating unused parts gives fast.json such a huge advantage, while still catching any errors that could invalidate the program output.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants