-
Notifications
You must be signed in to change notification settings - Fork 7.9k
Improve performance of json_encode() #636
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
I selected master, but IMO this might as well go into 5.6. I'll leave that up to merger. |
Hi, I run few tests on your branch. I haven't seen any difference in decoding which I expected as there are no changes in the code. How did you get 5% and what could be the reason for that? It looks that you just removed utf16 check? Please have you got some tests that result in such improvement? I have done few small string tests that were more or less the same but I would probably have to do much more comprehensive testing for it. Then I tested I got a big decrease in performance:
YOUR PATCH
I tested the results with JSOND as well to make sure that the env is the same. As you can see the One more question: Have you compiled your tests with -O2 (disabled debug)? I think that such changes need to be properly tested with variable data. I could help with some testing as I do that for jsond anyway. However I don't think that it should go to 5.6 as it can bring some regressions if not properly tested. Thanks |
Very interesting results, thanks for testing. Decode performance was just an observation from my benchmarks and happened "by accident". I suspect it's due to the removed condition and due to better inlining since My build configuration for benchmarks is As a benchmark I've used these scripts: https://gist.github.com/Rican7/6457237 (small data, 50000 iterations) and https://gist.github.com/realityking/9877752 (large data, 5000 iterations) Here are results with the large data and default setting for
PHP-5.6 branch
I've also repeated your test (minus the jsond part as I don't have that build), and the results for encode go in the same direction as yours (67% worse) but interestingly decode is still slightly faster ;) PATCH
PHP-5.6 branch
I've also repeated your test with JSON_UNESCAPED_UNICODE and the results are much closer, but still in favor of current master. PATCH
PHP-5.6 branch
I suspect the big difference is that my tests were mostly ASCII, while the linked test goes to town with unicode (and manages to freeze TextWrangler). Looking at the code, this makes sense too. I basically traded the UTF-16 conversion for more costly encoding of unicode (I suspect the snprintf is a part of that), I'll see if I can optimize that part a bit. |
By eliminating the Default settings
JSON_UNESCAPED_UNICODE
So for JSON_UNESCAPED_UNICODE it's now faster than the PHP-5.6 branch, with default settings however it's still about 50% slower. Edit: Any for my benchmark script above it's now almost twice as fast. It seems do to rather well with longer strings. |
As suspected
That's around 15% than PHP-5.6 for me. My test case is now up to 90% faster. @bukka Could you rerun the test on your setup and see what results you now get? |
if (utf16) { | ||
efree(utf16); | ||
if (state != UTF8_ACCEPT) { | ||
buf->len = oldlen; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this actually the correct thing do to or am I causing some kind of side effect I'm not aware of?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should probably reset buf->c
before you set new len
otherwise null
can be written after the new buf->len
. It should be something like
buf->c -= oldlen - buf->len;
buf->len = oldlend;
It would be good to add some tests for it to check if it generates the same output when there is ill-formed utf8
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The behavior is already tested, for example here https://github.com/php/php-src/blob/master/ext/json/tests/bug43941.phpt
So apparently the append functions cover that, but I can probably add that line without harm. Will test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No sorry don't add that line. I just checked smart_str_appendc_ex
. It would screw things up. It's ok as it is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alright I won't ;)
I am a bit busy right now but I will have a look and do the testing later... ;) |
}; | ||
|
||
static inline php_uint32 | ||
decode(php_uint32* state, php_uint32* codep, php_uint32 byte) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would change the name of the function to prevent possible (future) collisions with php codebase. Something like php_json_utf8_decode
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed.
I finally properly looked at the code and it looks nice! However you should probably add more tests to check if all UTF-8 encoding byte sequences are correct. Look to the Unicode Standard (it's table 3.7 in version 6.2). It would be also good to check surrogate pair the same way as it's in the current impl. I also run the test and the regression is fixed:
Then I would like to add it to jsond and do a bit more testing if it's ok with you? |
Thanks again for testing! As mentioned above, invalid data is already tested and I've encounter at least 1, 2 and 4 byte sequences in my testing. I don't see a check for surrogate pairs in the old code, what do you mean? As for jsond, feel free to pull this in but I'd like to keep this open separately. IMO it's a change with far less risk and could be merged earlier. |
You are adding two unicode escapes for the surrogate pair which looks correct. I was wondering how is this handled in the previous code. So it looks like a correct change but the encoded string will be different. I think that such changes should be at least added to the upgrade notes. In case someone did checksums for really weird json string... |
it works fine in both impl. I just tested ;) |
I plan to do some regression test on surrogates only which should cover perf so don't worry about that :) |
That's the reason the old code transcoded to UTF-16, it than looped over the individual bytes and created the unicode escape sequences (except for those in the ASCII range) for each byte. The new code now works based on codepoints, thus the code becomes a bit clearer (IMO). The output should be 100% identical between the old and the new implementation. |
Yeah I saw that after the post... :) I asked because I found a bug in decoding where it accepts invalid surrogate character. Just wanted to check the encoding is fine... ;) |
Now that jsond has been merged into master, is this perf improvement still relevant? |
Last I checked, jsond didn't change the encoder. I'll get the branch running on PHP7 in the next days to check. |
Can one of the admins verify this patch? |
I re-worked the patch ac2548e to verify (without merge conflicts) against the PHP-5.6.7 branch.
Result: No fuzz or merge conflicts' @ PHP-5.6.7 |
@bukka what do you think of this? |
The revised patch was applied and tested against the latest PHP branch @ 5.6.7 on a production PHP web server, and over 11 hours [time window], log events indicate that this revised json_encode() is without code issues. I have no solid benchmarks - other than seeing valid code using the revised json_encode() - as given above. |
I have got a modified version of this in the jsond ext ( https://github.com/bukka/php-jsond/blob/4ab59945459ed7bc121b323cfcaa3ab5ba7a2da8/jsond_encoder.c#L179 ). It copies data to the buffer ( I have got it on my list as one of the priorities and I would like to soon port it and compare it with the current master. I hope I will do that the next few weeks. I don’t think that this should go to 5.6 as it is not a bug fix and it is a bit invasive for a point release. It can go only to master (PHP 7) IMHO… |
Just a quick update I have ported this to PHP 7. I also ported the jsond version of that. Both commits can be seen here: Both of these were apparent improvements on PHP 5. However it's not the case for PHP 7 which got also some encoding optimization from Dmitry. I have done benchmarks for 531 various json instances and I see some decrease in performance (for both commits) compare to current master. However there is a big room for possible improvements so it might eventually get better. Before I start looking into it I would like to first improve my benchmark tools to get a more complete results. I would also like to do some real app (e.g. RESTful services) testing as the looping tests don't have to always get the correct idea about perf in real apps. |
@bukka |
already implemented in josnd |
Just to clarify. This is not a part of jsond that is in the core as it didn't perform as expected. As I said in my previous comment, it will require a bit more testing and improving which I have not had time yet... Anyway it will be different patch if anything so it's cool to close this. Cheers. |
Encoding with default settings is about 20-30% faster, encoding with JSON_UNESCAPED_UNICODE about 15% faster
Decoding is also ~5% faster
Encoding performance is achieved by avoiding the UTF-8 to UTF-16 conversation. The slight improvement to decoding performance is probably due to more inlining oppertunities as
json_utf8_to_utf16
is now only used when decoding.Binary size is reduced by 56 bytes.