-
Notifications
You must be signed in to change notification settings - Fork 15.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
C++ json decoding is too slow in 3.1.0 #2305
Comments
Can you also attach the proto definition and maybe also data for us to take a look? |
Sorry our code is not public so I cannot share the proto definition. But I confirmed that this is a problem on Linux too. Infact I had to kill the program after several tens of minutes. Fortunately I profiled the code and found the issue. This commit explains and solves the problem: na-ka-na@920f190. Timing went from 35 mins to 11 sec (almost 200x). |
Wow, thanks. I didn't know that the opensourced version of STLStringResizeUninitialized is that slow. Any way to fix STLStringResizeUninitialized? |
No portable way I know of since I can't find any std::string API which would allow for that. The comment on STLStringResizeUninitialized says so too. One sure shot way is to only call STLStringResizeUninitialized when necessary. |
I think StringOutputStream2 would work just as well as StringOutputStream inside Google. It just keeps an extra int64. If so, I'd recommend replacing all usage of StringOutputStream with StringOutputStream2. Never know what more performance issues are already lurking and/or will be prevented from happening in future. |
Any updates here? We would like to use this feature but I don't want to maintain my fork. Is there a fix coming? Shall I open a PR? |
I prefer a fix to the existing StringInputStream rather than creating a new one. Can you send a pull request for that? |
That was my first thought and I'm happy to do so. But I couldn't come up with a satisfactorily transparent fix for all clients of StringInputStream. Do you have ideas? Problem is that StringInputStream assumes that it does not own the string pointer passed to it. So theoretically after every mutating function call of StringInputStream (next, backup) - the string should be well formed. And doing that is very expensive. You can see the diff between StringOutputStream & 2 here: https://gist.github.com/na-ka-na/07db1eed4f8ba2bda5832c87c9338267/revisions. StringOutputStream2 has a different interface, it owns the buffer. Let me know your thoughts. |
When using StringInputStream, the string is only required to be well-formed after the destruction of the wrapping StringInputStream. Is that still expensive to ensure? |
Is that true? If it is, I can just do the adjustment in the destructor and it should work just fine. But I'm not sure if that assumption is true: // Create a StringOutputStream which appends bytes to the given string. Specifically the "Either be sure there's no further usage" part - makes me think that that assumption might not hold. |
Maybe there is a way to examine (and potentially fix) all current usages of StringOutputStream. But that similar to moving all usages of StringOutputStream to StringOutputStream2. |
Hmm, I hadn't noticed the "either be sure there is no further usage" part in the comment. To make things simpler I would prefer we just remove that and require the stream being destroyed before accessing the string content, but I don't know how many clients are relying on it though. I'll dig around and see if I can find some alternatives. |
Cool thanks. |
It turns out calling StringOutputStream::Next()/BackUp() repeatedly is very costly in opensource protobuf because it keeps resize() the string back and forth. The current JSON conversion API suffers this problem and leads to ridiculously long parsing time: protocolbuffers#2305 (comment) This change fixes the problem but caching the buffer of Next() and avoid calling BackUp() as much as possible.
Just had a top level glance: that might fix the problem! But IMO its not a good resolution because of following reasons:
|
What you mentioned are all true, but the problem can't be fixed unless we update StringOutputStream's implementation and that may not be easy because it's an API behavior change. There isn't any better way as far as I can see. Introducing another StringOutputStream2 implementation doesn't magically fix problems of the original StringOutputStream class and users of the original StringOutputStream will continue to have the same issue unless they do what I proposed in #2355. I believe #2355 is the safest solution to proceed. If you want to fix StringOutputStream instead, could you make a change to StringOutputStream directly? |
One way to fix StringOutputStream is for it to take a boolean parameter in its constructor. Default will be to keep current behavior. New clients can set the boolean to true to get optimized behavior at the cost of additional call to fix up the string before consumption. What do you think? |
Could you change StringOutputStream's behavior without adding a new parameter? I will patch the change in Google and verify whether there is any code relying on the old behavior. |
ok let me do that. |
Here it is, all make check tests are passing with this: na-ka-na@0f7070d. I had to modify proto_writer.cc too. If it looks okay I can open a PR. |
I can even remove the FixUp() function and make adapter_ a pointer in proto_writer. |
Hey any updates here. |
@na-ka-na Can you send it out as a PR and remove the FixUp function? I would like to avoid any new APIs. |
Ok will do. Busy today, might do it by tomorrow. |
Here's the PR: #2441 |
Hi, any update here? |
Ping, just bringing this up in your queue. |
Hey did you manage to test the PR internally inside Google? |
Yes, I have ran a Google global test. It's finished but I have problem viewing the test result. I'll retry the test if the result still doesn't show up tomorrow. I also sent the change to other teammates to review. Will post back here when I get more result/feedback. |
Some updates: I checked with our testing team and confirmed that there are too many failing tests caused by the change and the result webpage fails to show result because of timeouts when fetching data from test server. I know that you have put a lot of time on this issue, but it's unlikely we can push this change through. Sorry for that. :( We probably have to live with the current behavior of StringOutputStream and avoid inefficient usage pattern like Next()/BackUp()/Next()/BackUp()/.. wherever we use StringOutputStream. |
That is very unfortunate :(. But this is why my first instinct was to introduce a new class with fixed behavior, and over time clients could migrate as necessary. I can rename the fixed class to StringOutputStream2 and let StringOutputStream be, let me know. |
Hey this is still lingering. Can we push this through. If we cannot push PR #2441 (which is the best fix). I have validated two other potential fixes:
I think 1) is a better fix since it is fixing the underlying problem, more in line with PR #2441. But I good with 2) also. In one of our test cases, for a smaller 13MB json file, on linux, protobuf 3.1 takes 13sec, while both the fixes 1), 2) above take 900ms. |
oops, sorry I just realized I forgot to merge the pull request #2355 I sent out to fix the problem. It uses the same approach as you proposed in fix (2). Let me know if you still experience the slow parsing problem. |
See https://groups.google.com/forum/#!topic/protobuf/xUEdPWq5x_o for details. It is taking 35 mins to decode a 120MB json std::string into proto on Mac OS X El Capitan.
The text was updated successfully, but these errors were encountered: