-
Notifications
You must be signed in to change notification settings - Fork 15.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ruby assert can sometime trigger, depending on garbage collector. #2004
Comments
If this problem is caused by a bug in protobuf code, it could be caused by a Ruby object that is temporarily not visible to the garbage collector. This means that it is not on the stack, and no other Ruby object holds a reference to it. That could happen if at some point the only reference to a Ruby object is in a C struct on the heap. |
Thanks for all the work you've done on this already! If we have a deterministic crash, can we run it under Valgrind? That would put us in a great position to understand what exactly is going on. |
Maybe ? I'm not exactly familiar with what's involved in running the Ruby VM through valgrind, and my attempts haven't been really rewarded with success so far. |
Hmm. I know Ruby throws a bunch of Valgrind errors even in normal operation. But did the crash itself not trigger under Valgrind? |
Let me say it differently: I can't even get there. Valgrind segfaults while On Wed, Aug 24, 2016 at 12:57 PM, Joshua Haberman notifications@github.com
|
Ok. In that case I'll probably need to look deeper into the repro and try to debug it. How can I reproduce this? |
Would there be something missing from the repro case in the report ? |
Ah sorry, I missed that you attached a zip file! I'll take a look as soon as I can. Hopefully today. |
So with this Dockerfile.txt, I can reproduce the VM crash deterministically. And I can also run valgrind on it. But under valgrind, there's no crash.
And nothing happens. The code just loops forever trying to trigger the corruption. Or maybe my machine isn't strong enough to support valgrind under such heavy load and never gets to the state where it fails. |
I'll take that back. The same crash repro properly under valgrind. It just takes a very long time:
|
After adding --track-origins, we also get this:
|
The attached PR fixes the crash for me. |
@haberman Hi, when you plan to publish the gem with this fix? I struggle with this error in produciton |
@xfxyjwf Hi, is it possible to give us a timeline for the release of this fix? |
Sorry I didn't get to this yesterday. I will work on the release packages
|
@haberman Great! Thanks! |
Updated packages are now available on RubyGems. |
@haberman Thanks! |
Basically, we were investigating this: grpc/grpc#7661
Our investigation led to realize that this assert in the protobuf code is being triggered, but only if the garbage collector has been exercised enough: https://github.com/google/protobuf/blob/master/ruby/ext/google/protobuf_c/map.c#L74
If the garbage collector is really under heavy stress, we can even produce a VM crash: http://pastebin.com/hzgHPJGq
I have included a zip file with our current reproduction case: ruby-repro.zip. Right now, this can crash any of the versions of Ruby I've been able to try this with. The reproduction steps are as follow:
The idea of the repro is to load a baked binary protobuf from the disk, and deserialize it enough times in memory to eventually cause a failure. The failure is evidently due to some corruption that happens in the Ruby VM. We have checked that the actual raw memory itself hasn't been altered - and even though it would've been, the internal assert being triggered shouldn't have happened in the first place.
When using a vanilla version of Ruby, the crash will not be deterministic. However, compiling a custom Ruby library with the timer_thread disabled causes the crash to become fully deterministic. Changing the value of the number of times we try to deserialize the object while the garbage collector is disabled will alter the behavior of the problem.
It would also be reasonable to suspect that there is an actual bug in the Ruby VM. I have cross-filed a bug there too: https://bugs.ruby-lang.org/issues/12699
The text was updated successfully, but these errors were encountered: