Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ruby assert can sometime trigger, depending on garbage collector. #2004

Closed
nicolasnoble opened this issue Aug 24, 2016 · 18 comments
Closed

Ruby assert can sometime trigger, depending on garbage collector. #2004

nicolasnoble opened this issue Aug 24, 2016 · 18 comments

Comments

@nicolasnoble
Copy link
Contributor

Basically, we were investigating this: grpc/grpc#7661

Our investigation led to realize that this assert in the protobuf code is being triggered, but only if the garbage collector has been exercised enough: https://github.com/google/protobuf/blob/master/ruby/ext/google/protobuf_c/map.c#L74

If the garbage collector is really under heavy stress, we can even produce a VM crash: http://pastebin.com/hzgHPJGq

I have included a zip file with our current reproduction case: ruby-repro.zip. Right now, this can crash any of the versions of Ruby I've been able to try this with. The reproduction steps are as follow:

$ bundle install
$ bundle exec gem repro.rb

The idea of the repro is to load a baked binary protobuf from the disk, and deserialize it enough times in memory to eventually cause a failure. The failure is evidently due to some corruption that happens in the Ruby VM. We have checked that the actual raw memory itself hasn't been altered - and even though it would've been, the internal assert being triggered shouldn't have happened in the first place.

When using a vanilla version of Ruby, the crash will not be deterministic. However, compiling a custom Ruby library with the timer_thread disabled causes the crash to become fully deterministic. Changing the value of the number of times we try to deserialize the object while the garbage collector is disabled will alter the behavior of the problem.

It would also be reasonable to suspect that there is an actual bug in the Ruby VM. I have cross-filed a bug there too: https://bugs.ruby-lang.org/issues/12699

@murgatroid99
Copy link
Contributor

murgatroid99 commented Aug 24, 2016

If this problem is caused by a bug in protobuf code, it could be caused by a Ruby object that is temporarily not visible to the garbage collector. This means that it is not on the stack, and no other Ruby object holds a reference to it. That could happen if at some point the only reference to a Ruby object is in a C struct on the heap.

@haberman
Copy link
Member

Thanks for all the work you've done on this already! If we have a deterministic crash, can we run it under Valgrind? That would put us in a great position to understand what exactly is going on.

@nicolasnoble
Copy link
Contributor Author

Maybe ? I'm not exactly familiar with what's involved in running the Ruby VM through valgrind, and my attempts haven't been really rewarded with success so far.

@haberman
Copy link
Member

Hmm. I know Ruby throws a bunch of Valgrind errors even in normal operation. But did the crash itself not trigger under Valgrind?

@nicolasnoble
Copy link
Contributor Author

Let me say it differently: I can't even get there. Valgrind segfaults while
trying to run ruby at the moment.

On Wed, Aug 24, 2016 at 12:57 PM, Joshua Haberman notifications@github.com
wrote:

Hmm. I know Ruby throws a bunch of Valgrind errors even in normal
operation. But did the crash itself not trigger under Valgrind?


You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
#2004 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AG8bpkTZKyNZKiRi9-SXbGGGLJIuJnnZks5qjKIvgaJpZM4JrsAj
.

@haberman
Copy link
Member

Ok. In that case I'll probably need to look deeper into the repro and try to debug it. How can I reproduce this?

@nicolasnoble
Copy link
Contributor Author

Would there be something missing from the repro case in the report ?

@haberman
Copy link
Member

Ah sorry, I missed that you attached a zip file! I'll take a look as soon as I can. Hopefully today.

@nicolasnoble
Copy link
Contributor Author

So with this Dockerfile.txt, I can reproduce the VM crash deterministically. And I can also run valgrind on it. But under valgrind, there's no crash.

$ docker build .
[...]
Successfully built 61695a99f23b
$ docker run 61695a99f23b ruby repro.rb
Preparing the GC...
Done - trying to hit the bug
repro.rb:16: [BUG] Segmentation fault at 0x00000000000000
ruby 2.3.1p112 (2016-04-26 revision 54768) [x86_64-linux]
[...]
$ docker run 61695a99f23b
==1== Memcheck, a memory error detector
==1== Copyright (C) 2002-2015, and GNU GPL'd, by Julian Seward et al.
==1== Using Valgrind-3.11.0 and LibVEX; rerun with -h for copyright info
==1== Command: ruby repro.rb
==1==
Preparing the GC...
Done - trying to hit the bug

And nothing happens. The code just loops forever trying to trigger the corruption. Or maybe my machine isn't strong enough to support valgrind under such heavy load and never gets to the state where it fails.

@nicolasnoble
Copy link
Contributor Author

I'll take that back. The same crash repro properly under valgrind. It just takes a very long time:

==1== Use of uninitialised value of size 8
==1==    at 0x226120: rb_id_table_lookup (id_table.c:1516)
==1==    by 0x26F2FB: lookup_method_table (vm_method.c:190)
==1==    by 0x26F2FB: search_method (vm_method.c:701)
==1==    by 0x26F2FB: method_entry_get_without_cache (vm_method.c:726)
==1==    by 0x272E98: method_entry_get (vm_method.c:786)
==1==    by 0x272E98: rb_callable_method_entry (vm_method.c:831)
==1==    by 0x280C89: rb_call0 (vm_eval.c:343)
==1==    by 0x2813F3: rb_call (vm_eval.c:637)
==1==    by 0x2813F3: rb_funcall (vm_eval.c:835)
==1==    by 0x2098F7: rb_obj_as_string (string.c:1254)
==1==    by 0x1F5C17: ruby__sfvextra (sprintf.c:1375)
==1==    by 0x1F6BA9: BSD_vfprintf (vsnprintf.c:836)
==1==    by 0x1FD333: rb_enc_vsprintf (sprintf.c:1421)
==1==    by 0x1FD333: rb_vsprintf (sprintf.c:1445)
==1==    by 0x2FB26D: rb_raise (error.c:2060)
==1==    by 0x2FC47D: rb_check_type (error.c:599)
==1==    by 0x7195074: Map_index_set (in /usr/local/lib/ruby/gems/2.3.0/gems/google-protobuf-3.0.0.alpha.5.0.5.1-x86_64-linux/lib/google/2.3/protobuf_c.so)
==1==
==1== Invalid read of size 4
==1==    at 0x226120: rb_id_table_lookup (id_table.c:1516)
==1==    by 0x26F2FB: lookup_method_table (vm_method.c:190)
==1==    by 0x26F2FB: search_method (vm_method.c:701)
==1==    by 0x26F2FB: method_entry_get_without_cache (vm_method.c:726)
==1==    by 0x272E98: method_entry_get (vm_method.c:786)
==1==    by 0x272E98: rb_callable_method_entry (vm_method.c:831)
==1==    by 0x280C89: rb_call0 (vm_eval.c:343)
==1==    by 0x2813F3: rb_call (vm_eval.c:637)
==1==    by 0x2813F3: rb_funcall (vm_eval.c:835)
==1==    by 0x2098F7: rb_obj_as_string (string.c:1254)
==1==    by 0x1F5C17: ruby__sfvextra (sprintf.c:1375)
==1==    by 0x1F6BA9: BSD_vfprintf (vsnprintf.c:836)
==1==    by 0x1FD333: rb_enc_vsprintf (sprintf.c:1421)
==1==    by 0x1FD333: rb_vsprintf (sprintf.c:1445)
==1==    by 0x2FB26D: rb_raise (error.c:2060)
==1==    by 0x2FC47D: rb_check_type (error.c:599)
==1==    by 0x7195074: Map_index_set (in /usr/local/lib/ruby/gems/2.3.0/gems/google-protobuf-3.0.0.alpha.5.0.5.1-x86_64-linux/lib/google/2.3/protobuf_c.so)
==1==  Address 0x0 is not stack'd, malloc'd or (recently) free'd
==1==

@nicolasnoble
Copy link
Contributor Author

After adding --track-origins, we also get this:

==1==  Uninitialised value was created by a client request
==1==    at 0x14B118: gc_page_sweep (gc.c:3384)
==1==    by 0x14B118: gc_sweep_step (gc.c:3559)
==1==    by 0x150EC2: gc_sweep_rest (gc.c:3608)
==1==    by 0x150EC2: gc_rest.part.63 (gc.c:6267)
==1==    by 0x151FAC: gc_rest (gc.c:6161)
==1==    by 0x151FAC: garbage_collect (gc.c:6154)
==1==    by 0x152187: garbage_collect_with_gvl (gc.c:6394)
==1==    by 0x152203: objspace_malloc_increase.isra.67 (gc.c:7558)
==1==    by 0x1534BD: objspace_xmalloc (gc.c:7650)
==1==    by 0x7192326: Message_alloc (in /usr/local/lib/ruby/gems/2.3.0/gems/google-protobuf-3.0.0.alpha.5.0.5.1-x86_64-linux/lib/google/2.3/protobuf_c.so)
==1==    by 0x195D7F: rb_obj_alloc (object.c:1823)
==1==    by 0x196070: rb_class_new_instance (object.c:1855)
==1==    by 0x719668C: submsg_handler (in /usr/local/lib/ruby/gems/2.3.0/gems/google-protobuf-3.0.0.alpha.5.0.5.1-x86_64-linux/lib/google/2.3/protobuf_c.so)
==1==    by 0x71B0DFF: run_decoder_vm (in /usr/local/lib/ruby/gems/2.3.0/gems/google-protobuf-3.0.0.alpha.5.0.5.1-x86_64-linux/lib/google/2.3/protobuf_c.so)
==1==    by 0x719891F: Message_decode (in /usr/local/lib/ruby/gems/2.3.0/gems/google-protobuf-3.0.0.alpha.5.0.5.1-x86_64-linux/lib/google/2.3/protobuf_c.so)
==1==

@haberman
Copy link
Member

The attached PR fixes the crash for me.

@timanovsky
Copy link

@haberman Hi, when you plan to publish the gem with this fix? I struggle with this error in produciton

@quartzmo
Copy link

@xfxyjwf Hi, is it possible to give us a timeline for the release of this fix?

@haberman
Copy link
Member

Sorry I didn't get to this yesterday. I will work on the release packages
today.
On Wed, Sep 14, 2016 at 10:36 AM Chris Smith notifications@github.com
wrote:

@xfxyjwf https://github.com/xfxyjwf Hi, is it possible to give us a
timeline for the release of this fix?


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#2004 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAAE9un46HR7XAkV8BqvCa7r1uMgkCeoks5qqDCUgaJpZM4JrsAj
.

@quartzmo
Copy link

@haberman Great! Thanks!

@haberman
Copy link
Member

Updated packages are now available on RubyGems.

@timanovsky
Copy link

@haberman Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants