Ruby assert can sometime trigger, depending on garbage collector. #2004

nicolasnoble · 2016-08-24T06:46:08Z

Basically, we were investigating this: grpc/grpc#7661

Our investigation led to realize that this assert in the protobuf code is being triggered, but only if the garbage collector has been exercised enough: https://github.com/google/protobuf/blob/master/ruby/ext/google/protobuf_c/map.c#L74

If the garbage collector is really under heavy stress, we can even produce a VM crash: http://pastebin.com/hzgHPJGq

I have included a zip file with our current reproduction case: ruby-repro.zip. Right now, this can crash any of the versions of Ruby I've been able to try this with. The reproduction steps are as follow:

$ bundle install
$ bundle exec gem repro.rb

The idea of the repro is to load a baked binary protobuf from the disk, and deserialize it enough times in memory to eventually cause a failure. The failure is evidently due to some corruption that happens in the Ruby VM. We have checked that the actual raw memory itself hasn't been altered - and even though it would've been, the internal assert being triggered shouldn't have happened in the first place.

When using a vanilla version of Ruby, the crash will not be deterministic. However, compiling a custom Ruby library with the timer_thread disabled causes the crash to become fully deterministic. Changing the value of the number of times we try to deserialize the object while the garbage collector is disabled will alter the behavior of the problem.

It would also be reasonable to suspect that there is an actual bug in the Ruby VM. I have cross-filed a bug there too: https://bugs.ruby-lang.org/issues/12699

The text was updated successfully, but these errors were encountered:

murgatroid99 · 2016-08-24T17:35:33Z

If this problem is caused by a bug in protobuf code, it could be caused by a Ruby object that is temporarily not visible to the garbage collector. This means that it is not on the stack, and no other Ruby object holds a reference to it. That could happen if at some point the only reference to a Ruby object is in a C struct on the heap.

haberman · 2016-08-24T17:43:05Z

Thanks for all the work you've done on this already! If we have a deterministic crash, can we run it under Valgrind? That would put us in a great position to understand what exactly is going on.

nicolasnoble · 2016-08-24T19:33:17Z

Maybe ? I'm not exactly familiar with what's involved in running the Ruby VM through valgrind, and my attempts haven't been really rewarded with success so far.

haberman · 2016-08-24T19:56:49Z

Hmm. I know Ruby throws a bunch of Valgrind errors even in normal operation. But did the crash itself not trigger under Valgrind?

nicolasnoble · 2016-08-24T19:58:17Z

Let me say it differently: I can't even get there. Valgrind segfaults while
trying to run ruby at the moment.

On Wed, Aug 24, 2016 at 12:57 PM, Joshua Haberman notifications@github.com
wrote:

Hmm. I know Ruby throws a bunch of Valgrind errors even in normal
operation. But did the crash itself not trigger under Valgrind?

—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
#2004 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AG8bpkTZKyNZKiRi9-SXbGGGLJIuJnnZks5qjKIvgaJpZM4JrsAj
.

haberman · 2016-08-24T20:01:42Z

Ok. In that case I'll probably need to look deeper into the repro and try to debug it. How can I reproduce this?

nicolasnoble · 2016-08-24T20:03:38Z

Would there be something missing from the repro case in the report ?

haberman · 2016-08-24T20:05:26Z

Ah sorry, I missed that you attached a zip file! I'll take a look as soon as I can. Hopefully today.

nicolasnoble · 2016-08-24T20:53:13Z

So with this Dockerfile.txt, I can reproduce the VM crash deterministically. And I can also run valgrind on it. But under valgrind, there's no crash.

$ docker build .
[...]
Successfully built 61695a99f23b
$ docker run 61695a99f23b ruby repro.rb
Preparing the GC...
Done - trying to hit the bug
repro.rb:16: [BUG] Segmentation fault at 0x00000000000000
ruby 2.3.1p112 (2016-04-26 revision 54768) [x86_64-linux]
[...]
$ docker run 61695a99f23b
==1== Memcheck, a memory error detector
==1== Copyright (C) 2002-2015, and GNU GPL'd, by Julian Seward et al.
==1== Using Valgrind-3.11.0 and LibVEX; rerun with -h for copyright info
==1== Command: ruby repro.rb
==1==
Preparing the GC...
Done - trying to hit the bug

And nothing happens. The code just loops forever trying to trigger the corruption. Or maybe my machine isn't strong enough to support valgrind under such heavy load and never gets to the state where it fails.

nicolasnoble · 2016-08-24T22:05:46Z

I'll take that back. The same crash repro properly under valgrind. It just takes a very long time:

==1== Use of uninitialised value of size 8
==1==    at 0x226120: rb_id_table_lookup (id_table.c:1516)
==1==    by 0x26F2FB: lookup_method_table (vm_method.c:190)
==1==    by 0x26F2FB: search_method (vm_method.c:701)
==1==    by 0x26F2FB: method_entry_get_without_cache (vm_method.c:726)
==1==    by 0x272E98: method_entry_get (vm_method.c:786)
==1==    by 0x272E98: rb_callable_method_entry (vm_method.c:831)
==1==    by 0x280C89: rb_call0 (vm_eval.c:343)
==1==    by 0x2813F3: rb_call (vm_eval.c:637)
==1==    by 0x2813F3: rb_funcall (vm_eval.c:835)
==1==    by 0x2098F7: rb_obj_as_string (string.c:1254)
==1==    by 0x1F5C17: ruby__sfvextra (sprintf.c:1375)
==1==    by 0x1F6BA9: BSD_vfprintf (vsnprintf.c:836)
==1==    by 0x1FD333: rb_enc_vsprintf (sprintf.c:1421)
==1==    by 0x1FD333: rb_vsprintf (sprintf.c:1445)
==1==    by 0x2FB26D: rb_raise (error.c:2060)
==1==    by 0x2FC47D: rb_check_type (error.c:599)
==1==    by 0x7195074: Map_index_set (in /usr/local/lib/ruby/gems/2.3.0/gems/google-protobuf-3.0.0.alpha.5.0.5.1-x86_64-linux/lib/google/2.3/protobuf_c.so)
==1==
==1== Invalid read of size 4
==1==    at 0x226120: rb_id_table_lookup (id_table.c:1516)
==1==    by 0x26F2FB: lookup_method_table (vm_method.c:190)
==1==    by 0x26F2FB: search_method (vm_method.c:701)
==1==    by 0x26F2FB: method_entry_get_without_cache (vm_method.c:726)
==1==    by 0x272E98: method_entry_get (vm_method.c:786)
==1==    by 0x272E98: rb_callable_method_entry (vm_method.c:831)
==1==    by 0x280C89: rb_call0 (vm_eval.c:343)
==1==    by 0x2813F3: rb_call (vm_eval.c:637)
==1==    by 0x2813F3: rb_funcall (vm_eval.c:835)
==1==    by 0x2098F7: rb_obj_as_string (string.c:1254)
==1==    by 0x1F5C17: ruby__sfvextra (sprintf.c:1375)
==1==    by 0x1F6BA9: BSD_vfprintf (vsnprintf.c:836)
==1==    by 0x1FD333: rb_enc_vsprintf (sprintf.c:1421)
==1==    by 0x1FD333: rb_vsprintf (sprintf.c:1445)
==1==    by 0x2FB26D: rb_raise (error.c:2060)
==1==    by 0x2FC47D: rb_check_type (error.c:599)
==1==    by 0x7195074: Map_index_set (in /usr/local/lib/ruby/gems/2.3.0/gems/google-protobuf-3.0.0.alpha.5.0.5.1-x86_64-linux/lib/google/2.3/protobuf_c.so)
==1==  Address 0x0 is not stack'd, malloc'd or (recently) free'd
==1==

nicolasnoble · 2016-08-25T00:41:51Z

After adding --track-origins, we also get this:

==1==  Uninitialised value was created by a client request
==1==    at 0x14B118: gc_page_sweep (gc.c:3384)
==1==    by 0x14B118: gc_sweep_step (gc.c:3559)
==1==    by 0x150EC2: gc_sweep_rest (gc.c:3608)
==1==    by 0x150EC2: gc_rest.part.63 (gc.c:6267)
==1==    by 0x151FAC: gc_rest (gc.c:6161)
==1==    by 0x151FAC: garbage_collect (gc.c:6154)
==1==    by 0x152187: garbage_collect_with_gvl (gc.c:6394)
==1==    by 0x152203: objspace_malloc_increase.isra.67 (gc.c:7558)
==1==    by 0x1534BD: objspace_xmalloc (gc.c:7650)
==1==    by 0x7192326: Message_alloc (in /usr/local/lib/ruby/gems/2.3.0/gems/google-protobuf-3.0.0.alpha.5.0.5.1-x86_64-linux/lib/google/2.3/protobuf_c.so)
==1==    by 0x195D7F: rb_obj_alloc (object.c:1823)
==1==    by 0x196070: rb_class_new_instance (object.c:1855)
==1==    by 0x719668C: submsg_handler (in /usr/local/lib/ruby/gems/2.3.0/gems/google-protobuf-3.0.0.alpha.5.0.5.1-x86_64-linux/lib/google/2.3/protobuf_c.so)
==1==    by 0x71B0DFF: run_decoder_vm (in /usr/local/lib/ruby/gems/2.3.0/gems/google-protobuf-3.0.0.alpha.5.0.5.1-x86_64-linux/lib/google/2.3/protobuf_c.so)
==1==    by 0x719891F: Message_decode (in /usr/local/lib/ruby/gems/2.3.0/gems/google-protobuf-3.0.0.alpha.5.0.5.1-x86_64-linux/lib/google/2.3/protobuf_c.so)
==1==

haberman · 2016-08-25T20:45:51Z

The attached PR fixes the crash for me.

timanovsky · 2016-09-13T18:16:43Z

@haberman Hi, when you plan to publish the gem with this fix? I struggle with this error in produciton

quartzmo · 2016-09-14T17:35:36Z

@xfxyjwf Hi, is it possible to give us a timeline for the release of this fix?

haberman · 2016-09-14T17:44:31Z

Sorry I didn't get to this yesterday. I will work on the release packages
today.
On Wed, Sep 14, 2016 at 10:36 AM Chris Smith notifications@github.com
wrote:

@xfxyjwf https://github.com/xfxyjwf Hi, is it possible to give us a
timeline for the release of this fix?

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#2004 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAAE9un46HR7XAkV8BqvCa7r1uMgkCeoks5qqDCUgaJpZM4JrsAj
.

quartzmo · 2016-09-14T17:51:21Z

@haberman Great! Thanks!

haberman · 2016-09-15T00:53:23Z

Updated packages are now available on RubyGems.

timanovsky · 2016-09-15T15:32:24Z

@haberman Thanks!

This was referenced Aug 24, 2016

DataStore protocol error? grpc/grpc#7661

Closed

DataStore protocol error? googleapis/google-cloud-ruby#817

Closed

haberman mentioned this issue Aug 25, 2016

Ruby: make sure map parsing frames are GC-rooted. #2012

Merged

xfxyjwf added bug ruby labels Aug 29, 2016

haberman closed this as completed Sep 15, 2016

apolcyn mentioned this issue Dec 15, 2016

Sporadic 'TypeError: wrong argument type Fixnum' in datastore client grpc/grpc#8689

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ruby assert can sometime trigger, depending on garbage collector. #2004

Ruby assert can sometime trigger, depending on garbage collector. #2004

nicolasnoble commented Aug 24, 2016

murgatroid99 commented Aug 24, 2016 •

edited

Loading

haberman commented Aug 24, 2016

nicolasnoble commented Aug 24, 2016

haberman commented Aug 24, 2016

nicolasnoble commented Aug 24, 2016

haberman commented Aug 24, 2016

nicolasnoble commented Aug 24, 2016

haberman commented Aug 24, 2016

nicolasnoble commented Aug 24, 2016

nicolasnoble commented Aug 24, 2016

nicolasnoble commented Aug 25, 2016

haberman commented Aug 25, 2016

timanovsky commented Sep 13, 2016

quartzmo commented Sep 14, 2016

haberman commented Sep 14, 2016

quartzmo commented Sep 14, 2016

haberman commented Sep 15, 2016

timanovsky commented Sep 15, 2016

Ruby assert can sometime trigger, depending on garbage collector. #2004

Ruby assert can sometime trigger, depending on garbage collector. #2004

Comments

nicolasnoble commented Aug 24, 2016

murgatroid99 commented Aug 24, 2016 • edited Loading

haberman commented Aug 24, 2016

nicolasnoble commented Aug 24, 2016

haberman commented Aug 24, 2016

nicolasnoble commented Aug 24, 2016

haberman commented Aug 24, 2016

nicolasnoble commented Aug 24, 2016

haberman commented Aug 24, 2016

nicolasnoble commented Aug 24, 2016

nicolasnoble commented Aug 24, 2016

nicolasnoble commented Aug 25, 2016

haberman commented Aug 25, 2016

timanovsky commented Sep 13, 2016

quartzmo commented Sep 14, 2016

haberman commented Sep 14, 2016

quartzmo commented Sep 14, 2016

haberman commented Sep 15, 2016

timanovsky commented Sep 15, 2016

murgatroid99 commented Aug 24, 2016 •

edited

Loading