The JVM SIGSEGVs when using RubyEtc.getgrgid #4057

Closed
nbarrientos opened this Issue Aug 9, 2016 · 13 comments

Comments

Projects
None yet
2 participants
@nbarrientos
Contributor

nbarrientos commented Aug 9, 2016

Hi,

[root@foo jruby-1.7.25]# PATH=bin/:$PATH ./bin/irb
irb(main):001:0> JRUBY_VERSION
=> "1.7.25"
irb(main):002:0> require 'java'
=> false
irb(main):003:0> java_import java.lang.System
=> [Java::JavaLang::System]
irb(main):004:0> System.getProperties["java.runtime.version"]
=> "1.7.0_111-mockbuild_2016_07_27_10_50-b00"
irb(main):005:0> a = [2688,2765,2746,2764,1008,1077,2745,1338,1665,1135,1157,1158,1160,1395,1470,1018,1399,1307]
=> [2688, 2765, 2746, 2764, 1008, 1077, 2745, 1338, 1665, 1135, 1157, 1158, 1160, 1395, 1470, 1018, 1399, 1307]
irb(main):006:0> (0..2).each{|th| Thread.new { (0..100000).each{|rr| a.each{ |x| puts Etc.getgrgid(x.to_i).name}} }}
=> 0..2
... # Output removed to hide group names...
irb(main):007:0>
...
#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x00007f840213b3e5, pid=29540, tid=140204612916992
#
# JRE version: OpenJDK Runtime Environment (7.0_99) (build 1.7.0_99-mockbuild_2016_03_25_01_00-b00)
# Java VM: OpenJDK 64-Bit Server VM (24.95-b01 mixed mode linux-amd64 compressed oops)
# Derivative: IcedTea 2.6.5
# Distribution: CentOS Linux release 7.2.1511 (Core) , package rhel-2.6.5.0.el7_2-x86_64 u99-b00
# Problematic frame:
# C  [libc.so.6+0x1633e5]  __strlen_sse2_pminub+0x35
#
# Failed to write core dump. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again
#
# An error report file with more information is saved as:
# /tmp/jvm-29540/hs_error.log
#
# If you would like to submit a bug report, please include
# instructions on how to reproduce the bug and visit:
#   http://icedtea.classpath.org/bugzilla
# The crash happened outside the Java Virtual Machine in native code.
# See problematic frame for where to report the bug.
#
Aborted

Is this supposed to happen? IOW, is RubyEtc.getgrgid (or the libs used underneath) meant to be thread-safe?

We stumbled upon this crash when running Puppetserver that ships jRuby 1.7.20.1 however as you can see above the bug can be reproduced using the latest 1.7.

Full crash report

Thanks in advance for looking into it.

@headius

This comment has been minimized.

Show comment
Hide comment
@headius

headius Aug 10, 2016

Member

I don't know about threadsafe, but if MRI doesn't do this we shouldn't either. I assume MRI doesn't do this :-)

Member

headius commented Aug 10, 2016

I don't know about threadsafe, but if MRI doesn't do this we shouldn't either. I assume MRI doesn't do this :-)

@headius

This comment has been minimized.

Show comment
Hide comment
@headius

headius Aug 10, 2016

Member

Your script runs ok for me on OS X (JRuby 9k).

The definition used by JRuby 1.7 may have lagged behind, or perhaps neither version have the right function definition on your platform. And yes, there's the possibility that this call is not threadsafe, but it doesn't effect MRI due to the global lock.

Can you reproduce it if you lock a Mutex around the getgrgid calls? If not, it would seem the call is bound propertly but not threadsafe.

Judging by the crash report, it wouldn't surprise me if there's some shared state used by this call that's getting stepped on.

Member

headius commented Aug 10, 2016

Your script runs ok for me on OS X (JRuby 9k).

The definition used by JRuby 1.7 may have lagged behind, or perhaps neither version have the right function definition on your platform. And yes, there's the possibility that this call is not threadsafe, but it doesn't effect MRI due to the global lock.

Can you reproduce it if you lock a Mutex around the getgrgid calls? If not, it would seem the call is bound propertly but not threadsafe.

Judging by the crash report, it wouldn't surprise me if there's some shared state used by this call that's getting stepped on.

@headius headius added this to the JRuby 1.7.26 milestone Aug 10, 2016

@nbarrientos

This comment has been minimized.

Show comment
Hide comment
@nbarrientos

nbarrientos Aug 10, 2016

Contributor

Hi,

Thanks for your reply.

I'm sorry I might have not given you all the data you need to reproduce the issue. The groups which GIDs are in the array named 'a' must exist and have members, otherwise my guess is that the code that triggers the crash won't be executed.

I can also reproduce the bug using 9k:

[root@foo jruby-9.1.2.0]# PATH=bin/:$PATH ./bin/irb
irb(main):001:0> JRUBY_VERSION
=> "9.1.2.0"
irb(main):002:0> require 'java'
=> false
irb(main):003:0> java_import java.lang.System
=> [Java::JavaLang::System]
irb(main):004:0> System.getProperties["java.runtime.version"]
=> "1.7.0_111-mockbuild_2016_07_27_10_50-b00"
irb(main):005:0> a = [2688,2765,2746,2764,1008,1077,2745,1338,1665,1135,1157,1158,1160,1395,1470,1018,1399,1307]
=> [2688, 2765, 2746, 2764, 1008, 1077, 2745, 1338, 1665, 1135, 1157, 1158, 1160, 1395, 1470, 1018, 1399, 1307]
irb(main):006:0> (0..2).each{|th| Thread.new { (0..100000).each{|rr| a.each{ |x| puts Etc.getgrgid(x.to_i).name}} }}
=> 0..2
irb(main):007:0>
...
#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x00007fd1862393e5, pid=4133, tid=140536982243072
#
# JRE version: OpenJDK Runtime Environment (7.0_111-b01) (build 1.7.0_111-mockbuild_2016_07_27_10_50-b00)
# Java VM: OpenJDK 64-Bit Server VM (24.111-b01 mixed mode linux-amd64 compressed oops)
# Derivative: IcedTea 2.6.7
# Distribution: CentOS Linux release 7.2.1511 (Core) , package rhel-2.6.7.2.el7_2-x86_64 u111-b01
# Problematic frame:
# C  [libc.so.6+0x1633e5]  __strlen_sse2_pminub+0x35
#
# Failed to write core dump. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again
#
# An error report file with more information is saved as:
# /tmp/jvm-4133/hs_error.log
#
# If you would like to submit a bug report, please include
# instructions on how to reproduce the bug and visit:
#   http://icedtea.classpath.org/bugzilla
# The crash happened outside the Java Virtual Machine in native code.
# See problematic frame for where to report the bug.
#
Aborted

Will try later to protect the calls with a Mutex and see what happens.

Contributor

nbarrientos commented Aug 10, 2016

Hi,

Thanks for your reply.

I'm sorry I might have not given you all the data you need to reproduce the issue. The groups which GIDs are in the array named 'a' must exist and have members, otherwise my guess is that the code that triggers the crash won't be executed.

I can also reproduce the bug using 9k:

[root@foo jruby-9.1.2.0]# PATH=bin/:$PATH ./bin/irb
irb(main):001:0> JRUBY_VERSION
=> "9.1.2.0"
irb(main):002:0> require 'java'
=> false
irb(main):003:0> java_import java.lang.System
=> [Java::JavaLang::System]
irb(main):004:0> System.getProperties["java.runtime.version"]
=> "1.7.0_111-mockbuild_2016_07_27_10_50-b00"
irb(main):005:0> a = [2688,2765,2746,2764,1008,1077,2745,1338,1665,1135,1157,1158,1160,1395,1470,1018,1399,1307]
=> [2688, 2765, 2746, 2764, 1008, 1077, 2745, 1338, 1665, 1135, 1157, 1158, 1160, 1395, 1470, 1018, 1399, 1307]
irb(main):006:0> (0..2).each{|th| Thread.new { (0..100000).each{|rr| a.each{ |x| puts Etc.getgrgid(x.to_i).name}} }}
=> 0..2
irb(main):007:0>
...
#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x00007fd1862393e5, pid=4133, tid=140536982243072
#
# JRE version: OpenJDK Runtime Environment (7.0_111-b01) (build 1.7.0_111-mockbuild_2016_07_27_10_50-b00)
# Java VM: OpenJDK 64-Bit Server VM (24.111-b01 mixed mode linux-amd64 compressed oops)
# Derivative: IcedTea 2.6.7
# Distribution: CentOS Linux release 7.2.1511 (Core) , package rhel-2.6.7.2.el7_2-x86_64 u111-b01
# Problematic frame:
# C  [libc.so.6+0x1633e5]  __strlen_sse2_pminub+0x35
#
# Failed to write core dump. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again
#
# An error report file with more information is saved as:
# /tmp/jvm-4133/hs_error.log
#
# If you would like to submit a bug report, please include
# instructions on how to reproduce the bug and visit:
#   http://icedtea.classpath.org/bugzilla
# The crash happened outside the Java Virtual Machine in native code.
# See problematic frame for where to report the bug.
#
Aborted

Will try later to protect the calls with a Mutex and see what happens.

@nbarrientos

This comment has been minimized.

Show comment
Hide comment
@nbarrientos

nbarrientos Aug 10, 2016

Contributor
[root@foo jruby-9.1.2.0]# PATH=bin/:$PATH ./bin/irb                                                                                                       
irb(main):001:0> m = Mutex.new                                                                                                                                
=> #<Mutex:0xf6a5a4e>                                                                                                                                         
irb(main):002:0> a = [2688,2765,2746,2764,1008,1077,2745,1338,1665,1135,1157,1158,1160,1395,1470,1018,1399,1307]                                              
=> [2688, 2765, 2746, 2764, 1008, 1077, 2745, 1338, 1665, 1135, 1157, 1158, 1160, 1395, 1470, 1018, 1399, 1307]                                               
irb(main):003:0> (0..2).each{|th| Thread.new { (0..100000).each{|rr| a.each{ |x| m.lock; Etc.getgrgid(x.to_i).name; m.unlock }} }}
=> 0..2
irb(main):004:0>                                                                                                                                                   

:)

Contributor

nbarrientos commented Aug 10, 2016

[root@foo jruby-9.1.2.0]# PATH=bin/:$PATH ./bin/irb                                                                                                       
irb(main):001:0> m = Mutex.new                                                                                                                                
=> #<Mutex:0xf6a5a4e>                                                                                                                                         
irb(main):002:0> a = [2688,2765,2746,2764,1008,1077,2745,1338,1665,1135,1157,1158,1160,1395,1470,1018,1399,1307]                                              
=> [2688, 2765, 2746, 2764, 1008, 1077, 2745, 1338, 1665, 1135, 1157, 1158, 1160, 1395, 1470, 1018, 1399, 1307]                                               
irb(main):003:0> (0..2).each{|th| Thread.new { (0..100000).each{|rr| a.each{ |x| m.lock; Etc.getgrgid(x.to_i).name; m.unlock }} }}
=> 0..2
irb(main):004:0>                                                                                                                                                   

:)

@headius

This comment has been minimized.

Show comment
Hide comment
@headius

headius Aug 11, 2016

Member

Nice! So either we are doing something thread-unsafe or the getgrgid call is. I'll try to reproduce here with proper group IDs and then I'll do a little digging.

Member

headius commented Aug 11, 2016

Nice! So either we are doing something thread-unsafe or the getgrgid call is. I'll try to reproduce here with proper group IDs and then I'll do a little digging.

@headius

This comment has been minimized.

Show comment
Hide comment
@headius

headius Aug 11, 2016

Member

Interestingly, the manpage for getgrgid on OS X has this to say:

On OS X, these routines are thread-safe and return a pointer to a thread-specific data structure.

Definitely getting the feeling that the Linux getgrgid is not thread-safe.

Member

headius commented Aug 11, 2016

Interestingly, the manpage for getgrgid on OS X has this to say:

On OS X, these routines are thread-safe and return a pointer to a thread-specific data structure.

Definitely getting the feeling that the Linux getgrgid is not thread-safe.

@headius

This comment has been minimized.

Show comment
Hide comment
@headius

headius Aug 11, 2016

Member

The getgrgid entry from the POSIX Programmer's Manual has another perspective:

The getgrgid() function need not be thread-safe.

I think our best option here is to add a lock in JRuby around getgrgid.

Member

headius commented Aug 11, 2016

The getgrgid entry from the POSIX Programmer's Manual has another perspective:

The getgrgid() function need not be thread-safe.

I think our best option here is to add a lock in JRuby around getgrgid.

@headius headius closed this in 64ab8ea Aug 11, 2016

headius added a commit that referenced this issue Aug 11, 2016

Synchronize around getgrgid call. Fixes #4057.
getgrgid is thread-safe on OS X, but other platforms are not
required to be thread-safe by the POSIX specification. It appears
that at least some flavors of Linux do not have thread-safe
getgrgid. To aid users who might be calling it across threads, we
lock around the native getgrgid call.

I chose to lock around the RubyEtc.class to ensure the native call
is single-threaded for all JRuby instances in a given classloader.
There's no clean way to make that guarantee across classloaders.
@headius

This comment has been minimized.

Show comment
Hide comment
@headius

headius Aug 11, 2016

Member

I patched this for 9.1.3.0 and 1.7.26. I believe 9.1.3.0 will be out first. Wish Github let me set multiple target milestones.

Member

headius commented Aug 11, 2016

I patched this for 9.1.3.0 and 1.7.26. I believe 9.1.3.0 will be out first. Wish Github let me set multiple target milestones.

@headius headius modified the milestones: JRuby 9.1.3.0, JRuby 1.7.26 Aug 11, 2016

camlow325 added a commit to camlow325/jruby that referenced this issue Aug 17, 2016

Merge from upstream 'jruby-1_7' at 0dc6fa9
* jruby-1_7:
  Use absolute classpath URL when loading jar-bootstrap. See #4000.
  Synchronize around getgrgid call. Fixes #4057.
  Only lock Digest mutex in one place.
  Explicitly close input stream in CompoundJarURLStreamHandler
  Write a log message when findClass IOException is caught
@nbarrientos

This comment has been minimized.

Show comment
Hide comment
@nbarrientos

nbarrientos Sep 12, 2016

Contributor

Did the patch make it to 1.7.26? I can still reproduce the problem and I fail to find anything in the release notes.

[root@foo jruby-1.7.26]# PATH=$PATH:bin
[root@foo jruby-1.7.26]# ./bin/irb 
irb(main):001:0> JRUBY_VERSION
=> "1.7.26"
irb(main):002:0> require 'java'
=> false
irb(main):003:0> java_import java.lang.System
=> [Java::JavaLang::System]
irb(main):004:0> System.getProperties["java.runtime.version"]
=> "1.7.0_99-mockbuild_2016_03_25_01_00-b00"
irb(main):005:0> a = [2688,2765,2746,2764,1008,1077,2745,1338,1665,1135,1157,1158,1160,1395,1470,1018,1399,1307]
=> [2688, 2765, 2746, 2764, 1008, 1077, 2745, 1338, 1665, 1135, 1157, 1158, 1160, 1395, 1470, 1018, 1399, 1307]
irb(main):006:0> (0..2).each{|th| Thread.new { (0..100000).each{|rr| a.each{ |x| puts Etc.getgrgid(x.to_i).name}} }}
=> 0..2
...
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x00007f8c4440fa4f, pid=27310, tid=140240080443136
#
# JRE version: OpenJDK Runtime Environment (7.0_99) (build 1.7.0_99-mockbuild_2016_03_25_01_00-b00)
# Java VM: OpenJDK 64-Bit Server VM (24.95-b01 mixed mode linux-amd64 compressed oops)
# Derivative: IcedTea 2.6.5
# Distribution: CentOS Linux release 7.2.1511 (Core) , package rhel-2.6.5.0.el7_2-x86_64 u99-b00
# Problematic frame:
# C  [libc.so.6+0x13aa4f] __strlen_sse42+0xf
#
# Failed to write core dump. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again
#
# An error report file with more information is saved as:
# /tmp/jvm-27310/hs_error.log [thread 140240082544384 also had an error]
#
# If you would like to submit a bug report, please include
# instructions on how to reproduce the bug and visit:
#   http://icedtea.classpath.org/bugzilla
# The crash happened outside the Java Virtual Machine in native code.
# See problematic frame for where to report the bug.
#
Aborted
Contributor

nbarrientos commented Sep 12, 2016

Did the patch make it to 1.7.26? I can still reproduce the problem and I fail to find anything in the release notes.

[root@foo jruby-1.7.26]# PATH=$PATH:bin
[root@foo jruby-1.7.26]# ./bin/irb 
irb(main):001:0> JRUBY_VERSION
=> "1.7.26"
irb(main):002:0> require 'java'
=> false
irb(main):003:0> java_import java.lang.System
=> [Java::JavaLang::System]
irb(main):004:0> System.getProperties["java.runtime.version"]
=> "1.7.0_99-mockbuild_2016_03_25_01_00-b00"
irb(main):005:0> a = [2688,2765,2746,2764,1008,1077,2745,1338,1665,1135,1157,1158,1160,1395,1470,1018,1399,1307]
=> [2688, 2765, 2746, 2764, 1008, 1077, 2745, 1338, 1665, 1135, 1157, 1158, 1160, 1395, 1470, 1018, 1399, 1307]
irb(main):006:0> (0..2).each{|th| Thread.new { (0..100000).each{|rr| a.each{ |x| puts Etc.getgrgid(x.to_i).name}} }}
=> 0..2
...
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x00007f8c4440fa4f, pid=27310, tid=140240080443136
#
# JRE version: OpenJDK Runtime Environment (7.0_99) (build 1.7.0_99-mockbuild_2016_03_25_01_00-b00)
# Java VM: OpenJDK 64-Bit Server VM (24.95-b01 mixed mode linux-amd64 compressed oops)
# Derivative: IcedTea 2.6.5
# Distribution: CentOS Linux release 7.2.1511 (Core) , package rhel-2.6.5.0.el7_2-x86_64 u99-b00
# Problematic frame:
# C  [libc.so.6+0x13aa4f] __strlen_sse42+0xf
#
# Failed to write core dump. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again
#
# An error report file with more information is saved as:
# /tmp/jvm-27310/hs_error.log [thread 140240082544384 also had an error]
#
# If you would like to submit a bug report, please include
# instructions on how to reproduce the bug and visit:
#   http://icedtea.classpath.org/bugzilla
# The crash happened outside the Java Virtual Machine in native code.
# See problematic frame for where to report the bug.
#
Aborted
@headius

This comment has been minimized.

Show comment
Hide comment
@headius

headius Sep 13, 2016

Member

@nbarrientos The commit does appear to be on 1.7 branch...but my commit is confusing since it only fixes getgrent. I may have botched this fix and not noticed it.

Given the threading problems with Etc I'm wondering if we shouldn't just synchronize the whole class.

Reopening, because this may not actually have been fixed.

Member

headius commented Sep 13, 2016

@nbarrientos The commit does appear to be on 1.7 branch...but my commit is confusing since it only fixes getgrent. I may have botched this fix and not noticed it.

Given the threading problems with Etc I'm wondering if we shouldn't just synchronize the whole class.

Reopening, because this may not actually have been fixed.

@headius headius reopened this Sep 13, 2016

@headius headius modified the milestones: JRuby 1.7.27, JRuby 9.1.3.0 Sep 13, 2016

@headius headius closed this in f7746d0 Sep 13, 2016

@headius

This comment has been minimized.

Show comment
Hide comment
@headius

headius Sep 13, 2016

Member

I've committed to master what I'll shortly commit to 1.7: synchronizing all the Etc methods against our RubyEtc class. Seems like it's the safest option, since I reviewed several Etc function and all have questionable threading guarantees.

Member

headius commented Sep 13, 2016

I've committed to master what I'll shortly commit to 1.7: synchronizing all the Etc methods against our RubyEtc class. Seems like it's the safest option, since I reviewed several Etc function and all have questionable threading guarantees.

headius added a commit that referenced this issue Sep 13, 2016

Synchronize all of Etc because UNIX is dumb. Fixes #4057.
This also incorporates diffs from master, aligning both
Etc implementations.
@headius

This comment has been minimized.

Show comment
Hide comment
@headius

headius Sep 13, 2016

Member

Fixed again, along with all the other Etc functions, for 9.1.6.0 and 1.7.27.

Member

headius commented Sep 13, 2016

Fixed again, along with all the other Etc functions, for 9.1.6.0 and 1.7.27.

@nbarrientos

This comment has been minimized.

Show comment
Hide comment
@nbarrientos

nbarrientos Sep 14, 2016

Contributor

Thanks, @headius :)

Contributor

nbarrientos commented Sep 14, 2016

Thanks, @headius :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment