Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The JVM SIGSEGVs when using RubyEtc.getgrgid #4057

Closed
nbarrientos opened this issue Aug 9, 2016 · 13 comments
Closed

The JVM SIGSEGVs when using RubyEtc.getgrgid #4057

nbarrientos opened this issue Aug 9, 2016 · 13 comments

Comments

@nbarrientos
Copy link
Contributor

@nbarrientos nbarrientos commented Aug 9, 2016

Hi,

[root@foo jruby-1.7.25]# PATH=bin/:$PATH ./bin/irb
irb(main):001:0> JRUBY_VERSION
=> "1.7.25"
irb(main):002:0> require 'java'
=> false
irb(main):003:0> java_import java.lang.System
=> [Java::JavaLang::System]
irb(main):004:0> System.getProperties["java.runtime.version"]
=> "1.7.0_111-mockbuild_2016_07_27_10_50-b00"
irb(main):005:0> a = [2688,2765,2746,2764,1008,1077,2745,1338,1665,1135,1157,1158,1160,1395,1470,1018,1399,1307]
=> [2688, 2765, 2746, 2764, 1008, 1077, 2745, 1338, 1665, 1135, 1157, 1158, 1160, 1395, 1470, 1018, 1399, 1307]
irb(main):006:0> (0..2).each{|th| Thread.new { (0..100000).each{|rr| a.each{ |x| puts Etc.getgrgid(x.to_i).name}} }}
=> 0..2
... # Output removed to hide group names...
irb(main):007:0>
...
#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x00007f840213b3e5, pid=29540, tid=140204612916992
#
# JRE version: OpenJDK Runtime Environment (7.0_99) (build 1.7.0_99-mockbuild_2016_03_25_01_00-b00)
# Java VM: OpenJDK 64-Bit Server VM (24.95-b01 mixed mode linux-amd64 compressed oops)
# Derivative: IcedTea 2.6.5
# Distribution: CentOS Linux release 7.2.1511 (Core) , package rhel-2.6.5.0.el7_2-x86_64 u99-b00
# Problematic frame:
# C  [libc.so.6+0x1633e5]  __strlen_sse2_pminub+0x35
#
# Failed to write core dump. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again
#
# An error report file with more information is saved as:
# /tmp/jvm-29540/hs_error.log
#
# If you would like to submit a bug report, please include
# instructions on how to reproduce the bug and visit:
#   http://icedtea.classpath.org/bugzilla
# The crash happened outside the Java Virtual Machine in native code.
# See problematic frame for where to report the bug.
#
Aborted

Is this supposed to happen? IOW, is RubyEtc.getgrgid (or the libs used underneath) meant to be thread-safe?

We stumbled upon this crash when running Puppetserver that ships jRuby 1.7.20.1 however as you can see above the bug can be reproduced using the latest 1.7.

Full crash report

Thanks in advance for looking into it.

@headius
Copy link
Member

@headius headius commented Aug 10, 2016

I don't know about threadsafe, but if MRI doesn't do this we shouldn't either. I assume MRI doesn't do this :-)

@headius
Copy link
Member

@headius headius commented Aug 10, 2016

Your script runs ok for me on OS X (JRuby 9k).

The definition used by JRuby 1.7 may have lagged behind, or perhaps neither version have the right function definition on your platform. And yes, there's the possibility that this call is not threadsafe, but it doesn't effect MRI due to the global lock.

Can you reproduce it if you lock a Mutex around the getgrgid calls? If not, it would seem the call is bound propertly but not threadsafe.

Judging by the crash report, it wouldn't surprise me if there's some shared state used by this call that's getting stepped on.

@headius headius added this to the JRuby 1.7.26 milestone Aug 10, 2016
@nbarrientos
Copy link
Contributor Author

@nbarrientos nbarrientos commented Aug 10, 2016

Hi,

Thanks for your reply.

I'm sorry I might have not given you all the data you need to reproduce the issue. The groups which GIDs are in the array named 'a' must exist and have members, otherwise my guess is that the code that triggers the crash won't be executed.

I can also reproduce the bug using 9k:

[root@foo jruby-9.1.2.0]# PATH=bin/:$PATH ./bin/irb
irb(main):001:0> JRUBY_VERSION
=> "9.1.2.0"
irb(main):002:0> require 'java'
=> false
irb(main):003:0> java_import java.lang.System
=> [Java::JavaLang::System]
irb(main):004:0> System.getProperties["java.runtime.version"]
=> "1.7.0_111-mockbuild_2016_07_27_10_50-b00"
irb(main):005:0> a = [2688,2765,2746,2764,1008,1077,2745,1338,1665,1135,1157,1158,1160,1395,1470,1018,1399,1307]
=> [2688, 2765, 2746, 2764, 1008, 1077, 2745, 1338, 1665, 1135, 1157, 1158, 1160, 1395, 1470, 1018, 1399, 1307]
irb(main):006:0> (0..2).each{|th| Thread.new { (0..100000).each{|rr| a.each{ |x| puts Etc.getgrgid(x.to_i).name}} }}
=> 0..2
irb(main):007:0>
...
#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x00007fd1862393e5, pid=4133, tid=140536982243072
#
# JRE version: OpenJDK Runtime Environment (7.0_111-b01) (build 1.7.0_111-mockbuild_2016_07_27_10_50-b00)
# Java VM: OpenJDK 64-Bit Server VM (24.111-b01 mixed mode linux-amd64 compressed oops)
# Derivative: IcedTea 2.6.7
# Distribution: CentOS Linux release 7.2.1511 (Core) , package rhel-2.6.7.2.el7_2-x86_64 u111-b01
# Problematic frame:
# C  [libc.so.6+0x1633e5]  __strlen_sse2_pminub+0x35
#
# Failed to write core dump. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again
#
# An error report file with more information is saved as:
# /tmp/jvm-4133/hs_error.log
#
# If you would like to submit a bug report, please include
# instructions on how to reproduce the bug and visit:
#   http://icedtea.classpath.org/bugzilla
# The crash happened outside the Java Virtual Machine in native code.
# See problematic frame for where to report the bug.
#
Aborted

Will try later to protect the calls with a Mutex and see what happens.

@nbarrientos
Copy link
Contributor Author

@nbarrientos nbarrientos commented Aug 10, 2016

[root@foo jruby-9.1.2.0]# PATH=bin/:$PATH ./bin/irb                                                                                                       
irb(main):001:0> m = Mutex.new                                                                                                                                
=> #<Mutex:0xf6a5a4e>                                                                                                                                         
irb(main):002:0> a = [2688,2765,2746,2764,1008,1077,2745,1338,1665,1135,1157,1158,1160,1395,1470,1018,1399,1307]                                              
=> [2688, 2765, 2746, 2764, 1008, 1077, 2745, 1338, 1665, 1135, 1157, 1158, 1160, 1395, 1470, 1018, 1399, 1307]                                               
irb(main):003:0> (0..2).each{|th| Thread.new { (0..100000).each{|rr| a.each{ |x| m.lock; Etc.getgrgid(x.to_i).name; m.unlock }} }}
=> 0..2
irb(main):004:0>                                                                                                                                                   

:)

@headius
Copy link
Member

@headius headius commented Aug 11, 2016

Nice! So either we are doing something thread-unsafe or the getgrgid call is. I'll try to reproduce here with proper group IDs and then I'll do a little digging.

@headius
Copy link
Member

@headius headius commented Aug 11, 2016

Interestingly, the manpage for getgrgid on OS X has this to say:

On OS X, these routines are thread-safe and return a pointer to a thread-specific data structure.

Definitely getting the feeling that the Linux getgrgid is not thread-safe.

@headius
Copy link
Member

@headius headius commented Aug 11, 2016

The getgrgid entry from the POSIX Programmer's Manual has another perspective:

The getgrgid() function need not be thread-safe.

I think our best option here is to add a lock in JRuby around getgrgid.

@headius headius closed this in 64ab8ea Aug 11, 2016
headius added a commit that referenced this issue Aug 11, 2016
getgrgid is thread-safe on OS X, but other platforms are not
required to be thread-safe by the POSIX specification. It appears
that at least some flavors of Linux do not have thread-safe
getgrgid. To aid users who might be calling it across threads, we
lock around the native getgrgid call.

I chose to lock around the RubyEtc.class to ensure the native call
is single-threaded for all JRuby instances in a given classloader.
There's no clean way to make that guarantee across classloaders.
@headius
Copy link
Member

@headius headius commented Aug 11, 2016

I patched this for 9.1.3.0 and 1.7.26. I believe 9.1.3.0 will be out first. Wish Github let me set multiple target milestones.

@headius headius modified the milestones: JRuby 9.1.3.0, JRuby 1.7.26 Aug 11, 2016
camlow325 added a commit to camlow325/jruby that referenced this issue Aug 17, 2016
* jruby-1_7:
  Use absolute classpath URL when loading jar-bootstrap. See jruby#4000.
  Synchronize around getgrgid call. Fixes jruby#4057.
  Only lock Digest mutex in one place.
  Explicitly close input stream in CompoundJarURLStreamHandler
  Write a log message when findClass IOException is caught
@nbarrientos
Copy link
Contributor Author

@nbarrientos nbarrientos commented Sep 12, 2016

Did the patch make it to 1.7.26? I can still reproduce the problem and I fail to find anything in the release notes.

[root@foo jruby-1.7.26]# PATH=$PATH:bin
[root@foo jruby-1.7.26]# ./bin/irb 
irb(main):001:0> JRUBY_VERSION
=> "1.7.26"
irb(main):002:0> require 'java'
=> false
irb(main):003:0> java_import java.lang.System
=> [Java::JavaLang::System]
irb(main):004:0> System.getProperties["java.runtime.version"]
=> "1.7.0_99-mockbuild_2016_03_25_01_00-b00"
irb(main):005:0> a = [2688,2765,2746,2764,1008,1077,2745,1338,1665,1135,1157,1158,1160,1395,1470,1018,1399,1307]
=> [2688, 2765, 2746, 2764, 1008, 1077, 2745, 1338, 1665, 1135, 1157, 1158, 1160, 1395, 1470, 1018, 1399, 1307]
irb(main):006:0> (0..2).each{|th| Thread.new { (0..100000).each{|rr| a.each{ |x| puts Etc.getgrgid(x.to_i).name}} }}
=> 0..2
...
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x00007f8c4440fa4f, pid=27310, tid=140240080443136
#
# JRE version: OpenJDK Runtime Environment (7.0_99) (build 1.7.0_99-mockbuild_2016_03_25_01_00-b00)
# Java VM: OpenJDK 64-Bit Server VM (24.95-b01 mixed mode linux-amd64 compressed oops)
# Derivative: IcedTea 2.6.5
# Distribution: CentOS Linux release 7.2.1511 (Core) , package rhel-2.6.5.0.el7_2-x86_64 u99-b00
# Problematic frame:
# C  [libc.so.6+0x13aa4f] __strlen_sse42+0xf
#
# Failed to write core dump. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again
#
# An error report file with more information is saved as:
# /tmp/jvm-27310/hs_error.log [thread 140240082544384 also had an error]
#
# If you would like to submit a bug report, please include
# instructions on how to reproduce the bug and visit:
#   http://icedtea.classpath.org/bugzilla
# The crash happened outside the Java Virtual Machine in native code.
# See problematic frame for where to report the bug.
#
Aborted
@headius
Copy link
Member

@headius headius commented Sep 13, 2016

@nbarrientos The commit does appear to be on 1.7 branch...but my commit is confusing since it only fixes getgrent. I may have botched this fix and not noticed it.

Given the threading problems with Etc I'm wondering if we shouldn't just synchronize the whole class.

Reopening, because this may not actually have been fixed.

@headius headius reopened this Sep 13, 2016
@headius headius modified the milestones: JRuby 1.7.27, JRuby 9.1.3.0 Sep 13, 2016
@headius headius closed this in f7746d0 Sep 13, 2016
@headius
Copy link
Member

@headius headius commented Sep 13, 2016

I've committed to master what I'll shortly commit to 1.7: synchronizing all the Etc methods against our RubyEtc class. Seems like it's the safest option, since I reviewed several Etc function and all have questionable threading guarantees.

headius added a commit that referenced this issue Sep 13, 2016
This also incorporates diffs from master, aligning both
Etc implementations.
@headius
Copy link
Member

@headius headius commented Sep 13, 2016

Fixed again, along with all the other Etc functions, for 9.1.6.0 and 1.7.27.

@nbarrientos
Copy link
Contributor Author

@nbarrientos nbarrientos commented Sep 14, 2016

Thanks, @headius :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
2 participants
You can’t perform that action at this time.