Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segfault in IO.select #3201

Closed
nirvdrum opened this issue Aug 7, 2023 · 6 comments · Fixed by #3241
Closed

Segfault in IO.select #3201

nirvdrum opened this issue Aug 7, 2023 · 6 comments · Fixed by #3241
Assignees
Labels

Comments

@nirvdrum
Copy link
Collaborator

nirvdrum commented Aug 7, 2023

I have a Rails application that periodically segfaults. The hs_err log indicates it's stemming from the dnsruby gem as it calls IO.select. I'm seeing this with TruffleRuby 23.0.0 on Oracle GraalVM running in JVM mode. The gem is used by an in-house gem that wraps around ruby-kafka. As the hs_err shows, there's a background thread that does periodic DNS checks.

I haven't yet debugged the code to see what values are flowing through the system when the error occurs. The dnsruby gem uses class variables for various bits of state and I haven't yet verified the thread-safety of that code.

hs_err.log

@eregon
Copy link
Member

eregon commented Aug 8, 2023

So it must be from this line:
https://github.com/alexdalitz/dnsruby/blob/053d4743e9a74c2d9c5fe34eebea6bec7984e632/lib/dnsruby/select_thread.rb#L193

ready, _write, _errors = IO.select(sockets, nil, nil, timeout)

It could be an issue in truffleposix_select() (more likely) or in NFI.
Obviously IO.select should never segfault.

I wonder if it's a case of a file descriptor being > FD_SETSIZE (1024).
If so the reason is most likely that we use FD_ZERO() and that doesn't support such big fds.
BTW select(2) does not officially support any FD bigger than FD_SETSIZE but it seems in practice it does:

DESCRIPTION
       WARNING:  select()  can  monitor only file descriptors numbers that are less than FD_SETSIZE (1024)—an unreasonably
       low limit for many modern applications—and this limitation will not change.  All modern applications should instead
       use poll(2) or epoll(7), which do not suffer this limitation.

@eregon eregon added the bug label Aug 8, 2023
@eregon
Copy link
Member

eregon commented Aug 8, 2023

There is rb_fd_init(), rb_fd_zero() etc to deal with such big fds.
Those headers are not really accessible to libtruffleposix though, and the implementation is in libtruffleruby.
But we don't really want IO.select depend on loading C extensions support.

Another idea would be to use poll(2) to implement IO.select, not sure whether if it's a good idea.

@eregon
Copy link
Member

eregon commented Aug 8, 2023

I tried to reproduce this with:
On CRuby:

$ ruby -ve 'ios=2048.times.map { io=File.new("README.md"); p io.fileno; io }; p ios.last.fileno; IO.select([ios.last], nil, nil, 1.0)'
ruby 3.2.2 (2023-03-30 revision e51014f9c0) [x86_64-linux]
...
1022
1023
-e:1:in `initialize': Too many open files @ rb_sysopen - README.md (Errno::EMFILE)

On TruffleRuby:

$ ruby -ve 'ios=4096.times.map { io=File.new("README.md"); p io.fileno; io }; p ios.last.fileno; p IO.select([ios.last], nil, nil, 1.0)'
truffleruby 23.1.0-dev-b6c229a4, like ruby 3.1.3, GraalVM CE Native [x86_64-linux]
...
4095
4096
4097
4098
4099
4100
4101
4101
zsh: segmentation fault (core dumped)  ruby -ve 

So interestingly CRuby raises EMFILE for what would be fd 1024.
And for some unknown reason TruffleRuby goes higher.

At least what we can do is add a check in IO.select to raise if fd > FD_SETSIZE.

@eregon
Copy link
Member

eregon commented Aug 8, 2023

open(2) mentions:

EMFILE The per-process limit on the number of open file descriptors has been reached (see the description of RLIMIT_NOFILE in getrlimit(2)).

and getrlimit(2) mentions:

       RLIMIT_NOFILE
              This specifies a value one greater than the maximum file descriptor  number  that  can  be  opened  by  this
              process.  Attempts (open(2), pipe(2), dup(2), etc.)  to exceed this limit yield the error EMFILE.  (Histori‐
              cally, this limit was named RLIMIT_OFILE on BSD.)

              Since Linux 4.5, this limit also defines the maximum number of file descriptors that an unprivileged process
              (one  without  the  CAP_SYS_RESOURCE  capability)  may  have "in flight" to other processes, by being passed
              across UNIX domain sockets.  This limit applies to the sendmsg(2) system call.   For  further  details,  see
              unix(7).

And the default limit for me is $ limit: descriptors 1024.
So it seems the JVM/SVM increases the soft limit to the hard limit:

$ ruby -ve 'p Process.getrlimit Process::RLIMIT_NOFILE'
ruby 3.2.2 (2023-03-30 revision e51014f9c0) [x86_64-linux]
[1024, 524288]

$ ruby -ve 'p Process.getrlimit Process::RLIMIT_NOFILE'
truffleruby 23.1.0-dev-b6c229a4, like ruby 3.1.3, GraalVM CE Native [x86_64-linux]
[524288, 524288]

$ jt -q ruby -ve 'p Process.getrlimit Process::RLIMIT_NOFILE'
truffleruby 23.1.0-dev-82d3a4a5, like ruby 3.1.3, Interpreted JVM [x86_64-linux]
[524288, 524288]

In SVM it's done in https://github.com/oracle/graal/blob/79738daa192f8dd941691a07e4bbd5238fcccd81/substratevm/src/com.oracle.svm.core.posix/src/com/oracle/svm/core/posix/PosixNativeLibraryFeature.java#L81-L90

@eregon
Copy link
Member

eregon commented Aug 8, 2023

On CRuby if we set the limit then it works with higher fds:

$ ruby -ve 'Process.setrlimit(Process::RLIMIT_NOFILE, 524288); ios=4096.times.map { io=File.new("README.md"); p io.fileno; io }; p ios.last.fileno; p IO.select([ios.last], nil, nil, 1.0)'
...
4099
4100
4100
[[#<File:README.md>], [], []]

@eregon eregon self-assigned this Aug 8, 2023
@eregon
Copy link
Member

eregon commented Aug 28, 2023

I have a branch reimplementing IO.select with poll() or ppoll() which seems a pretty good fit, removes this limit of fd < 1024 and is likely faster than select(2).

graalvmbot pushed a commit that referenced this issue Aug 29, 2023
* This avoids the issue of file descriptors >= 1024 that FD_ZERO does not handle.
* Fixes #3201
mtortonesi pushed a commit to mtortonesi/truffleruby that referenced this issue Nov 10, 2023
* This avoids the issue of file descriptors >= 1024 that FD_ZERO does not handle.
* Fixes oracle#3201
* Skip TestIO#test_select_exceptfds on macOS, macOS's poll(2) is buggy with TCP MSG_OOB or macOS does not really support MSG_OOB:
  For a fd to which out-of-band data has been sent,
  poll() with POLLIN|POLLPRI returns POLLRDNORM|POLLPRI|POLLIN on Linux (correct) but POLLRDNORM|POLLIN on macOS (bug).
  poll() with just POLLPRI works fine on Linux but hangs on macOS (bug).
  It seems a bug of macOS poll() not handling TCP MSG_OOB.
  The man page of poll() on macOS talks about: "The distinction between normal, priority, and high-priority data
  is specific to particular file types or devices." but gives no details, it seems a mess.
  MSG_OOB is poorly supported across platforms anyway and extremely rarely if ever used.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants