Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix intermittent fails in CI #8007

Merged
merged 13 commits into from Nov 8, 2023
Merged

Conversation

headius
Copy link
Member

@headius headius commented Nov 6, 2023

This will contain small fixes to get our CI more reliable, by fixing the causes of intermittent failures.

  • IO#read raises IOError when stream is closed by another thread intermittent due to different ways of handling a nulled-out "fd". All cases where "fd" is null when it was previously confirmed to be non-null should indicate parallel close occurred. (Better state management here would be helpful, to ensure we know it only could have been closed in parallel.

* Rename some methods to describe their purpose better.
* Use shortcut syntax to create interrupted IOException.
* Raise interrupted IOException when about to read but fd has
  suddenly been cleared.

The last item here was responsible for occasional EBADF failues in
an IO#read spec testing for parallel close and thread interrupts.
If we have reached this point in the code, the "fd" should have
been null-checked much earlier, meaning this null has to have been
done in parallel. Therefore, we should raise the interrupt
IOException here.

This will fix one intermittent rubyspec failure:

```
IO#read raises IOError when stream is closed by another thread FAILED
Expected #<Errno::EBADF: Bad file descriptor - Bad file descriptor> (Errno::EBADF)
to be kind of IOError
/home/runner/work/jruby/jruby/spec/ruby/core/io/read_spec.rb:426:in `block in <main>'
```

A variation on that spec, run in a loop, was used to confirm that
we sometimes raise EBADF at the point I have fixed here.
@headius headius added this to the JRuby 9.4.6.0 milestone Nov 6, 2023
@headius
Copy link
Member Author

headius commented Nov 6, 2023

TestIO#test_read_unlocktmp_ensure:
Errno::EBADF: Bad file descriptor - No message available
    org/jruby/RubyIO.java:2196:in `close'
    org/jruby/RubyIO.java:[45](https://github.com/jruby/jruby/actions/runs/6748586512/job/18405119095#step:4:46)04:in `pipe'
    /home/runner/work/jruby/jruby/test/mri/ruby/test_io.rb:3592:in `test_read_unlocktmp_ensure'

Unable to reproduce standalone so far.

The fileno mapping is managed globally for a given JRuby runtime
and during testing there may be other threads still cleaning up
and removing filenos from the mapping. We give this test ten tries
to have the starting count and ending count match.

This should help avoid intermittent failures like the following:

```
Failure: test_rubyio_fileno_mapping_leak(TestIO)
/home/runner/work/jruby/jruby/test/jruby/test_io.rb:527:in `test_rubyio_fileno_mapping_leak'
     524:     io = org.jruby.RubyIO.new(JRuby.runtime, java.io.ByteArrayOutputStream.new)
     525:
     526:     open_io_count = fileno_util.number_of_wrappers
  => 527:     assert_equal(starting_count + 1, open_io_count)
     528:
     529:     io.close
     530:     closed_io_count = fileno_util.number_of_wrappers
```
The Ruby.newBlahError methods are expected to return a fully-
prepared RaiseException with stack trace and current $! set. If
these methods are used to create a RubyException for throwing on
another thread, the current thread's stack trace and $! will be
used incorrectly.

Instead, I start the process of adding direct constructors for
Ruby exceptions, starting with RubyIOError.newIOError, which will
return just the Ruby exception object without any preparation for
throwing.
These methods that bottom out in `newRaiseException` or
`toThrowable` prepare the exception for an immediate throw, which
means:

* The backtrace will be populated
* The cause will be set to the currently in-flight exception
* The current thread's $! error info will be set to this exception

As such it should not be used to create a "simple" Ruby Exception
object for throwing on a different thread or at a later time.
Even with a 20 second GC loop this still occasionally fails on
Windows. This means one of two things:

* 20 seconds of GCs with 0.1s pause between each is insufficient
  to ensure that the tempfiles have been cleaned up. If so, there
  may be no amount of time we can run to ensure this test passes.
* Windows may be depositing or detecting other files in this temp
  dir and failing as a result.

The former can't be fixed; if that is the case, and this test
works correctly on other platforms, we may have to accept that it
can't be run reliably on Windows (and exclude it there).

I attempt to improve the latter here by only searching for files
that will match the requested temp file name.
@headius
Copy link
Member Author

headius commented Nov 6, 2023

Failure: test_cleanup(TestTempfileCleanup)
D:/a/jruby/jruby/test/jruby/test_tempfile_cleanup.rb:38:in `test_cleanup'
     35: 
     36:     tmp_files = Dir["#{@tmpdir}/*"]
     37:     # test that the files are gone
  => 38:     assert_equal 0, tmp_files.size, "Files were not cleaned up: (#{tmp_files.size}) #{tmp_files}"
     39:   end
     40: end

May indicate we can't reliably clean up tempfiles on Windows (this seems to only be intermittent on Windows). I have made a commit to narrow the glob, and if that works we will probably need to skip this on Windows.

Update: excluded on Windows; it fails more consistently than expected, and a narrow glob did not help.

Failures that do not report properly are impossible to investigate
with dot or file output.
Windows does not seem to reliably clean up the tempfiles based on
our GC loop for 20 seconds. Since this is reliably working on
other platforms, we'll skip it on Windows and assume it is just
JDK or platform weirdness.
@headius
Copy link
Member Author

headius commented Nov 7, 2023

TestIO#test_readpartial_buffer_error:
Errno::EBADF: Bad file descriptor - No message available
    org/jruby/RubyIO.java:2196:in `close'
    /home/runner/work/jruby/jruby/test/mri/ruby/test_io.rb:65:in `with_pipe'
    /home/runner/work/jruby/jruby/test/mri/ruby/test_io.rb:1505:in `test_readpartial_buffer_error'

Similar to test_read_unlocktmp_ensure in that a pipe IO appears to be closed already by the time we try to close it (either manually or via IO.pipe's block form

We always check for null fd when entering an IO operation; this
check occurs later, and must mean that the fd was nulled out in
parallel. Raise a more appropriate error.
We poll after each failed select to make sure we're not supposed
to wake up and die or propagate an exception. If we are supposed
to block, we should keep blocking here until interrupted or the
stream has data.

If we proceed from here to a blocking read when data is
unavailable, we may end up inside NIO code that closes the stream
upon any interrupt. Selects are interrupt, so we work harder to
stay in select logic when doing a blocking read that might be
interrupted.
@headius
Copy link
Member Author

headius commented Nov 7, 2023

Example took longer than the configured timeout of 120.0s:
Signal.trap accepts a proc

This one appears to be getting stuck, perhaps losing a signal somewhere and failing to trigger. When it fails, mspec dumps all thread stack traces which usually includes many unfinalized threads and fibers, making this difficult to find in the output.

Update: debugging a shortened version of this shows that the signal handler lambda seems to be getting desynced from the surrounding scope; I see it spinning waiting for done to be false in a different scope than the outer block that would set it to false:

loop do
  done = false

  handler = -> signo {
    print("o")
    done = true
    Thread.pass while done
  }

  print(Signal.trap(:HUP, handler) ? "." : "X")

  Process.kill :HUP, Process.pid
  Thread.pass until done
  done = false
end

If this is made into a while loop, not creating a new scope as in the loop case, it never hangs presumably because the signal lambda always points at that one scope.

Something here seems to be causing the lambda for the signal to get associated with a previous iteration's scope.

Update: the issue seems to be that we re-enable the same handler at the end of the handler logic, which might wipe out any other handler installed meanwhile. Fix is in this PR.

@headius
Copy link
Member Author

headius commented Nov 7, 2023

Most recent commit 8722fd2 modifies the read-select logic in IO to that if select fails (returns false, possibly a sporadic wake-up or error), we re-try it rather than ignoring the result and proceeding into a blocking read. If we do not do this, we may end up in an NIO blocking read that closes the stream on interrupt. I suspect this is the cause of these unexplained EBADF; we close an in-progress NIO read, which causes the stream to be closed, and then subsequent IO operations see ChannelClosedException and raise it as EBADF.

Doing this re-set at the bottom led to a situation where the
signal handler thread – running in parallel – might overlap with
another signal trap, and while finishing up this handler logic it
could wipe out that trap's handler. This led to unpredictable
hangs in some signal-related tests and specs as detailed in jruby#8007.
@headius
Copy link
Member Author

headius commented Nov 7, 2023

The signal handler issue seems to be due to us re-setting the handler to the same handler, as recommended in many guides on using unix signal handling. Unfortunately, we set it in a finally block in the handler logic, which meant that it might overwrite any new handler update that happens before this handler has finished executing responding to the signal. The fix is to move the re-set to the top of the handler, to avoid stepping on any code that might change the handler while we execute.

See #8007 (comment) for the problem discussion and example script.

@headius
Copy link
Member Author

headius commented Nov 7, 2023

See https://www.cs.kent.edu/~ruttan/sysprog/lectures/signals.html for a discussion about signal handling at the C level, which shows the re-set of the handler as the first thing the handler does.

There are concurrent shutdown issues with WEBrick that have never
been fixed (ruby/webrick#102 which may explain why we get these
sporadic failures in several of the net/http tests:

```
  1) Error:
TestNetHTTP_v1_2#test_s_start:
Errno::EPIPE: Broken pipe - No message available
    org/jruby/RubyIO.java:1391:in `write_nonblock'
    /home/runner/work/jruby/jruby/test/tool/lib/webrick/server.rb:227:in `block in stop'
    /home/runner/work/jruby/jruby/test/tool/lib/webrick/server.rb:352:in `alarm_shutdown_pipe'
    /home/runner/work/jruby/jruby/test/tool/lib/webrick/server.rb:227:in `stop'
    /home/runner/work/jruby/jruby/test/tool/lib/webrick/server.rb:235:in `shutdown'
    /home/runner/work/jruby/jruby/test/mri/net/http/utils.rb:37:in `teardown'
```

I have commented on the issue and this commit adds more exceptions
to ignore in an attempt to prevent unsafe shutdown from causing
any tests to fail.
@headius
Copy link
Member Author

headius commented Nov 8, 2023

This ErlangActor sporadic failure appears to also have been noticed on TruffleRuby and they added skips for it.

  1) Concurrent ErlangActor on pool behaves like erlang actor asking timing out
     Failure/Error: raise NoActor.new(@Pid) if @Terminated.resolved?

     Concurrent::ErlangActor::NoActor:
       #<Concurrent::ErlangActor::Pid:0x7f58d87f terminated normally with true>
     Shared Example Group: "erlang actor" called from ./spec/concurrent/edge/erlang_actor_spec.rb:975
     # ./lib/concurrent-ruby-edge/concurrent/edge/erlang_actor.rb:706:in `ask'
     # ./lib/concurrent-ruby-edge/concurrent/edge/erlang_actor.rb:82:in `ask'
     # ./spec/concurrent/edge/erlang_actor_spec.rb:934:in `block in <main>'
     # org/jruby/RubyBasicObject.java:2614:in `instance_exec'

https://github.com/ruby-concurrency/concurrent-ruby/pull/943/files#diff-80d88e9530950f5a6b2198da6c47ed17d0cf98e088412359936ce86bc2998890R920

I am adding the same skip and some improvements to another sporadic failure in ruby-concurrency/concurrent-ruby#1012.

@headius headius merged commit e0a77c5 into jruby:master Nov 8, 2023
73 of 74 checks passed
@headius headius deleted the fix_intermittent_fails branch November 8, 2023 02:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

1 participant