New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix intermittent fails in CI #8007
Conversation
* Rename some methods to describe their purpose better. * Use shortcut syntax to create interrupted IOException. * Raise interrupted IOException when about to read but fd has suddenly been cleared. The last item here was responsible for occasional EBADF failues in an IO#read spec testing for parallel close and thread interrupts. If we have reached this point in the code, the "fd" should have been null-checked much earlier, meaning this null has to have been done in parallel. Therefore, we should raise the interrupt IOException here. This will fix one intermittent rubyspec failure: ``` IO#read raises IOError when stream is closed by another thread FAILED Expected #<Errno::EBADF: Bad file descriptor - Bad file descriptor> (Errno::EBADF) to be kind of IOError /home/runner/work/jruby/jruby/spec/ruby/core/io/read_spec.rb:426:in `block in <main>' ``` A variation on that spec, run in a loop, was used to confirm that we sometimes raise EBADF at the point I have fixed here.
Unable to reproduce standalone so far. |
The fileno mapping is managed globally for a given JRuby runtime and during testing there may be other threads still cleaning up and removing filenos from the mapping. We give this test ten tries to have the starting count and ending count match. This should help avoid intermittent failures like the following: ``` Failure: test_rubyio_fileno_mapping_leak(TestIO) /home/runner/work/jruby/jruby/test/jruby/test_io.rb:527:in `test_rubyio_fileno_mapping_leak' 524: io = org.jruby.RubyIO.new(JRuby.runtime, java.io.ByteArrayOutputStream.new) 525: 526: open_io_count = fileno_util.number_of_wrappers => 527: assert_equal(starting_count + 1, open_io_count) 528: 529: io.close 530: closed_io_count = fileno_util.number_of_wrappers ```
The Ruby.newBlahError methods are expected to return a fully- prepared RaiseException with stack trace and current $! set. If these methods are used to create a RubyException for throwing on another thread, the current thread's stack trace and $! will be used incorrectly. Instead, I start the process of adding direct constructors for Ruby exceptions, starting with RubyIOError.newIOError, which will return just the Ruby exception object without any preparation for throwing.
These methods that bottom out in `newRaiseException` or `toThrowable` prepare the exception for an immediate throw, which means: * The backtrace will be populated * The cause will be set to the currently in-flight exception * The current thread's $! error info will be set to this exception As such it should not be used to create a "simple" Ruby Exception object for throwing on a different thread or at a later time.
Even with a 20 second GC loop this still occasionally fails on Windows. This means one of two things: * 20 seconds of GCs with 0.1s pause between each is insufficient to ensure that the tempfiles have been cleaned up. If so, there may be no amount of time we can run to ensure this test passes. * Windows may be depositing or detecting other files in this temp dir and failing as a result. The former can't be fixed; if that is the case, and this test works correctly on other platforms, we may have to accept that it can't be run reliably on Windows (and exclude it there). I attempt to improve the latter here by only searching for files that will match the requested temp file name.
May indicate we can't reliably clean up tempfiles on Windows (this seems to only be intermittent on Windows). I have made a commit to narrow the glob, and if that works we will probably need to skip this on Windows. Update: excluded on Windows; it fails more consistently than expected, and a narrow glob did not help. |
Failures that do not report properly are impossible to investigate with dot or file output.
Windows does not seem to reliably clean up the tempfiles based on our GC loop for 20 seconds. Since this is reliably working on other platforms, we'll skip it on Windows and assume it is just JDK or platform weirdness.
Similar to |
We always check for null fd when entering an IO operation; this check occurs later, and must mean that the fd was nulled out in parallel. Raise a more appropriate error.
We poll after each failed select to make sure we're not supposed to wake up and die or propagate an exception. If we are supposed to block, we should keep blocking here until interrupted or the stream has data. If we proceed from here to a blocking read when data is unavailable, we may end up inside NIO code that closes the stream upon any interrupt. Selects are interrupt, so we work harder to stay in select logic when doing a blocking read that might be interrupted.
This one appears to be getting stuck, perhaps losing a signal somewhere and failing to trigger. When it fails, mspec dumps all thread stack traces which usually includes many unfinalized threads and fibers, making this difficult to find in the output. Update: debugging a shortened version of this shows that the signal handler lambda seems to be getting desynced from the surrounding scope; I see it spinning waiting for loop do
done = false
handler = -> signo {
print("o")
done = true
Thread.pass while done
}
print(Signal.trap(:HUP, handler) ? "." : "X")
Process.kill :HUP, Process.pid
Thread.pass until done
done = false
end If this is made into a while loop, not creating a new scope as in the loop case, it never hangs presumably because the signal lambda always points at that one scope. Something here seems to be causing the lambda for the signal to get associated with a previous iteration's scope. Update: the issue seems to be that we re-enable the same handler at the end of the handler logic, which might wipe out any other handler installed meanwhile. Fix is in this PR. |
Most recent commit 8722fd2 modifies the read-select logic in IO to that if select fails (returns false, possibly a sporadic wake-up or error), we re-try it rather than ignoring the result and proceeding into a blocking read. If we do not do this, we may end up in an NIO blocking read that closes the stream on interrupt. I suspect this is the cause of these unexplained EBADF; we close an in-progress NIO read, which causes the stream to be closed, and then subsequent IO operations see ChannelClosedException and raise it as EBADF. |
Doing this re-set at the bottom led to a situation where the signal handler thread – running in parallel – might overlap with another signal trap, and while finishing up this handler logic it could wipe out that trap's handler. This led to unpredictable hangs in some signal-related tests and specs as detailed in jruby#8007.
The signal handler issue seems to be due to us re-setting the handler to the same handler, as recommended in many guides on using unix signal handling. Unfortunately, we set it in a See #8007 (comment) for the problem discussion and example script. |
See https://www.cs.kent.edu/~ruttan/sysprog/lectures/signals.html for a discussion about signal handling at the C level, which shows the re-set of the handler as the first thing the handler does. |
There are concurrent shutdown issues with WEBrick that have never been fixed (ruby/webrick#102 which may explain why we get these sporadic failures in several of the net/http tests: ``` 1) Error: TestNetHTTP_v1_2#test_s_start: Errno::EPIPE: Broken pipe - No message available org/jruby/RubyIO.java:1391:in `write_nonblock' /home/runner/work/jruby/jruby/test/tool/lib/webrick/server.rb:227:in `block in stop' /home/runner/work/jruby/jruby/test/tool/lib/webrick/server.rb:352:in `alarm_shutdown_pipe' /home/runner/work/jruby/jruby/test/tool/lib/webrick/server.rb:227:in `stop' /home/runner/work/jruby/jruby/test/tool/lib/webrick/server.rb:235:in `shutdown' /home/runner/work/jruby/jruby/test/mri/net/http/utils.rb:37:in `teardown' ``` I have commented on the issue and this commit adds more exceptions to ignore in an attempt to prevent unsafe shutdown from causing any tests to fail.
This
I am adding the same skip and some improvements to another sporadic failure in ruby-concurrency/concurrent-ruby#1012. |
This will contain small fixes to get our CI more reliable, by fixing the causes of intermittent failures.
IO#read raises IOError when stream is closed by another thread
intermittent due to different ways of handling a nulled-out "fd". All cases where "fd" is null when it was previously confirmed to be non-null should indicate parallel close occurred. (Better state management here would be helpful, to ensure we know it only could have been closed in parallel.