Fix intermittent fails in CI #8007

headius · 2023-11-06T20:52:39Z

This will contain small fixes to get our CI more reliable, by fixing the causes of intermittent failures.

IO#read raises IOError when stream is closed by another thread intermittent due to different ways of handling a nulled-out "fd". All cases where "fd" is null when it was previously confirmed to be non-null should indicate parallel close occurred. (Better state management here would be helpful, to ensure we know it only could have been closed in parallel.

* Rename some methods to describe their purpose better. * Use shortcut syntax to create interrupted IOException. * Raise interrupted IOException when about to read but fd has suddenly been cleared. The last item here was responsible for occasional EBADF failues in an IO#read spec testing for parallel close and thread interrupts. If we have reached this point in the code, the "fd" should have been null-checked much earlier, meaning this null has to have been done in parallel. Therefore, we should raise the interrupt IOException here. This will fix one intermittent rubyspec failure: ``` IO#read raises IOError when stream is closed by another thread FAILED Expected #<Errno::EBADF: Bad file descriptor - Bad file descriptor> (Errno::EBADF) to be kind of IOError /home/runner/work/jruby/jruby/spec/ruby/core/io/read_spec.rb:426:in `block in <main>' ``` A variation on that spec, run in a loop, was used to confirm that we sometimes raise EBADF at the point I have fixed here.

headius · 2023-11-06T21:04:50Z

TestIO#test_read_unlocktmp_ensure:
Errno::EBADF: Bad file descriptor - No message available
    org/jruby/RubyIO.java:2196:in `close'
    org/jruby/RubyIO.java:[45](https://github.com/jruby/jruby/actions/runs/6748586512/job/18405119095#step:4:46)04:in `pipe'
    /home/runner/work/jruby/jruby/test/mri/ruby/test_io.rb:3592:in `test_read_unlocktmp_ensure'

Unable to reproduce standalone so far.

The fileno mapping is managed globally for a given JRuby runtime and during testing there may be other threads still cleaning up and removing filenos from the mapping. We give this test ten tries to have the starting count and ending count match. This should help avoid intermittent failures like the following: ``` Failure: test_rubyio_fileno_mapping_leak(TestIO) /home/runner/work/jruby/jruby/test/jruby/test_io.rb:527:in `test_rubyio_fileno_mapping_leak' 524: io = org.jruby.RubyIO.new(JRuby.runtime, java.io.ByteArrayOutputStream.new) 525: 526: open_io_count = fileno_util.number_of_wrappers => 527: assert_equal(starting_count + 1, open_io_count) 528: 529: io.close 530: closed_io_count = fileno_util.number_of_wrappers ```

The Ruby.newBlahError methods are expected to return a fully- prepared RaiseException with stack trace and current $! set. If these methods are used to create a RubyException for throwing on another thread, the current thread's stack trace and $! will be used incorrectly. Instead, I start the process of adding direct constructors for Ruby exceptions, starting with RubyIOError.newIOError, which will return just the Ruby exception object without any preparation for throwing.

These methods that bottom out in `newRaiseException` or `toThrowable` prepare the exception for an immediate throw, which means: * The backtrace will be populated * The cause will be set to the currently in-flight exception * The current thread's $! error info will be set to this exception As such it should not be used to create a "simple" Ruby Exception object for throwing on a different thread or at a later time.

Even with a 20 second GC loop this still occasionally fails on Windows. This means one of two things: * 20 seconds of GCs with 0.1s pause between each is insufficient to ensure that the tempfiles have been cleaned up. If so, there may be no amount of time we can run to ensure this test passes. * Windows may be depositing or detecting other files in this temp dir and failing as a result. The former can't be fixed; if that is the case, and this test works correctly on other platforms, we may have to accept that it can't be run reliably on Windows (and exclude it there). I attempt to improve the latter here by only searching for files that will match the requested temp file name.

headius · 2023-11-06T22:23:46Z

Failure: test_cleanup(TestTempfileCleanup)
D:/a/jruby/jruby/test/jruby/test_tempfile_cleanup.rb:38:in `test_cleanup'
     35: 
     36:     tmp_files = Dir["#{@tmpdir}/*"]
     37:     # test that the files are gone
  => 38:     assert_equal 0, tmp_files.size, "Files were not cleaned up: (#{tmp_files.size}) #{tmp_files}"
     39:   end
     40: end

May indicate we can't reliably clean up tempfiles on Windows (this seems to only be intermittent on Windows). I have made a commit to narrow the glob, and if that works we will probably need to skip this on Windows.

Update: excluded on Windows; it fails more consistently than expected, and a narrow glob did not help.

Failures that do not report properly are impossible to investigate with dot or file output.

Windows does not seem to reliably clean up the tempfiles based on our GC loop for 20 seconds. Since this is reliably working on other platforms, we'll skip it on Windows and assume it is just JDK or platform weirdness.

headius · 2023-11-07T00:25:02Z

TestIO#test_readpartial_buffer_error:
Errno::EBADF: Bad file descriptor - No message available
    org/jruby/RubyIO.java:2196:in `close'
    /home/runner/work/jruby/jruby/test/mri/ruby/test_io.rb:65:in `with_pipe'
    /home/runner/work/jruby/jruby/test/mri/ruby/test_io.rb:1505:in `test_readpartial_buffer_error'

Similar to test_read_unlocktmp_ensure in that a pipe IO appears to be closed already by the time we try to close it (either manually or via IO.pipe's block form

We always check for null fd when entering an IO operation; this check occurs later, and must mean that the fd was nulled out in parallel. Raise a more appropriate error.

We poll after each failed select to make sure we're not supposed to wake up and die or propagate an exception. If we are supposed to block, we should keep blocking here until interrupted or the stream has data. If we proceed from here to a blocking read when data is unavailable, we may end up inside NIO code that closes the stream upon any interrupt. Selects are interrupt, so we work harder to stay in select logic when doing a blocking read that might be interrupted.

headius · 2023-11-07T20:23:06Z

Example took longer than the configured timeout of 120.0s:
Signal.trap accepts a proc

This one appears to be getting stuck, perhaps losing a signal somewhere and failing to trigger. When it fails, mspec dumps all thread stack traces which usually includes many unfinalized threads and fibers, making this difficult to find in the output.

Update: debugging a shortened version of this shows that the signal handler lambda seems to be getting desynced from the surrounding scope; I see it spinning waiting for done to be false in a different scope than the outer block that would set it to false:

loop do
  done = false

  handler = -> signo {
    print("o")
    done = true
    Thread.pass while done
  }

  print(Signal.trap(:HUP, handler) ? "." : "X")

  Process.kill :HUP, Process.pid
  Thread.pass until done
  done = false
end

If this is made into a while loop, not creating a new scope as in the loop case, it never hangs presumably because the signal lambda always points at that one scope.

Something here seems to be causing the lambda for the signal to get associated with a previous iteration's scope.

Update: the issue seems to be that we re-enable the same handler at the end of the handler logic, which might wipe out any other handler installed meanwhile. Fix is in this PR.

headius · 2023-11-07T20:26:25Z

Most recent commit 8722fd2 modifies the read-select logic in IO to that if select fails (returns false, possibly a sporadic wake-up or error), we re-try it rather than ignoring the result and proceeding into a blocking read. If we do not do this, we may end up in an NIO blocking read that closes the stream on interrupt. I suspect this is the cause of these unexplained EBADF; we close an in-progress NIO read, which causes the stream to be closed, and then subsequent IO operations see ChannelClosedException and raise it as EBADF.

Doing this re-set at the bottom led to a situation where the signal handler thread – running in parallel – might overlap with another signal trap, and while finishing up this handler logic it could wipe out that trap's handler. This led to unpredictable hangs in some signal-related tests and specs as detailed in jruby#8007.

headius · 2023-11-07T21:47:16Z

The signal handler issue seems to be due to us re-setting the handler to the same handler, as recommended in many guides on using unix signal handling. Unfortunately, we set it in a finally block in the handler logic, which meant that it might overwrite any new handler update that happens before this handler has finished executing responding to the signal. The fix is to move the re-set to the top of the handler, to avoid stepping on any code that might change the handler while we execute.

See #8007 (comment) for the problem discussion and example script.

headius · 2023-11-07T22:21:25Z

See https://www.cs.kent.edu/~ruttan/sysprog/lectures/signals.html for a discussion about signal handling at the C level, which shows the re-set of the handler as the first thing the handler does.

There are concurrent shutdown issues with WEBrick that have never been fixed (ruby/webrick#102 which may explain why we get these sporadic failures in several of the net/http tests: ``` 1) Error: TestNetHTTP_v1_2#test_s_start: Errno::EPIPE: Broken pipe - No message available org/jruby/RubyIO.java:1391:in `write_nonblock' /home/runner/work/jruby/jruby/test/tool/lib/webrick/server.rb:227:in `block in stop' /home/runner/work/jruby/jruby/test/tool/lib/webrick/server.rb:352:in `alarm_shutdown_pipe' /home/runner/work/jruby/jruby/test/tool/lib/webrick/server.rb:227:in `stop' /home/runner/work/jruby/jruby/test/tool/lib/webrick/server.rb:235:in `shutdown' /home/runner/work/jruby/jruby/test/mri/net/http/utils.rb:37:in `teardown' ``` I have commented on the issue and this commit adds more exceptions to ignore in an attempt to prevent unsafe shutdown from causing any tests to fail.

headius · 2023-11-08T01:41:09Z

This ErlangActor sporadic failure appears to also have been noticed on TruffleRuby and they added skips for it.

  1) Concurrent ErlangActor on pool behaves like erlang actor asking timing out
     Failure/Error: raise NoActor.new(@Pid) if @Terminated.resolved?

     Concurrent::ErlangActor::NoActor:
       #<Concurrent::ErlangActor::Pid:0x7f58d87f terminated normally with true>
     Shared Example Group: "erlang actor" called from ./spec/concurrent/edge/erlang_actor_spec.rb:975
     # ./lib/concurrent-ruby-edge/concurrent/edge/erlang_actor.rb:706:in `ask'
     # ./lib/concurrent-ruby-edge/concurrent/edge/erlang_actor.rb:82:in `ask'
     # ./spec/concurrent/edge/erlang_actor_spec.rb:934:in `block in <main>'
     # org/jruby/RubyBasicObject.java:2614:in `instance_exec'

https://github.com/ruby-concurrency/concurrent-ruby/pull/943/files#diff-80d88e9530950f5a6b2198da6c47ed17d0cf98e088412359936ce86bc2998890R920

I am adding the same skip and some improvements to another sporadic failure in ruby-concurrency/concurrent-ruby#1012.

headius added this to the JRuby 9.4.6.0 milestone Nov 6, 2023

headius added 5 commits November 6, 2023 15:16

This file uses Tempfile but does not require it

5428a72

headius added 2 commits November 6, 2023 16:28

Switch rubyspec CI to specdoc output

4d1db58

Failures that do not report properly are impossible to investigate with dot or file output.

Skip this test on Windows

a339c7e

Windows does not seem to reliably clean up the tempfiles based on our GC loop for 20 seconds. Since this is reliably working on other platforms, we'll skip it on Windows and assume it is just JDK or platform weirdness.

headius added 3 commits November 7, 2023 13:58

If fd is null here it was closed in parallel

07bc504

We always check for null fd when entering an IO operation; this check occurs later, and must mean that the fd was nulled out in parallel. Raise a more appropriate error.

More descriptive name

ae9051d

headius merged commit e0a77c5 into jruby:master Nov 8, 2023
73 of 74 checks passed

headius deleted the fix_intermittent_fails branch November 8, 2023 02:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix intermittent fails in CI #8007

Fix intermittent fails in CI #8007

headius commented Nov 6, 2023

headius commented Nov 6, 2023 •

edited

headius commented Nov 6, 2023 •

edited

headius commented Nov 7, 2023

headius commented Nov 7, 2023 •

edited

headius commented Nov 7, 2023

headius commented Nov 7, 2023

headius commented Nov 7, 2023

headius commented Nov 8, 2023

Fix intermittent fails in CI #8007

Fix intermittent fails in CI #8007

Conversation

headius commented Nov 6, 2023

headius commented Nov 6, 2023 • edited

headius commented Nov 6, 2023 • edited

headius commented Nov 7, 2023

headius commented Nov 7, 2023 • edited

headius commented Nov 7, 2023

headius commented Nov 7, 2023

headius commented Nov 7, 2023

headius commented Nov 8, 2023

headius commented Nov 6, 2023 •

edited

headius commented Nov 6, 2023 •

edited

headius commented Nov 7, 2023 •

edited