Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Flaky OpenSSL tests #8118

Open
jcharaoui opened this issue Feb 22, 2024 · 13 comments
Open

Flaky OpenSSL tests #8118

jcharaoui opened this issue Feb 22, 2024 · 13 comments

Comments

@jcharaoui
Copy link
Contributor

Hello,

I'm seeing intermittent mri-stdlib test failures with JRuby 9.4.5.0 on Debian testing/sid with OpenJDK 17.0.10.

The failures are related to one specific test: TestOpenURISSL#test_validation_noverify.

When I loop this command, usually after anywhere between one and forty runs, it always ends up failing:

while true; do echo "attempt #${n} ..."; jruby test/mri/runner.rb --excludes=test/mri/excludes:test/mri/excludes_wip test/mri/open-uri/test_ssl.rb; sleep 1; n=$(( n+1 )); done

When it fails, this is the stack trace:

Run options: 
  --seed=61134
  --excludes=test/mri/excludes:test/mri/excludes_wip

# Running tests:

warning: thread "Ruby-0-Thread-22: /tmp/autopkgtest-lxc.6djqypt5/downtmp/autopkgtest_tmp/test/mri/open-uri/test_ssl.rb:45" terminated with exception (report_on_exception is true):
EOFError: end of file reached
                 rbuf_fill at /usr/share/jruby/lib/ruby/stdlib/net/protocol.rb:227
                 readuntil at /usr/share/jruby/lib/ruby/stdlib/net/protocol.rb:193
                  readline at /usr/share/jruby/lib/ruby/stdlib/net/protocol.rb:203
          read_status_line at /usr/share/jruby/lib/ruby/stdlib/net/http/response.rb:42
                  read_new at /usr/share/jruby/lib/ruby/stdlib/net/http/response.rb:31
         transport_request at /usr/share/jruby/lib/ruby/stdlib/net/http.rb:1609
                     catch at org/jruby/RubyKernel.java:1310
         transport_request at /usr/share/jruby/lib/ruby/stdlib/net/http.rb:1600
                   request at /usr/share/jruby/lib/ruby/stdlib/net/http.rb:1573
                 open_http at /usr/share/jruby/lib/ruby/stdlib/open-uri.rb:333
                     start at /usr/share/jruby/lib/ruby/stdlib/net/http.rb:985
                 open_http at /usr/share/jruby/lib/ruby/stdlib/open-uri.rb:327
               buffer_open at /usr/share/jruby/lib/ruby/stdlib/open-uri.rb:760
                 open_loop at /usr/share/jruby/lib/ruby/stdlib/open-uri.rb:214
                     catch at org/jruby/RubyKernel.java:1310
                 open_loop at /usr/share/jruby/lib/ruby/stdlib/open-uri.rb:212
                  open_uri at /usr/share/jruby/lib/ruby/stdlib/open-uri.rb:153
                      open at /usr/share/jruby/lib/ruby/stdlib/open-uri.rb:740
                      open at /usr/share/jruby/lib/ruby/stdlib/open-uri.rb:29
  test_validation_noverify at /tmp/autopkgtest-lxc.6djqypt5/downtmp/autopkgtest_tmp/test/mri/open-uri/test_ssl.rb:88
                with_https at /tmp/autopkgtest-lxc.6djqypt5/downtmp/autopkgtest_tmp/test/mri/open-uri/test_ssl.rb:47

  1) Failure:
TestOpenURISSL#test_validation_noverify [/tmp/autopkgtest-lxc.6djqypt5/downtmp/autopkgtest_tmp/test/mri/open-uri/test_ssl.rb:52]:
exceptions on 1 threads:
end of file reached

Finished tests in 61.018906s, 0.0492 tests/s, 0.1639 assertions/s.
3 tests, 10 assertions, 1 failures, 0 errors, 0 skips

I don't see any other failures like this, related to OpenSSL, in the full test suite run.

@jcharaoui jcharaoui changed the title Flaky OpenSSL "noverify" test Flaky OpenSSL tests Feb 23, 2024
@jcharaoui
Copy link
Contributor Author

I've tested 9.4.6.0 and I still see the same errors. One thing I've seen however is that the failing test case isn't always the same, eg. I've seen TestOpenURISSL#test_validation_success failures. That said, the issue doesn't seem to be related to the unit tests themselves, but instead to the evaluation of assert_join_threads(threads) on line 52. This is the line that produces the failure.

@headius
Copy link
Member

headius commented Feb 23, 2024

@jcharaoui That's good information. Some tests inherited from CRuby expect threads to not run in parallel, or to behave more predictably than parallel threads. I'll have a look at the test and see if I can suss out why it's flaky.

@jcharaoui
Copy link
Contributor Author

jcharaoui commented Feb 23, 2024

Interesting! So I've tried adding -j1 to the runner command line but that fails with NameError: uninitialized constant RubyVM.

So I tried instead with both -n test_validation_noverify -n test_validation_success and the test seemed to work fine even after > 100 runs. So it indeed looks like maybe the parallel runner is introducing some kind of issue here.

I had a look around the issue tracker and found #7611 which is quite oddly familiar because the test_cp_r_symlink_preserve mentioned is also one that I've identified having unknown some issues, and had been added to Debian CI's own list of skipped tests for JRuby!

@jcharaoui
Copy link
Contributor Author

I'm wondering if the same bug could be also responsible for #8037.

@headius
Copy link
Member

headius commented Feb 26, 2024

I ran a long loop on my M1 and could not reproduce the error, but your evidence still tells me that there are lingering threads causing future tests to fail. I'll see if I can figure out a cleaner way to shut them down when each test completes.

@headius
Copy link
Member

headius commented Feb 26, 2024

uninitialized constant RubyVM

The OpenURI tests only seem to use RubyVM in one test, but it is omitted if defined?(RubyVM.stat) fails (and it should since RubyVM is not defined). Can you provide a stack trace for that? If there's some other unguarded access of RubyVM, we should fix it.

@headius
Copy link
Member

headius commented Feb 26, 2024

@jcharaoui Since you seem to be able to reproduce more easily than I can, could you try this patch and see if it helps? The shutdown logic for the threads seems sound, but I noticed it does not attempt to wait for the server thread, which could easily interfere with future tests.

diff --git a/test/mri/open-uri/test_ssl.rb b/test/mri/open-uri/test_ssl.rb
index 3f94cab40f..b0f7231027 100644
--- a/test/mri/open-uri/test_ssl.rb
+++ b/test/mri/open-uri/test_ssl.rb
@@ -35,7 +35,7 @@ class TestOpenURISSL
         :Port => 0})
       _, port, _, host = srv.listeners[0].addr
       threads = []
-      server_thread = srv.start
+      threads << server_thread = srv.start
       threads << Thread.new {
         server_thread.join
         if log_tester

@jcharaoui
Copy link
Contributor Author

uninitialized constant RubyVM

The OpenURI tests only seem to use RubyVM in one test, but it is omitted if defined?(RubyVM.stat) fails (and it should since RubyVM is not defined). Can you provide a stack trace for that? If there's some other unguarded access of RubyVM, we should fix it.

Here it is, let me know if you need anything else to track this down.

$ jruby test/mri/runner.rb -j1 --excludes=test/mri/excludes:test/mri/excludes_wip test/mri/open-uri/test_ssl.rb
Run options: 
  --seed=59660
  -j1
  --excludes=test/mri/excludes:test/mri/excludes_wip

# Running tests:

NameError: uninitialized constant RubyVM
     const_missing at org/jruby/RubyModule.java:4333
            <main> at /tmp/autopkgtest-lxc.2e2x5ymy/downtmp/autopkgtest_tmp/test/tool/lib/iseq_loader_checker.rb:9
           require at org/jruby/RubyKernel.java:1175
  require_relative at org/jruby/RubyKernel.java:1204
            <main> at /tmp/autopkgtest-lxc.2e2x5ymy/downtmp/autopkgtest_tmp/test/tool/lib/test/unit/parallel.rb:8

Some worker was crashed. It seems ruby interpreter's bug
or, a bug of test/unit/parallel.rb. try again without -j
option.

@jcharaoui
Copy link
Contributor Author

jcharaoui commented Feb 26, 2024

@jcharaoui Since you seem to be able to reproduce more easily than I can, could you try this patch and see if it helps? The shutdown logic for the threads seems sound, but I noticed it does not attempt to wait for the server thread, which could easily interfere with future tests.

Thank you! I've started a loop with this patch, and will let you know the results.

One thing you may find interesting (before and after the patch), is that the majority of iterations complete in just under one second. Some iterations, at random it seems, complete after just under 31 seconds, which seems to indicate that somtimes, the test is hitting a 30-second timeout of some sort. Furthermore, I had noticed that failed iterations would complete after just over 60 seconds, which might tell us that a failure happens when the timeout is getting hit twice? It might also not mean anything but just throwing it out there is case it's useful!

@jcharaoui
Copy link
Contributor Author

jcharaoui commented Feb 26, 2024

Here's a log of the test loop running 35 times:

https://paste.lib3.net/lavamind/2024-02-26-2Pao29HezYUjLqAL8zrChgpq70KKaB0IFZiycDT4GRk/stdin.txt

It failed 3 times. As I mentioned, each of these failures happened after 60 seconds of wall clock time.

Edit: this is with the patch at #8118 (comment)

@headius
Copy link
Member

headius commented Feb 27, 2024

Some discoveries that might help:

  • One test explicitly sets an open_timeout of 60s, but net/http appears to default to 60s so it may not be this test causing it. Excluding it and re-running the test loop would tell us if that fixes it.
  • The timeout is implemented within net/http and just uses the timeout library. I could not see any obvious bug in the timeout library that would cause it to fail to cancel a timeout request.
  • The RubyVM error on -j1 is only because that loads parallel testing logic that apparently loads an iseq-related file that depends on RubyVM now. We'll need to fix that so it doesn't use iseq stuff (unrelated to this issue).

This is not an issue with running tests in parallel; rather it seems the thread in the failing test (test_ssl.rb, test_validation_noverify) are hanging until the timeout triggers, but we have a bug in our read_nonblock logic where any error result not EAGAIN/EWOULDBLOCK gets raised as an EOFError.

@jcharaoui
Copy link
Contributor Author

jcharaoui commented Feb 28, 2024

One test explicitly sets an open_timeout of 60s, but net/http appears to default to 60s so it may not be this test causing it. Excluding it and re-running the test loop would tell us if that fixes it.

I couldn't find open_timeout in test/mri/open-uri/test_ssl.rb directly, so I'm not sure which test I should try excluding. The only ones executed in my loop are TestOpenURISSL#test_validation_failure, TestOpenURISSL#test_validation_success and TestOpenURISSL#test_validation_noverify.

For now I'm going to exclude these three tests from the autopkgtest in Debian, with a reference to this issue.

Let me know if you think of anything else I should try.

@jcharaoui
Copy link
Contributor Author

I hit another seemingly related error in a run of mri-stdlib, but another test:

2) Error:
TestNetHTTPS#test_verify_none:
EOFError: end of file reached
    /usr/share/jruby/lib/ruby/stdlib/net/protocol.rb:227:in `rbuf_fill'
    /usr/share/jruby/lib/ruby/stdlib/net/protocol.rb:193:in `readuntil'
    /usr/share/jruby/lib/ruby/stdlib/net/protocol.rb:203:in `readline'
    /usr/share/jruby/lib/ruby/stdlib/net/http/response.rb:42:in `read_status_line'
    /usr/share/jruby/lib/ruby/stdlib/net/http/response.rb:31:in `read_new'
    /usr/share/jruby/lib/ruby/stdlib/net/http.rb:1609:in `block in transport_request'
    org/jruby/RubyKernel.java:1414:in `catch'
    /usr/share/jruby/lib/ruby/stdlib/net/http.rb:1600:in `transport_request'
    /usr/share/jruby/lib/ruby/stdlib/net/http.rb:1573:in `request'
    /usr/share/jruby/lib/ruby/stdlib/net/http.rb:1566:in `block in request'
    /usr/share/jruby/lib/ruby/stdlib/net/http.rb:985:in `start'
    /usr/share/jruby/lib/ruby/stdlib/net/http.rb:1564:in `request'
    /usr/share/jruby/lib/ruby/stdlib/net/http.rb:1474:in `request_get'
    /tmp/autopkgtest-lxc.31tqowoa/downtmp/autopkgtest_tmp/test/mri/net/http/test_https.rb:204:in `test_verify_none'

From: https://salsa.debian.org/lavamind/jruby/-/jobs/5380922#L7735

I was able to reproduce it locally using the same sort of test loop:

ruby -v: jruby 9.4.6.0 (3.1.4) 2024-02-29 9.4.6.0+ds OpenJDK 64-Bit Server VM 17.0.10+7-Debian-1 on 17.0.10+7-Debian-1 +jit [x86_64-linux]
attempt #35 ...
Run options: 
  --seed=40756
  --excludes=test/mri/excludes:test/mri/excludes_wip

# Running tests:


  1) Error:
TestNetHTTPS#test_verify_none:
EOFError: end of file reached
    /usr/share/jruby/lib/ruby/stdlib/net/protocol.rb:227:in `rbuf_fill'
    /usr/share/jruby/lib/ruby/stdlib/net/protocol.rb:193:in `readuntil'
    /usr/share/jruby/lib/ruby/stdlib/net/protocol.rb:203:in `readline'
    /usr/share/jruby/lib/ruby/stdlib/net/http/response.rb:42:in `read_status_line'
    /usr/share/jruby/lib/ruby/stdlib/net/http/response.rb:31:in `read_new'
    /usr/share/jruby/lib/ruby/stdlib/net/http.rb:1609:in `block in transport_request'
    org/jruby/RubyKernel.java:1414:in `catch'
    /usr/share/jruby/lib/ruby/stdlib/net/http.rb:1600:in `transport_request'
    /usr/share/jruby/lib/ruby/stdlib/net/http.rb:1573:in `request'
    /usr/share/jruby/lib/ruby/stdlib/net/http.rb:1566:in `block in request'
    /usr/share/jruby/lib/ruby/stdlib/net/http.rb:985:in `start'
    /usr/share/jruby/lib/ruby/stdlib/net/http.rb:1564:in `request'
    /usr/share/jruby/lib/ruby/stdlib/net/http.rb:1474:in `request_get'
    /tmp/autopkgtest-lxc.6dt5egfh/downtmp/autopkgtest_tmp/test/mri/net/http/test_https.rb:204:in `test_verify_none'

Finished tests in 60.826215s, 0.0822 tests/s, 0.2630 assertions/s.
5 tests, 16 assertions, 0 failures, 1 errors, 0 skips

I'm not sure if test/mri/net/http/test_https.rb uses the same kinds of threading as test_ssl.rb but at first glance, it seems not.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants