jruby-openssl spins endlessly trying to read from a non-blocking SocketChannel while no data is available. #1280

xb · 2013-11-29T00:37:00Z

Under jruby-1.7.6, I observed multiple threads spinning in this state:

   java.lang.Thread.State: RUNNABLE
        at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
        at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
        at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
        at sun.nio.ch.IOUtil.read(IOUtil.java:197)
        at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:379)
        - locked <0x00000000e495c940> (a java.lang.Object)
        at org.jruby.ext.openssl.SSLSocket.readAndUnwrap(SSLSocket.java:516)
        at org.jruby.ext.openssl.SSLSocket.read(SSLSocket.java:504)
        at org.jruby.ext.openssl.SSLSocket.do_sysread(SSLSocket.java:616)
        at org.jruby.ext.openssl.SSLSocket.sysread(SSLSocket.java:634)

Using Java debugger, it turned out that the SocketChannelImpl object is set to non-blocking mode (the "blocking" flag of the object is set to false) while the method org.jruby.ext.openssl.SSLSocket.readAndUnwrap() is has the blocking parameter set to true.

When considering the current implementation of readAndUnwrap()

    private int readAndUnwrap(boolean blocking) throws IOException {
        int bytesRead = getSocketChannel().read(peerNetData);
        if (bytesRead == -1) {
            if (!peerNetData.hasRemaining() || (status == SSLEngineResult.Status.BUFFER_UNDERFLOW)) {
                closeInbound();
                return -1;
            }
            // inbound channel has been already closed but closeInbound() must
            // be defered till the last engine.unwrap() call.
            // peerNetData could not be empty.
        }

it is clear that bytesRead==0 (because the SocketChannel is in non-blocking mode), however, processing does not fail in any way. Rather, readAndUnwrap() returns 0, which in turn results in rr==0 in this code (in org.jruby.ext.openssl.SSLSocket.do_sysread()):

            // ensure >0 bytes read; sysread is blocking read.
            while (rr <= 0) {
                if (engine == null) {
                    rr = getSocketChannel().read(dst);
                } else {
                    rr = read(dst, blocking);
                }
                if (rr == -1) {
                    throw getRuntime().newEOFError();
                }
            }

Thus, this while loop spins forever. The comment says that sysread is a blocking read, but apparently it is not.

Thus, it apparently may be that a blocking SocketChannel becomes non-blocking due to weird circumstances, or it starts being non-blocking right away.

I'd recommend catching these bugs by checking whether the SocketChannel is actually blocking when it is expected to be blocking (and throwing an exception if the check fails). A quick workaround would be to ensure that the SocketChannel is always blocking when it is expected to be blocking (however, this hides the bug and may not solve it completely, maybe there is some interfering code which actually sets the SocketChannel to non-blocking mode).

The text was updated successfully, but these errors were encountered:

headius · 2013-11-29T05:52:59Z

Hmmm yes. The blocking and non-blocking state of the underlying SocketChannel is probably getting mucked up or set to non-blocking and not reset. I don't suppose you have a small reproduction for us?

xb · 2013-12-03T01:23:27Z

Well, after some analysis, actually I do have a reproduction: xb@32fa59d

lardcanoe · 2013-12-09T04:20:52Z

I am seeing this as well on our production servers using Jruby 1.7.3 and 1.7.8. I've also tried a few versions of excon and all result in threads locking.

I would be more than happy to test any code since this starts affecting us within hours of starting our processes.

Great find @xb I had no desire to go hunting in the jruby code.

at org.jruby.ext.openssl.SSLSocket.readAndUnwrap(SSLSocket.java:549)
at org.jruby.ext.openssl.SSLSocket.read(SSLSocket.java:501)
at org.jruby.ext.openssl.SSLSocket.do_sysread(SSLSocket.java:613)
at org.jruby.ext.openssl.SSLSocket.sysread_nonblock(SSLSocket.java:636)
at org.jruby.ext.openssl.SSLSocket$INVOKER$i$0$1$sysread_nonblock.call(SSLSocket$INVOKER$i$0$1$sysread_nonblock.gen)
at org.jruby.internal.runtime.methods.JavaMethod$JavaMethodN.call(JavaMethod.java:650)
at org.jruby.internal.runtime.methods.DynamicMethod.call(DynamicMethod.java:213)
at org.jruby.runtime.callsite.CachingCallSite.call(CachingCallSite.java:202)
at rubyjit.OpenSSL::Buffering$$read_nonblock_EC873C394AF0ADC44443D2944E076A7326B4842B1899205018.file(/home/worker/.rvm/rubies/jruby-1.7.3/lib/ruby/1.9/openssl/buffering.rb:174)
at rubyjit.OpenSSL::Buffering$$read_nonblock_EC873C394AF0ADC44443D2944E076A7326B4842B1899205018.file(/home/worker/.rvm/rubies/jruby-1.7.3/lib/ruby/1.9/openssl/buffering.rb)
at org.jruby.ast.executable.AbstractScript.file(AbstractScript.java:42)
at org.jruby.internal.runtime.methods.JittedMethod.call(JittedMethod.java:181)
at org.jruby.runtime.callsite.CachingCallSite.call(CachingCallSite.java:168)
at rubyjit.Excon::Socket$$read_681A07264576C40909D6E9427E3B5F029D0FBD611899205018.chained_0_rescue_1$RUBY$SYNTHETIC__file__(...../vendor/bundle/jruby/1.9/gems/excon-0.30.0/lib/excon/socket.rb:38)

amalagaura · 2013-12-13T15:51:50Z

I am having this problem as well. I also have a jstack trace with similar lines:

This is Jruby 1.7.9, 1.9 mode

    at org.jruby.ext.openssl.SSLSocket.readAndUnwrap(SSLSocket.java:513)
    at org.jruby.ext.openssl.SSLSocket.read(SSLSocket.java:501)
    at org.jruby.ext.openssl.SSLSocket.do_sysread(SSLSocket.java:613)
    at org.jruby.ext.openssl.SSLSocket.sysread_nonblock(SSLSocket.java:636)

I have the full trace if necessary: https://gist.github.com/amalagaura/7945662

tbeauvais · 2013-12-16T13:33:41Z

Seems related to this where it happens on a write: #1156
These are causing issues for us in production. Any work arounds, or ETA for a fix?

mohamedhafez · 2013-12-16T16:58:16Z

Issues like this and #1156 that cause requests to spin forever and use up all CPU are killing those of us using JRuby for scraping in production, i'd love to see a higher priority assigned to these bugs... Thank you so much @xb for finding and documenting them so well!

headius · 2013-12-19T14:51:02Z

Ok, we should prioritize this for 1.7.10 or 1.7.11. Looks like we have some good doco now and some reproduction cases. I will try to investigate, but as we're getting into the holiday it may not happen until after xmas.

headius · 2013-12-19T16:31:53Z

Ok, so the real issue here is that the partial TLS header never gets consumed, so we never go back into the blocking select. We keep seeing that there's data available and going ahead with the read, which can't proceed because there's not a complete packet, so it returns nothing, and so on.

I'm going to try to fix two things:

Properly set the socket back into blocking mode after the select. We should not be setting it to nonblocking permanently.
Always enter the select logic if we can't read any decrypted data. This should prevent spinning when there's partial data in the pipe.

headius · 2013-12-19T16:56:09Z

Ok, so with a few modifications I have gotten it to block properly. This causes the test case from @xb to hang without spinning. Now before you think this is wrong...MRI appears to block too.

I'm not sure the given test case is correct. Continuing to explore.

headius · 2013-12-19T17:00:04Z

After reading through the other comments, it seems like what's needed here is just to get sysread to properly block when there's no data or partial data. However, nobody other than @xb has provided something I can use to reproduce the issue, and @xb's case is flawed because it hangs on both JRuby (with my tweaks) and MRI.

Does anyone have a reproduction I can use to test that my fixes do what they should?

xb · 2013-12-19T17:30:35Z

As far as I remember, the problem is not the blocking per se, but the spinning while blocking, meaning 100% CPU usage per thread. If the network weather is not perfect (e.g. there is packet loss and hence the TLS client thinks the TCP connection is still alive while the TLS server has already shut down the TCP connection), this may lead to all CPU cores being used up.

I tried to produce a testcase, but reproducing the exact situation appeared to be quite difficult. Reproducing the situations in production is also difficult, because it apparently depends on the "network weather".

amalagaura · 2013-12-19T17:32:20Z

@headius I can only speak as someone who submitted a stack trace. If sysread blocks will it result in an eventual timeout with your fix? The problem we were facing is CPU usage without a timeout.

Edit: Sorry I don't know if it should block and timeout or just exit since @xb's test case closes the connection.

mohamedhafez · 2013-12-19T20:24:09Z

@headius I believe you are correct about @xb's test case being flawed: it used the blocking read call from openssl/buffering.rb, so i'm guessing that its correct behavior that its blocking without spinning after your fixes. I've made a copy of @xb's testcase that uses read_nonblock instead, I think that is a better reproduction of the problem: it blocks and spins on jruby (without your fixes), but doesn't spin or block on mri: mohamedhafez@860f0c3

mohamedhafez · 2013-12-20T04:37:50Z

The new testcase linked to above (mohamedhafez@860f0c3) is a better reproduction of the issue myself, @amalagaura, and @lardcanoe have been facing: from all of our stack traces it looks like we are all spinning indefinitely on a read_nonblock, so my test above tests that directly. However, I could be mistaken but it looks like @xb's stack trace is for a blocking read call, perhaps a different test would be better to make sure things behave correctly in that situation, though it sounds like your fixes @headius are already blocking without spinning as MRI does in that case.

mohamedhafez · 2014-01-13T22:00:30Z

@headius if you've got a branch with the fixes you referred to earlier I'd be happy to run it against the test I've written above

lardcanoe · 2014-01-16T00:55:13Z

Using 1.7.10, still getting this issue.

Some thread dumps: https://gist.github.com/lardcanoe/8447830

No clue how to reproduce other than a large load.

(Note sure if experiencing this or #1156, but leaning towards this one.)

mohamedhafez · 2014-02-06T23:33:41Z

Would anybody be interested in going in on a bounty for this issue? I know some maintainers don't like bounties, are there any objections to that?

lardcanoe · 2014-02-07T15:32:17Z

@mohamedhafez You read my mind! Many of our servers have become my pets due to this one. I'm in for a good chunk of dough.

mohamedhafez · 2014-02-12T05:46:38Z

@headius could you post those tweaks you made earlier that at least partially solved the problem? They may be a good starting point for others...

…dable Currently in JRuby, trying to read from an SSL socket with only a partial TLS record results in spinning indefinitely until the rest of the record comes in (issue jruby/jruby#1280). This will make sure that a non blocking read correctly results in a IO::WaitReadable in that case instead.

mohamedhafez · 2014-02-23T03:05:42Z

Been running with my fix (in the pull request above) in production for a week now, not a single thread spinning forever, and not a single spike to 100% CPU all week:)

amalagaura · 2014-02-23T03:13:49Z

Reading this thread it would be funny if @headius made the fix but didn't check it against the new testcases which we think do reproduce the issue. Are @headius changes the same as @mohamedhafez 's changes?

headius · 2014-02-26T19:03:32Z

This should be fixed with #1508 from @mohamedhafez, so I'm going to close it. We'll try to get a jruby-openssl release out that works on earlier JRuby versions too.

xb pushed a commit to xb/jruby that referenced this issue Dec 3, 2013

Rough testcase for jruby#1280 .

32fa59d

mohamedhafez mentioned this issue Dec 11, 2013

IO#read_nonblock blocks sometimes for SSL Sockets #1129

Closed

mohamedhafez mentioned this issue Jan 7, 2014

Switching Net::HTTP for Faraday sparklemotion/mechanize#339

Closed

lardcanoe referenced this issue in excon/excon Feb 12, 2014

Use OpenSSL::SSL::SSLSocket#connect_nonblock

e4b8399

This was referenced Feb 16, 2014

read_nonblock on a partial TLS record should result in an IO::WaitReadable ruby/ruby#535

Closed

fix spinning indefinitely on partial TLS record (issue #1280) #1508

Merged

mohamedhafez mentioned this issue Feb 26, 2014

Testing read_nonblock on a partial TLS record results in IO::WaitReadable ruby/ruby#547

Closed

headius added this to the JRuby 1.7.12 milestone Feb 26, 2014

headius added the openssl label Feb 26, 2014

headius closed this as completed Feb 26, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

jruby-openssl spins endlessly trying to read from a non-blocking SocketChannel while no data is available. #1280

jruby-openssl spins endlessly trying to read from a non-blocking SocketChannel while no data is available. #1280

xb commented Nov 29, 2013

headius commented Nov 29, 2013

xb commented Dec 3, 2013

lardcanoe commented Dec 9, 2013

amalagaura commented Dec 13, 2013

tbeauvais commented Dec 16, 2013

mohamedhafez commented Dec 16, 2013

headius commented Dec 19, 2013

headius commented Dec 19, 2013

headius commented Dec 19, 2013

headius commented Dec 19, 2013

xb commented Dec 19, 2013

amalagaura commented Dec 19, 2013

mohamedhafez commented Dec 19, 2013

mohamedhafez commented Dec 20, 2013

mohamedhafez commented Jan 13, 2014

lardcanoe commented Jan 16, 2014

mohamedhafez commented Feb 6, 2014

lardcanoe commented Feb 7, 2014

mohamedhafez commented Feb 12, 2014

mohamedhafez commented Feb 23, 2014

amalagaura commented Feb 23, 2014

headius commented Feb 26, 2014

jruby-openssl spins endlessly trying to read from a non-blocking SocketChannel while no data is available. #1280

jruby-openssl spins endlessly trying to read from a non-blocking SocketChannel while no data is available. #1280

Comments

xb commented Nov 29, 2013

headius commented Nov 29, 2013

xb commented Dec 3, 2013

lardcanoe commented Dec 9, 2013

amalagaura commented Dec 13, 2013

tbeauvais commented Dec 16, 2013

mohamedhafez commented Dec 16, 2013

headius commented Dec 19, 2013

headius commented Dec 19, 2013

headius commented Dec 19, 2013

headius commented Dec 19, 2013

xb commented Dec 19, 2013

amalagaura commented Dec 19, 2013

mohamedhafez commented Dec 19, 2013

mohamedhafez commented Dec 20, 2013

mohamedhafez commented Jan 13, 2014

lardcanoe commented Jan 16, 2014

mohamedhafez commented Feb 6, 2014

lardcanoe commented Feb 7, 2014

mohamedhafez commented Feb 12, 2014

mohamedhafez commented Feb 23, 2014

amalagaura commented Feb 23, 2014

headius commented Feb 26, 2014