jruby-openssl spins endlessly trying to read from a non-blocking SocketChannel while no data is available. #1280

Closed
xb opened this Issue Nov 29, 2013 · 22 comments

Projects

None yet

6 participants

@xb
xb commented Nov 29, 2013

Under jruby-1.7.6, I observed multiple threads spinning in this state:

   java.lang.Thread.State: RUNNABLE
        at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
        at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
        at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
        at sun.nio.ch.IOUtil.read(IOUtil.java:197)
        at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:379)
        - locked <0x00000000e495c940> (a java.lang.Object)
        at org.jruby.ext.openssl.SSLSocket.readAndUnwrap(SSLSocket.java:516)
        at org.jruby.ext.openssl.SSLSocket.read(SSLSocket.java:504)
        at org.jruby.ext.openssl.SSLSocket.do_sysread(SSLSocket.java:616)
        at org.jruby.ext.openssl.SSLSocket.sysread(SSLSocket.java:634)

Using Java debugger, it turned out that the SocketChannelImpl object is set to non-blocking mode (the "blocking" flag of the object is set to false) while the method org.jruby.ext.openssl.SSLSocket.readAndUnwrap() is has the blocking parameter set to true.

When considering the current implementation of readAndUnwrap()

    private int readAndUnwrap(boolean blocking) throws IOException {
        int bytesRead = getSocketChannel().read(peerNetData);
        if (bytesRead == -1) {
            if (!peerNetData.hasRemaining() || (status == SSLEngineResult.Status.BUFFER_UNDERFLOW)) {
                closeInbound();
                return -1;
            }
            // inbound channel has been already closed but closeInbound() must
            // be defered till the last engine.unwrap() call.
            // peerNetData could not be empty.
        }

it is clear that bytesRead==0 (because the SocketChannel is in non-blocking mode), however, processing does not fail in any way. Rather, readAndUnwrap() returns 0, which in turn results in rr==0 in this code (in org.jruby.ext.openssl.SSLSocket.do_sysread()):

            // ensure >0 bytes read; sysread is blocking read.
            while (rr <= 0) {
                if (engine == null) {
                    rr = getSocketChannel().read(dst);
                } else {
                    rr = read(dst, blocking);
                }
                if (rr == -1) {
                    throw getRuntime().newEOFError();
                }
            }

Thus, this while loop spins forever. The comment says that sysread is a blocking read, but apparently it is not.

Thus, it apparently may be that a blocking SocketChannel becomes non-blocking due to weird circumstances, or it starts being non-blocking right away.

I'd recommend catching these bugs by checking whether the SocketChannel is actually blocking when it is expected to be blocking (and throwing an exception if the check fails). A quick workaround would be to ensure that the SocketChannel is always blocking when it is expected to be blocking (however, this hides the bug and may not solve it completely, maybe there is some interfering code which actually sets the SocketChannel to non-blocking mode).

@headius
Member
headius commented Nov 29, 2013

Hmmm yes. The blocking and non-blocking state of the underlying SocketChannel is probably getting mucked up or set to non-blocking and not reset. I don't suppose you have a small reproduction for us?

@xb xb pushed a commit to xb/jruby that referenced this issue Dec 3, 2013
Xb Rough testcase for jruby#1280 . 32fa59d
@xb
xb commented Dec 3, 2013

Well, after some analysis, actually I do have a reproduction: xb@32fa59d

@lardcanoe

I am seeing this as well on our production servers using Jruby 1.7.3 and 1.7.8. I've also tried a few versions of excon and all result in threads locking.

I would be more than happy to test any code since this starts affecting us within hours of starting our processes.

Great find @xb I had no desire to go hunting in the jruby code.

at org.jruby.ext.openssl.SSLSocket.readAndUnwrap(SSLSocket.java:549)
at org.jruby.ext.openssl.SSLSocket.read(SSLSocket.java:501)
at org.jruby.ext.openssl.SSLSocket.do_sysread(SSLSocket.java:613)
at org.jruby.ext.openssl.SSLSocket.sysread_nonblock(SSLSocket.java:636)
at org.jruby.ext.openssl.SSLSocket$INVOKER$i$0$1$sysread_nonblock.call(SSLSocket$INVOKER$i$0$1$sysread_nonblock.gen)
at org.jruby.internal.runtime.methods.JavaMethod$JavaMethodN.call(JavaMethod.java:650)
at org.jruby.internal.runtime.methods.DynamicMethod.call(DynamicMethod.java:213)
at org.jruby.runtime.callsite.CachingCallSite.call(CachingCallSite.java:202)
at rubyjit.OpenSSL::Buffering$$read_nonblock_EC873C394AF0ADC44443D2944E076A7326B4842B1899205018.file(/home/worker/.rvm/rubies/jruby-1.7.3/lib/ruby/1.9/openssl/buffering.rb:174)
at rubyjit.OpenSSL::Buffering$$read_nonblock_EC873C394AF0ADC44443D2944E076A7326B4842B1899205018.file(/home/worker/.rvm/rubies/jruby-1.7.3/lib/ruby/1.9/openssl/buffering.rb)
at org.jruby.ast.executable.AbstractScript.file(AbstractScript.java:42)
at org.jruby.internal.runtime.methods.JittedMethod.call(JittedMethod.java:181)
at org.jruby.runtime.callsite.CachingCallSite.call(CachingCallSite.java:168)
at rubyjit.Excon::Socket$$read_681A07264576C40909D6E9427E3B5F029D0FBD611899205018.chained_0_rescue_1$RUBY$SYNTHETIC__file__(...../vendor/bundle/jruby/1.9/gems/excon-0.30.0/lib/excon/socket.rb:38)

@amalagaura

I am having this problem as well. I also have a jstack trace with similar lines:

This is Jruby 1.7.9, 1.9 mode

    at org.jruby.ext.openssl.SSLSocket.readAndUnwrap(SSLSocket.java:513)
    at org.jruby.ext.openssl.SSLSocket.read(SSLSocket.java:501)
    at org.jruby.ext.openssl.SSLSocket.do_sysread(SSLSocket.java:613)
    at org.jruby.ext.openssl.SSLSocket.sysread_nonblock(SSLSocket.java:636)

I have the full trace if necessary: https://gist.github.com/amalagaura/7945662

@tbeauvais

Seems related to this where it happens on a write: #1156
These are causing issues for us in production. Any work arounds, or ETA for a fix?

@mohamedhafez
Contributor

Issues like this and #1156 that cause requests to spin forever and use up all CPU are killing those of us using JRuby for scraping in production, i'd love to see a higher priority assigned to these bugs... Thank you so much @xb for finding and documenting them so well!

@headius
Member
headius commented Dec 19, 2013

Ok, we should prioritize this for 1.7.10 or 1.7.11. Looks like we have some good doco now and some reproduction cases. I will try to investigate, but as we're getting into the holiday it may not happen until after xmas.

@headius
Member
headius commented Dec 19, 2013

Ok, so the real issue here is that the partial TLS header never gets consumed, so we never go back into the blocking select. We keep seeing that there's data available and going ahead with the read, which can't proceed because there's not a complete packet, so it returns nothing, and so on.

I'm going to try to fix two things:

  • Properly set the socket back into blocking mode after the select. We should not be setting it to nonblocking permanently.
  • Always enter the select logic if we can't read any decrypted data. This should prevent spinning when there's partial data in the pipe.
@headius
Member
headius commented Dec 19, 2013

Ok, so with a few modifications I have gotten it to block properly. This causes the test case from @xb to hang without spinning. Now before you think this is wrong...MRI appears to block too.

I'm not sure the given test case is correct. Continuing to explore.

@headius
Member
headius commented Dec 19, 2013

After reading through the other comments, it seems like what's needed here is just to get sysread to properly block when there's no data or partial data. However, nobody other than @xb has provided something I can use to reproduce the issue, and @xb's case is flawed because it hangs on both JRuby (with my tweaks) and MRI.

Does anyone have a reproduction I can use to test that my fixes do what they should?

@xb
xb commented Dec 19, 2013

As far as I remember, the problem is not the blocking per se, but the spinning while blocking, meaning 100% CPU usage per thread. If the network weather is not perfect (e.g. there is packet loss and hence the TLS client thinks the TCP connection is still alive while the TLS server has already shut down the TCP connection), this may lead to all CPU cores being used up.

I tried to produce a testcase, but reproducing the exact situation appeared to be quite difficult. Reproducing the situations in production is also difficult, because it apparently depends on the "network weather".

@amalagaura

@headius I can only speak as someone who submitted a stack trace. If sysread blocks will it result in an eventual timeout with your fix? The problem we were facing is CPU usage without a timeout.

Edit: Sorry I don't know if it should block and timeout or just exit since @xb's test case closes the connection.

@mohamedhafez
Contributor

@headius I believe you are correct about @xb's test case being flawed: it used the blocking read call from openssl/buffering.rb, so i'm guessing that its correct behavior that its blocking without spinning after your fixes. I've made a copy of @xb's testcase that uses read_nonblock instead, I think that is a better reproduction of the problem: it blocks and spins on jruby (without your fixes), but doesn't spin or block on mri: mohamedhafez@860f0c3

@mohamedhafez
Contributor

The new testcase linked to above (mohamedhafez@860f0c3) is a better reproduction of the issue myself, @amalagaura, and @lardcanoe have been facing: from all of our stack traces it looks like we are all spinning indefinitely on a read_nonblock, so my test above tests that directly. However, I could be mistaken but it looks like @xb's stack trace is for a blocking read call, perhaps a different test would be better to make sure things behave correctly in that situation, though it sounds like your fixes @headius are already blocking without spinning as MRI does in that case.

@mohamedhafez mohamedhafez referenced this issue in sparklemotion/mechanize Jan 7, 2014
Closed

Switching Net::HTTP for Faraday #339

@mohamedhafez
Contributor

@headius if you've got a branch with the fixes you referred to earlier I'd be happy to run it against the test I've written above

@lardcanoe

Using 1.7.10, still getting this issue.

Some thread dumps: https://gist.github.com/lardcanoe/8447830

No clue how to reproduce other than a large load.

(Note sure if experiencing this or #1156, but leaning towards this one.)

@mohamedhafez
Contributor

Would anybody be interested in going in on a bounty for this issue? I know some maintainers don't like bounties, are there any objections to that?

@lardcanoe

@mohamedhafez You read my mind! Many of our servers have become my pets due to this one. I'm in for a good chunk of dough.

@mohamedhafez
Contributor

@headius could you post those tweaks you made earlier that at least partially solved the problem? They may be a good starting point for others...

@mohamedhafez mohamedhafez added a commit to mohamedhafez/ruby that referenced this issue Feb 16, 2014
@mohamedhafez mohamedhafez read_nonblock on a partial TLS record should result in an IO::WaitRea…
…dable

Currently in JRuby, trying to read from an SSL socket with only a partial TLS record results in spinning indefinitely until the rest of the record comes in (issue jruby/jruby#1280). This will make sure that a non blocking read correctly results in a IO::WaitReadable in that case instead.
1122d50
@mohamedhafez
Contributor

Been running with my fix (in the pull request above) in production for a week now, not a single thread spinning forever, and not a single spike to 100% CPU all week:)

@amalagaura

Reading this thread it would be funny if @headius made the fix but didn't check it against the new testcases which we think do reproduce the issue. Are @headius changes the same as @mohamedhafez 's changes?

@headius headius added this to the JRuby 1.7.12 milestone Feb 26, 2014
@headius headius added the openssl label Feb 26, 2014
@headius
Member
headius commented Feb 26, 2014

This should be fixed with #1508 from @mohamedhafez, so I'm going to close it. We'll try to get a jruby-openssl release out that works on earlier JRuby versions too.

@headius headius closed this Feb 26, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment