jruby-openssl spins endlessly trying to write to a socket which is in state CLOSE_WAIT #1156

Closed
xb opened this Issue Oct 24, 2013 · 6 comments

Projects

None yet

5 participants

@xb
xb commented Oct 24, 2013

On

$ jruby --version
jruby 1.7.5 (1.9.3p392) 2013-10-07 74e9291 on OpenJDK 64-Bit Server VM 1.7.0_40-b31 [linux-amd64]

with the jruby-openssl-0.9.4 gem installed, when running an application which accessess a HTTPS site as a HTTPS client, I repeatably observe following patterns after some running time:

  1. The CPU usage of the application is slightly above 100% times n, where n is a positive integer.
  2. The number of sockets which are actually HTTPS sockets and in CLOSE_WAIT state (as of "lsof") is equal to n.
  3. The number of threads calling org.jruby.ext.openssl.SSLSocket.syswrite(SSLSocket.java:664) in at a particular time (as evidenced by "jstack" or by "kill -3") is almost always n.
  4. The thread ids for these n affected threads being the same over time.

I conclude that there seems to be a bug in jruby-openssl that is characterized by a SSL client socket getting into the CLOSE_WAIT state (e.g. the remote party or the network terminating the connection unexpectedly) and jruby-openssl apparently not noticing this state, and, rather than that, attempting to write to the socket in an endless loop.

Some Java stack traces which were captured:

   java.lang.Thread.State: RUNNABLE
        at java.lang.Object.hashCode(Native Method)
        at java.util.HashMap.hash(HashMap.java:366)
        at java.util.HashMap.put(HashMap.java:496)
        at java.util.HashSet.add(HashSet.java:217)
        at sun.nio.ch.EPollSelectorImpl.implRegister(EPollSelectorImpl.java:165)
        at sun.nio.ch.SelectorImpl.register(SelectorImpl.java:133)
        - locked <0x00000000e3ce1d60> (a java.util.Collections$UnmodifiableSet)
        at java.nio.channels.spi.AbstractSelectableChannel.register(AbstractSelectableChannel.java:209)
        - locked <0x00000000e42aa638> (a java.lang.Object)
        - locked <0x00000000e42aa648> (a java.lang.Object)
        at java.nio.channels.SelectableChannel.register(SelectableChannel.java:277)
        at org.jruby.ext.openssl.SSLSocket.waitSelect(SSLSocket.java:288)
        at org.jruby.ext.openssl.SSLSocket.do_syswrite(SSLSocket.java:644)
        at org.jruby.ext.openssl.SSLSocket.syswrite(SSLSocket.java:664)
   java.lang.Thread.State: RUNNABLE
        at sun.nio.ch.EPollArrayWrapper.epollCtl(Native Method)
        at sun.nio.ch.EPollArrayWrapper.updateRegistrations(EPollArrayWrapper.java:287)
        - locked <0x00000000e3cdf488> (a java.lang.Object)
        at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:256)
        at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:79)
        at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:87)
        - locked <0x00000000e3cdf410> (a sun.nio.ch.Util$2)
        - locked <0x00000000e3cdf400> (a java.util.Collections$UnmodifiableSet)
        - locked <0x00000000e3cdf1c8> (a sun.nio.ch.EPollSelectorImpl)
        at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:98)
        at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:102)
        at org.jruby.ext.openssl.SSLSocket$2.run(SSLSocket.java:316)
        at org.jruby.RubyThread.executeBlockingTask(RubyThread.java:1063)
        at org.jruby.ext.openssl.SSLSocket.waitSelect(SSLSocket.java:295)
        at org.jruby.ext.openssl.SSLSocket.do_syswrite(SSLSocket.java:644)
        at org.jruby.ext.openssl.SSLSocket.syswrite(SSLSocket.java:664)
   java.lang.Thread.State: RUNNABLE
        at org.jruby.util.io.ChannelStream.fflush(ChannelStream.java:661)
        - locked <0x00000000e4430e08> (a org.jruby.util.io.ChannelStream)
        at org.jruby.RubyIO.flush(RubyIO.java:2168)
        at org.jruby.ext.openssl.SSLSocket.do_syswrite(SSLSocket.java:654)
        at org.jruby.ext.openssl.SSLSocket.syswrite(SSLSocket.java:664)

The bug seems to be similar to https://jira.codehaus.org/browse/JRUBY-6973 .

@xb
xb commented Oct 24, 2013

The bug is probably that in https://github.com/jruby/jruby/blob/1.7.6/lib/ruby/shared/jopenssl19/openssl/buffering.rb?source=cc#L318

      while remain > 0
        str = @wbuffer[nwritten,remain]
        begin
          nwrote = syswrite(str)
        rescue Errno::EAGAIN
          retry
        end
        remain -= nwrote
        nwritten += nwrote
      end

nwrote is either 0 each time or Errno::EAGAIN is raised each time although the call is a blocking call (which means that neither nwrote==0 is allowed to happen nor that Errno::EAGAIN is allowed to be raised).

@xb
xb commented Oct 25, 2013

Analyzing this problem further, it turns out that in https://github.com/jruby/jruby/blob/1.7.6/ext/openssl/src/main/java/org/jruby/ext/openssl/SSLSocket.java#L484

    public int write(ByteBuffer src, boolean blocking) throws SSLException, IOException {
        if(initialHandshake) {
            throw new IOException("Writing not possible during handshake");
        }

        SelectableChannel selectable = getSocketChannel();
        boolean blockingMode = selectable.isBlocking();
        if (!blocking) selectable.configureBlocking(false);

        try {
            if(netData.hasRemaining()) {
                flushData(blocking);
            }
            netData.clear();
            SSLEngineResult res = engine.wrap(src, netData);
            netData.flip();
            flushData(blocking);
            return res.bytesConsumed();
        } finally {
            if (!blocking) selectable.configureBlocking(blockingMode);
        }
    }

the line

SSLEngineResult res = engine.wrap(src, netData); 

Returns an SSLEngineResult object "res". This SSLEngineResult has a status (obtainable by res.getStatus()) One of the possible states is "CLOSED". In case of this bug, the status is actually closed.

It turns out the place where this CLOSED state is created is http://hg.openjdk.java.net/jdk7/jdk7/jdk/file/9b8c96f96a0f/src/share/classes/sun/security/ssl/SSLEngineImpl.java

    /*
     * Makes additional checks for unwrap, but this time more
     * specific to this packet and the current state of the machine.
     */
    private SSLEngineResult writeAppRecord(EngineArgs ea) throws IOException {

        Status status = null;
        HandshakeStatus hsStatus = null;

        /*
         * See if the handshaker needs to report back some SSLException.
         */
        checkTaskThrown();

        /*
         * short circuit if we're closed/closing.
         */
        if (writer.isOutboundDone()) {
            return new SSLEngineResult(Status.CLOSED, getHSStatus(null), 0, 0);
        }

So in this case, writer.isOutboundDone() returns true and thus the SSLEngineResult contains status Status.CLOSED.

The netData buffer then contains 0 bytes and when attempting to write it, 0 bytes are actually written. Thus, the syswrite() method returns 0, and spins forever.

@tbeauvais

Yes, I'm seeing the same thing. Several threads(5-6) in similar state and each causes the CPU to spinning at 600% or so.

Is there a Jira ticket for this? any workarounds?

Thread 31345: (state = IN_VM)
 - org.jruby.internal.runtime.ThreadService.getCurrentContext() @bci=0, line=189 (Interpreted frame)
 - org.jruby.Ruby.getCurrentContext() @bci=4, line=2557 (Interpreted frame)
 - org.jruby.RubyThread.pollThreadEvents() @bci=5, line=489 (Interpreted frame)
 - org.jruby.RubyThread.executeBlockingTask(org.jruby.RubyThread$BlockingTask) @bci=10, line=1004 (Interpreted frame)
 - org.jruby.ext.openssl.SSLSocket.waitSelect(int, boolean) @bci=100, line=295 (Interpreted frame)
 - org.jruby.ext.openssl.SSLSocket.do_syswrite(org.jruby.runtime.ThreadContext, org.jruby.runtime.builtin.IRubyObject, boolean) @bci=13, line=644 (Interpreted frame)
 - org.jruby.ext.openssl.SSLSocket.syswrite(org.jruby.runtime.ThreadContext, org.jruby.runtime.builtin.IRubyObject) @bci=4, line=664 (Interpreted frame)
Thread 13372: (state = IN_VM)
 - org.jruby.internal.runtime.ThreadService.getCurrentContext() @bci=0, line=189 (Interpreted frame)
 - org.jruby.Ruby.getCurrentContext() @bci=4, line=2557 (Interpreted frame)
 - org.jruby.ext.openssl.SSLSocket.waitSelect(int, boolean) @bci=21, line=283 (Interpreted frame)
 - org.jruby.ext.openssl.SSLSocket.do_syswrite(org.jruby.runtime.ThreadContext, org.jruby.runtime.builtin.IRubyObject, boolean) @bci=13, line=644 (Interpreted frame)
 - org.jruby.ext.openssl.SSLSocket.syswrite(org.jruby.runtime.ThreadContext, org.jruby.runtime.builtin.IRubyObject) @bci=4, line=664 (Interpreted frame)

@xb xb pushed a commit to xb/jruby that referenced this issue Nov 19, 2013
Xb Fix bug jruby#1156 . 566fd59
@tbeauvais

Can we get this merged into the jruby repo?

@mohamedhafez
Contributor

@xb, would you mind doing a pull request into the main jruby repo for your fix?

@headius headius added a commit that referenced this issue Dec 19, 2013
@headius Xb + headius Fix bug #1156 .
Signed-off-by: Charles Oliver Nutter <headius@headius.com>
e4a8933
@headius
Member
headius commented Dec 19, 2013

I went ahead and cherry-picked your commit to jruby-1_7 and master branches. Perhaps you can come up with a test case PR too?

@headius headius closed this Dec 19, 2013
@enebo enebo modified the milestone: JRuby 1.7.10, JRuby 1.7.11 Feb 24, 2014
@kares kares added a commit to kares/jruby-openssl that referenced this issue Jun 25, 2014
@kares Xb + kares Fix bug jruby/jruby#1156 .
Signed-off-by: Charles Oliver Nutter <headius@headius.com>
c13884a
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment