Connect Timeouts with NonBlocking CreateEndPoint #1920

gregw · 2017-10-25T00:45:50Z

There is evidence that #1804 is causing some deployments to see connection timeouts when a Jetty HttpClient is communicating with a Jetty server. The problem has been bisected to a few commits, with the fixes to #1804 being the most likely suspects as the change affects both how the client creates outgoing endpoints and how the server accepts incoming ones.

Improved the ManagedSelector Dump Reverted CreateEndPoint to be a blocking task

Made CreateEndPoint a blocking task, because it calls into application code via Connection.Listener, and for safety we assume it may be blocking code, avoiding to stall the processing of NIO selected keys.

migue · 2017-10-26T10:50:47Z

We were hit by this, and testing the current development branch seemed to solve the problems.

Do you know when this would be released?

Thanks a lot!

sbordet · 2017-10-26T11:00:06Z

@migue we are very interested in the details, as it appears that the problem (connect timeouts) is not fixed for one of our clients, and that this issue actually had nothing to do with the real problem.

Can you reproduce your problem so that we can also reproduce it and see what's going on ?

migue · 2017-10-26T11:16:59Z

Hello @sbordet

Let me try to summarise my scenario:

We have a Http2 endpoint which receives messages and send them to a Kafka topic. We were running Jetty 9.4.2 and everything was behaving properly.

We tried to move to Jetty 9.4.7 (for different reasons) and we ran into (or we do think so) the issue described here. All of our client's request started failing with a SSL Handshake timeout error.

Early this morning we built the 9.4.x branch locally and deploy it in our test environment, and, seems to be back to normal.

I would be glad to share any details you need, this is just a quick overview of our scenario and how we're trying to fix it.

sbordet · 2017-10-26T11:33:28Z

@migue can you do some experiment for us ?

I understand you have a problem with Jetty HttpClient, which you use to call Kafka, is that right ?
So you don't have a server problem, just a client problem, is that right ?

Can you build commit 6e94d40 and tell us if works fine for you ?
Then, can you build commit ce39d50 and tell us if works fine for you ?

migue · 2017-10-26T12:08:01Z

Hello @sbordet

This is a server problem for us.

Let me try to give some more details:

Our client (let me call it load_replay) is just a Netty + Rx client reproducing some data recorded from our production environment
The server side (let me call it acceptor) is based on Jetty Http2 server. This server accepts requests and stores them into Kafka (using the Kafka Java client)

The problem we're facing is that when we moved the acceptor to the Jetty server 9.4.7 we started to see lots of SSL Handshake timeouts in the load_replay process.

I don't know the Jetty code well enough, sorry, I just did some profiling and looked at the issues and found this. That's the reason I deployed the current 9.4.x branch to our load test environment.

Do you want me to test the previous commits anyway?

Added a LiveLock (BusyBlocking) test. Modified ManagedSelector to fair share between actions and selecting

gregw · 2017-10-28T04:02:06Z

@sbordet I improved the live lock test. I no longer run a really high connect rate (which may be beyond what the client/server can handle even with this issue fixed). Instead the connection/request rate is a modest 5/s over 5 seconds, with a 1s connect timeout. In parallel on both the client and server I submit a NonBlocking runnable that submits itself - ensuring that the action queue will always be at least 1. This fails 100% of the time without the livelock fix and passes 100% with the livelock fix.

I have also changed the livelock fix a little bit from what we discussed on the hangout. We looked at using a time since last select to bypass actions, however that could result in a busy selector livelocking the actions ie preventing any action from being run. Instead I now apply only from the beginning of handling actions, so that after every select we get at least one period of handling actions followed by handling all the events from a single select.

I also added an IOtest to confirm that a wakeup() does not prevent a select() call from selecting. It confirms it only prevents it from blocking but that any IO activity will be detected. Thus I use the simpler wakeup call when the action queue is not empty.

sbordet · 2017-10-29T23:41:15Z

@gregw I moved the livelock issue to #1924, so let's use that issue.
This issue resulted in a false positive.
I'm inclined to leave CreateEndPoint non-blocking, as DestroyEndPoint is non-blocking and they both call Connection.Listeners, so either both non-blocking or both blocking.
Implementation of Connection.Listener must therefore be non-blocking.

Reverted CreateEndPoint to be non-blocking, as the real issue was determined to be #1924 instead.

sbordet · 2017-10-29T23:48:47Z

@migue, we think we have found the problem, it's described in #1924.

Can you please try commit 0142509 ?

$ git clone https://github.com/eclipse/jetty.project.git jetty
$ cd jetty
$ git checkout 0142509975f1554868d01bceaaf28bb70ba309bf
$ mvn clean install -DskipTests

Update your dependencies to the just built 9.4.8-SNAPSHOT, and let us know if you still see the problem. Thanks !

migue · 2017-10-30T07:35:37Z

Will try to give it a try (probably tomorrow)

mlex · 2017-11-07T11:11:17Z

Hi,
because Miguel is occupied otherwise, I took over the re-testing.

Setup

Server is a dropwizard application providing http2 via jetty with maxThreads=1024 and queueSize=16384. Otherwise default dropwizard/jetty configuration.
Client is a load-testing tool that opens 1000 ssl connections at once. When a connection is closed (or fails), then the connection attempt is retried after 1 second. For every successfully opened connection the tool sends one http2 request per second.

Observed problem

With jetty 9.4.7 the client experiences ssl handshake timeouts. Because of the (almost) immediate reconnect-attempt, this seems to overload the server completely. Observations (on the server):

number of threads stays constant (no new threads)
heap growing steadily (until jvm stalls completely with 100% cpu doing gc)
no http2 requests are processed at all
thread dump shows just one thread being actually busy:

dw-21 id=21 state=RUNNABLE
    at java.math.BigInteger.implMontgomeryMultiply(BigInteger.java:2608)
[...]
    at sun.security.ssl.Handshaker$1.run(Handshaker.java:916)
[...]
    at org.eclipse.jetty.http2.HTTP2Connection.fill(HTTP2Connection.java:124)
    at org.eclipse.jetty.http2.HTTP2Connection.access$600(HTTP2Connection.java:40)
    at org.eclipse.jetty.http2.HTTP2Connection$HTTP2Producer.produce(HTTP2Connection.java:217)
    at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.doProduce(EatWhatYouKill.java:179)
    at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.produce(EatWhatYouKill.java:140)
    at org.eclipse.jetty.http2.HTTP2Connection.onOpen(HTTP2Connection.java:97)
    at org.eclipse.jetty.http2.server.HTTP2ServerConnection.onOpen(HTTP2ServerConnection.java:146)
    at org.eclipse.jetty.io.ssl.SslConnection.onOpen(SslConnection.java:249)
    at org.eclipse.jetty.io.SelectorManager.connectionOpened(SelectorManager.java:354)
    at org.eclipse.jetty.io.ManagedSelector.createEndPoint(ManagedSelector.java:210)
    at org.eclipse.jetty.io.ManagedSelector.access$1600(ManagedSelector.java:59)
    at org.eclipse.jetty.io.ManagedSelector$CreateEndPoint.run(ManagedSelector.java:623)
    at org.eclipse.jetty.io.ManagedSelector$SelectorProducer.nextAction(ManagedSelector.java:328)
    at org.eclipse.jetty.io.ManagedSelector$SelectorProducer.produce(ManagedSelector.java:293)
    at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.doProduce(EatWhatYouKill.java:179)
    at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.produce(EatWhatYouKill.java:140)
    at org.eclipse.jetty.io.ManagedSelector$$Lambda$51/192264336.run(Unknown Source)
    at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:679)
    at org.eclipse.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:597)
    at java.lang.Thread.run(Thread.java:745)

Problem fixed with 9.4.8-SNAPSHOT

I tested it with the following jetty 9.4.8-SNAPSHOT versions:

All these versions worked fine. After the load-test tool was started, the number of threads in the server jvm started to grow, ssl handshakes were successful and http2 requests were processed successfully.

Bisecting

I finally bisected my way to the commit ab85e01 and issue #1849. This commit fixed our problem (with the parent e58a7b4 I could still reproduce the problem).

My theory is, that because of #1849 we were running just a single selector thread. And because the CreateEndPoint is now NonBlocking, the ssl handshake is now done in the selector thread. So in the end the single selector thread had to terminate 3000 ssl connections at once which took longer than the ssl handshake timeout which lead to even more connection requests.  

To validate this, I tried two things: 

Run e58a7b4 with explicit config selectorThreads=8.
Take ab85e01 and cherry-pick ce5993b which makes CreateEndpoing Blocking again.
 Both tests were successfull (meaning that all ssl handshakes were successfull and requests were processed).

So in the end, I'd say this rather was a misconfiguration and bad luck with default settings on our side. We will configure the selectorThreads explicitly from now on. For anybody doing ssl termination with jetty, it is probably helpful to know that starting with jetty 9.4.7 the ssl handshake is done on the selector threads and that you should configure your thread pool accordingly.

sbordet · 2017-11-07T15:04:00Z

@mlex Thanks for the feedback, it confirms that the real issue behind your problem was indeed #1924.
The increase in number of selectors simply gave more time to each selector to actually perform the selection, rather than continue to process actions.

I'd say that you were bitten by #1924 rather than a misconfiguration on your part; IIUC, with the original configuration (single selector) and 0142509 you don't see the issue anymore, so it was indeed our problem.

Thanks !

mlex · 2017-11-08T08:08:03Z

I actually didn't test with a single selector and 0142509. But I did so just now. The result:

With 1000 ssl connections everything worked well. All connections were established and all clients were sending data and the server was processing everything.

With 2000 ssl connections the picture was different: Only ~570 connections were successfully established and the server was processing data from these connections. The other clients still ended up in the endless try-connect -> ssl-handshake-timeout -> retry cycle.

The most notable difference to jetty 9.4.7 (with a single selector): The server did not become unresponsive and did not end up with 100% heap usage. Those connections that were successfully established could send data and this data was processed correctly (and without noticable delay). Also: when reducing the load, the server quickly recovered.

So the fix for #1924 is improving the situation a lot. Thanks for that! But even with this fix, a single selector is not enough for our use case of hundreds of clients simultanously trying to establish a ssl connection.

sbordet · 2017-11-08T14:32:20Z

@mlex what is exactly queueSize=16384 ?

If it is the max size of the queue of the thread pool, can you set it to unbounded and try again with 1 selector and 2000 TLS connections ?

When it fails, do you see actual connection timeouts on the client, or just connections being closed by the server before the TLS handshake completes ?

mlex · 2017-11-08T15:00:49Z

Yes, it is the the size of the queue of the underlying thread pool. But the queue is empty most of the time anyway (at least judging from the metrics exported by dropwizard). So I don't think setting it to unbounded makes no difference

What I observed is, that the main load is caused by CreateEndPoint tasks (because the ssl handshake is happening there). If I understood it correct, then those tasks are direclty executed on the selector. So this work never really reaches the thread pool queue.

If it helps, I could provide a heapdump of the described situation.

Made CreateEndPoint and DestroyEndPoint blocking.

sbordet · 2017-11-08T15:29:23Z

@mlex can you try d1e5883 and report if it's working for you ?

mlex · 2017-11-08T16:56:19Z

With d1e5883 and a single selector thread everythings works fine. All connections were established successfully and the server was then processing requests. I also tried the load test with 5000 tls connections and everything worked well.

One notable difference is a small peak in the queue size, which probably represents all the CreateEndpoint tasks that are now pushed onto the thread pool queue.

Also the stacktrace now shows that the ssl handshake is done by a non-selector thread (how do you call those threads?):

dw-298 id=298 state=BLOCKED
    - waiting to lock <0x51068ee2> (a sun.security.ssl.SSLContextImpl$TLSContext)
     owned by dw-514 id=514
    at sun.security.ssl.SSLContextImpl.getDefaultCipherSuiteList(SSLContextImpl.java:291)
    at sun.security.ssl.SSLEngineImpl.init(SSLEngineImpl.java:402)
    at sun.security.ssl.SSLEngineImpl.<init>(SSLEngineImpl.java:357)
    at sun.security.ssl.SSLContextImpl.engineCreateSSLEngine(SSLContextImpl.java:210)
    at javax.net.ssl.SSLContext.createSSLEngine(SSLContext.java:361)
    at org.eclipse.jetty.util.ssl.SslContextFactory.newSSLEngine(SslContextFactory.java:1608)
    at org.eclipse.jetty.util.ssl.SslContextFactory.newSSLEngine(SslContextFactory.java:1630)
    at org.eclipse.jetty.server.SslConnectionFactory.newConnection(SslConnectionFactory.java:85)
    at com.instana.acceptor.http2.Http2InstrumentedConnectionFactory.newConnection(Http2InstrumentedConnectionFactory.java:53)
    at org.eclipse.jetty.server.ServerConnector$ServerConnectorManager.newConnection(ServerConnector.java:542)
    at org.eclipse.jetty.io.ManagedSelector.createEndPoint(ManagedSelector.java:207)
    at org.eclipse.jetty.io.ManagedSelector.access$1700(ManagedSelector.java:58)
    at org.eclipse.jetty.io.ManagedSelector$CreateEndPoint.run(ManagedSelector.java:637)
    at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:708)
    at org.eclipse.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:626)
    at java.lang.Thread.run(Thread.java:745)

mlex · 2017-11-08T17:40:06Z

I took a heapdump during a run with 0142509 and a single selector and 2000 concurrent connection attempts. It shows that the ManagedSelector._actions queue is full of Accept, CreateEndPoint and DestroyEndPoint tasks.

Of those the tasks, the CreateEndpoint is the most expensive (because of the ssl handshake):

sbordet · 2017-11-08T18:42:30Z

@mlex, thanks for verifying the commits and analyzing the thread dumps.
I'm now convinced that CreateEndPoint must be a blocking action, and make DestroyEndPoint blocking too (this was commit d1e5883, which you tested as good).

mlex · 2017-11-09T13:20:05Z

But wouldn't DestroyEndpoint being a blocking action again lead to the possible memory leak described in #1705?

sbordet · 2017-11-09T13:39:33Z

@mlex no, because now it implements Closeable.

mlex · 2017-11-10T14:13:47Z

Ah, I see. Thank you for the explanation!

gregw assigned gregw and sbordet Oct 25, 2017

gregw added this to the 9.4.x milestone Oct 25, 2017

gregw added Bug For general bugs on Jetty side High Priority Sponsored This issue affects a user with a commercial support agreement labels Oct 25, 2017

gregw added a commit that referenced this issue Oct 25, 2017

Issue #1920 Connect timeout

ce5993b

Improved the ManagedSelector Dump Reverted CreateEndPoint to be a blocking task

sbordet closed this as completed in ce39d50 Oct 25, 2017

gregw added a commit that referenced this issue Oct 28, 2017

Issue #1920 - Connect Timeouts with NonBlocking CreateEndPoint.

ab849e8

Added a LiveLock (BusyBlocking) test. Modified ManagedSelector to fair share between actions and selecting

sbordet added a commit that referenced this issue Oct 29, 2017

Issue #1920 - Connect Timeouts with NonBlocking CreateEndPoint.

0142509

Reverted CreateEndPoint to be non-blocking, as the real issue was determined to be #1924 instead.

sbordet added a commit that referenced this issue Nov 8, 2017

Issue #1920 - Connect Timeouts with NonBlocking CreateEndPoint.

d1e5883

Made CreateEndPoint and DestroyEndPoint blocking.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Connect Timeouts with NonBlocking CreateEndPoint #1920

Connect Timeouts with NonBlocking CreateEndPoint #1920

gregw commented Oct 25, 2017

migue commented Oct 26, 2017

sbordet commented Oct 26, 2017

migue commented Oct 26, 2017

sbordet commented Oct 26, 2017

migue commented Oct 26, 2017

gregw commented Oct 28, 2017

sbordet commented Oct 29, 2017

sbordet commented Oct 29, 2017

migue commented Oct 30, 2017

mlex commented Nov 7, 2017

sbordet commented Nov 7, 2017

mlex commented Nov 8, 2017

sbordet commented Nov 8, 2017

mlex commented Nov 8, 2017

sbordet commented Nov 8, 2017

mlex commented Nov 8, 2017

mlex commented Nov 8, 2017

sbordet commented Nov 8, 2017

mlex commented Nov 9, 2017

sbordet commented Nov 9, 2017

mlex commented Nov 10, 2017

Connect Timeouts with NonBlocking CreateEndPoint #1920

Connect Timeouts with NonBlocking CreateEndPoint #1920

Comments

gregw commented Oct 25, 2017

migue commented Oct 26, 2017

sbordet commented Oct 26, 2017

migue commented Oct 26, 2017

sbordet commented Oct 26, 2017

migue commented Oct 26, 2017

gregw commented Oct 28, 2017

sbordet commented Oct 29, 2017

sbordet commented Oct 29, 2017

migue commented Oct 30, 2017

mlex commented Nov 7, 2017

Setup

Observed problem

Problem fixed with 9.4.8-SNAPSHOT

Bisecting

sbordet commented Nov 7, 2017

mlex commented Nov 8, 2017

sbordet commented Nov 8, 2017

mlex commented Nov 8, 2017

sbordet commented Nov 8, 2017

mlex commented Nov 8, 2017

mlex commented Nov 8, 2017

sbordet commented Nov 8, 2017

mlex commented Nov 9, 2017

sbordet commented Nov 9, 2017

mlex commented Nov 10, 2017