-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Connect Timeouts with NonBlocking CreateEndPoint #1920
Comments
Improved the ManagedSelector Dump Reverted CreateEndPoint to be a blocking task
Made CreateEndPoint a blocking task, because it calls into application code via Connection.Listener, and for safety we assume it may be blocking code, avoiding to stall the processing of NIO selected keys.
We were hit by this, and testing the current development branch seemed to solve the problems. Do you know when this would be released? Thanks a lot! |
@migue we are very interested in the details, as it appears that the problem (connect timeouts) is not fixed for one of our clients, and that this issue actually had nothing to do with the real problem. Can you reproduce your problem so that we can also reproduce it and see what's going on ? |
Hello @sbordet Let me try to summarise my scenario: We have a Http2 endpoint which receives messages and send them to a Kafka topic. We were running Jetty 9.4.2 and everything was behaving properly. We tried to move to Jetty 9.4.7 (for different reasons) and we ran into (or we do think so) the issue described here. All of our client's request started failing with a SSL Handshake timeout error. Early this morning we built the 9.4.x branch locally and deploy it in our test environment, and, seems to be back to normal. I would be glad to share any details you need, this is just a quick overview of our scenario and how we're trying to fix it. |
@migue can you do some experiment for us ? I understand you have a problem with Jetty Can you build commit 6e94d40 and tell us if works fine for you ? |
Hello @sbordet This is a server problem for us. Let me try to give some more details:
The problem we're facing is that when we moved the acceptor to the Jetty server 9.4.7 we started to see lots of SSL Handshake timeouts in the load_replay process. I don't know the Jetty code well enough, sorry, I just did some profiling and looked at the issues and found this. That's the reason I deployed the current 9.4.x branch to our load test environment. Do you want me to test the previous commits anyway? |
Added a LiveLock (BusyBlocking) test. Modified ManagedSelector to fair share between actions and selecting
@sbordet I improved the live lock test. I no longer run a really high connect rate (which may be beyond what the client/server can handle even with this issue fixed). Instead the connection/request rate is a modest 5/s over 5 seconds, with a 1s connect timeout. In parallel on both the client and server I submit a NonBlocking runnable that submits itself - ensuring that the action queue will always be at least 1. This fails 100% of the time without the livelock fix and passes 100% with the livelock fix. I have also changed the livelock fix a little bit from what we discussed on the hangout. We looked at using a time since last select to bypass actions, however that could result in a busy selector livelocking the actions ie preventing any action from being run. Instead I now apply only from the beginning of handling actions, so that after every select we get at least one period of handling actions followed by handling all the events from a single select. I also added an IOtest to confirm that a wakeup() does not prevent a select() call from selecting. It confirms it only prevents it from blocking but that any IO activity will be detected. Thus I use the simpler wakeup call when the action queue is not empty. |
@gregw I moved the livelock issue to #1924, so let's use that issue. |
Reverted CreateEndPoint to be non-blocking, as the real issue was determined to be #1924 instead.
@migue, we think we have found the problem, it's described in #1924. Can you please try commit 0142509 ?
Update your dependencies to the just built 9.4.8-SNAPSHOT, and let us know if you still see the problem. Thanks ! |
Will try to give it a try (probably tomorrow) |
Hi, Setup
Observed problemWith jetty 9.4.7 the client experiences ssl handshake timeouts. Because of the (almost) immediate reconnect-attempt, this seems to overload the server completely. Observations (on the server):
Problem fixed with 9.4.8-SNAPSHOTI tested it with the following jetty 9.4.8-SNAPSHOT versions: All these versions worked fine. After the load-test tool was started, the number of threads in the server jvm started to grow, ssl handshakes were successful and http2 requests were processed successfully. BisectingI finally bisected my way to the commit ab85e01 and issue #1849. This commit fixed our problem (with the parent e58a7b4 I could still reproduce the problem). My theory is, that because of #1849 we were running just a single selector thread. And because the CreateEndPoint is now NonBlocking, the ssl handshake is now done in the selector thread. So in the end the single selector thread had to terminate 3000 ssl connections at once which took longer than the ssl handshake timeout which lead to even more connection requests. To validate this, I tried two things:
So in the end, I'd say this rather was a misconfiguration and bad luck with default settings on our side. We will configure the selectorThreads explicitly from now on. For anybody doing ssl termination with jetty, it is probably helpful to know that starting with jetty 9.4.7 the ssl handshake is done on the selector threads and that you should configure your thread pool accordingly. |
@mlex Thanks for the feedback, it confirms that the real issue behind your problem was indeed #1924. I'd say that you were bitten by #1924 rather than a misconfiguration on your part; IIUC, with the original configuration (single selector) and 0142509 you don't see the issue anymore, so it was indeed our problem. Thanks ! |
I actually didn't test with a single selector and 0142509. But I did so just now. The result: With 1000 ssl connections everything worked well. All connections were established and all clients were sending data and the server was processing everything. With 2000 ssl connections the picture was different: Only ~570 connections were successfully established and the server was processing data from these connections. The other clients still ended up in the endless try-connect -> ssl-handshake-timeout -> retry cycle. The most notable difference to jetty 9.4.7 (with a single selector): The server did not become unresponsive and did not end up with 100% heap usage. Those connections that were successfully established could send data and this data was processed correctly (and without noticable delay). Also: when reducing the load, the server quickly recovered. So the fix for #1924 is improving the situation a lot. Thanks for that! But even with this fix, a single selector is not enough for our use case of hundreds of clients simultanously trying to establish a ssl connection. |
@mlex what is exactly If it is the max size of the queue of the thread pool, can you set it to unbounded and try again with 1 selector and 2000 TLS connections ? When it fails, do you see actual connection timeouts on the client, or just connections being closed by the server before the TLS handshake completes ? |
Yes, it is the the size of the queue of the underlying thread pool. But the queue is empty most of the time anyway (at least judging from the metrics exported by dropwizard). So I don't think setting it to unbounded makes no difference What I observed is, that the main load is caused by CreateEndPoint tasks (because the ssl handshake is happening there). If I understood it correct, then those tasks are direclty executed on the selector. So this work never really reaches the thread pool queue. If it helps, I could provide a heapdump of the described situation. |
Made CreateEndPoint and DestroyEndPoint blocking.
With d1e5883 and a single selector thread everythings works fine. All connections were established successfully and the server was then processing requests. I also tried the load test with 5000 tls connections and everything worked well. One notable difference is a small peak in the queue size, which probably represents all the CreateEndpoint tasks that are now pushed onto the thread pool queue. Also the stacktrace now shows that the ssl handshake is done by a non-selector thread (how do you call those threads?):
|
I took a heapdump during a run with 0142509 and a single selector and 2000 concurrent connection attempts. It shows that the Of those the tasks, the |
But wouldn't DestroyEndpoint being a blocking action again lead to the possible memory leak described in #1705? |
@mlex no, because now it implements |
Ah, I see. Thank you for the explanation! |
There is evidence that #1804 is causing some deployments to see connection timeouts when a Jetty HttpClient is communicating with a Jetty server. The problem has been bisected to a few commits, with the fixes to #1804 being the most likely suspects as the change affects both how the client creates outgoing endpoints and how the server accepts incoming ones.
The text was updated successfully, but these errors were encountered: