Better handling of fatal Selector failures #4798

sbordet · 2020-04-21T19:29:31Z

Jetty version
9.4.x

Description
Various java.nio.channels.Selector APIs throw IOException, such as Selector.select().

However, fatal selector failures should not happen, and recent work especially on Java 13 (see https://mail.openjdk.java.net/pipermail/nio-dev/2020-February/007051.html), made sure that it was the case implementation wise, although the API signature still allows exceptions to be thrown.

Jetty does not handle well fatal selector failures.

If a ManagedSelector fails, Jetty exits the selector thread but leaves the ManagedSelector in the array of selectors in SelectorManager._selectors.

A new connection may arrive and be assigned to the failed selector, ending up in its update queue which will never be drained, since the selector thread has exited already.

#3989 introduced some overridable behavior in ManagedSelector but SelectorManager still references the failed selector.

The text was updated successfully, but these errors were encountered:

sbordet · 2020-04-21T19:33:10Z

I think we must make an effort to exclude (and replace) the failed selector from SelectorManager._selectors otherwise we have a broken server.

I understand that if a client may trigger a selector failure that is a DoS attack and a JVM bug, but I think we must respect the fact that a selector may fail because its APIs throw IOException.
The implementation may not actually throw in Java 13 or later, but many still use older versions, so I think we should put some workaround (or better handling) in case of fatal selector failures.

@gregw your thoughts?

gregw · 2020-04-22T08:10:18Z

As per our hangout conversation, my thoughts are as follows:

If selectors can be made to fail as a result of external connections, then nothing we can do will help much. Every failure will take out a large percentage of our current connections (12.5% for 8 selectors to 100% for 1 selector), so the DOS will be pretty fundamental even if we restore a selector or exclude a broken one from new connections.
If selector failure can be provoked by external connections, then it is a serious security problem and must be fixed - be that in jetty, the JVM or the OS. So our priority has to be to identify the exception being thrown and to work back to the root cause.
If selector failure is due to some hard OS issue, then recreating them will just fail again and we can end up in a hard loop.
Selector failure is almost in the category of OOME, in that the best action might be System.exit. However, just like OOME, it is bad form for the server to make this decision, so reporting and limping on is the best we can do.
If we are to limp on, then there are a few things we could probably do better in our handling of a selector exception:
- I don't think we should immediately close the selector. Instead we should try to iterate over the keys and identify any likely problematic ones and cancel them. Only if selector exceptions persist should be close the selector.
- Once we close a ManagedSelector, we should replace it with a new one.

I think our priorities are:

Identify if there is an iceberg (something that breaks the selector)
Try to avoid hitting the iceberg (can we accept TCP pings differently to avoid the failure)
Only if we still hit the iceberg do we then consider how to arrange the deck chairs and if the band should keep playing or not.

Amarendraar23 · 2020-04-23T12:03:27Z

We are seeing a similar issue as discussed above:
One of the selector which is dead keeps accepting connections but doesnt act upon it.

[2020-03-20 06:54:59.022] ALL 000000000000 GLOBAL_SCOPE 2020-03-20 06:54:59.022:DBUG:oeji.ManagedSelector:qtp-113589028-112-acceptor-1@62c969a5-ServerConnector@451274c0{SSL,[ssl, http/1.1]}{mfsgas1u1:60001}: Queued change org.eclipse.jetty.io.ManagedSelector$Accept@63b3dc53 on ManagedSelector@822b2077{STARTED} id=3 keys=-1 selected=-1 updates=2123

Tracing back to see if any exception occurs for the selector we found this exception log around the time when the selector last accepted any connection successfully.

[2020-03-19 22:10:04.823] ALL 000000000000 GLOBAL_SCOPE 2020-03-19 22:10:04.822:WARN:oeji.ManagedSelector:qtp-113589028-795900: Fatal select() failure
java.io.IOException: Bad file descriptor
at sun.nio.ch.FileDispatcherImpl.close0(Native Method)
at sun.nio.ch.SocketDispatcher.close(SocketDispatcher.java:67)
at sun.nio.ch.SocketChannelImpl.kill(SocketChannelImpl.java:894)
at sun.nio.ch.EPollSelectorImpl.implDereg(EPollSelectorImpl.java:206)
at sun.nio.ch.SelectorImpl.processDeregisterQueue(SelectorImpl.java:168)
at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:109)
at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:98)
at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:109)
at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:113)
at org.eclipse.jetty.io.ManagedSelector$SelectorProducer.select(ManagedSelector.java:472)
at org.eclipse.jetty.io.ManagedSelector$SelectorProducer.produce(ManagedSelector.java:409)
at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.produceTask(EatWhatYouKill.java:360)
at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.doProduce(EatWhatYouKill.java:184)
at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.tryProduce(EatWhatYouKill.java:171)
at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.run(EatWhatYouKill.java:129)
at org.eclipse.jetty.util.thread.ReservedThreadExecutor$ReservedThread.run(ReservedThreadExecutor.java:388)
at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:806)
at org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.run(QueuedThreadPool.java:938)
at java.lang.Thread.run(Thread.java:811)

After this exception we see the selector only accepting the connections and selector queue build up as shown in the attached image.

We are using IBM JDK 1.8 and redhat 7.

lorban · 2020-04-24T10:22:33Z

Maybe we could try and/or document to check other Selector implementations? They can be changed by setting the java.nio.channels.spi.SelectorProvider system property to a class name and I found a few in the OpenJDK 11 source code:

Linux-specific sun.nio.ch.EPollSelectorProvider, which is the default one on this platform: https://github.com/AdoptOpenJDK/openjdk-jdk11/tree/master/src/java.base/linux/classes/sun/nio/ch
Unix-generic sun.nio.ch.PollSelectorProvider: https://github.com/AdoptOpenJDK/openjdk-jdk11/tree/master/src/java.base/unix/classes/sun/nio/ch
Solaris-specific sun.nio.ch.DevPollSelectorProvider and sun.nio.ch.EventPortSelectorProvider: https://github.com/AdoptOpenJDK/openjdk-jdk11/tree/master/src/java.base/solaris/classes/sun/nio/ch
MacOS-specific sun.nio.ch.KQueueSelectorProvider: https://github.com/AdoptOpenJDK/openjdk-jdk11/tree/master/src/java.base/macosx/classes/sun/nio/ch
AIX-specific sun.nio.ch.AixAsynchronousChannelProvider: https://github.com/AdoptOpenJDK/openjdk-jdk11/tree/master/src/java.base/aix/classes/sun/nio/ch

I wonder if adding -Djava.nio.channels.spi.SelectorProvider=sun.nio.ch.PollSelectorProvider could help in certain cases without too much of a performance impact.

Implemented selector recovery by transferring all keys to a newly created selector. Updated code so that it does not assume that the SelectionKey never changes.

BrownKathy · 2020-04-28T13:51:28Z

Thank you for the update. Can you provide more detail about what the fix is doing specifically, any performance impact it has and what build you are intending to put it in? Will you have something for @Amarendraar23 to test?

Updates after review. Signed-off-by: Simone Bordet <simone.bordet@gmail.com>

BrownKathy · 2020-04-29T15:37:20Z

Is there any performance impact it has and what build you are intending to put it in? Will you have something for @Amarendraar23 to test?

More updates after review. Signed-off-by: Simone Bordet <simone.bordet@gmail.com>

…or_failures Issue #4798 - Recover from Selector Failures

sbordet · 2020-05-04T17:15:50Z

@lorban and I have done a couple of JMH benchmarks for both socket open/close and HTTP/1.0 requests (the worst cases for this change) and could not see any relevant performance difference - we checked with PerfNorm and PerfAsm profilers as well as load test throughput.

@Amarendraar23 can you please test the current jetty-9.4.x branch?

sbordet · 2020-05-08T08:42:35Z

@Amarendraar23 news about the testing?

BrownKathy · 2020-05-08T12:10:40Z

@sbordet, thank you for checking in. We are still working through testing at this time. We did initial testing and it seems ok but we are going to have our QA team test it as well. We'll let you know what happens.

sbordet · 2020-05-08T13:25:23Z

@BrownKathy we are preparing for the Jetty 9.4.29 release, probably early next week.
Do you think you'll have some data by then?

sbordet · 2020-05-19T07:56:53Z

@BrownKathy @Amarendraar23 news?

BrownKathy · 2020-05-19T14:23:44Z

@sbordet, we are still testing and need to build some new environments. We will provide an update as soon as possible.

stale · 2021-06-02T17:48:03Z

This issue has been automatically marked as stale because it has been a full year without activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale · 2021-07-21T02:38:09Z

This issue has been closed due to it having no activity.

sbordet mentioned this issue Apr 23, 2020

Handle closed selector exception #3580

Closed

sbordet added a commit that referenced this issue Apr 27, 2020

Issue #4798 - Better handling of fatal Selector failures.

952a20f

Implemented selector recovery by transferring all keys to a newly created selector. Updated code so that it does not assume that the SelectionKey never changes.

sbordet linked a pull request Apr 27, 2020 that will close this issue

Issue #4798 - Recover from Selector Failures #4823

Merged

sbordet added a commit that referenced this issue Apr 29, 2020

Issue #4798 - Better handling of fatal Selector failures.

8c75eec

Updates after review. Signed-off-by: Simone Bordet <simone.bordet@gmail.com>

sbordet added a commit that referenced this issue Apr 30, 2020

Issue #4798 - Better handling of fatal Selector failures.

0a028b6

More updates after review. Signed-off-by: Simone Bordet <simone.bordet@gmail.com>

sbordet added a commit that referenced this issue May 4, 2020

Merge pull request #4823 from eclipse/jetty-9.4.x-4798-recover_select…

0ab2b42

…or_failures Issue #4798 - Recover from Selector Failures

sbordet mentioned this issue May 19, 2020

Repeated "Could not process key for channel" IllegalStateException for unix sockets #4865

Closed

stale bot added the Stale For auto-closed stale issues and pull requests label Jun 2, 2021

stale bot closed this as completed Jul 21, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Better handling of fatal Selector failures #4798

Better handling of fatal Selector failures #4798

sbordet commented Apr 21, 2020

sbordet commented Apr 21, 2020

gregw commented Apr 22, 2020

Amarendraar23 commented Apr 23, 2020

lorban commented Apr 24, 2020

BrownKathy commented Apr 28, 2020 •

edited

Loading

BrownKathy commented Apr 29, 2020

sbordet commented May 4, 2020

sbordet commented May 8, 2020

BrownKathy commented May 8, 2020

sbordet commented May 8, 2020

sbordet commented May 19, 2020

BrownKathy commented May 19, 2020

stale bot commented Jun 2, 2021

stale bot commented Jul 21, 2021

Better handling of fatal Selector failures #4798

Better handling of fatal Selector failures #4798

Comments

sbordet commented Apr 21, 2020

sbordet commented Apr 21, 2020

gregw commented Apr 22, 2020

Amarendraar23 commented Apr 23, 2020

lorban commented Apr 24, 2020

BrownKathy commented Apr 28, 2020 • edited Loading

BrownKathy commented Apr 29, 2020

sbordet commented May 4, 2020

sbordet commented May 8, 2020

BrownKathy commented May 8, 2020

sbordet commented May 8, 2020

sbordet commented May 19, 2020

BrownKathy commented May 19, 2020

stale bot commented Jun 2, 2021

stale bot commented Jul 21, 2021

BrownKathy commented Apr 28, 2020 •

edited

Loading