Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Better handling of fatal Selector failures #4798

Closed
sbordet opened this issue Apr 21, 2020 · 14 comments · Fixed by #4823
Closed

Better handling of fatal Selector failures #4798

sbordet opened this issue Apr 21, 2020 · 14 comments · Fixed by #4823
Labels
Stale For auto-closed stale issues and pull requests

Comments

@sbordet
Copy link
Contributor

sbordet commented Apr 21, 2020

Jetty version
9.4.x

Description
Various java.nio.channels.Selector APIs throw IOException, such as Selector.select().

However, fatal selector failures should not happen, and recent work especially on Java 13 (see https://mail.openjdk.java.net/pipermail/nio-dev/2020-February/007051.html), made sure that it was the case implementation wise, although the API signature still allows exceptions to be thrown.

Jetty does not handle well fatal selector failures.

If a ManagedSelector fails, Jetty exits the selector thread but leaves the ManagedSelector in the array of selectors in SelectorManager._selectors.

A new connection may arrive and be assigned to the failed selector, ending up in its update queue which will never be drained, since the selector thread has exited already.

#3989 introduced some overridable behavior in ManagedSelector but SelectorManager still references the failed selector.

@sbordet
Copy link
Contributor Author

sbordet commented Apr 21, 2020

I think we must make an effort to exclude (and replace) the failed selector from SelectorManager._selectors otherwise we have a broken server.

I understand that if a client may trigger a selector failure that is a DoS attack and a JVM bug, but I think we must respect the fact that a selector may fail because its APIs throw IOException.
The implementation may not actually throw in Java 13 or later, but many still use older versions, so I think we should put some workaround (or better handling) in case of fatal selector failures.

@gregw your thoughts?

@gregw
Copy link
Contributor

gregw commented Apr 22, 2020

As per our hangout conversation, my thoughts are as follows:

  • If selectors can be made to fail as a result of external connections, then nothing we can do will help much. Every failure will take out a large percentage of our current connections (12.5% for 8 selectors to 100% for 1 selector), so the DOS will be pretty fundamental even if we restore a selector or exclude a broken one from new connections.
  • If selector failure can be provoked by external connections, then it is a serious security problem and must be fixed - be that in jetty, the JVM or the OS. So our priority has to be to identify the exception being thrown and to work back to the root cause.
  • If selector failure is due to some hard OS issue, then recreating them will just fail again and we can end up in a hard loop.
  • Selector failure is almost in the category of OOME, in that the best action might be System.exit. However, just like OOME, it is bad form for the server to make this decision, so reporting and limping on is the best we can do.
  • If we are to limp on, then there are a few things we could probably do better in our handling of a selector exception:
    • I don't think we should immediately close the selector. Instead we should try to iterate over the keys and identify any likely problematic ones and cancel them. Only if selector exceptions persist should be close the selector.
    • Once we close a ManagedSelector, we should replace it with a new one.

I think our priorities are:

  1. Identify if there is an iceberg (something that breaks the selector)
  2. Try to avoid hitting the iceberg (can we accept TCP pings differently to avoid the failure)
  3. Only if we still hit the iceberg do we then consider how to arrange the deck chairs and if the band should keep playing or not.

@Amarendraar23
Copy link

We are seeing a similar issue as discussed above:
One of the selector which is dead keeps accepting connections but doesnt act upon it.

[2020-03-20 06:54:59.022] ALL 000000000000 GLOBAL_SCOPE 2020-03-20 06:54:59.022:DBUG:oeji.ManagedSelector:qtp-113589028-112-acceptor-1@62c969a5-ServerConnector@451274c0{SSL,[ssl, http/1.1]}{mfsgas1u1:60001}: Queued change org.eclipse.jetty.io.ManagedSelector$Accept@63b3dc53 on ManagedSelector@822b2077{STARTED} id=3 keys=-1 selected=-1 updates=2123

Tracing back to see if any exception occurs for the selector we found this exception log around the time when the selector last accepted any connection successfully.

[2020-03-19 22:10:04.823] ALL 000000000000 GLOBAL_SCOPE 2020-03-19 22:10:04.822:WARN:oeji.ManagedSelector:qtp-113589028-795900: Fatal select() failure
java.io.IOException: Bad file descriptor
at sun.nio.ch.FileDispatcherImpl.close0(Native Method)
at sun.nio.ch.SocketDispatcher.close(SocketDispatcher.java:67)
at sun.nio.ch.SocketChannelImpl.kill(SocketChannelImpl.java:894)
at sun.nio.ch.EPollSelectorImpl.implDereg(EPollSelectorImpl.java:206)
at sun.nio.ch.SelectorImpl.processDeregisterQueue(SelectorImpl.java:168)
at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:109)
at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:98)
at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:109)
at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:113)
at org.eclipse.jetty.io.ManagedSelector$SelectorProducer.select(ManagedSelector.java:472)
at org.eclipse.jetty.io.ManagedSelector$SelectorProducer.produce(ManagedSelector.java:409)
at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.produceTask(EatWhatYouKill.java:360)
at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.doProduce(EatWhatYouKill.java:184)
at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.tryProduce(EatWhatYouKill.java:171)
at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.run(EatWhatYouKill.java:129)
at org.eclipse.jetty.util.thread.ReservedThreadExecutor$ReservedThread.run(ReservedThreadExecutor.java:388)
at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:806)
at org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.run(QueuedThreadPool.java:938)
at java.lang.Thread.run(Thread.java:811)

After this exception we see the selector only accepting the connections and selector queue build up as shown in the attached image.
image

We are using IBM JDK 1.8 and redhat 7.

@lorban
Copy link
Contributor

lorban commented Apr 24, 2020

Maybe we could try and/or document to check other Selector implementations? They can be changed by setting the java.nio.channels.spi.SelectorProvider system property to a class name and I found a few in the OpenJDK 11 source code:

I wonder if adding -Djava.nio.channels.spi.SelectorProvider=sun.nio.ch.PollSelectorProvider could help in certain cases without too much of a performance impact.

sbordet added a commit that referenced this issue Apr 27, 2020
Implemented selector recovery by transferring
all keys to a newly created selector.

Updated code so that it does not assume that the
SelectionKey never changes.
@sbordet sbordet linked a pull request Apr 27, 2020 that will close this issue
@BrownKathy
Copy link

BrownKathy commented Apr 28, 2020

Thank you for the update. Can you provide more detail about what the fix is doing specifically, any performance impact it has and what build you are intending to put it in? Will you have something for @Amarendraar23 to test?

sbordet added a commit that referenced this issue Apr 29, 2020
Updates after review.

Signed-off-by: Simone Bordet <simone.bordet@gmail.com>
@BrownKathy
Copy link

Is there any performance impact it has and what build you are intending to put it in? Will you have something for @Amarendraar23 to test?

sbordet added a commit that referenced this issue Apr 30, 2020
More updates after review.

Signed-off-by: Simone Bordet <simone.bordet@gmail.com>
sbordet added a commit that referenced this issue May 4, 2020
…or_failures

Issue #4798 - Recover from Selector Failures
@sbordet
Copy link
Contributor Author

sbordet commented May 4, 2020

@lorban and I have done a couple of JMH benchmarks for both socket open/close and HTTP/1.0 requests (the worst cases for this change) and could not see any relevant performance difference - we checked with PerfNorm and PerfAsm profilers as well as load test throughput.

@Amarendraar23 can you please test the current jetty-9.4.x branch?

@sbordet
Copy link
Contributor Author

sbordet commented May 8, 2020

@Amarendraar23 news about the testing?

@BrownKathy
Copy link

@sbordet, thank you for checking in. We are still working through testing at this time. We did initial testing and it seems ok but we are going to have our QA team test it as well. We'll let you know what happens.

@sbordet
Copy link
Contributor Author

sbordet commented May 8, 2020

@BrownKathy we are preparing for the Jetty 9.4.29 release, probably early next week.
Do you think you'll have some data by then?

@sbordet
Copy link
Contributor Author

sbordet commented May 19, 2020

@BrownKathy @Amarendraar23 news?

@BrownKathy
Copy link

@sbordet, we are still testing and need to build some new environments. We will provide an update as soon as possible.

@stale
Copy link

stale bot commented Jun 2, 2021

This issue has been automatically marked as stale because it has been a full year without activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the Stale For auto-closed stale issues and pull requests label Jun 2, 2021
@stale
Copy link

stale bot commented Jul 21, 2021

This issue has been closed due to it having no activity.

@stale stale bot closed this as completed Jul 21, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Stale For auto-closed stale issues and pull requests
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants