Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AsynchronousTlsChannelGroup#processPendingInterests can throw CancelledKeyException #18

Closed
jyemin opened this issue Feb 27, 2021 · 5 comments

Comments

@jyemin
Copy link

jyemin commented Feb 27, 2021

In tests of the MongoDB Java driver that use this library, I've seen occasional, non-deterministic failures where AsynchronousTlsChannelGroup#processPendingInterests throws CancelledKeyException, causing AsynchronousTlsChannelGroup.loop to exit. It happens in cases where we are forcing the server to close the socket in order to test failure scenarios.

I'm not exactly sure why this is happening, but I do see that in AsynchronousTlsChannelGroup.loop there is already code that wraps calls to java.nio.channels.SelectionKey#interestOps(int) in a try/catch of CancelledKeyException. Does it make sense to do a similar thing in AsynchronousTlsChannelGroup#processPendingInterests , e.g.

  private void processPendingInterests() {
    for (SelectionKey key : selector.keys()) {
      RegisteredSocket socket = (RegisteredSocket) key.attachment();
      int pending = socket.pendingOps.getAndSet(0);
      if (pending != 0) {
        try {
          key.interestOps(key.interestOps() | pending);
        } catch (CancelledKeyException e) {
          // ignore
        }
      }
    }
  }
@marianobarrios
Copy link
Owner

Hi Jeff, thanks for the report. I tried a bit but could not reproduce the problem. However, what you suggests makes sense and is completely consistent with the other catch we already have, so I just added it. It's in master already.

Best regards.

@martinandersson
Copy link

Why is this exception ignored? There's probably an assumption involved there. Can this be officially explained in docs somewhere?

@stIncMale
Copy link
Contributor

@martinandersson Below is my explanation of the problem and the fix.

The method AsynchronousTlsChannelGroup.RegisteredSocket.close may be called by any thread as a result of it calling
AsynchronousTlsChannel.close. AsynchronousTlsChannelGroup.RegisteredSocket.close calls SelectionKey.cancel, which

Requests that the registration of this key's channel with its selector be cancelled. Upon return the key will be invalid and will have been added to its selector's cancelled-key set. The key will be removed from all of the selector's key sets during the next selection operation.

If we now look at Selector, we can see that

The cancelled-key set is the set of keys that have been cancelled but whose channels have not yet been deregistered. This set is not directly accessible. The cancelled-key set is always a subset of the key set.

This means that selector.keys in AsynchronousTlsChannelGroup.processPendingInterests
may return cancelled SelectionKeys. Calling key.interestOps on such SelectionKeys results in CancelledKeyException as per the documentation of SelectionKey.interestOps.

Thus, depending on how AsynchronousTlsChannelGroup and AsynchronousTlsChannel are used in a program, the program may have a race condition.

Two approaches are possible:

  1. change the usage of SelectionKey.cancel, Selector.select, Selector.keys, SelectionKey.isValid, SelectionKey.interestOps methods
    in AsynchronousTlsChannelGroup in such a way that there can be no such race condition anymore;
  2. catch and ignore CancelledKeyException when it happens as a result of a program having the race condition.

The second approach seems (maybe surprisingly) more optimal in this case because it is both simpler and introduces smaller performance overhead assuming that CancelledKeyException is thrown much more rarely than the method SelectionKey.cancel is called.

@martinandersson
Copy link

Oh, I can only agree. Exceptions are most definitely a valid branching mechanism. I use it myself from time to time. If you ask me, I think any performance degradation - if t here is any - is probably negligible and should not really matter other than how we design our API. For example I would give my users a method boolean hasNext() in order to keep their code "clean", but if a particular interface is only used internally then I might just as well skip adding the method and only rely on the client catching a NoSuchElementException. But I am sure I am preaching to the choir here hahaha, love your reply! Detailed. So thank you! The only thing I'd like to add is that your answer should perhaps be copy-pasted into internal documentation?

@marianobarrios
Copy link
Owner

The Selector API is already racy here. But a lot of "closing workflows" are racy and benign, as typically not much happens after a close to matter anyway.

Something that would help: having a test that show non-deterministic behavior due to this race.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants