AsynchronousTlsChannelGroup#processPendingInterests can throw CancelledKeyException #18

jyemin · 2021-02-27T00:18:19Z

In tests of the MongoDB Java driver that use this library, I've seen occasional, non-deterministic failures where AsynchronousTlsChannelGroup#processPendingInterests throws CancelledKeyException, causing AsynchronousTlsChannelGroup.loop to exit. It happens in cases where we are forcing the server to close the socket in order to test failure scenarios.

I'm not exactly sure why this is happening, but I do see that in AsynchronousTlsChannelGroup.loop there is already code that wraps calls to java.nio.channels.SelectionKey#interestOps(int) in a try/catch of CancelledKeyException. Does it make sense to do a similar thing in AsynchronousTlsChannelGroup#processPendingInterests , e.g.

  private void processPendingInterests() {
    for (SelectionKey key : selector.keys()) {
      RegisteredSocket socket = (RegisteredSocket) key.attachment();
      int pending = socket.pendingOps.getAndSet(0);
      if (pending != 0) {
        try {
          key.interestOps(key.interestOps() | pending);
        } catch (CancelledKeyException e) {
          // ignore
        }
      }
    }
  }

The text was updated successfully, but these errors were encountered:

marianobarrios · 2021-02-27T22:40:46Z

Hi Jeff, thanks for the report. I tried a bit but could not reproduce the problem. However, what you suggests makes sense and is completely consistent with the other catch we already have, so I just added it. It's in master already.

Best regards.

martinandersson · 2021-02-28T07:13:46Z

Why is this exception ignored? There's probably an assumption involved there. Can this be officially explained in docs somewhere?

stIncMale · 2021-03-03T19:15:00Z

@martinandersson Below is my explanation of the problem and the fix.

The method AsynchronousTlsChannelGroup.RegisteredSocket.close may be called by any thread as a result of it calling
AsynchronousTlsChannel.close. AsynchronousTlsChannelGroup.RegisteredSocket.close calls SelectionKey.cancel, which

Requests that the registration of this key's channel with its selector be cancelled. Upon return the key will be invalid and will have been added to its selector's cancelled-key set. The key will be removed from all of the selector's key sets during the next selection operation.

If we now look at Selector, we can see that

The cancelled-key set is the set of keys that have been cancelled but whose channels have not yet been deregistered. This set is not directly accessible. The cancelled-key set is always a subset of the key set.

This means that selector.keys in AsynchronousTlsChannelGroup.processPendingInterests
may return cancelled SelectionKeys. Calling key.interestOps on such SelectionKeys results in CancelledKeyException as per the documentation of SelectionKey.interestOps.

Thus, depending on how AsynchronousTlsChannelGroup and AsynchronousTlsChannel are used in a program, the program may have a race condition.

Two approaches are possible:

change the usage of SelectionKey.cancel, Selector.select, Selector.keys, SelectionKey.isValid, SelectionKey.interestOps methods
in AsynchronousTlsChannelGroup in such a way that there can be no such race condition anymore;
catch and ignore CancelledKeyException when it happens as a result of a program having the race condition.

The second approach seems (maybe surprisingly) more optimal in this case because it is both simpler and introduces smaller performance overhead assuming that CancelledKeyException is thrown much more rarely than the method SelectionKey.cancel is called.

martinandersson · 2021-03-06T09:36:36Z

Oh, I can only agree. Exceptions are most definitely a valid branching mechanism. I use it myself from time to time. If you ask me, I think any performance degradation - if t here is any - is probably negligible and should not really matter other than how we design our API. For example I would give my users a method boolean hasNext() in order to keep their code "clean", but if a particular interface is only used internally then I might just as well skip adding the method and only rely on the client catching a NoSuchElementException. But I am sure I am preaching to the choir here hahaha, love your reply! Detailed. So thank you! The only thing I'd like to add is that your answer should perhaps be copy-pasted into internal documentation?

marianobarrios · 2021-03-14T19:05:27Z

The Selector API is already racy here. But a lot of "closing workflows" are racy and benign, as typically not much happens after a close to matter anyway.

Something that would help: having a test that show non-deterministic behavior due to this race.

marianobarrios closed this as completed Feb 27, 2021

stIncMale mentioned this issue Dec 8, 2021

Closing an AsynchronousTlsChannel concurrently with an AsynchronousTlsChannelGroup registering it sometimes causes termination of the selectorThread #34

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AsynchronousTlsChannelGroup#processPendingInterests can throw CancelledKeyException #18

AsynchronousTlsChannelGroup#processPendingInterests can throw CancelledKeyException #18

jyemin commented Feb 27, 2021

marianobarrios commented Feb 27, 2021

martinandersson commented Feb 28, 2021

stIncMale commented Mar 3, 2021

martinandersson commented Mar 6, 2021

marianobarrios commented Mar 14, 2021

AsynchronousTlsChannelGroup#processPendingInterests can throw CancelledKeyException #18

AsynchronousTlsChannelGroup#processPendingInterests can throw CancelledKeyException #18

Comments

jyemin commented Feb 27, 2021

marianobarrios commented Feb 27, 2021

martinandersson commented Feb 28, 2021

stIncMale commented Mar 3, 2021

martinandersson commented Mar 6, 2021

marianobarrios commented Mar 14, 2021