ISPN-9589 Client Listener notifications can deadlock #6318

wburns · 2018-10-05T21:31:53Z

https://issues.jboss.org/browse/ISPN-9589
https://issues.jboss.org/browse/ISPN-7087

wburns · 2018-10-05T21:32:42Z

You can verify test by running ClientEventStressTest that failed everytime before on master for me due to TimeoutException.

wburns · 2018-10-05T21:34:58Z

I am not sold on using Caffeine for this, but it was a quick fix for the issue of dynamically adding or removing listeners. I didn't have that much time to vet it more, but this should fix stuff :D

wburns · 2018-10-07T23:14:02Z

This is not quite ready yet. It can still deadlock an event thread in the event submission.

This invocation https://github.com/infinispan/infinispan/blob/master/server/hotrod/src/main/java/org/infinispan/server/hotrod/ClientListenerRegistry.java#L393 can still block forever if this is invoked on the event loop thread. We may have to timeout and drain then attempt to readd. I can try some different options later.

If anyone has any suggestions they would be appreciated as well.

wburns · 2018-10-10T20:14:26Z

Updated and should work properly now. There are still performance issues, especially if a swamp of async writes occur. But this will have to be fixed in a different PR.

galderz

Some small comments.

galderz · 2018-10-11T06:06:02Z

client/hotrod-client/src/main/java/org/infinispan/client/hotrod/near/NearCacheService.java

@@ -170,6 +172,12 @@ public void handleRemovedEvent(ClientCacheEntryRemovedEvent<K> event) {
         invalidate(event.getKey());
      }

+      @ClientCacheEntryExpired


Seems like you're fixing ISPN-7087 indirectly here?

Yeah, let me make a separate commit for it.

galderz · 2018-10-11T06:14:50Z

server/hotrod/src/main/java/org/infinispan/server/hotrod/HotRodServer.java

   private Address clusterAddress;
   private ServerAddress address;
   private Cache<Address, ServerAddress> addressCache;
   private Map<String, AdvancedCache> knownCaches = CollectionFactory.makeConcurrentMap(4, 0.9f, 16);
   private Map<String, Configuration> knownCacheConfigurations = CollectionFactory.makeConcurrentMap(4, 0.9f, 16);
   private Map<String, ComponentRegistry> knownCacheRegistries = CollectionFactory.makeConcurrentMap(4, 0.9f, 16);
-   private final Map<String, CacheInfo> cacheInfo = new ConcurrentHashMap<>();
+   // This needs to be an expiration map so we can update knowledge about if sync listeners are registered


Why do you need expiration exactly?

So the reason for expiration is due to the fact that this CacheInfo is cached. Before this PR this information was static to the Cache configuration. However now that we require checking if a sync listener is installed, which can change over time, we need a way to recheck this every so often.

There may be a way to plug into listener installation to see if one is sync, but I have my doubts about that working.

And as I mentioned in my below comment, when I refactor listeners for 10.0 this is no longer needed and can be reverted to a regular ConcurrentHashMap

danberindei · 2018-10-11T07:37:17Z

client/hotrod-client/src/main/java/org/infinispan/client/hotrod/near/NearCacheService.java

@@ -228,6 +236,7 @@ public void handleCreatedModifiedEvent(ClientCacheEntryCustomEvent<byte[]> e) {
      }

      @ClientCacheEntryRemoved
+      @ClientCacheEntryExpired


2 methods annotated with @ClientCacheEntryExpired?

One is for eager and the other is for invalidated near cache.

I didn't realize EagerNearCacheListener is still there, but it's private and never used, so I'd rather delete it.
There is a reference to it in the code, but that condition in NearCacheService.createListener() is never false.

I will add a sub JIRA under https://issues.jboss.org/browse/ISPN-8070 for it.

It's not user-visible, so I don't think we need to wait until 10.0

@wburns I would still like to remove EagerNearCacheListener in this PR, I don't think we'll get any eager near cache-related PRs any time soon.

danberindei · 2018-10-11T07:38:33Z

server/hotrod/src/main/java/org/infinispan/server/hotrod/BaseDecoder.java

+         log.tracef("Channel %s writability changed to %b", channel, writeable);
+      }
+      if (writeable) {
+         server.getClientListenerRegistry().findAndWriteEvents(channel);


writeEventsIfPossible() already checks channel.isWritable() before writing.

Yes, but findAndWriteEvents doesn't and allocates additional objects and uses event loop resources. Is a minor optimization.

Oops, I've seen the method name at least 10 times and only now I see that the method is also called when the channel becomes non-writable :)

But I think findAndWriteEvents() also needs some work, because it does the event loop thing even if there is no client sender associated with this channel, and the way we look up the client sender in the map is pretty atrocious.

Hrmm, it only invokes writeEventsIfPossible if the channel that is now writeable is associated with the given event sender in findAndWriteEvents. And you are referring to how we have to loop through all values I am guessing :(

I didn't write this, but I don't see why we couldn't have a ConcurrentMap<Channel, ConcurrentHashSet<BaseClientEventSender>> or something similar.

Yeah, I was thinking about the iteration, which always has to happen in the event loop, whatever that means :)

Ok, so I finally started reading a bit about the Netty thread model and I saw my expectations were completely wrong. I didn't realize a channel was actually bound to a single thread, I thought the event loop was only bound to a thread pool and every event arrived on another thread in the pool, something like the 5.0 thread model, only with regular thread pools instead of ForkJoinPool.

danberindei · 2018-10-11T07:39:27Z

server/hotrod/src/main/java/org/infinispan/server/hotrod/BaseDecoder.java

+      }
+      if (writeable) {
+         server.getClientListenerRegistry().findAndWriteEvents(channel);
+         server.getClientCounterNotificationManager().channelActive(channel);


Maybe we should check that the channel is still writable after sending the events?

Not sure what you mean here. Unfortunately, the channel can become non writeable for many different reasons, since it is shared for many different operations.

The way it works currently is we submit to the event loop if it is writeable, there is no guarantee those events are written though as those event writes are queued. Another operation could fully exhaust the outbound buffer making it not writeable again before the events fire. Thus we need this callback to try to write those events again after it can be written again.

I meant it's much more likely that the channel is no longer active after we wrote some events than right after we got the notification that the channel is active. Unless super.channelWritabilityChanged(ctx) can also cause some writes to the channel?

Not sure if this is intermingled with the other comment chain now? But yes I agree that the channel may not be writeable most likely after exhausting all 100+ elements from the event queue. But I don't understand what change you are proposing here though. If the channel is no longer writeable after writing the events, we just wait until we get the callback again telling us the channel is writeable to continue writing more.

Sorry, I wrote this before I realized that channel.isWriteable() could be false at the start of the method, and forgot to update it. But what I meant was that I would have added a separate channel.isWriteable() check between writing the events and calling channelActive().

danberindei · 2018-10-11T07:44:22Z

server/hotrod/src/main/java/org/infinispan/server/hotrod/ClientListenerRegistry.java

+         EventLoop loop = ch.eventLoop();
+         // We shouldn't be in the event loop, but just in case we can't get stuck blocking on putting into the queue
+         // so we offer and try to catch up on events if possible
+         if (loop.inEventLoop()) {


IMO we could just throw an exception instead of complicating the code

Unfortunately we can't do that really. The problem is someone could add a sync listener after the CacheRequestProcessor has already checked and not found one which could cause it possibly to fire in the event loop.

TBH I'm still a little confused why we would ever keep the event loop busy while we are running any cache operation. I mean that prevents any other thread from reading from the channel, so the server is going to spend a lot of time twiddling its thumbs unless there are lots of clients per server. I understand that we wouldn't want to incur a context switch to process short requests, but couldn't we allow another thread to start reading from/writing to the channel while we are doing some other work?

That is just the way @rvansa wrote it when he changed it over to share the connection.

I originally wrote it the way you said, except gets and ping were exceptions allowing reads on the io thread.

infinispan/server/hotrod/src/main/java/org/infinispan/server/hotrod/transport/HotRodChannelInitializer.java

Line 47 in 1283927

ch.pipeline().addLast("local-handler", new LocalContextHandler(transport));

When things were more using the async stack was when Radim felt it would be better off allowing all operations that could be async to do this. This PR/bug is a side effect of missing a case :)

I am not sure if what we have is better or worse atm than before (assuming no bugs with blocking). But maybe write operations are too big even when CPU only bound?

We should trigger a perf ack job with and without passing requests to another thread pool again, but I'd argue that even the serialization of read results could be heavy enough to justify doing it on another thread.

danberindei · 2018-10-11T07:46:29Z

server/hotrod/src/main/java/org/infinispan/server/hotrod/ClientListenerRegistry.java

+            }
+         } else {
+            try {
+               eventQueue.put(remoteEvent);


I think we should use eventQueue.offer(remoteEvent, remoteTimeout, MILLISECONDS) here, so we eventually time out and report an exception.

danberindei · 2018-10-11T07:47:18Z

server/hotrod/src/main/java/org/infinispan/server/hotrod/ClientListenerRegistry.java

-            ch.eventLoop().submit(writeEventsIfPossible);
+            if (ch.isWritable()) {
+               // Make sure we write any event in main event loop
+               loop.submit(writeEventsIfPossible);


This is the same in both branches. The comment is different, but this one is better ;)

Sure, I can move them out of if/else block.

danberindei · 2018-10-11T07:49:57Z

server/hotrod/src/main/java/org/infinispan/server/hotrod/HotRodServer.java

   private Address clusterAddress;
   private ServerAddress address;
   private Cache<Address, ServerAddress> addressCache;
   private Map<String, AdvancedCache> knownCaches = CollectionFactory.makeConcurrentMap(4, 0.9f, 16);
   private Map<String, Configuration> knownCacheConfigurations = CollectionFactory.makeConcurrentMap(4, 0.9f, 16);
   private Map<String, ComponentRegistry> knownCacheRegistries = CollectionFactory.makeConcurrentMap(4, 0.9f, 16);
-   private final Map<String, CacheInfo> cacheInfo = new ConcurrentHashMap<>();
+   // This needs to be an expiration map so we can update knowledge about if sync listeners are registered
+   private final Map<String, CacheInfo> cacheInfo = expirationMap(10, TimeUnit.SECONDS);


I don't like this approach, it means we could have event loop threads processing write requests for 10s after the listener is added. Since we don't really care about all cache listeners, only about client listeners, and only on the originator, we could intercept client listener add requests here and update the map synchronously.

Hrmm, we care about all sync listeners, doesn't matter where they come from. Otherwise we would be blocking the io event loop for user listeners, cluster listeners or others as well. The two types mentioned are added outside of this class. I will see if I can do something else though.

I am not sure of an easy way to do this without changing a bunch of stuff. Given that this is temporary anyways, I would rather just touch as little as possible for now.

I was thinking this is the server, so we shouldn't have any sync or async listeners except for the client listeners, but I guess it's fine if you add a comment that it's all going away in 10 so I can hold you to it :)

A client listener is registered as a sync cluster listener on other nodes, which is the big case this misses. Technically this shouldn't be an issue as you mentioned as it should be the primary owner, but sometimes it may not :)

I was thinking that even on the primary owner, the client listener is triggered only after the backups ack the write, on a remote thread. But your approach is definitely safer, the primary could temporarily be the only owner, and we don't want to make too many assumptions about how distribution works here in the server.

danberindei · 2018-10-11T07:50:43Z

server/hotrod/src/main/java/org/infinispan/server/hotrod/HotRodServer.java

@@ -432,12 +460,14 @@ public CacheInfo getCacheInfo(AdvancedCache<byte[], byte[]> cache, HotRodHeader
               .noFlags().withFlags(LOCAL_NON_BLOCKING_GET);
         if (cache.getStatus() != ComponentStatus.RUNNING) {
            // stay on the safe side
-            return new CacheInfo(localNonBlocking, true, true);
+            return new CacheInfo(localNonBlocking, true, true, true);


I wonder which parameter the comment above refers to...

:) I was assuming it was all of them - since the optimizations only fire if the values are false.

I don't think this check is needed, we can't get a reference to a cache before it's started and if it's stopping/stopped we'll throw an exception immediately.

I have no idea personally, just how it was added by @rvansa. He would know more about why this check is present. But I would be up for removing, however I wouldn't want to do it in this PR.

Why not? 😃

danberindei · 2018-10-11T07:57:12Z

Shouldn't we change ClientEventStressTest to use only 3 (or even 2) io threads by default?

I'd prefer a non-stress test that blocks the client and generates just enough events for the server-side queue to fill up, but I'm not sure how easy or hard that would be to write.

wburns · 2018-10-11T12:41:48Z

@danberindei It should be pretty easy to write a test that exhibits this issue without my changes since we have a few other issues with listeners imo. All you need is a client listener that takes even a millisecond and a lot of writes and it will fill up the buffer and eventually hit the event loop and deadlock.

wburns · 2018-10-11T12:44:44Z

Also this is just a bandaid until 10.0 when I am planning on adding support for listeners working with async interceptor stack in which case this issue goes away.

wburns · 2018-10-11T17:06:15Z

Actually working on this more, unfortunately making a simple test is not that simple. As you need to use multiple threads to overwhelm the io thread pools. It would essentially become a stress tests of itself. I personally feel like we need to make sure we have the stress tests running on CI more often and actually pay attention to the outcomes :)

wburns · 2018-10-11T17:07:13Z

Broke out additional commit and also updated rework comments that I could.

danberindei · 2018-10-11T18:51:32Z

I personally feel like we need to make sure we have the stress tests running on CI more often and actually pay attention to the outcomes :)

I feel the same way, but I'm not creating the job in https://ci.infinispan.org/job/InfinispanAlternateBuilds/ either :D

danberindei · 2018-10-12T08:11:17Z

As you need to use multiple threads to overwhelm the io thread pools.

If Netty really uses a single thread for each channel, then we can add a ChannelInboundHandler that waits on a latch.

danberindei · 2018-10-12T08:12:18Z

Could this failure be related to your changes?

https://ci.infinispan.org/job/Infinispan/job/PR-6318/9/testReport/junit/org.jsr107.tck.event/CacheListenerTest/testCacheClearListener_true_/

wburns · 2018-10-12T11:24:30Z

It is possible, I would have to look into it.

wburns · 2018-10-12T14:27:16Z

Looking at the test, these changes shouldn't affect it. It seems that the clear invocation is causing a remove event to be generated. The changes here shouldn't affect delivery of said messages, other than receiving expired events now for near cache, which this wasn't testing. I will have to debug to see closer.

wburns · 2018-10-12T14:39:29Z

Okay looking closer, it is a randomly failing test, remote cache notifies listeners asynchronously. The remote JCache implementation doesn't ignore listeners for a clear method, where as embedded does. So the failure is a different issue. My fixes may have made the event be generated a little quicker, so it may exhibit the issue more frequently.

wburns · 2018-10-12T14:41:40Z

It looks like this was discussed in https://issues.jboss.org/browse/ISPN-9155 and #6036 but nothing came out of it

wburns · 2018-10-23T13:13:57Z

Updated to remove eager near cache listener

* Sync listeners can block as well - fire on executor

* Remove EAGER client listener

danberindei · 2018-10-26T13:46:15Z

Integrated, thanks Will!

wburns requested a review from galderz October 5, 2018 21:31

wburns force-pushed the ISPN-9589_client_listener_deadlock branch from ea60bf0 to 1b26b5a Compare October 5, 2018 21:34

wburns added the Changes Required label Oct 7, 2018

wburns force-pushed the ISPN-9589_client_listener_deadlock branch 5 times, most recently from 985c222 to 6263fe8 Compare October 10, 2018 20:13

wburns removed the Changes Required label Oct 10, 2018

galderz reviewed Oct 11, 2018

View reviewed changes

danberindei suggested changes Oct 11, 2018

View reviewed changes

wburns force-pushed the ISPN-9589_client_listener_deadlock branch from 6263fe8 to e0c7e16 Compare October 11, 2018 15:44

wburns force-pushed the ISPN-9589_client_listener_deadlock branch from e0c7e16 to 2f764c7 Compare October 11, 2018 17:06

wburns added 3 commits October 26, 2018 09:11

ISPN-7087 Handle ClientCacheEntryExpired in InvalidatedNearCacheListener

fd75d4c

ISPN-9589 Client Listener notifications can deadlock

3fcf19b

* Sync listeners can block as well - fire on executor

ISPN-9589 Client Listener notifications can deadlock

8db0b82

* Remove EAGER client listener

wburns force-pushed the ISPN-9589_client_listener_deadlock branch from 2f764c7 to 8db0b82 Compare October 26, 2018 13:12

wburns added this to the 9.4.1.Final milestone Oct 26, 2018

danberindei merged commit c5063e7 into infinispan:master Oct 26, 2018

ISPN-9589 Client Listener notifications can deadlock #6318

ISPN-9589 Client Listener notifications can deadlock #6318

Conversation

wburns commented Oct 5, 2018 • edited

wburns commented Oct 5, 2018

wburns commented Oct 5, 2018

wburns commented Oct 7, 2018

wburns commented Oct 10, 2018

galderz left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wburns Oct 11, 2018 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

danberindei commented Oct 11, 2018

wburns commented Oct 11, 2018

wburns commented Oct 11, 2018

wburns commented Oct 11, 2018

wburns commented Oct 11, 2018

danberindei commented Oct 11, 2018

danberindei commented Oct 12, 2018

danberindei commented Oct 12, 2018

wburns commented Oct 12, 2018

wburns commented Oct 12, 2018 • edited

wburns commented Oct 12, 2018

wburns commented Oct 12, 2018

wburns commented Oct 23, 2018

danberindei commented Oct 26, 2018

wburns commented Oct 5, 2018 •

edited

wburns Oct 11, 2018 •

edited

wburns commented Oct 12, 2018 •

edited