New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ISPN-9589 Client Listener notifications can deadlock #6318
ISPN-9589 Client Listener notifications can deadlock #6318
Conversation
You can verify test by running |
ea60bf0
to
1b26b5a
Compare
I am not sold on using Caffeine for this, but it was a quick fix for the issue of dynamically adding or removing listeners. I didn't have that much time to vet it more, but this should fix stuff :D |
This is not quite ready yet. It can still deadlock an event thread in the event submission. This invocation https://github.com/infinispan/infinispan/blob/master/server/hotrod/src/main/java/org/infinispan/server/hotrod/ClientListenerRegistry.java#L393 can still block forever if this is invoked on the event loop thread. We may have to timeout and drain then attempt to readd. I can try some different options later. If anyone has any suggestions they would be appreciated as well. |
985c222
to
6263fe8
Compare
Updated and should work properly now. There are still performance issues, especially if a swamp of async writes occur. But this will have to be fixed in a different PR. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some small comments.
@@ -170,6 +172,12 @@ public void handleRemovedEvent(ClientCacheEntryRemovedEvent<K> event) { | |||
invalidate(event.getKey()); | |||
} | |||
|
|||
@ClientCacheEntryExpired |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like you're fixing ISPN-7087 indirectly here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, let me make a separate commit for it.
private Address clusterAddress; | ||
private ServerAddress address; | ||
private Cache<Address, ServerAddress> addressCache; | ||
private Map<String, AdvancedCache> knownCaches = CollectionFactory.makeConcurrentMap(4, 0.9f, 16); | ||
private Map<String, Configuration> knownCacheConfigurations = CollectionFactory.makeConcurrentMap(4, 0.9f, 16); | ||
private Map<String, ComponentRegistry> knownCacheRegistries = CollectionFactory.makeConcurrentMap(4, 0.9f, 16); | ||
private final Map<String, CacheInfo> cacheInfo = new ConcurrentHashMap<>(); | ||
// This needs to be an expiration map so we can update knowledge about if sync listeners are registered |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do you need expiration exactly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the reason for expiration is due to the fact that this CacheInfo
is cached. Before this PR this information was static to the Cache configuration. However now that we require checking if a sync listener is installed, which can change over time, we need a way to recheck this every so often.
There may be a way to plug into listener installation to see if one is sync, but I have my doubts about that working.
And as I mentioned in my below comment, when I refactor listeners for 10.0 this is no longer needed and can be reverted to a regular ConcurrentHashMap
@@ -228,6 +236,7 @@ public void handleCreatedModifiedEvent(ClientCacheEntryCustomEvent<byte[]> e) { | |||
} | |||
|
|||
@ClientCacheEntryRemoved | |||
@ClientCacheEntryExpired |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
2 methods annotated with @ClientCacheEntryExpired
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One is for eager and the other is for invalidated near cache.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't realize EagerNearCacheListener is still there, but it's private and never used, so I'd rather delete it.
There is a reference to it in the code, but that condition in NearCacheService.createListener() is never false.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will add a sub JIRA under https://issues.jboss.org/browse/ISPN-8070 for it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not user-visible, so I don't think we need to wait until 10.0
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@wburns I would still like to remove EagerNearCacheListener
in this PR, I don't think we'll get any eager near cache-related PRs any time soon.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure
log.tracef("Channel %s writability changed to %b", channel, writeable); | ||
} | ||
if (writeable) { | ||
server.getClientListenerRegistry().findAndWriteEvents(channel); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
writeEventsIfPossible()
already checks channel.isWritable()
before writing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, but findAndWriteEvents
doesn't and allocates additional objects and uses event loop resources. Is a minor optimization.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops, I've seen the method name at least 10 times and only now I see that the method is also called when the channel becomes non-writable :)
But I think findAndWriteEvents()
also needs some work, because it does the event loop thing even if there is no client sender associated with this channel, and the way we look up the client sender in the map is pretty atrocious.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hrmm, it only invokes writeEventsIfPossible
if the channel that is now writeable is associated with the given event sender in findAndWriteEvents
. And you are referring to how we have to loop through all values I am guessing :(
I didn't write this, but I don't see why we couldn't have a ConcurrentMap<Channel, ConcurrentHashSet<BaseClientEventSender>>
or something similar.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I was thinking about the iteration, which always has to happen in the event loop, whatever that means :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, so I finally started reading a bit about the Netty thread model and I saw my expectations were completely wrong. I didn't realize a channel was actually bound to a single thread, I thought the event loop was only bound to a thread pool and every event arrived on another thread in the pool, something like the 5.0 thread model, only with regular thread pools instead of ForkJoinPool
.
} | ||
if (writeable) { | ||
server.getClientListenerRegistry().findAndWriteEvents(channel); | ||
server.getClientCounterNotificationManager().channelActive(channel); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we should check that the channel is still writable after sending the events?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure what you mean here. Unfortunately, the channel can become non writeable for many different reasons, since it is shared for many different operations.
The way it works currently is we submit to the event loop if it is writeable, there is no guarantee those events are written though as those event writes are queued. Another operation could fully exhaust the outbound buffer making it not writeable again before the events fire. Thus we need this callback to try to write those events again after it can be written again.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I meant it's much more likely that the channel is no longer active after we wrote some events than right after we got the notification that the channel is active. Unless super.channelWritabilityChanged(ctx)
can also cause some writes to the channel?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if this is intermingled with the other comment chain now? But yes I agree that the channel may not be writeable most likely after exhausting all 100+ elements from the event queue. But I don't understand what change you are proposing here though. If the channel is no longer writeable after writing the events, we just wait until we get the callback again telling us the channel is writeable to continue writing more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I wrote this before I realized that channel.isWriteable()
could be false
at the start of the method, and forgot to update it. But what I meant was that I would have added a separate channel.isWriteable()
check between writing the events and calling channelActive()
.
EventLoop loop = ch.eventLoop(); | ||
// We shouldn't be in the event loop, but just in case we can't get stuck blocking on putting into the queue | ||
// so we offer and try to catch up on events if possible | ||
if (loop.inEventLoop()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO we could just throw an exception instead of complicating the code
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately we can't do that really. The problem is someone could add a sync listener after the CacheRequestProcessor
has already checked and not found one which could cause it possibly to fire in the event loop.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TBH I'm still a little confused why we would ever keep the event loop busy while we are running any cache operation. I mean that prevents any other thread from reading from the channel, so the server is going to spend a lot of time twiddling its thumbs unless there are lots of clients per server. I understand that we wouldn't want to incur a context switch to process short requests, but couldn't we allow another thread to start reading from/writing to the channel while we are doing some other work?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is just the way @rvansa wrote it when he changed it over to share the connection.
I originally wrote it the way you said, except gets and ping were exceptions allowing reads on the io thread.
Line 47 in 1283927
ch.pipeline().addLast("local-handler", new LocalContextHandler(transport)); |
When things were more using the async stack was when Radim felt it would be better off allowing all operations that could be async to do this. This PR/bug is a side effect of missing a case :)
I am not sure if what we have is better or worse atm than before (assuming no bugs with blocking). But maybe write operations are too big even when CPU only bound?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should trigger a perf ack job with and without passing requests to another thread pool again, but I'd argue that even the serialization of read results could be heavy enough to justify doing it on another thread.
} | ||
} else { | ||
try { | ||
eventQueue.put(remoteEvent); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should use eventQueue.offer(remoteEvent, remoteTimeout, MILLISECONDS)
here, so we eventually time out and report an exception.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure.
ch.eventLoop().submit(writeEventsIfPossible); | ||
if (ch.isWritable()) { | ||
// Make sure we write any event in main event loop | ||
loop.submit(writeEventsIfPossible); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the same in both branches. The comment is different, but this one is better ;)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I can move them out of if/else block.
private Address clusterAddress; | ||
private ServerAddress address; | ||
private Cache<Address, ServerAddress> addressCache; | ||
private Map<String, AdvancedCache> knownCaches = CollectionFactory.makeConcurrentMap(4, 0.9f, 16); | ||
private Map<String, Configuration> knownCacheConfigurations = CollectionFactory.makeConcurrentMap(4, 0.9f, 16); | ||
private Map<String, ComponentRegistry> knownCacheRegistries = CollectionFactory.makeConcurrentMap(4, 0.9f, 16); | ||
private final Map<String, CacheInfo> cacheInfo = new ConcurrentHashMap<>(); | ||
// This needs to be an expiration map so we can update knowledge about if sync listeners are registered | ||
private final Map<String, CacheInfo> cacheInfo = expirationMap(10, TimeUnit.SECONDS); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't like this approach, it means we could have event loop threads processing write requests for 10s after the listener is added. Since we don't really care about all cache listeners, only about client listeners, and only on the originator, we could intercept client listener add requests here and update the map synchronously.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hrmm, we care about all sync listeners, doesn't matter where they come from. Otherwise we would be blocking the io event loop for user listeners, cluster listeners or others as well. The two types mentioned are added outside of this class. I will see if I can do something else though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure of an easy way to do this without changing a bunch of stuff. Given that this is temporary anyways, I would rather just touch as little as possible for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking this is the server, so we shouldn't have any sync or async listeners except for the client listeners, but I guess it's fine if you add a comment that it's all going away in 10 so I can hold you to it :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A client listener is registered as a sync cluster listener on other nodes, which is the big case this misses. Technically this shouldn't be an issue as you mentioned as it should be the primary owner, but sometimes it may not :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking that even on the primary owner, the client listener is triggered only after the backups ack the write, on a remote thread. But your approach is definitely safer, the primary could temporarily be the only owner, and we don't want to make too many assumptions about how distribution works here in the server.
@@ -432,12 +460,14 @@ public CacheInfo getCacheInfo(AdvancedCache<byte[], byte[]> cache, HotRodHeader | |||
.noFlags().withFlags(LOCAL_NON_BLOCKING_GET); | |||
if (cache.getStatus() != ComponentStatus.RUNNING) { | |||
// stay on the safe side | |||
return new CacheInfo(localNonBlocking, true, true); | |||
return new CacheInfo(localNonBlocking, true, true, true); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder which parameter the comment above refers to...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
:) I was assuming it was all of them - since the optimizations only fire if the values are false.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this check is needed, we can't get a reference to a cache before it's started and if it's stopping/stopped we'll throw an exception immediately.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have no idea personally, just how it was added by @rvansa. He would know more about why this check is present. But I would be up for removing, however I wouldn't want to do it in this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not? 😃
Shouldn't we change I'd prefer a non-stress test that blocks the client and generates just enough events for the server-side queue to fill up, but I'm not sure how easy or hard that would be to write. |
@danberindei It should be pretty easy to write a test that exhibits this issue without my changes since we have a few other issues with listeners imo. All you need is a client listener that takes even a millisecond and a lot of writes and it will fill up the buffer and eventually hit the event loop and deadlock. |
Also this is just a bandaid until 10.0 when I am planning on adding support for listeners working with async interceptor stack in which case this issue goes away. |
6263fe8
to
e0c7e16
Compare
Actually working on this more, unfortunately making a simple test is not that simple. As you need to use multiple threads to overwhelm the io thread pools. It would essentially become a stress tests of itself. I personally feel like we need to make sure we have the stress tests running on CI more often and actually pay attention to the outcomes :) |
e0c7e16
to
2f764c7
Compare
Broke out additional commit and also updated rework comments that I could. |
I feel the same way, but I'm not creating the job in https://ci.infinispan.org/job/InfinispanAlternateBuilds/ either :D |
If Netty really uses a single thread for each channel, then we can add a |
Could this failure be related to your changes? |
It is possible, I would have to look into it. |
Looking at the test, these changes shouldn't affect it. It seems that the clear invocation is causing a remove event to be generated. The changes here shouldn't affect delivery of said messages, other than receiving expired events now for near cache, which this wasn't testing. I will have to debug to see closer. |
Okay looking closer, it is a randomly failing test, remote cache notifies listeners asynchronously. The remote JCache implementation doesn't ignore listeners for a clear method, where as embedded does. So the failure is a different issue. My fixes may have made the event be generated a little quicker, so it may exhibit the issue more frequently. |
It looks like this was discussed in https://issues.jboss.org/browse/ISPN-9155 and #6036 but nothing came out of it |
Updated to remove eager near cache listener |
* Sync listeners can block as well - fire on executor
* Remove EAGER client listener
2f764c7
to
8db0b82
Compare
Integrated, thanks Will! |
https://issues.jboss.org/browse/ISPN-9589
https://issues.jboss.org/browse/ISPN-7087